Home
EXT: Indexed Search Engine
Contents
1. The last example below has three main issues to discuss 1 The page Other languages is apparently available in three languages Which ones are not possible to determine unless we know the value from the sys languages table In this case the default language zero 0 is english and the language with id 1 and id 2 is danish and german versions of the page When a search is conducted each page may turn up as a result page but with a little flag telling if the page was found in another language than the main language on the website see second illustration hereafter 2 If there is no phash rows found for a page this can mean three things 1 Either the page is not cached In this case both the tt products and tt news plugins apparently disables the caching of the page thereby disabling any indexing of the pages Searching in news and products must be done with a searching function looking up directly in the news and products tables 2 In the case with other pages the reason may be that the pages has never been visited and therefore not indexed yet Indexing of pages in Typo3 happens during the rendering of the page there is currently no crawler to assist this job 3 Finally the reason for a page not being indexed can be the combination of 1 and 2 That the page has never been visited And if it was visited the cache would have been disabled 3 These numbers just tells us that the page Lists was indexed once by a user with membership
2. index section index fulltext index phash TT Typo3 page 0 217 7119 40609 202 217 217 PES 204 217 This shows that 217 pages are indexed comprising 7000 words and using 40 000 records in the relation table to glue things together List Typo3 Pages This view shows a list of indexed pages with all the technical details 3 TYPO3 EXT Indexed Search Engine 7 Indexing Engine Statistics List Typos Pages bul TYPO PAGES id type Title Size Words mtime Indexed Updated Parsetime s5ec gr full sub 740 Case stories z4 kK 125 23 07 02 18 29 22 08 02 18 34 159 1 1 1 LAL 740 Case stories za kK 120 03 09 02 20 12 O4 10 02 15 34 29 10 02 14 223 166 iriri L L inis Mitsubishi Danmark News ob 2a KE i37 z3 7 0z2 11 05 22 06 02 18 35 195 1 1 1i LAL 1019 GreenSquare Antiques zu Hk 116 03 09 02 20 12 13 10 02 14 182 783 1 i i LAL 1020 FreakZone Internet Cafe 24 K i32 z3 07 02 11 05 22 06 02 18 36 130 iriri LAL 1021 Kasper s minimalistic homepage 27 K 250 23 07 02 11 05 22 08 02 18 23 434 iriri LAL 1021 Kaspers minimalistic homepage 28 K 254 03 09 02 20 12 04 10 02 15 34 726 1 1 1 1 ML inzz Kaspers Wedding private 24K 11 2S OF 027 11 05 27 06 02 18 41 161 1 1 1 LAL 1023 Dyision Digital video zy 246 23 07 02 11 05 22 08 02 18 38 356 1 1 1 1 ML 1024 Fladsaa County Denmark 24k ion 23 07 02 11 05 22 08 02 18 34 150 iriri L L 1028 Inter Photo Photo Dealer 223 KE 110 z2z3 07 02 11 05 22 06 02 18 42 145 1 1 1i
3. grouping by phash but say two Typo3 pages are linking to the document then only one of them will be shown as the path where the link can be found However if both Typo3 pages are not available then the document will not be shown Handling extendToSubpages or not In the searching plugin there are two ways of searching with respect to accessible pages 1 join pages 1 If set then the final result rows are joined with the pages table This will make sure that no enableFields hidden but NOT extendToSubpages pages are selected And it will also make sure to search ALL pages within the rlO of the index section table But extendToSubpages will NOT be taken into account 2 join pages 0 default Then a long list of page ids are selected first and after that the final result rows are selected but without joining the pages table This will work with a limited number of page ids which means most sites And it makes sure that any extendToSubpages hidden pages are NOT selected along with enableFields hidden pages BUT it will also prevent pages down the branch of a php tree stop from being selected as well Access restricted pages A Typo3 page will always be available in the search result only if there is access to the page This is secured in the final result query Whether extendToSubpages is taken into account depends on the join pages flag see above But the page will only be listed if the user has access However a page may be indexed mo
4. of group 1 and 2 the page Addresses was also indexed by a user with membership of group 1 and 2 but has since been visited by a user without login Both instances yielded a similar page and it was therefore not indexed twice This raises the question about the page Lists Is that access restricted for users without login or has a user without login just never visited that page since no 0 1 grlist has been detected Both could be the answer On pages which has access restriction or a whole section in an intranet such pages would obviously not have been indexed by no login users However in this case nothing indicates that the page should be hidden for non login users and so we must conclude that the page has simply not yet been visited by a no login user otherwise it would look like the page Addresses having also the 0 1 list detected he Guestbook page was indexed by a user without login only I YPO 3 EXT Indexed Search Engine 13 Indexed search Ej Another site in the same database Indexed search w Path Intra Another site T z INDEXED SEARCH 2 levels v Tite Mp Hash cHash H 012 pid t l Size grlist cHashParams Another site in t Another site in t m 9235864753 18713195 id 10 1 0 78E O 1 be Search pe 3 Lists Lists m 151393 1873052558 10 11 0 11 1 0 ip Addresses Addresses qm 3823445 121732398 10 11 1313 1 0 Eg Guestbook Guestbook m 15182000 62499995 10 11 1515 1 0 mBo ar
5. options for advanced search searchfor german Match Distinctword Allwards AND v search m EI media Al languages From section hole site Iv Order by wWaeight Frequen style cy v Highestfirst se 10 e at a time Section hierarchy Extended resume User manual Adding the search plugin to a page That is really easy 1 Create a page called Search or something like that This is where the seach box will appear 2 Then create a new content element on that page From the Web gt Page module you can do it like this BAA Search Columns MR Path www tupa3 arg Search Eis In Ex Pagecontent Edit page header Create page content Show hidden content elements 3 Then select some plugin type if you can It doesn t matter if it s a guestbook or forum Or if no plugins are available just select a Regular text element as in the top of the page Plugins OQ e Message board Adds a message board list style forum to the page C Em Discussion forum Add a threaded discussion forum tree stule forum to the page Oo Ey Guestbook Adds a guestbook to the page C Ed Todo items 4 Then make sure Insert plugin is selected if not select it and save the element then you ll see the form below enter a title and select the Plugin type to be Indexed search 3 TYPO3 EXT Indexed Search Engine 3 Fes Pagecontent NEW 7 A Hndexed search B
6. 0 1 atx t3referenc Malburgen District pm 2090925252 2450501983 12271 1229 12491249 0 014 5 K 0 1 Ste t3referenc k arrigre maqazin ty miss04213 139651972 12271 1229 12491249 0 017 6 K 0 1 amp tx t3referenc www tilmahalic de T 2274402756 63813355 1271 1229 12491249 0 017 7 HK 0 1 atx t3referenc Native Instruments tT 185065393 BOS 96503 1271 1229 12491249 0 017 1 E O 1 amp tx t3referenc 3 www drums de m 141517516 202737934 1271 1229 124912749 0 017 6 K O 1 amp tx t3referenc 3 wwwkreis warendo 171881822 213094181 1221 1229 12491249 0 018 5 K 0 1 Gtx t3referenc Jenoptik Camers E m 21413092 109510109 1221 122957 124921243 0 017 2 EK 0 1 amp tx t3referenc DIBS corporate we m 2466395656 163542257 1221 1225957 124921243 0 018 4 K 0 1 amp tx t3referenc www imp muenchen de gy 147115297 146690 1221 12292 12491243 0 017 9 K 0 1 amp tx t3referenc 3 Green Square A 5 m 1422873542 193650959 1271 1229 12491249 0 017 8 Kk 0 1 amp tx t3referenc Snowlecpard Adven m 160917384 146158117 1221 1225957 124921243 0 017 2 K 0 1 amp tx t3referenc 3 Rosenbilderberg com gp 262296210 192933962 1221 1229 12491249 0 017 4 K 0 1 amp tx t3referenc 3 boarder ch m 223162976 20356563472 1221 1223 124921243 0 017 7 EK 0 1 amp tx t3referenc 3 Relations m z4eiogsess 26159840270 12271 1229 124912749 0 017 0 K 0 1 amp tx t3referenc www magix net mie 125436477 290423011565 1271 1229 12491249 0 016 7 K 0 1 amp tx t3referenc Hubuk Sparts m 3554308 53
7. 1 page has uid 123 e pid t l This is the page id type number sys language uid e Size How many bytes the indexed page consumed e grlist This is the gr list of the user which initiated the indexing operation e cHashParams Additional parameters which are identifying the page in addition to the id type number which usually does that 2 The page Content elements has one indexed version The page id of the root page is 1 and the page on level 1 in the rootline had the uid 2 Notice how all subpages to Content elements has the exact same rl0 and rl1 value Where the page Content elements does NOT have a value for rl2 so does all the subpages because they ARE the level 2 themselves Furthermore the page has the page id 2 a type value of O and is indexed with the default language 0 The size was 10 6 KB and the user who initiated the indexing operation was a member of the groups 0 2 1 which is effectively fe group 1 because 0 and 2 is pseudogroups 3 Onthe page Special content there must have been a link to a local PDF and Word file since those two are indexed in relation to this page The PDF file is located in the path uploads media tsref onepage pdf relative to the website Notice that the PDF file is actually indexed three times one time per page This is of course configurable Each indexed section of the PDF file has the potential to show up as a search result row of course because the p
8. 501521 12271 1229 12491249 0 017 5 K 0 1 amp tx t3referenc schweizer illustr tm efos2z061 1401592 1221 1225957 124921243 0 016 6 K O 1 amp tx t3referenc germanmaps de m 2002341 oF 920476 1221 12292 124921249 0 0156 7 K 0 1 amp tx t3referenc 3 www uw ilaad de mo 43e67335 1824073598 1221 122395 1243212433 0 016 8 K 0 1 amp tx t3referenc 3 www umr edu mo27072856 199294174 1221 1225957 124921243 0 016 8 K 0 1 amp tx t3referenc 3 Archined tT 162354911 24922564 1271 1229 12491249 0 016 7 K 0 1 amp tx t3referenc 3 stopchildtraffick m 224524034 103046063 1221 12239 124921243 0 017 5 Kk 0 1 amp tx t3referenc As you can see most pages here are indexed only one time However a few are indexed twice This can happen for several reasons and here the reason is most likely due to a user login or something related The most interesting occurence is the page References which has more than 20 indexed instances available The reason is that this page holds multiple cached views due to some parameters which are used by a plugin on that page Each instance will be searchable as a unique search result Now imagine that you want to clear out all those instances of the References page to let them be re indexed when viewed again Simply click the page References in the page tree to the left Then you see this EXT Indexed Search Engine 6 3 TYPO3 Indexed search SA4 References Path www typo comi Cases amp Reuiews Reference
9. 8 but at some point PHP changed behaviour with hexdec function so that where originally a 32 bit value was input half the values would be negative they were suddently positive all of them That would require a similar change of the fields in the database To cut it simple the length was reduced to 7 all being positive then How pages are indexed First of all a page must be cachable For pages where the cache is disabled no indexing will occur The phash is a unique identification of a page with regard to the indexer So an entry in the index phash table equals 1 resultrow in the search results called a phash row A phash is a combination of the page id type sys language id gr list and the cHash parameters of the page function setT3Hashes If the phash is made for EXTERNAL media item type gt 0 then it s a combination of the absolute filename hashes with any subpage indication for instance if a PDF document is splitted into subsections So for external media there is one phash row for each file except PDF files where there may be more But for Typo3 pages there can be more phash rows matching one single page Obviously the type parameter would normally always be only one namely the type number of the content page And the cHash may be of importance for the result as well with regard to plugins using that For instance a message board may make pages cachable by using the cHash params If so each cached page wil
10. EXT Indexed Search Engine Extension Key indexed_search Copyright 2000 2002 Kasper Skarhgj lt kasper typo3 com gt This document is published under the Open Content License available from http www opencontent org opl shtml The content of this document is related to Typo3 a GNU GPL CMS Framework available from www typo3 com Table of Contents EXT Indexed Search Engine 1 Jaiige ce Ute 0 c 1 WV ARGOS HE GO COT MS 1 Features of the indexer sse 2 Features of the search frontend the plugin 2 UI dr m 3 Adding the search plugin to a page 3 tonto Me 5 Monitoring indexed content sssssssss 5 Monitoring the global picture of indexed pages 7 COMM QUT e 8 Technical detallS onec cocco ai 8 PAU IVILSGOMLO I NR PIN 8 Use of gi 9 1 ae nen ee eee eee eee ee eee eee ene enne 8 Introduction What does it do The Indexed Search Engine provides two major elements to Typos 1 This is an example of how the search interface on a website looks How pages are indexed sssesesseeeessss 8 External I CON A i ascccutent opens anu pee EM eRu ccs PERO E FRxE S DES RA HEterE 9 Handling extendToSubpages or not 9 Access restricted pages seeesses 9 Analysing the i
11. L L Configuration General The most basic requirement for the search engine to work is that pages are getting indexed That will not happen by just installing the plugin You will have to set up in TypoScript that a certain page should be indexed That is needed for several good reasons First of all not all sites in a TYPOS database might need indexing So therefore we disable it on a per site basis Secondly a single site may have frames and in that case we need only index the page object which actually shows the page content Lets say that you have a PAGE object called page that is pretty typical then you will have to set this config option page config index_enable 1 When this option is set you should begin to see your pages being indexed when they are shown next time Remember that only cached pages are indexed This is documented in TSref in the CONFIG section Please look there for further options For instance indexing of external media can also be enabled there Languages The plugin supports all system languages in TYPO3 Translation is done using the typo3 org tools If you want to use eg danish language that will automatically be used if this option is set in your template the value is the internal language key config language dk TypoScript Still missing the major parts here Just use the object browser for now since that includes all options specConfs pid specConfs is an array of objects with properti
12. ains the gr list of the user initiating the indexing of the document I YPO 3 EXT Indexed Search Engine 15 index_section Points out the section where an entry in index phash belongs phash The phash of the indexed document phash t3 The phash of the parent Typo3 page of the indexed document If the document being indexed is a Typo3 page then phash and phash t3 are the same But if the document is an external file PDF Word etc which are found as a LINK on a Typo3 page then this phash t3 points to the phash of that Typo3 page Normally it goes like this when indexing 1 The Typo3 document is indexed this has a phash value of course then 2 if any external files are found on the page they are indexed as well AND their phash t3 will become the phash of the Typo3 they were on But the significance is unclear l m not sure this value is used for anything anywhere o0 it might not be important at all But it can be used to dertermine the relation ship of sub documents to an indexed Typo3 page if you will ro The id of the root page of the site The id of the level 1 page if any of the indexed page a 2 The id of the level 2 page if any of the indexed page page id The page id of the indexed page uniqid This is just an autoincremented unique primary key Generally not used i think index fulltext For free text searching eg with a sentence in all content title description keywords body ph
13. an zero eg 1 searching will happen in ALL of the page tree with no regard to branches at all search detect sys domain boolean If set then the search results are linked to the proper domains where records they are found search detect sys domain string Target for external URLs records target tsref plugin tx indexedsearch Technical details HTML content HTML content is weighted by the indexing engine in this order 1 lt title gt data 2 lt meta keywords gt 3 lt meta description gt 4 lt body gt In addition you can insert markers as HTML comments which define which part of the body text to include or exclude in the indexing The marker is lt TYPO3SEARCH_begin gt or lt TYPO3SEARCH_end gt Rules 1 Ifthere is no marker at all everything is included 2 If the first found marker is an end marker the previous content until that point is included and the preceeding code until next begin marker is excluded 3 If the first found marker is a begin marker the previous content until that point is excluded and preceeding content until next end marker is included Use of hashes The hashes used are md5 hashes where the first 7 chars are converted into an integer which is used as the hash in the database This is done in order to save space in the database thus using only 4 bytes and not a varchar of 32 bytes It s estimated that a hash of 7 chars 32 is sufficient originally
14. ash The phash of the indexed document fulltextdata The total content stripped for any HTML codes Currently the MySQL FULLTEXT search is not used something with MATCH AGAINST but this will be added in the future hope index grlist This table will hold records related to a phash row Records in this table confirms that certain gr lists would actually share the same content as represented by phash row even though the phash row may be indexed under another login The table is used during result display to positively confirm if the current user may see the resume which otherwise might contain secret info Please see discussion far above index words index rel Words table and word relation table Almost self explanatory For the index rel table some fields require explanation count Number of occurencies on the page first How close to the top low number is better freq Frequency please see source for the calculations This is converted from some floating point to an integer flags Bits which describes the weight of the words 8th bit 128 2 word found in title 7th bit 64 word found in keywords 6th bit 32 2 word found in description Last 5 bits are not used yet but if used they will enter the weight hierarchy The result rows are ordered by this value if the Weight Frequency sorting is selected Thus results with a hit in the title keywords or description are ranked higher in the result list Known probl
15. be indexed before it s reconsidered to index it again A max age defines an absolute point at which re indexing will occur unless the content has not changed according to an md5 hash cHashParams The cHashParams For Typo3 pages These are used to re generate the actual url of the Typo3 page in question For files this is an empty array Not used item type An integer indicating the content type 0 is Typo3 pages 1 external files like pdf 2 doc 3 html 1 txt 4 and so on See the class indexer php file item title Title For Typo3 pages the page title For files the basename of the file no path item description Short description of the item Top information on the page Used in search result data page id For Typo3 pages The id data page type For Typo3 pages The type data filename For external files The filepath relative or URL not used yet contentHash md5 hash of the content indexed Before reindexing this is compared with the content to be indexed and if it matches there is obviously no need for reindexing crdate The creation date of the INDEXING not the page file see item crdate parsetime The parsetime of the indexing operation sys language uid Will contain the value of GLOBALS TSFE 5sys language uid which tells us the language of the page indexed item crdate The creation date For files only the modification date can be read from the files so here it will be the filemtime gr list Cont
16. bviously used by the plugin tt board The plugin has been constructed so intelligenly that it links to the messages in the message board without disabling the normal page cache but rather sending the tt board uid parameter along with a socalled cHash If this is combined correctly the caching engine allows the page to be cached Not only does this mean a quicker display of pages in the messageboard it also means we can index the page 3 TYPO3 EXT Indexed Search Engine 12 Indexed search HE Board Indexed search W Path Intro Another site Lists Board H INDEXED SEARCH 3 levels Tide lip Hash cHash H 012 pid t l Size grist cHashParams af Board Board T 134793933 102465242 l0 11 2424 1 0 16 3 K O 1 mE Sourcream and Oni 3 Sourcream and Oni gp 2050200593 229544850 10 11 2443 1 0 6 7K D 1 Fat percent m 14567630 40764155 in 1i1 2443 1 0 6 9K U0 1 atk board uidzi E E Sourcream and Oni Sourcream and Oni gp 240350245 e4is6444 in i1i1 2444 1 0 7 2914 O 1 Fat percent tree m 5323332 1247332575 in 1i z444 1 0 7 2Hk U0 1 att board uid 3 Fat percent tree tT 679984955 42508934 l 0 11 2444 1 0 726 d 1 att board uid 4 This is gross tT 4257257112 a9 780722 in 11 2444 1 0 7 1K d 1 amp tt board uid 5 Sourcream and Oni gp 1189202887 s4is6444 in 11 2444 1 0 7 9 kK O 1 amp tt board uid As you see the main board page showing the list of messages threads Sourcream and Oni is indexed without any val
17. d Board i 154793933 102465242 10 11 2424 1 0 bee Rating m iized225 2359770853 VO li l125iz25 l D0 t db d 0 1 Pee Pall m 21921327539 1043547955 10 11 147147 1 0 3 2 Kh O 1 bee Calendar m 1176535025 140416950 eae bee eR HEL EE Records Tn m Backend user RE Cool example Cool example qpis7441242 86189358 10 1480 181 0 83K DL Es Other languages Other languages m20ie740s2 60110211 10 173 0173 0 1 Andre sprog m 123996083 842855553 110 173 0173 O 1 Andere Sprachen m 066728515 asi3z724 10 173 017 O 1 Le 3 Sitemap m Ei www tupoa3 cam f Default zite Other languages Sider 3 1 Other languages ag 5 Other languages Other languages This page is suppo Size 4 3 K Created 13 12 01 Modified 13 12 01 16 27 Path Other languages 2 Andere Sprachen 8395 andere Sprachen Andere Sprachen Diese Seite soll a Size 4 3 K Created 13 12 01 Modified 08 01 02 18 10 s Path Other languages 3 Andre sprog 575 andre sprog Andre sorog Med denne side er det ment Size 4 3 K Created 13 12 01 Modified 13 12 01 16 27 Path Other languages Illustration 1A seach result showing how localized versions of a page are displayed Database Tables index_phash This table contains references to Typo3 pages or external documents The fields are like this 3 TYPO3 EXT Indexed Search Engine 14 phash 7md5 int hash It s an integer based on a 7 char md5 hash This is a unique representation of the page indexed For Typo3 pages t
18. dia Search Your own scripts XML WAP PDA Rich Text Editar Search ISEACH example ISEACH example temo txt e sre onepage pdf tsref onepage pdt tzref onepage pdf WF test word doc uploads mediajtsref_onepage pdF wt Indexed search p Hash cHash m 443975383 ZgZ656157241 T 189203345 191581327 m 20870265 207670795 m i52227425 21595929 mis43860743 3491874 imi i 0519 24552101 T 51799 69209839 m 25621322666 37588254 m 29235306 43655555 mi2215862736 15427881 mi 27418 z18205747 m i7407 399 161154437 22287455 26124357 tm e07254001 1756868134 mziss225n03 F64606076 m 140363025 178330568 m 162036733 15250775 mj 7 473087 z50y779654 m 2120355355 1036627598 T 1729718327 2585674467 m 106883765 185701936 mis907463 27031661 mis42860743 3431874 T 12851799 69209839 m 26630519 24552101 m 256213225666 2790982294 H 12 pid t 120 20 0 12 5 50 0 1 2 6 600 1 2 6 6 0 0 1 2 6 1 2 6 2 6 Tze Q 20 0 0 20 0 0 23 0 259 0 28656 n0 0 287 0 287 0 cHashParams grlist D 2 1 Size 10 6 To ZW UA S 14 2 17 5 HR HH Veet ho Se HH 2 reas i z rna i im 1 z i 2 ue ez D Bossi Huang 10 7 K 0 1 14 1 k 0 2 1 2H O1 fac B 13 0 K 0 2 1 2Page 1 1 13 0 K 0 2 1 2Page 2 2 13 0 K 0 2 1 2Page 3 3 9k O 2 1 2 On the image below we are looking at another scenario In this case the cHashParams is o
19. ems Currently the extension is under observation because instances of heavy server load unstability has been reported It is not yet clear if THIS extension has anything to do with So it s only under suspicion at this point until further data has been collected But for now it is adviced to be careful with the application of the extension for mission critical high load STYPO 3 EXT Indexed Search Engine 16 environments e It s still uncertain how performance is under heavy load conditions and when MANY pages are indexed Currently benchmarks has been done only up to 2000 pages indexed approx 400 000 relation records It is probably that some parts has to be optimized for such scenarios Todo list e function linkPage should be faster Is currently quite slow because it calls the typolink function in cObj We pass only a id number and so we could optimize a lot MAYBE change how the function in pibase works e he Tools gt Indexing module could need some shining up and more useful features TYPO 3 EXT Indexed Search Engine 17
20. es that can customize certain behaviours of the display of a result row depending on it s position in the rootline For instance you can define that all results which links to pages in a branch from page id 123 should have another page icon displayed Of you can add a suffix to the class names so you can style that section differently Examples If a page Contact is found in a search for address and that Contact page is in the rootline Frontpage ID 23 gt About us ID 45 gt Contact ID 77 then you should set the pid value to either 77 or 45 f 45 then all subpages including the About us page will have similar configuration If the pid value is set to O zero it will apply to all pages Please see the options below specConfs pid pagelcon IMAGE cObject Alternative page icon EE specContfs pid CSSsuffix string A string that will be appended to the class names of all the class attributes used within the result row presentation The prefix will be like this Example If CSSsuffix doc then eg the class name tx indexedsearch title will be tx indexedsearch title doc STYPO 3 EXT Indexed Search Engine 8 search rootPidList list of int A list of integer which should be root pages to search from Thus you The current root can search multiple branches of the page tree by setting this property page id to a list of page id numbers If this value is set to less th
21. esult 5 Here we can see that the pages Special content Advanced and Menu Sitemap is indexed twice each The reason is that those three pages has had different content depending on whether or not a user was logged in In the case of the page Special content the reason is that the page contained a content element which was visible for users which was a member of group number 1 Therefore the page was different in the two cases The page Advanced has a user login form and that form looks different whether a user is logged in or not Finally the page Menu Sitemap apparently changed There reason was that this page includes a sitemap and that sitemap displayed some extra pages when the logged in users hit the page and so the content was not the same as without login Another thing which is interesting is that two different users must have visited those pages We can see that because the page Special content was apparently indexed with the usergroup combination 1 2 Later another user hit the page but only a member of group 1 However the page content was the SAME And because those two users saw the very same page it was not indexed a third time but it was instead noted down that a user with membership of only group 1 did also see this same page That comparison was based on the cHash contentHash which is a hash value based on the actual content being indexed So when the user with group 1 only came to the page the i
22. hash is different per indexed part The whole point with this is that a large PDF file might contain so much information that it might match all too many search queries So breaking a PDF file down into smaller parts makes it possible for us to indicate exactly WHERE in the PDF file the search word was found 4 Looking at the word file and the PDF file as well we see that they are found on BOTH the page Special content and on the page ISEARCH example But looking at the phash values for the word file it is 268192666 it is the SAME value in both cases So this means that the Word and PDF file is indexed only once when it is first discovered Later when another page is indexed and a link to the same document appears then the document is not indexed as another document but rather an entry in the index section table is made indicating that this result row is also found available linked to from another page section oay you are doing a search in the section from Content elements and outwards in the page tree The word document is matched in the search but it will appear only once in the search result Now if one of the two pages where the worddocument was either hidden or access restricted the word document would still be matched because one of the pages is accessible for the user But if BOTH pages with the link to the word document is not accessible for the user doing the search then the word document will not be included in the search r
23. his is a serialization of id type gr_list see later and cHashParams which enables subcaching with extra parameters This concept is also used for Typo3 caching although the caching hash includes the all array and thus takes the template into account which this hash does not It s expected that template changes through conditions would not seriously alter the page content For external media this is a serialization of 1 unique filename id 2 any subpage indication parallel to cHashParams gr list is NOT taken into consideration here phash grouping 7md5 int hash This is a non unique hash exactly like phash but WITHOUT the gr list and in addition for external media without subpage indication Thus this field will indicate a unique page or file while this page may exist twice or more due to gr list Use this field to GROUP BY the search so you get only one hit per page when selecting with gr list in mind Currently a seach result does not either group or limit by this but rather the result display may group the result into logical units item mtime Modification time For Typo3 pages the SYS LASTCHANGED value For external media The filemtime value Depending on config if mtime hasn t changed compared to this value the file page is not indexed again tstamp time stamp of the indexing operation You can configure min max ages which are checked with this timestamp A min age defines how long an indexed page must
24. l also be indexed Thus many phash rows for a single page id But the most tricky reason for having multiple phash rows for a single Typo3 page id is if the gr list is set This works like this If a page has exactly the same content both with and without logins then it s stored only once If the page content differs whether a user is logged in or not it may even do so based on the fe groups then it s indexed as many times as the content differs The phash is of course different but the phash grouping value is the same The table index grlist will always hold one record per phash row of item typezO that is Typo3 pages But it may also hold many more records These point to the phash row in question in the case of other gr list combinations which actually had the SAME content and thus refers to the same phash row I YPO 3 EXT Indexed Search Engine 9 External media External media pdf doc html txt is tricky External media is always detected as links to local files in the content of a Typo3 page which is being indexed But external media can the linked to from more than one page So the index_section table may hold many entries for a single external phash record one for each position it s found Also it s important to notice that external media is only indexed or updated if a parent Typo3 page is re indexed Only then will the links to the external files be found In a searching operation external media will be listed only once
25. n be available in I YPO 3 EXT Indexed Search Engine 10 more than one indexed version based on the user groups But while the same page may have different content based on the user groups and so must be indexed once for each such pages may just as well present the SAME content regardless of usergroups This is the very most tricky thing Understanding these complex scenarios The best thing to do is to grab an example Please refer to the picture below while reading the bulletlist here 1 The overview in general shows one line per phash row a single row from the index phash table Such a row represents a single hit in a searching session In other words each line with grayish background in the overview may be a search hit The columns of these rows are e Title The search result title e icon Click here to remove the indexed information for this entry will be re indexed on the next hit e pHash The id of the search row The hash is calculated based on id type language cHashParams gr list of the page when indexed For external media this is based on filepath page interval for PDF s only e cHash Calculated based on the actual content which was indexed e l 012 This is the rootline ids for level 0 1 2 Used when searching in certain sections For instance a search operation may select all pages with rl12123 which will result in a search within pages which exist ONLY in the branch of the website where the level
26. ndexed data 9 Understanding these complex scenarios 10 Database TaDIGS uscucc uei ente oo coc a Dee ne ea eoru uuu nnna 13 laiesoqies cimeme 13 index sectiOn eessseeeeeeseeeennnnnennnn nnns 15 Index TUIIBXE aiu cane imi bao tectecuteaiwhensdneadeteatectaceda 15 CIO GSi m BE 15 index words index rel esee 15 Known proDIellis occ coco e cce inaa 15 TCO Lern 16 Indexing An indexing engine which indexes Typo3 pages on the fly as they are rendered by Typo3 s frontend Indexing a page means that all words from the page or specifically defined areas on the page are registered counted weighted and finally inserted into a database table of words Then another table will be filled with relation records between the word table and the page This is the basic idea Searching A plugin you can insert on your website which allows website users to search for information on your website By searching the plugin first looks in the word table if the word exist and if it does all pages which has a relation to that word will be considered for the search result display The search results are ordered based on factors like where on the page the word was found or the frequency of the word on the page IYPO3 EXT Indexed Search Engine 1 search Search for search Advanced search Search for searclr Displaying res
27. ndexer engine realize that the page as it looked has already been indexed because another phash row with that content hash was already available WI YPO 3 EXT Indexed Search Engine 11 6 These pages does not contain any tricks it appears According to the grlist s both users with membership of group 1 2 and group 1 only as well as surfers who did not at all login 0 1 is the pseudo group for no login as visited the page And because only one indexed version exist the page must have have the same content to present all users regardless of their login status The reason why the page Your own scripts does not contain a grlist value 0 2 1 2 as the others do is simply because no user with that combination of usergroups has ever visited the page 7 txt and html documents can also be indexed as external media In the case of HTML documents the documents lt title gt is detected and used Aq Content elements Indexed search Path Intro Startpage content elem INDEXED SEARCH 2 levels w 3 Content elements E E Insert content ie Special content pu Your awn scripts ku EML WAP PDA E Ej Rich Text Editor m ES Thanks far your rn T ISearch Lo ISEACH example H Test af HTML exte BH Title HE Content elements Insert content Special content Special content man onepage pdf WF test word doc Advanced Advanced Menu Sitemap Menu Sitemap Multime
28. not show resume but rather link to the page from which the user can see the real link to the document Note These tricky scenarios exist only if the content on a page differs based on login It does not affect situations with access restriction to the page as a whole A general lesson from this is to reduce the number of hidden content elements Instead use hidden pages Better more reliable Analysing the indexed data The indexer is constructed to work with Typo3 s page structure Opposite to a crawler which simply indexes all the pages it can find the Typo3 indexer MUST take the following into account e Only cached pages can be indexed Pages with dynamic content such as seach pages etc should supply their own search engine for lookup in specific tables Another option is to selectively allow certain of those dynamic pages to be cached anyways see the cHashParams concept used by some plugins e Pages in more than one language must be indexed separately as different pages e Pages with messageboards may have multiple indexed versions based on what is displayed on the page The overview or a single messageboard item This is determined by the cHashParams value Pages with access restricted to must be observed e Because pages can contain different content whether a user is logged in or not and even based on which groups he is a member of a single page identified by the combination of id type language cHashParams may eve
29. oard Lis Board Tree Guestbook Addresses Extension Repository Frontend User administration FAG Consultancies References Mailing lists Documents Links Todo iterns Oo Docs 5 Then select the root page of your website as the Starting point of the plugin content element R8 Pagecontent 2030 Search Indexed search c ODE Startingpoint General options Hide Start aS d And that s it Your frontend should now look like this 3 TYPO3 EXT Indexed Search Engine 4 Address E Search Advanced search Rules only words with 2 ar more characters are accepted Max 200 chars total Space is used to split wards can be used ta search for a whole string not indexed search then AMD OR and MOT are prefix words averruling the default aperator equals AMD OR and MOT as operators All search wards are converted ta lowercase The styles are most likely different from this but that is controlled by the developer having administration access to the system Adminstration Monitoring indexed content The Indexed Search extension adds two backend modules one as a global database wide statistics module and a page specific analysis module In the Web gt Info module you can see an overview of how many instances are indexed per Typo3 page Look at this image TYPO 2 EXT Indexed Search Engine 5 INDEXED SEARCH TANIR AE 2 levels Tide Mp Hash cHash H 012 pid t l Si
30. ption gt 4 lt body gt e Indexing external files Text formats like html and txt and doc pdf by external programs catdoc pdftotext e Wordcounting and frequency used to rate results e Exact partially or metaphone search e Searching freely for sentences non indexed NOT case sensitive in any ways though Features of the search frontend the plugin The search interface has several optiosn for advanced searching Any of those can be disabled and or preset with default values e Searching whole word part of word sounds like sentence e Logical AND and OR search including syntaxical recognition of AND OR and NOT as logical keywords Furthermore sentences encapsulated in quotes will be recognized I YPO 3 EXT Indexed Search Engine 2 e Searching can be targeted at specific medie for instance searching only indexed PDF files HTML files Word files Typo3 pages or everything e The engine is language sensitive based on the multiple language feature of Typo3 s CMS frontend e Searching can be performed in specific sections of the website e Results can be sorted descending or ascending and ordered by word frequency weight location relative to page top page modification date page title etc The display of search results can be intelligently divided into sections based on the internal page hierarchy Thus results are primarily grouped by relation then by hit relavance This shows the full range of default
31. re than once if the content differ from usergroup to usergroup or just without login Still the result display will display only one occurency because similar pages determined based on phash grouping will be detected The tricky scenario oay that a page has a content element with some secret information visible for only one usergroup The page as a whole will be visible for all users The page will be indexed twice both without login and with login because page content differs The problem is that if a search is conducted and matching one of the secret words in the access restricted section then the page will be in the search result even if the user is not logged in The best solution to this problem is to allow the result to be listed anyways but then HIDE the resume if the index grlist table cannot confirm positively that the combination of usergroups of the user has access to the result So the result is there but no resume shown The resume might contain hidden text External media Equally for external media they are linked from a Typo3 page When an external media is selected we can be sure that the page linking to it can be selected But we cannot be sure that the link was in a section accessible for the user Similarly we should make a lookup in the index grlist table selecting the phash gr list by the phash t3 value of the section record for the search result If this is not available we should not display a link to the document and
32. s INDESED SEARCH 2 levels iv Tide ES References References Inter Photo A S Cruptanet Malburgen District karriere magazin tv 1 www filmnahalic de Hash Indexed search iv cHash H 012 pid t l Size grist cHashParams m 178049520 470276175 1271 1279 1274912749 0 039 K 0 1 O 2 a 157570579 1221 1229 12491249 0 015 0 K 0 1 amp tx t3references pil shawLlid 222 I 2456115252 4492223230 1221 1229 1249912749 0 014 5 k 0 1 amp tx t3references pil showLlid 261 m 2020926252 245050193 1221 1229 124971245 0 014 9 k 0 1 amp tx t3references pil showLlid 23 0 m i2552004213 139651972 izz21 1229 124921245 0 017 amp 6 K O 1 amp tx t3referencezs pili shawLlid 5 m 234402786 63813355 1221 1229 124991249 0 017 7 k 0 1 amp tx t3references pilfshawWid 7 You can either click the red garbage bin 1 in order to clear all listed instances or alternatively pick out single instances by clicking the local garbage bin 2 Monitoring the global picture of indexed pages JP Tools S9 User Admin AB Ext M anager DB check ES confi p Insta earar F guration S Indexing phpM uAdrmin By the Tools gt Indexing module you can get statistics about the indexing engine Currently they are sparse and very roughly presented This view needs some more work to be friendly and really useful General statistics General statistics RECORDS index phash index words index rel index grist
33. ues for the parameter tt board uid the cHashParams field is blank Then it has also been indexed one time for each display of a message In a search result any of these five rows may appear as an independant result row after all they are to be regarded as a single page with unique content despite sharing the same page id Another interesting thing is that while the main page has inherited the page title for the search result Sourcream and each of the indexed pages with a message has got another title namely the subject line of the message shown Thus a search matching three of these five pages will not shown three similar page titles but a unique page title relative to the actual content on the page It is the tt board plugin that sets the page title itself by an API call The only glitch here is that the tt board plugin has falsely allowed the main page to be cached twice See the first and last phash row The last row has got the parameter amp tt board uid2 sent and the tt board plugin should not have allowed that Because looking at the content hash of the first and last we realize that it s the SAME hash 84186444 and therefore the SAME content However being two separate result rows they will both be displayed in the search result as separate hits The responsibility for this lies with the plugin However such occurencies can be automatically filtered out during the search result display But it s better to avoid this kind of stuff
34. ults 1 ta 10 aut af 10 in 4 sections search 1 page 9 Cases amp Reviews 4 pages Resources 1 page E 1 search 36 Search search Search Size 7 4 K Created 04 10 02 Modified 13 11 02 10 16 Path search 2 DIBS corporate website 10056 References Hundreds of websites are impmiermmerntec with Tyeo3 word wide teouagrh independent cansutancies These featured projects shows the great variety of projects vou Can crea ates color scheme LBS website uses the Indexed Search engine bulla into TIyn5o3 The engine bulle Into Tvbeo3 The search engine makes a Global engine makes a Global search for Infarmation but the results are stl diania he local website Notice the advanced division of search results T he new website has a multilevel l ayerecr Size 18 4 K Created 28 05 02 Modified 19 11 02 16 40 Path Cases amp Reviews References 3 www mp muenchen de 100 References Huncdrecs of websites are immiermentec with Tyoo3 word wide trough Independent cansutancies These featured projects shows the great variety of projects vou Can crea Ors are producing content for their Individual reasesrch groups and their leches They produce downloads Size 17 8 K Created 28 05 02 Modified 19 11 02 16 40 Path Cases amp ReviewsiReferences Features of the indexer The indexing engine has several features e HTML data priority 1 lt title gt data 2 meta keywords 3 lt meta descri
35. ze griist cHashParams 3 www typos com 3 www typos com tT 136705550 32753251 1221 0 01221 0 011 3 F mr 3 3 www tpoz com T 221209103 32753251 1221 0 01221 0 015 3 K 0 1 e E About a About T 112568175 93777037 i221 1231 01231 0 017 0 K nae m 3 Whatis a CMS Whatis a CMS m 28188575 43647704 iz21 1231 13511351 0 018 5 K E i I 2 e Highlights 3 Highlights m 2921772321 104231436 1221 1231 13521352 0 020 K TR A j ems Hd Feature list Feature list m 76410846 202796647 i221 1231 12431243 0 040 K O 1 Hoe uL m Screenshots 3 Screenshots tT 110005450 10e7o7500 12271 1231 12381238 0 032 K ssc 2 D 2 m 3 Price amp License Price amp License tT 163121932 145933262 iz221 1231 12441244 n0 020 K Je E Te People People tT 22420061 151111013 1221 1231 13541354 0 014 4 Kk 0 1 People m 1671229368 1540601350 1221 1231 13541354 0 017 3 K O 2 1 TN 3 History 3 History m 226101571 2285395646535 1221 1231 126581268 0 020 K in 3 m Snowboard 3 Snowboard tT 188910381 462743302 1221 1231 13531353 0 015 7 K ues 3 e Cases amp Reviews Cases amp Reviews T 22451913 400051 1221 12297 01227 0 014 5 K ae ms Case Studies ES Case Studies T 172387739 254553391 1271 1229 13491349 0 014 0 k O 1 Gerad O 2 1 2 Ue 3 References 3 References m 178049520 427026175 1271 1229 124912749 0 039 K O 1 O 2 Inter Photo A S m 116996239 157570579 12271 1229 124912749 0 015 0 K 0 1 amp tx t3referenc Cryptonet tT 245611528 44925230 1271 1229 12491249 0 014 5 Kk
Download Pdf Manuals
Related Search
Related Contents
Active User Manual YKK社製商品の保証についてはこちらをご覧下さい Phonix S8190SKU mobile phone case 045 HD manual パンフレット - ニテコ図研 ANALOGUE UNIT UltraSite User`s Guide BEC Supplement Copyright © All rights reserved.
Failed to retrieve file