Home
EXT: Indexed Search Engine
Contents
1. TYPOS PAGES id type Tite Size Words mtme Indexed Updated Parsetime s5ec gr full sub 74 Case stories z4 k 125 23 07 02 15 29 22 08 02 16 34 133 abra Eua LAL z4 Case stories zu k 130 03 09 02 20 12 O4 10 02 15 34 29 10 02 14 29 166 iriri LAL 1018 Mitsubishi Danmark News ob 25k 137 23 07 02 11 05 22 08 02 18 35 196 l1 1 1i 17ML inis Greensquare Antiques 25k 116 03 09 02 20 12 13 10 07 14 18 783 l1 1 1i LAL 1020 FreakZone Internet Cafe 24 K 132 23 07 02 11 05 22 06 02 18 36 130 l1 1 1 LAL 1021 Kaspers minimalistic homepage 27 K 250 23 07 02 11 05 22 08 02 18 29 434 Tagal LAL 1021 Kasper s minimalistic homepage 28 K 2254 03 09 02 20 12 O4 10 02 15 34 726 afi LAL 10272 Kasper s Wedding private z4 k 115 2S OF 02 11 053 22 08 02 18 41 161 1 i i LAL 1023 Dyision Digital Video 27 E 246 z3 07 02 11 03 22 08 07 18 38 369 1 1 1i LAL 1024 Fladsaa County Denmark 24 K 100 23 07 02 11 05 22 08 02 18 34 150 1 1 1 1 ML 1028 Inter Photo Photo Dealer zu Hk ii 23 07 02 11 05 22 08 02 18 42 145 iriri 1r ML Configuration General The most basic requirement for the search engine to work is that pages are getting indexed That will not happen by just installing the plugin You will have to set up in TypoScript that a certain page should be indexed That is needed for several good reasons First of all not all sites in a TYPO3 database might need indexing So therefore we disable it on a per site basis Secondly a single site may have frames a
2. EXT Indexed Search Engine 4 Address E Search Advanced search Rules Only words with 2 or more characters are accepted Max 200 chars total Space is used to split words can be used to search for a whole string not indexed search then AMD OR and WOT are prefix words overruling the default operator equals AMD OR and NOT as operators All search words are converted to lowercase The styles are most likely different from this but that is controlled by the developer having administration access to the system Adminstration Monitoring indexed content The Indexed Search extension adds two backend modules one as a global database wide statistics module and a page specific analysis module In the Web gt Info module you can see an overview of how many instances are indexed per TYPO3 page Look at this image X TYPO 3 EXT Indexed Search Engine 5 INDEXED SEARCH 2 levels v Tite MP Hash cHash H 012 pid t Size grlist cHashParams www typos con 3 www typos com m 1167053550 327532761 13 2 29 8 70 q52 2 Ones Fo Bs 0 2 3 www typa3 con m 221209103 327932761 oboe dlo o aLe La a Waban e aal m 5 About About T 1195685175 33777037F Peed s350 43 23 56 ar Blast Ez m 3 Whatis a CMS 3 Whatis a CMS m 28188575 43647704 q 29 E 550 35 95 dS SS Wee a mm E Highlights 3 Highlightz I 292521772821 1042731436 1221 1231 1355241352 0 B 2D ES T 3 Feature list Feature list m 76410846 z
3. 22965647 l221 1231 12431243 0 040 k 0 1 ieee a Ee Screenshots 3 Screenshots tm 110005490 1o0eyoy7so00 1221 12341 123981235 0 B0 32 FK a N a E Price amp License 3 Price amp License miesiziss3 1452533262 lz21 1231 12441244 0 020 k 0 1 Wises ale ae E People 3 People m 12420061 151111013 1221 1231 13541354 0 014 4 K 0 1 People m 169122938 154060150 T2202 1223 13599 sero aL aa d rss m E History 3 History m 222101571 2285395646595 l1221 1231 12681268 0 020 K ur M Ee Snowboard Snowboard m isssinssi 46243302 Pee g 2317 155339521958 deena 0 a eee ele UA E Cases amp Reviews Cases amp Reviews T 22451915 4000517 1221 1229 012297 0 014 5 K U 1 Ilo m M E Case Studies Case Studies T 173387739 254553391 1221 1229 13491349 0 014 0 K 0 1 isa O22 mm ES References References m 178049520 427026175 1221 12295 1249212439 0 039 K nue l7 z Inter Phota A S m 116796239 157570579 TEE a 2 5 B sd Stu t3referenc 3 Cryptonet tT 245611528 44925230 1221 1229 124921249 0 014 5 K 0 1 amp tx t3referenc Malburgen District m 2090925652 245050183 1221 1229 12491249 0 014 5 Kk 0 1 Ste t3referenc k arrigre magazin ty m 185004213 139651972 12271 1229 124991249 0 017 6 k O 1 amp tx t3referenc www tilmvahalic de I 234402736 63813385 1221 1229 124921249 0 017 7 HK 0 1 amp tx t3referenc Native Instruments m 18065393 5053596503 1221 229 Lee ae a L 2 T1 e Bos Ste t3referenc 3 www drums de m 41517516 202
4. ISEACH example ISEACH example ISEACH example EH Test af HTML exte temo txt Tener onepage pdf tsref onepage pdf tsref onepage pdf BF test word doc ploads mediastsref_onepage pd na LE Indexed search IpHash cHash T 244397583 466167241 m 189203345 191581327 m 0670265 207670795 T 168927425 21597552 m 184260743 94971674 oo 0519 24585101 21799 563210983 m 256213925666 3 588z254 T 2229506 43655550 m i2212562736 154273531 m 1387415 218205747 miz240735933 161154437 T 22287455 zgO6ly 24337 m 507254001 125588134 m 1732503 PE4606076 m 140269025 178330568 qm 162036733 15350778 m 7473087 250779654 T 19032353555 1036627598 m 172971832 230674462 m 106883768 18285701535 m i9207463 3 8315651 m 1842360743 3431874 T 21225173239 69209839 m 26630519 24585101 T 2562192256665 3 588254 hal H 012 pid t l 1 2 0 2 0 em 7 0 0 1 2 6 6 0 0 d 6 0 0 1 2 6 Ila Zo S 1 2 6 PELi 12 29 290 0 a ene eel Gate E 2 31 31 0 fe cil JSL le 12 22 22 0 0 12 20 20 0 0 1 2 30 30 0 0 EE ao oo 1 2 268 765 0 0 12 286 286 0 12 287 2387 0 0 12 267 287 u 0 tigre 1 2 28 1 2 28 1 2 28 1 2 28 12 257 Size gHist cHashParams 7 a Z E 0 2 1 HH HH HH bi MMP WEP WE MME MM HH hJ l0 7 kK 14 1 Kk Fez K 2 7 K aL Sill s lbs caes ep atte abet 13 0 K 2 1 2Page 2 2 alsin osse cs Pus s Fig KR O 2 1 2 On the image below we are looking
5. 5 K 0 1 amp tx t3references pil showLlid 261 m 2020926232 245050183 1221 12292 124321249 0 014 5 K 0 1 amp tx t3references pil showLlid 23 m 185004213 139651972 izz2i 1229 12491249 0 017 6 k 0 1 amp tx t3references pil shawLlid 25 m 2344027868 63813385 1221 1229 12491249 0 017 7 E 9 1 amp tx t3references pil showlLlid 27 You can either click the red garbage bin 1 in order to clear all listed instances or alternatively pick out single instances by clicking the local garbage bin 2 Monitoring the global picture of indexed pages P Tools PE Ext M EXP User Admin anager fal DB check Ee confi xl Insts Ee Log guration Ge indexing phpM uAdrmin By the Tools gt Indexing module you can get statistics about the indexing engine Currently they are sparse and very roughly presented This view needs some more work to be friendly and really useful General statistics general statisticz we RECORDS index phash index words index rel index grist index section index fulltext index phash TT Typos page 0 217 711 40603 292 217 217 PES 204 217 This shows that 217 pages are indexed comprising 7000 words and using 40 000 records in the relation table to glue things together List TYPO3 Pages This view shows a list of indexed pages with all the technical details 3 TYPO3 EXT Indexed Search Engine 7 Indexing Engine Statistics List Typos Pages
6. K Created 28 05 02 Modified 19 11 02 16 40 Path Cases amp ReviewsiReferences Features of the indexer The indexing engine has several features e HTML data priority 1 lt title gt data 2 lt meta keywords gt 3 lt meta description gt 4 lt body gt e Indexing external files Text formats like html and txt and doc pdf by external programs catdoc pdftotext e Wordcounting and frequency used to rate results e Exact partially or metaphone search e Searching freely for sentences non indexed NOT case sensitive in any ways though Features of the search frontend the plugin The search interface has several options for advanced searching Any of those can be disabled and or preset with default values e Searching whole word part of word sounds like sentence e Logical AND and OR search including syntactical recognition of AND OR and NOT as logical keywords Furthermore sentences encapsulated in quotes will be recognized X TYPO 3 EXT Indexed Search Engine 2 e Searching can be targeted at specific media for instance searching only indexed PDF files HTML files Word files TYPO3 pages or everything The engine is language sensitive based on the multiple language feature of TYPO3 s CMS frontend Searching can be performed in specific sections of the website Results can be sorted descending or ascending and ordered by word frequency weight location relative to page top page modification d
7. This is a unique representation of the page indexed For TYPO3 pages this is a serialization of id type gr list see later MP and cHashParams which enables subcaching with extra parameters This concept is also used for TYPO3 caching although the caching hash includes the all array and thus takes the template into account which this hash does not It s expected that template changes through conditions would not seriously alter the page content For external media this is a serialization of 1 unique filename id 2 any subpage indication parallel to cHashParams gr list is NOT taken into consideration here phash grouping 7md5 int hash This is a non unique hash exactly like phash but WITHOUT the gr list and in addition for external media without subpage indication Thus this field will indicate a unique page or file while this page may exist twice or more due to gr list Use this field to GROUP BY the search so you get only one hit per page when selecting with gr list in mind Currently a seach result does not either group or limit by this but rather the result display may group the result into logical units item mtime Modification time For TYPO3 pages the SYS LASTCHANGED value For external media The filemtime value Depending on config if mtime hasn t changed compared to this value the file page is not indexed again tstamp time stamp of the indexing operation You can configure min max ages
8. bal Other languages Other languages This page is suppo Path Other languages Ill z Andere Sprachen andere Sprachen Andere Sprachen Dese Seite sola Path Other languages 3 Andre sprog andre sorog Andre sprog Med denne side er det meni inp Hash cHash H 012 pid t Size geist cHashParams m 255647253 ite Fe Eta Sls 10 0 0 10 1 0 Ta e maal m 131393 18572305258 almo ababa B falaba ILa E T 2893449 171732998 qug q 1 0 1 9 4 E m 22132000 62499995 zi Bl pa s Lote ler a Bra Me mj 154793933 102465242 10 11 7474 1 0 m 112602259 z35770833 L G dab ab eal mns TLIO mal p amp 3L n 21289197259 104547755 lig Lal bab p bs LS mcm pe B2 m 117655025 140416950 SG E Ed Pia pe maal m i957441242 2615929585 GS 18 1 0 0 pe aal m 2012740232 601102711 aLa ab F a Ol es T eme mizs226n0233 249156559 ig0 173 0173 O 1 m 206622518 FALSE P EE ILO s Em Emm en Bie Sider 3 aa 9 Size 4 3 E Created 13 12 01 Modified 13 12 01 16 27 B355 Size 4 3 E Created 13 12 01 Modified 08 01 02 18 10 mt D7 Yo Size 4 3 K Created 13 12 01 Modified 13 12 01 16 27 Path Other languages Illustration 1A seach result showing how localized versions of a page are displayed Database Tables index_phash This table contains references to TYPO3 pages or external documents The fields are like this 3 TYPO3 EXT Indexed Search Engine 14 phash 7 md5 int hash It s an integer based on a 7 char md5 hash
9. page if any of the indexed page page_id The page id of the indexed page uniqid This is just an autoincremented unique primary key Generally not used i think index fulltext For free text searching eg with a sentence in all content title description keywords body phash The phash of the indexed document fulltextdata The total content stripped for any HTML codes Currently the MySQL FULLTEXT search is not used something with MATCH AGAINST but this will be added in the future index grlist This table will hold records related to a phash row Records in this table confirms that certain gr lists would actually share the same content as represented by phash row even though the phash row may be indexed under another login The table is used during result display to positively confirm if the current user may see the resume which otherwise might contain secret info Please see discussion far above index words index rel Words table and word relation table Almost self explanatory For the index rel table some fields require explanation count Number of occurrences on the page first How close to the top low number is better freq Frequency please see source for the calculations This is converted from some floating point to an integer flags Bits which describes the weight of the words 8th bit 128 word found in title fth bit 64 2 word f
10. will display only one occurrence because similar pages determined based on phash grouping will be detected The tricky scenario Say that a page has a content element with some secret information visible for only one usergroup The page as a whole will be visible for all users The page will be indexed twice both without login and with login because page content differs The problem is that if a search is conducted and matching one of the secret words in the access restricted section then the page will be in the search result even if the user is not logged in The best solution to this problem is to allow the result to be listed anyway but then HIDE the resume if the index grlist table cannot confirm positively that the combination of usergroups of the user has access to the result So the result is there but no resume shown The resume might contain hidden text External media Equally for external media they are linked from a TYPO3 page When an external media is selected we can be sure that the page linking to it can be selected But we cannot be sure that the link was in a section accessible for the user Similarly we should make a lookup in the index grlist table selecting the phash gr list by the phash t3 value of the section record for the search result If this is not available we should not display a link to the document and not show resume but rather link to the page from which the user can see the real link to the document No
11. 737934 1221 aLa a Sla 02a ess a L a B aaa amp tx t3referenc 3 wwwkreis warendo 171881822 213094181 1221 1222 1249212499 0 018 2 K 0 1 amp tx t3referenc Jenoptik Camera E m 21413092 109510109 L221 1229 12490249 0 017 2 BOL amp tx t3referenc DIBS corporate we T 2t66s9656 163542257 12271 1229 124912749 0 015 4 K 0 1 amp tx t3referenc www irip muenchen de j 147115297 146690 Legi 22g eal ee ALONE ee pS lal amp tx t3referenc Green Square A 5 m 1422873542 193650959 1221 1229 12491249 0 017 8 Kk 0 1 Ste t3referenc 3 Snowleaopard Adven m 160917384 146158117 1221 1229 12494 249 05 0 J 72 EOL Stu t3referenc Rosenbilderberg com M 2622956210 192953962 1221 12292 124921249 0 017 4 K O 1 amp tx t3referenc boarder ch m 233162976 203568563472 1221 12291249124 0 0 47 7 K U 1 amp tx t3referenc Relations m 2t610s695 2619840270 1221 b225 1249 esas od O B Urs amp tx t3referenc www magix net m 123496477 2904230115 1221 1229 124921249 0 01686 7 K 0 1 amp tx t3referenc Nubuk Sports m 2224308 83501521 L221 P2259 1299124909 L ma e Oped amp tx t3referenc 3 zchweizer illustr mesosanei 1401592 1221 L229 02490 Foo 0 ee E U T amp tx t3referenc germanmaps de m 22002341 oF 920476 tees es 209055 Dien ie Bliss amp tx t3referenc 3 www uw ilaad de m 7436735 1824073958 1221 1229 124921243 0 01686 8 K 0 1 amp tx t3referenc wow UME edu m 62707286 199294174 122 1229 i eases Ode B usd amp tx t3r
12. EXT Indexed Search Engine Extension Key indexed_search Copyright 2000 2004 Kasper Sk rhej lt kasper typo3 com gt This document is published under the Open Content License available from http www opencontent org opl shtml The content of this document is related to TYPO3 a GNU GPL CMS Framework available from www typo3 com Table of Contents EXT Indexed Search Engine Introduction eeeeeeee eere nnne nnne nennen What does it dO cccccccececececececececeeaececececeeaeaenens 1 Features of the indexer seeeesseesesse Features of the search frontend the plugin User Ti A Ma erc c Adding the search plugin to a page Adimunstralloisscabcs cacaeuadatke a Monitoring indexed content seseesssusss Monitoring the global picture of indexed pages TVDOSCHDLieidiceitdeus tis cirea cubes re eee eb EE Technical details eere nen Introduction What does it do The Indexed Search Engine provides two major elements to TYPO3 HTML Conte nt ccccccseccccececceeesceeeecaseeseaeessaeessaes 9 Use of NASNES cccccccccseeececeeeeeeeaeeeesseseeeseeeesaeeees 9 How pages are indexed sessseeeeeeese 9 External Me CIa ccccseccccececceseecsseccseesceeeeceeenaess 10 Handling extendToSub
13. at another scenario In this case the cHashParams is obviously used by the plugin tt_board The plugin has been constructed so intelligently that it links to the messages in the message board without disabling the normal page cache but rather sending the tt board uid parameter along with a so called cHash If this is combined correctly the caching engine allows the page to be cached Not only does this mean a quicker display of pages in the message board it also means we can index the page 3 TYPO3 EXT Indexed Search Engine 12 Indexed search repe Board Indexed search hull Path Intra Another site Listz Board INDEXED SEARCH Blevels v Tide ipHash cHash H 012 pid t l Size grist cHashParams alkiBoard Board m 1747933933 10274652742 10 11 24274 1 0 16 3 K 0 1 2 Sourcream and Oni 3 Sourcream and Oni f 205020059 229544850 10 11 2443 1 0 6 7K D Fat percent qpi4567630 40764155 10 11 2443 1 0 6 9K amp tt board uid i gt Sourcream and Oni 3 Sourcream and Oni f 240390245 84186444 10 11 2444 1 0 7 9K Fat percent tree gp 78323332 124733575 10 11 2444 1 0 7 2 K att board uid 3 Fat percent trea m 52222425 42508934 id ii 2444 1 0 7 2K O att board uid 4 amp tt board uid 5 amp tt board uidz This is grass I 343257257112 99780722 10 11 2444 1 0 7 1K Sourcream and Oni gp 1189202887 oadiosb4d4d4 i0 1i1 2444 1 0 7 9 K i 4 NE tf dad 0 BReHePeeP HP eH As you se
14. ate page title etc The display of search results can be intelligently divided into sections based on the internal page hierarchy Thus results are primarily grouped by relation then by hit relevance This shows the full range of default options for advanced search Search for german Match All words AND Search m All media All languages From section hole site v Order by Weight Frequency Highestfirst at a time style Section hierarchy v Extended resume Warning The search frontend plugin is optimized for features not speed Especially it will be slow on a website with many pages in the page tree because it traverses the whole tree each time to build a list of accessible pages However you can circumvent this by modifications to the search plugin so it does not check page access based on the id list But then you loose that feature of course Can t have both In any case The indexing of pages and searching the indexed information are two different processes and therefore you can easily use another frontend plugin for making searches in the same data for whatever reason you might have for discarding the default search plugin User manual Adding the search plugin to a page That is really easy 1 Create a page called Search or something like that This is where the search box will appear 2 hen create a new content element on that page From the Web Page module you can do it li
15. can be automatically filtered out during the search result display But it s better to avoid this kind of stuff The last example below has three main issues to discuss 1 The page Other languages is apparently available in three languages Which ones are not possible to determine unless we know the value from the sys languages table In this case the default language zero 0 is english and the language with id 1 and id 2 is danish and german versions of the page When a search is conducted each page may turn up as a result page but with a little flag telling if the page was found in another language than the main language on the website see second illustration hereafter 2 fthere is no phash rows found for a page this can mean three things 1 Either the page is not cached In this case both the tt products and tt news plugins apparently disables the caching of the page thereby disabling any indexing of the pages Searching in news and products must be done with a searching function looking up directly in the news and products tables 2 n the case with other pages the reason may be that the pages has never been visited and therefore not indexed yet Indexing of pages in TYPO3 happens during the rendering of the page there is currently no crawler to assist this job 3 Finally the reason for a page not being indexed can be the combination of 1 and 2 That the page has never been visited And if it was visited the cache would hav
16. ch for address and that Contact page is in the rootline Frontpage ID 23 gt About us ID 45 gt Contact ID 77 then you should set the pid value to either 77 or 45 If 45 then all subpages including the About us page will have similar configuration If the pid value is set to 0 zero it will apply to all pages Please see the options below specConfs pid pagelcon gt IMAGE cObject Alternative page icon specConfs pid CSSsuffix string A string that will be appended to the class names of all the class attributes used within the result row presentation The prefix will be like this Example If CSSsuffix doc then eg the class name bcindexedsearch title will be tx indexedsearch title doc X TYPO 3 EXT Indexed Search Engine 8 Property Data type Description Default search rootPidList list of int A list of integer which should be root pages to search from Thus you The current root can search multiple branches of the page tree by setting this property page id to a list of page id numbers If this value is set to less than zero eg 1 searching will happen in ALL of the page tree with no regard to branches at all Notice that by root page we mean a website root defined by a TypoScript Template If you just want to search in branches of your site use the possibility of searching in levels search detect_sys domain boo
17. e been disabled 3 These numbers just tells us that the page Lists was indexed once by a user with membership of group 1 and 2 the page Addresses was also indexed by a user with membership of group 1 and 2 but has since been visited by a user without login Both instances yielded a similar page and it was therefore not indexed twice This raises the question about the page Lists Is that access restricted for users without login or has a user without login just never visited that page since no O 1 grlist has been detected Both could be the answer On pages which has access restriction or a whole section in an intranet such pages would obviously not have been indexed by no login users However in this case nothing indicates that the page should be hidden for non login users and so we must conclude that the page has simply not yet been visited by a no login user otherwise it would look like the page Addresses having also the 0 1 list detected The Guestbook page was indexed by a user without login only TYPO 3 EXT Indexed Search Engine 13 Indexed search E Ej Another site in the same database Path Intra Anather site INDEXED SEARCH Zlevels 5 Another site in t ae B ISearch Title Another site in t Lists Addresses Guestbook Board Rating Pall Calendar Cool example Other languages Andre sprog Andere Sprachen Other languages 1 Other languages Indexed search
18. e the main board page showing the list of messages threads Sourcream and Oni is indexed without any values for the parameter tt board uid the cHashParams field is blank Then it has also been indexed one time for each display of a message In a search result any of these five rows may appear as an independent result row after all they are to be regarded as a single page with unique content despite sharing the same page id Another interesting thing is that while the main page has inherited the page title for the search result Sourcream and each of the indexed pages with a message has got another title namely the subject line of the message shown Thus a search matching three of these five pages will not shown three similar page titles but a unique page title relative to the actual content on the page It is the tt board plugin that sets the page title itself by an API call The only glitch here is that the tt board plugin has falsely allowed the main page to be cached twice See the first and last phash row The last row has got the parameter amp tt board uid sent and the tt board plugin should not have allowed that Because looking at the content hash of the first and last we realize that it s the SAME hash 84186444 and therefore the SAME content However being two separate result rows they will both be displayed in the search result as separate hits The responsibility for this lies with the plugin However such occurrences
19. ed item crdate The creation date For files only the modification date can be read from the files so here it will be the filemtime gr list Contains the gr list of the user initiating the indexing of the document 3 TYPO3 EXT Indexed Search Engine 15 index section Points out the section where an entry in index phash belongs phash The phash of the indexed document phash t3 The phash of the parent TYPO3 page of the indexed document If the document being indexed is a TYPO3 page then phash and phash t3 are the same But if the document is an external file PDF Word etc which are found as a LINK on a TYPOS3 page then this phash t3 points to the phash of that TYPO3 page Normally it goes like this when indexing 1 The TYPO3 document is indexed this has a phash value of course then 2 if any external files are found on the page they are indexed as well AND their phash t3 will become the phash of the TYPO3 page they were on The significance of this value is that indexed external files may have more than one record in index section with the same phash a record for each parent page where a link to the document was found There are details about this in the section of this document that describes the complexities of indexing pages rlO The id of the root page of the site rl1 The id of the level 1 page if any of the indexed page rl2 The id of the level 2
20. eferenc Archined m 162334911 249227564 1221 1229 124921249 0 01686 7 K 0 1 amp tx t3referenc 3 stopchildtraffick m 22425240394 103046063 aha sa pa Fad ea ear L e a al ma sq amp tx t3referenc As you can see most pages here are indexed only one time However a few are indexed twice This can happen for several reasons and here the reason is most likely due to a user login or something related The most interesting occurence is the page References which has more than 20 indexed instances available The reason is that this page holds multiple cached views due to some parameters which are used by a plugin on that page Each instance will be searchable as a unique search result Now imagine that you want to clear out all those instances of the References page to let them be re indexed when viewed again Simply click the page References in the page tree to the left Then you see this EXT Indexed Search Engine 6 3 TYPO3 Indexed search HA References Path fwwe typo3 com Cases amp Reviews References INDEXED SEARCH Tide 3 References E References Inter Photo A S Cryptonet Malburgen District karriere magazin ty 1 www filmaholic de Indexed search v cHash H 012 pid t l Size grist cHashParams m 178049520 42026175 1221 12292 124921249 0 032 Hk O 1 O 2 BE dieses To ma 1221 12292 124321249 0 015 0 K 0 1 amp tx t3references pil showLlid 222 I 2456115282 449275230 1221 1229 12491249 0 014
21. example of how the search interface on a website looks 3 TYPO3 EXT Indexed Search Engine 1 search Search for search Advanced search Search for searclr Displaying results 1 ta 10 aut af 10 in 4 sections search 1 page Cases amp Review s 4 pages Resources 1 page e e mm mm mm um um um nm um mm mm m Search search Search Size 7 4 K Created 04 10 02 Modified 13 11 02 10 16 Path search 2 DIBS corporate website 100 References Huncdrecs of websites are implemented with Tvoo3 word wide rough independent consutancies These featured projects shows the great variety of projects vou Can crea ates color scheme LBS website uses the Indexed Search engine bulla into Typo The engine bull into Tvpo3 The search engine makes a global engine makes a Global search for Information but the results are stl cllaniav he local website Notice the advanced division of search results T he new website has a rmultllevel l ayerec Size 18 4 K Created 28 05 02 Modified 19 11 02 16 40 Path Cases amp ReviewsiReferences B 3 www imp muenchen de 100 References Huncdrecs of websites are implemented with Tvoo3 word wide rough Independent consutancies These featured projects shows the greet variety of projects vou Can crea Ors are producing content for their Individual reasesrch grouns and thelr leches T hey produce downloads Size 17 8
22. exed twice each The reason is that those three pages has had different content depending on whether or not a user was logged in In the case of the page Special content the reason is that the page contained a content element which was visible for users which was a member of group number 1 Therefore the page was different in the two cases The page Advanced has a user login form and that form looks different whether a user is logged in or not Finally the page Menu Sitemap apparently changed There reason was that this page includes a sitemap and that sitemap displayed some extra pages when the logged in users hit the page and so the content was not the same as without login Another thing which is interesting is that two different users must have visited those pages We can see that because the page Special content was apparently indexed with the usergroup combination 1 2 Later another user hit the page but only a member of group 1 However the page content was the SAME And because those two users saw the very same page it was not indexed a third time but it was instead noted down that a user with membership of only group 1 did also see this same page That comparison was based on the cHash contentHash which is a hash value based on the actual content being indexed So when the user with group 1 only came to the page the indexer engine realize that the page as it looked has already been indexed because another phash ro
23. f the values would be negative they were suddenly positive all of them That would require a similar change of the fields in the database To cut it simple the length was reduced to 7 all being positive then How pages are indexed First of all a page must be cachable For pages where the cache is disabled no indexing will occur The phash is a unique identification of a page with regard to the indexer So an entry in the index phash table equals 1 resultrow in the search results called a phash row A phash is a combination of the page id type sys language id gr list MP and the cHash parameters of the page function setT3Hashes If the phash is made for EXTERNAL media item type gt 0 then it s a combination of the absolute filename hashes with any subpage indication for instance if a PDF document is splitted into subsections So for external media there is one phash row for each file except PDF files where there may be more But for TYPO3 pages there can be more phash rows matching one single page Obviously the type parameter would normally always be only one namely the type number of the content page And the cHash may be of importance for the result as well with regard to plugins using that For instance a message board may make pages cachable by using the cHash params If so each cached page will also be indexed Thus many phash rows for a single page id But the most tricky reason for having multiple phash row
24. formation that it might match all too many search queries So breaking a PDF file down into smaller parts makes it possible for us to indicate exactly WHERE in the PDF file the search word was found 4 Looking at the word file and the PDF file as well we see that they are found on BOTH the page Special content and on the page ISEARCH example But looking at the phash values for the word file it is 268192666 it is the SAME value in both cases So this means that the Word and PDF file is indexed only once when it is first discovered Later when another page is indexed and a link to the same document appears then the document is not indexed as another document but rather an entry in the index section table is made indicating that this result row is also found available linked to from another page section Say you are doing a search in the section from Content elements and outwards in the page tree The word document is matched in the search but it will appear only once in the search result Now if one of the two pages where the Word document was either hidden or access restricted the word document would still be matched because one of the pages is accessible for the user But if BOTH pages with the link to the word document is not accessible for the user doing the search then the word document will not be included in the search result 7 11 5 Here we can see that the pages Special content Advanced and Menu Sitemap is ind
25. he path where the link can be found However if both TYPO3 pages are not available then the document will not be shown Handling extendToSubpages or not In the searching plugin there are two ways of searching with respect to accessible pages 1 join pages 1 If set then the final result rows are joined with the pages table This will make sure that no enableFields hidden but NOT extendToSubpages pages are selected And it will also make sure to search ALL pages within the rlO of the index section table But extendToSubpages will NOT be taken into account 2 join pages 0 default Then a long list of page ids are selected first and after that the final result rows are selected but without joining the pages table This will work with a limited number of page ids which means most sites And it makes sure that any extendToSubpages hidden pages are NOT selected along with enableFields hidden pages BUT it will also prevent pages down the branch of a php tree stop from being selected as well Access restricted pages A TYPO3 page will always be available in the search result only if there is access to the page This is secured in the final result query Whether extendToSubpages is taken into account depends on the join pages flag see above But the page will only be listed if the user has access However a page may be indexed more than once if the content differ from usergroup to usergroup or just without login Still the result display
26. ke this E Ej search columns vi Path www tupa3 arg Search Tarp Ea Pagecontent Edit page header NORMAL Create page cantent Shaw hidden cantent elements 3 Then select some plugin type if you can It doesn t matter if it s a guestbook or forum Or if no plugins are available just select a Regular text element as in the top of the page X TYPO 3 EXT Indexed Search Engine 3 Plugins A Message board Adds a message board list style forum to the page Em Discussion forum Add a threaded discussion forum tree style forum to the pags Oo Ey Guestbook Adds guestbook to the page C Ed Todo items 4 Then make sure Insert plugin is selected if not select it and save the element then you ll see the form below enter a title and select the Plugin type to be Indexed search Pagecontent NEW Allindexed search Board Iz Board Tree Guestbook Addresses Extension Repository Frontend User administration FAO Consultancies References Mailing lists Documents Links Todo items Oo Docs 5 Then select the root page of your website as the Starting point of the plugin content element Pagecontent 2030 Search 2 Header Search Insert plugin Indexed search 2I CODE B Www tpoz com General options Hide ae Start C And that s it Your frontend should now look like this 3 TYPO3 2 Access
27. lean If set then the search results are linked to the proper domains where _records they are found search detect sys domain string Target for external URLs _records target tsref plugin tx_indexedsearch Technical details HTML content HTML content is weighted by the indexing engine in this order 1 lt title gt data 2 lt meta keywords gt 3 lt meta description gt 4 lt body gt In addition you can insert markers as HTML comments which define which part of the body text to include or exclude in the indexing The marker is lt TYPO3SEARCH_begin gt or lt TYPO3SEARCH_end gt Rules 1 If there is no marker at all everything is included 2 If the first found marker is an end marker the previous content until that point is included and the preceeding code until next begin marker is excluded 3 If the first found marker is a begin marker the previous content until that point is excluded and preceeding content until next end marker is included Use of hashes The hashes used are md5 hashes where the first 7 chars are converted into an integer which is used as the hash in the database This is done in order to save space in the database thus using only 4 bytes and not a varchar of 32 bytes It s estimated that a hash of 7 chars 32 is sufficient originally 8 but at some point PHP changed behavior with hexdec function so that where originally a 32 bit value was input hal
28. nd in that case we need only index the page object which actually shows the page content Lets say that you have a PAGE object called page that is pretty typical then you will have to set this config option page config index enable 1 When this option is set you should begin to see your pages being indexed when they are shown next time Remember that only cached pages are indexed This is documented in TSref in the CONFIG section Please look there for further options For instance indexing of external media can also be enabled there Languages The plugin supports all system languages in TYPO3 Translation is done using the typo3 org tools If you want to use eg danish language that will automatically be used if this option is set in your template the value is the internal language key config language dk TypoScript Still missing the major parts here Just use the object browser for now since that includes all options Property Data type Description Default specConfs pid specConfs is an array of objects with properties that can customize certain behaviours of the display of a result row depending on it s position in the rootline For instance you can define that all results which links to pages in a branch from page id 123 should have another page icon displayed Of you can add a suffix to the class names so you can style that section differently Examples If a page Contact is found in a sear
29. ound in keywords 6th bit 32 word found in description Last 5 bits are not used yet but if used they will enter the weight hierarchy The result rows are ordered by this value if the Weight Frequency sorting is selected Thus results with a hit in the title keywords or description are ranked higher in the result list Known problems Currently the extension is under observation because instances of heavy server load unstability has been reported It is not yet clear if THIS extension has anything to do with So it s only under suspicion at this point until further data has been collected But for now it is adviced to be careful with the application of the extension for mission critical high load environments 3 TYPO3 EXT Indexed Search Engine 16 e t s still uncertain how performance is under heavy load conditions and when MANY pages are indexed Currently benchmarks has been done only up to 2000 pages indexed approx 400 000 relation records It is probably that some parts has to be optimized for such scenarios STYPO 3 EXT Indexed Search Engine 17
30. pages or not 10 Access restricted pages eeseeeeessssse 10 Analysing the indexed data 10 Understanding these complex scenarios 11 Database Tables eere 14 he dieicipme 14 INd X_SOCtION cccceccceececsececeeseceeeecueecueecsueessaes 16 index fulltext 16 WGK OVS eT 16 index words index rel eeseeeessseese 16 Known problems eere eren nennen 16 1 Indexing An indexing engine which indexes TYPO3 pages on the fly as they are rendered by TYPO3 s frontend Indexing a page means that all words from the page or specifically defined areas on the page are registered counted weighted and finally inserted into a database table of words Then another table will be filled with relation records between the word table and the page This is the basic idea 2 Searching A plugin you can insert on your website which allows website users to search for information on your website By searching the plugin first looks in the word table if the word exist and if it does all pages which has a relation to that word will be considered for the search result display The search results are ordered based on factors like where on the page the word was found or the frequency of the word on the page This is an
31. s for a single TYPO3 page id is if the gr list is set This works like this If a page has exactly the same content both with and without logins then it s stored only once If the page content differs whether a user is logged in or not it may even do so based on the fe groups then it s indexed as many times as the content differs The phash is of course different but the phash grouping value is the same The table index grlist will always hold one record per phash row of item type O that is TYPO3 pages But it may also hold many more records These point to the phash row in question in the case of other gr list combinations which actually had the X TYPO 3 EXT Indexed Search Engine 9 SAME content and thus refers to the same phash row External media External media pdf doc html txt is tricky External media is always detected as links to local files in the content of a TYPO3 page which is being indexed But external media can the linked to from more than one page So the index_section table may hold many entries for a single external phash record one for each position it s found Also it s important to notice that external media is only indexed or updated if a parent TYPO3 page is re indexed Only then will the links to the external files be found In a searching operation external media will be listed only once grouping by phash but say two TYPO3 pages are linking to the document then only one of them will be shown as t
32. te These tricky scenarios exist only if the content on a page differs based on login It does not affect situations with access restriction to the page as a whole A general lesson from this is to reduce the number of hidden content elements Instead use hidden pages Better more reliable Analysing the indexed data The indexer is constructed to work with TYPO3 s page structure Opposite to a crawler which simply indexes all the pages it can find the TYPO3 indexer MUST take the following into account e Only cached pages can be indexed Pages with dynamic content such as search pages etc should supply their own search engine for lookup in specific tables Another option is to selectively allow certain of those dynamic pages to be cached anyways see the cHashParams concept used by some plugins e Pages in more than one language must be indexed separately as different pages Pages with messageboards may have multiple indexed versions based on what is displayed on the page The overview or a single messageboard item This is determined by the cHashParams value Pages with access restricted to must be observed e Because pages can contain different content whether a user is logged in or not and even based on which groups he is a member of a single page identified by the combination of id type language cHashParams may even be available in more X TYPO 3 EXT Indexed Search Engine 10 than one indexed version based on
33. the user groups But while the same page may have different content based on the user groups and so must be indexed once for each such pages may just as well present the SAME content regardless of usergroups This is the very most tricky thing Understanding these complex scenarios The best thing to do is to grab an example Please refer to the picture below while reading the bulletlist here 1 The overview in general shows one line per phash row a single row from the index_phash table Such a row represents a single hit in a searching session In other words each line with grayish background in the overview may be a search hit The columns of these rows are Title The search result title e icon Click here to remove the indexed information for this entry will be re indexed on the next hit e pHash The id of the search row The hash is calculated based on id type language MP cHashParams gr list of the page when indexed For external media this is based on filepath page interval for PDF s only e cHash Calculated based on the actual content which was indexed e l 012 This is the rootline ids for level 0 1 2 Used when searching in certain sections For instance a search operation may select all pages with rl1 123 which will result in a search within pages which exist ONLY in the branch of the website where the level1 page has uid 123 e pid t This is the page id type number sys language uid Size How man
34. w with that content hash was already available 6 These pages does not contain any tricks it appears According to the grlist s both users with membership of group 1 2 and group 1 only as well as surfers who did not at all login 0 1 is the pseudo group for no login as visited the page And because only one indexed version exist the page must have had the same content to present all users regardless of STYPO 3 EXT Indexed Search Engine 11 their login status The reason why the page Your own scripts does not contain a grlist value 0 2 1 2 as the others do is simply because no user with that combination of usergroups has ever visited the page 7 txt and html documents can also be indexed as external media In the case of HTML documents the documents title is detected and used Indexed search Aq Content elements Path Intros Startpage content elem INDEXED SEARCH 2 levels BH Tite ES Content elements HE Content elements um ES Insert content Insert cantent Ix Special content Special content Special cantent ee anepage pdf WF test word doc e E Advanced 3 Advanced Advanced a E Menur Sitemap E Menur Sitemap Menur Sitemap A E Multimedia Multimedia ee Search Search ve E Your own scripts Your awn scripts Xt E XML WAP PDA XML WAP PDA Hem E Rich Text Editor Rich Text Editor m Thanks far your rn um ES ISearch ISearch E
35. which are checked with this timestamp A min age defines how long an indexed page must be indexed before it s reconsidered to index it again A max age defines an absolute point at which re indexing will occur unless the content has not changed according to an md5 hash cHashParams The cHashParams For TYPO3 pages These are used to re generate the actual url of the TYPO3 page in question For files this is an empty array Not used item type An integer indicating the content type 0 is TYPO3 pages 1 external files like pdf 2 doc 3 html 1 txt 4 and so on See the class indexer php file item title Title For TYPO3 pages the page title For files the basename of the file no path item description Short description of the item Top information on the page Used in search result data page id For TYPO3 pages The id data page type For TYPO3 pages The type data filename For external files The filepath relative or URL not used yet contentHash md5 hash of the content indexed Before reindexing this is compared with the content to be indexed and if it matches there is obviously no need for reindexing crdate The creation date of the INDEXING not the page file see item crdate parsetime The parsetime of the indexing operation sys language uid Will contain the value of GLOBALS TSFE 5sys language uid which tells us the language of the page index
36. y bytes the indexed page consumed e grlist This is the gr list of the user which initiated the indexing operation e cHashParams Additional parameters which are identifying the page in addition to the id type number which usually does that 2 The page Content elements has one indexed version The page id of the root page is 1 and the page on level 1 in the rootline had the uid 2 Notice how all subpages to Content elements has the exact same rlO and rl1 value Where the page Content elements does NOT have a value for rl2 so does all the subpages because they ARE the level 2 themselves Furthermore the page has the page id 2 a type value of 0 and is indexed with the default language 0 The size was 10 6 KB and the user who initiated the indexing operation was a member of the groups 0 2 1 which is effectively fe group 1 because 0 and 2 is pseudogroups 3 On the page Special content there must have been a link to a local PDF and Word file since those two are indexed in relation to this page The PDF file is located in the path uploads media tsref onepage pdf relative to the website Notice that the PDF file is actually indexed three times one time per page This is of course configurable Each indexed section of the PDF file has the potential to show up as a search result row of course because the phash is different per indexed part The whole point with this is that a large PDF file might contain so much in
Download Pdf Manuals
Related Search
Related Contents
Manual do Utilizador CE6000 SERIES 取扱説明書 I n s t a l l a t i o n M a n u a l TED Pro Home Wiley Professional Windows PowerShell Starfrît - Starfrit ELLE - 4 février 2011 PDF 1.77MB - Shop@Aterm RM2255 InviMag Blood DNA Maxi Kit/ IG User manual Bedienungs- anleitung Copyright © All rights reserved.
Failed to retrieve file