Home
fulltext - SC Home
Contents
1. 010 THE DESIGN AND IMPLEMENTATION OF INGRES 011 DECOMPOSITION A STRATEGY FOR QUERY PROCESSING 012 AN AUTHORIZATION MECHANISM FOR A RELATIONAL DATABASE SYSTEM 013 DIFFERENTIAL FILES THEIR APPLICATION TO THE MAINTENANCE OF LARGE 014 BATCHED SEARCHING OF SEQUENTIAL AND TREE STRUCTURED FILES 015 SYNTHESIZING THIRD NORMAL FORM RELATIONS FROM FUNCTIONAL 016 ALGORITHMS FOR PARSING SEARCH QUERIES IN SYSTEMS WITH INVERTED FILE 017 PERFORMANCE OF A DATABASE MANAGER IN A VIRTUAL MEMORY SYSTEM 018 DATABASE SYSTEM APPROACH TO MANAGEMENT DECISION SUPPORT 019 ON USER CRITERIA FOR DATA MODEL EVALUATION 020 A MODEL OF STATISTICAL DATABASES AND THEIR SECURITY 021 PREFIX B TREES 022 A CLUSTERING ALGORITHM FOR HIERARCHICAL STRUCTURES Figure 7 An example of document list in HypIR 14 DOCNO 344 VOL 15 NO 4 A F CAN E A OZKARAHAN B CONCEPTS AND EFFECTIVENESS OF THE COVER COEFFICIENT BASED CLUSTERING METHODOLOGY FOR TEXT DATABASES C CLUSTERING INDEXING RELATIONSHIPS CLUSTER VALIDITY COVER COEFFICIENT DECOUPLING COEFFICIENT DOCUMENT RETRIEVAL RETRIEVAL EFFECTIVENESS D A NEW ALGORITHM FOR DOCUMENT CLUSTERING IS INTROOUCED THE BASE VERY LOW WTHE RETRIEVAL EXPERIMENTS SHOW THAT THE INFORMATION RETRIEVAL EFFECTIVENESS OF THE ALGORITHM IS COMPATIBLE WITH A VERY DEMANDING 00C232 ADAPTIVE RECORD CLUSTERING it Document aan Figure 8 Document card in HypIR Clustering
2. Update Lists 4 CLICK Export Text TO CREATE A TEXT FILE CONTAINING ALL DOCUMENTS ENTER NO OF CURRENT DOCUMENTS o ENTER NO OF DOCUMENTS AFTER UPDATE 524 Update Lists Enport Tent Figure 14 Update card in HypIR 31
3. 2 Choose Copy from Edit menu 3 Click on HyperCard s icon A new window which contains all the resource folders of HyperCard will appear on the screen 4 Click on XCMD folder Now the XCMD folder should be highlighted 5 Choose Paste from the Edit menu 6 Select quit from the File menu 7 Click on Yes in the new pop up dialog box Now the new XCMD is usable in the HyperCard However the programmer can also build the XCMD in a normal stack although this XCMD can be recognized only in the specified stack To do so replace step 3 by step 3 a 3 a Click on the icon which represents the stack in which the XCMD should be built 22 D Use XCMD in HyperTalk The new XCMD can be used as a standard HyperTalk command To use it just use the name which was typed in the Set Project Type dialog box as the actual command of the XCMD E The HyperCard glue routines There are 29 glue routines in HyperXcmd p which are a collection of procedures and functions that provide access to some of the routines and data inside HyperCard The glue routines are the only method by which the user can set and get the contents of a HyperCard field set and get the contents of a HyperTalk variable evaluate a HyperTalk expression or send a HyperCard message Note that DRVVRtime lib has been replaced by Runtime lib This makes all of the THINK Pascal standard I O functions unavailable In this section we would like to provide a brief description of the
4. the first word of field 1 of card 1 str ZeroToPas paramPtr EvalExpr paramPtr the number of words in field 1 str 5 StrToLong Since the only data type in HyperTalk is zero terminated string data coming from HyperCard may need this routine to convert itself to long integer Parameters Function StrToLong paramPtr XCmdPtr Str Str31 Longint It takes a parameter of type Str31 that contains the value to be converted It returns an unsigned long integer Examples myLonglInt StrloLong paramPtr mystring myLongInt StrToLong paramPtr 3333 6 LongToStr Any long integer that is needed to send back to HyperCard must be converted by using this routine Parameters Function LongToStr paramPtr XCmdPtr posNum LonglInt str31 It takes a longint representing the number to be converted as one of its arguments and returns a string of type Str31 Examples myString LongToStr paramPtr 2345 myString LongtoStr paramPtr myLongInt 25 7 GetGlobal GetGlobal returns a handle to the content of a HyperTalk global variable The global variable must have been declared in a HyperTalk handler prior to calling XCMD and must be passed by name Parameters Function GetGlobal paramPtr XCmdPtr globalName Str255 Handle GetGlobal functions takes a string of type Str255 representing the name of the global variable as its argument and returns a handle that points to the zero terminated contents of t
5. Information 4 476 Retrieval 1535 2754 Figure 9 Query processing card in HypIR Automatic Search icon displays the query processing card Figure 9 On this card the user can first type in his query in natural language click on the Send Query icon and obtain the corresponding query vector from the Query Terms field If the user wants to modify the query vector HypIR provides a simple method To delete a query term the user can click on the term itself By doing so the system will pop up a dialog box Figure 10 to request an action from the user After the user clicks on the Delete 15 Term button the specified term will be erased from the query vector A similar approach is also provided for the modification of a term weight The only difference is that the user needs to type in the term weight in another dialog box Figure 11 To add a new term the user can click on any spot on the Query Terms field After responding to the same dialog box shown in Figure 10 the Index stack will appear in a new window Figure 12 By clicking on the letters in the window and using the same approach as checking a dictionary the user can easily locate desired term and add it to the query vector what is the type of the maditication Alter Weight Addterm Figure 10 Query modification dialog box Type in the weight of this term Figure 12 Index windows The FS and CS icons represent the searching techniques Fu
6. buffer to the volume and updates the file s entry in the file directory As mentioned before the programmer actually does not have standard I O Pascal functions when writing a XCMD In general when it is necessary to read write data from to the screen glue routines are used to replace standard I O function i e read readin write writeln etc When it is needed to read write data from to files Macintosh file manager routines are used to replace the standard I O Pascal functions For further description of glue routines and file manager routines programmers should reference to XCMD s for HyperCard and Inside Macintosh IV respectively APPL88 BOND88 APPENDIX II SUMMARY OF PROGRAMS AND FILES USED IN HypIR XCMDs 1 FSDOCS P the source code of the full search XCMD 2 CBSDOCS2 p the source code of the cluster based search XCMD 3 SEEDDOCS P _ the source code of the nearest neighbors XCMD 4 BROWSECL P the source code of the cluster browsing XCMD 5 CITEDDOC P the source code of the cited documents XCMD 6 PREPAREQ P the source code of the XCMD of the stemming function i e send query function Major Files in HypIR 1 IID file consists of IIS bin and I SHead bin files which are generated by FILEMAKE P Pascal program using the input files 30 2 IC file consists of IC bin and ICHead bin files which are generated by CBSFILEM P Pascal program using the input files 3 DV file consists of DV bin and D
7. card field or FALSE for a background field The second parameter is of type Str255 and contains the name of the field whose contents is needed by the XCMD Examples This example assumes a HyperTalk card field was first declared with the name TestField Var my Handle handle myCardField Boolean fieldName Str255 pasStr Str255 27 eaessoccsesseseeseaes aeneseonscovencanccese fieldName TestField myCardField StrToBoolean paramPtr TRUE make it a card field get the contents of card field TestField myHandle GetFieldByName paramPtr myCardField fieldName ZeroToPas paramPtr myHandle pasStr DisposHandle my Handle Pererrrre rere TTT eit t rei irri rrey 10 SetFieldBy Name SetFieldByName sets the content of a card or background field to a zero terminated string The field is specified by its name Parameters Procedure SetFieldByName paramPtr XCmdPtr cardfieldFlag Boolean fieldName str255 fieldVal Handle It contains three major parameters The first parameter contains a Boolean value that determines what kind of field card or background will be set use TRUE for a card field or FALSE for a background field The second parameter is of type Str255 and contains the name of the field to be set The third parameter contains a handle that points to a zero terminated string Examples This example assume a card field has been created and named TestField var my
8. cluster representatives centroids Then after selecting best matching clusters detailed query by document comparison is performed within the selected clusters Note that this is a conceptual explanation The actual implementation may be different Although the selected clusters may not contain the best matching documents generally speaking CS and clustering provide several advantages 1 In a clustered document environment the user may choose to browse the cluster of any retrieved document This provides some expansion of recall ability as not all documents in a cluster are relevant but they are related in ways not always accessible through a query Furthermore during the process of cluster browsing the user creates a better image of his information need and can submit a better query to the system 2 In a multi search IR system CS constitutes a good alternative to FS 3 The results of FS and CS can be combined to increase the system effectiveness For instance the combination of FS and CS may provide a precision improvement of up to 25 percent CAN92a Precision is defined as the ratio of the number of retrieved relevant documents to the number of retrieved documents 4 In a clustered environment the documents of a cluster can be put into close physical proximity in secondary storage to decrease I O time and therefore to increase system efficiency SALT89 3 3 Clustering Algorithm In HypIR the semantic relationships a
9. document matching and the next section considers the search strategies 2 1 Generation of Document Representatives In IR two common approaches for document representative generation are document signatures and the vector space model SALT89 The document signature approach uses a bit map array for each document whose entries are set by a hash function using the words of documents as its input FALO85 In the vector space model the approach used in this study each document is represented by a document vector describing the words or terms which appear in the associated document This model is simple and appropriate for hypertext environments SALT89 CROU89 According to the vector space model a document database simply becomes a document D matrix For a database of m documents defined by n terms an entry in the ior D matrix in row i document i at column j term j dij 1 lt i lt m 1 lt j lt n represents the weight or frequency of term j in document i i e the number of occurrences of term j in document i When constructing the D matrix a stemming algorithm should be adopted to reduce the size of the D matrix For the documents or queries written in natural English it is known that terms with a common stem will usually have similar meanings such as the following attract attracted attraction and attractive Thus if the IR system can recognize the various suffixes ed ion ive etc and remove them from th
10. may still want to see other documents found by FS To do so only requires a click on one of the five radio buttons In this case FS button is the appropriate choice of course By doing so the user can go back to the original path once again Due to the functionality of this field it is named as browsing monitor The major reason that we adopted browsing monitor is to prevent the user from getting lost in HypIR This design actually contains both of the advantages of system maps and history lists It not only tracks all the navigated documents but also groups the documents according to the retrieval function Since the switching among different groups is so handy we believe that browsing monitor provides a more convenient and practical aid to the user than system maps and history lists 6 CONCLUSION Information retrieval IR also known as text or document retrieval is the process of locating and retrieving documents that are relevant to the user queries In hypertext environments document databases are organized as a network of nodes which are interlinked by various types of links Through the links the documents are navigated or browsed in a non sequential manner which is totally controlled by the user In a hypertext based retrieval system the browsing of documents should be triggered by so called optimal starting nodes In this paper we introduced HypIR a hypertext based IR system which is implemented in the HyperCard environm
11. routines which are frequently used in HypIR system programs 1 SendCardMessage SendCardMessage sends a card message HyperCard command to the current card From there the message traverses the inheritance path in the normal way Any command can be sent as a message including the user defined command Parameters Procedure SendCardMessage paramPtr XCmdPtr msg Str255 It takes a string of type Str255 representing the card message as one of its major arguments XCmdPtr is a pointer of a record XCmdBlock It contains declarations for the actual communication between HyperCard and the XCMD It is not needed to understand this record completely To write a XCMD the programmer can just use paramPtr to be the first parameter in any glue routine Thus in the rest of this section we will ignore the description of this parameter To get better understanding of XCmdBlock the programmer can reference XCMD s for HyperCard by G Bond BOND88 Examples SendCardMessage paramPtr put field 1 into field 2 SendCardMessage paramPtr go to card home 23 2 PasToZero This is one of the most frequently used routine in HypIR since HyperCard utilizes zero terminated string which is not compatible with the string type in THINK Pascal language Thus every variable or data that the programmer wants to send to HyperCard must be converted by this routine first PasToZero converts the contents of a Pascal string to a zero terminated s
12. Handle Handle myCardField Boolean fieldName Str255 tecceseesossossecsosecoros een wecacerercevacccesese 28 fieldName TestField myCardField StrToBool paramPtr TRUE SetFieldByName paramPtr myCardField fieldName myHandle DisposHandle myHandle must dispose of the handle before exiting end E The file manager routines of Macintosh Toolbox File manager is the part of the Macintosh operating system that controls the exchange of information between a Macintosh application and files The file manager allows the programmer to create and access any number of files containing whatever information Compared to the standard I O routines in Pascal file manger is on a lower level Thus to use file manager there is no so called direct access file or sequential file The user can always access any position of a file 1 Function GetVInfo drvNum integer volName StringPtr var vRefNum integer var freebytes integer OSErr It returns the name reference number and available space in bytes in volName vRefNum and freeBytes for the volume in the drive specified by drvNum OSErr is used for error detection If it is equal to 0 it means no error has occured during the execution of the function Detailed information of OSErr is available in Inside Macintosh III APPL88 2 Function FSOpen filename Sir255 vRefNum integer var refNum integer OSErr It creates an access path to file
13. S92 NIEL90 The system map method which is usually adopted in the hierarchical hypertext environment is implemented by dividing the entire hierarchy of the system into several maps according to different subjects Each map is then inserted into the links between the nodes Thus the user can navigate the entire collection according to the system maps If the user is lost he can still find the subjects which he was in History list is usually a 10 special node in a hypertext system It keeps a list of all the navigated nodes for the user Whenever the user is lost he can still find a specific document by checking such a list Neither of these designs is not used in HypIR due to the following reasons 1 In a system adopting system maps all the documents must have a kind of semantic relationship which can be clearly expressed graphically 2 System maps can not help the user find a specific document conveniently In a large database finding a navigated document may still consume a lot of time 3 History list is a test for the user s memory since it expects the user to memorize the titles of the documents without providing any hint 4 History list does not provide any browsing ability In HypIR a new approach is used The detail of this design is provided in section 5 4 5 3 HyperCard s Object Oriented Philosophy HypIR is implemented using the HyperCard graphic programming package because of its hypertext nature object
14. Scholarly Commons at Miami University http sc lib miamioh edu Computer Science and Systems Analysis Computer Science and Systems Analysis Technical Reports Design and Implementation of a Hypertext based Information Retrieval System Downloaded from Miami University Scholarly Commons Computer Science and Systems Analysis Computer Science and Systems Analysis Technical Reports Miami University Year 1992 Design and Implementation of a Hypertext based Information Retrieval System Yuan Lee Miami University commons admin lib muohio edu This paper is posted at Scholarly Commons at Miami University http sc lib muohio edu csa_techreports 48 a SA L 1 9 MIAMI UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE amp SYSTEMS ANALYSIS TECHNICAL REPORT MU SEAS CSA 1992 011 Design and Implementation of a Hypertext based Information Retrieval System Yuan Ming Lee School of Engineering amp Applied Science Oxford Ohio 45056 513 529 5928 Design and Implementation of a Hypertext based Information Retrieval System by Yuan Ming Lee Systems Analysis Department Miami University Oxford Ohio 45056 Working Paper 92 011 08 92 SYSTEMS ANALYSIS DEPARTMENT MASTER S PROJECT FINAL REPORT Presented in Partial Fulfillment of the Requirements for the Degree of Master of Systems Analysis in the Graduate School of Miami University TITLE Design and Implementation of a Hypertext based Information R
15. VHead bin files which are generated by MAKEDVFLP Pascal program using the input files 4 CM file consists of CMEMBERHEAD bin and CMEMBER bin which are generated by CBSCMEM P Pascal program using the input files 5 The file of Cited Documents consists of CITEDDOCS bin and CITEDDOCSHEAD bin which are generated by CITEMAKE P PASCAL program using the input files Furthermore the following data files are used for calculating the similarity values between query and documents and query and centroids 1 ISNC bin and STermDistn bin contain the normalization component and collection frequency component respectively for each document They are generated by FILEMAKE p Pascal program 2 ICNC bin and CTermDistn bin contain the normalization component and collection frequency component respectively for each cluster They are generated by CBSFILEM p Pascal program APPENDIX III THE MAINTENANCE OF HypIR l The maintenance of HypIR is supported by a card called update in the HypIR stack Figure 14 On this card the Update Lists button provides a function which will add the authors and titles of the new documents to the author list and title list The Export Text button contains a script which can write the entire TODS TOIS database to a text file UPDATE CARD UPDATE STEPS 1 CREATE NEW DOCUMENTS ON SEPARATE CARDS 2 ENTER NO OF CURRENT DOCUMENTS AND NO OF DOCUMENTS AFTER UPDATE IN THE FOLLOWING FIELDS 3 CLICK
16. are placed in the second level Meanwhile various functions such as semantic association navigation sequential reading backtracking and history list are also provided AGOS92 Another example of hypertext based IR system with hierarchical cluster browsing capability is implemented by Crouch and his co workers This system allows the user to browse the nodes within a single link clustering structure When using this system the user is not expected to access best matching documents directly Instead the user can utilize the similarity values and single link structure to decide which clusters should be visited CROU89 An application of the hypertext and IR techniques on a medical handbook is defined by Frisse In his system the links are defined by already existing hierarchical relationships of different sections of the handbook FRIS88 In this paper we introduce a hypertext based text retrieval system HypIR As in any other information system efficiency and effectiveness are the main concerns Efficiency and effectiveness are respectively associated with the time and space required for searching and with the quality of retrieval The implementation principles of HypIR are proven to be both effective and efficient CAN90 CAN92a CAN92b In HypIR since the documents are independently created the semantic relationships among documents are obtained using the Cover Coefficient based Clustering Methodology C3M This algorithm generates
17. construct WILL89 In a network structured system nodes are connected in a non hierarchical manner The links between the nodes may be embedded in the names of authors the key terms or an index list in the end of a node Thus the user s navigation is totally free in network environments 5 2 Hypertext Browsing Browsing is a heuristic search through a well connected collection of documents in order to find information relevant to the user s needs THOM89 It provides two major advantages for IR 1 By browsing the user s information need can become more and more clear Hence more appropriate queries can be created 2 Since the related documents are linked together a user can evaluate a large database in a rapid manner In a hypertext based IR system the user may browse 1 the documents written by the authors of the located document 2 the references of the located document 3 the documents in the same publication 4 the documents that are very similar to the located document i e the nearest neighbors 5 the documents containing the same key terms In HypIR all of these browsing methods are implemented and future modification or enhancement will be conducted according to users feedback A major consideration when designing a hypertext based IR system is to prevent users from getting lost during browsing Currently two major designs system map and history list have been used in some hypertext systems AGO
18. d Index for Documents IID Inverted Index for Centroids 11C Templates of the IID and IC files IID lt lt Term No No of Documents Using This Term gt gt lt Document No Term Weight gt gt IC lt lt Term No No of Centroids Using This Term gt gt lt Centroid No Term Weight gt gt indicates one or more occurrences of the enclosed information gt indicates a pointer Figure 2 B IID and IC file structures and templates 5 USING HYPERTEXT FOR IR AND HypIR 5 1 Links and Nodes It is commonly agreed that the major advantage of hypertext model is the non sequential organization of the information In a hypertext based system each piece of information on the screen can provide several links through which the user can browse the entire database and retrieve increasingly useful information In general the structures containing all the nodes and links in a hypertext system can be divided into two categories hierarchical structures and network structures as shown in Figure 3 Figure 3 Hierarchical and network hypertext models In a hierarchical structure the information is divided into several levels Each lower level of the hierarchy provides more detailed information Hence the information can be searched and presented in a semantically meaningful way For a bibliography database adopted in HypIR the network structure is a better choice since the hierarchical relationships are expensive to
19. e stem attract the complexity of the system and the storage requirement of the database can both be reduced In HypIR the stemming program is coded using Porter s algorithm PORT80 Porter s algorithm is simple compared to other stemming algorithms but effective HARM91 2 2 Query Document Matching No single search strategy can satisfy all users queries Therefore it is desirable that an IR system should have more than one search strategy Two common search techniques are Full Search FS and Cluster based Search CS FS has the best performance in terms of retrieval effectiveness and CS facilitates document browsing For both a query matching similarity function also described as a search machine determines which documents or clusters potentially relevant i e match the query and should be returned to the user Several matching functions based on term weighting components of document and query terms have been introduced in the IR literature Term weighting consists of three components the term frequency component TFC the collection frequency component CFC and the normalization component NC Both the weights of terms in a document and a query denoted by waj and wg 1 lt j lt n can be derived by multiplying the term weights of these three components After obtaining the term weights the similarity between a document d and a query q can be defined as follows SALT89 n similar ity d q DE Yi 1 where n is the n
20. ent of Macintosh In HypIR the semantic relationships among documents are obtained using the C3M clustering algorithm which is known to have good IR performance For information retrieval the users are provided with various search and browsing tools These include full search FS cluster based 17 search CS nearest neighbors NN search cluster browsing CB and others In HypIR the user enters natural language queries and can easily modify them The paper introduces the underlying concepts implementation details and the object oriented nature of the HypIR system To prevent the user from being lost in the IR hyperspace a new approach browsing monitor is introduced Browsing monitor provides the advantages of system maps and history lists without their difficulties of usage In the future the modification and enhancement of HypIR can be conducted by utilizing users feedback Two other future research possibilities are the design and implementation of an on line thesaurus for better query formulation and the development of performance measurement methods that would be appropriate for hypertext based IR systems REFERENCES AGOS92 Agosti M Gradenigo G Marchetti P G A hypertext environment for interacting with textual database Information Processing amp Management 28 3 1992 371 387 lotta Apple Computer Inc Inside Macintosh I VI Addison Wesley Reading MA 1988 BELK87 Belkin N J Croft W B Retri
21. environments databases are organized as a network of nodes which are interconnected by various links Through the links the user can navigate or browse the documents in a non sequential manner This network browsing process is totally controlled by the user In a hypertext based IR system documents have multiple entries and numerous connections as shown in Figure 1 a Computed Link SS a Embedded Link Figure 1 An example hypertext structure As shown in Figure 1 computed links are constructed whenever the user submits a query On the other hand embedded links may be provided by system functions In such a system the browsing of hypertext is triggered by some optimal starting nodes which are the documents with high similarity to the user query The user may then navigate among documents following the original query or alternatively the user may utilize system functions to find documents containing the same keywords or documents written by the same author etc Consequently the user can become more and more familiar with the system and his information need Thus a query which is more accurate than the initial query can be formulated and more relevant documents can be found The recent IR literature contains various examples of hypertext based IR systems sometimes with multimedia support For example American Memory is a multimedia integrated system which provides electronic images of selected collections of the Library of Co
22. esos saeaas SEARCH STRATEGIES file st etiveves iol aww bakes Sob Full Searches corkage rome e Gee bes wares bls San ewe 3 3 Clustering Algorithm 4 cide bhsite tee Weeden rererere 3 4 Implementation of Cluster based Search 0e 000s SYSTEM DATABASE AND FILES 2 00 cece eee eee 4 1 System Database joey sdnt Sabla dea ase erese tes oi aaa 42 System Filesi Sic sanded Vita eae a pawte y heea ete USING HYPERTEXT FOR IR AND HYPIR 000 eee eees 5 1 danks and Nodes scrot vey ka eae ii arenie tes werent 5 2 Hypertext Browsing 5 6 sa wiles ieee poh ty beeiet eee ah Seat arate dentate 5 3 HyperCard s Object Oriented Philosophy 000 00 5 4 HypIR System Design and Implementation CONCLUSION e fe nadeinvavec das dngalinted Gawau view 9 eee oa REFERENCES Verea i cartes a eae A E S Dees eave APPENDIX erno a E NET E eae tes Sams I Programming Details for HypIR 6 2 00 esac neste ce ee wes I Summary of Programs and Files Used in HypIR HI The Maintenance of HypIR cic cosa ee Ste Soa WON Be es SD i esonrsasnosessessogreonssasesnsossursvlydn lll Page o e O MN a W W N N 10 11 12 17 18 20 20 30 31 LIST OF TABLES TABLE I Characteristics of the TODS TOIS Database aero eee ee meee eens LIST OF FIGURES Figure Page 1 An example hypertext structure 0 0 cc eee e eee 1 2A Example D marines 4 Vcchy we ae ides dee
23. etrieval System PRESENTED BY Yuan Ming Lee DATE August 13 1992 COMMITTEE MEMBERS Fazli Can Advisor James D Kiper Alton Sanders Design and Implementation of a Hypertext based Information Retrieval System Graduate Research by Yuan Ming Lee In Partial Fulfillment of the Requirements for the Degree Master in Systems Analysis Miami University August 1992 Abstract Information Retrieval IR which is also known as text or document retrieval is the process of locating and retrieving documents that are relevant to the user queries In hypertext environments document databases are organized as a network of nodes which are interconnected by various types of links This study introduces a hypertext based text retrieval system HypIR In HypIR the semantic relationships among documents are obtained using a Clustering algorithm A new approach providing the advantages of system maps and history list is introduced to prevent the user from being lost in the IR hyperspace The paper presents the underlying concepts and implementation details HypIR is based on the object oriented paradigm of HyperCard il Approvals Student Advisor Member Member ill Date Date Date Date TABLE OF CONTENTS ABSTRACT tact et iena Averhrvas ye O ty on ONA LIST OF TABLES 1 2 2 1 Generation of Document Representation eee ee eee 2 2 Query Document Matching 0 56 03 4 eveec des
24. eval techniques In Annual Review of Information Science and Technology ARIST Vol 22 M E Williams Ed Elsevier Science Amsterdam The Netherlands 1987 109 145 BOND88 Bond G XCMD s for HyperCard MIS Press OR 1988 CAN90 Can F Ozkarahan E A Concepts and effectiveness of the cover coefficient based clustering methodology for text databases ACM Transactions on Database Systems 15 4 Dec 1990 483 517 CAN92a Can F Incremental clustering for dynamic information processing ACM Transactions on Information Systems to appear CAN92b Can F On the efficiency of best match cluster searches Tentatively accepted by Information Processing and Management CAN92c Can F McCarthy K J Implementation of an information retrieval system ANIRS with ranking and browsing capabilities Working paper 92 001 Dept of System Analysis Miami Univ Oxford Ohio April 1992 CROF87 Croft W B Thompson R H IR A new approach to the design of document retrieval systems Journal of the American Society for Information Science 38 6 Nov 1987 389 404 18 CROU89 Crouch D B Crouch C J Andreas G The use of cluster hierarchies in hypertext information retrieval Hypertext 89 Proceedings November 1989 225 237 FALO85 Faloutsos C Access methods for text ACM Computing Surveys 17 1 Mar 1985 49 74 FRIS88 Frisse M E Searching for information in hypertext medical handbook Commun
25. fe oues eens 8 2 B IHD and IC file structures and templates 00005 9 3 Hierarchical and network hypertext models 00005 9 4 The hierarchy of objects in HyperCard 0 c cece eee 11 5 HypIR system design ccs wee he eh anann ay Se asawies eee 13 6 Main menu in Hyp Roe teu ince foawiensea ew iedeetaew es 14 ve An example of document list in HypIR 0 0 eee eee 14 8 Document card in AyplRics shearer vekaanay wedasseeia es 15 9 Query processing card in HypIR 0 cece eee eee eee 15 10 Query modification dialog box cee eee eee eee eee 16 11 Term weight dialog box 5 0 4i45 had ban once eawe ds ates 16 12 ndex windows gojccd a4 vad Sina eee us Lah eRe AR 16 13 An example of Set Project Type dialog box 4 21 14 Update card in ByplRe dscns basa e ceed dee ae enw ee 31 vi 1 INTRODUCTION Information Retrieval IR is the process of locating and retrieving documents that are relevant to the user queries SALT89 In general IR is accomplished using document representatives or surrogates Whatever the representation of the documents the major problem in IR is the query formulation This is why several retrieval techniques are available in the IR literature BELK87 Among these techniques the hypertext approach which allows the user to navigate and inspect the database documents according to his own wishes is the most intuitive one NIEL90 In hypertext
26. hat variable Example This example assumes a HyperTalk global variable was first declared with the name MyList Var myHandle handle inList Str255 begin myHandle getGlobal paramPtr MyList ZeroToPas paramPtr myHandle inList end 8 SetGlobal SetGlobal sets the contents of a specified HyperTalk global variable to the contents of a zero terminated string Parameters Procedure SetGlobal paramPtr XCmdPtr globName Str255 glob Value Handle It contains two major parameters The first is a string of type Str255 that contains the name of the global to be set and the second is a handle to a zero terminated string 26 Examples This example assumes a HyperTalk global variable was first declared with the name fieldTotal Var my Handle handle Pee eroe nara erceaneseeereeessesensenesens Cd een cena rena neenereneeeeeesaseseseans SetGlobal paramPtr fieldTotal myHandle DisposHandle myHandle must dispose of the Handle after setting the global 9 GetFieldBy Name GetFieldByName returns a handle to a zero terminated string that points to the contents of a card or background field The field is specified by its name Parameters Function GetFieldByName paramPtr XCmdPtr cardFieldFlag Boolean FieldName Str255 Handle It contains two major parameters The first parameter is of type Boolean and determines what kind of field card or background will be accessed use TRUE for a
27. having the name fileName on the volume specified by vRefNum A path reference number is returned in refNum The access path s read write permission is set to whatever the file s open permission allows 3 Function FSRead refNum integer var count longint buffptr Ptr OSErr FSRead attempts to read the number of bytes specified by the count parameter from the open file whose access path is specified by refNum and transfers them to the data buffer pointed to by buffPtr The read operation begins at the current mark so the user might want to precede this with a call to SetPos If the user try to read past the end of file FSRead moves the mark to the end of file and returns an error message by setting OSErr to nonzero After the read is completed the number of bytes actually read is returned in the count parameter 29 4 Function SetFPos refNum integer posMode integer posOff longint OSErr It sets the mark of the open file whose access path is specified by ref Num to the position specified by posMode and posOff Posmode indicates how to position the mark it must contain one of the following values const fsAtMark 0 at current mark fsFromStart 1 set mark relative to beginning of file fsFromLEOF 2 set mark relative to End of File fsFromMark 3 set mark relative to current mark 5 Function FSClose refNum integer OSErr FSClose removes the access path specified by refNum writes the content of the volume
28. ications of the ACM 31 7 July 1988 880 886 GAUC92 Gauch S Intelligent information retrieval An introduction Journal of the American Society for Information Science 43 2 March 1992 175 182 HARM91 Harman D How effective is suffixing Journal of the American Society for Information Science 42 1 Jan 1991 7 15 HARM92 Harman D User friendly system instead of user friendly front ends Journal of the American Society for Information Science 43 2 March 1992 164 174 LIBC91 Library of Congress American Memory Instruction Manual Washington D C October 1991 NIEL90 Nielsen J Hypertext and Hypermedia Academic Press San Diego CA 1990 PORTED Porter M F An algorithm for suffix stripping Program 14 3 July 1980 130 137 SALT88 Salton G Buckley C Term weight approaches in automatic text retrieval Information Processing and Management 24 5 May 1988 513 523 SALT89 Salton G Automatic Text Processing Addison Wesley Reading MA 1989 SHAF91 Shafer D The Complete Book of Hypertalk 2 Addison Wesley Reading MA 1991 SYMA90 Symatec Corporation THINK Pascal User Manual U S A 1990 THOM89 Thompson R H Croft W B Support for browsing in an intelligent text retrieval system Int J of Man Machine Studies 30 1989 639 668 VANR79 Van Rijsbergen C J Information Retrieval 2nd ed Butterworths London 1979 WILL89 Willett P Recent trends in hie
29. isplayed to the user for browsing purposes It is shown that ICIIS is much more efficient than other conventional CS implementation methods Its efficiency is due to shortness of the query vectors and it is especially suitable to very large databases CAN92b 4 SYSTEM DATABASE AND FILES 4 1 System Database The document database of HypIR is the TODS TOIS database covering the papers published in both ACM Transactions on Database Systems and ACM Transactions on Information Systems In HypIR each document of TODS TOIS is represented by a document card consisting of the title author s and the abstract of the corresponding article For clustering purposes the database is defined with a D matrix using an indexing software The relevant statistics of the current TODS TOIS database are listed in Table I Table I Characteristics of the TODS TOIS Database No of documents m No of terms n No of clusters ne No of nonzero entries in the D matrix Average No of terms per document The TODS TOIS database is currently adopted by a mainframe IR system ANIRS and is available on the Miami IBM environment CAN92c The updates of TODS TOIS and its clustering structure are conducted periodically CAN92a 4 2 System Files Several files are used in the implementation of HypIR They are described in the following 1 IID file contains the inverted index representation of the D matrix and is needed for the selection of the best ma
30. ll Search and Cluster based Search respectively As defined before FS and CS provide a list of the best matching documents of the query vector in the entire database and the selected clusters respectively Again the user accesses the content of the documents by clicking on their titles The document card as shown in Figure 8 contains several icons to support different types of browsing The Nearest Neighbors icon provides the function that sends the 16 entire document as a query and returns the nearest neighbors of it as the result The Cluster Browsing icon provides a way that the user reveals all the members of the cluster in which the located document resides The Cited Documents icon provides all the titles of the documents cited by the located document Furthermore browsing by key terms is also available in HypIR The Document field contains a script that allows the user to navigate the entire database by clicking on any word which appears in the located document The field appearing on the bottom left corner of the document card is responsible for displaying the results of all the previous mentioned functions It keeps track of the results of FS Full Search CS Cluster based Search NN Nearest Neighbors CB Cluster Browsing and CD Cited Documents For instance the user may first locate a document using the FS function then traverse the database by using CB function or links of key terms After all of these the user
31. mong documents is obtained using the Cover Coefficient based Clustering Methodology C3M In C3M some of the documents are selected as the cluster initiators seeds Then the nonseed documents are assigned to one of the clusters initiated by the seed documents C3M produces a single level partitioning type clustering structure The number of clusters ng is determined using the Cover Coefficient CC concept According to CC for an m document by n term D matrix the value range of nc and the average cluster size dc is as follows 1 lt nce lt min m n max l m n lt dc lt m In C3M an m by n D matrix is first mapped into an m by m C matrix using the n C Cre Bpo dx where 1 lt i j lt m and qj and Px are the reciprocals of the ith row sum and kth column following formula sum This asymmetric C matrix shows the relationships among the documents of a database The diagonal entries of the C matrix determine the number of clusters nc and is used for the selection of cluster seeds The relationships between a nonseed document dj and a seed document dj is determined by calculating cij entry of the C matrix The whole clustering process requires the calculation of m m nc nc entries of the total m entries of the C matrix This is a small fraction of m2 since ng is much less than m A detailed discussion of C3M and its complexity analysis are available in CAN90 In a dynamic document environment the clusters of C3M can ea
32. ngress The system provides archival material related to American culture and history on CD ROM discs and laser videodiscs The multimedia database covers motion pictures photographs cartoons speeches songs and text LIBC91 The News Retrieval Tool NRT built at University of Glasgow in Scotland is based on a probabilistic retrieval model This system covers a collection of articles from Financial Times and is designed to test retrieval improvement for users of the existing profile retrieval services HARM92 CANSEARCH which is an application of artificial intelligence techniques provides intelligent access to on line information It is designed to enable doctors to retrieve cancer therapy related documents from the MEDLINE database To use CANSEARCH the user must have sophisticated medical knowledge but little IR experience GAUC92 PR Intelligent Interface for Information Retrieval is a knowledge base system which allows the user to find information using various means To retrieve documents both natural language query and Boolean query formulation can be used CROF87 THOM89 HYPERLINE which has been developed by the Information Retrieval Service of European Space Agency is based on a two level conceptual architecture for the construction of a hypertext environment for interacting with large textual databases In HYPERLINE the collection of documents of interest is placed in the first level and the semantically related concepts
33. ntains the Runtime lib and Interface lib files At this step a new project has been created successfully To create a new source file the programmer can 1 Choose the New from the File menu Then an empty window will appear on the screen 2 Type in the source code After creating the project and source files the source files must be added to the project To add a source file to the project simply use the Add command in the project menu The project window should now contain three files Runtime lib interface lib and the source file To execute the program simply choose Go from the Run menu 20 B Creating a HyperCard XCMD Resource The creation of a HyperCard XCMD is similar to the creation of a Think Pascal program However after successfully accomplishing step 1 to step 4 in the previous section the programmer needs to 1 Select Set Project Type from the Project menu and click the Code Resource icon on the left side of the dialog box 2 Fill in the dialog according to the specification in Fig 13 File information SJ Type rsrc Creator C Bundle Bit Resource Information Name Type Attributes m Driver infarmatian Flags a Delay Gas MEO i Mask iA i Mutti seqment Figure 13 An example of Set Project Type dialog box 3 Click OK to save the changes 4 Add some files to the project that are necessary for creating stand alone code resources that
34. o each other All the backgrounds within a stack have the same size on the screen Figure 4 also expresses the message passing strategy adopted in HyperCard Whenever a message has been generated by an object it will be either sent to a specific object if any exists or it will be passed through the hierarchy until it is interrupted by an object on a higher level 5 4 HypIR System Design and Implementation The architecture of HypIR is based on HyperCard s object oriented philosophy The whole system is accomplished by using three stacks the HypIR stack the help stack and the index stack as shown in Figure 5 The HypIR stack is the main body of the entire system which contains a main menu card a query processing card and several title list cards author list cards and document cards After accessing the system the user is first led to the main menu card Figure 6 On this card the user can make a decision among several searching strategies and start the navigation in the system The user may click on 1 The Automatic Search icon to access query processing card 2 The Doc List icon to browse the titles of the entire collection of documents In this case we assume that the user does not have any particular destination to start the searching 3 The Authors icon and then type in the last name of a particular author in a pop up dialog box Consequently a list of the author s papers are provided by the system 4 One of the bo
35. oks and obtain a list of documents in a certain volume of the collection All the previous choices except 1 provide the user a starting point to begin with his navigation For example after obtaining a list of documents Figure 7 the user then browses through the document titles Whenever an interesting document title is found the user can click on the title and access the content of this document Figure 8 12 HypIR Stack Index Stack Help Stack Send Query CON Full Search con Cluster based Lico EEDA Card Title List E arc Automatic bea On Author List arc O con arg Eral Figure 5 HypIR system design coal YOG id Document Field Browsin Monitor Feld Nearest Neighbors Icon Cluster Browsing Icon Cited Docs Icon lf oo Figure 6 Main menu in HypIR Document List Doc 1 330 ee Aas 001 THE ENTITY RELATIONSHIP MODEL TOWARD A UNIFIED VIEW OF DATA 002 ON THE ENCIPHERMENT OF SEARCH TREES AND RANDOM ACCESS FILES 003 THE DESIGN OF A ROTATING ASSOCIATIVE MEMORY FOR RELATIONAL DATABASE 004 OPTIMAL ALLOCATION OF RESOURCES IN DISTRIBUTED INFORMATION NETWORKS 005 DATABASE MANAGEMENT FACILITY FOR AUTOMATIC GENERATION OF DATABASE 006 SYSTEM R RELATIONAL APPROACH TO DATABASE MANAGEMENT 007 RESTRUCTURING FOR LARGE DATABASES THREE LEVELS OF ABSTRACTION 008 DYNAMIC DATABASE REORGANIZATION ALGORITHM 009 HASHING AND TRIE ALGORITHMS FOR PARTIAL MATCH RETRIEVAL
36. oriented philosophy powerful scripting language and the ability to facilitate general purpose languages such as Pascal and C The object oriented philosophy of HyperCard visualizes the well known object in traditional object oriented programming languages such as C or Smalltalk In HyperCard when a user needs a new object it is not necessary to use a specific statement provided in the language Instead a graphic tool is provided to actually draw an object on the screen The objects provided in HyperCard are stacks backgrounds cards buttons and fields Each of these objects can be associated with a script which enables the object to respond to a HyperCard message A system based on hierarchical inheritance and message passing can be constructed by combining these objects The inheritance of HyperCard is based on the relationships between the objects as shown in Figure 4 Figure 4 The hierarchy of objects in HyperCard 11 The stacks and cards are actually HyperCard terminology to represent HyperCard files and screens respectively As in Figure 4 a stack may have several different backgrounds A background can be shared by numbers of cards Each card has several buttons and fields which distinguish from the other cards with the same background Furthermore from the bottom level to the top each object inherits all the HyperCard properties of the higher level object For example all the cards with the same background look similar t
37. rarchical document clustering A critical review Information Processing and Management 24 5 1989 577 597 19 APPENDIX I IMPLEMENTATION DETAILS OF HypIR A Programming in THINK Pascal This section provides the necessary information to create Think Pascal projects and source files In THINK Pascal to create an executable program the user needs to create a project The project file should include the source files and two THINK Pascal library files runtime lib and interface lib which support the Think Pascal language i e support the standard I O and other Pascal functions The project is only compiled at the first time that the programmer executes the THINK Pascal compiler Later the THINK Pascal compiler will only compile the updated parts of the program i e the source files By doing so a lot of compilation time can be saved To create a project the programmer should 1 Use the Finder s New Folder command to create a new folder This folder will contain the THINK Pascal project and all of the source files 2 Double click on the THINK Pascal icon Then a dialog box will pop up to ask a working project Since we are creating a new project click on the New button THINK Pascal user manual p 24 3 Name the project as the following format in another dialog box THINK Pascal user manual p 25 SYMA90 filename 4 Then Think Pascal will create a new project document on the disk and display a project window which co
38. sily be updated without initiating a reclustering process CAN92a The CS effectiveness of C3M is reported in CAN90 The mentioned study shows that the effectiveness of C gt M is 15 1 to 63 5 with an average of 47 5 percent better than four other clustering algorithms in CS 3 4 Implementation of Cluster based Search Cluster based Search CS is conventionally implemented in the following two ways CVDV Match the query vectors with all the centroid vectors CV and the document vectors DV of the members of the best matching clusters ICDV Match the query vectors with the inverted indexes of centroids IC and the document vectors DV of the members of the best matching clusters In addition to these conventional methods the following method of implementation for CS has been introduced in CAN92pb ICIS Match the query vectors with the inverted indexes of centroids IC and the inverted indexes of all documents In ICIS the system first retrieves the best matching clusters by matching the query vector with the inverted indexes of centroids After obtaining the best matching clusters the documents of these clusters are selected using the results of IIS performed on the complete database In other words by using the ICIIS algorithm the IR system can also provide the results of FS without extra effort By definition the efficiency of ICHS is independent of the number of clusters to be selected and the number of documents to be d
39. statistically valid clusters i e groups of documents that are strongly associated with each other which are appropriate for IR CAN90 The selection of documents from the generated clusters is performed using inverted index search techniques HypIR is implemented using HyperCard and THINK Pascal The system has a dynamic nature and documents can easily be added and deleted The paper is organized as follows Section 2 briefly introduces the concepts of document representation and query document matching process In hypertext browsing Full Search FS and Cluster based Search CS provide the so called optimal browsing starting points Section 3 covers the principles of FS and the details of CS Section 4 and 5 respectively provide the data and file structures and the design principles used for the implementation of HypIR In Section 6 a conclusion is provided Finally three appendices are included to aid future graduate students who will work on projects related to the HypIR system 2 DOCUMENT REPRESENTATION AND QUERY DOCUMENT MATCHING In the design and implementation of IR systems some decisions should be made for the techniques of document representation query document matching and searching strategies Thus before getting into the details of HypIR we would like to introduce the techniques that are adopted in HypIR and the reasons that support the selection of these techniques This section considers document representation and query
40. tching documents using IIS Formerly ACM Transactions on Office Information Systems 2 DV file is a direct access file of document vectors of the D matrix and is used for the creation of IID file and the implementation of Nearest Neighbors browsing technique which returns the documents most similar to a located document 3 IC file contains the inverted index representation of the centroid vectors and is needed for the selection of the best matching clusters 4 CV file is a direct access file of centroid vectors and is needed for the update of the IC file For better understanding of the major file structures IID and IC let us consider the example D matrix of Figure 2 A The application of C gt M to this D matrix produces the cluster Cy dj d2 and C2 d3 d4 d5 CAN90 The centroids and templates of HD and IC for the example D matrix are depicted in Figure 2 B In Figure 2 B IID shows that t appears in two documents dj and d2 since the D matrix is binary the term weights are ones IC provides the same information for centroids For example the header information for t2 indicates that it appears in two centroids cj and c2 In cluster c1 both members dj and d2 contain t2 that is why the weight of ty in cq is two Similarly in c2 two members d4 and d5 contain t2 t t2 t3 t4 ts t6 1 E E 1 0 d 1 1 0 t 1t O dy palo 0 0 0 0 Tl ay 00 g tha jo 1 1 1 0 14 dg Figure 2 A Example D matrix Inverte
41. tring and returns a handle to the converted string A handle is a pointer which points to another pointer The programmer must dispose of the handle before exiting the XCMD use the DisposHandle ROM routine for additional information see Inside Macintosh APPL88 Parameters Function PasToZero paramPtr XCmdPtr str 255 Handle It takes the Pascal string to be the major parameter Examples paramPtr return Value PasToZero paramPtr store my error message in field 1 paramPtr return Value is a field in XCmdBlock record and can be treated as a reserved variable The string assigned to this variable can be retrieved in HyperTalk by a HyperTalk reserved word The result l 3 ZeroToPas For the reason mentioned in 2 PasToZero this routine is used whenever the XCMD gets a string from HyperCard Parameters Procedure ZeroToPas paramPtr XCmdPtr var zeroStr Ptr var pasStr str255 It contains two major parameters The first is a pointer to a zero terminated string and the second is a string variable of type str255 into which the converted string will be placed Examples see 4 EvalExpr 4 EvalExpr EvalExpr evaluates a HyperTalk expression 24 parameters Function EvalExpr paramPtr XCmdPtr expre Str255 Handle The EvalExpr function takes a Pascal string of type Str255 representing the expression to be evaluated in HyperCard and returns a handle Examples ZeroToPas paramPtr EvalExpr paramPtr
42. umber of terms According to Salton and Buckley s research 1800 different combinations of document query term weight assignments i e matching functions can be derived Among these combinations 287 were found to be distinct and six of them were recommended SALT88 The results of the experiments reported in CAN90 indicate that the matching function labeled as TW2 tfc nfx in SALT88 is the most effective one Thus TW2 is used as the search machine of HypIR 3 SEARCH STRATEGIES 3 1 Full Search Full Search FS is implemented using inverted index search IIS In IIS each distinct term in the system has a list of documents in which that term appears Each document is represented by its document number and associated with the weight of the corresponding term By traversing the list of those query terms the similarity values of all database documents are calculated SALT89 The documents with the highest similarity values are then selected to answer the user s query It is known that IIS is both effective and efficient CAN92b SALT89 3 2 Cluster based Search In IR there is a hypothesis known as the clustering hypothesis which states that closely related documents tend to be relevant to the same query VANR79 It is this hypothesis that supports the Cluster based Search CS strategy In CS the documents are divided into several homogeneous groups clusters In a typical CS the user queries are first compared with the
43. work with HyperCard First of all the Runtime lib file must be replaced by its code resource counterpart DRVRRuntim lib To do this click on the file Runtime lib in the project window Remove it from the project by selecting Remove from the Project menu The file Runtime lib should disappear from the project window 5 Select Add File from the Project menu and add the files DRVRRuntime lib HyperXlib lib and HyperXCmd p to the project All three files can be found within the Think Pascal folder DRURRuntim lib and HyperXLib lib are in Libraries subfolder HyperXCmd p is in the Interfaces folder 21 6 Type in your source code according to the following template unit Dummy Unit interface uses HyperXCmd procedure Main paramPtr XCmdPtr implementation procedure Main paramPtr XCmdPtr begin Type in the source code in this block end end 7 Save the file and add it to the project 8 Rearrange the order of the files within the Project window as the following order by using the hand cursor that appears when the mouse is inside the project window Drag the files up or down until the order matches the following DRVRRuntime lib Interface lib HyperXLib lib HyperXCmd p YourXCMD p This is the XCMD file 9 Build the XCMD resource by selecting Build Code Resource from the Project menu Save the XCMD C Add the XCMD to HyperCard stack 1 Execute the ResEdit application by double clicking the XCMD file
Download Pdf Manuals
Related Search
Related Contents
none 66520 Instructions / Assembly Progress Lighting P5729-20 Installation Guide RVG manual, English, German, French, Italian, Spanish Sony KDL-48R550C 48" Full HD 3D compatibility Smart TV Wi-Fi Black LED TV 2 Descripción del dispositivo Modo de empleo GIGA X7c Professional B. Braun lance le lecteur Omnitest® 3 : nouveau système de mesure Copyright © All rights reserved.
Failed to retrieve file