Home
Interactive document summarization
Contents
1. onl peqeerQ 8 1X9 OdW Jeqnduo Teuoszeg hice ee DEH yor soe esueopr or i9euora Azewumng OTyewo ny qH USOQUTOPW T UM 7 17 OOSTOURTA UPS 1940 110139q Den MOT II TAN JUIN Aepusy I E3 UgSUn20p 3ATJOLI JUT WTJ TVL I V DeW ong queyeg ieuung omy pesn seqdq ciel XSTp uo gut 9ZTS que1eg Azewuns omy Oju jud eg A1euung ojny Z OI US 7 886 235 B2 Sheet 8 of 9 Feb 8 2011 U S Patent Azewuns Jyu umoop e UT pepnpoul eq 02 TTeIEp Jo yunowe y 1011000 snonutquoo 03 Jasn a SMOTTE YOTYM uejs s uoliez l1eUuns queunoop eAT32e391UT WT4 T99JI V SqUSWMIOD Wd 69 9 G66T 92 deg ony hoop Ozd 93T24MOEW YIT Wd IT G C661 92 deg any OC OJd 91TJMOPN YOT Wd 2 S661 92 deg ent h20p Old PFIMOPW MOP PeTITPOW ep Loge MODpUIM AIEUUUUDE 9neuio ny 8 Sls DUTY sr SOEN BSUBITT oL aeeuora 67 6 TAN V TN Aepu queyeg zeuung oil US 7 886 235 B2 Sheet 9 of 9 2011 Feb 8 U S Patent Aa SUIA01d qe I AV IV x Sed IV ed 3 ed jueunoog M8IA9Jd MOUS x Jod 1 du E SOLN su 917 ol 1 99uoq ejddy sulor Ipue1 ode lig uox3 E sjueuinooq ejdueg MS 6 9l etj se pue Mu L 10 J BBUBU U0ISIAID Jo onpuo3iuu s se suollisod Duipjou 134e sJee Wf 10 edoun3 10 juopiseud 90IA Se D9AJ9S H N IA91d US 7 886 235 B2 1 INTERACTIVE DOCUMENT SUMMARIZATION A portion of the d
2. within the document summary window 201 It is important to note here that the examples of FIGS 2 4 are merely static points in time and that the user has the flexibility to continuously alter the slider position In this way the user might first see the summary window as it appears in FIG 3 wherein one eighth of the document is displayed Then the user might continuously move the slider towards the All setting thus requesting more and more of the docu ment be displayed in the summary window until he reaches the summary window as it appears in FIG 2 wherein all of the original document is available for viewing Then the user might decide that less of the document is desired to be viewed and thus move the slider back towards the One setting such that the system is continuously showing less and less of the original document Finally the user might end up moving the slider all the way down to the One setting wherein only the one most indicative sentence is displayed in the document summary window as it appears in FIG 4 As just explained a significant advantage of the present invention lies in the use of the slider or knob user interface control Just as in the case of a dimmer switch to control room lighting which provides direct feedback by having the light get brighter or dimmer as the user moves the slider or knob control as well as having an essentially infinite number of settings using a slider or knob control in
3. bool first in paragraph bool remove returns first in paragraph False chew up leading whitespace char last loc of buffer buf length 1 identify if this is the start of a paragraph boollast was return False while isspace buf amp amp buf last loc of buffer switch buf case e case n iflast was return return followed by return first in paragraph True else last was retum True break case t if last was retum return followed by tab first paragraph True else last was return False something came after the preceding return other than a return or tab break case if last was return amp amp isspace buf 1 return followed by more than one white space first in paragraph True else last was return False something came after the preceding return other than a return or tab break default break buf start of sent buf ran out of buffer True if buf last loc of buffer start of sent 0 retum 0 note that past this point we ll return sum length even if we hit end of the buffer before concluding a sent Now we start looking for the end of the sentence start of sent buf bool conclusive sentence False bool abrev False char lookahead do we re going to repeat a big loop until we find a sentence break or run out of characters in the buffer switch buf Consider the current character
4. by the current position of the user interface element against the scale the means for generating the first document summary including 20 25 16 1 means for generating a respective score for different portions of the document that measures the respective portions ability to describe the document s contents as a whole ii means for selecting for inclusion in the first document summary one or more portions having stronger scores relative to other portions and not selecting for inclu sion in the first document summary portions having weaker scores relative to other portions wherein the number of portions selected for inclusion in the first document summary is determined from the current position of the user interface element 12 A data processing system as in claim 11 wherein the first document summary is displayed 13 A data processing system as in claim 11 wherein the user interface element comprises one of a a slider b a rotating knob 14 A data processing system as in claim 11 wherein each of the different portions correspond to different sentences of the document 15 A data processing system as in claim 11 wherein the portions selected for inclusion in the first document summary is in a same order as the portions in the document but do not produce a continuous portion of the document zs UNITED STATES PATENT AND TRADEMARK OFFICE CERTIFICATE OF CORRECTION PATENT NO 7 886 235 B2 Page of 1
5. in the buffer case t if buf 1 es handle Suzanne said I love you If it s a quotation mark preceded by a period we found a sentence break conclusive sentence True break case 5 lookahead buf 1 If it s a period consider next character if lookahead handle elipses If part of an ellipsis consider the character after the last period US 7 886 235 B2 11 12 APPENDIX A continued while lookahead amp amp lookahead lt last loc of buffer lookahead if lookahead gt last loc of buffer no more characters buf lookahead break rule out some abbreviations by checking for space followed by capital letter bool was space after period False while isspace lookahead amp amp skip white space Was there a space after the period If so it might be a sentence break lookahead lt last loc of buffer lookahead was space alter period True if lookahead gt last_loc_ of buffer buf lookahead break if Uwas space after period break things a sentence can start with here If we have a quote bullet or dash after the space we ll treat this as a sentence break if lookahead lookahead lookahead conclusive_sentence True break else if isupper lookahead break If lowercase letter after period it s not a sentence break otherwise check if it
6. invention FIG 6 is a sample user interface display showing some or all of the top sentence of each document in a display line or listing of documents in a computer system user interface FIG 7 is a sample user interface display showing the top sentence of a document in a comments field of an informa tional window of the document in a computer system user interface FIG 8 is a sample user interface display showing the top sentence of a document in a pop up area of a display line or listing of documents in a computer system user interface and FIG 9 is a sample user interface display showing the top sentence of a document in an open dialog box in a computer system user interface SUMMARY AND OBJECTS OF THE INVENTION It is an object of the present invention to provide an inter active document summarization system US 7 886 235 B2 3 It is a further object of the present invention to provide an interactive document summarization system wherein the user of the system can control the amount of the document sum mary It is a still further object ofthe present invention to provide a file listing containing document summary information Tt is an even further object of the present invention to provide document summary information about a document in a variety of contexts The foregoing and other advantages are provided by a method for a user to display a summary of a document on an electronic display the
7. keyboard key and or mouse button combination A large variety of display options is thus possible with the approach of the present invention depending upon such factors as display size and resolution user preferences and system capabilities In the foregoing specification the present invention has been described with reference to a specific exemplary embodiment and alternative embodiments thereof It will however be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims The specifications and drawings are accordingly to be regarded in an illustrative rather than a restrictive sense US 7 886 235 B2 APPENDIX A J Dnd next sentence On return start of sent will be gt buf if first chars encountered A are whitespace A Normally returns length of sentence starting from returned value W ofstart of sent A If it returns 0 then it ran out of buffer before finding A a sentence The caller will typically copy the remaining A text to the beginning of a buffer fill up the buffer A and then call this again The case where a complete A sentence does not fit in the buffer should be checked A by the caller A Can t handle see J P there or call A Morgan s A Handles Mr Mrs Ms Dr and i e intfind next sentence char buf uint32 length char start of sent bool ran out of buffer
8. present invention can be implemented on all kinds of computer systems Regardless of the manner in which the present invention is implemented the basic operation of a computer system embodying the present invention including the software and electronics which allow it to be performed can be described with reference to the block diagram of FIG 1 wherein numeral 10 indicates a central processing unit CPU which controls the overall operations of the computer system numeral 12 indicates a standard display device such asa CRT or LCD numeral 14 indicates an input device which usually includes both a standard keyboard and a pointer controlling device such as a mouse and numeral 16 indicates a memory device which stores programs according to which the CPU 30 carries out various predefined tasks The interac tive document summarization program according to the present invention for example is generally also stored in this memory 16 to be referenced by the CPU 10 As stated above the process of document summarization or automatic abstracting is well known in the art A variety of different mechanisms used singly and in combination have been tried to automatically create document summaries or abstracts Such mechanisms typically start with determining the significance of particular words and or sentences usually by focusing on position in the document semantic relation ships and term frequencies Further criteria may include contextual inf
9. the present invention has greater intuitiveness and utility than would mere up and down buttons having discrete quantized levels A slider con trol combined with immediate display feedback immedi ately displaying greater or fewer sentences in the document being summarized as the user moves the slider means the user only has to be concerned about whether the amount of summarized information being displayed is of the desired quantity And the present invention has clear advantages over requir ing the user to specify actual summary values or percentages Just as in the case ofa light dimmer switch where the user only knows that they want more or less light rather than say knowing that what they want is 15 more light or 22 less light the slider control of the present invention avoids plac ing on the user the additional cognitive load of first estimating the new amount desired In other words after the user deter mined that more or less summary information was desired if the interface mechanism required specifying a summary per centage or utilizing up and down buttons then the user would have to be concerned with exactly how much or less informa tion is truly desired It is less intuitive to require the user desiring more information to first determine that 49 isn t enough but that 58 is sufficient or to try a series of static up and down clicks until the desired amount is obtained The more intuitive interaction mechanism of the p
10. the summary facilitates rapid review of documents in which the user has little interest as well as review of up to the entire document in the case of great user interest Furthermore such interactive control allows the user to expand and contract summarized docu ments at will thus freeing the user to focus on the content of the summarized document rather than on trying to determine what amount or percentage is sufficient or how the underlying abstracting mechanism operates BRIEF DESCRIPTION OF THE DRAWINGS The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements and in which FIG 1 is a diagram of a typical computer system as might be used with the present invention FIG 2 is a sample summary document window according to one implementation of the present invention wherein All ofthe original document to be summarized is displayable FIG 3 is a sample summary document window according to one implementation of the present invention wherein one eighth of the original document to be summarized is display able FIG 4 is a sample summary document window according to one implementation of the present invention wherein One most representative sentence of the original document to be summarized is displayable FIG 5 is a flowchart ofthe document summarization meth odology according to one implementation of the present
11. 2 Bornstein et al 715 854 65 Prior Publication Data US 2006 0059442 Al Mar 16 2006 51 Int CI G06F 17 00 2006 01 52 US 1 street 715 854 715 802 58 Field of Classification Search 715 706 715 707 708 712 853 855 786 832 833 715 802 707 7 10 11 530 See application file for complete search history 56 References Cited U S PATENT DOCUMENTS 5 047 868 A 9 1991 Takeda etal 386 109 5 168 533 A 12 1992 Katoetal 352 54 5 278 980 A 1 1994 Pedersen etal 395 600 5 303 361 A 4 1994 Colwell et al 5 384 703 A 1 1995 Withgott et al 5 477 451 A 12 1995 Brown etal 364 419 7 372 473 B2 5 2008 Venolia FOREIGN PATENT DOCUMENTS JP 04348460 A 12 1992 OTHER PUBLICATIONS Trafton User s Manual f ED Search Desktop Edition 1995 pp 1 8 10 11 1995 Continued Primary Examiner Steven P Sax 74 Attorney Agent or Firm Blakely Sokoloff Taylor amp Zafman LLP 57 ABSTRACT A real time interactive document summarization system which allows the user to continuously control the amount of detail to be included in a document summary 15 Claims 9 Drawing Sheets 201 Pioneer To License MacOS Pioneer Electronics Corporation announced today that it has reached an agreement with Apple Computer Inc to license the Mac OS for use in Pioneer s new line of personal computers The f
12. APPLICATION NO 10 200806 DATED February 8 2011 INVENTOR S Jeremy J Bornstein et al It is certified that error appears in the above identified patent and that said Letters Patent is hereby corrected as shown below In column 2 line 19 delete sets and insert gets therefor Signed and Sealed this Eighth Day of November 2011 David J Kappos Director of the United States Patent and Trademark Office
13. OIT OL WHHNOId S661 ST Azenaqeg Aepsoupem Sg 1132 LOZ US 7 886 235 B2 Sheet 3 of 9 Feb 8 2011 U S Patent ooueuuogjied eorud pooh 0p00189 l104030N ZHWEE d2O eAT3p WOU dd peeds x p p pue zoxeeds OC Aarrenb ubru uir4 peddtnbs zeqndwos euosaded A v doa3xsep ed 303oud MSN gt 00TXT OdN 3e3nduoj Teucszeg OSpr A earbip pue punos bursseooud 10g eoueuizogjied ubru T09 Od3 MOd ZHW99 pes q 2SIM dO eArJ3p WOW q3 peeds x p p pue iexeeds qc Aar enb ubru uar4 peddinbe d4e3nduoo Puosaed A V doqysep ed 30304d M N IX9 2dW 43e3nduo euosda8d uedep otddy queptseiad epues TLtes pres qexaieu 3ueuureqioquo uou y o3 suoraugos bura nduoo pezt euosszed anbtun 19jjo o3 paddtnba T3depe st deeuorid s sdeunsuoo 40j SM U LpaiD st qu u ounouue SUL 4 39UO31sno I qnduoo Teuoszed orq4ueo erpeurjTgnu s Aepo O3 SUOTINTOS queuurejiago3ue euou e3e duoo epr4Aoad oq I uord so qeue u 3s s bur3eagedo usoquroeW y jo I mod pue 3ITIqIX9 sul S1o3nduoo Teuoszed jo eurT Meu S 19o98uOTd UT sn 10j SO OPW SU SSUSDTT O3 OUI a2ej3nduoj eTddy tw queueeibe ue peugoeez seu 31i qeuq epo3 peounouue UOT eTOdIOD soruO23j29 4 I SUOTd d 11239 LOZ US 7 886 235 B2 Sheet 4 of 9 Feb 8 2011 U S Patent SU Dis OSPTA Te4TOTp pue punos bursseooud 4103 eoueuzogied ubru 09 DdtemMog zHW99 Peseq ISTY ndO AT4D WOW q9 peeds x p y pue aexeed
14. ach sentence in the original document is obtained by examining either a preset value or the slider position value which thus indicates how far down the ranked list to go Again the markers on the slider could be represented as a proportional amount of the entire document as a numeric value of the number of sentences of the total document or even as a non linear value indicator of the total document While this last form may not sound as intuitive as the former ones it is important to note that studies have shown that most of the content of a document can be understood by only reading a relatively small amount ofthe entire document e g 20 25 Further remember that the user interface of the present invention frees the user to focus on the displayed summary content rather than on some more obscure summary percentage or value As such a non linear slider may provide even greater utility to the user of the present invention Lastly the slider position is monitored 513 so that if the user changes its position thus indicating a desire for more or less information the appropriate amount of summary infor mation based on the new slider position 511 can be displayed It is important to note a performance advantage in the process just described In the preferred embodiment of the present invention because the query 507 asked for all of the sentences in the document before concerning itself with how many sentences will be displayed every sentenc
15. az United States Patent US007886235B2 10 Patent No US 7 886 235 B2 Bornstein et al 45 Date of Patent Feb 8 2011 54 INTERACTIVE DOCUMENT 5 483 468 A 1 1996 Chenetal 364 551 01 SUMMARIZATION 5 544 354 A 8 1996 May etal 707 4 5 555 369 A 9 1996 Menendez etal 715 762 75 Inventors Jeremy J Bornstein Menlo Park CA 5 576 954 A 11 1996 Driscoll US Douglass R Cutting Oakland 5 619 709 A 5 4 1997 Caid et al CA US John D Hatton Mt Hermon 5 652 889 A 7 1997 Sites 717 144 S P i 2 5 675 819 A 10 1997 Schuetze CA US Daniel E Rose Cupertino 5 734 883 A 3 1998 Umen etal aaa 707 1 CA US 5 794 178 A 81998 Caid et al 5 802 493 A 9 1998 Sheflott et al 73 Assignee Apple Inc Cupertino CA US 5 832 470 A 11 1998 Morita et al 5 838 323 A 11 1998 Roseetal 707 531 Notice Subject to any disclaimer the term of this 5 867 164 A 2 1999 Bornstein et al 345 357 patent is extended or adjusted under 35 5 924 108 A 7 1999 Feinetal 345 349 U S C 154 b by 1555 days 5 963 205 A 10 1999 Sotomayor 715 236 6 021 218 A 2 2000 Capps et al 21 Appl No 10 200 806 6 081 804 A 6 2000 Smith 707 5 6 112 201 A 8 2000 Wical 707 5 Tad 6 243 724 Bl 6 2001 Mander et al Cep e EE 6 424 362 Dis 7 200
16. ched Then the present invention treats the text of the original document as the query to be applied to the corpus In this way a determination can be made as to how similar each sentence in the document is to the document as a whole The result is a ranking or value score for each sentence in the document being summarized Then depending upon either a preset value n or the user specified slider setting n only those sentences above the ranking or value score ofn get displayed in the document summary Furthermore the present invention as is common in the art uses term weighting to provide distinctions between the vari ous terms or in the present invention words in a document The present invention utilizes a well known term weighting formula see e g page 518 of Salton and Buckley in the Term Weighting Approaches in Automatic Text Retrieval article referred to above and incorporated herein wherein the term weighting components are as follows tf the number of times a term word occurs in a sentence or in a document as a whole N the number of sentences in the document and n the number of sentences in the document which contain a given term The term weighting formula is applied to both document and query vector terms and is tfc where t is replaced by log tf 1 to better normalize long documents and to keep things posi tive f is replaced with log N n 1 to permit a search for a word that occurs in every sentence to in
17. cited by examiner U S Patent Feb 8 2011 Sheet 1 of 9 US 7 886 235 B2 DISPLAY 12 MEMORY 16 FIG 1 US 7 886 235 B2 Sheet 2 of 9 Feb 8 2011 U S Patent pue sjonpoid Tezeydtazed A V jo uoTJeAbequT y Jey SenveT eq j uortd abe erp ur31nu buruoo y 304 qQusowutTeqzequs AJr enb ybty u3r4 sieuoisno oprAodd pue e3eedo o3 ure eqedgoddgoo Sir uo peseq osrq3 seT oul se uons Sjonpoad eorado aese SAT eAOUUT JO Jequnu e burdo eA9ep useq sey 1ieeuord S3onpoaud A v AqtTenb ybty pue szaqnduioo euosiged ano jo uorqej3b qur y ubnoadq3 qusuuteqzeque uou jo ed 4i mau e o3eo9JO O4 sn e qeue TIA Tddv qu r4 uorjqedoqe oo Ino eyg eASI 9q 9M obe erpeurjrgnu buruoo2 y pue jeayzrew AZeqndwod euosaged A v y oqur Ae INO sxieu pue ji uord 40j 3ueorjriubrs psnopueued3 ST quouosedb STYLu uoraegodaio soruo4g32e9 4 4 uord qu prs 4d OTA oqounsqeW e uey pies Aueduoo ie3nduoo snouej p 4go4 y ouy iej3nduo e ddy ua3i4 queueeibe SSUSOTT peuoeod JAVY M Jey eounouue oq aeunseo d RSIS INO ST QI S1ejnduoo Teuoszed meu esequi oqur s3onpoad A X sar burqedbequr jo ure y u314 3exaeu eui oU si eqnduoo euosaed Tensta otpne A N sar eonpoaqur I ieeuorq sae3nduoo euosded jo eur M U s ieouord UT sn 10j SO JEW 9u3 su or I OF OUL ae3nduo eTddy tw 3ueuoeube ue peuoeez seu 3r 3eu3 Aepoj peounouue uorjedgod4go soTUOTIDSTY Tesuotg SO OWN S HIddV SSN3
18. document comprising one or more sentences said method comprising the steps of i separating the one or more sentences of the document ii ranking the relevance of the separate one or more sentences of the docu ment to the document as a whole iii displaying an initial number of said separate one or more sentences of said docu ment based upon the relevance ranking of said one or more sentences of said document and iv repeatedly specifying a subsequent number of said separate one or more sentences of said document by user control of a user control means and displaying said subsequent number of said ranked separate one or more sentences of said document The foregoing and other advantages are also provided by a computer system for displaying a summary of a document comprising i a document containing one or more separate sentences ii a relevance ranking means for ranking the relevance of the one or more separate sentences to the docu ment as a whole iii a continuously variable control means for specifying an amount of the document to be included in the summary and iv a display means for displaying the summary ofthe document based upon the specified document summary amount and the ranked relevance ofthe one or more sentences Other objects features and advantages of the present invention will be apparent from the accompanying drawings and from the detailed description which follows DETAILED DESCRIPTION OF THE INVENTION The
19. e in the docu ment gets a ranking 509 Then whenever the slider position is changed 513 displaying the larger or smaller summary is a relatively simple matter of merely displaying the more or less sentences as dictated by the previously generated relevance ranked list In other words by precomputing the relevance ranking displaying more or less detail can be accomplished quickly without an additional query to be performed for each change in the slider position Further in the preferred embodiment of the present inven tion displaying more or less detail is done using an offscreen bitmap a technique well known in the computer art Using an offscreen bitmap makes the display appear to have the sen tences instantly inserted or deleted in place rather than having the entire document summary appear to scroll from the top down whenever the user asks for more or less detail Note that the present invention has numerous applications A more clear application would be as part of a document browser or within a document retrieval context thus allowing 20 40 45 50 8 more rapid review of a corpus of documents The present invention is equally useful within an electronic mail context where the user can view a summary of the electronic mail received and can then determine whether more or less ofthe contents of the entire electronic mail message s is desired Another useful application of the present invention is within the user interfac
20. e of a modern computer system such as the Apple Macintosh Finder where stored documents ei ther locally stored e g on a hard disk drive of the computer or remotely stored e g across a network or even across the internet can be displayed by name application type date created etc When using such an interface a user is often times faced with a window displaying a long list of such stored documents without much hint as to what the docu ments actually contain While documents or files are often given a particular name in order to provide a hint of their content or subject matter the user is still often left wondering what a particular document or documents contain As such using the summarization engine of the present invention the system could provide a show top sentence option This option would display to the user the one sentence of a docu ment which is most indicative of the contents of that docu ment Such display could take the form ofa portion of the display line or listing of documents in a computer system user inter face as in a Finder folder window of the Macintosh computer system as is shown in FIG 6 wherein the amount of the top sentence displayed is limited by the amount of window dis play space allotted to this field Such display could also take the form of being displayed in a comments field of an infor mational window about the document in a computer system user interface as is shown in FIG 7 Such displa
21. effect of the present invention is that as the user moves the slider the window instantaneously updates to display a summary with more or less detail and in the same order as the original document Thus as the user moves the slider to ask for more detail the summarized document appears to grow with the ever increasing number of sentences instantly appearing in their original order and paragraph structure with the upper limit being the entire original document And as the user moves the slider to ask for less detail the summa rized document appears to shrink with the sentences instantly disappearing and the remaining sentences within each remaining paragraph collapsing to form new summary para graphs with the lower limit being the one sentence most characteristic of the entire document according to the sum marization mechanism And again the interface mechanism of the preferred embodiment of the present invention operates as simply as having the user manipulate a cursor control device such as a mouse trackball or trackpad to move a slider control on the computer display to indicate that more or less summary information is desired Referring now to FIG 2 a sample screen from the system before it has summarized the document can be seen In the figure a document summary window 201 can be seen wherein the slider 203 is set to All indicating that all of the sentences in the original document are to be shown The scroll bar 205 on the
22. erence and or syntactic coherence jak 5 35 40 45 50 55 60 65 4 However again regardless of the sophistication of the summarization mechanism and note that the present inven tion is equally applicable to document summarization using any reasonable summarization mechanism now known or later developed it is highly unlikely that any particular sum marization mechanism will always generate the degree of detail desired by the user As such the present invention provides the user with a control mechanism to vary the degree of summary detail so as to suit the particular user s tastes and interests at that point in time and for that particular purpose In the preferred embodiment of the present invention a summarization engine again any reasonable summariza tion mechanism would work with the present invention run ning on a personal computer is used to rank all of the sen tences in a document from most to least representative The user interacts with the system by adjusting a slider control displayed in a graphical user interface of the computer sys tem As the user moves the slider to a given position the engine returns the top n sentences where n is based on the slider s position The sentences original order and paragraph structure are maintained in the preferred embodiment of the present invention as a summary consisting of those n sen tences is displayed in a window on the computer screen The
23. fact find every sen tence and c is unaltered i e each weight in a vector is US 7 886 235 B2 7 divided by the square root of the sum of all the squares of the unnormalized weights for the vector Referring now to FIG 5 the process of the present inven tion will now be described When a document is to be sum marized 501 with the present invention it must first be deter mined 503 where the sentence breaks are in the document Note that the sentence break determination approach of the preferred embodiment of the present invention is shown in the C programming language format in Appendix A to the present specification The next step is to determine the sentence ranking within the document being summarized This is accomplished by first 505 building an index which is a database representing the contents of the sentences in the document in the form of statistics about the words in those sentences a process which is well known in the art Then 507 the entire original docu ment is treated as a query to the corpus of individual sentences in the document in accordance with the standard vector model approach The result is a score indicating how well each sentence matches the query of the entire document and hence the output of the queries is a rank ordered list by score of all the sentences in the document 509 Then the desired number of sentences to include in the document summary display is determined 511 once a ranked list of e
24. increasing amount of digital information is the time it takes to read even a small portion of it Whether one is reviewing a previously arranged set of documents as in the case of reading an on line news paper or magazine reviewing the results of an electronic search or scanning documents stored on a large hard disk drive of a personal computer it can still take considerable time to read more than a minimal amount What is needed therefore is a facility which provides a summary or abstract of each document Having a summary of each document allows the reader to determine whether that document is of interest and hence reading more of the docu ment might be desirable Conversely reading the summary of a document could suffice to sufficiently inform the reader about the document or instead could indicate to the reader that the particular document is not of interest No matter the result a good document abstract mechanism could be quite valuable in the modern digital world However a good document abstract mechanism means more than merely providing an automatic summary of a docu ment Prior approaches to document summarization or Auto matic Sentence Extraction as discussed on pages 87 89 of the Introduction to Modern Information Retrieval by Salton and McGill Copyright 1983 incorporated herein by refer ence in its entirety have yet to yield abstracts in a readable natural language context which obey normal styli
25. ing sentence punctuation clues Tf the newline followed by another or a tab or 3 or more spaces it s a sentence break if lookahead n lookahead r two returns lI lookahead t return followed by a tab US 7 886 235 B2 13 APPENDIX A continued 14 gt paragraph delimiter lookahead return followed by 3 or more spaces amp amp lookahead 1 amp amp lookahead 2 conclusive sentence True break while isspace lookahead amp amp lookahead if lookahead gt last loc of buffer buf lookahead break Ditto if followed by a bullet or two hpyhens if lookahead lookahead amp amp lookahead 1 break conclusive sentence True break skip white space lookahead lt last loc of buffer Back to our initial character If a question mark or exclamation point it s a break case case l conclusive sentence True if a period or is immediately followed by a double quote count the quote as part of the sentence if buf 1 buf break default break buf while ran out of buffer conclusive_sentence conclusive sentence amp amp buf last loc of buffer return the length you conclusive sentence even if we ran out of buffer before determining conclusively whether
26. isclosure of this patent document con tains material which is subject to copyright protection The copyright owner has no objection to the facsimile reproduc tion by anyone of the patent document or the patent disclo sure as it appears in the Patent and Trademark Office patent file or records but otherwise reserves all copyright rights REFERENCES TO RELATED APPLICATIONS The present application is related to co pending U S patent application Ser No 08 536 903 filed on the same day as the present application assigned to the same assignee and having the same inventive entity FIELD OF THE INVENTION The present invention relates to the field of document sum marization which is otherwise known as automatic abstract ing wherein an extract of a document i e a selection of sentences from the document can serve as an abstract BACKGROUND OF THE INVENTION The advent of the personal computer and modern telecom munications has resulted in millions of computer users com municating with each other around the globe One of the primary uses of such computers by such users is accessing the vast store of digital information which has been created over the last several decades Further additional digital informa tion is created daily due to both the conversion of information previously unavailable digitally and the large amount of new information created by an ever increasing computer user population One concern with this vast ever
27. it s a sent or not man out of buffer gives that indicator return buf start of sent We claim 1 A method to process a document on a data processing system the method comprising receiving a first input indicating a change in a position ofa user interface element against a displayed scale rendered by the data processing system and generating a first document summary for the document according to a current position of the user interface element against the scale in response to the first input the first document summary providing a synopsis of the document s content the first document summary s amount of text established by the current position of the user interface element against the scale the generating ofthe first document summary including 1 generating a respective score for different portions of the document that measures the respective portions ability to describe the document s contents as a whole ii selecting for inclusion in the first document summary one or more portions having stronger scores relative to other portions and not selecting for inclusion in the first document summary portions having weaker scores relative to other portions wherein the number of portions selected for inclusion in the first document summary is determined from the current position of the user interface element 2 A method as in claim 1 wherein the first document summary is displayed 3 A method as in claim 1 wherein the user i
28. lexibility and power of the Macintosh operating system enables Pioneer to provide complete home entertainment solutions to today s multimedia centric personal computer customer The announcement is great news for consumers as Pioneer is adeptly equipped to offer unique personalized computing solutions to the home entertainment market said Seiji Sanda president Apple Japan Personal Computer MPC GX1 New prototype desktop A V personal computer equipped with high quality 3D speaker and 4 4 x speed CD ROM drive CPU RISC based 66MHz PowerPC processing sound and digital video 601 high performance for Personal Computer MPC LX100 New prototype desktop A V personal computer equipped with high quality 3D speaker and 4 4 x speed CD ROM drive price performance CPU 33MHz Motorola 68LC040 good US 7 886 235 B2 Page 2 OTHER PUBLICATIONS Gerald Salton entitled The Smart Retrieval System Experiments in Automatic Document Processing Copyright 1971 Prentice Hall Inc pp 144 156 Salton amp McGill The Smart and Sire Experimental Retrieval Sys tems 1983 pp 120 123 Salton amp Buckley Term Weighting Approaches in Automatic Text Retrieval Information Processing amp Management vol 24 No 5 pp 513 523 Witten Moffat amp Bell Managing Gigabytes Compressing and Indexing Documents and Images 1994 pp 141 148 Frakes and Baeza Yates Information Retrieval Data St
29. ment summary one or more portions having stronger scores relative to other portions and not selecting for inclusion in the first document summary portions having weaker scores relative to other portions wherein the number of portions selected for inclusion in the first document US 7 886 235 B2 15 summary is determined from the current position of the user interface element 7 A medium as in claim 6 wherein the first document summary is displayed 8 A medium as in claim 6 wherein the user interface element comprises one of a a slider b a rotating knob 9 A medium as in claim 6 wherein each of the different portions correspond to different sentences of the document 10 A medium as in claim 6 wherein the portions selected for inclusion in the first document summary is in a same order as the portions in the document but do not produce a continu ous portion of the document 11 A data processing system to display a document the data processing system comprising means for receiving first input indicating a change in a position of a user interface element against a displayed scale rendered by the data processing system and means for generating a first document summary for the document according to a current position of the user interface element against the scale in response to the first input the first document summary providing a synopsis of the document s content the first document summa ry s amount of text established
30. nterface element comprises one of a a slider b a rotating knob 35 40 45 50 65 4 A method as in claim 1 wherein each of the different portions correspond to different sentences of the document 5 A methodas in claim 1 wherein the portions selected for inclusion in the first document summary is in a same order as the portions in the document but do not produce a continuous portion of the document 6 A non transitory machine readable storage medium con taining stored executable computer program instructions which when executed by a data processing system cause said system to perform a method to process a document the method comprising receiving a first input indicating a change in a position ofa user interface element against a displayed scale rendered by the data processing system and generating a first document summary for the document according to a current position of the user interface element against the scale in response to the first input the first document summary providing a synopsis ofthe document s content the first document summary s amount of text established by the current position ofthe user interface element against the scale the generating of the first document summary including 1 generating a respective score for different portions of the document that measures the respective portions ability to describe the document s contents as a whole ii selecting for inclusion in the first docu
31. presentative of the distribution of terms in the document A particular search query is then represented as a vector such that the retrieval of a particular record or document then depends upon the magnitude of a similarity computation between the particular document s representative vector and the query s representative vector Suffice it to say that the vector model of document compari son is well known in the art of computer search and retrieval mechanisms see Salton and McGill Introduction to Modern Information Retrieval 1983 pages 120 123 Salton and Buckley Term Weighting Approaches in Automatic Text Retrieval Information Processing amp Management Vol 24 No 5 pp 513 523 Witten Moffat and Bell Managing Gigabytes Compressing and Indexing Documents and Images 1994 pp 141 148 and Frakes and Baeza Yates Information Retrieval Data Structures amp Algorithms 1992 pp 363 392 all incorporated herein by reference in their entirety Typical prior art search and retrieval mechanisms how ever attempt to find out of a corpus comprised of multiple documents one or more documents which are most similar to a single query which may itself be a document Instead the preferred embodiment of the present invention treats each sentence in the document to be summarized as being equiva lent to an entire document and thus the set of all of the sentences of the document can be treated as the corpus of documents to be sear
32. resent invention allows the user to interactively operate a continuously vari able control while providing immediate display feedback of the greater or lesser information until the user determines that the appropriate amount of information is displayed Thus another advantage of the present invention as alluded to above is that the user has the option of continu ously changing the amount of summary information being displayed which thus facilitates the user requesting more and more of the original document as the greater and greater summary amount further piques the user s interest And then after the user has read the desired amount of document sum 20 25 30 35 40 45 50 55 60 65 6 mary the user still has the option of decreasing the final amount of summary information This has the added benefit of providing the reader with as much information as desired while still facilitating minimal document summaries which might then be used in other ways e g see below regarding View by Sentence and comment window applications A general overview of the summarization engine of the present invention will now be explained Note first however that any of a large variety of well known summarization techniques are equally applicable to the present invention In many prior art document retrieval systems a vector model approach has been taken where each record or document is represented by a vector re
33. right hand side of the window a standard feature of the standard Macintosh Finder user interface envi ronment indicates that there is more of the document that exists than can fit within the window 201 displayed on the screen in other words while the All setting allows viewing of the entire document not all of the document may be dis playable at a given point in time due to display screen and or window size constraints In this example the original docu ment contains 32 sentences and with this window size would fill several screens of text Referring now to FIG 3 the user has moved the slider 203 typically via a cursor control device such as a mouse track ball or trackpad to indicate that he only wants a summary one eighth the size of the original document note that pre determined summarization settings wherein the system auto matically generates a preset amount of summarization according to previously set system or user values are equally supportable with the present invention to be displayed within the document summary window 201 The summary now fits US 7 886 235 B2 5 within the window 201 as indicated by the empty scroll bar 205 on the right hand side of the summary window Referring now to FIG 4 the user has now moved the slider 203 to indicate that he only wants a summary which shows the one sentence deemed by the summarization engine to be most representative of the document s content to be displayed
34. ructures amp Algorithms 1992 pp 363 392 Salvador and Zamora Automatic Abstracting and Indexing II Pro duction of Indicative Abstracts by Application of Contextual Infer ence and Syntactic Coherence Criteria Journal of American Society for Information Science 1971 pp 260 274 Edmundson New Methods in Automatic Extracting Journal of the Association for Computing Machinery vol 16 No 2 1969 pp 264 285 Edmundson Problems in Automatic Abstracting Communica tions of the ACM vol 7 No 4 1964 pp 259 263 Edmundson Automatic Abstracting and Indexing Survey and Rec ommendations Communications of the ACM vol 4 No 5 1961 pp 226 234 Rose et al Content Awareness in a File System Interface Imple menting the Pile Metaphor for Organizing Information ACM Press SIGIR 93 pp 260 269 Mander et al A Pile Metaphor for Supporting Casual Organiza tion of Information CHI 92 Conference Proceedings ACM Press Human Factors in Computing Systems 1992 pp 627 634 Kupiec et al A Trainable Document Summarizer ACM Press Proceeding of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 1995 pp 68 73 Inside Architext The Herring Reporter 1995 pp 45 47 Inside Macintosh Macintosh Toolbox Essentials Apple Computer Inc Addison Wesley Publishing Company 1992 pp 7 1 to 7 78
35. s qc AqtTenb ybty yatm peddinbe azoe3nduoo Teuosized A V doqysep ed 30304d M N IX9 2dNWN Zeandwog euosdeqd 8q U S Patent Feb 8 2011 Sheet 5 of 9 US 7 886 235 B2 INPUT DOCUMENT TO BE SUMMARIZED 501 FIND SENTENCE BREAKS IN DOCUMENT 503 BUILD INDEX DATABASE FOR DOCUMENT 505 QUERY INDEX TO COMPUTE SIMILARITY SCORES BETWEEN EACH SENTENCE AND ENTIRE DOCUMENT 507 PREPARE RELEVANCE RANKED LIST OF ALL SENTENCES IN THE DOCUMENT 509 DISPLAY DESIRED NUMBER OF SENTENCES BASED ON SLIDER POSITION 511 HAS SLIDER POSITION CHANGED 513 FIG 5 US 7 886 235 B2 Sheet 6 of 9 Feb 8 2011 U S Patent N IX2 24W 4e3nduo Teuosteg WC S661 92 deg any NOOP Ogg S1TIMOeW d t 80298 esueoT OL JeeUota 7 17 OOS OULIJ UVS 21940 3103390 Wd IT c66T 92 deg ont 000D Oa IFIMOPW YIT SZ 6 TaN IFN epuon ugUNOp eAT32e3e3UI eurj eez v Wa pZ G G661 9c deg nI n20p ozgq e3TXMOeW MOP S3Uu UNIOO PSTITPOW ase 9qeT MODpUIM fueuiung neuiojny 9 Sls PUTA 92 3uejeg Azewumg omg US 7 886 235 B2 Sheet 7 of 9 Feb 8 2011 U S Patent ped Azeuotzeis O peyooT O Arewuns 3ueunoop UT PapNTOUT eq 03 re39p JO qunowe o 0I3UOO Apsnonurtquoo 03 iesn sy SMOTTe YOTYyM Oase UOTILZTIEUIMS quoeunoop 3ATJIVIIJUT GUT1 T 01 Y C66 2equeooq IAG OJ 9 TIMOR UOTSIOA Wd bZ G C661 97 deg onl per3TDOR Wd 72 S C661 97 deg
36. stic con straints Salton and McGill further state that r eadable extracts are obtainable without excessive difficulties but per fection cannot be expected within the foreseeable future One difficulty with prior document abstract mechanisms even when overcoming many of the natural language barriers 25 40 45 55 65 2 is that the system or mechanism can never know for certain whether the user is receiving as much or as little of an abstract as they would like In other words no matter how well the mechanism can determine which portions of the document to include in the summary or abstract the mechanism can never automatically include just the right amount of abstract to always please the user This can be due to different users interest levels different user s reasons for reviewing the document and even time or situation varying interests of the same user As such what is needed is not necessarily a better abstracting algorithm as much as a mechanism which allows the user to interactively specify whether the present abstract is sufficient or instead whether more or less of the original document should be included in the abstract or summary The present invention utilizes an interactive control which allows the user to specify whether more or less of the original document should be included in the document summary Allowing the user to interactively control how much of the original document sets included in
37. was just an abbreviation now we check for Mr Mrs etc currently handles Dr Mr Mrs Ms i e if buf start of sent gt 2 switch buf 1 case T if buf 2 M buf 2 D Dr Mr abrev True break case s if buf 2 MI Ms abrev True break case e if buf 2 amp amp buf 3 i i e abrev True break j if buf start of sent gt 3 amp amp buf 1 s amp amp buf 2 r amp amp buf 3 M abrev True special case ifa period is immediately followed by a double quote count the quote as part of the sentence if labrev amp amp buf 1 A buff conclusive sentence abrev if we get here its the simple case of end of sentence break that is hello there Go away now catch amp separate list items here expensive back to our initial character If it wasn t a quote or period what was it case v This section is trying to separate lists of items e g bullets that may not use punctuation to separate the items case n if remove returns buf 5 replace the return with a space lookahead buf 1 while lookahead amp amp skip space that might be between two returns lookahead lt last loc of buffer lookahead if lookahead gt last loc of buffer buf lookahead break detect list items lack
38. y could also take the form of being an expanded display in a display line or listing of documents when the user positioned a pointer over the document name or icon when in a particular expanded display mode or when depressing a particular keyboard key and or mouse button combination as is shown in FIG 8 Still further such display could also take the form of an open dialog box where instead of displaying a thumbnail minia ture image of a graphic image document or merely the first sentence of a textual document a summary comprised of a top sentence or sentences could be displayed as is shown in FIG 9 An additional feature of the user interface document sum mary mechanism is the option as in the more general docu ment summary invention described above for the user to control whether more or less of the document summary is to be displayed In other words while the default setting of a graphical user interface which displayed the show top sen tence option might typically be to show only the one top sentence the user could have the option of displaying a greater number of representative sentences from the summa rized document Such additional sentences might simply wrap onto the next line of the display or instead might only be displayed when the user positioned a pointer over the document name or icon when in a particular mode e g similarly to the standard Macintosh Finder Balloon Help feature or when depressing a particular
Download Pdf Manuals
Related Search
Related Contents
PPMS TTO Manual - Materials Research Laboratory at UCSB MODE D`EMPLOI BALANCE MURALE Installation, Operation, Parts & Service Manual 型 式 TIW-A180C 型 式 TIW-A160C Dell OptiPlex 790 (Early 2011) Desktop Owner's Manual 取扱説明書 詳細版 programme ciné oct Installation and User Manual for AppCluster™ SX DS 3000WAN ~ " " 囗sx`Ds`3〝。0w^“ Da-Lite Designer Electrol, 69" x 92" Copyright © All rights reserved.
Failed to retrieve file