Home
        Xpantrac Final Report.docx
         Contents
1.      COMMITting Solr index changes to http   localhost 8983 solr update     Time spent  0 00 00 072  dcabrera DMBP   exampledocs        Figure 13  IOException from indexing the  50docs xml  file into Solr    The issue was caused by the existence of ampersand characters      amp      in the XML file we tried to  index  To fix this problem we removed the ampersands and then ran the indexing command  again  The files were then able to be indexed into our local Solr instance without any more  issues     16    Concept Map   Algorithm   lt             is an          Retrieval    stands Cognitive Informatics  eXPANsion exTRACtion   lt            o          xeantrac   combines  gt   with Information  x    contains  components    T M for B 0    retrieve topics of a given  page    takes as input md P  d     Frores   Query Une Builder   rai kroniese Coler E TEE  a text file containing    Topic Setector J  ie     relevant information f  on article  i e  first      30 words in article     Symbol and Stopword SE NT such as contains  Removal the input text to p                       to External Knowledge    Collector SOLR    system API or bing  currently builds API       Queries of every 5 words  with  a 1 word overlap    Figure 14  Xpantrac concept map    Xpantrac for Yahoo Search API   For our midterm presentation  we tried to modify Yang s original Xpantrac script that used a  database to instead use the Bing Search API  However  we ran into multiple authentication  issues  As a result o
2.    Yang  90     The design of Xpantrac has two parts  Expansion and Extraction  The flow of the algorithm can  be shown in the figure below     Input Text    Preprocessor  Symbol Stopword  Remover Remover  Query Unit Builder  External Knowledge Collector j Web    EXPANSION    NLP Module    Term Doc Matrix Builder    EXTRACTION       Topic tags    Figure 3  Components of Xpantrac grouped into two parts    Because of the modular design of Xpantrac  any component can be flexibly replaced  For our  project  we used a web API as the External Knowledge Collector on the first run and later  replaced it with a Solr system     Expansion   The Expansion part of the algorithm is responsible for building a    derived corpus    of relevant  information by accessing an external knowledge source by expanding input text  This part  contains three parts     1  Preprocessor  removes symbol characters  e g   amp          and stopwords  e g     a        and         the       2  Query Unit Builder  segments the preprocessed input texts into uniform sized groups of  words  The words are grouped with neighboring words to keep the context    3  External Knowledge Collector  accesses a knowledge source  located outside the system   to search and retrieve relevant information on the queries sent    Extraction  The extraction part is where a list of words is derived from the corpus created from expansion   This part contains three parts     1  NLP Module  applies a POS  Part of Speech  tagger to the co
3.   For example  if the current IDEAL collection in Solr is changed to suit our specifications  then  you would only need to change the hostname and port to the match the corresponding URL     Configuration File   In order to help create an easier Xpantrac experience for future developers  we have created a  configuration file  This file will allow users to enter commonly changed variables  such as  hostname  port  query field  title of input documents  path to input documents  number of topics  to be found  and window overlap     1     xpanconfig ini e O    1  server    2 hostname   hostname   3 port   port   4 query field   content   5 input documents    0    6 path to input documents       7 num topics   10   8 window overlap   al    Figure 26  Xpantrac configuration file  Because of this configuration file  there is no longer a need to change variables directly in the    Xpantrac script  This will help ensure that all variables are changed correctly when a new user  wishes to use the system     23    Evaluation of Extracted Topics    File Hierarchy     project CTR_30 A directory of 30 CTR files   project V ARIOUS 30 A directory of 30 various files     project gold_ctr30 csv The    gold standard    of merged human topics   project gold_various30 csv The    gold standard    of merged human topics   project human_topics_CTR30 csv Human assigned topics for 30 CTR articles    project human_topics_VARIOUS30 csv   Human assigned topics for 30 various articles     project xpantrac_
4.   asterisks with the queries of your choice  This link stays constant for all queries  Another option    that you see is the part that says    json     Here  you can change it to return    json        xml         python        ruby        php     or    cesy        WARC Files with IDEAL Documents   Our group collaborated with the IDEAL Pages group for the initial part of our project since we  were both working with IDEAL and Solr  The IDEAL Pages group goal was to index the IDEAL  documents into Solr  To achieve this goal  they had been given a set of WARC files containing  IDEAL documents in the form of HTML pages  However  the WARC files also contained non   HTML documents that were unnecessary for our purposes  After the IDEAL Pages group created  a Python script to unpack the WARC files  they sent it to us for further modification     13    Python Script to Remove HTML   As stated before  the WARC files included the HTML documents we needed  but they also  included a lot of other files we did not need  Figure 8 shows the Python script we created to  remove all of the unnecessary documents     removeAllButHTML py    os          root  subFolders  files os walk rootdir    filename files     filePath   os path join root  filename   filename  find     filename  r  filename     filePath  os  remove filePath        Figure 8  Python script to remove all other files except HTML from a directory    This script recursively deletes all of the files in a root directory that do not end 
5.  performance timing  c GLOBAL window  jstiming load b a navigationStart   a a responseStart  0 lt b amp  amp a gt  b amp  amp   c  tick    _ wtsrt    void   0 b  c tick   wtsrt_     _wtsrt   a  c tick   tbsd_     wtsrt_      try a null  GLOBAL  _window chrome amp  amp GLOBAL window chrome csi   amp  amp  n a Math  floor  GLOBAL window chrome csi           pageT   c amp  amp 0 lt b amp  amp  c tick   _tbnd   void           collection id    3650      id    7  74825401865487f671bd0fd388ce2b       version    1465938356823130000             Figure 24  A document from the IDEAL collection in Solr    As you can see  there is a lot of unnecessary text and JavaScript inside of the    content    field  If  Xpantrac thought that this page was a match and we returned the first 30 words of the content  field  it would look like      Google Newsvar GLOBAL window window function function d a this t this tick  functiona a c b b void b b new Date getTime this t a b c this tick start null a          var a new d GLOBAL window jstiming       Figure 25  First 30 words of the content field from the IDEAL collection in Solr       22    Therefore  we are unable to use the Solr collection at this time because the project specifications  for IDEAL Pages and Xpantrac were not the same  If an IDEAL collection is created to match  our specifications  then you would only need to change the URL to match the collection URL  and the field name to match the field containing the relevant content information   
6. 1 is being processed  m39 10 Topics  ater rain california weather drought los angeles storm fire street    81 75 seconds  1399317596 51       Figure 18  Output from Xpantrac_yahooWeb py script    Xpantrac for Solr    Finding a Larger Solr Collection   After we successfully indexed our 50 CNN documents into Solr  we found out that 50 files is too  small a number to enable Xpantrac to work correctly  Instead  we ended up using Yang   s  collection of Wikipedia articles on Solr  This collection currently holds 4 2 million documents   and counting      Removing Code from Xpantrac_yahooWeb py   First  we removed all of the database code  and    db    variables  from the Xpantrac_Yahoo py  script  This database held one thousand New York Times articles  Solr will replace this database   so we can remove it and the    import MySQLdb    statement     Changing the URL in Xpantrac  After obtaining the URL to Yang   s Wikipedia collection in Solr  we created a new query request  in Xpantrac  First  we had to import  urlopen  as seen in Figure 19     20    from urllib import urlopen    Figure 19  Importing urlopen to be used for the query request    Next  we had to modify the  query assembled  with the correct URL and field name     for itea in query l  st      query unit 1    query unit 2    query         join ites   num results returned         if query l         ry   query assembled      http ER g jr llection1 select a ntent   query   RutzjsonBindentstroues  conn   urlopen query assem
7. 17  Xpantrde for Yahoo Search APh aesti i ii s d n nte Se 17  File Hierarchy MP                                         iea SEA 17  Input TextFile S o ali 18  Yahoo Search API AUNOrnzaton ip 19  UEP PTT                         ln o ln 20  berries dvn a                                       20  Binding a Larger Solr Collection eqs AI a hoes ag Vo Bema a 20  Removing Code from Xpantrac  yahooWeb py                     esee nne 20  Changing the URL Tn Xp anna ruo eee tenerse to e aeo aus te UNS te dace pe use tees a tae beer tuas 20    Handling the Content Field    iiie tu sued cada E cascada abedaatasasbacs 21    Changing the Xpanttac Dar alie  els ooo ce diera Ra At Gus wt cea E E cae ue tu Re nu 21  Connecting with IDEAL in the Future                 eee eese eee eere neenon teen nest en aee enne tesa des 22  CA O EXT eon ag o e 23  Evaluation of Extracted Topics evisos cren Arden aan reia 24  A A E O 24  Howto  Rule esent cda s oa 24  Human Assigned MIO 24  Grold Standard Piles CE A A A eee Be le de 24  Evaluation WECUICS e CT           25  ExaluatlioB sq e A Cd tel 25  Lessons Learned C                                       Eai 27  BSC Tal NOR  E Eat stt E 27  A A ues ts it iot Up dmt wea 28  RETOS aso setic olent eei e RS 28    Table of Figures   Figure 1  The 0 txt file used to run the Xpantrac script    6  Figure 2  How to run Xpantrac from the command line  with output                        eee 7  Figure 3  Components of Xpantrac grouped into two PaltS   ooocconcccconc
8. 4 04 10 09 27 27    amp  Core Admin Swap Space    34 lucene spec 4 7 2  1  Java Properties  lucene impl 4 7 2 1586229   rmuir   2014 04 10 09 00 35     Thread Dump    File Descriptor Count  T JVM   JVM Memory  T Runtime Oracle Corporation Java HotSpot TM  64 Bit Server VM  1 7 0_51 24 51 b03   Bl  Processors 4    Documentation 3 Issue Tracker     ffi IRC Channel Community forum  E  Solr Query Syntax       Figure 5  Shows the Solr administration page    Indexing   To index documents with the default setup of the Solr server  you can use the post jar file that is  located in the exampledocs folder  You can copy and paste the post jar file into any folder and do  the command    java  jar post jar  file name here      Once you run post  it uploads the files to  servers and they are indexed     Querying   To query the files you have indexed  you choose the Solr collection to search  for the default  setup  the collection is named Collection 1   Once you choose the collection from the  administrator page  you can select the Query tab to see the Query menu  From here you have a  lot of options when you search  What we are most concerned about is the    q    box containing the        query  The left asterisk indicates the tag you want to search in  you can leave the   to  search all tags  and the right asterisk indicates the content you want to search within the tag   Searching           returns all of the documents that are contained within the server     12      Chrome File Edit Vi
9. Server   jetty 8 1 10 v20130312   46  main  INFO org eclipse jetty deploy providers ScanningAppProvider      Deployment monitor  Users d  cabrera Downloads solr 4 7 2 example contexts at interval     57  main  INFO org eclipse jetty deploy DeploymentManager   Deployable added   Users dcabrera Downl  oads solr 4 7 2 example contexts solr jetty context xml   1616  main  INFO org eclipse jetty webapp StandardDescriptorProcessor   NO JSP Support for  solr  did  not find org apache  jasper servlet JspServlet   1690  main  INFO org apache solr servlet SolrDispatchFilter   SolrDispatchFilter init     1708  main  INFO org apache solr core SolrResourceLoader   JNDI not configured for solr  NoInitialCon  textEx    1708  main  INFO org apache solr core SolrResourceLoader   solr home defaulted to  solr    could not  find system property or JNDI    1709  main  INFO org apache solr core SolrResourceLoader      new SolrResourceLoader for directory     sol  r     1865  main  INFO org apache solr core ConfigSolr   Loading container configuration from  Users dcabre  ra Downloads solr 4 7 2 example solr solr xml       Figure 4  Shows the command to start the server and initialization output    11      Chrome File Edit View History Bookmarks Window Help       000   Solr Admin       E c localhost 8983 solr 4   00  amp     Apache DB E Instance EB System o  Solr     oir      Start 3 minutes ago Pista Mery    Dashboard E Versions  3 Logging so solr spec 4 7 2  3 solr impl 4 7 2 1586229   rmuir   201
10. Xpantrac Connection with IDEAL       David Cabrera Erika Hoffman Samantha Johnson Sloane Neidig  dcabrera  vt edu herika6 O vt edu sjf2728  vt edu sloane10 vt edu    Client  Seungwon Yang  syang20  gmu edu   CS4624  Edward A  Fox    Blacksburg  VA  May 8  2014    Table of Contents    Table OF CODI  o esa o ict coded aed au ea i ee ate ee ca ente et a eaters 2  Table OPV AT o tene at dede a DIS UM ae este Dc ua E OA NES 4  PD SER ACT E 5  User SM wir orinni                     PH 6  Command Mine C HD ves lease eee eed ot tee ee ats 6  Developer s Wiaritial sx  tati id 8  Inventory ofData ESS ise sees E cee E evan qu ed tole a ti e ees aga Du ed tiae cude pip tend 8  Xpantrac Explained    eee ers I PHYS EET MESSA GA USA CL SU SER RANT EU VR TUNER Ra UE Pe venus 9  Expansion e dati besstpse Ec oiga uA em aeons eaa oci s ceto sette ted i ok Lotta ea ue EE 10  Extraction Met 10  How to Setup Apache Solucion RECS UE EEE ae iia Eea AES iin 11  Download  aec detener A A el de tdi a te ge le aa ad 11  Starting the SV iii 11   Ande XI oco E tate 12  QUAY E A e A AE EE A E ENEE N E AEE TE 12  WARC Piles vith IDEAE Documents io 13  Python  Script to Remove HTM r setae A ei ona eh oder cs bul tute idea n 14  Indexing Documents into Sol ro a Eten baden aen Een deis 14  Attempting to use the IDEAL Pages Script                    esee esee en eene tne 14  Manually Indexing Documents into SOI    ee eeeceseeeeeecneeceeeeseeeeseecaecsseesseeeeaeessaeen 15  Concept MEAD P                    dia 
11. and drought  now rains torment Southern California    By Kyung Lah and Ben Brumfield  CNN   updated 3 26 PM EST  Sat March 1  2014   Watch this video   Mudslides wreak havoc on Southern Calif    STORY HIGHLIGHTS   Rains are the first since the weather system behind the drought collapsed   Though desperately needed  the rain has not been great news   The deluge has come down at more than an inch an hour at times   Rain and cold will move  hitting the East Coast Monday    CNN     Mario Vazquez grabbed his dog and got out of the way  as a stream of water and mud came gushing on to his streets    Since California has been in the middle of its worst drought in 100 years  it would seem that the sight of rain would be good news    But in Glendora and other towns in Los Angeles County  it wasn t    The rain has been much needed  but Friday s deluge    coming down at more than an inch an hour at times    landed on bone dry hills scorche  With little vegetation left to stop them  walls of water have gushed into valleys below  They have spewed mud and debris into quiet resider  More could hit before Saturday is up  the National Weather Service says  It has placed Los Angeles and Ventura counties under a flash flooc  By the time it s over up to six inches will have landed on the foothills of Los Angeles County and as much as 10 inches on the ridge line   Weather weirdness    Figure 9  Text file containing the information from a CNN article    EP  Gi    1   id  1   id     2  lt title gt A
12. at we had a better understanding of their project goals  We  had initially thought they could help us accomplish some of our tasks  so we waited for them to  finish one of their deliverables so that they could share it with us  It turned out that this particular  deliverable did not accomplish the same thing we needed  so we wasted time waiting on it     Another lesson learned was dealt with Apache Solr  We were very confused about the purpose of  Solr when we first started our project  Additionally  we were unsure how to use it  We did not  understand how to index or query files  so we had to find a lot of tutorials  some of which were  misleading  or ask our primary contact  However  these tasks became more clear after was had  the guest lecture from Tarek Kanan about Solr and completed the Solr assignments for  homework  We hope that in the future the Solr activity will be moved toward the beginning of  the semester instead of the end  We believe that we would have experienced less troubles if the  course had been structured this way     Overall  we gained a lot of knowledge regarding tools that were new to us  such as Solr and  Yahoo Search API  We are glad to have the experience of working with Yang   s code and hope  that his research can be carried on in the future     Special Note    Yang has requested that the URL to the GMU Wikipedia Solr collection be redacted as it should  not yet be public  This explains the blackened hostname and port in Figure 20     27    Ackn
13. bled     rsp    eval  coen read        results   rsp     response          d        Figure 20  Shows the new query_assembled with    content    as the field name to query in the  collection  This can be found in the    makeMicroCorpus    function     Handling the Content Field   In addition to changing the query field to    content    in the query_assembled for the request  we  also had to change the field name in the configuration for the results seen later in the code  First   we changed the field name to    content     Next  we returned on the first 30 words of the content  field  Only the first 30 words are used because they tend to represent the key issues of an entire  document  The field change can be seen in Figure 21       for M_43 configuration  only 1  results merged                  for result in results 0 10    short result       a S split    30    clean _result   short result  replace   p k strip    replace   S     Figure 21  Shows the mum of the first 30 anis of ihe content field     Because we are no longer using the Yahoo Search API  we also removed all of the authorization  code that enabled us to access that API     Changing the Xpantrac parameters   With Yang   s help  the number of topics for Xpantrac to find was changed to 10  the number of  API results to return to be 10  and the query unit size to be 5  These changes can be seen in  Figures 22 and 23     def main     num_topics   10  window_overlap   1    Figure 22  num_topics represents the number of 
14. ccconccononcnonnncnonnnnnnnnnanononcnnnns 10  Figure 4  Shows the command to start the server and initialization output                           es 11  Figure 5  Shows the Solr administration page             ccccesssccessseceeneeceeeeceeacecseneeceeceeceeeeecseeeeeeteeeees 12  Figure 6  A query of       that returns all of the documents in the collection                               13  Figure 7  URL to the query TeSDODSO   us erect aeu edet so o tee tasses get e aal niae ead tud 13  Figure 8  Python script to remove all other files except HTML from a directory                          14  Figure 9  Text file containing the information from a CNN article                        eese 15  Figure 11  Command to index  50docs xmI  into Solt                     eee 15    Figure 12   Figure 13   Figure 14   Figure 15   Figure 16   Figure 17   Figure 18   Figure 19     XML  file using the correct forinat          beide dicente vende tertio i detti iode 16  IOException from indexing the    50docs xml    file into Solr                              suse 16    Xpa  trac concept maps osos puntas iat eiut oom ti nO ERE M ii 17    Creates a list of all file IDs from    plain text ids txt       esses 18  Shows how each input text file is accessed                       esses 18  Authorization and query information for Yahoo Search API                               esses 19  Output from Xpantrac_yahooWeb  py script    oooococnnocccnncccconanccnoncnonnncnononononnncncnnnccnonnos 20  Importin
15. ck  authorities told the news agency  No motive has  been provided  A doctor with the Kunming No 1 People s Hospital told Xinhua over the phone they re    not sure of the number of casualties  Xinhu said the Kunming Railway Station is one of the largest  stations in southwest China  lt    gt    gt     name   id  gt 1 lt    gt    name  title  gt After forest fires and drought  now rains torment Southern California lt    gt    name  content  gt Mario Vazquez grabbed his dog and got out of the way  as a stream of water and  mud came gushing on to his streets  Since California has been in the middle of its worst drought in  100 years  it would seem that the sight of rain would be good news        Figure 12  XML file using the correct format    Initially  we had 50 separate XML files for each of the 50 articles  However  we learned that we  were able to combine these into one long XML file  with each article in its own   doc   tag   When we tried to index the    50docs xml    file into Solr  we received the error seen in Figure 13     dcabreraeDMBP exampledocs  java  jar post jar 0 xml  SimplePostTool version 1 5  Posting files to base url http   localhost 8983 solr update using content type application xml    POSTing file Q xml   SimplePostTool  WARNING  Solr returned an error  500 Server Error   SimplePostTool  WARNING  IOException while reading response  java io I0Exception  Server returned HTTP  response code  500 for URL  http   localhost 8983 solr update   1 files indexed   
16. ctr30_10topics csv Xpantrac assigned topics for 30 CTR articles   10 topics per article     project xpantrac_ctr30_20topics csv Xpantrac assigned topics for 30 CTR articles   20 topics per article     project xpantrac_various30_10topics csv   Xpantrac assigned topics for 30 various articles   10 topics per article    Jproject xpantrac various30 20topics csv   Xpantrac assigned topics for 30 various articles   20 topics per article     project computePRFl  py Computes the precision  recall  F1 score of the  extracted topics       How to Run   gt  python computePRF1 py gold ctr30 csv xpantrac ctr30  topics csv   gt  python computePRF1 py gold various30 csv xpantrac various30  topics csv    Human Assigned Topics   Two sets of test files  CTR_30 and VARIOUS 30  were included in this project  These files have  been tagged with topics by multiple human sources  The people who tagged these articles were  from the Library Sciences field  so they were experienced taggers  The human assigned topics  for each file can be found in human topics CTR30 csv and human topics VARIOUS30 csv     Gold Standard Files   The gold standard files are a merged version of the human assigned topics  That means that if  Tagger A said that a file s topics are    Florida  marsh  tropical  coast  and Tagger B said that  same file s topics are    marsh  storm  Jacksonville   then those topics would be merged in the    24    gold standard file  Therefore  the gold standard of topics for that file would be    F
17. ew History Bookmarks Window Help       000   Solr Admin         H localhost 8983 solr   collection1 query       Apache e Request Handler  qt     aa  select           lt  t  Solr     common  responseHeader        Dashboard q  status   0   m  Qrime   3       Logging  params      E  Core Admin  indent J true    x fq en  J Java Properties a          Thread Dump        collection1    response      start  rows     numFound   51  f  start   0   fl docs      t   id    0    df  title       Knife wielding mob kills 27 at China train station   m     m Raw Query Parameters  content       At least 27 people were killed and 109 wounded when a group of people armed with knives stormed a railway station in the  wt y  I Query json   version    1467200232247787500  i E    indent t  w debugQuery  iat  1     title      dismax  After forest fires and drought  now rains torment Southern California   1   edismax   content      hi  Mario Vazquez grabbed his dog and got out of the way  as a stream of water and mud came gushing on to his streets  Since  facet l   spatial   version    1467200232282390500  b  spellcheck t  Execute Query miai   2     title           Figure 6  A query of       that returns all of the documents in the collection    The link at the top of the query gives you the general structure of a query if you do not want to  use the Admin page     al    Figure 7  URL to the query response    From here  the       in the link represent the the things we search for and you can replace the
18. ewitnesses     eyewitnesses told cnn affi liate wpri     wpri acrobats type aerial scaffolding     scaffolding human chandelier cable snapped  18   snapped payne told cnn fredricka     fredricka whitfield apparatus multiple performances  20   performances week ringling bros barnum  21   barnum bailey 1lauched 1egends time  22   time venue equipment performer group  23   group performers wel1 performers careful ly  24   carefully inspected health safety per formers  25   performers guests seriously company safety  26   safety department spends countless making  27   making equipment safe effective continued  28   continued circus local authorities investigating  29   investigating incident payne legends began  30   began short providence residency final  31   final performances slated rest canceled  32   canceled making determination remainder providence  33   providence engagement payne         Micro corpus is created              Vector Space Model is applied for topic extraction            Topics  separated by              payne  island  rhode  circus  providence  reuter  american  county  john  state       Figure 2  How to run Xpantrac from the command line  with output     Developer   s Manual    Inventory of Data Files    Directory containing all project files     project Xpantrac py Script containing Xpantrac algorithm to be used  with Apache Solr     project 0 txt Sample input file to be used by algorithm     project pos_tagger py Part of speech tagger  Trained using t
19. f these problems  we modified the original Xpantrac script to use the Yahoo  Search API     File Hierarchy  File Description    Jproject Directory containing all project files     project Xpantrac_yahooWeb py Script containing the Xpantrac algorithm to be used  with the Yahoo Search API     project plain_text_ids txt Text file containing a list of file IDs  Used in      project Xpantrac_yahooWeb py        project files Directory of text files with corresponding IDs  Used in      project Xpantrac_yahooWeb py          17    Input Text Files   The Xpantrac_yahooWeb py script used a plain_text_ids txt file to identify all of the IDs of the  text files to be used as input  These text files can be found in the   project files directory  The IDs  for the text files are simply 0 50 and the text files themselves are named 0 txt  50 txt   respectively  Figures 15 and 16 show how the files are accessed in the Xpantrac for Yahoo script       develop id list   fi   open  plain text ids txt    r    li   fi read   split     fi close      Figure 15  Creates a list of all file IDs from  plain text ids txt     print   n              Document ID   s is being processed                in    doc id  filename   str filenum      txt      for Linux mac machine   text   open  files   filename   r   read       for Windows machine    text   open  files   filename   r   read      Figure 16  Shows how each input text file is accessed    18    Yahoo Search API Authorization   Querying the Yahoo Search API 
20. fter forest fires and drought  now rains torment Southern  California lt  title gt    3  lt content gt Mario Vazquez grabbed his dog and got out of the way  as a  stream of water and mud came gushing on to his streets    4 Since California has been in the middle of its worst drought in 100  years  it would seem that the sight of rain would be good news    5 But mud from the streets is beginning to ooze over into yards   pools and houses  It has damaged two homes in Glendora so far   police chief Tim Staub said  lt  content gt         Figure 10  XML file containing the information from the text file in Figure 9    Next  we tried to manually index those XML files into Solr using the command line        eoo C exampledocs     bash x        amp 0MBP exampledocs  java  jar post jar 50docs xml    Figure 11  Command to index    50docs xml    into Solr  However  we ran into an error  After examining Solr s schema xml file and reviewing some    tutorials  we realized that we had been formatting our XML files incorrectly for Solr  The correct  formatting can been seen in Figure 12     15    50docs xml    name   id  gt 0 lt    gt    name  title  gt Knife wielding mob kills 27 at China train station lt    gt    name  content  gt At least 27 people were killed and 109 wounded when a group of people armed  with knives stormed a railway station in the southwest Chinese city of Kunming  authorities said   according to state news agency Xinhua   It was an organized  premeditated terrorist atta
21. g urlopen to be used for the query request                      eee 2l    Figure 20  Shows the new query assembled with  content  as the field name to query in the   collection  This can be found in the  makeMicroCorpus  function                     sse 21  Figure 21  Shows the return of the first 30 words of the content field                                sssss 21  Figure 22  num_topics represents the number of topics to be found for each input document     21  Figure 23     u_size    represents the query unit size and a  size represents the API return size        22  Figure 24  A document from the IDEAL collection in Solr                          eee 22  Figure 25  First 30 words of the content field from the IDEAL collection in Solr                        22  Figure 26  Xpantrae configuration file    ua erties eed eel va ceste de ee vea dE Ue bes PI ace eg ts 23    Abstract    Title  Integrating Xpantrac into the IDEAL software suite  and applying it to identify topics for  IDEAL webpages    Identifying topics is useful because it allows us to easily understand what a document is about  If  we organize documents into a database  we can then search through those documents using their  identified topics     Previously  our client  Seungwon Yang  developed an algorithm for identifying topics in a given  webpage called Xpantrac  This algorithm is based on the Expansion Extraction approach   Consequently  it is also named after this approach  In the first part  the text of a d
22. he  CoNLL2000 corpus provided by the Natural  Language Tool Kit  NLTK      project pos_tagger pyc Compiled version of      project pos_tagger py      project get pip py Package installer     project stopwords txt A list of words to exclude from the topic  identification   project custom_stops txt A list of words to exclude from the topics  identification   project Xpantrac_yahooWeb py Script containing the Xpantrac algorithm to be  used with the Yahoo Search API   project plain_text_ids txt Text file containing a list of file IDs  Used in      project Xpantrac_yahooWeb py      project files Directory of text files with corresponding IDs  Used in      project Xpantrac_yahooWeb py      project processWarcDir py Unpacks a WARC file and returns only html files   project CTR_30 A directory of 30 CTR files  Jproject V ARIOUS 30 A directory of 30 various files   project gold_ctr30 csv The    gold standard    of merged human topics     project gold_various30 csv The    gold standard    of merged human topics   project human_topics_CTR30 csv Human assigned topics for 30 CTR articles        project human_topics_VARIOUS30 csv   Human assigned topics for 30 various articles     project xpantrac_ctr30_10topics csv Xpantrac assigned topics for 30 CTR articles   10 topics per article     project xpantrac_ctr30_20topics csv Xpantrac assigned topics for 30 CTR articles   20 topics per article    Jproject xpantrac various30 20topics csv   Xpantrac assigned topics for 30 various articles   20 t
23. lorida  marsh   tropical  coast  storm  Jacksonville        Evaluation Metrics   This evaluation of topics measures precision  recall  and F1  Precision can be used to compute  the proportion of matching topics  1 e   C  from all the retrieved topics  i e   A  by the following  formula     C  precision     A   P relevant   retrieved     Recall is the proportion of the matching topic  1 e   C  from all of the retrieved topics  1 e   B    which are assigned by the human topic indexers or exist as the gold standard     C  recall   ic     P retrieved   relevant      B     Ideally  both the precision and recall values should be 1  This would mean that the sets of topics  compared would be exactly the same     The F1 score is used to compare precision and recall with the following formula     F 2   precision   recall    precision   recall    Evaluation  The tables below show the evaluation of average precision  recall  and F1 of the gold standard of  topics versus 10 Xpantrac topics      gt  python computePRF1 py gold ctr30 csv xpantrac ctr30 10topics csv  Evaluation  Average Precision    Average Recall       Average F1    25     gt  python computePRF1 py gold various30 csv xpantrac various30 10topics csv    Evaluation    Average Precision    Average Recall    Average F1       Above  the number of human assigned topics are much larger than the number of Xpantrac topics   10   Because of this  the recall value will be somewhat low  Increasing the number of Xpantrac  topics from 10 t
24. ly inspected  We take the health and  safety of our performers and our guests very seriously  and our company has a safety department that  spends countless hours making sure that all of our equipment is indeed safe and effective for continued  use   he said    The circus and local authorities are investigating the incident together  Payne said     Legends  began a short Providence residency on Friday  The final five performances there were slated  for 11 a m   3 p m  and 7 p m  on Sunday  and 10 30 a m  and 7 p m  on Monday     The rest of the  11 a m  Sunday  show was canceled and we re making a determination about the  remainder of the shows for the Providence engagement   Payne said        Figure 1  The 0 txt file used to run the Xpantrac script    PS C  Users sloan_000 Desktop project gt  python   Xpantrac py    Input text  0 txt is being processed       List of queries  query size 5            cnn ringling bros barnum bai ley     bailey circus performers injured providence     providence rhode island apparatus fai led     failed circus spokesman stephen payne     payne performers fel 1 hair hang     hang apparatus holds performers hair     hair failed payne performer injured   injured ground performers hospitalized injuries  injuries accident rhode island hospital     hospital spokeswoman 3111 reuter to1d     told cnn listed critical condition     condition reuter clear victims multiple     multiple emergency units responded accident     accident dunkin donuts center ey
25. nt will be used as input to Xpantrac  To run the Xpantrac  script  simple type    python   Xpantrac py     The output in the console will show the query size   each query performed  and a list of topics found in the relevant documents      CNN     Nine Ringling Bros  and Barnum and Bailey circus performers were among 11 people injured  Sunday in Providence  Rhode Island  after an apparatus used in their act failed  circus spokesman  Stephen Payne said    Eight performers fell when the hair hang apparatus    which holds performers by their hair    failed   Payne added  Another performer was injured on the ground  he said    The performers were among 11 people hospitalized with injuries related to the accident  Rhode Island  Hospital spokeswoman Jill Reuter told CNN  One of those people was listed in critical condition  Reuter  said    It was not immediately clear who the other two victims were    Multiple emergency units responded to the accident at the Dunkin  Donuts Center    Eyewitnesses told CNN affiliate WPRI that they saw acrobats up on a type of aerial scaffolding doing a   human chandelier  when a cable snapped    Payne told CNN s Fredricka Whitfield the apparatus had been used for multiple performances each  week since Ringling Bros  and Barnum  amp  Bailey lauched its  Legends  show in February     Each and every time that we come to a new venue  all of the equipment that is used by this performer     this group of performers as well as other performers    is careful
26. o a larger number  such as 20  will increase the recall value  Eventually  the F1  measure will increase as well  However  the precision value may decrease slightly  Below are the  average precision  recall  and Fl scores for the increased number of topics  20     gt  python computePRF1 py gold ctr30 csv xpantrac ctr30 20topics csv   Evaluation   Average Precision    Average Recall    Average F1        gt  python computePRF1 py gold various30 csv xpantrac various30 20topics csv    Evaluation    Average Precision    Average Recall       Average F1    As expected  the precision value has decreased and the recall value has increased  Overtime  we  should still expect the F1 score to increase     26    Lessons Learned    This capstone project was definitely an eye opening experience for all of us  We had never done  this type of work in any of the courses from our past semesters before  Because of this  we felt  that we learned a lot of lessons and gained a lot of experience     While all of our group members had previous experience working in a team  none of us had ever  had to coordinate with another separate team before  Overall  we felt that there was a good deal  of miscommunication between our group and the IDEAL Pages group  Throughout the semester   we were under the impression that some of our project goals overlapped with their project goals   However  this was not the case  In hindsight  we should have made our objectives more clear  with the other group and ensured th
27. ocument is used  as input into Xpantrac and is expanded into relevant information using a search engine  In the  second part  the topics in each document are identified  or extracted  In his prototype  Yang used  a standard data set  a collection of one thousand New York Times articles  as a search database     As our CS4624 capstone project  our group was asked to modify Yang   s algorithm to search  through IDEAL documents in Apache Solr  In order to accomplish this  we set up and became  familiar with a Solr instance  Next  we replaced the prototype   s database with the Yahoo Search  API to understand how it would work with a live search engine  Then we indexed a set of  IDEAL documents into Solr and replaced the Yahoo Search API with Solr  However  the amount  of documents we had previously indexed was far too few  In the end  we used Yang   s Wikipedia  collection in Solr instead  This collection has approximately 4 2 million documents and counting     We were unable to connect Xpantrac to the IDEAL collection in Solr  This issue is discussed in  detail later  along with a future solution   Therefore  our deliverable is Xpantrac for Yang   s  Wikipedia collection in Solr along with an evaluation of the extracted topics     User   s Manual    Command Line   In the command prompt  the user must navigate to Xpantrac   s project directory    Before running the Xpantrac script  the user must ensure there is a document named    0 txt    in  that project directory  This docume
28. opics per article     project computePRFl  py Computes the precision  recall  F1 score of the  extracted topics       Jproject xpantrac various30 lOtopics csv   Xpantrac assigned topics for 30 various articles   10 topics per article    Xpantrac Explained   Xpantrac is an algorithm that combines Cognitive Informatics with the Vector Space Model to  retrieve topics from an input of text  The name Xpantrac came from the Expansion Extraction  approach it takes when expanding the query and eventually extracting the topics  Consider this  use case of Xpantrac in the following scenario     Rachel is a librarian working at a children s library  This library received about 100  short stories  each of which was written by young writers who recently started their  literary career  To make these stories accessible online  Rachel decides to organize them  based on the topic tags  So  she opens a Web browser and enters a URL of the Xpantrac  UI  After loading documents that contain 100 stories  she selects each document to  briefly view it  and then extracts suggested topic tags using the UI  After selecting  several suggested tags from the Xpantrac UI  and also coming up with additional tags by  herself  she enters them as the topic tags representing a story  A library patron  Jason   accesses the library homepage at home  clicks a tag    Christmas     which lists 5 stories  about Christmas  He selects a story that might be appropriate for his 4 year daughter   and reads the story to her
29. owledgements    We would first like to thank Seungwon Yang for taking the time out of his busy schedule at  George Mason University to help our group better understand the Xpantrac algorithm and goals  for this capstone project     We would also like to mention Mohamed Magdy and IDEAL Pages group  consisting of  Mustafa Aly and Gasper Gulotta  for their contributions to the initial part of our project  The  IDEAL Pages project goal was to index the IDEAL documents into Solr     Lastly  we would like to thank Dr  Edward Fox for presenting us with the opportunity to work on  and improve this project for our capstone class and the National Science Foundation  NSF  for  supporting the Integrated Digital Event Archiving and Library  IDEAL  organization     References    Yang  Seungwon  Automatic Identification of Topic Tags from Texts Based on Expansion   Extraction Approach  Diss  Virginia Polytechnic Institute and State University  2013  230 pages    lt http   hdl handle net 10919 25111 gt      28    
30. required authorization  Therefore  this script had a few extra  authorization lines than normal  Figure 17 shows the necessary authorization and query  information     if query l  25   try   Mila gi  if yahoo_api_type       web    url    http   yboss yahooapis com ysearch web q     query  else   url    http   yboss yahooapis com ysearch news q     query    consumer   oauth2 Consumer key 0AUTH_CONSUMER_KEY secret 0AUTH_CONSUMER_SECRET   params        oauth version    1 0      oauth nonce   oauth2 generate nonce       oauth timestamp   int time time             oauth_request   oauth2 Request method  GET   url url  parameters params   oauth request sign request oauth2 SignatureMethod HMAC SHA1    consumer  None   oauth header oauth request to header realm  yahooapis com        Get search results   http   httplib2 Http     resp  content   http request url   GET   headers oauth header     print resp     print content   results   simplejson  loads content     Figure 17  Authorization and query information for Yahoo Search API    19    Output  See Figure 18 for instructions on how to run the Xpantrac for Yahoo script in the command  prompt  This figure also shows the list of topics  output  for each document processed     PS C  Users sloan_000 Desktop project gt  python   Xpantrac_yahooweb py  1399317485  37    Document ID  O is being processed    m39 10 Topics  station people attack news railway xinhua china train group knife    29 3789999485 seconds  1399317514 75    Document ID  
31. rpus to select only nouns     verbs  or both  It also finds    lemmas    of the nouns or verbs to resolve singular and plural  forms    10    2  Term Doc Matrix Builder  develops a term index using the unique words from the  derived corpus and constructs a term document matrix as in the Vector Space Model  3  Topic Selector  identifies significant words representative of the input text    How to Setup Apache Solr    Download   In order to setup Solr  you need to have the latest Java JRE installed on your system  At the time  of this writing  the current version of Java  Java 8  is fully compatible with Apache Solr but  previous versions can be used if desired  Once the latest Java is installed  you can download  Apache Solr     Starting the Server   Once Solr is downloaded  you can run the server in its template form by navigating to  solr  download  example  From here  running    java  jar start jar  starts the server  You can then  navigate to http   localhost 8983 solr   If the server is successfully started  you should be able to  see the administrator page  The figure below shows the command to start the server and what a  developer should see when initializing the server     eoo  _  example     java   dcabreraeDMBP example  1s   README   txt example DIH  lib  resources  solr webapp    contexts  example schemaless  logs  scripts  start jar   etc  exampledocs  multicore  solr  webapps   dcabreraeDMBP example  java  jar start jar    o   main  INFO org eclipse jetty server 
32. topics to be found for each input document    21      input text  Control inputs      for  p size in  20  15  10  5  1  3 4      for  u size  in  20  15  10  5  2  1  3 4   4  3 4    gt  group 5 words together    for a return in  50  10  5  1  1 2    amp   1 2  gt  ask Solr to return 10 matching documents  Figure 23     u_size    represents the query unit size and a  size represents the API return size   This can be found in the    main    function     Connecting with IDEAL in the Future   In the future  Xpantrac should connect to the IDEAL collection in Solr  This collection can be  found at http   nick dlib  vt edu 8080 solr   collection1 query  While this collection does contain  a    content    field  it does not meet the specifications of our project at this time     The IDEAL Pages group was given a different specification to use for the content of their Solr  collection  Their group was instructed to collect the entire content of an HTML page  This means  that all of the text in the  lt body gt  of an HTML page will be put into their    content    field  Figure  24 shows an example of a content field           content         Google Newsvar GLOBAL window window   function    function  d  a   this t    this tick function a c b   b void 0  b b   new  Date   getTime   this t a   b c   this tick   start   null a  var a new  d GLOBAL window jstiming  Timer d load a  if  GLOBAL window performance amp  amp GLOBAL windo       w performance timing   var          a GLOBAL window
33. with the HTML  extension  When running the script  the only parameter needed is the path to the root directory  where the files are located  The full path to each deleted file is printed as it is removed     Indexing Documents into Solr    Attempting to use the IDEAL Pages Script   As mentioned before  the IDEAL Pages group goal was to index IDEAL documents into Solr   Our group also needed to do this in order to later use IDEAL documents with Xpantrac  After  speaking with our professor and primary contacts  our groups were asked to work together  The  IDEAL Pages group would supply the Xpantrac group with the script to index documents into  Solr and the Xpantrac group would manually index the documents until that script was created     When the IDEAL Pages script was finally received  it would not run with our Solr instance  Our  group spent a lot of time trying to fix the script and get it to run with our instance  The IDEAL  Pages group was also unable to help  Eventually  we realized that we would rather spend time  manually indexing the files into Solr instead of trying to fix a script that may never work for us     14    Manually Indexing Documents into Solr   Initially  we had 50 text documents from CNN that were supposed to be indexed into Solr  See  Figure 9   These documents would represent documents from the IDEAL collection  However   Solr needed those documents to be in XML format  See Figure 10         File Edit   Format   View Help           fter forest fires 
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
English  Whitehaus Collection WHKN1078-WH Installation Guide    Philips 241E1SB  Trust 18886  取扱説明書・ 承認図      Copyright © All rights reserved. 
   Failed to retrieve file