Home

CS 5604 Information Retrieval and Storage Spring

1. e Install Mahout 0 8 o Download Mahout 0 8 from https archive apache org dist mahout 0 8 o Download mahout distribution 0 8 tar gz and mahout distribution 0 8 pom and convert this pom file to pom xml gt Use mvn to compile myn clean amp amp mvn compile amp amp mvn DskipTests install e Follow tutorial 17 again We can see for each tweet we give the tweet ID tweet content and its score for each class and final result in terms of which class it belongs to with gt as shown in Figure 16 40 ed T 799 79 535795 et 95 5U JUG een 779 15 S OF Tweet 309145895579029504 RT indiacoupons eShakti Save on spring summer dresses Expires 3 20 2013 h tp t co TPS8hrvrc5 discount offer apparel 304 1869227651582 art 315 6646829380505 camera 322 15291800342584 event 290 030551733890 health 292 8379405447859 home 306 22712768029976 tech 305 8166399044838 gt event Tweet 309145651311153152 eShakti Save on spring summer dresses Expires 3 20 2013 http t co TPS8hrv c5 discount offer apparel 275 7836818849774 art 278 17159120780207 camera 291 55815597697983 event 260 80476071372 67 health 266 21766304270466 home 268 3665967183446 tech 272 63097959538567 gt event Tweet 309145348146860032 TEKLYNX L8UPRPROS LV PRO RUNTIME V8 TO PRO V8 01 L8UPRPROS by Teklynx Labelv ew Pro Runtime V8 To V8 http t co XzKJofUddM discount apparel 105 89144565146677 art 81 93289456173794 camera 10
2. d We tried to upload the tweets with webpages to Solr We discussed with our TA about how to upload them to Solr We think about 2 approaches 1 Use the post jar But we still have to do preprocessing to merge the webpages and tweets 2 Write our own script using Python Java C to do the preprocessing and then use an API library to connect to Solr inside the script and then upload our results to Solr 24 We finally choose the second method to finish this job since we tried many times and still could not figure out how to successfully use the post jar to upload the webpages and tweets at the same time e We choose the solrpy library to connect to Solr inside the script and then use the script to preprocess the script and upload to Solr f We first tested the solrpy library This is the test code from the solrpy website create a connection to a Solr server s solr SolrConnection http example org 8083 solr add a document to the index doc dict id 1 title Lucene in Action author Erik Hatcher Otis Gospodneti s add doc commit True Here we came across a problem This example does not work for our Solr Finally we searched the internet and found that we should use s add _commit True doc instead of s add doc commit True The key of the dic structure should be included in the schema of Solr g Then we do the preprocessing for the metadata To be short we need to read all the tweets in the z356 csv
3. 2134 6 9095192209009 2095 0 1288528607028155 3325 0 3357421306809514 1494 0 1902304722200905 4500 0 0883934 1838667017 463 0 33792394990479074 4830 0 34253573091760024 3167 0 12346824259351073 3320 0 391 a saseasesoioes 1797 0 2son6x0s47s201916 1903 30504766269657007 2317 0 3672310541485848 2655 0 0 173705240363842 1576 0 421961141047676 1900 0 421561141047676 2085 0 21545754034902814 2294 0 2671857572390694 2651 0 25716702 43115348 255 Pj Sin117493502265 83 0 35378061067269965 1629 8 24161526001204516 2107 0 25565630520614596 2162 0 4515424282400377 2655 PAA I2SIORETSSSS21A 1758 8 CMDSMTAMTMS 2347 2 2084130 MOIZMUL 2034 8 77743 BO ASTEAR 2130 JARSUMLRICRTTS 2872 8 18M VIARI ATIS 2089 8 217243 IS2RTMZT 28D 17XISSILAGIOSI ASR 971417258771 SSH SITO on D Td D Plain Text Tab Width 8v 1n 3941 Col 100 INS E Oes 53 ipt a lot a woe ERR I om BoSSe0 gan Fig 12 Generated tf idf results 25498056249135383 4009 0 3989435729068422 4268 8 3221637947901394 4586 0 658573992528 7401 5299 8 3283493894498409 5370 0 22699921 3776999522239438 1798 0 2872110006166647 2655 0 87069502057547928 3784 0 31947364339215994 4163 0 3128052237709803 4584 9 17300251892930135 5167 0 06880956314255088 5697 0 30494 32923977213629757 2173 0 3032334884978490 2655 0 08681226084476545 3638 0 23791050256416674 3780 0 2051734823202561 4584 0 21244166279448462 4586 0
4. A mb attack ha truck near the Indian n Indian consula Af n A sui mb attack ha http tack on Indi i k Afghanistan http tan At ine i are killed uicide attac ghan n c s cludin 42 ary Convoy 1 Lebanon i behind e ide r ck on e ian y in Leban behind mb attack on the i ssy in Leban gt behind mb attack on the ni ssy in Leban gt behind ici ck ni ssy in Leban gt behind e ici e ssy in Lebanon ind Iranian e ssy in Lebanon i what the hell it s 5 LEAK Sui attack on h in Uruzgan in southern cont Labour MEP candidate D ing i icide bomb attack candidate for the Euro Labour MEP candidate De ed in Kabul Si can http t Telegraphworld Labour E 1 Singh k in Kabul s e bot Singh a Labour candid Labour MEP i e D i i in Kabul om C i candidate for the Nigeria M s te Del Singh d in K icii 0f ck Del Sin a Labo Labour MEP in ed in Kabul sui m for the Labour MEP candidate De ing ed in Kabul sui Labour MEP candidate D ingh killed in Kabul sui China points to suicide blast in Urumqi attack 1 dead 79 hurt an ap Fig 17 Example of Training Set When we are done with labeling we use the NaiveBayesGenerate java class to generate the model The command is as follows The naive bayes generate is the class from Pangool The LDA train lda txt is the training data we showed in Figure The lda out bayes model is the Naive Bayes model we generated Then we used the labeled data to test the accuracy
5. but they differ from each other in other characteristics and these differences offer different advantages or drawbacks in different situations Size of data set Mahout algorithm Executive model Characteristics Small to medium less Stochastic gradient descent Sequential online Uses all types of than tens of millions of SGD family incremental predictor variables training examples OnlineLogisticRegression sleek and efficient over CrossFoldLearner the appropriate data AdaptiveLogisticRegression range up to millions of training examples Medium to large Support Vector Machine Sequential Experimental still sleek and millions to hundreds SVM efficient over the appropriate data range of millions of training Naive Bayes Parallel Strongly prefers text like examples Complementary Na ve Parallel data medium to high Bayes overhead for training effective and useful for data sets too large for SGD or SVM Somewhat more expensive to train than na ve Bayes effective and useful for data sets too large for SGD but has similar limitations to na ve Bayes Small to medium less Random forests Parallel Uses all types of predictor than tens of millions of variables high overhead for tani l training not widely used raining examples yet costly but offers complex and interesting classifications handles nonlinear and conditional relationships in data better than other techniques Table 2 Char
6. mcbposttime pdate 1 psolr add _commit True upload upload to Solr h Figure 6 shows that we have successfully uploaded the tweets and their corresponding webpages to Solr 26 i iid i IEIBFI ET ETT NI Fig 6 Tweets and webpages uploaded to Solr 2 4 Load Webpages of Small Collection to HDFS a Download the new CSV file and Python script from TA Hint use pip to install Beautifuloap4 and requests which are used in the script Otherwise the script cannot run b Use the script to deal with the new csv file The generated long URLs is shown in Figure 7 c Install Nutch follow the Nutch tutorial provided by TA Because we do not need Nutch to upload the new webpages to Solr we need to modify the crawl script in Nutch d We comment the code after note that the link inversion indexing routine can be done within the main loop on a per segment basis until the end of the script in order to prevent this code from uploading data to Solr e Use Nutch to crawl the webpages and upload them to HDFS We put the URL seed we got from the script and save them to the directory urls e Werun the command The screenshot after Nutch finished crawling the webpages is shown in Figure 8 27 Fig 7 LongURLs from the new tweets Fig 8 Nutch finished crawling the webpages f After we get the webpages data from Nutch we uploaded it to HDFS for reducing noise team We put them under th
7. 1354 0 38068432782710476 265 8655104778422053 3780 0 1548285085142981 3847 0 2629197314100581 4440 0 39343627505509274 4441 0 39343627505589274 5167 0 06376256315100724 5176 0 23459157763028732 5250 0 1874 tit 124646004506238 1625 0 4506374873468124 1743 0 3046481657772473 2655 0 08139105664647966 3067 0 3040040278578284 3444 2 9 22994654930670547 4655 0 2520264648419718 5127 0 4039752360935806 5167 0 07921942504905S09 5665 0 30150 3 txt 0 44665777864753853 2173 0 3254864063120082 2207 0 2572519056612789 2655 0 09318301532642233 2837 0 34878567585245285 3640 0 4234725150326331 4586 0 09203498888477787 4741 0 41879995571733614 5167 0 09069675714568512 4 tt 8 0 3021265273030723 461 0 22922016285434327 504 0 20274665936073235 667 0 28446606192555 1589 0 2804098224824482 1548 0 2640047702002457 2396 0 24526423149100957 2640 0 2000199236226412 3264 0 20838647441543847 3794 0 2563068211 Stt 6 7 7 5 7765953419751937 667 0 343259452719548 2333 0 35922940113849915 2499 0 323952439838389 2055 0 09971209417105367 326410 3479981953713201 38 9 2096894546347111 4306 0 09033273837910224 67 9 097119753255A8213 57008 3202953 196330 2LSETI399 51 0 32229009322910673 2257 0 203595660980512 2542 8 16698755269040972 2733 0 2482241508450 31280 21043311806306739 3453 0 2039800949195229 3782 9 1604990118901 1214 4041 9 3033383411332490 4306 0 0631 tut C7310 81042
8. For each tweet we should extract the URL Next check if the URLs are selected because of their high frequency Then we form a dictionary structure which includes the ID date and its corresponding webpages as content At last we upload the dictionary structure to Solr This below is the code with annotation import solr create a connection to a Solr server psolr solr SolrConnection http localhost 8983 solr connect with Solr f open z356t csv rb text line decode utf 8 strip for line in f readlines f close pid pdate pcontent pwebpage pname for row in text tzrow split X pcontent append t 0 pname append t 1 25 pid append t 2 pdate append t 3 read the tweets f open short_origURLsMapping_z356t txt rb text line decode utf 8 stripQ for line in f readlines f close maap dict 1 1 for row in text t row split split the full version URL and its corresponding abbreviation version q t 1 split for o in q maap o i i i 1 page page append blank for j in range 1 13 s str j txt f open s rb page append f read decode utf 8 stripQ f close read webpage for i in range 0 len pid t 1 print i for j in maap keys if pcontent i find j 1 t maap j Find the corresponding webpage number if t 9 t 8 if t gt 1 print upload str t upload dict upload id pid i upload mcbcontent page t upload
9. Jone EEEE EES 30 Figure 11 Print out message from sequence file to VeCtOIS ccc cece cece eect ne ene eeae ene 3l Figure 12 Generated tf idf results 2 nee por ete hacer nares ei aetna ie ete oe 3l Figure 13 Generate word count results ii uero esce dateien o P ua o heec sia nba te dpedo Da cea nents 32 Figure 14 Confusion Matrix and Accuracy Statistics from Naive Bayes classification 33 Figure 15 Error message when generating classification labels for new unlabeled test data 38 Figure 16 Using Mahout Classification Model and Additional Program to Predict Class Labels f New Unlabeled Vest Data coc s2 o524ecmccse deat ne too oH PERRA PA EE opa erii ses aT TUDI 41 Figure 17 Example of Training Set sssssessssresssssssserrrrreresssesssererreeeessesssererreeee 42 List of Tables Table 1 Comparison between Mahout and non Mahout approach cceeee cence eee eee 16 Table 2 Characteristics of the Mahout learning algorithms used for classification 17 Table 3 Options of Mahout feature vector generation ssssssssserrerrirssssssreerrerreeesse 19 Table 4 Results of Small Collections of Tweelts eer extended e aote x 45 Table 5 Results of Large Collections of Tweets ccc ccc cece cence eee eee eme 45 Table 6 Results of Small Collections of Webpages ssssssssse 46 Table 7 Results of large Collections of Webpages cccceecee eee eeeeeeeeeeeaee
10. Negative 13 12 Test 8 Using Test 7 test on test set Command is as follows The overall accuracy is 65 and the confusion matrix is as follows Positive Negative Positive 11 0 Negative 6 0 We concluded that the accuracy of webpage is not good here is mainly due to that we did not use the cleaned webpages Original webpages is very noisy 3 4 Using Mahout Naive Bayes Classifier for Tweet Small Collections from Other Teams We used the training tweet datasets from other groups and applied the Mahout Naive Bayes classification algorithm If the results are not good we apply feature selection to revise our model Team No Feature Selection Feature Selection Test on Train Set Test on Test set Test on Train Set Test on Test Set LDA 99 95 Hadoop 95 16 99 81 NER 100 100 Reducing Noise 99 93 3 5 Generate Class Label for New Data Using Mahout Na ve Bayes Model We are able to use Mahout to train and test our classification model However the test data provided to Mahout has to be labeled The next problem is how to apply the classification model to new unlabeled test data However Mahout does not provide any function to label new unlabeled test data The generated classification model given by Mahout is in binary format Right now we find an article 17 about applying the Naive Bayes classifier to new unlabeled datasets This
11. but the methods for solving routing filtering and text classification are essentially the same Apart from manual classification and hand crafted rules there is a third approach to text classification namely machine learning based text classification It is the approach that we focus on in our project In machine learning the set of rules or more generally the decision criterion of the text classifier is learned automatically from training data This approach is also called statistical text classification if the learning method is statistical In statistical text classification we require a number of good example documents or training documents for each class The need for manual classification is not eliminated because the training documents come from a person who has labeled them where labeling refers to the process of annotating each document with its class But labeling is arguably an easier task than writing rules Almost anybody can look at a document and decide whether or not it is related to China Sometimes such labeling is already implicitly part of an existing workflow For instance the user may go through the news articles returned by a standing query each morning and give relevance feedback by moving the relevant articles to a special folder like multicore processors 1 2 Feature Selection Feature selection is the process of selecting a subset of the terms occurring in the training set and using only this subset as features
12. data CSV file Prower mcbcontent mebmark mcbusemame id mebeode meblanguage mebdevice mbpidure mebposttime morat Fig 3 Data CSV file modification example b Change schema xml to define properties of the fields The schema declares what kinds of fields there are which field should be used as the unique primary key which fields are required and how to index and search each field For each new field we add name type indexed and stored attribute Name is mandatory and it is the name for the field Type is mandatory and it is the name of a previously defined type from the types section Indexed is set to true if this field should be indexed Stored is set to true if this field should be retrievable Figure 4 shows what we have added to the existing schema xml field name mcbtwitter type string indexed true stored true lt field name mcbcontent type string indexed true stored true gt lt field name mcbmark type string indexed true stored true gt lt field name mcbusername type text general indexed true stored true lt field name mcbid type string indexed true stored true gt lt field name mcbcode type string indexed true stored true gt lt field name mcblanguage type string indexed true stored true gt lt field name mcbdevice type string indexed true stored true gt field name mcbpicture type string indexed true stored true gt lt field name mcbpos
13. download ipad rele VW nofollow V Twitter for iPade a s twing con profile background images 540071598527496194 YWxtQ Nq jpeg Jan 87 21 41 56 40000 2015 version 1491679799452631000 ecbtwitter twitter search acbcontent RT MS Cartoonists around the world respond to attack in Paris http t co 117d0fkNu CharlieHebdo http t co kFzf0W8909 mcbusermame Lindsdorant id 552943242518729000 Fig 5 Import CSV file to Solr i 2 3 Upload Tweets and Webpages of Small Collection to Solr a Download our small collection We use the accident data z356t csv as our small collection b Download the script from TA We use the tweet URL archivingFile py to deal with our z356 csv file It extracts the URL of all the tweets Then it calculates the frequency of each URL and then gives out a sorted result for the URL with its frequency in the file shortURLs z356 txt At this time we can set a threshold and takes the URLs whose frequency is higher than the threshold The tweet_URL_archivingFile py will translate URLs in abbreviation version to their complete version Then it will output a translation form which includes the full version URLs with its corresponding abbreviation version in file short_origURLsMapping_z356 txt Also the script will download the real webpage of these URLs and save it as txt file c We can see that from our threshold we can only get 12 webpages They totally appear 952 times
14. ee sdk 6u3 jdk 6u29 downloads 523388 html o Uncompress jdk 6u35 linux x86 64 bin o Copy the directory jdk 1 6 035 to usr lib o Setup environment variable LI LI LI LI e o Test installation of Java e Install Maven o Download from https maven apache org download cgi o Since version 3 3 1 is not compatible with Java 1 6 we changed to version 3 2 1 39 e Install Hadoop 1 1 1 o Following the instructions from http 6341986 blog 5 lcto com 633 1986 1 143596 Download Hadoop 1 1 1 Unzip o Rename o Modify conf core site xml conf core site xml and add lt configuration gt lt property gt lt name gt fs default name lt name gt lt value gt hdfs localhost 9000 lt value gt lt property gt lt configuration gt o Modify conf hdfs site xml lt configuration gt lt property gt lt name gt dfs replication lt name gt lt value gt 1 lt value gt lt property gt lt configuration gt o Modify conf mapred site xml lt property gt lt name gt mapred job tracker lt name gt lt value gt localhost 9001 lt value gt lt property gt lt configuration gt o In order for Hadoop to be able to find Java add path of JDK at the end of conf hadoop env sh Format HDFS Start Check start progress jps Check installation of Hadoop Open http localhost 50030 for status of MapReduce Open http localhost 50070 for status of HDFS O OOOO
15. in text classification Feature selection serves two main purposes First it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary This is of particular importance for classifiers that unlike NB are expensive to train Second feature selection often increases classification accuracy by eliminating noise features A noise feature is one that when added to the document representation increases the classification error on new data Suppose a rare term say arachno centric has no information about a class say China but all instances of arachno centric happen to occur in China documents in our training set Then the learning method might produce a classifier that wrongly assigns test documents containing arachno centric to China Such an incorrect generalization from an accidental property of the training set is called over fitting We combine the definitions of term frequency and inverse document frequency to produce a composite weight for each term in each document The tf idf weighting scheme assigns to term t a weight in document d given by tf idfia tfta x idfi In other words tf idfia assigns to term t a weight in document d that is a highest when t occurs many times within a small number of documents thus lending high discriminating power to those documents b lower when the term occurs fewer times in a document or occurs in many documents thus offering a less pronounced freque
16. of our classifier We split the labeled data into train set and test set 80 and 20 The train set was used in last step The test set was used without the label We used the model and the NaiveBayesClassifier java class to classify the testing data We compared the manually assigned label and the predicted label and calculated the accuracy The command is shown as We used the naive bayes clasifier class with our model Ida out bayes model p to apply classification for the LDA test lda txt and outputed the labeled data out classify lda We conducted 5 fold cross validation to make sure that our model works fine for this text classification task We will show our summary of accuracy in Section IV 4 We used this classifier to classify new data small tweets small webpages large tweets and large webpages However the input file and output file we used and generated are plain text However we need to read cleaned data from Reducing Noise team which is in AVRO format We should also have our output file in AVRO format in order for Hadoop team to upload to HBase We give our thanks to Jose Cadena from Hadoop team He helped us to modify the code and finally we can read in the AVRO file and generate the AVRO file with the correct schema The command for labeling new data is shown as amp We used our new JAR file to classify the new data which has already been cleaned by Reducing Noise team and generated the AVRO file that could be uploa
17. something scalable like Mahout is needed System size in number of examples Choice of classification approach lt 100 000 Traditional non Mahout approaches should work very well Mahout may even be slower for training 100 000 to 1 million Mahout begins to be a good choice The flexible API may make Mahout a preferred choice even though there is no performance advantage 1 million to 10 million Mahout is an excellent choice in this range gt 10 million Mahout excels where others fail Table 1 Comparison between Mahout and non Mahout approach 16 The reason Mahout has an advantage with larger data sets is that as input data increases the time or memory requirements for training may not increase linearly in a non scalable system A system that slows by a factor of 2 with twice the data may be acceptable but if 5 times as much data input results in the system taking 100 times as long to run another solution must be found This is the sort of situation in which Mahout shines In general the classification algorithms in Mahout require resources that increase no faster than the number of training or test examples and in most cases the computing resources required can be parallelized This allows you to trade off the number of computers used against the time the problem takes to solve The main advantage of Mahout is its robust handling of extremely large and growing data sets The algorithms in Mahout all share scalability
18. the following command We can copy the results from HDFS to the local filesystem Our result should be able to read by a filereader by running the command However on our cluster we come across the error shown in Figure 15 We are still working on this issue We hope that we can generate labels for new unlabeled test tweets when we solve this problem 38 2 MapReduceClassifier ClassifierMap not found nChild Native 1 ject jav GroupInformat onfig nfiguration fier ClassifierMap not found 2 MapReduceClassifier ClassifierMap not found We discussed with Mohamed and Sunshin about our problem We also tried to come up with other possible solutions for classification label prediction such as using Python and Hadoop streaming API using Apache Spark and ML lib or using other classification algorithms in Mahout However we found the only difference between our trial and the tutorial 17 was the version of Hadoop The tutorial 17 was using Hadoop 1 1 1 and our cluster was using Hadoop 2 5 Thus we tried to install Hadoop 1 1 1 on our own machine and this actually solved the problem Our current progress is that we are able to predict labels for new unlabeled test data using trained Mahout Naive Bayes classification model using Hadoop 1 1 1 We state our efforts on Hadoop 1 1 1 installation and label prediction result as follows e Install Java 1 6 o Download from http www oracle com technetwork java javaee downloads java
19. the vocabulary we use for classification and nd is the number of such tokens in d 1 4 Vector Space Classification Chapter 14 in the textbook gives an introduction about vector space classification Each document is represented as a vector and each component is for a term or word Term are axes in the vector space thus vector spaces are high dimensionality Generally vectors are normalized to unit length Chapter 14 covers two vector space classification methods Rocchio and kNN Rocchio divides the vector space into regions centered on centroids or prototypes kNN assigns the majority class of k nearest neighbors to a test document This chapter talks about the difference between linear and nonlinear classifiers It illustrates how to apply two class classifiers to problems with more than two classes Rocchio classification uses standard TF IDF weighted vectors to represent text documents For training documents in each category it computes the centroid of members of each class It assigns test documents to the category with the closest centroid based on cosine similarity The centroid of a class is computed as the vector average or center of mass of its members The boundary between two classes in Rocchio classification is the set of points with equal distance from the two centroids However Rocchio did worse than Naive Bayes classifier in many cases One reason is that Rocchio cannot handle nonconvex multimodal classes kNN classification is
20. use the above mentioned commend to transform the binary vector file to readable file This will take any plain text files in the directory inputdirectory and convert them to sequence files We specify the UTF8 character set to make sure each file inside the input directory is processed using the same set rather than being able to specify individual character sets These sequence files must next be converted to vector files that Mahout can then run any number of algorithms on To create these vector files you will need to run the command The nv flag means that we use named vectors so that the data files will be easier to inspect by hand The x 90 flag means that any word that appears in more than 90 of the files is a stop word or a word is too common and thus is filtered out i e articles such as the an a etc Figure 11 shows the print out message when we have successfully generated the vectors Figure 12 shows the generated tf idf results while Figure 13 shows the generated word count results 30 art mahout bin FILE Mumber of large read operations 0 FILE Mber of write operations HOFS Nusber of bytes readel195187 HOFS Number of bytes uritten 1195036 HOFS Number of read operations 7 HOFS Number of large read operations 3 HOFS Number of write operationse2 Job Counters Launched sap tasksel Launched reduce tasks 1 Data Local wap tasksel Total tine spent by all maps in occupied slot
21. 3M 8 6s Attack Hadoop Jan 25 73 214M 18 6s Solr Election 86 298M 34s Reducing Noise Charlie Hebdo 85 64M 8 4s NER Storm 100 Table 4 Results of Small Collections of Tweets Since the performance of Hadoop team s collection we looked into their data and found that the data contains many Arabic words We thought that s the reason that causes the accuracy to not be good We found the accuracy of NER team s data is 100 We looked into their data and found that their negative tweets are selected from other collections We did no further work on their data since the trained classifier would not be able to classify their collection Our results of large collections of tweets are shown as Team Collection Average Filesize Runtime Topic Accuracy Classification WELW ASE 78 270M 32s Airlines Hadoop Egypt 81 SEE 136s Reducing Shooting 73 112s Noise LDA Bomb 75 110s Table 5 Results of Large Collections of Tweets 45 Our testing file came from the cleaned data produced by Reducing Noise team thus we could not work on all collections Our results of small collections of webpages are shown as Team Collection Average Filesize Runtime Topic Accuracy NER storm 83 65M 4 7s Classification Plane Crash 87 24M Table 6 Results of Small Collections of Webpages We can see that the sizes of webpages are not necessarily larger than the sizes of tweets This is because we just extracted the text cl
22. 6 Using Mahout Classification Model and Additional Program to Predict Class Labels of New Unlabeled Test Data We can work on the small collections using our own Hadoop 1 1 1 and Mahout 0 8 now however we will run out of space for large collections We then moved on for new method that can be used to predict class labels for new data 3 5 Using Pangool to Predict Class Label for New Data We found another Apache Hadoop implementation library Pangool 19 Pangool is a Java low level MapReduce API It aims to be a replacement for the Hadoop Java MapReduce API By implementing an intermediate Tuple based schema and configuring a Job conveniently many of the accidental complexities that arise from using the Hadoop Java MapReduce API disappear Things like secondary sort and reduce side joins become extremely easy to implement and understand Pangool s performance is comparable to that of the Hadoop Java MapReduce API Pangool also augments Hadoop s API by making multiple outputs and inputs first class and allowing instance based configuration The reason why we used Pangool is that it is compatible with all kinds of Hadoop versions This library does not have the same issue we come across with Mahout Pangool provides an implementation of Naive Bayes Pangool has some features e Easier MapReduce development e Support for Tuples instead of just Key Value pairs e Secondary sorting as easy as it can get e Built in reduce side joining capabiliti
23. 7 20835325114486 event 102 68560394731 13 health 102 14909163889153 home 86 9974507441349 tech 90 1803171438048 gt art Tweet 309144337902620672 To my Chicago Indiana peeps try this to get 10 cent discount per gallon d scount on gas http t co rowfKLR3QB apparel 185 7486073193464 art 171 58438506899736 camera 190 39054488037434 event 155 29405694696 8 health 161 96099166282502 home 181 93010974456286 tech 170 41259934414168 gt event Tweet 309144325567176704 Thorntons 3 DAYS 2go 15 voucher code Grab 15 discount from Thorntons sing voucher code Excluding flowers http t co y11GmlBEDT apparel 375 3852730157056 art 366 56980734031055 camera 412 4536911048791 event 338 399328022404 5 health 341 41703621377025 home 397 6722178422013 tech 379 6871616964319 gt event Tweet 309143976752054272 To celebrate International Women s Day we re offering 50 off this great ran e of products http t co OZmOQXvXrk coupon discount apparel 362 4069482261332 art 373 7860621671973 camera 405 20800609640105 event 357 397874062013 health 345 6654103821567 home 385 3224318661978 tech 378 6942974382766 gt health Tweet 309143870195769348 eFlorist 10 voucher code 10 off all products at eFlorist using discoun code Ends 05 30 13 http t co 2kEXy0U698 apparel 359 3495044484497 art 386 3144534340928 camera 426 23735545014586 event 380 525274124389 health 331 14781100372676 hom
24. 77919973017 2833 0 1611849735410603 50 0 5932905006167 51670 1370123991137947 3739 0 5192489301628726 5798 0 3907S11605937997 08 0 33300343016108537 1695 0 37457116797663514 1700 0 31069346146249793 1713 0 20503676589893418 2505 0 1907266506039941 2655 0 07215705509870655 4208 43335762391064863 4586 9 07126807112545427 4799 0 31069346146249793 5167 0 9 4309501289972411 1315 0 29057131290632099 1857 0 3842060631303139 199 9 4449724320760092 1964 0 31780605148873476 2824 0 4193639220105132 4380 9 07317080900394 130 4764 9 2739391281041036 5961 0 1790199406060317 5 0 2005154138818072 2171 9 238744878949239 3002 0 3093008413093047 3404 0 280037433937099 3931 oso 3231078000050794 300 0 1749483306129903 3 9 35a0047 186666195 2547 0 2790002249030744 26550 07980487147357214 2974 0 25424409072413904 4193 0 4794594009635058 4586 0 0788S0S0569722689 4772 0 479459409635058 S008 D 7928121300002467 5167 0 07770398250375263 2 txt 988 0 3572459927048382 1536 6 2545302614523793 2317 0 277474656582902 2325 0 49710201898789 2655 0 09150702897369194 4595 09 4691166480640666 4666 0 49710201998789 5167 0 08906548854291894 27 4 40321904171131223 2389 09 3317383900912764 2055 0 0674717424233420 3038 0 2903194672421071 3067 0 25201394734291205 3444 9 19062161109182865 700 0 1394037923209329 4949 9 36043674447838068 4956 0 3317383509912704 4960 0 321 0 12999 4235519722
25. 857427220977522 5008 0 31844 12110735850825606 2974 5454414126208778 4586 0 11961530065459945 5167 0 11787603829626409 5795 0 45625144182785177 01935066506393 3067 0 26218001857140733 3724 0 3338933446148064 3780 0 1658964 87779417135078186 2999 0 23350023888265192 4587 0 3405935712958305 5167 0 0756309112075945 5955 0 16050 S icio 1g do E fest Ge draw Los 11 dm Wi D ipt 8 Ics i ide 31 Wed Mar 18 5 02 M cloudera outputword txt Desktop gedit vn BAN z A B 5 JAAA EZ PainTextv Tab Width 8v 1n5987 Coi 26 ins Traw U io ini ip 8 ics ido iDes i Tout w F S 2 Boras Origa Fig 13 Generate word count results The reason why we show word count here is that the row number of the word is that of the key number in the tf idf table So there are 5999 words here in the word count table and the tf idf key number varies from 1 to 5999 Up to now we successfully generate the sequence file and vector The next step is to label the small collection and apply the Mahout Classification algorithms 3 2 Commands for Classification with Mahout Naive Bayes Algorithm We have two folders for tweets one folder contains 100 text files of positive tweets that are related to the topic the other folder contains 100 text files of negative tweets that are not related to the topic All tweets are randomly selected us
26. 9 to build more accurate classifiers Trying to use tf idf values instead of word count in the feature vector and using more representative features in feature selection also may improve the accuracy VII Acknowledgements We would like to thank our instructor Dr Edward A Fox who brought us into this interesting project We would like to thank our TAs Mohamed Magdy and Sunshin Lee for their continued support and valuable suggestions throughout the project We would also give special thanks to Jose Cadena from the Hadoop team who helped us with input and output formatting problems Further we thank the Reducing Noise team which provided cleaned tweets and webpages for us to work on Finally thanks go to the support of NSF grant IIS 1319578 III Small Integrated Digital Event Archiving and Library IDEAL 49 VIII References 1 Apache Mahout Scalable machine learning for everyone https www ibm com developerworks library j mahout scaling accessed on 02 12 2015 2 Twenty Newsgroups Classification Example https mahout apache org users classification twenty newsgroups html accessed on 02 12 2015 3 Creating vectors from text http mahout apache org users basics creating vectors from text html accessed on 02 12 2015 4 Classifying with random forests http mahout apache org users classification partial implementation html accessed on 02 12 2015 5 Mahout 1 0 Features by Engine http mahout ap
27. LL Virginia Tech CS 5604 Information Retrieval and Storage Spring 2015 Final Project Report Feature Extraction amp Selection Classification Project Members Xuewen Cui Rongrong Tao Ruide Zhang Project Advisor Dr Edward A Fox 05 06 2015 Virginia Tech Blacksburg Virginia 24061 USA Executive Summary Given the tweets from the instructor and cleaned webpages from the Reducing Noise team the planned tasks for our group were to find the best 1 way to extract information that will be used for document representation 2 feature selection method to construct feature vectors and 3 way to classify each document into categories considering the ontology developed in the IDEAL project We have figured out an information extraction method for document representation feature selection method for feature vector construction and classification method The categories will be associated with the documents to aid searching and browsing using Solr Our team handles both tweets and webpages The tweets and webpages come in the form of text files that have been produced by the Reducing Noise team The other input is a list of the specific events that the collections are about We are able to construct feature vectors after information extraction and feature selection using Apache Mahout For each document a relational version of the raw data for an appropriate feature vector is generated We applied the Naive Bayes classification algorith
28. ache org users basics algorithms html accessed on 02 12 2015 6 Yoon S Elhadad N amp Bakken S 2013 A Practical Approach for Content Mining of Tweets American Journal of Preventive Medicine 45 1 122 129 doi 10 1016 j amepre 2013 02 025 7 Cheng Z Caverlee J amp Lee K 2010 October You are where you tweet a content based approach to geo locating twitter users In Proceedings of the 19th ACM international conference on Information and knowledge management pp 759 768 ACM 8 Naive Bayes http scikit learn org stable modules naive_bayes html accessed on 02 05 2015 9 Random Forest http scikit learn org stable modules ensemble html random forests accessed on 02 05 2015 10 Cross validation http scikit learn org stable modules cross_validation html accessed on 02 05 2015 11 Update CSV https wiki apache org solr UpdateCSV accessed on 02 05 2015 12 Write schema xml http www solrtutorial com schema xml html accessed on 02 05 2015 13 NLTK http www nitk org accessed on 02 05 2015 14 scikit learn http scikit learn org stable accessed on 02 05 2015 15 An introduction to Information Retrieval written by Christopher D Manning Prabhakar Raghavan Hinrich Schutze 16 Mahout in Action written by Sean Owen Robin Anil Ted Dunning Ellen Friedman 17 Using the Mahout Naive Bayes classifiers to automatically classify twitter message
29. acteristics of the Mahout learning algorithms used for classification 16 The algorithms differ somewhat in the overhead or cost of training the size of the data set for which they re most efficient and the complexity of analyses they can deliver Stochastic gradient descent SGD is a widely used learning algorithm in which each training example is used to tweak the model slightly to give a more correct answer for that one example This incremental approach is repeated over many training examples With some special tricks to decide how much to nudge the model the model accurately classifies new data after seeing only a modest number of examples An experimental sequential implementation of the support vector machine SVM algorithm has recently been added to Mahout The behavior of the SVM algorithm will likely be similar to SGD in that the implementation is sequential and the training speed for large data sizes will probably be somewhat slower than SGD The Mahout SVM implementation will likely share the input flexibility and linear scaling of SGD and thus will probably be a better choice than naive Bayes for moderate scale projects 17 The naive Bayes and complementary naive Bayes algorithms in Mahout are parallelized algorithms that can be applied to larger datasets than are practical with SGD based algorithms Because they can work effectively on multiple machines at once these algorithms will scale to much larger training data sets th
30. an will the SGD based algorithms The Mahout implementation of naive Bayes however is restricted to classification based on a single text like variable For many problems including typical large data problems this requirement isn t a problem But if continuous variables are needed and they can t be quantized into word like objects that could be lumped in with other text data it may not be possible to use the naive Bayes family of algorithms In addition if the data has more than one categorical word like or text like variable it s possible to concatenate your variables together disambiguating them by prefixing them in an unambiguous way This approach may lose important distinctions because the statistics of all the words and categories get lumped together Most text classification problems however should work well with naive Bayes or complementary naive Bayes Mahout has sequential and parallel implementations of the random forests algorithm as well This algorithm trains an enormous number of simple classifiers and uses a voting scheme to get a single result The Mahout parallel implementation trains the many classifiers in the model in parallel This approach has somewhat unusual scaling properties Because each small classifier is trained on some of the features of all of the training examples the memory required on each node in the cluster will scale roughly in proportion to the square root of the number of training examples Thi
31. ch is Hadoop 2 5 We finally decided to use another MapReduce Naive Bayes package Pangool 19 which was able to generate Naive Bayes classifiers and predict class labels for new data We finished prediction of small collections of tweets for our team LDA team Reducing Noise team Solr team and Hadoop team We finished prediction of large collections of tweets for our team LDA team Reducing Noise team and Hadoop team We finished prediction of small collections of webpages for our team and NER team We finished prediction of large collections of webpages for Hadoop team and Clustering team The accuracy of our Naive Bayes model was validated using 5 fold cross validation Overall The accuracy is satisfactory but still can be improved and the running time for the predictions is reasonable 48 VII Future Work For performance improvement future researchers should consider using larger training sets Currently due to the time limit we are only able to label one hundred positive samples and one hundred negative samples for each collection to be used as training set Future researchers can still work on developing a new package to label new data with the model generated by Apache Mahout Future researchers can also work on modifying the prediction package provided in Learning Apache Mahout Classification 20 to make it compatible with higher versions of Hadoop not just only Hadoop 1 1 1 Additional work is needed on the Pangool package 1
32. ded to HBase In this project we have other teams to label their own small collection of tweets For small webpages large tweets and large webpages we manually labeled some of the topics to compare the performance 4 Evaluation 4 1 Cross Validation Learning the parameters of a prediction function and testing it on the same data is a methodological mistake a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet unseen data This situation is called overfitting To avoid it it is common practice when performing a supervised machine learning experiment to hold out part of the available data as a test set However by partitioning the available data into sets we drastically reduce the number of samples which can be used for learning the model and the results can depend on a particular random choice for the pair of train validation sets A solution to this problem is a procedure called cross validation CV for short A test set should still be held out for final evaluation but the validation set is no longer needed when doing CV In the basic approach called k fold CV the training set is split into k smaller sets other approaches are described below but generally follow the same principles The following procedure is followed for each of the k folds e A model is trained using k 1 of the folds as training data e The resulting mod
33. e 406 93940281411983 tech 380 5574613864772 gt health Tweet 309143843301912576 SubscriptionSave 10 discount code 10 off a subscription to Air Gunner m gazine from SubscriptionSave using http t co mfrVv2FamMD apparel 295 70864547929256 ar 1 6708562096357 camera 331 3672531321196 event 294 925290791687 health 277 4425794017157 home 7597971930412 tech 302 3348897506605 gt art Tweet 309143735508291584 Weekly Deal 4 Lorell 44553 Floor Fan on Sale special offer up to 50 off ht p t co q5e4rKL87f discount promo apparel 227 3714526609183 art 250 34069900018739 camera 291 08263129441536 event 245 77388731040 47 health 235 74915002499995 home 233 37296431064215 tech 237 48334473862772 gt apparel Tweet 309143443412770817 RT Katheleen_Souza House M D Seasons 3 4 HOUSE SEASON THREE amp amp SEASO FOUR DVD Movie http t co mXHlcabMDm discount deal apparel 329 5456902708079 art 272 0189618695541 camera 329 0215480955471 event 306 1453449320443 health 307 41457769088703 home 306 32233086220015 tech 316 04614429184846 gt art Tweet 309143199207804928 DS Miller Inc Equivalent of ACER 5600 SERIES Laptop AC Adapter 19 Volt 90 W tt Laptop AC Ada http t co N88a3niE3j discount deal apparel 592 3292118902237 art 557 3922228865674 camera 564 3919614341436 event 553 9028487279526 health 545 1514393580204 home 548 0613248517234 tech 433 4464478176691 gt tech Fig 1
34. e directory user cs5604s15 class pages 3 Naive Bayes Classification 3 Transform Tweets to Sequence File and Feature Vector on Our Own Machine In this section we use Python and Mahout to transform our small collection CSV file to sequence file and vectors So we can use the command line of Mahout to get the word count and tf idf vector file for our small collection Mahout has utilities to generate vectors from a directory of text documents Before creating the vectors we first convert the documents to SequenceFile format Sequence File is a Hadoop class which allows us to write arbitrary key value pairs into it The document vectorizer requires the key to be a text with a unique document ID and value to be the text content in UTF 8 format After searching online we still cannot figure out an approach to generate the CSV format tweet to sequence file The only solution is what we found on the VTTECHWORK https vtechworks lib vt edu bitstream handle 10919 47945 Mahout Tutorial pdf sequence 11 In order to transform the CSV format tweet to sequence file we developed another way We first use Python to read the CSV file and store each row into a separate text file with its row number being the file name We put these text files into one directory The Python code is like this 28 import string f open z356t csv rb text line decode utf 8 strip for line in f readlines text line for line in f readline
35. e mentioned above 3 Tools and Packages Mahout is an open source machine learning library from Apache The algorithms it implements fall under the broad umbrella of machine learning or collective intelligence This can mean many things but at the moment for Mahout it means primarily recommender engines collaborative filtering clustering and classification It is a Java library It doesn t provide a user interface a prepackaged server or an installer It s a framework of tools that intended to be used and adapted by developers It s also scalable Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large perhaps far too large for a single machine In its current incarnation these scalable machine learning implementations in Mahout are written in Java and some portions are built upon Apache s Hadoop distributed computation project Mahout supports Stochastic gradient descent SGD which is a widely used learning algorithm in which each training example is used to tweak the model slightly to give a more correct answer for that one example An experimental sequential implementation of the support vector machine SVM algorithm has recently been added to Mahout The behavior of the SVM algorithm will likely be similar to SGD in that the implementation is sequential and the training speed for large data sizes will probably be somewhat slower than SGD The Mahout SVM implementation wil
36. e virtualbox 21 eoo Oracle VM VirtualBox Manager Q u Dy EXE a New Settings Stat Deed I cloudera quic General El Preview Hd ON Name cloudera quickstart vm 5 1 0 1 virtualbox Operating System Red Hat 64 bit cloudera 3 System quickstart Base Memory 4096 MB vm 5 1 0 1 Boot Order Hard Disk CD DVD virtualbox Acceleration VT x AMD V Nested Paging PAE NX Ej Display Video Memory 8 M8 Remote Desktop Server Disabled Video Capture Disabled lal Storage Controller IDE Controller IDE Primary Master coudera quickstart vm 5 1 0 1 virtualbox disk1 vmdk Normal 62 50 GB IDE Secondary Master CD DVD Empty P Audio S Disabled iS Network Adapter 1 intel PRO 1000 MT Desktop NAT Fig 2 Cloudera Virtual Machine Installation 9 Click start to start the virtual machine OS Install Mahout Prerequisites 1 Java JDK 1 8 2 Apache Maven Checkout the sources from the Mahout GitHub repository either via Compiling Compile Mahout using standard maven commands myn clean compile 2 2 Data Import 2 2 1 Approach We renamed the test data CSV file to books csv put it at example exampledocs books csv and uploaded it to the Solr example server We used HTTP POST to send the CSV data over the network to the Solr server 22 2 2 2 Steps a Modify the data CSV file We added a row describing each column at the beginning of the data CSV file Figure 3 shows what we have added to the existing
37. ean field of webpages Out results of large collections of webpages are shown as Team Collection Average Filesize Runtime Topic Accuracy Hadoop Egypt 79 140M 36s Clustering Diabetes 77 305M 40s Table 7 Results of Large Collections of Webpages We also tried to work on the large collection of webpage for the Solr team However both the small and large collection of webpage for the Solr team seems to have less than 200 documents 46 V Timeline Schedule This table shows our schedule for this project Week 2 Environment setup JAVA Solr Python data import Solr Test Python library scikit learn and preprocessing the data for single node testing Week 3 Environment setup Cloudera Hadoop Mahout Test Mahout classification tool Random Forest and Naive Bayes for cluster Draft final report Test creating vectors from text using mahout Week 5 1 Use TA s script to play with the small collection 2 Download the corresponding webpage for each tweet and process the URLs then select tweets with high frequency and then upload them to Solr 3 Finish documenting the schema that describes the specific output our team will produce for each of the web pages Week 7 Use Mahout to do feature selection for our example data aircraft accident 1 Converting directory of documents to SequenceFile format 2 Creating Vectors from SequenceFile 3 Creating Normalized TF IDF Vectors from a directory of text documents 4 Convert
38. eck java version 2 Download the Apache Ant binary distribution 1 9 4 from http ant apache org and export environment variable ANT HOME Use ant version to check 3 Download the Apache Solr distribution Source code and extract to C solr 4 Navigate to the Solr directory and use ant compile to compile Solr Use the ant ivy bootstrap command to install ivy jar before installing Solr 5 Successfully installed Solr Or you can directly download the binary version of Solr and then launch Solr A screenshot of the Solr client webpage is shown in Figure 1 20 ERBFPELELLELE dmin Mozilla Firefox 33 Solr Admin js localhost c wot ft mpm Solr Dashboard q status 0 td Logging ia params E Core admin debugQuery true indent true so v usen wt Ys0n collection start rows response T munFound 200 start 0 docs mcbtwitter twitter search acbcontest RT Gnycjim R I P Uncle Bernard columnist at echarlieHebdo among 12 victims in Paris attack http t co n gafrCSTI http t co Q0EFkdqy a mcbusername peterbrowibarra m Raw Query Parameters tid 552943242975916000 mcbcode 1362564518 a acblanguage en Posey wt acbdevice lt a hrefe http twitter con download ipad rele nofollow gt Twitter for iPad lt a gt pon E mcbplcture http abs twing con inages
39. el is validated on the remaining part of the data i e it is used as a test set to compute a performance measure such as accuracy The performance measure reported by k fold cross validation is then the average of the values computed in the loop This approach can be computationally expensive but does not waste too much data as it is the case when fixing an arbitrary test set which is a major advantage in problems such as inverse inference where the number of samples is very small KFold divides all the samples in k groups of samples called folds of equal sizes The prediction function is learned using k 1 folds and the fold left out is used for testing 4 2 Summary of Results To summarize our classification results we listed the collection we worked on the average accuracy 44 among the 5 fold cross validation the size of new file for prediction and the runtime of prediction We can see that the accuracy is generally good Although some of the new file is pretty large the runtime is still acceptable We noticed that the runtime would not increase linearly with respect to the filesize This is because we found the number of mappers involved was different among the programs Thus for a larger file the program will use more computing resources Our results of small collections of tweets are shown as Team Collection Average Filesize Runtime Topic Accuracy Classification Plane Crash 92 90M 13s LDA Suicide Bomb 89 1
40. es e Performance and flexibility 41 e Configuration by object instance instead of classes e First class multiple inputs amp outputs e Built in serialization support for Thrift and ProtoStuff e 100 Hadoop compatibility 0 20 X 1 X 2 X and YARN We download Pangool using git clone https github com datasalt pangool git To install Pangool we should use maven The version we used here is maven 3 3 1 and the Java version we used is 1 7 0 55 After installation of maven we just simple use this command Myn clean install The Naive Bayes implementation is under the directory pangool examples src main java com datasalt pangool examples naivebayes It has two classes NaiveBayesGenerate java This class is used to generate the Naive Bayes classifier model NaiveBayesClassifier java This class will use the model we generated to predict and label the new data We used the Naive Bayes example under target pangool examples 0 71 SNAPSHOT hadoop jar Now we will give an example of the classification process First we will label our data to generate the training set For tweets we will add a POSITIVE before the tweets that are related to this topic We will add a NEGATIVE before the tweets that are not related to this topic The label and the content of tweet are separated by a TAB An example of the training set is as shown in Figure 17 Train LDA tsv Suicide Suicide ghanistan mb attac ruck near the I ate in Afghanistan
41. hyperplane with maximized margin with support vector on the two boundaries If we have the training set this problem comes to be a quadratic optimization problem Most of the time we have noise data that we have to ignore because we want to build a hyperplane that is far away from all the data points If we do not ignore these noise points we may get a hyperplane with very small margin or we even cannot build a hyperplane So in this case SVM also allows some noise data to be misclassified To use the SVMs for classification given a new point x we can score its projection onto the hyperplane normal We can also set threshold t such as Score gt t yes Score lt t no Else don t know This solution works great for the datasets that are linearly separable But sometimes the dataset is too hard to separate SVM also handles these datasets What SVM does is trying to define a mapping function to transform the data from low dimension to higher dimension to make the data linearly separable in the high dimension which is called the kernel trick So instead of complicated computation we can use kernels to stand for inner product which will make our calculation easier In summary SVM chooses hyperplane based on support vectors It is a powerful and elegant way to define similarity metric Based on our evaluation results perhaps it is the best performing text classifier 2 Papers 6 takes advantages of data mining techniques t
42. ing Python from the small collection we are given Then we uploaded the tweets to the cluster On the cluster we followed example from 2 and performed a complete Naive Bayes classification The step by step commands are listed as below Create a working directory for the dataset and all input output Convert the full dataset into a lt Text Text gt SequenceFile Convert and preprocess the dataset into a lt Text VectorWritable gt SequenceFile containing term frequencies for each document 32 Split the preprocessed dataset into training and testing sets Train the classifier Test the classfier A confusion matrix and the overall accuracy will be printed on the screen Figure 14 is an example of the summa ry 1 15 42 48 INFO driver MahoutDriver Program took 17145 ms Minutes 6 28575 Fig 14 Confusion Matrix and Accuracy Statistics from Naive Bayes classification 3 3 Applying Mahout Naive Bayes Classifier for Our Tweet Small Collection Test 1 Tweets classification no feature selection test on train set Commands are as follows 33 The overall accuracy is 96 and the confusion matrix is as follows Positive Negative Positive 86 3 Negative 4 95 Test 2 Using Test 1 test on test set Command is as follows The overall accuracy is 70 and the confusion matrix is as follows Positive Negative Positive 7 3 Negative 0 0 Test 3 Tweets classificati
43. ing existing vectors to Mahout s format Week 8 1 Upload webpages to HDFS 2 Further investigate into Mahout classification Week 9 1 Finished using Nutch to crawl webpages 2 Apply classification for tweets and webpages 3 Compare performance between without feature selection and with feature selection Week 10 Apply classification on tweet datasets provided by LDA RN NER and Hadoop team Week 11 Work on methods to generate classification labels for unlabeled test data Work on methods to apply classification to help searching in Solr Week 12 Able to predict classification labels for new unlabeled test data using Hadoop 1 1 1 Week 13 Produce classification model and label predictions for tweets and webpages for both small collections and large collections 47 VI Conclusion In this project our task is to classify tweet and webpage collections into pre defined classes in order to help with Solr search engine We reviewed existing techniques and attempted to apply Apache Mahout Naive Bayes classification algorithm at the beginning Apache Mahout Naive Bayes classification algorithm was able to provide accurate classifiers however it cannot predict labels for new data using the model Learning Apache Mahout Classification 20 provided a solution package to predict class labels for new data using the classifiers generated by Mahout However this package was only able to work on Hadoop 1 1 1 and it was not compatible with our cluster whi
44. ing term frequencies for each document Split the preprocessed dataset into training and testing sets Train the classifier Test the classifier 2 2 Predict Class Label for New Data Using Mahout Naive Bayes Classifier Note This method only work for Hadoop 1 1 1 Get the scripts and Java programs Compile Upload training data to HDFS Run the MapReduce job Copy the results from HDFS to the local file system Read the result by a file reader 2 3 Pangool MapReduce Naive Bayes Classification and Class Label Prediction Download Pangool Install NaiveBayesGenerate java This class is used to generate the Naive Bayes classifier model NaiveBayesClassifier java This class will use the model we generated to predict and label the new data We used the Naive Bayes example under target pangool examples 0 71 SNAPSHOT hadoop jar Train the classifier p A Testing the classifier Using the modified classifier to handle input file and output file in AVRO format 15 IV Developer Manual 1 Algorithms 1 1 Classification Algorithms in Mahout Mahout can be used on a wide range of classification projects but the advantage of Mahout over other approaches becomes striking as the number of training examples gets extremely large What large means can vary enormously Up to about 100 000 examples other classification systems can be efficient and accurate But generally as the input exceeds 1 to 10 million training examples
45. interpreted as k nearest neighbor classification To classify a document into a class we need to find k nearest neighbors of the document count number of documents in k nearest neighbors of the document that belong to the class estimate the probability that the document belongs to the class and choose the majority class One problem here is how to choose the value of k Using only the closest example 1NN to determine the class is subject to errors since there exists noise in the category label of a single training example The more robust way is to find the k most similar examples and return the majority category of these k examples The value of k is typically odd to avoid ties however we can break the ties randomly 3 and 5 are the most common values to be used for k but large values from 50 to 100 are also used The nearest neighbor method depends on a similarity or distance metric Simplest for continuous m dimensional instance space is Euclidean distance Simplest for m dimensional binary instance space is Hamming distance which counts number of features values that differ Cosine similarity of TF IDF weighted vectors is typically most effective for text classification Feature selection and training are not necessary for kNN classification KNN also scales well with a large number of classes however the scores can be hard to convert to probabilities Chapter 14 also introduces the bias variance tradeoff Bias is the squared difference between
46. kt s ipse 7 output M Fig 11 Print out message from sequence file to vectors WedMar18 02 PM cloudera outputtfidf txt Desktop g file Edt View Search Jools Documents Help D oe v Es gy AnA Sapana X 3019030943845815 709 0 28380279890218404 1992 9 2284757031387796 2142 0 22099217310937824 2899 0 24390314895297004 3119 0 33030213927039925 3539 0 26908534210045798 3929 0 23877308873433 4442 0 2877391390040733 07081827091 Sizso 4404z7a096109865 3168 5061681370870676 5129 0 2699180956707422 5176 4 33738523274630537 5754 0 2087169729411952 3995 0 1947319129623283 3961 0 22777908175372926 05 4 3566101754190374 6970 0 6183101612651436 129 6 2033430727780006 5250 0 2991212326885201 852 0 44304685244015354 5955 0 21165417175449065 5861 0 24754875482470448 5 24874294205140307 1197 0 3019813911752608 1222 0 29310761143665864 1713 0 22576588198335212 1775 0 3570531554983368 2095 24385281136768758 2431 0 3012966021693003 265 23289613608798373 2655 txt 696 0 3867100434812825 1093 8 34936915566286353 1187 8 445192174904388 2142 0 24882683117237886 3277 9 2955046810316393 4381 0 4337052304956785 5671 0 4391610494003985 13 9 4091953409625704 2343 0 22044703811033092 2081 0 509619845513458 2035 0 09084143837311837 3537 0 400477 39049408 4906 0 08052211055374908 4035 0 27881227113433304 9107 0 0876399020200755 07944434056532083 3477 1 tet 8 1606216008285262
47. l likely share the input flexibility and linear scaling of SGD and thus will probably is a better choice than Naive Bayes for moderate scale projects The Naive Bayes and complementary Naive Bayes algorithms in Mahout are parallelized algorithms that can be applied to larger datasets than are practical with SGD based algorithms Because they can work effectively on multiple machines at once these algorithms will scale to much larger training data sets than will the SGD based algorithms Mahout has sequential and parallel implementations of random forests algorithm as well This algorithm trains an enormous number of simple classifiers and uses a voting scheme to get a single result The Mahout parallel implementation trains the many classifiers in the model in parallel Pangool 19 is a framework on top of Hadoop that implements Tuple MapReduce Pangool is a Java low level MapReduce API It aims to be areplacement for the Hadoop Java MapReduce API By implementing an intermediate Tuple based schema and configuring a Job conveniently many of the accidental complexities that arise from using the Hadoop Java MapReduce API disappear Things like secondary sort and reduce side joins become extremely easy to implement and understand Pangool s performance is comparable to that of the Hadoop Java MapReduce API Pangool also augments Hadoop s API by making multiple outputs and inputs first class and allowing instance based configuration It provides an impleme
48. ll classify based on a simple linear combination of the features Such classifiers partition the space of features into regions separated by linear decision hyperplanes Many common text classifiers are linear classifiers such as Naive Bayes Rocchio logistic regression support vector machine with linear kernel and linear regression If there exists a hyperplane that perfectly separates the two classes then we call the two classes linearly separable Classification with more than two classes has two methods any of classification and one of classification When classes are not mutually exclusive a document can belong to none exact one or more than one classes and the classes are independent of each other This is called any of classification When classes are mutually exclusive each document can belong to exactly one of the classes This is called one of classification The difference is that when solving any of classification task with linear classifiers the decision of one classifier has no influence on the decisions of the other classifiers while when solving one classification task with linear classifiers we will assign the document to the class with the maximum score or the maximum confidence value or the maximum probability We commonly use a confusion matrix to evaluate the performance which shows for each pair of classes how many documents from one class are incorrectly assigned to the other classes In summary when choosing which clas
49. m in Apache Mahout to generate the vector file and the trained model The classification algorithm uses the feature vectors to go into classifiers for training and testing that works with Mahout However Mahout is not able to predict class labels for new data Finally we came to a solution provided by Pangool net 19 which is a Java low level MapReduce API This package provides us a MapReduce Naive Bayes classifier that can predict class labels for new data After modification this package is able to read in and output to Avro file in HDFS The correctness of our classification algorithms using 5 fold cross validation was promising Table of Contents Pld Viele i ee ee LOL 5 l elg 321 dp 7 II ero re rand aa a aa E a a aaa araa 7 1 1 Whatis Classification reece a R E R 7 1 2 Feat re S lection ertian ker ka du exta a EaR ap ERa Eaa Eq Ra RE 8 1 3 Na ve Bayes Classification nnsnneeeeesseenerrnreeeserrtternnrrestttrrnrnnnerssrrrnrnnnnneserennne 8 1 4 Vector Space Classification esssssessssesesseeeeeeeeneeenne nennen 9 1 5 Support Vector ier Mee T ae 10 2 DOBBRB Id 5odpa odd DU E LL I LIN LUI LUN 11 3 POO and PaekateSsuis esee aa a RERE an va ER M iN E DO EEUU RUN EENE ETOK 12 I USSr Malus uei pt PE DRE EUN UE LER DIE DEM FL E PIENE I VO EPI S DIDIDI rU PLE DIM D brU E PU UM UUM UD Dub 13 1 Attachment DesoriptlOF usus cece ca ctn cora ires o
50. method is also recommended in Learning Apache Mahout Classification 20 We tried to follow the tutorial and try to apply the new classifier First of all we get the scripts and Java programs used in the tutorials Then we want to compile the Java programs However our system does not have Maven Thus we downloaded and installed the latest version of Maven which is Maven 3 3 1 We exported the JAVA_HOME We can use this command to check if Maven is successfully installed Messages we get are as follows and we can see that Maven is successfully installed 37 We continue to compile the code by running the command as follows We got some JAR files in the target directory after running the above command and we can use these files to generate labels for new data sets To repeat the same procedure as described in the tutorial we tried to use the same dataset To get the tweets we use these command We change the file script twitter fetcher py with our new consumer keys secrets and access token key secrets to use the API to download tweets The tweet files contains a list of tweets in a tab separated value format The first number is the tweet ID followed by the tweet message which is identical to our previous format We want to use this method to generate labels for new tweets We have the tweet to classify tsv in our data directory We upload it to HDFS by using the following command We run the MapReduce job by using
51. naeeeaeees 46 I Introduction Our team aims to classify provided tweets collections and webpage collections into pre defined classes which ultimately can help with Solr search engine The Reducing Noise team provided us the cleaned tweets and webpages in HDFS for us to begin with At first we are recommended to make use of Mahout which is an open source machine learning library For text classification task Mahout will be able to help us encode the features and then create vectors out of the features It also provides techniques to set up training and testing sets Specifically Mahout can convert the raw text files into Hadoop s SequenceFile format It can convert the SequenceFile entries into sparse vectors and modify the labels It can split the input data into training and testing sets and run the built in classifiers to train and test Existing classification algorithms provided in Mahout include 1 Stochastic gradient descent SGD OnlineLogisticRegression CrossFoldLearner AdaptiveLogisticRegression 2 Support Vector Machine SVM 3 Naive Bayes 4 Complementary Naive Bayes 5 Random Forests We tried our collections with Naive Bayes classification algorithm since it is simple and very suitable for text classification tasks However we find that Mahout Naive Bayes classification algorithm is not able to predict class labels for new data This means that we can only generate Naive Bayes classifiers but we are not able to label ne
52. ncy relevance signal c lowest when the term occurs in virtually all documents At this point we may view each document as a vector with one component corresponding to each term in the dictionary together with a weight for each component that is given by the previous equation For dictionary terms that do not occur in a document this weight is zero This vector form will prove to be crucial to scoring and ranking As a first step we introduce the overlap score measure the score of a document d is the sum over all query terms of the number of times each of the query terms occurs in d We can refine this idea so that we add up not the number of occurrences of each query term t in d but instead the tf idf weight of each term in d 1 3 Naive Bayes Classification The first supervised learning method introduced is the multinomial Naive Bayes or multinomial NB model a probabilistic learning method The probability of a document d being in class c is computed as P cld c lt P c P tilc 1 k nd where P tic is the conditional probability of term ti occurring in a document of class c We interpret P tilc as a measure of how much evidence t contributes that c is the correct class P c is the prior probability of a document occurring in class c If a document s terms do not provide clear evidence for one class versus another we choose the one that has a higher prior probability ti t2 tna are the tokens in d that are part of
53. ntation of M R Naive Bayes classification algorithm III User Manual 1 Attachment Description MRClassify master A package that can use Naive Bayes model we trained using Mahout to classify new unlabeled data This package works fine with Hadoop 1 1 1 but it is not compatible with Hadoop 2 5 generate py It can generate an individual text file for each tweet in the CSV file mr naivebayes jar The MapReduce Naive Bayes classifier is provided by Pangool 19 It can generate Naive Bayes classifier and label new data It is modified for our project to read in and write to Avro format e NaiveBayesClassifier java This class can generate Naive Bayes classifier It can be modified for classifiers that have better performance e NaiveBayesGenerate java This class can label new data using the generated classifier It can be modified to have different scoring technique for new data print webpage py This script can generate plain text from AVRO file for webpages tweet shortToLongURL File py This script is provided by TA tweet URL archivingFile py This script is provided by TA and can be used to generate seed URLs for webpages to be crawled using Nutch 2 Usage of Package 2 1 Generate Mahout Naive Bayes Classification Model Create a working directory for the dataset and all input output Convert the full dataset into a Text Text SequenceFile Convert and preprocess the dataset into a Text VectorWritable gt SequenceFile contain
54. ny other query except that it is periodically executed on a collection to which new documents are incrementally added over time If the standing query is just multicore AND computer AND chip the user will tend to miss many relevant new articles which use other terms such as multicore processors To achieve good recall standing queries thus have to be refined over time and can gradually become quite complex In this example using a Boolean search engine with stemming the user might end up with a query like multi core OR multi core AND chip OR processor OR microprocessor To capture the generality and scope of the problem space to which standing queries belong we now introduce the general notion of a classification problem Given a set of classes we seek to determine which class a given object belongs to In the example the standing query serves to divide new newswire articles into the two classes documents about multicore computer chips and documents not about multicore computer chips We refer to this as two class classification A class need not be as narrowly focused as the standing query multicore computer chips Often a class describes a more general subject area like China or coffee Such more general classes are usually referred to as topics and the classification task is then called text classification text categorization topic classification or topic spotting Standing queries and topics differ in their degree of specificity
55. o get metadata from tweets gathered from the Internet They discover the relationship of tweets In this paper they separate the method into 5 different steps which are 1 selecting keywords to gather an initial set of tweets to analyze 2 importing data 3 preparing data 4 analyzing data topic sentiment and ecologic context and 5 interpreting data We find the steps in this paper extremely helpful to our project We can use similar steps to play with our own CSV file We can directly get data from others so we do not need the first step But when it comes to importing and preparing data we can apply the method in this paper The original contents in the tweets are not well prepared for data analysis So we must stem the data like excluding the punctuation and transforming verbs to their original term Then we go to the fourth step which is to analyze the tweets to find features Finally in this paper it uses a method other than machine learning to build up the classification but in our project we will apply machine learning algorithm MLA to the classification problem Similarly in 7 they apply a model based method to deal with tweets and get the geo locating information purely by the contents of the tweets They have similar processing structure as 6 They also import the data and make a metric to model the data and get the conclusion Our structure for extracting features and classifying tweets should be based on the procedur
56. on using feature selection test on train set Commands are as follows The overall accuracy is 99 and the confusion matrix is as follows Positive Negative Positive 78 1 Negative 1 98 Test 4 Using Test 3 test on test set Command is as follows The overall accuracy is 100 and the confusion matrix is as follows Positive Negative Positive 18 0 Negative 0 0 We can observe that feature selection for tweets can help improve accuracy We use the script from TA to download some of the webpages Then we choose some positive related to the topic webpages and negative not related to the topic pages manually At last we upload these webpages to HDFS and repeat the same procedure as what we did for the tweets Test 5 Webpages classification no feature selection test on train set Commands are as follows The overall accuracy is 100 and the confusion matrix is as follows Positive Negative Positive 55 0 Negative 0 28 Test 6 Using Test 5 test on test set Command is as follows The overall accuracy is 94 and the confusion matrix is as follows Positive Negative Positive 14 0 Negative 1 2 Test 7 Webpages classification using feature selection test on train set Commands are as follows The overall accuracy is 84 and the confusion matrix is as follows Positive Negative Positive 58 0
57. or Hadoop 1 1 1 e A MapReduce Naive Bayes package called Pangool 19 which can be used to generate Naive Bayes classifiers and predict for new data It is modified to adapt to Avro format in HDFS e Evaluation of classfiers Here is a list of what we are going to cover in the following sections Section II gives an overview of related literature More about the packages used is given in Section II Section IV chronicles our development efforts Section IV 1 gives an overview of Apache Mahout for classification Section IV 2 gives our end to end handling of classification in conjunction with a small collection and searching with Solr Section IV 3 5 gives our ultimate solution for predicting classes using Pangool 19 following discussion earlier in Section IV 3 about attempts to use Mahout Section IV 4 describes evaluation tests Section V summarizes our schedule while Section VI summarizes our efforts and conclusions leading to Section VII which gives future plans II Literature Review 1 Textbook 1 1 What is Classification From 15 chapter 13 we learn that many users have ongoing information needs For example a user might need to track developments in multi core computer chips One method of doing this is to issue the query multi core AND computer AND chip against an index of recent newswire articles each morning How can this repetitive task be automated To this end many systems support standing queries A standing query is like a
58. ription Examples and Notes norm The norm modifies all vectors by a function that calculates its 1 norm Manhattan distance 2 norm length norm Euclidean distance weight Calculate the weight of any given feature as either TF IDF term TF IDF is a common weighting scheme frequency inverse document frequency or just Term in search and machine learning for Frequency representing text as vectors Both of these options drop terms that are either too frequent Useful in automatically dropping maxDFPercent max or not frequent enough across the collection of common or very infrequent terms that add minSupport documents little value to the calculation An Apache Lucene analyzer class that can be used to tokenize See Resources to learn more about Lucene analyzerName stem remove or otherwise change the words in the document Table 3 Options of Mahout feature vector generation 16 The analysis process in Step 2 a is worth diving into a bit more given that it is doing much of the heavy lifting needed for feature selection A Lucene Analyzer is made up of a Tokenizer class and zero or more TokenFilter classes The Tokenizer is responsible for breaking up the original input into zero or more tokens such as words TokenFilter instances are chained together to then modify the tokens produced by the Tokenizer For example the Analyzer used in the example 1 Tokenizes on whitespace plus a few edge cases for punc
59. s f close 1 0 for row in text t row split t filename rawtext str i txt f open filename encode utf 8 wbr print t 0 f write t O f close i i 1 After generating the directory we use the command provide by Mahout to generate the sequence file It uses MapReduce to transform the plaintext to sequence file Figure 9 shows the p EX SXMarl4 amp 14 PM cloudera rint out message Be E o Ug E EEREEFIELEILEL IB couderaibqu ament ji home Appica o M pt Sequences d 5601 e Mj a d os u wQ Gamal Fig 9 Print out message from text file to sequence file To check the result of the sequence file we cannot read it directly because it is stored in binary format We use Mahout seqdumper to transform this binary format file to a readable file The command is The sequence file is shown in Figure 10 29 Wed Mar 18 5 01PM cleudera output txt Desktop gedit 7j gown v Sosa Outpt X ouputwordbit X oKputbt X Input Psi n Plain Text bW t 8v Ln 7 Col 133 ws B ico Bi io fes i iow ie 8 w ix 18 i ich Des ic i out i wor outa I m lsuw eso Fig 10 Generated sequence file From the result we can see that we are on the right track The key is the file name which is the row number of the tweets and the values are the content of the tweets After that we try to create the vector file which needs the word count and tf idf We also
60. s https chimpler wordpress com 2013 03 13 using the mahout naive bayes classifier to automatically classify twitter messages accessed on 04 19 2015 18 Integrating the Mahout Bayes classifier with Solr http www oracle com technetwork community join member discounts 523387 accessed on 04 19 2015 19 Pangool Package http pangool net overview html accessed on 05 06 2015 20 Learning Apache Mahout Classification 50
61. s Total tine spent by all reduces in occupied slots ns 7554 Total time spent by all sap tasks ms 7i6t Total tise spent by all reduce tasks ns 7554 Total weore seconds taten by all wap tasks 7101 Total yeore seconds taken by all reduce taskse7554 Total negabyte secon s taken by all map tasks 7271424 Total negabyte seconds taken by all reduce tasks 7735296 Nap Redoce Framework Nap input records Nap output records 7899 Map output bytesel120388 Map output materialized bytes Input split bytese157 Combine input records 9 Combine output recordset Reduce input groups 7699 Reduce shuffle bytese1140712 Reduce input records 7899 Reduce output records 7899 spilled Records 15798 Shutfled Maps 1 Failed Shutfles 3 Merged Map outputsel GC tine elapsed as 143 CPU tine spent ns 3178 Physical memory bytes snapshot 333897232 Virtual menory bytes snapshot 1708949504 Total comitte heap usage bytes 219480064 Shuffle Errors 240 I6 COWECTION 10_ERROR 9 one LENT YONG Vp WRONG REDUCE 0 File Input Format Counters Bytes Reag 1195030 File Output Format Counters Bytes Written 1195030 15 93 14 16 43 26 INFO conson Hadooputil Deleting outputdirectory partial vectors O 15 03 14 16 43 26 INFO driver MahoutOriver Prograw took 279200 ss Minutes 4 653333333333333 l clouderaBquickstart bin lj E Goud iff coude ii Rest ramte I ELIEIE CECILE D D DI p Ep m mone 8 IAM Ap FS Int se es 50 x oud 3x Des
62. s isn t quite as good as naive Bayes where memory requirements are proportional to the number of unique words seen and thus are approximately proportional to the logarithm of the number of training examples In return for this less desirable scaling property random forests models have more power when it comes to problems that are difficult for logistic regression SVM or naive Bayes Typically these problems require a model to use variable interactions and discretization to handle threshold effects in continuous variables Simpler models can handle these effects with enough time and effort by developing variable transformations but random forests can often deal with these problems without that effort 1 2 Classification Configuration For classification of text this primarily means encoding the features and then creating vectors out of the features but it also includes setting up training and test sets The complete set of steps taken is 1 Convert the raw text files into Hadoop s SequenceFile format 2 Convert the SequenceFile entries into sparse vectors and modify the labels 3 Split the input into training and test sets 4 Run the naive Bayes classifier to train and test The two main steps worth noting are Step 2 and Step 4 Step 2 a is the primary feature selection and encoding step and a number of the input parameters control how the input text will be represented as weights in the vectors Option Desc
63. sification method to use we need to consider how much training data is available how simple complex is the problem how noisy is the problem and how stable is the problem over time 1 5 Support Vector Machine Chapter 15 gives an introduction of support vector machine SVM classification method Assume that we have a two class linear separable train set We want to build a classifier to divide them into two classes For a 2D situation the classifier is a line When it comes to high dimensions the decision boundary comes to be a hyperplane Some methods find a separating hyperplane but not the optimal one Support Vector Machine SVM finds an optimal solution It maximizes the distance between the hyperplane and the difficult points close to the decision boundary That is because first if there are no points near the decision surface then there are no very uncertain classification decisions Secondly if you have to place a fat separator between classes you have less choices and so the capacity of the model has been decreased 10 So the main idea of SVMs is that it maximizes the margin around the separating hyperplane That is because the larger margin we have the more confidence we can get for our classification Obviously there should be some points at the boundary Otherwise we can continue to expand the margin to make it larger until it reaches some points These points are called the support vectors So our job is to find the
64. ta ia xo noia oa beR czech Deo ra ER sate ERR Ro PUE aoe 13 2 Usage of PacRAUD uiui deir Eni etx et rExE I p E CHE uM MEE E dM Midi E um uE 13 2 1 Generate Mahout Naive Bayes Classification Model ssusssss 13 2 2 Predict Class Label for New Data Using Mahout Naive Bayes Classifier 14 2 3 Pangool MapReduce Naive Bayes Classification and Class Label Prediction 14 IV Developer Uc V P m 16 TAIGO S efectos mui eee a ee ee 16 1 1 Classification Algorithms in Mahout sssssseeeeennnm nnn 16 1 2 Classification Configuration eeeesseesseseeeeeeeeeenennnnn nennen 18 2 Environment Set p pe EN 20 2 1 Installation sssini nanmanra renren E ERE NENEN PEE PEER 20 2 2 Data IMPO erneer EEE A RE E EEE EE 22 2 3 Upload Tweets and Webpages of Small Collection to Solr 24 2 4 Load Webpages of Small Collection to HDFS sesseeeeeeeeseee 27 3 Naive Bayes Classification esses nennen nennen nnne nnns 28 3 1 Transform Tweets to Sequence File and Feature Vector on Our Own Machine 28 3 2 Commands for Classification with Mahout Naive Bayes Algorithm 32 3 3 Applying Mahout Naive Bayes Classifier for Our Tweet Small Collection 33 3 4 Using Mahout Naive Bayes Classifier for Tweet Small Collections from Other H
65. the true conditional probability of a document being in a class and the prediction of the learned classifier average over training sets Thus bias is large if the learning method produces classifiers that are consistently wrong Variance is the variation of the prediction of learned classifier It is calculated as the average squared difference between the prediction of the learned classifier and its average Thus variance is large if different training sets give rise to very different classifiers while variance is small if the training set has a minor effect on the classification decisions Variance measures how inconsistent the decisions are not whether they are correct or incorrect The bias variance tradeoff can be summarized as follows linear methods like Rocchio and Naive Bayes have high bias for nonlinear problems because they can only model one type of class boundary a linear hyperplane and low variance because most randomly drawn training sets produce similar decision hyperplanes however nonlinear methods like KNN have low bias and high variance High variance learning methods are prone to over fitting the train data Since learning error includes both bias and variance we know there is not a learning method that is optimal among all text classification problems because there is always a tradeoff between bias and variance Chapter 14 also talks about the difference between linear classifiers and nonlinear classifiers Linear classifiers wi
66. thenes thenel bg png acbposttine wed Jan 07 21 41 56 0000 2015 echeat 14206669167 version 1491679799425368000 mcbtwitter twitter search mcbcontent RT MhaledBeydoun A Muslim policeman was killed defending amp CharlieHebdo Vn Vwrut heroics never enter into the Islamophobic indictment of Isl mcbusernane SergioWonder id 7552943242971738000 spatial mcbcode 765957017 spelicheck ecblamguage en acb evice lt a hrefe http twitter con download ipad rels nofollow Tuitter for iPad lt a gt achpicture http pbs twing con profile background images 540071598527496194 YMxtQJNq jpeg acbposttime Wed Jas 07 21 41 56 40000 2015 ecbcat 1420666916 version 1491679799452631000 mcbtwitter twitter search mcbcontent RT 8452 Cartoonists around the world respond to attack in Paris http t co i17pd0fWNu amp harlieMebdo http t co kFzfON89o8 sebusermame Lindsdorand dd 552943242518729000 Fig 1 Solr Installation Li Install Cloudera virtualbox VM 1 Download latest version of Virtualbox and install 2 Download cloudera quickstart vm 5 3 x zip and then unzip to your virtualbox machine folder 3 Run virtualbox and run import applicance in the File menu 4 Find your Cloudera unzipped file and click open 5 Click Continue 6 Click import 7 Click Continue 8 Successfully import th
67. ttime type string indexed true stored true gt lt field name mcbcat type string indexed true stored true gt Fig 4 Schema xml modification example c Upload the CSV d Restart Solr 2 2 3 Results We conducted a query without any keyword and the result was captured as shown in Figure 5 which shows we have successfully imported books csv to Solr 23 Solr Admin Mozilla Firefox localhost vc 6 tA cd select Solr Ea a status 9 rime 2 params n wy true indent true 5 sis 1422576810270 en ection bd Lh pa start rows response a mumFound 200 start 0 docs mcbtwitter twitter search mcbcontent RT nycjim R I P Uncle Bernard columnist at charlieHebdo among 12 victims in Paris attack http t co njgafrCSTl http t co QQEFkdqY hd mcbusernane peterbrowibarra Raw Query Parameters id 552943242975916000 mcbcode 1362564518 ittp tuitter con download ipad rel nofollow gt Twitter for iPade a wing con images thenes thenel bg png 21 41 56 40000 2015 EEERELEELEEL edismax acbtwitter twitter search mcbcontent RT gkhaledBeydoun A Muslim policeman was killed defending CharlieHebdo n nBut heroics never enter into the Islamophobic indictment of Isl mcbusername SergioWonder id 552943242971738000 http tuitter con
68. tuation 2 Lowercases all tokens 3 Converts non ASCII characters to ASCII where possible by converting diacritics and so on 4 Throws away tokens with more than 40 characters 5 Removes stop words 6 Stems the tokens using the Porter stemmer The end result of this analysis is a significantly smaller vector for each document as well as one that has removed common noise words the a an and the like that will confuse the classifier This Analyzer was developed iteratively by looking at examples and then processing it through the Analyzer and examining the output making judgment calls about how best to proceed 19 Step 2 b does some minor conversions of the data for processing as well as discards some content so that the various labels are evenly represented in the training data Step 4 is where the actual work is done both to build a model and then to test whether it is valid or not In Step 4 a the extractLabelsoption simply tells Mahout to figure out the training labels from the input The alternative is to pass them in The output from this step is a file that can be read via the org apache mahout classifier naivebayes NaiveBayesModel class Step 4 b takes in the model as well as the test data and checks to see how good of a job the training did 2 Environment Setup 2 1 Installation Install Solr Installation progress 1 Download the Java SE 7 JDK and export environment variables Ch
69. uc 37 3 5 Generate Class Label for New Data Using Mahout Naive Bayes Model 37 3 5 Using Pangool to Predict Class Label for New Data suuuessssssse 41 4 Evaluation MR wm 44 4 1 Cross loce aO ETT T 44 4 2 Summary OLCHIBS UIS ii tintas isiauutus Doususd ENS tec inEPumutis nace acne tute dde Soin o Ru Ianus 44 V Tiriellne Schedule asta a tux ta EAE ARS a MN NUM M MD IAM I MM MM ME 47 Pu ees Uo rr 48 UI SUUS ie SERE E E a 49 VII Ackniowlodgermerifs sueco pneter terti em tetti Re tme rati rn br Fecit eni desde etiem edad 49 WIM RGIS IM RETE E LU TEE 50 List of Figures Figure 1 Sole Installat oi 0 ooi teo ret ease p eet ee n ri E EENE AE E E ARARA AaS 2l Figure 2 Cloudera Virtual Machine Installation eee 22 Figure 3 Data CSV file modilieationexamiple s oe ERE AREE RIEN 23 Figure 4 Schema xml modification example rss coiere oos o pa eee proh epa uuyeeaten acildes 23 Figure 5 Import CSV fileto Sole ossi i rir dure etr niasa odes tea 24 Figure 6 Tweets and webpages uploaded to Solr esee 24 Ligure 7 Long URLs from the new Weelsi iue peche irre Eie cee Pres steeds 27 Figure 8 Nutch finished crawling the webpages essesssss 28 Figure 9 Print out message from text file to sequence file csse 29 Figure 10 Generated sequence Tile edo prede toe edness one E EE EE
70. w data In order to solve this problem we looked into available books and online tutorials and finally found a package from Learning Apache Mahout Classification 20 which could be used to predict class labels for new data using Mahout Naive Bayes classifiers However we noticed that this package only works for Hadoop 1 1 1 and is not compatible with our cluster which is Hadoop 2 5 We tried to modify the code and talk to the TAs however we did not successfully adapt this solution to our cluster Finally we came across another solution provided by Pangool net 19 which is a Java low level MapReduce API This package provides us a MapReduce Naive Bayes classification algorithm and it can also predict class labels for new data The most important thing is that this package is compatible with all versions of Hadoop This package is modified to be able to read in and write to Avro files in HDFS We used this package to generate Naive Bayes classifiers for small collections of tweets and webpages and large collections of tweets and webpages for different teams We showed the accuracy of our generated classifiers using 5 fold cross validations Our contribution can be summarized as e Investigation of classification algorithms provided by Apache Maout e Naive Bayes classifiers generated using Apache Mahout e Prediction for class labels for new data using package provided in Learning Apache Mahout Classificaiton 20 but only works f

CS 5604 Information Retrieval and Storage Spring

Contents

Download Pdf Manuals

Related Search

Related Contents