Home

WEKA Manual for Version 3-6-8

1. Setup Run Analyse Experiment Configuration Mode Simple Advanced Open Save New Results Destination ARFF file w Filename le ATemplweka 3 5 BlExperiments1 arff Browse Experiment Type iteration Control Cross validation Number of repetitions 10 Number of folds 10 8 Data sets first 8 Classification Regression Algorithms first Datasets Algorithms ane morro inl Add new Edit selecte Delete select Add new Edit selected Delete selected y Use relative pat Zero dataliris artt J48 C 0 25 M 2 Up Down Load options Save options Up Down With the Load options and Save options buttons one can load and save the setup of a selected classifier from and to XML This is especially useful for highly configured classifiers e g nested meta classifiers where the manual setup takes quite some time and which are used often One can also paste classifier settings here by right clicking or Alt Shift left clicking and selecting the appropriate menu point from the popup menu to either add a new classifier or replace the selected one with a new setup This is rather useful for transferring a classifier setup from the Weka Explorer over to the Experimenter without having to setup the classifier from scratch 5 2 1 7 Saving the setup For future re use
2. Runs Distribute experiment Generator properties From h To fio Host Disabled Z Select property 8 By data set Byrun Disabled Iteration control 9 Data sets first Custom generator first Datasets Add new Edit selecte Delete select v Use relative pat Can t edit dataliris arff Click Select property and expand splitEvaluator so that the classifier entry is visible in the property list click Select Select a property x EJ Available properties Ey outputFile D randomizeData Ey rawoutput 9 EA splitEvaluator Ey attributelD D Ey predTargetColumn y trainPercent Select Cancel The scheme name is displayed in the Generator properties panel 72 CHAPTER 5 EXPERIMENTER Weka Experiment Environment TES Setup Run Analyse xperiment Configuration Mode Simple Advanced Open Save New Destination Choose InstancesResultListener O Experimenti arff Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator W weka classifier Runs Distribute experiment Generator properties From h To h 0 Hosts NN eae Bydataset Byrun Iteration co
3. oldpeak slope ca Numeric Nominal Numerio Nominal Nominal non_angi 112 0 340 0 normal 184 0 no non_angi 130 01 209 0 normal 178 0 no non_angi 150 0 160 0 normal 172 0 no 37 Offemale jatyp_angi 120 0 260 0 normal 130 0 no i 37 Olfemale non_angi 130 0 211 0 normal 142 0 no f 37 0 female jasympt 130 0 stt 184 0 no l lt 50 atyp_ang 130 0 stt 98 0 no ll lt 50 non_angi 130 0 194 0 normal 150 0 no 50 53 d l 37 Olmale _ asympt 120 0 223 0 f normal 168 0 no 0 normal 50 37 O male asympt 130 0 315 0 normal 158 0 f 38 0 female jatyp_angi 120 0 140 0 normal normal i oof T so atyp_angi 100 0 219 0 f st t atyp_ang 105 0 off normal Copy Search i lt 3 Clear search no Delete selected instance lt 0 atyp_angi touu zosu momar res Olr 0 atyp_angi 120 0 normal 180 0 i 50 y atyp_anol 112 CHAPTER 7 ARFFVIEWER 7 2 Editing Besides the first column which is the instance index all cells in the table are editable Nominal values can be easily modified via dropdown lists numeric values are edited directly ARFF Viewer D development datasets uci nominal heart h artt Sle se Relation hungarian 14 heart disea
4. weka ment Environment lolx Setup Run Analyse Source Got 30 results File Database Experiment Configure test Test output Testing with Paired T Tester cor w Tester veka experiment PairedCorrectedTTester Analysing Percent_correct Datasets Resultsets Confidence Sorted by Select Select Date Conpansonneia Perc cared y Significance 0 05 sonic aoe e Row Column Dataset Test base Select Displayed Columns Columns Show std deviations Key Output Format Select Perform test Save output Result list 1 rules ZeroR 48055541465867954 2 rules OneR B 6 2459427002147861445 3 trees J48 C 0 25 M 2 217733168393644444 1 3 0 05 two tailed 21 12 05 16 51 1 rules Ze 2 rules 3 trees 93 53 v 94 73 v ws 1 1 0 0 1 0 0 79 80 CHAPTER 5 EXPERIMENTER 5 3 Cluster Experiments Using the advanced mode of the Experimenter you can now run experiments on clustering algorithms as well as classifiers Note this is a new feature available with Weka 3 6 0 The main evaluation metric for this type of experiment is the log likelihood of the clusters found by each clusterer Here is an example of setting up a cross validation experiment using clusterers Choose Cross ValidationResultProducer from the Result generator panel AAA Weka Experi
5. trainPercent 566 0 Open Save OK Cancel Click on rawOutput and select the True entry from the drop down list By default the output is sent to the zip file splitEvaluatorOut zip The output file can be changed by clicking on the outputFile panel in the window Now when the experiment is run the result of each processing run is archived as shown below Name Size Modified Ratio Packed Path gt B rules ZeroR_ version_48055541465867954 568 21 12 200516 55 257 1 iris ClassifierSplitEvaluator S trees J48_ C_0 25_ M_2 version_ 217733168393644444 844 21 12 2005 16 53 397 1 iris ClassifierSplitEvaluator S _tules ZeroR_ version_48055541 465867954 568 21 12 2005 16 55 257 10 iris ClassifierSplitEvaluator Sy _trees J48_ C_0 25_ M_2 version_ 217733168393644444 915 21 12 2005 16 54 417 10iris ClassifierSplitEvaluator S _tules ZeroR_ version_48055541 465867954 568 21 12 200516 55 257 2iris ClassifierSplitEvaluator B _trees J48_ C_0 25_ M_2 version_ 2177331 68393644444 1 001 21 12 2005 16 58 425 2 iris ClassifierSplitEvaluator B_tules ZeroR_ version_48055541 465867954 568 21 12 2005 16 55 257 3 iris ClassifierSplitEvaluator S _trees J48_ C_0 25_ M_2 version_ 2177331 68393644444 844 21 12 2005 16 53 395 3iiris ClassifierSplitEvaluator 5 _rules ZeroR_ version_48055541465857954 568 21 12 2005 16 55 257 iris ClassifierSplitEvaluat
6. Clicking on the button for the Output format leads to a dialog that lets you choose the precision for the mean and the std deviations as well as the format of the output Checking the Show Average checkbox adds an additional line to the output listing the average of each column With the Remove filter classnames checkbox one can remove the filter name and options from processed datasets filter names in Weka can be quite lengthy The following formats are supported e CSV e GNUPlot e HTML 5 5 ANALYSING RESULTS 91 e LaTeX e Plain text default e Significance only Mean Precision 2 StdDev Precision 24 Output Format Plain Text v Show Average Remowe filter classnames Cancel 5 5 2 Saving the Results The information displayed in the Test output panel is controlled by the currently selected entry in the Result list panel Clicking on an entry causes the results corresponding to that entry to be displayed Perform test Save output Result list 16 36 04 Available resultsets 16 37 11 Percent_correct rules ZeroR 480555 16 37 40 Percent_correct rules ZeroR 480555 16 38 12 Number_correct rules ZeroR 4 4 i I gt The results shown in the Test output panel can be saved to a file by clicking Save output Only one set of results can be saved at a time but Weka permits the user to save all re
7. Weka Experiment Environment miali Setup Run Analyse xperiment Configuration Mode Simple Advanced Open Save New Results Destination ARFF file Filename le Templweka 3 5 6lExperiments1 arff Browse Experiment Type Iteration Control Cross validation vv Number of repetitions 10 Number of folds 10 8 Data sets first 8 Classification Regression Algorithms first Datasets Algorithms Add new Edit selecte Delete select Add new Edit selected Delete selected C Use relative pat Up Down Load options Save options Il up Down Notes The advantage of ARFF or CSV files is that they can be created without any additional classes besides the ones from Weka The drawback is the lack of the ability to resume an experiment that was interrupted e g due to an error or the addition of dataset or algorithms Especially with time consuming experiments this behavior can be annoying JDBC database With JDBC it is easy to store the results in a database The necessary jar archives have to be in the CLASSPATH to make the JDBC functionality of a particular database available After changing ARFF file to JDBC database click on User to specify JDBC URL and user credentials for accessing the database 56 CHAPTER 5 EXPERIMENTER Database Connection Parameters E Database URL jdbc mysgliflocalhost 3306Wweka_test_ User
8. 289 18 6 2 Serialization of Experimenta 292 caidos it 293 RE Re a cd oh ao ee ee 294 AA 294 295 A A a ee aa 295 ARAN 295 19 2 1 Weka download problema 295 bs gt ea oe eee a 295 indowg 2k ee RAE piwa 296 19 2 3 Mac OSX c ii ce dew oe eS 296 ETUC e A 296 CONTENTS 19 2 5 just in time JIT compile 297 Lace aoe een ea ae ek oe 297 acm ep es as he ede a 297 2 8 Spa i R id da 28 297 aE WO aa isa Gales ee RA a R 297 AE ode sae Gh Ca ehan a aa hs 298 Bee ate oa ea 298 e ote dele duce Gh 298 a o do ias 299 19 2 12 Memory consumption and Garbage collectod 299 2 5 i 299 O eae ae oe A 300 6 6 mee es de eee be bead oes 300 302 10 CONTENTS Part I The Command line Chapter 1 A command line primer 1 1 Introduction While for initial experiments the included graphical user interface is quite suf ficient for in depth usage the command line interface is recommended because it offers some functionality which is not available via the GUI and uses far less memory Should you get Out of Memory errors increase the maximum heap size for your java engine usually via Xmx1024M or Xmx1024m for 1GB the default setting of 16 to 64MB is usually too small If you get errors that classes are not found check your CLASSPATH does it include weka jar You can explicitly set CLASSPATH via the cp command line option as we
9. Class colour o 0 5 1 6 4 EXAMPLES 105 6 4 3 Processing data incrementally Some classifiers clusterers and filters in Weka can handle data incrementally in a streaming fashion Here is an example of training and testing naive Bayes incrementally The results are sent to a Text Viewer and predictions are plotted by a StripChart component b instance S instance dio incrementalClasay gr ar Ae Te Pe ArffLoader Class NaiveBayes Update r Incr ntal Assigner le ClassifiefEvaluator hart Es Ta TextViewer StripChart e Click on the DataSources tab and choose ArffLoader from the toolbar the mouse pointer will change to a cross hairs e Next place the ArffLoader component on the layout area by clicking some where on the layout a copy of the ArffLoader icon will appear on the layout area e Next specify an ARFF file to load by first right clicking the mouse over the ArffLoader icon on the layout A pop up menu will appear Select Configure under Edit in the list from this menu and browse to the location of your ARFF file e Next click the Evaluation tab at the top of the window and choose the ClassAssigner allows you to choose which column to be the class com ponent from the toolbar Place this on the layout e Now connect the ArffLoader to the ClassAssigner first right click over the ArffLoader and select the dataSet under Connections in the menu A rubber band line will appear Move the mouse
10. from somewhere filter Remove rm new Remove rm setAttributelndices 1 remove 1st attribute classifier J48 j48 new J48 j48 setUnpruned true using an unpruned J48 meta classifier FilteredClassifier fc new FilteredClassifier fc setFilter rm fc setClassifier j48 train and output model fc buildClassifier train System out println fc for int i 0 i lt test numInstances i double pred fc classifyInstance test instance i double actual test instance i classValue System out print ID test instance i value 0 System out print actual test classAttribute value int actual System out println predicted test classAttribute value int pred 16 6 CLASSIFICATION 209 16 6 Classification Classification and regression algorithms in WEKA are called classifiers and are located below the weka classifiers package This section covers the following topics e Building a classifier batch and incremental learning e Evaluating a classifier various evaluation techniques and how to obtain the generated statistics e Classifying instances obtaining classifications for unknown data The Weka Examples collection 3 contains example classes covering classification in the wekaexamples classifiers package 16 6 1 Building a classifier By design all classifiers in WEKA are batch trainable i e they get trained on the whole datase
11. A real world implementation of a stream filter is the MultiFilter class pack age weka filters which passes the data through all the filters it contains Depending on whether all the used filters are streamable or not it acts either as stream filter or as batch filter 258 CHAPTER 17 EXTENDING WEKA 17 2 2 3 Internals Some useful methods of the filter classes e isNewBatch returns true if an instance of the filter was just instan tiated or a new batch was started via the batchFinished method e isFirstBatchDone returns true as soon as the first batch was finished via the batchFinished method Useful for supervised filters which should not be altered after being trained with the first batch of instances 17 2 3 Capabilities Filters implement the weka core CapabilitiesHandler interface like the clas sifiers This method returns what kind of data the filter is able to process Needs to be adapted for each individual filter since the default implementation allows the processing of all kinds of attributes and classes Otherwise correct function ing of the filter cannot be guaranteed See section Capabilities on page 242 for more information 17 2 4 Packages A few comments about the different filter sub packages e supervised contains supervised filters i e filters that take class distri butions into account Must implement the weka filters SupervisedFilter interface attribute filters t
12. Error generating data n ex getMessage Error JOptionPane ERROR_MESSAGE generator setRelationName relName H e the Use button finally fires a propertyChange event that will load the data into the Explorer m_ButtonUse addActionListener new ActionListener public void actionPerformed ActionEvent evt m_Support firePropertyChange null null H 272 CHAPTER 17 EXTENDING WEKA e the propertyChange event will perform the actual loading of the data hence we add an anonymous property change listener to our panel addPropertyChangeListener new PropertyChangeListener public void propertyChange PropertyChangeEvent e try Instances data new Instances new StringReader m_0utput getText set data in preprocess panel also notifies of capabilties changes getExplorer getPreprocessPanel setInstances data catch Exception ex ex printStackTrace JOptionPane showMessageDialog getExplorer Error generating data n ex getMessage Error JOptionPane ERROR_MESSAGE e In order to add our GeneratorPanel to the list of tabs displayed in the Explorer we need to modify the Explorer props file just extract it from the weka jar and place it in your home directory The Tabs property must look like this Tabs weka gui explorer GeneratorPanel standalone weka gui explorer ClassifierPanel weka gui explorer ClustererPanel weka gui explorer Ass
13. get_wekatechinfo sh d w dist weka jar b gt tech txt command is issued from the same directory the Weka build xml is located in http www kdd org explorations issues 11 1 2009 07 p2V11n1 pdf 191 192 Bash shell script get_wekatechinfo sh bin bash This script prints the information stored in TechnicalInformationHandlers to stdout FracPete Revision 4582 the usage of this script function usage echo echo O d lt dir gt w lt jar gt pl b h echo echo Prints the information stored in TechnicalInformationHandlers to stdout echo echo h this help echo d lt dir gt echo the directory to look for packages must be the one just above echo the weka package default DIR echo w lt jar gt echo the weka jar to use if not in CLASSPATH echo p prints the information in plaintext format echo b prints the information in BibTeX format echo generates a filename out of the classname TMP and returns it in TMP uses the directory in DIR function class_to_filename TMP DIR echo TMP sed s g java variables DIR PLAINTEXT no BIBTEX no WEKA TECHINFOHANDLER weka core TechnicalInformationHandler TECHINFO weka core TechnicalInformation CLASSDISCOVERY weka core ClassDiscovery interprete parameters while getopts hpbw d flag do case flag in p PLAINTEXT yes b BIBTEX
14. B java weka filters supervised instance Resample i data soybean arff o soybean 5 arff c last Z 5 java weka filters supervised instance Resample i data soybean arff o soybean uniform 5 arff c last Z 5 B 1 18 CHAPTER 1 A COMMAND LINE PRIMER StratifiedRemoveFolds creates stratified cross validation folds of the given dataset This means that by default the class distributions are approximately retained within each fold The following example splits soybean arff into strat ified training and test datasets the latter consisting of 25 1 4 of the data java weka filters supervised instance StratifiedRemoveFolds i data soybean arff o soybean train arff c last N 4 F 1 V java weka filters supervised instance StratifiedRemoveFolds i data soybean arff o soybean test arff c last N 4 F 1 weka filters unsupervised Classes below weka filters unsupervised in the class hierarchy are for unsu pervised filtering e g the non stratified version of Resample A class attribute should not be assigned here weka filters unsupervised attribute StringToWordVector transforms string attributes into word vectors i e creat ing one attribute for each word which either encodes presence or word count C within the string W can be used to set an approximate limit on the number of words When a class is assigned the limit applies to each class separately This filter is useful for text mining Obfuscate ren
15. In the following a snippet of the UCI dataset iris in ARFF format relation iris attribute sepallength numeric attribute sepalwidth numeric attribute petallength numeric attribute petalwidth numeric attribute class Iris setosa Iris versicolor Iris virginica data 5 1 3 5 1 4 0 2 lris setosa 4 9 3 1 4 0 2 Iris setosa 167 168 CHAPTER 10 XRFF 10 2 2 XRFF And the same dataset represented as XRFF file lt xml version 1 0 encoding utf 8 gt lt DOCTYPE dataset lt ELEMENT dataset header body gt lt ATTLIST dataset name CDATA REQUIRED gt lt ATTLIST dataset version CDATA 3 5 4 gt lt ELEMENT header notes attributes gt lt ELEMENT body instances gt lt ELEMENT notes ANY gt lt ELEMENT attributes attribute gt lt ELEMENT attribute labels metadata attributes gt lt ATTLIST attribute name CDATA REQUIRED gt lt ATTLIST attribute type numeric date nominal string relational REQUIRED gt lt ATTLIST attribute format CDATA IMPLIED gt lt ATTLIST attribute class yes no no gt lt ELEMENT labels label gt lt ELEMENT label ANY gt lt ELEMENT metadata property gt lt ELEMENT property ANY gt lt ATTLIST property name CDATA REQUIRED gt lt ELEMENT instances instance gt lt ELEMENT instance value gt lt ATTLIST instance type normal sparse normal gt lt ATTLIST instance weight CDATA IMPLIED gt lt ELEMENT value H
16. New name for value inf 5 55 flow The popup menu shows list of values that can be deleted from selected node This is only active when there are more then two values for the node single valued nodes do not make much sense By selecting the value the CPT of the node is updated in order to ensure that the CPT adds up to unity The CPTs of children are updated by dropping the distributions conditioned on the value E 4 D III Setevidence gt Rename Delete node Edit CPT Add parent gt Delete parent gt Delete child gt Add value Rename value gt Delete value inf 2 45 2 45 4 75 4 75 infy A note on CPT learning Continuous variables are discretized by the Bayes network class The discretiza tion algorithm chooses its values based on the information in the data set 8 10 BAYESIAN NETS IN THE EXPERIMENTER 153 However these values are not stored anywhere So reading an arff file with continuous variables using the File Open menu allows one to specify a network then learn the CPTs from it since the discretization bounds are still known However opening an arff file specifying a structure then closing the applica tion reopening and trying to learn the network from another file containing continuous variables may not give the desired result since a the discretization algorithm is re applied and new boundaries may have been found Unexpected behavior may be the result Learning f
17. brings either the chosen attribute into view or displays all the values of an After opening a file by default the column widths are optimized based on the attribute name and not the content This is to ensure that overlong cells do not force an enormously wide table which the user has to reduce with quite attribute some effort CHAPTER 7 ARFFVIEWER 7 1 MENUS 111 In the following screenshots of the table popups development datasets uci nominal heart h artt lolx nar ation hungarian 14 heart disease age sex ches re estecg thalach exang oldpeak slope ca thal num 1 Numerio Nominal Noi Numeric Nominal Numeric Nominal Numeric Nominal Nominal 28 0 male _ atyp 185 0 no 0 0 50 a 29 0 male atyp_ I 160 0 no 4 8 29 0 male _ atyp_4 Set missing values to I 170 0 no 0 0 lt 50 i Rept i O fixed 30 0 female typ_al EE 170 0 no 0 0 fixed _ lt 50 31 0 female atyp_ 32 0 female atyp_ Rename attribute 32 O male _ t P_ attribute as class 32 0 male__ atyp_ 33 0 male non_ Delete attribute 150 0 no A lt 50 165 0 no 0 0 lt 50 184 0 no si A 0 0 50 185 0 no oo T 50 34 0 female atyp_ Delete attributes 190 0 no A lt 50 34 0 male lato Sort data ascending _ 168 0 no 0 0 50 atyp_ I 150 0 no 0 0 lt 50 35 O female typ_an Optimal column width current 1850no 35 0
18. output batchFinished flushInput getRevision But only the following ones normally need to be modified getCapabilities setInputFormat Instances input Instance batchFinished getRevision For more information on Capabilities see section Z2 3 Please note that the weka filters Filter superclass does not implement the weka core OptionHandler interface See section Option handling on pageB49 248 CHAPTER 17 EXTENDING WEKA setInputFormat Instances With this call the user tells the filter what structure i e attributes the input data has This method also tests whether the filter can actually process this data according to the capabilities specified in the getCapabilities method If the output format of the filter i e the new Instances header can be determined based alone on this information then the method should set the output format via setOutputFormat Instances and return true otherwise it has to return false getInputFormat This method returns an Instances object containing all currently buffered Instance objects from the input queue setO0utputFormat Instances set0utputFormat Instances defines the new Instances header for the out put data For filters that work on a row basis there should not be any changes between the input and output format But filters that work on attributes e g removing adding modifying will affect this format This method must be called with
19. 00 GY oar The toolbar allows a shortcut to many functions Just hover the mouse over the toolbar buttons and a tooltiptext pops up that tells which function is activated The toolbar can be shown or hidden with the View View Toolbar menu Statusbar At the bottom of the screen the statusbar shows messages This can be helpful when an undo redo action is performed that does not have any visible effects such as edit actions on a CPT The statusbar can be shown or hidden with the View View Statusbar menu Click right mouse button Clicking the right mouse button in the graph panel outside a node brings up the following popup menu It allows to add a node at the location that was clicked or add select a parent to add to all nodes in the selection If no node is selected or no node can be added as parent this function is disabled Add node Add parent gt sepalwidth petallength petalwidth Clicking the right mouse button on a node brings up a popup menu The popup menu shows list of values that can be set as evidence to selected node This is only visible when margins are shown menu Tools Show margins By selecting Clear the value of the node is removed and the margins calculated based on CPTs again inf 2 45 Rename 2 45 4 75 Delete node 4 75 inf Edit CPT Add parent gt Delete parent Delete child gt Add value Rename value Delete value gt 150 CHAPT
20. Arff Viewer The ArffViewer is a little tool for viewing ARFF files in a tabular format The advantage of this kind of display over the file representation is that attribute name type and data are directly associated in columns and not separated in defintion and data part But the viewer is not only limited to viewing multiple files at once but also provides simple editing functionality like sorting and deleting ARFF Viewer File Edit View development datasets uci nominal heart h arft lation hungarian 14 heart disease age sex jchest_pain trestbps chol bs restecg thalach exang oldpeak slope ca thal num Numerio Nominal Nominal Numeric Numeric Nominal Nominal Numeric Nominal Numeric Nominal Numeric Nominal Nominal 1 28 0 male atyp_angi 130 0 132 0 f lefty 185 0 no 0 0 50 2 29 0 male latyp_angi 120 0 243 0 f normal 160 0 no 0 01 50 3 29 0lmale latyp_angi 140 0 a normal 170 0 no 0 0 50 4 30 0 female typ_angina 170 0 237 0 s t z 170 0 no 0 0 fixed_ lt 50 5 31 0 female jatyp_angi 100 0 219 0 f 150 0 no 0 0 50 6 32 0 female atyp_angi 105 0 198 0 f 165 0 no 0 0 50 7 32 0 male jatyp_angi 110 0 225 0 f 184 0 no 0 0 50 8 32 0 male atyp_angi 125 0 254 011 155 0 no 0 0 50 9 33 0 male non_angi 120 0 298 0 f normal
21. CfsSubsetEval eval new CfsSubsetEval GreedyStepwise search new GreedyStepwise search setSearchBackwards true filter setEvaluator eval filter setSearch search filter setInputFormat data filter data Instances newData Filter useFilter data filter System out println newData 224 CHAPTER 16 USING THE API 16 8 3 Using the API directly Using the meta classifier or the filter approach makes attribute selection fairly easy But it might not satify everybody s needs For instance if one wants to obtain the ordering of the attributes using Ranker or retrieve the indices of the selected attributes instead of the reduced data Just like the other examples the one shown here uses the CfsSubsetEval evaluator and the GreedyStepwise search algorithm in backwards mode But instead of outputting the reduced data only the selected indices are printed in the console import weka attributeSelection AttributeSelection import weka attributeSelection CfsSubsetEval import weka attributeSelection GreedyStepvise import weka core Instances Instances data from somewhere setup attribute selection AttributeSelection attsel new AttributeSelection CfsSubsetEval eval new CfsSubsetEval GreedyStepwise search new GreedyStepwise search setSearchBackwards true attsel setEvaluator eval attsel setSearch search perform attribute selection attsel SelectAttributes data int
22. at the end containing a random number The output format can be collected immediately J public Capabilities getCapabilities Capabilities result super getCapabilities result enableAllAttributes result enableAllClasses result enable Capability NO_CLASS filter doesn t need class to be set return result public boolean setInputFormat Instances instanceInfo throws Exception super setInputFormat instanceInfo Instances outFormat new Instances instancelnfo 0 outFormat insertAttributeAt new Attribute blah outFormat numAttributes setOutputFormat outFormat m_Random new Random 1 return true output format is immediately available public boolean input Instance instance throws Exception if getInputFormat null throw new NullPointerException No input instance format defined if isNewBatch 4 resetQueue m_NewBatch false convertInstance instance return true can be immediately collected via output protected void convertInstance Instance instance double newValues new double instance numAttributes 1 double oldValues instance toDoubleArray newValues newValues length 1 m_Random nextInt System arraycopy oldValues 0 newValues 0 oldValues length push new Instance 1 0 newValues public static void main String args runFilter new StreamFilter args a 254 CHAPTER 17 EXTENDIN
23. e invocation the usual build file needs not be specified explicitly if it s in the current directory if not target is specified the default one is used ant f lt build file gt lt target gt e displaying all the available targets of a build file ant f lt build file gt projecthelp 18 1 2 Weka and ANT e a build file for Weka is available from subversion e some targets of interest clean Removes the build dist and reports directories also any class files in the source tree compile Compile weka and deposit class files in path_modifier build classes docs Make javadocs into path_modifier doc exejar Create an executable jar file in path_modifier dist 277 278 CHAPTER 18 TECHNICAL DOCUMENTATION 18 2 CLASSPATH The CLASSPATH environment variable tells Java where to look for classes Since Java does the search in a first come first serve kind of manner you ll have to take care where and what to put in your CLASSPATH I personally never use the environment variable since I m working often on a project in different versions in parallel The CLASSPATH would just mess up things if you re not careful or just forget to remove an entry ANT offers a nice way for building and separating source code and class files Java projects But still if you re only working on totally separate projects it might be easiest for you to use the environment variable 18 2 1 Setting the CLASSPATH
24. import weka core Capabilities import java util Random public class BatchFilter3 extends Filter protected int m_Seed protected Random m_Random public String globalInfo return A batch filter that adds an attribute blah at the end containing a random number The output format cannot be collected immediately F public Capabilities getCapabilities Capabilities result super getCapabilities result enableAllAttributes result enableAllClasses result enable Capability NO_CLASS filter doesn t need class to be set return result y public boolean input Instance instance throws Exception if getInputFormat null throw new NullPointerException No input instance format defined if isNewBatch resetQueue m_NewBatch false if isFirstBatchDone convertInstance instance else bufferInput instance return isFirstBatchDone public boolean batchFinished throws Exception if getInputFormat null throw new NullPointerException No input instance format defined output format still needs to be set random number generator is seeded with number of instances of first batch if isFirstBatchDone m_Seed getInputFormat numInstances Instances outFormat new Instances getInputFormat 0 outFormat insertAttributeAt new Attribute blah getInputFormat numInstances outFormat numAttributes setOutputForma
25. tive way of serializing and derserializing Java Objects in an XML file Like the normal serialization it serializes everything into XML via an Ob jectOutputStream including the SerialUID of each class Even though we have the same problems with mismatching SerialUIDs it is at least pos sible edit the XML files by hand and replace the offending IDs with the new ones In order to use KOML one only has to assure that the KOML classes are in the CLASSPATH with which the Experimenter is launched As soon as KOML is present another Filter kom1 will show up in the Save Open Dialog The DTD for KOML can be found at http old koalateam com xml kom112 dtd Responsible Class es weka core xml KOML The experiment class can of course read those XML files if passed as input or out put file see options of weka experiment Experiment and weka experiment RemoteExperiment 18 6 3 Serialization of Classifiers The options for models of a classifier 1 for the input model and d for the out put model now also supports XML serialized files Here we have to differentiate between two different formats e built in The built in serialization captures only the options of a classifier but not the built model With the 1 one still has to provide a training file since we only retrieve the options from the XML file It is possible to add more options on the command line but it is no check performed whether they collide with the ones stored in
26. 1 iim Training TestSet CrossValidation TrainTest Class ClassValue Classifier Incremental Clug Setilaker Maker Fol dMaker SplitMaker Assigner Picker PerformanceEvaluator ClassifierEvaluator Performan 4 il I e TrainingSetMaker make a data set into a training set e TestSetMaker make a data set into a test set e Cross ValidationFoldMaker split any data set training set or test set into folds e TrainTestSplitMaker split any data set training set or test set into a training set and a test set e ClassAssigner assign a column to be the class for any data set training set or test set e Class ValuePicker choose a class value to be considered as the posi tive class This is useful when generating data for ROC style curves see ModelPerformanceChart below and example 6 4 2 e ClassifierPerformanceEvaluator evaluate the performance of batch trained tested classifiers e IncrementalClassifierEvaluator evaluate the performance of incremen tally trained classifiers e ClustererPerformanceEvaluator evaluate the performance of batch trained tested clusterers e PredictionAppender append classifier predictions to a test set For dis crete class problems can either append predicted class labels or probabil ity distributions 100 CHAPTER 6 KNOWLEDGEFLOW 6 3 7 Visualization DataSources DataSinks Fitters Classifiers I Clusterers Associations I
27. 185 0 no 0 0 lt 50 10 34 0 female jatyp_angi 130 0 161 0 f normal 190 0 no 0 0 50 11 34 0 male _ atyp_angi 150 0 214 0 f stt 168 0 no 0 0 50 12 34 0 male jatyp_angi 98 0 220 0 f normal 150 0 no 0 0 50 13 35 0 female ltyp_angina 120 0 160 0 f stt 185 0 no 0 0 50 14 35 0 female jasympt 140 0 167 0 f normal 150 0 no 0 0 50 15 35 0 male latyp_angi 120 0 308 0 f lefty 180 0 no 0 0 50 16 35 0 male _latyp_angi 160 0 264 0 f normal 168 0 no 0 0 50 17 36 0 male jatyp_angi 120 0 166 0 f normal 180 0 no 0 0 50 18 36 0 male non_angi 112 0 340 0 f normal 184 0 no 1ol at normal lt 50 19 36 0 male non_angi 130 0 209 0 f normal 178 0 no 0 0 lt 50 20 36 0 male non_angi 150 0 160 0 f normal 172 0 no 0 0 50 21 37 O female jatyp_angi 120 0 260 0 f normal 130 0 no 0 0 50 22 37 O female non_angi 130 0 211 0 f normal 142 0 no 0 0 50 23 37 D female jasympt 130 01 173 0 f stt 1840 no 0 0 50 24 37 O male atyp_angi 130 0 283 0 f stt 98 0 no 0 0 50 25 37 0 male non_angi 130 0 194 0 f normal 150 0 no 0 0 50 26 37 O male asympt 120 0 223 0 f normal 168 0 no 0 0 normal 50 27 37 O male jasympt 130 0 315 0 f normal 158 0 no 0 0 50 28 38 O female latyp_angi 120 0 275 0 normal 129 0 no 0 0 50 29 38 0 male __ atyp_angi 140 0 297 0 f normal 150 0 no 0 0 50 o E Eire 4 109 110 7 1 Menus
28. C3 changelogs C doc FileName data Files of Type Arff data files arff X Double click on the data folder to view the available datasets or navigate to an alternate location Select iris arff and click Open to select the Iris dataset 64 CHAPTER 5 EXPERIMENTER Look In C data lr all ER y contact lenses arff 7 weather arff y weather nominal arff E segment challenge arff E segment test arft E soybean arft FileName iris art Files of Type arff data files arti y Open Cancel weka Experiment Environment EE x Setup Run Analyse xperiment Configuration Mode Simple Advanced Open Save New Destination Choose InstancesResultListener O weka_experiment25619 arff Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip WV weka experiment ClassifierSplitEvaluator W weka classifier Runs Distribute experiment Generator properties From 1 To 10 Host l acar Disabled K A Select property 3 Bydataset Byrun Iteration control 8 Data sets first Custom generator first Datasets Add new Edit selecte Delete select v Use relative pat Can t edit dataliris artt The dataset name is now dis
29. FEA re ot E 227 Bs SPs she Deed es ews a eee toe ee oe 227 5 ede Foe A ke ed 228 bs eaa dot Ao we ed 228 ue aed AE 229 Fae Bay a Die erate cas a es bn ee 230 17 2 2 Simple approachl 2 ee 254 172 21 SimpleBatchFiltey 254 17222 SimpleStreamPilted 256 2 2 3 Intermald cs eara peara 20 4 258 GEOR A ae 258 A Acc dw bE ES 258 E ea a a aca ane ene ee 258 OU che ps ae a A 259 Mit Sk ea eee es ane 259 Bt ee Se Sen E 259 AAA eo REAE RES 259 8 CONTENTS 18 277 O NO 277 LE Basi ua a a e a ek 277 hd pas do Oe ia Sie a 277 18 2 CLASSPATH o ccn sa aane ikea pera e aa 278 18 2 1 Setting the CLASSPATH 278 18 2 2 RunWeka bath 0 0 0 020 2020200200000 0084 279 18 2 3 java cja eoe 24 koe eee eee ba th ae wl oe ed 280 2 iia ce ers E ke 280 in presse ee Gs el BG SS aed 280 eee ee ROA 280 A GN I gin os Ge eee eas Se eo as 281 O Bingley Bene op E 281 i RN aoe Soa OR a Oa eS Ge RI e e 281 oS hd a ae Gus Ge eee OG ee a gl seh 282 E he ee DA AE oe 282 A e 283 18 4 3 Exclusi n 284 18 4 4 Class Discover 285 18 4 5 Multiple Class Hierarchies 285 Pa os ate eee Soothe da a 287 E Gt ote ee oe ee a ee ee 288 Set Re Oso ie A aoe Gade Soh oe a aa 288 a aged d ante eke Pade Catena eee ee ee 288 sp ata bee ae eke eG AY Ge sede de gh aa Rd A ta 289 ommand Linda
30. In the following we add the mysql connector java 5 1 7 bin jar to our CLASSPATH variable this works for any other jar archive to make it possible to access MySQL databases via JDBC Win32 2k and XP We assume that the mysql connector java 5 1 7 bin jar archive is located in the following directory C Program Files Weka 3 7 In the Control Panel click on System or right click on My Computer and select Properties and then go to the Avanced tab There you ll find a button called Environment Variables click it Depending on whether you re the only person using this computer or it s a lab computer shared by many you can either create a new system wide yow re the only user environment variable or a user depen dent one recommended for multi user machines Enter the following name for the variable CLASSPATH and add this value C Program Files Weka 3 7 mysql connector java 5 1 7 bin jar If you want to add additional jars you will have to separate them with the path separator the semicolon no spaces Unix Linux We make the assumption that the mysql jar is located in the following directory home johndoe jars 18 2 CLASSPATH 279 Open a shell and execute the following command depending on the shell you re using e bash export CLASSPATH CLASSPATH home johndoe jars mysql connector java 5 1 7 bin jar e c shell setenv CLASSPATH CLASSPATH home johndoe jars mysql connector java 5 1 7 bin jar
31. The ArffViewer offers most of its functionality either through the main menu or via popups table header and table cells Short description of the available menus e File ARFF Viewer D devel Edit View Ctrl S Save Save as Ctil Shifts Close Ctrl Close all Properties Ctrl Enter gt Exit AIEX contains options for opening and closing files as well as viewing properties about the current file 2 32 0 male latyp_angi allows one to delete attributes instances rename attributes choose a new class attribute search for certain values in the data and of course undo e Edit ARFF Viewer D development datasi File View iris Undo Ctrl Z lat Copy Ctrl Insert ol erio N 1 Cirit S 2 0 f Clear search Ctrlt Shift F 2 Y gt Off 3 Rename attribute f 4 Attribute as class 7 0 F a Delete attribute eer 7 Delete attributes 5 0 f 8 Delete instance pOr off E i 8 0 f i Delete instances 3 noy 1olr 11 2 Sort data ascending 4 0 f 12 34 0 male Jatyp_anar EEA 0 0 f the modifications e View ARFF Viewer D development datasets ud File Edit iris arff Relation h ES values Numeri Ctrl Shitty 89 la Optimal column width all 1 28 0 male _ atyp_angi 130 0 132 0 f
32. Typical values for m are 5 10 and 20 With m N k fold cross validation becomes loo cv e Cumulative cross validation cumulative cv starts with an empty data set and adds instances item by item from D After each time an item is added the next item to be added is classified using the then current state of the Bayes network Finally the useProb flag indicates whether the accuracy of the classifier should be estimated using the zero one loss if set to false or using the esti mated probability of the class 126 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS weka gui GenericObjectEditor weka classifiers bayes net search global K2 About This Bayes Network learning algorithm uses a hill climbing More algorithm restricted by an order on the variables C Type initAsNaiveBayes markovBlanketClassifier maxNrOfParents 1 randomOrder False y useProb The following search algorithms are implemented K2 HillClimbing Repeat edHillClimber TAN Tabu Search Simulated Annealing and Genetic Search See Section 8 2 for a description of the specific options for those algorithms 8 5 Fixed structure learning The structure learning step can be skipped by selecting a fixed network struc ture There are two methods of getting a fixed structure just make it a naive Bayes network or reading it from a file in XML BIF format
33. actual instance lines The QOdata Declaration The data declaration is a single line denoting the start of the data segment in the file The format is Odata The instance data Each instance is represented on a single line with carriage returns denoting the end of the instance A percent sign introduces a comment which continues to the end of the line Attribute values for each instance are delimited by commas They must appear in the order that they were declared in the header section i e the data corresponding to the nth Cattribute declaration is always the nth field of the attribute Missing values are represented by a single question mark as in Odata 4 4 7 1 5 lris setosa Values of string and nominal attributes are case sensitive and any that contain space or the comment delimiter character must be quoted The code suggests that double quotes are acceptable and that a backslash will escape individual characters An example follows relation LCCvsLCSH Cattribute LCC string attribute LCSH string data AGS Encyclopedias and dictionaries Twentieth century AS262 Science Soviet Union History AES Encyclopedias and dictionaries AS281 Astronomy Assyro Babylonian Moon Phases AS281 Astronomy Assyro Babylonian Moon Tables 9 3 SPARSE ARFF FILES 165 Dates must be specified in the data section using the string representation spec ified in the attrib
34. find in the CLASSPATH and therefore fixes the location of the package it found the class in the dynamic discovery examines the complete CLASSPATH you are starting the Java Virtual Machine JVM with This means that you can have several parallel directories with the same WEKA package structure e g the standard release of WEKA in one directory distribution weka jar and another one with your own classes development weka and display all of the classifiers in the GUI In case of a name conflict i e two directories contain the same class the first one that can be found is used In a nutshell your java call of the GUIChooser can look like this java classpath development distribution weka jar weka gui GUIChooser 66 99 Note Windows users have to replace the with and the forward slashes with backslashes 18 4 5 Multiple Class Hierarchies In case you are developing your own framework but still want to use your clas sifiers within WEKA that was not possible with WEKA prior to 3 4 4 Starting with the release 3 4 4 it is possible to have multiple class hierarchies being dis played in the GUI If you have developed a modified version of NaiveBayes let us call it DummyBayes and it is located in the package dummy classifiers then you will have to add this package to the classifiers list in the GPC file like this weka classifiers Classifier weka classifiers bayes weka classifiers functions weka class
35. gui Explorer LogHandler interface but that is only additional functionality public class GeneratorPanel extends JPanel implements ExplorerPanel e some basic members that we need to have the same as for the SqlPanel class the parent frame protected Explorer m_Explorer null sends notifications when the set of working instances gets changed protected PropertyChangeSupport m_Support new PropertyChangeSupport this e methods we need to implement due to the used interfaces almost identical to SqlPanel Sets the Explorer to use as parent frame public void setExplorer Explorer parent m_Explorer parent returns the parent Explorer frame public Explorer getExplorer return m_Explorer Returns the title for the tab in the Explorer public String getTabTitle return DataGeneration what s displayed as tab title e g Classify Returns the tooltip for the tab in the Explorer public String getTabTitleToolTip return Generating artificial datasets the tooltip of the tab ignored since we generate data and not receive it public void setInstances Instances inst PropertyChangeListener which will be notified of value changes public void addPropertyChangeListener PropertyChangeListener 1 m_Support addPropertyChangeListener 1 Removes a PropertyChangeListener public void removePropertyChangeListener PropertyChan
36. help lt command gt 3 1 Commands The following commands are available in the Simple CLI e java lt classname gt lt args gt invokes a java class with the given arguments if any e break stops the current thread e g a running classifier in a friendly manner 31 32 CHAPTER 3 SIMPLE CLI e kill stops the current thread in an unfriendly fashion e cls clears the output area e exit exits the Simple CLI e help lt command gt provides an overview of the available commands if without a command name as argument otherwise more help on the specified command 3 2 Invocation In order to invoke a Weka class one has only to prefix the class with java This command tells the Simple CLI to load a class and execute it with any given parameters E g the J48 classifier can be invoked on the iris dataset with the following command java weka classifiers trees J48 t c temp iris arff This results in the following output simpleci loj x 50 0 01 a Iris setosa 2 049 1 b Iris versicolor 0 248 c Iris virginica Stratified cross validation Correctly Classified Instances 144 96 kd Incorrectly Classified Instances 6 4 Kappa statistic 0 94 Mean absolute error 0 035 Root mean squared error 0 1586 Relative absolute error 7 8705 Root relative squared error 33 6353 Total Number of Instances 150 Confusion Matrix a b e lt classified as 49 1 0 a Ir
37. ing set Supplied test set and Percentage split Section 1 3 1 except that now the data is assigned to clusters instead of trying to predict a specific class The fourth mode Classes to clusters evaluation compares how well the chosen clusters match up with a pre assigned class in the data The drop down box below this option selects the class just as in the Classify panel An additional option in the Cluster mode box the Store clusters for visualization tick box determines whether or not it will be possible to visualize the clusters once training is complete When dealing with datasets that are so large that memory becomes a problem it may be helpful to disable this option 4 4 3 Ignoring Attributes Often some attributes in the data should be ignored when clustering The Ignore attributes button brings up a small window that allows you to select which attributes are ignored Clicking on an attribute in the window highlights it holding down the SHIFT key selects a range of consecutive attributes and holding down CTRL toggles individual attributes on and off To cancel the selection back out with the Cancel button To activate it click the Select button The next time clustering is invoked the selected attributes are ignored 46 CHAPTER 4 EXPLORER 4 4 4 Working with Filters The FilteredClusterer meta clusterer offers the user the possibility to apply filters directly before the clusterer is learned This approach eliminates the
38. lris virginica 6 1 inf So the graph visualizer allows you to inspect both network structure and probability tables 8 9 Bayes Network GUI The Bayesian network editor is a stand alone application with the following features e Edit Bayesian network completely by hand with unlimited undo redo stack cut copy paste and layout support e Learn Bayesian network from data using learning algorithms in Weka e Edit structure by hand and learn conditional probability tables CPTs using 142 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS learning algorithms in Weka e Generate dataset from Bayesian network e Inference using junction tree method of evidence through the network in teractively changing values of nodes e Viewing cliques in junction tree e Accelerator key support for most common operations The Bayes network GUI is started as java weka classifiers bayes net GUI bif file The following window pops up when an XML BIF file is specified if none is specified an empty graph is shown Bayes Network Editor j jol x File Edit Tools View Help o amp 8 5 uma pea s 3378 mx BY rs E Layout W Nvo _ ORMAL 95 i EXPCO2 BNO BIMAL 5217 hF ALSE 0000 OW 2952 HRSAT BHIGH 6014 co E IGH 6014 co r maa IGH B15 Ed ENE gt Undo action performed Layout Graph Action Moving a node Click a node with the left mouse
39. lt integer gt Number of runs lt seed gt Random number seed P lt nr of parents gt Maximum number of parents R Use arc reversal operation default false N Initial structure is empty instead of Naive Bayes mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local SimulatedAnnealing 132 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS A lt float gt Start temperature U lt integer gt Number of runs D lt float gt Delta temperature R lt seed gt mbc Random number seed Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local TabuSearch L lt integer gt Tabu list length U lt integer gt Number of runs P lt nr of parents gt Maximum number of parents Use arc reversal operation default false P lt nr of parents gt mbc Maximum number of parents Use arc reversal operation default
40. new Instances unlabeled label instances for int i 0 i lt unlabeled numInstances i double clsLabel tree classifyInstance unlabeled instance i labeled instance i setClassValue clsLabel save newly labeled data DataSink write some where labeled arff labeled The above example works for classification and regression problems alike as long as the classifier can handle numeric classes of course Why is that The classifyInstance Instance method returns for numeric classes the regres sion value and for nominal classes the 0 based index in the list of available class labels If one is interested in the class distribution instead then one can use the distributionForInstance Instance method this array sums up to 1 Of course using this method makes only sense for classification problems The code snippet below outputs the class distribution the actual and predicted label side by side in the console load data Instances train DataSource read args 0 train setClassIndex train numAttributes 1 Instances test DataSource read args 1 test setClassIndex test numAttributes 1 train classifier J48 cls new J48 cls buildClassifier train output predictions System out println actual predicted distribution for int i 0 i lt test numInstances i double pred cls classifyInstance test instance i double dist cls distributionForInstance test in
41. public void actionPerformed ActionEvent evt m_Support firePropertyChange null null H e the propertyChange event will perform the actual loading of the data hence we add an anonymous property change listener to our panel addPropertyChangeListener new PropertyChangeListener public void propertyChange PropertyChangeEvent e try 1 load data InstanceQuery query new InstanceQuery query setDatabaseURL m_Viewer getURL query setUsername m_Viewer getUser query setPassword m_Viewer getPassword Instances data query retrieveInstances m_Viewer getQuery set data in preprocess panel also notifies of capabilties changes getExplorer getPreprocessPanel setInstances data catch Exception ex ex printStackTrace H e In order to add our SqlPanel to the list of tabs displayed in the Ex plorer we need to modify the Explorer props file just extract it from the weka jar and place it in your home directory The Tabs property must look like this Tabs weka gui explorer SqlPanel weka gui explorer ClassifierPanel weka gui explorer ClustererPanel X weka gui explorer AssociationsPanel X weka gui explorer AttributeSelectionPanel weka gui explorer VisualizePanel 17 4 EXTENDING THE EXPLORER 269 Screenshot Preprocess SQL Classify Cluster Associate Selectattributes Visualize Connection URL ljdbc my
42. significant win with regard to the scheme in the row 5 5 6 Ranking Test Selecting Ranking from Test base causes the following information to be gener ated weka Experiment Environment AmE Setup Run Analyse Source Got 30 results File Database Experiment Configure test H Test output Testing with Paired T Tester cor w Tester weka experiment PairedCorrectedTTester a Analysing Percent_correct Row Select Datasets 1 z Resultsets 3 Comm Select Confidence 0 05 two tailed x Sorted by Date 21 12 05 16 42 Comparison field Percent_correct gt A Significance 0 05 gt lt gt lt Resultset Sorting asc by lt default gt x l 1 O trees J48 C 0 25 M 2 217733168393644444 l 1l G rules OneR B 6 2459427002147861445 Test hase Select 2 0 2 rules ZeroR 48055541465867954 Displayed Columns Columns Show std deviations Output Format Select Perform test Save output Result list 16 42 48 Percent_correct Ranking a 4 il The ranking test ranks the schemes according to the total number of sig nificant wins gt and losses lt against the other schemes The first column gt lt is the difference between the number of wins and the number of losses This difference is used to generate the ranking 94 CHAPTER 5 EXP
43. weka UU classifiers cf bayes cnet GE search gt local e ci o c global Co fixed E FromFile Ll NaiveBayes 8 6 Distribution learning Once the network structure is learned you can choose how to learn the prob ability tables selecting a class in the weka classifiers bayes net estimate 8 6 DISTRIBUTION LEARNING 127 package weka U classifiers Co bayes Ei net cf estimate Q BayesNetEstimator QA BMAEstimator Ey MultiNomialBMAEstimator y SimpleEstimator Close The SimpleEstimator class produces direct estimates of the conditional probabilities that is Nigk Nise P x k pa x j Nij Nij where Niir is the alpha parameter that can be set and is 0 5 by default With alpha 0 we get maximum likelihood estimates weka gui GenericObjectEditor weka classifiers bayes net estimate SimpleEstimator About SimpleEstimator is used for estimating the conditional probability tables of a Bayes network once the structure has been learned More alpha 0 5 Open Save OK Cancel With the BMAEstimator we get estimates for the conditional probability tables based on Bayes model averaging of all network structures that are sub structures of the network structure learned 15 This is achieved by estimat ing the conditional probability table of a node 2 given its parents pa x as a weighted average of all conditional
44. yes d DIR 0PTARG WEKA 0PTARG usage exit 0 w h 55 usage exit 1 55 esac done either plaintext or bibtex if PLAINTEXT BIBTEX then echo echo ERROR either p or b has to be given echo usage exit 2 fi CHAPTER 15 RESEARCH 15 2 PAPER REFERENCES 193 do we have everything if DIR I d DIR then echo echo ERROR no directory or non existing one provided echo usage exit 3 fi generate Java call if WEKA then JAVA java else JAVA java classpath WEKA fi if PLAINTEXT yes then CMD JAVA TECHINFO plaintext elif BIBTEX yes then CMD JAVA TECHINFO bibtex fi find packages TMP find DIR mindepth 1 type d grep v CVS sed s weka weka g sed s g PACKAGES echo TMP sed s g get technicalinformationhandlers TECHINFOHANDLERS JAVA weka core ClassDiscovery TECHINFOHANDLER PACKAGES grep weka sed s weka weka g output information echo for i in TECHINFOHANDLERS do TMP i class_to_filename exclude internal classes if f TMP then continue fi CMD W i echo done 194 CHAPTER 15 RESEARCH Chapter 16 Using the API Using the graphical tools like the Explorer or just the command line is in most cases sufficient for the normal user But WEKA s clearly defined API ap plication programming interface makes i
45. 1 Selecting a Clusterey 2 a a ee ee 45 AAA 45 4 4 3 Tenorimg Attributed 2 2 45 cee ea ee ae oe ee 46 Seed by e oe ee eee 46 AS Om GE Goan ees 47 a ey eri eo 47 ied dA a ee ee AT ee padron 48 A eee 48 a NE de eS othe eee 48 a a e 48 Deol de eae Oe o s See eS 50 4 7 1 The scatter plot matrix 50 pai ee ee oe 50 ae ea E 51 53 AA AS 53 2 Standard Experimenta e 54 a ee dt 54 AA 54 21 2 Results destination 54 a e oe See 56 nd a a ee ed 58 2 1 Iteration contro oaa a 59 2 1 6 Algorithmg oaoa aaa a 59 61 62 63 63 66 68 75 80 83 83 83 84 85 86 86 88 88 5 5 2 Saving the Resultg 91 5 5 3 Changing the Baseline Schema 91 5 5 4 Statistical Significancg 92 CONTENTS 5 92 5 5 6 Ranking Test 93 95 a EDEN aa ga 95 A E hee oa a 97 98 98 98 98 98 98 99 100 101 101 103 ee ropita 105 AN A 107 ArffViewe 109 ide eto RA dard ae e E ee Me Da We the es ec 110 112 115 115 119 119 120 123 125 126 126 128 138 141 153 153 155 156 ee an a Deia 162 6 CONTENTS a a e 166 167 o a o a 167 10 2 Comparsa 167 A AA 167 A ceey eed seereeue tae 168 ate oe Si Sy Foon fends e fs an ada tad 169 E AR GEE PS RES BES 170 hha te a ena Gs a
46. 11 Adding your own Bayesian network learn ers You can add your own structure learners and estimators 154 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Adding a new structure learner Here is the quick guide for adding a structure learner 1 Create a class that derives from weka classifiers bayes net search SearchAlgorithm If your searcher is score based conditional independence based or cross validation based you probably want to derive from ScoreSearchAlgorithm CISearchAlgorithmor CVSearchAlgorithm instead of deriving from SearchAlgorithm directly Let s say it is called weka classifiers bayes net search local MySearcher derived from ScoreSearchAlgorithn 2 Implement the method public void buildStructure BayesNet bayesNet Instances instances Essentially you are responsible for setting the parent sets in bayesNet You can access the parentsets using bayesNet getParentSet iAttribute where iAttribute is the number of the node variable To add a parent iParent to node iAttribute use bayesNet getParentSet iAttribute AddParent iParent instances where instances need to be passed for the parent set to derive properties of the attribute Alternatively implement public void search BayesNet bayesNet Instances instances The implementation of buildStructure in the base class This method is called by the SearchAlgorithm will call search after ini tializing parent sets and if the initAsNaiveBase flag is set it will start
47. 16318 C3 changelogs C3 data C3 doc FileName Experiment1 exp Files of Type Experiment configuration files exp y The experiment can be restored by selecting Open in the Setup tab and then selecting Experiment1 exp in the dialog window 5 2 2 2 Running an Experiment To run the current experiment click the Run tab at the top of the Experiment Environment window The current experiment performs 10 randomized train and test runs on the Iris dataset using 66 of the patterns for training and 34 for testing and using the ZeroR scheme weka Experiment Environment Setup Run Analyse Start Stop Log Status Not running Click Start to run the experiment 5 2 STANDARD EXPERIMENTS 67 Weka Experiment Environment El x Setup Run Analyse Start Log 16 17 12 Started 16 17 12 Finished 16 17 12 There were 0 errors Status Not running If the experiment was defined correctly the 3 messages shown above will be displayed in the Log panel The results of the experiment are saved to the dataset Experiment1 arff The first few lines in this dataset are shown below relation InstanceResultListener Gattribute Gattribute Gattribute Gattribute Gattribute Gattribute Oattribute Oattribute Gattribute Gattribute Gattribute Gattr
48. 2000 Explicitly representing expected cost An alternative to ROC representation Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Publishers San Mateo CA 6 Extensions for Weka s main GUI on Weka Wiki http weka wikispaces com Extensions for Weka 2 s main GUI 7 Adding tabs in the Explorer on WekaWiki http weka wikispaces com Adding tabs in the Explorer 8 Explorer visualization plugins on WekaWiki http weka wikispaces com Explorer visualization plugins 9 Bengio Y and Nadeau C 1999 Inference for the Generalization Error 10 Ross Quinlan 1993 C4 5 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA 11 Subversion http weka wikispaces com Subversion 12 ASQLDB http hsqldb sourceforge net 13 MySQL http www mysql com 14 Plotting multiple ROC curves on WekaWiki http weka wikispaces com Plotting multiple ROC curves 15 R R Bouckaert Bayesian Belief Networks from Construction to Inference Ph D thesis University of Utrecht 1995 16 W L Buntine A guide to the literature on learning probabilistic networks from data IEEE Transactions on Knowledge and Data Engineering 8 195 210 1996 17 J Cheng R Greiner Comparing bayesian network classifiers Proceedings UAI 101 107 1999 18 C K Chow C N Liu Approximating discrete probability distributions with de pendence trees IEE
49. Capabilities object like this public Capabilities getCapabilities Capabilities result new Capabilities this attributes result enable Capability NOMINAL_ATTRIBUTES result enable Capability NUMERIC_ATTRIBUTES class result enable Capability NUMERIC_CLASS return result Another classifier that only handles binary classes and only nominal attributes and missing values would implement the getCapabilities method as fol lows public Capabilities getCapabilities Capabilities result new Capabilities this attributes result enable Capability NOMINAL_ATTRIBUTES result enable Capability MISSING_VALUES class result enable Capability BINARY_CLASS result disable Capability UNNARY_CLASS result enable Capability MISSING_CLASS_VALUES return result Meta classifier Meta classifiers by default just return the capabilities of their base classifiers in case of descendants of the weka classifier MultipleClassifiersCombiner an AND over all the Capabilities of the base classifiers is returned Due to this behavior the capabilities depend normally only on the cur rently configured base classifier s To soften filtering for certain behavior meta classifiers also define so called Dependencies on a per Capability basis These dependencies tell the filter that even though a certain capability is not sup ported right now it is possible that it will be supported with a differe
50. Edit this file and change the jdbcURL jdbc mysql server_name 3306 database_name entry to include the name of the machine that is running your database server and the name of the database the result will be stored in e g jdbcURL jdbc mysql dodo company com 3306 experiment Now start the Experimenter inside this directory java cp home johndoe jars mysql jar remoteEngine jar home johndoe weka weka jar Djava rmi server codebase file home johndoe weka weka jar weka gui experiment Experimenter Note the database name experiment can still be modified in the Exper imenter this is just the default setup Now we will configure the experiment e First of all select the Advanced mode in the Setup tab e Now choose the DatabaseResultListener in the Destination panel Config ure this result producer HSQLDB Supply the value sa for the username and leave the password empty 86 CHAPTER 5 EXPERIMENTER MySQL Provide the username and password that you need for connecting to the database e From the Result generator panel choose either the Cross ValidationResult Producer or the RandomSplitResultProducer these are the most com monly used ones and then configure the remaining experiment details e g datasets and classifiers e Now enable the Distribute Experiment panel by checking the tick box e Click on the Hosts button and enter the names of the machines that you started remote
51. Evaluation Visualization Visualization lalala lalala Data Scatter Attribute Model Text Graph Strip Visual izer PlotMatrix Summarizer PerformanceChart Viewer Viewer Chart 4 5 Data Visualizer component that can pop up a panel for visualizing data in a single large 2D scatter plot ScatterPlotMatriz component that can pop up a panel containing a ma trix of small scatter plots clicking on a small plot pops up a large scatter plot AttributeSummarizer component that can pop up a panel containing a matrix of histogram plots one for each of the attributes in the input data ModelPerformanceChart component that can pop up a panel for visual izing threshold i e ROC style curves Text Viewer component for showing textual data Can show data sets classification performance statistics etc GraphViewer component that can pop up a panel for visualizing tree based models StripChart component that can pop up a panel that displays a scrolling plot of data used for viewing the online performance of incremental clas sifiers 6 4 EXAMPLES 101 6 4 Examples 6 4 1 Cross validated J48 Setting up a flow to load an ARFF file batch mode and perform a cross validation using J48 WEKA s C4 5 implementation i traininc e a lt A datada e dataden gt besoss 22 batch h a Arffloader Class CrossVal idation Jas Classlifier Assigner Tol Maker PerformanckEvaluator ext
52. In order to make sure that your classifier applies to the WEKA criteria you should add your classifier to the junit unit test framework i e by creating a Test class The superclass for classifier unit tests is weka classifiers AbstractClassifierTest 17 2 WRITING A NEW FILTER 247 17 2 Writing a new Filter The work horses of preprocessing in WEKA are filters They perform many tasks from resampling data to deleting and standardizing attributes In the following are two different approaches covered that explain in detail how to implement a new filter e default this is how filters had to be implemented in the past e simple since there are mainly two types of filters batch or stream ad ditional abstract classes were introduced to speed up the implementation process 17 2 1 Default approach The default approach is the most flexible but also the most complicated one for writing a new filter This approach has to be used if the filter cannot be written using the simple approach described further below 17 2 1 1 Implementation The following methods are of importance for the implementation of a filter and explained in detail further down It is also a good idea studying the Javadoc of these methods as declared in the weka filters Filter class getCapabilities setInputFormat Instances getInputFormat setOutputFormat Instances getOutputFormat input Instance bufferInput Instance push Instance
53. J48 in order to view the textual or graphical representations of the trees produced for each fold of the cross validation this is something that is not possible in the Explorer 6 4 EXAMPLES 103 6 4 2 Plotting multiple ROC curves The KnowledgeFlow can draw multiple ROC curves in the same plot window something that the Explorer cannot do In this example we use J48 and Ran domForest as classifiers This example can be found on the Weka Wiki as well Tg CrossValidation pest3et ToldMaker Arf loader Class Assigner txdin en l w sairi Ja Classifier NehresholdDava PerformanceEvaluator 7 dat dwg datas A a gt E E y amp K fa ClassValue Picker tro a 42 p AS E Es 10 Random Classifier Forest PerformanceEvaluator e Click on the DataSources tab and choose ArffLoader from the toolbar the mouse pointer will change to a cross hairs e Next place the ArffLoader component on the layout area by clicking some where on the layout a copy of the ArffLoader icon will appear on the layout area e Next specify an ARFF file to load by first right clicking the mouse over the ArffLoader icon on the layout A pop up menu will appear Select Configure under Edit in the list from this menu and browse to the location of your ARFF file e Next click the Evaluation tab at the top of the window and choose the ClassAssigner allows you to choose which column to be the class com ponent from the toolbar P
54. Open Save New Results Destination ARFF file v Filename Browse Experiment Type Iteration Control Cross validation a Number of repetitions 10 Number of folds 10 8 Data sets first Classification Regression Algorithms first Datasets Algorithms Add new Edit selecte Delete select Add new Edit selected Delete selected C Use relative pat Up Down Load options Save options Up Down 5 2 1 2 Results destination By default an ARFF file is the destination for the results output But you can choose between e ARFF file e CSV file e JDBC database ARFF file and JDBC database are discussed in detail in the following sec tions CSV is similar to ARFF but it can be used to be loaded in an external spreadsheet application ARFF file If the file name is left empty a temporary file will be created in the TEMP directory of the system If one wants to specify an explicit results file click on Browse and choose a filename e g Experiment1 arff 5 2 STANDARD EXPERIMENTS 55 Save In C weka 3 5 6 x alla EE C3 changelogs Ci data C doc FileName Experiments1 arfl Files of Type ARFF files Click on Save and the name will appear in the edit field next to ARFF file
55. Ranker search algorithm is usually used in conjunction with these algorithms e attribute subset evaluators work on subsets of all the attributes in the dataset The weka attributeSelection SubsetEvaluator interface is implemented by these evaluators e attribute set evaluators evaluate sets of attributes Not to be con fused with the subset evaluators as these classes are derived from the weka attributeSelection AttributeSetEvaluator superclass Most of the attribute selection schemes currently implemented are supervised i e they require a dataset with a class attribute Unsupervised evaluation algorithms are derived from one of the following superclasses e weka attributeSelection UnsupervisedAttributeEvaluator e g LatentSemanticAnalysis PrincipalComponents e weka attributeSelection UnsupervisedSubsetEvaluator none at the moment Attribute selection offers filtering on the fly like classifiers and clusterers as well e weka attributeSelection FilteredAttributeEval filter for evalua tors that evaluate attributes individually e weka attributeSelection FilteredSubsetEval for filtering evalua tors that evaluate subsets of attributes So much about the differences among the various attribute selection algorithms and back to how to actually perform attribute selection WEKA offers three different approaches e Using a meta classifier for performing attribute selection on the fly sim ilar to FilteredClass
56. THE API 16 10 2 Graphs Classes implementing the weka core Drawable interface can generate graphs of their internal models which can be displayed There are two different types of graphs available at the moment which are explained in the subsequent sections e Tree decision trees e BayesNet bayesian net graph structures 16 10 2 1 Tree It is quite easy to display the internal tree structure of classifiers like J48 or M5P package weka classifiers trees The following example builds a J48 classifier on a dataset and displays the generated tree visually using the TreeVisualizer class package weka gui treevisualizer This visu alization class can be used to view trees or digraphs in GraphViz s DOT language 26 import weka classifiers trees J48 import weka core Instances import weka gui treevisualizer PlaceNode2 import weka gui treevisualizer TreeVisualizer import java awt BorderLayout import javax swing JFrame Instances data from somewhere train classifier J48 cls new J48 cls buildClassifier data display tree TreeVisualizer tv new TreeVisualizer null cls graph new PlaceNode2 JFrame jf new JFrame Weka Classifier Tree Visualizer J48 jf setDefaultCloseOperation JFrame DISPOSE_ON_CLOSE jf setSize 800 600 jf getContentPane setLayout new BorderLayout O jf getContentPane add tv BorderLayout CENTER jf setVisible true adjust tree tv fitToScr
57. The graph visualizer has two buttons to zoom in and out Also the exact zoom desired can be entered in the zoom percentage entry Hit enter to redraw at the desired zoom level 8 9 BAYES NETWORK GUI 141 Graph drawing options Hit the extra controls button to show extra options that control the graph layout settings weka Classifier Graph Visualizer 17 08 27 Bayes Bayes olx ExtraControls Layout Type Naive Layout 8 Priority Layout Layout Method Top Down 8 Bottom Up With Edge Concentration Custom Node Size Width Height Layout Graph ajajja m B The Layout Type determines the algorithm applied to place the nodes The Layout Method determines in which direction nodes are considered The Edge Concentration toggle allows edges to be partially merged The Custom Node Size can be used to override the automatically deter mined node size When you click a node in the Bayesian net a window with the probability table of the node clicked pops up The left side shows the parent attributes and lists the values of the parents the right side shows the probability of the node clicked conditioned on the values of the parents listed on the left class sepallength lris setosa inf 6 1 0 99 ESTEVA 0 5 lris versicolor inf6 1 0 329 liris versicolor 6 1 inf _ 0 029 lris virginica Cinf 6 1
58. Visualization Access to visualization from the ClassifierPanel ClusterPanel and Attribute Selection panel is available from a popup menu Click the right mouse button over an entry in the Result list to bring up the menu You will be presented with options for viewing or saving the text output and depending on the scheme further options for visualizing errors clusters trees etc 19 2 12 Memory consumption and Garbage collector There is the ability to print how much memory is available in the Explorer and Experimenter and to run the garbage collector Just right click over the Status area in the Explorer Experimenter 19 2 13 GUIChooser starts but not Experimenter or Ex plorer The GUIChooser starts but Explorer and Experimenter don t start and output an Exception like this in the terminal usr share themes Mist gtk 2 0 gtkrc 48 Engine mist is unsupported ignoring Registering Weka Editors java lang NullPointerException at weka gui explorer PreprocessPanel addPropertyChangeListener PreprocessPanel java 519 at javax swing plaf synth SynthPanelUI installListeners SynthPanelUI java 49 at javax swing plaf synth SynthPanelUI install1UI SynthPanelUI java 38 at javax swing JComponent setUI JComponent java 652 at javax swing JPanel setUI JPanel java 131 This behavior happens only under Java 1 5 and Gnome Linux KDE doesn t produce this error The reason for this is that Weka tries to look more native and
59. a dataset D consisting of samples over x y The learning task consists of finding an appropriate Bayesian network given a data set D over U All Bayes network algorithms implemented in Weka assume the following for the data set e all variables are discrete finite variables If you have a data set with continuous variables you can use the following filter to discretize them weka filters unsupervised attribute Discretize e no instances have missing values If there are missing values in the data set values are filled in using the following filter weka filters unsupervised attribute ReplaceMissingValues The first step performed by buildClassifier is checking if the data set fulfills those assumptions If those assumptions are not met the data set is automatically filtered and a warning is written to ST DERR Inference algorithm To use a Bayesian network as a classifier one simply calculates argmaz P y x using the distribution P U represented by the Bayesian network Now note that P y x P U P x x P U pulpa 8 1 uEU And since all variables in x are known we do not need complicated inference algorithms but just calculate for all class values Learning algorithms The dual nature of a Bayesian network makes learning a Bayesian network as a two stage process a natural division first learn a network structure then learn the probability tables There are various approaches to structure learning and in Wek
60. a string an arbitrary long list of characters enclosed in double quotes Additional types are date and relational which are not covered here but in the ARFF chapter The external representation of an Instances class is an ARFF file which consists of a header describing the attribute types and the data as comma separated list Here is a short commented example A complete description of the ARFF file format can be found here Comment lines at the beginning This is a toy example the UCI weather dataset of the dataset should give an in Any relation to real weather is purely coincidental dication of its source context and meaning Here we state the internal name relation golfWeatherMichigan_1988 02 10_14days of the dataset Try to be as com prehensive as possible Here we define two nominal at tributes outlook and windy The former has three values sunny attribute outlook sunny overcast rainy overcast and rainy the latter attribute windy TRUE FALSE two TRUE and FALSE Nom inal values with special charac ters Commas or spaces are en closed in single quotes These lines define two numeric attributes Instead of real inte ger or numeric can also be used Cattribute temperature real Cattribute humidity real While double floating point val ues are stored internally only seven decimal digits are usually processed The last attribute is the default target or class variable used
61. allowing you to browse the results Clicking with the left mouse button into the text area while holding Alt and Shift brings up a dialog that enables you to save the displayed output in a variety of formats currently BMP EPS JPEG and PNG Of course you can also resize the Explorer window to get a larger display area The output is split into several sections 1 Run information A list of information giving the learning scheme op tions relation name instances attributes and test mode that were in volved in the process 2 Classifier model full training set A textual representation of the classification model that was produced on the full training data 3 The results of the chosen test mode are broken down thus 4 Summary A list of statistics summarizing how accurately the classifier was able to predict the true class of the instances under the chosen test mode 5 Detailed Accuracy By Class A more detailed per class break down of the classifier s prediction accuracy 6 Confusion Matrix Shows how many instances have been assigned to each class Elements show the number of test examples whose actual class is the row and whose predicted class is the column 7 Source code optional This section lists the Java source code if one chose Output source code in the More options dialog 4 3 6 The Result List After training several classifiers the result list will contain several entries Left clicking t
62. an experiment and doesn t serialize anything else It s sole purpose is to save the setup of a specific experiment and can therefore not store any built models Thanks to this limitation we ll never run into problems with mismatching SerialUlDs This kind of serialization is always available and can be selected via a Filter xml in the Save Open Dialog of the Experimenter The DTD is very simple and looks like this for version 3 4 5 lt DOCTYPE object lt ELEMENT object PCDATA object x gt lt ATTLIST object name CDATA REQUIRED gt lt ATTLIST object class CDATA REQUIRED gt lt ATTLIST object primitive CDATA no gt lt ATTLIST object array CDATA no gt lt ATTLIST object null CDATA no gt lt ATTLIST object version CDATA 3 4 5 gt gt Prior to versions 3 4 5 and 3 5 0 it looked like this lt DOCTYPE object lt ELEMENT object PCDATA object x gt lt ATTLIST object name CDATA REQUIRED gt lt ATTLIST object class CDATA REQUIRED gt lt ATTLIST object primitive CDATA yes gt lt ATTLIST object array CDATA no gt gt Responsible Class es weka experiment xml XMLExperiment for general Serialization weka core xml XMLSerialization weka core xml XMLBasicSerialization 18 6 XML 293 e KOML http o1d koalateam com xml serialization The Koala Object Markup Language KOML is published under the LGPL http www gnu org copyleft Igpl html and is an alterna
63. are very basic algorithms and only support building of the model buildAssociations Instances Like the buildClassifier Instances method this method completely re builds the model Subsequent calls of this method with the same dataset must result in exactly the same model being built This method also tests the training data against the capabilities public void buildAssociations Instances data throws Exception other necessary setups test data against capabilities getCapabilities testWithFail data actual model generation get Capabilities see section Capabilities on page 242 for more information toString should output some information on the generated model Even though this is not required it is rather useful for the user to get some feedback on the built model 17 3 WRITING OTHER ALGORITHMS 265 main String executes the associator from command line If your new algorithm is called FunkyAssociator then use the following code as your main method ex Main method for executing this associator Oparam args the options use h to display options public static void main String args AbstractAssociator runAssociator new FunkyAssociator args Testing For some basic tests from the command line you can use the following test class weka associations CheckAssociator W classname further options For junit tests you can subclass the weka associations Abstr
64. attributes you need to override this method as it has to return the regression value predicted by the model main String executes the classifier from command line If your new algorithm is called FunkyClassifier then use the following code as your main method ex Main method for executing this classifier Oparam args the options use h to display options public static void main String args runClassifier new FunkyClassifier args Note the static method runClassifier defined in the abstract superclass weka classifiers Classifier handles all the appropriate calls and catches and processes any exceptions as well 240 CHAPTER 17 EXTENDING WEKA Meta classifiers Meta classifiers define a range of other methods that you might want to override Normally this should not be the case But if your classifier requires the base classifier s to be of a certain type you can override the specific set method and add additional checks SingleClassifierEnhancer The following methods are used for handling the single base classifier of this meta classifier defaultClassifierString returns the class name of the classifier that is used as the default one for this meta classifier set Classifier Classifier sets the classifier object Override this method if you require further checks like that the classifiers needs to be of a certain class This is necessary if you still want to allow the user to para
65. button and drag the node to the desired position 8 9 BAYES NETWORK GUI 143 Selecting groups of nodes Drag the left mouse button in the graph panel A rectangle is shown and all nodes intersecting with the rectangle are selected when the mouse is released Selected nodes are made visible with four little black squares at the corners see screenshot above The selection can be extended by keeping the shift key pressed while selecting another set of nodes The selection can be toggled by keeping the ctrl key pressed All nodes in the selection selected in the rectangle are de selected while the ones not in the selection but intersecting with the rectangle are added to the selection Groups of nodes can be moved by keeping the left mouse pressed on one of the selected nodes and dragging the group to the desired position File menu File Edit Tools View Help D New Six e 5 Load ctri o E Save ctr s Save As amp Print Ctr P Export Exit The New Save Save As and Exit menu provide functionality as expected The file format used is XML BIF 20 There are two file formats supported for opening e xml for XML BIF files The Bayesian network is reconstructed from the information in the file Node width information is not stored so the nodes are shown with the default width This can be changed by laying out the graph menu Tools Layout e arff Weka data files When an arff file is selected a new em
66. due to different randomization of the data see section 6 4 for more information on randomization The code snippet below performs 10 fold cross validation with a J48 decision tree algorithm on a dataset newData with random number generator that is seeded with 1 The summary of the collected statistics is output to stdout 212 CHAPTER 16 USING THE API import weka classifiers Evaluation import weka classifiers trees J48 import weka core Instances import java util Random Instances newData from somewhere Evaluation eval new Evaluation newData J48 tree new J48 eval crossValidateModel tree newData 10 new Random 1 System out println eval toSummaryString nResults n n false The Evaluation object in this example is initialized with the dataset used in the evaluation process This is done in order to inform the evaluation about the type of data that is being evaluated ensuring that all internal data structures are setup correctly Train test set Using a dedicated test set to evaluate a classifier is just as easy as cross validation But instead of providing an untrained classifier a trained classifier has to be provided now Once again the weka classifiers Evaluation class is used to perform the evaluation this time using the evaluateModel method The code snippet below trains a J48 with default options on a training set and evaluates it on a test set before outputting the summary of the col
67. e is a parent child of sibling of the classifier node nothing happens otherwise an arrow is added If set to false no such arrows are added scoreType determines the score metric used see Section 2 1 for details Cur rently K2 BDe AIC Entropy and MDL are implemented maxNrOfParents is an upper bound on the number of parents of each of the nodes in the network structure learned 8 2 1 Local score metrics We use the following conventions to identify counts in the database D and a network structure Bs Let r 1 lt i lt n be the cardinality of x We use q to denote the cardinality of the parent set of x in Bs that is the number of different values to which the parents of x can be instantiated So q can be calculated as the product of cardinalities of nodes in pa x qi Mz Epal Ti 120 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Note pa x implies q 1 We use Nj 1 lt i lt n 1 lt j lt qi to denote the number of records in D for which pa x takes its jth value We use Ni l lt i lt n 1 lt j lt q 1 lt k lt r to denote the number of records in D for which pe takes its jth value and for which x takes its kth value So Nij key Nijr We use N to denote the number of records in D Let the entropy metric H Bs D of a network structure and database be defined as a 7 H Bs D e y E Nask y Aik 8 2 i 1 j 1 k 1 and the number of parameters K as AIC metric The AIC metric Q arc Bs D
68. eh A Se ae 170 ATEN 170 so yea ok Es den de ae ees ce 170 10 5 3 Instance weightg e 171 11 Converters 173 ei Ate ede a ee A oe oe eS ees 173 tthe ee Ret Ohara te pus Be ee Sy de aS ose 174 ds sa bee ee ee 174 ee be ao Ge e 174 177 ph lash Aa ee eB hh a a 177 eh no ashe doe oe ee a bdo TO 177 Geese doe A 178 A xed oe Ba a ae ap 178 ee ee a eee ee re 178 Pate tai dob a da Go ele eee 178 14 Windows databases 185 TV Appendixi 189 191 15 1 Citing Wekale uos ba fh dee a e eee AERA RS 191 15 2 Paper referenced o ooo a a 191 16 Using the API 195 6 1 Optio OW cig Mery A ds Sy A AE ae a we E 196 16 2 Loading data a a a 198 16 2 1 Loading data from fileg 198 16 2 2 Loading data from databases 199 16 3 Creating datasets in MemQada 202 16 3 1 Defining the formal 202 CONTENTS 7 A 203 IT 205 A ee ee 206 5 A 207 A a hee ws 208 ee re re err rr tr en er 209 Cokes ak ee ae ea 209 os Os ee ee eee 211 back Ei ae eB A Ae 214 bok kG ALE a a ee e 216 16 7 1 Building aclusterey 216 Bob dich he ded ede ws ee he es 218 md dete ysis Beane Ae Bk eA A 220 So eer ey ee er ee 221 tee hie A oe 222 a de hehe a ee o 223 16 8 3 Using the API directly a oa a a a a 224 Pee a es ee ee a Ee 225 aia Nelvana dewed ete de ad 225 eT ee die da 225
69. false Initial structure is empty instead of Naive Bayes Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local TAN mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the 8 7 RUNNING FROM THE COMMAND LINE 133 classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search ci CISearchAlgorithm mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search ci ICSSearchAlgorithm cardinality lt num gt When determining whether an edge exists a search is performed for a set Z that separates the nodes MaxCardinality determines the maximum size of the set Z This greatly influences the length of the search default 2 mbc Applies a Markov Bl
70. female asyrn Optimal column width all I 150 0 no oo sd 510 atyp_ang ou To su enw 180 0 no A lt 50 atyp_angi 150 0 264 0 f normal 168 0 no 0 0 lt 50 atyp_angi 120 0 166 0 normal 180 0 n0 0 0 aie non_angi 112 0 340 0 normal 184 0 no 1 0 normal 50 non_angi 130 0 209 0 normal 178 0 no 0 0 lt 50 non_angi 150 0 160 0 normal 172 0 no A lt 50 37 0 female jatyp_angi 120 0 260 0 normal 130 0 no 0 0 lt 50 normal 1420 n0 aO E stt 184 0 no 0 0 50 Btt 398 0 no Do lt 50 37 O female non_angi 130 01 211 0 normal 150 0 no 0 0 50 normal 168 0 no 0 0 normal lt 50 a on o a JE 37 O female asympt 130 0 173 0 atyp_angi 130 0 283 0 non_angi 130 0 194 0 lasympt 120 0 223 0 22S 8238S S38 asympt 130 0 315 0 normal 158 0 no 0 38 0 female jatyp_angi 120 0 275 0 normal 129 0 no 0 0 50 297 0 f normal 150 0 no o T k5 140 0 atyp_angi D development datasets uciimominalheart h arff AO elation hungarian 14 heart disease E E sex Tchest t_pain trestbps chol bs restecg thalach exang Nominal Nominal Numeric Numeric Nominal Nominal Numeric Nominal mas atyp_angi 130 0 132 0 f left_v 29 0 male _ atyp_ang 120 0 243 0 f normal atyp_angi 140 0 f normal typ_angina 170 0 237 0 f st t
71. filter doesn t need class to be set return result public boolean batchFinished throws Exception if getInputFormat null throw new NullPointerException No input instance format defined output format still needs to be set depends on first batch of data if isFirstBatchDone Instances outFormat new Instances getInputFormat 0 outFormat insertAttributeAt new Attribute blah getInputFormat numInstances outFormat numAttributes setOutputFormat outFormat Instances inst getInputFormat Instances outFormat getOutputFormat for int i 0 i lt inst numInstances i double newValues new double outFormat numAttributes double oldValues inst instance i toDoubleArray System arraycopy oldValues 0 newValues 0 oldValues length newValues newValues length 1 i push new Instance 1 0 newValues F flushInput O m_NewBatch true m_FirstBatchDone true return numPendingOutput 0 public static void main String args runFilter new BatchFilter2 args 252 CHAPTER 17 EXTENDING WEKA BatchFilter3 As soon as this batch filter s first batch is done it can process Instance objects immediately in the input Instance method It adds a new attribute which contains just a random number but the random number generator being used is seeded with the number of instances from the first batch import weka core
72. find conditional inde pendencies z y Z in the data For each pair of nodes x y we consider sets Z starting with cardinality 0 then 1 up to a user defined maximum Further more the set Z is a subset of nodes that are neighbors of both x and y If an independency is identified the edge between x and y is removed from the skeleton The first step in directing arrows is to check for every configuration x z y where x and y not connected in the skeleton whether z is in the set Z of variables that justified removing the link between x and y cached in the first step If z is not in Z we can assign direction x gt z y Finally a set of graphical rules is applied 25 to direct the remaining arrows Rule 1 i gt j k amp i k gt j gt k Rule 2 i gt j gt k amp i k gt i gt k Rule 3 m MM i k gt m gt i gt j lt k I j Rule 4 m INM i k gt i gt m k gt m i j j Rule 5 if no edges are directed then take a random one first we can find The ICS algorithm comes with the following options weka gui GenericObjectEditor lol x weka classifiers bayes net search ci ICSSearchAlgorithm About This Bayes Network learning algorithm uses conditional More independence tests to find a skeleton finds V nodes and applies a set of rules to find the directions of the remaining arrows markovBlanketClassifier False b 4 maxCardinality 2 scoreType BAY
73. footprint as the training data does not have to fit in memory ARFF files for instance can be read incrementally see chapter I6 2 210 CHAPTER 16 USING THE API Training an incremental classifier happens in two stages 1 initialize the model by calling the buildClassifier Instances method One can either use a weka core Instances object with no actual data or one with an initial set of data 2 update the model row by row by calling the updateClassifier Instance method The following example shows how to load an ARFF file incrementally using the ArffLoader class and train the NaiveBayesUpdateable classifier with one row at a time import weka core converters ArffLoader import weka classifiers bayes NaiveBayesUpdateable import java io File load data ArffLoader loader new ArffLoader loader setFile new File some where data arff Instances structure loader getStructure structure setClassIndex structure numAttributes 1 train NaiveBayes NaiveBayesUpdateable nb new NaiveBayesUpdateable nb buildClassifier structure Instance current while current loader getNextInstance structure null nb updateClassifier current 16 6 CLASSIFICATION 211 16 6 2 Evaluating a classifier Building a classifier is only one part of the equation evaluating how well it performs is another important part WEKA supports two types of evaluation e Cross validation If one only has a
74. for attribute play yes no prediction In our case it is a nominal attribute with two val ues making this a binary classi fication problem 1 2 BASIC CONCEPTS 15 The rest of the dataset consists Odata of the token Adata followed by sunny FALSE 85 85 no sunny TRUE 80 90 no comma separated values for the overcast FALSE 83 86 yes attributes one line per exam rainy FALSE 70 96 yes rainy FALSE 68 80 yes ple In our case there are five ex amples In our example we have not mentioned the attribute type string which defines double quoted string attributes for text mining In recent WEKA versions date time attribute types are also supported By default the last attribute is considered the class target variable i e the attribute which should be predicted as a function of all other attributes If this is not the case specify the target variable via c The attribute numbers are one based indices i e c 1 specifies the first attribute Some basic statistics and validation of given ARFF files can be obtained via the main routine of weka core Instances java weka core Instances data soybean arff weka core offers some other useful routines e g converters C45Loader and converters CSVLoader which can be used to import C45 datasets and comma tab separated datasets respectively e g java weka core converters CSVLoader data csv gt data arff java weka core converters C45Loader c45_filestem
75. generator F lt file gt The BIF file to obtain the structure from 138 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS The network structure is generated by first generating a tree so that we can ensure that we have a connected graph If any more arrows are specified they are randomly added 8 8 Inspecting Bayesian networks You can inspect some of the properties of Bayesian networks that you learned in the Explorer in text format and also in graphical format Bayesian networks in text Below you find output typical for a 10 fold cross validation run in the Weka Explorer with comments where the output is specific for Bayesian nets Run information Scheme weka classifiers bayes BayesNet D B iris xml Q weka classifiers bayes n Options for BayesNet include the class names for the structure learner and for the distribution estimator Relation iris weka filters unsupervised attribute Discretize B2 M 1 0 Rfirst last Instances 150 Attributes 5 sepallength sepalwidth petallength petalwidth class Test mode 10 fold cross validation Classifier model full training set Bayes Network Classifier not using ADTree Indication whether the ADTree algorithm 24 for calculating counts in the data set was used attributes 5 classindex 4 This line lists the number of attribute and the number of the class variable for which the classifier was trained Network structure nodes followed by parents sepallength 2 c
76. gt data arff 16 CHAPTER 1 A COMMAND LINE PRIMER 1 2 2 Classifier Any learning algorithm in WEKA is derived from the abstract weka classifiers Classifier class Surprisingly little is needed for a basic classifier a routine which gen erates a classifier model from a training dataset buildClassifier and another routine which evaluates the generated model on an unseen test dataset classifyInstance or generates a probability distribution for all classes distributionForInstance A classifier model is an arbitrary complex mapping from all but one dataset attributes to the class attribute The specific form and creation of this map ping or model differs from classifier to classifier For example ZeroR s weka classifiers rules ZeroR model just consists of a single value the most common class or the median of all numeric values in case of predicting a numeric value regression learning ZeroR is a trivial classifier but it gives a lower bound on the performance of a given dataset which should be significantly improved by more complex classifiers As such it is a reasonable test on how well the class can be predicted without considering the other attributes Later we will explain how to interpret the output from classifiers in detail for now just focus on the Correctly Classified Instances in the section Stratified cross validation and notice how it improves from ZeroR to J48 java weka classifiers rules ZeroR t
77. has gone wrong In that case you should restart the WEKA Explorer 4 1 5 Graphical output Most graphical displays in WEKA e g the GraphVisualizer or the TreeVisu alizer support saving the output to a file A dialog for saving the output can be brought up with Alt Shift left click Supported formats are currently Win dows Bitmap JPEG PNG and EPS encapsulated Postscript The dialog also allows you to specify the dimensions of the generated image 4 2 PREPROCESSING 37 4 2 Preprocessing Weka 3 5 4 Explorer Program Applications Tools Visualization Windows Help E Explorer Preprocess Classify Cluster Associate Selectattributes Visualize opens Open UR Open 8 conor Filter Choose None Apply Current relation Selected attribute Relation None Name None Type None Instances None Attributes None Missing None Distinct None Unique None Attributes All w Visualize All Remove Status Welcome to the Weka Explorer 4 2 1 Loading Data The first four buttons at the top of the preprocess section enable you to load data into WEKA 1 Open file Brings up a dialog box allowing you to browse for the data file on the local file system 2 Open URL Asks for a Uniform Resource Locator address for where the data is stored 3 Open DB Reads data from a database Note that to make this work yo
78. impose the same checks when getting setting parameters Randomization In order to get repeatable experiments one is not allowed to use unseeded random number generators like Math random Instead one has to instantiate a java util Random object in the buildClassifier Instances method with a specific seed value The seed value can be user supplied of course which all the Randomizable abstract classifiers already implement 242 CHAPTER 17 EXTENDING WEKA Capabilities By default the weka classifiers Classifier superclass returns an object that denotes that the classifier can handle any type of data This is useful for rapid prototyping of new algorithms but also very dangerous If you do not specifically define what type of data can be handled by your classifier you can end up with meaningless models or errors This can happen if you devise a new classifier which is supposed to handle only numeric attributes By using the value int Attribute method of a weka core Instance to obtain the numeric value of an attribute you also obtain the internal format of nominal string and relational attributes Of course treating these attribute types as numeric ones does not make any sense Hence it is highly recommended and required for contributions to override this method in your own classifier There are three different types of capabilities that you can define 1 attribute related e g nominal numeric date missing values
79. in WEKA is weka classifiers Classifier an abstract class Your new classifier must be derived from this class at least to be visible through the GenericObjectEditor But in order to make implemen tations of new classifiers even easier WEKA comes already with a range of other abstract classes derived from weka classifiers Classifier In the fol lowing you will find an overview that will help you decide what base class to use for your classifier For better readability the weka classifiers prefix was dropped from the class names e simple classifier Classifier not randomizable RandomizableClassifier randomizable e meta classifier single base classifier SingleClassifierEnhancer not randomizable not iterated RandomizableSingleClassifierEnhancer randomizable not iterated x IteratedSingleClassifierEnhancer not randomizable iter ated x RandomizablelteratedSingleClassifierEnhancer random izable iterated multiple base classifiers MultipleClassifiersCombiner not randomizable RandomizableMultipleClassifiersCombiner randomizable If you are still unsure about what superclass to choose then check out the Javadoc of those superclasses In the Javadoc you will find all the classifiers that are derived from it which should give you a better idea whether this particular superclass is suited for your needs 17 1 2 Additional interfaces The abstract classes listed above basically just impl
80. in the menu e Next click on the Classifiers tab at the top of the window and scroll along the toolbar until you reach the J48 component in the trees section Place a J48 component on the layout 102 CHAPTER 6 KNOWLEDGEFLOW Connect the CrossValidationFoldMaker to J48 TWICE by first choosing trainingSet and then testSet from the pop up menu for the CrossValida tionFoldMaker Next go back to the Evaluation tab and place a ClassifierPerformanceE valuator component on the layout Connect J48 to this component by selecting the batchClassifier entry from the pop up menu for J48 Next go to the Visualization toolbar and place a TextViewer compo nent on the layout Connect the ClassifierPerformanceEvaluator to the Text Viewer by selecting the text entry from the pop up menu for Classi fierPerformanceEvaluator Now start the flow executing by selecting Start loading from the pop up menu for ArffLoader Depending on how big the data set is and how long cross validation takes you will see some animation from some of the icons in the layout J48 s tree will grow in the icon and the ticks will animate on the ClassifierPerformanceEvaluator You will also see some progress information in the Status bar and Log at the bottom of the window When finished you can view the results by choosing Show results from the pop up menu for the Text Viewer component Other cool things to add to this flow connect a TextViewer and or a Graph Viewer to
81. inst numInstances i double values new double result numAttributes for int n 0 n lt inst numAttributes n values n inst instance i value n values values length 1 i result add new Instance 1 values return result public static void main String args runFilter new SimpleBatch args 256 CHAPTER 17 EXTENDING WEKA 17 2 2 2 SimpleStreamPFilter Only the following abstract methods need to be implemented for a stream filter e globalInfo returns a short description of what the filter does will be displayed in the GUI e determineOutputFormat Instances generates the new format based on the input data e process Instance processes a single instance and turns it from the old format into the new one e getRevision returns the Subversion revision information see section Revisions on page B58 If more options are necessary then the following methods need to be overridden e listOptions returns an enumeration of the available options these are printed if one calls the filter with the h option e setOptions String parses the given option array that were passed from command line e getOptions returns an array of options resembling the current setup of the filter See also section Z1 4 1 covering Methods for classifiers 17 2 WRITING A NEW FILTER 257 In the following an example implementation of a stream filter that ad
82. is also possible to Visualize 4 6 SELECTING ATTRIBUTES 49 reduced data or if you have used an attribute transformer such as Principal Components Visualize transformed data The reduced transformed data can be saved to a file with the Save reduced data or Save transformed data option In case one wants to reduce transform a training and a test at the same time and not use the AttributeSelectedClassifier from the classifier panel it is best to use the AttributeSelection filter a supervised attribute filter in batch mode b from the command line or in the SimpleCLI The batch mode allows one to specify an additional input and output file pair options r and s that is processed with the filter setup that was determined based on the training data specified by options i and o Here is an example for a Unix Linux bash java weka filters supervised attribute AttributeSelection E weka attributeSelection CfsSubsetEval S weka attributeSelection BestFirst D 1 N 5 b i lt inputi arff gt o lt outputi arff gt r lt input2 arff gt s lt output2 arff gt Notes e The backslashes at the end of each line tell the bash that the command is not finished yet Using the SimpleCLI one has to use this command in one line without the backslashes e It is assumed that WEKA is available in the CLASSPATH otherwise one has to use the classpath option e The full filter setup is output in the
83. is listed as well Finally the divergence between the network distribution on file and the one learned is reported This number is calculated by enumerating all possible in stantiations of all variables so it may take some time to calculate the divergence for large networks The remainder of the output is standard output for all classifiers Time taken to build model 0 01 seconds Stratified cross validation Summary Correctly Classified Instances 116 77 3333 Incorrectly Classified Instances 34 22 6667 etc Bayesian networks in GUI To show the graphical structure right click the appropriate BayesNet in result list of the Explorer A menu pops up in which you select Visualize graph 140 Weka Explorer CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS 10 x Preprocess Classify Cluster Associate Select attributes Visualize Classifier Choose BayesNet D B iris xml Q weka classifiers bayes net search local K2 P 2 5 BAYES E weka classifiers bayes net estimate S Test options Classifier output Use training set Summary 12 Supplied test set Set A Correctly Classified Instances 116 77 3333 8 Cross validation Folds 10 Incorrectly Classified Instances 34 22 6667 Kappa statistic 0 66 Percentage split E ge se Mean absolute error 0 1882 More options Root mean squared error 0 3153 Relative ab
84. last 1 o run t_ f a f o run to 24 CHAPTER 1 A COMMAND LINE PRIMER java weka filters supervised instance Resample S 0 Z 83 c last i run t nr f o run t nrp1 f end echo Run run of f done run end end If meta classifiers are used i e classifiers whose options include classi fier specifications for example StackingC or ClassificationViaRegression care must be taken not to mix the parameters E g java weka classifiers meta ClassificationViaRegression W weka classifiers functions LinearRegression S 1 t data iris arff x 2 gives us an illegal options exception for S 1 This parameter is meant for LinearRegression not for ClassificationViaRegression but WEKA does not know this by itself One way to clarify this situation is to enclose the classifier specification including all parameters in double quotes like this java weka classifiers meta ClassificationViaRegression W weka classifiers functions LinearRegression S 1 t data iris arff x 2 However this does not always work depending on how the option handling was implemented in the top level classifier While for Stacking this approach would work quite well for ClassificationViaRegression it does not We get the dubious error message that the class weka classifiers functions LinearRegression S 1 cannot be found Fortunately there is another approach All parameters given after are processed by the first su
85. length tl of the tabu list e Genetic search applies a simple implementation of a genetic search algo rithm to network structure learning A Bayes net structure is represented by a array of n n n number of nodes bits where bit i n represents whether there is an arrow from node j gt i weka gui GenericObjectEditor weka classifiers bayes net search local GeneticSearch About This Bayes Network learning algorithm uses genetic search for finding a well scoring Bayes network structure descendantPopulationSize 100 markovBlanketClassifier False 5 A populationSize 10 runs 10 scoreType BAYES v seed 1 useCrossOver True z useMutation True h useTournamentSelection False El NA Open Save OK Specific options populationSize is the size of the population selected in each generation descendantPopulationSize is the number of offspring generated in each 8 3 CONDITIONAL INDEPENDENCE TEST BASED STRUCTURE LEARNING123 generation runs is the number of generation to generate seed is the initialization value for the random number generator useMutation flag to indicate whether mutation should be used Mutation is applied by randomly adding or deleting a single arc useCrossOver flag to indicate whether cross over should be used Cross over is applied by randomly picking an index k in the bit representation and selecting the firs
86. lt ELEMENT options option gt lt ATTLIST options type CDATA classifier gt lt ATTLIST options value CDATA gt lt ELEMENT option PCDATA options gt lt ATTLIST option name CDATA REQUIRED gt lt ATTLIST option type flag single hyphens quotes single gt gt The type attribute of the option tag needs some explanations There are cur rently four different types of options in WEKA e flag The simplest option that takes no arguments like e g the V flag for inversing an selection lt option name V type flag gt e single The option takes exactly one parameter directly following after the op tion e g for specifying the trainings file with t somefile arff Here the parameter value is just put between the opening and closing tag Since single is the default value for the type tag we don t need to specify it ex plicitly lt option name t gt somefile arff lt option gt 290 CHAPTER 18 TECHNICAL DOCUMENTATION e hyphens Meta Classifiers like AdaBoostM1 take another classifier as option with the W option where the options for the base classifier follow after the And here it is where the fun starts where to put parameters for the base classifier if the Meta Classifier itself is a base classifier for another Meta Classifier E g does W weka classifiers trees J48 C 0 001 become this lt option name W type hyphens gt lt options type classifier value weka cla
87. lt A s savers are available DataSources DataSinks Filters Classifiers Clusterers Associations Evaluation Visualization DataSinks Database XRFF Saver MSaver InstancesSaver Saver 4 gt 6 3 3 Filters All of WEKA s filters are available DataSources DataSinks Filters Classifiers Clusterers Associations Evaluation Visualization lt supervised unsupervised Brvzibuta Class Fominal 1 Spread Seravifiad E fication Selection Order Discretime ToBinary er Resample Subsample RemoveFolds add el 4 1 Dp 6 3 4 Classifiers All of WEKA s classifiers are available DataSources DataSinks Filters Classifiers Clusterers Associations Evaluation Visualization Bayes Complement Naive NaiveBayes MaiveBayes O A va Nes NaiveBayes mB Bayes Multinomial MultinomialUpdateable Simple Updateable ODE 6 3 5 Clusterers All of WEKASs clusterers are available DataSources DataSinks Filters Classifiers Clusterers Associations Evaluation Visualization Clusterers Farthest Filtered MakeDens ity 3i EE Cobweb Cluste BasedCluste ans XMeans 6 3 COMPONENTS 99 6 3 6 Evaluation DataSources DataSinks Filters Classifiers Clusterers Associations Evaluation Visualization Evaluation la a Jelala
88. manual application of a filter in the Preprocess panel since the data gets processed on the fly Useful if one needs to try out different filter setups 4 4 5 Learning Clusters The Cluster section like the Classify section has Start Stop buttons a result text area and a result list These all behave just like their classifica tion counterparts Right clicking an entry in the result list brings up a similar menu except that it shows only two visualization options Visualize cluster assignments and Visualize tree The latter is grayed out when it is not applicable 4 5 ASSOCIATING 47 4 5 Associating Weka 3 5 4 Explorer Program Applications Tools Visualization Windows Help E Explorer Classify Cluster Associate Select attributes Visualize Apriori N 10 T 0 C 0 9 D 0 05 U 1 0 M0 1 S 1 0 c 1 Associator output Stop Result list right click fc Size of set of large itemsets L 1 15 16 49 Apriori Size of set of large itemsets L 2 Size of set of large itemsets L 3 Size of set of large itemsets L 4 Best rules found outlook overcast 4 gt play yes 4 conf 1 temperature cool 4 gt humidity normal 4 conf 1 hunidity normal vindy FALSE 4 gt play yes 4 conf 1 outlook sunny pley no 3 gt humidity high 3 conf 1 outlook sunny hunidity high 3 gt play no 3 conf 1 outlook rainy play yes 3 gt windy FALSE 3 co
89. of a Bayesian network structure Bs for a database D is Qarc Bs D H Bs D K 8 4 A term P Bs can be added representing prior information over network structures but will be ignored for simplicity in the Weka implementation MDL metric The minimum description length metric Qmp Bs D of a Bayesian network structure Bs for a database D is is defined as K Ompu Bs D H Bs D 2 log N 8 5 Bayesian metric The Bayesian metric of a Bayesian network structure Bp for a database D is TT Niy q Die Nik QBa es Bg D P Bs A i I 7 TO Fal Ny jll DN where P Bg is the prior on the network structure taken to be constant hence ignored in the Weka implementation and T the a Nj and Nj jk Yepresent choices of priors on counts restricted by Nj N ij po With Nijx 1 and thus Nj ri we obtain the K2 metric la Qk2 Bs D P Bs Mm A I sue With Nj 1 ri q and thus Nj 1 q we obtain the BDe metric 22 8 2 2 Search algorithms The following search algorithms are implemented for local score metrics e K2 IJ hill climbing add arcs with a fixed ordering of variables Specific option randomOrder if true a random ordering of the nodes is made at the beginning of the search If false default the ordering in the data set is used The only exception in both cases is that in case the initial network is a naive Bayes network initAsNaiveBayes set true the class variable is made first in the ordering 8 2
90. of jdbcDriver properties files need unique keys The jdbcURL property has a spelling error and tries to use a non existing protocol or you listed it multiple times which doesn t work either remember properties files need unique keys 184 CHAPTER 13 DATABASES Chapter 14 Windows databases A common query we get from our users is how to open a Windows database in the Weka Explorer This page is intended as a guide to help you achieve this It is a complicated process and we cannot guarantee that it will work for you The process described makes use of the JDBC ODBC bridge that is part of Sun s JRE JDK 1 3 and higher The following instructions are for Windows 2000 Under other Windows versions there may be slight differences Step 1 Create a User DSN 1 Go to the Control Panel 2 Choose Adminstrative Tools Choose Data Sources ODBC gt Qu At the User DSN tab choose Add 5 Choose database e Microsoft Access a Note Make sure your database is not open in another application before following the steps below b Choose the Microsoft Access driver and click Finish c Give the source a name by typing it into the Data Source Name field d In the Database section choose Select e Browse to find your database file select it and click OK f Click OK to finalize your DSN e Microsoft SQL Server 2000 Desktop Engine a Choose the SQL Server driver and click Finish b Give the source a name by typ
91. one can save the current setup of the experiment to a file by clicking on Save at the top of the window B save x c 85 th Save In C weka 3 5 6 ie C3 changelogs C3 data ci doc File Name Files of Type Experiment configuration files exp a Save Cancel By default the format of the experiment files is the binary format that Java serialization offers The drawback of this format is the possible incompatibility between different versions of Weka A more robust alternative to the binary format is the XML format Previously saved experiments can be loaded again via the Open button 62 CHAPTER 5 EXPERIMENTER 5 2 1 8 Running an Experiment To run the current experiment click the Run tab at the top of the Experiment Environment window The current experiment performs 10 runs of 10 fold strat ified cross validation on the Iris dataset using the ZeroR and J48 scheme weka Experiment Environment loj x Setup Run Analyse Start Stop Log Status Not running Click Start to run the experiment Weka Experiment Environment Setup Run Analyse Stop Log 16 17 12 Started 16 17 12 Finished 16 17 12 There were 0 errors Status Not running If the experiment was defined correctly the 3 messages shown a
92. one way or another RandomizableClusterer RandomizableDensityBasedClusterer and RandomizableSingleClustererEnhancer all implement this inter face already Methods In the following a short description of methods that are common to all cluster algorithms see also the Javadoc for the Clusterer interface buildClusterer Instances Like the buildClassifier Instances method this method completely re builds the model Subsequent calls of this method with the same dataset must result in exactly the same model being built This method also tests the training data against the capabilities of this this clusterer public void buildClusterer Instances data throws Exception test data against capabilities getCapabilities testWithFail data actual model generation clusterInstance Instance returns the index of the cluster the provided Instance belongs to 17 3 WRITING OTHER ALGORITHMS 261 distributionForInstance Instance returns the cluster membership for this Instance object The membership is a double array containing the probabilities for each cluster number OfClusters returns the number of clusters that the model contains after the model has been generated with the buildClusterer Instances method get Capabilities see section Capabilities on page 242 for more information toString should output some information on the generated model Even though this is not required it is rather useful for the use
93. or removes rows from a batch of data this no longer works when working in single row processing mode This makes sense if one thinks of a scenario involving the FilteredClassifier meta classifier after the training phase first batch of data the classifier will get evaluated against a test set one instance at a time If the filter now removes the only instance or adds instances it can no longer be evaluated correctly as the evaluation expects to get only a single result back This is the reason why instance based filters only pass through any subsequent batch of data without processing it The Resample filters for instance act like this One can find example classes for filtering in the wekaexamples filters package of the Weka Examples collection 8 16 5 FILTERING 207 The following example uses the Remove filter the filter is located in package weka filters unsupervised attribute to remove the first attribute from a dataset For setting the options the setOptions String method is used import weka core Instances import weka filters Filter import weka filters unsupervised attribute Remove String options new String 2 options 0 R range options 1 1 first attribute Remove remove new Remove new instance of filter remove setOptions options set options remove setInputFormat data inform filter about dataset AFTER setting options Instances newData Filter useFilter
94. single dataset and wants to get a reasonable realistic evaluation Setting the number of folds equal to the number of rows in the dataset will give one leave one out cross validation LOOCV e Dedicated test set The test set is solely used to evaluate the built clas sifier It is important to have a test set that incorporates the same or similar concepts as the training set otherwise one will always end up with poor performance The evaluation step including collection of statistics is performed by the Evaluation class package weka classifiers Cross validation The crossValidateModel method of the Evaluation class is used to perform cross validation with an untrained classifier and a single dataset Supplying an untrained classifier ensures that no information leaks into the actual evaluation Even though it is an implementation requirement that the buildClassifier method resets the classifier it cannot be guaranteed that this is indeed the case leaky implementation Using an untrained classifier avoids unwanted side effects as for each train test set pair a copy of the originally supplied classifier is used Before cross validation is performed the data gets randomized using the supplied random number generator java util Random It is recommended that this number generator is seeded with a specified seed value Otherwise subsequent runs of cross validation on the same dataset will not yield the same results
95. the dataset containing the class attribute and remove the class attribute using the Remove filter this filter is located in package weka filters unsupervised attribute build the clusterer with this new data evaluate the clusterer now with the original data And here are the steps translated into code using EM as the clusterer being evaluated 1 create a copy of data without class attribute Instances data from somewhere Remove filter new Remove filter setAttributelndices data classIndex 1 filter setInputFormat data Instances dataClusterer Filter useFilter data filter 2 build the clusterer EM clusterer new EMO set further options for EM if necessary clusterer buildClusterer dataClusterer 3 evaluate the clusterer ClusterEvaluation eval new ClusterEvaluation eval setClusterer clusterer eval evaluateClusterer data print results System out println eval clusterResultsToString 220 CHAPTER 16 USING THE API 16 7 3 Clustering instances Clustering of instances is very similar to classifying unknown instances when using classifiers The following methods are involved e clusterInstance Instance determines the cluster the Instance would belong to e distributionForInstance Instance predicts the cluster membership for this Instance The sum of this array adds up to 1 The code fragment outlined below trains an EM clusterer on one data
96. the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S LOO CV k Fold CV Cumulative CV Q Score type LOO CV k Fold CV Cumulative CV Use probabilistic or 0 1 scoring default probabilistic scoring weka classifiers bayes net search fixed FromFile B lt BIF File gt Name of file containing network structure in BIF format weka classifiers bayes net search fixed NaiveBayes 8 7 RUNNING FROM THE COMMAND LINE 137 No options Overview of options for estimators e weka classifiers bayes net estimate BayesNetEstimator A lt alpha gt Initial count alpha e weka classifiers bayes net estimate BMAEstimator k2 Whether to use K2 prior A lt alpha gt Initial count alpha e weka classifiers bayes net estimate MultiNomialBMAEstimator k2 Whether to use K2 prior A lt alpha gt Initial count alpha e weka classifiers bayes net estimate SimpleEstimator A lt alpha gt Initial count alpha Generating random networks and artificial data sets You can generate random Bayes nets and data sets using weka classifiers bayes net BayesNetGenerator The options are B Generate network instead of instances N lt integer gt Nr of nodes A lt integer gt Nr of arcs M lt integer gt Nr of instances C lt integer gt Cardinality of the variables S lt integer gt Seed for random number
97. the same table and are temporarily locked out this will resolve itself so just leave your experiment running in fact it is a sign that the experiment is working 5 4 REMOTE EXPERIMENTS 87 e If you serialized an experiment and then modify your Database Utils props file due to an error e g a missing type mapping the Experimenter will use the DatabaseUtils props you had at the time you serialized the ex periment Keep in mind that the serialization process also serializes the Database Utils class and therefore stored your props file This is another reason for storing your experiments as XML and not in the properietary binary format the Java serialization produces e Using a corrupt or incomplete Database Utils props file can cause peculiar interface errors for example disabling the use of the User button along side the database URL If in doubt copy a clean Database Utils props from Subversion 1 e If you get NullPointerException at java util Hashtable get in the Remote Engine do not be alarmed This will have no effect on the results of your experiment 88 CHAPTER 5 EXPERIMENTER 5 5 Analysing Results 5 5 1 Setup Weka includes an experiment analyser that can be used to analyse the results of experiments in this example the results were sent to an InstancesResultLis tener The experiment shown below uses 3 schemes ZeroR OneR and J48 to classify the Iris data in an experiment using 10 train and test runs wi
98. to a polygon which is always closed Once an area of the plot has been selected using Rectangle Polygon or Polyline it turns grey At this point clicking the Submit button removes all instances from the plot except those within the grey selection area Clicking on the Clear button erases the selected area without affecting the graph Once any points have been removed from the graph the Submit button changes to a Reset button This button undoes all previous removals and returns you to the original graph with all points included Finally clicking the Save button allows you to save the currently visible instances to a new ARFF file 52 CHAPTER 4 EXPLORER Chapter 5 Experimenter 5 1 Introduction The Weka Experiment Environment enables the user to create run modify and analyse experiments in a more convenient manner than is possible when processing the schemes individually For example the user can create an exper iment that runs several schemes against a series of datasets and then analyse the results to determine if one of the schemes is statistically better than the other schemes The Experiment Environment can be run from the command line using the Simple CLI For example the following commands could be typed into the CLI to run the OneR scheme on the Iris dataset using a basic train and test process Note that the commands would be typed on one line into the CLI java weka experiment Experiment r T data iris arf
99. value is as signed index 0 this means that internally this value is stored as a 0 When a SparseInstance is written string instances with internal value 0 are not out put so their string value is lost and when the arff file is read again the default value 0 is the index of a different string value so the attribute value appears to change To get around this problem add a dummy string value at index 0 that is never used whenever you declare string attributes that are likely to be used in SparseInstance objects and saved as Sparse ARFF files 166 CHAPTER 9 ARFF 9 4 Instance weights in ARFF files A weight can be associated with an instance in a standard ARFF file by ap pending it to the end of the line for that instance and enclosing the value in curly braces E g data 0 X 0 Y class A 5 For a sparse instance this example would look like data 1 X 3 Y 4 class A 5 Note that any instance without a weight value specified is assumed to have a weight of 1 for backwards compatibility Chapter 10 XRFF The XRFF Xml attribute Relation File Format is a representing the data in a format that can store comments attribute and instance weights 10 1 File extensions The following file extensions are recognized as XRFF files e xrff the default extension of XRFF files e xrff gz the extension for gzip compressed XRFF files see Compression section for more details 10 2 Comparison 10 2 1 ARFF
100. we will now explain the output of a typical classifier weka classifiers trees J48 Consider the following call from the command line or start the WEKA explorer and train J48 on weather arff java weka classifiers trees J48 t data weather arff i J48 pruned tree outlook sunny humidity lt 75 yes 2 0 humidity gt 75 no 3 0 outlook overcast yes 4 0 outlook rainy windy TRUE no 2 0 windy FALSE yes 3 0 Number of Leaves 5 Size of the tree 8 Time taken to build model 0 05 seconds Time taken to test model on training data 0 seconds The first part unless you specify o is a human readable form of the training set model In this case it is a decision tree out look is at the root of the tree and determines the first decision In case it is overcast we ll al ways play golf The numbers in parentheses at the end of each leaf tell us the number of exam ples in this leaf If one or more leaves were not pure all of the same class the number of mis classified examples would also be given after a slash As you can see a decision tree learns quite fast and is evalu ated even faster E g for a lazy learner testing would take far longer than training 1 2 BASIC CONCEPTS Error on training data Correctly Classified Instance 14 100 Incorrectly Classified Instances 0 0 Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute er
101. weather arff java weka classifiers trees J48 t weather arff There are various approaches to determine the performance of classifiers The performance can most simply be measured by counting the proportion of cor rectly predicted examples in an unseen test dataset This value is the accuracy which is also 1 ErrorRate Both terms are used in literature The simplest case is using a training set and a test set which are mutually independent This is referred to as hold out estimate To estimate variance in these performance estimates hold out estimates may be computed by repeatedly resampling the same dataset i e randomly reordering it and then splitting it into training and test sets with a specific proportion of the examples collecting all estimates on test data and computing average and standard deviation of accuracy A more elaborate method is cross validation Here a number of folds n is specified The dataset is randomly reordered and then split into n folds of equal size In each iteration one fold is used for testing and the other n 1 folds are used for training the classifier The test results are collected and averaged over all folds This gives the cross validation estimate of the accuracy The folds can be purely random or slightly modified to create the same class distributions in each fold as in the complete dataset In the latter case the cross validation is called stratified Leave one out loo cross validation signifies th
102. weka Experiment Environment Setup Run Analyse xperiment Configuration Mode Simple 8 Advanced Open Save New Destination Choose _ InstancesResultListener O Experiment arff Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator WW weka classifier Runs Distribute experiment Generator properties From 1 To 10 Host ecan Disabled Zz Select property Bydataset O Byrun AR Iteration control 8 Data sets first Custom generator first Datasets Add new Edit selecte Delete select v Use relative pat Can t edit dataliris artt Adding Additional Schemes Additional schemes can be added in the Generator properties panel To begin change the drop down list entry from Disabled to Enabled in the Generator properties panel 5 2 STANDARD EXPERIMENTS 71 Weka Experiment Environment Eal Setup Run Analyse xperiment Configuration Mode Simple Advanced Open Save New Destination Choose InstancesResultListener O Experimentt art Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator WW weka classifier
103. weka classifiers bayes net search global SimulatedAnnealing A lt float gt Start temperature U lt integer gt Number of runs D lt float gt Delta temperature R lt seed gt Random number seed mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S LOO CV k Fold CV Cumulative CV Score type L00 CV k Fold CV Cumulative CV Q Use probabilistic or 0 1 scoring default probabilistic scoring 136 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS weka classifiers bayes net search global TabuSearch L lt integer gt Tabu list length U lt integer gt Number of runs P lt nr of parents gt Maximum number of parents Use arc reversal operation default false P lt nr of parents gt mbc Maximum number of parents Use arc reversal operation default false Initial structure is empty instead of Naive Bayes Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S L00 CV k Fold CV Cumulative CV Q Score type LOO CV k Fold CV Cumulative CV Use probabilistic or 0 1 scoring default probabilistic scoring weka classifiers bayes net search global TAN mbc Applies a Markov Blanket correction to
104. with a naive Bayes network structure After calling search in your cus tom class it will add arrows if the markovBlanketClassifier flag is set to ensure all attributes are in the Markov blanket of the class node 3 If the structure learner has options that are not default options you want to implement public Enumeration listOptions public void setOptions String options public String getOptions and the get and set methods for the properties you want to be able to set NB 1 do not use the E option since that is reserved for the BayesNet class to distinguish the extra options for the SearchAlgorithm class and the Estimator class If the E option is used it will not be passed to your SearchAlgorithm and probably causes problems in the BayesNet class NB 2 make sure to process options of the parent class if any in the get setOpions methods Adding a new estimator This is the quick guide for adding a new estimator 1 Create a class that derives from weka classifiers bayes net estimate BayesNetEstimator Let s say it is called weka classifiers bayes net estimate MyEstimator 2 Implement the methods public void initCPTs BayesNet bayesNet 8 12 FAQ 155 public void estimateCPTs BayesNet bayesNet public void updateClassifier BayesNet bayesNet Instance instance and public double distributionForInstance BayesNet bayesNet Instance instance 3 If the structure learner has options that are not default op
105. x In the matrix this is the diagonal element divided by the sum over the relevant column i e 7 7 3 0 7 for class yes and 2 2 2 0 5 for class no The F Measure is simply 2 Precision Recall Precision Recall a combined measure for precision and recall These measures are useful for comparing classifiers However if more de tailed information about the classifier s predictions are necessary p out puts just the predictions for each test instance along with a range of one based attribute ids 0 for none Let s look at the following example We shall assume soybean train arff and soybean test arff have been constructed via weka filters supervised instance StratifiedRemoveFolds as in a previous example java weka classifiers bayes NaiveBayes K t soybean train arff T soybean test arff p 0 rhizoctonia root rot 0 9999999395928124 rhizoctonia root rot rhizoctonia root rot 0 999998912860593 rhizoctonia root rot rhizoctonia root rot 0 9999994386283236 rhizoctonia root rot XAON WNRoOo 32 phyllosticta leaf spot 0 7789710144361445 brown spot 39 alternarialeaf spot 0 6403333824349896 brown spot 44 phyllosticta leaf spot 0 893568420641914 brown spot 46 alternarialeaf spot 0 5788190397739439 brown spot 73 brown spot 0 4943768155314637 alternarialeaf spot If we had chosen a range of attributes via p e g The values in each line are sep arated by a single space The diaporthe stem canker 0 9999672587892333 diaporth
106. 002147861445 i 3 trees J48 C 0 25 M 2 217733168393644444 Perform test Save output Result list 16 37 40 Percent_correct rules ZeroR 4805 4 ll gt Selecting Number_correct generates the average number correct out of 50 test patterns 33 of 150 patterns in the Iris dataset as the comparison field and clicking Perform test Weka Experiment Environment al x Setup Run Analyse Source Got 30 results File Database Experiment Configure test M Test output Testing with Paired T Tester cor Tester weka experiment PairedCorrectedTTester Analysing Number_correct Roe Select Datasets 1 re Resultsets 3 Es s Confidence 0 05 two tailed zi Sorted by z Date 21 12 05 16 38 Comparison field Number_correct Significance 0 05 Dataset 1 rules Ze 2 rules 3 trees Sorting asc by lt default gt he ee ee et ae ATA te iris 10 17 00 48 10 v 48 40 v Test base EE E v s 1 1 0 0 1 0 0 Displayed Columns Columns Show std deviations Key 1 rules ZeroR 48055541465867954 Output Format Select 2 rules OneR B 6 2459427002147861445 3 trees J48 C 0 25 M 2 217733168393644444 Perform test Save output Result list 16 38 12 Number_correct rules ZeroR 4805 4 il gt
107. 1 normal 125 0 no 1 0 50_1 normal 165 0 no 0 0 50 normal 110 0 no 0 0 49 0 male _ atyp_ang 33 0 female asympt 60 0 male asympt 100 0 248 0 f 32 O female jatyp_angi 106 0 198 0 48 0 male asympt 106 0 263 0 48 O female jasympt 108 0 163 0 normal 175 0 no i 50 39 0O female jnon_angi 110 0 182 0 stt 180 0 no 0 0 50 39 0 male lasympt 110 0 273 0 normal 132 0 no 0 0 50 41 Offemale jatyp_angi 110 0 250 0 stt 0 0 50 46 0 male asympt 110 0 238 0 46 0 male jasympt 110 0 240 0 47 Omale _ typ_angina 110 0 249 0 48 0 male non_angi 110 0 211 0 49 0 female jatyp_angi 110 0 49 O female jatyp_angi 110 0 32 0 male latyp_angi 110 0 225 0 50 0 female jatyp_angi 110 0 202 0 1 0 female non_angi 110 0 190 0 54 0 male atyp_angi 110 0 208 0 55 0 female atyp_angi 110 0 344 0 55 0 male non_angi 110 0 277 0 35 0 male _ atyp_angi 257 0 38 0 male asympt 196 0 e sti a stt 140 0 no 0 0 normal 150 0 no normal 138 0 no normal 160 0 no normal 160 0 no normal 184 0 no normal 145 0 no normal 120 0 no normal 142 0 no stt 160 0 no normal 160 0 no normal 140 0 no Jnormal 166 0 no o normal normal 233233 SSS SSS SS See eer 114 CHAPTER 7 ARFFVIEWER
108. 17983 580771 47 3 12 2 135000 13 256440 259935 1 38812 226759 56132 299482 75 1 18 6 135000 24 310480 022729 0 69277 560943 43040 303323 20 2 1 4 135000 25 93838 186294 0 141497 92478 0 65 2 12 1 135000 3 406525 656945 0 140088 83117 0 64 4 4 8 135000 7 281727 27772 0 34934 230665 69994 331354 46 1 10 1 135000 26 461915 602277 1 4 il D gt Generate Use gt il Status Welcome to the Weka Explorer Log AS x0 17 4 EXTENDING THE EXPLORER 273 Experimenter light Purpose By default the Classify panel only performs 1 run of 10 fold cross validation Since most classifiers are rather sensitive to the order of the data being pre sented to them those results can be too optimistic or pessimistic Averaging the results over 10 runs with differently randomized train test pairs returns more reliable results And this is where this plugin comes in it can be used to obtain statistical sound results for a specific classifier dataset combination without having to setup a whole experiment in the Experimenter Implementation e Since this plugin is rather bulky we omit the implementation details but the following can be said based on the weka gui explorer ClassifierPanel the actual code doing the work follows the example in the Using the Experiment API wiki article 2 e In order to add our ExperimentPanel to the
109. 2 class attribute related e g no class nominal numeric missing class values 3 miscellaneous e g only multi instance data minimum number of in stances in the training data There are some special cases e incremental classifiers need to set the minimum number of instances in the training data to 0 since the default is 1 setMinimumNumberInstances 0 e multi instance classifiers in order to signal that the special multi instance format bag id bag data class is used they need to enable the following capability enable Capability ONLY MULTIINSTANCE These classifiers also need to implement the interface specific to multi instance weka core MultilnstanceCapabilitiesHandler which returns the capabilities for the bag data e cluster algorithms since clusterers are unsupervised algorithms they cannot process data with the class attribute set The capability that denotes that an algorithm can handle data without a class attribute is Capability NO_CLASS And a note on enabling disabling nominal attributes or nominal class attributes These operations automatically enable disable the binary unary and empty nominal capabilities as well The following sections list a few examples of how to configure the capabilities 17 1 WRITING A NEW CLASSIFIER 243 Simple classifier A classifier that handles only numeric classes and numeric and nominal at tributes but no missing values at all would configure the
110. 217 An ArffLoader is used in the following example to build the Cobweb clusterer incremental ly import weka clusterers Cobweb import weka core Instance import weka core Instances import weka core converters ArffLoader load data ArffLoader loader new ArffLoader loader setFile new File some where data arff Instances structure loader getStructure train Cobweb Cobweb cw new Cobweb cw buildClusterer structure Instance current while current loader getNextInstance structure null cw updateClusterer current cw updateFinished 218 CHAPTER 16 USING THE API 16 7 2 Evaluating a clusterer Evaluation of clusterers is not as comprehensive as the evaluation of classi fiers Since clustering is unsupervised it is also a lot harder determining how good a model is The class used for evaluating cluster algorithms is ClusterEvaluation package weka clusterers In order to generate the same output as the Explorer or the command line one can use the evaluateClusterer method as shown below import weka clusterers EM import weka clusterers ClusterEvaluation String options new String 2 options 0 Nats options 1 some where somefile arff System out println ClusterEvaluation evaluateClusterer new EMO options Or if the dataset is already present in memory one can use the following ap proach import weka clusterers ClusterEvaluation import weka clus
111. 3 5 8 1999 2008 The University of Waikato i Hamilton New Zealand Simple CLI The buttons can be used to start the following applications e Explorer An environment for exploring data with WEKA the rest of this documentation deals with this application in more detail e Experimenter An environment for performing experiments and conduct ing statistical tests between learning schemes e KnowledgeFlow This environment supports essentially the same func tions as the Explorer but with a drag and drop interface One advantage is that it supports incremental learning e SimpleCLI Provides a simple command line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface The menu consists of four sections 1 Program en Weka EERIE Visualization LogWindow EE Memory usage M Exit 26 27 28 CHAPTER 2 LAUNCHING WEKA e Log Window Opens a log window that captures all that is printed to stdout or stderr Useful for environments like MS Windows where WEKA is normally not started from a terminal e Exit Closes WEKA 2 Tools Other useful applications GUI Chooser Help ArffViewer A SqlViewer S Bayes net editor N e ArffViewer An MDI application for viewing ARFF files in spread sheet format e SqlViewer Represents an SQL worksheet for querying databases via JDBC e Bayes net editor An application for editing visualizing
112. 9 2 4 StackOverflowError Try increasing the stack of your virtual machine With Sun s JDK you can use this command to increase the stacksize java Xss512k 19 2 TROUBLESHOOTING 297 to set the maximum Java stack size to 512KB If still not sufficient slowly increase it 19 2 5 just in time JIT compiler For maximum enjoyment use a virtual machine that incorporates a just in time compiler This can speed things up quite significantly Note also that there can be large differences in execution time between different virtual ma chines 19 2 6 CSV file conversion Either load the CSV file in the Explorer or use the CVS converter on the com mandline as follows java weka core converters CSVLoader filename csv gt filename arff 19 2 7 ARFF file doesn t load One way to figure out why ARFF files are failing to load is to give them to the Instances class At the command line type the following java weka core Instances filename arff where you substitute filename for the actual name of your file This should return an error if there is a problem reading the file or show some statistics if the file is ok The error message you get should give some indication of what is wrong 19 2 8 Spaces in labels of ARFF files A common problem people have with ARFF files is that labels can only have spaces if they are enclosed in single quotes i e a label such as some value should be written either some value or som
113. BlExperiments1 arff Browse Experiment Type Iteration Control Y Number of repetitions Train percentage 66 0 8 Data sets first 8 Classification Regression Algorithms first Datasets Algorithms Add new Edit selecte Delete select Add new Edit selected Delete selected Use relative pat Up Down Load options Save options Up Down Notes Additionally one can choose between Classification and Regression depend ing on the datasets and classifiers one uses For decision trees like J48 Weka s implementation of Quinlan s C4 5 I0 and the iris dataset Classification is necessary for a numeric classifier like M5P on the other hand Regression Clas sification is selected by default Note if the percentage splits are used one has to make sure that the cor rected paired T Tester still produces sensible results with the given ratio 9 58 CHAPTER 5 EXPERIMENTER 5 2 1 4 Datasets One can add dataset files either with an absolute path or with a relative one The latter makes it often easier to run experiments on different machines hence one should check Use relative paths before clicking on Add new Look In CJ weka 3 5 6 X a aia E fam A changelogs A data doc File Name Files of Type Arff data files arff X In this example open the data directory and choose the iris a
114. Chapter 8 Bayesian Network Classifiers 8 1 Introduction Let U z 1 n n gt 1 be a set of variables A Bayesian network B over a set of variables U is a network structure Bs which is a directed acyclic graph DAG over U and a set of probability tables Bp p ulpa u lu U where pa u is the set of parents of u in Bs A Bayesian network represents a probability distributions P U Juey p ul pa u Below a Bayesian network is shown for the variables in the iris data set Note that the links between the nodes class petallength and petalwidth do not form a directed cycle so the graph is a proper DAG lolx weka Classifier Graph Visualizer 14 40 17 Bayes Baye etallength This picture just shows the network structure of the Bayes net but for each of the nodes a probability distribution for the node given its parents are specified as well For example in the Bayes net above there is a conditional distribution 115 116 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS for petallength given the value of class Since class has no parents there is an unconditional distribution for sepalwidth Basic assumptions The classification task consist of classifying a variable y xy called the class variable given a set of variables x 21 1 called attribute variables A classifier h x y is a function that maps an instance of x to a value of y The classifier is learned from
115. Cygwin The process is like with Unix Linux systems but since the host system is Win32 and therefore the Java installation also a Win32 application you ll have to use the semicolon as separator for several jars 18 2 2 RunWeka bat From version 3 5 4 Weka is launched differently under Win32 The simple batch file got replaced by a central launcher class RunWeka class in combination with an INI file RunWeka ini The RunWeka bat only calls this launcher class now with the appropriate parameters With this launcher approach it is possible to define different launch scenarios but with the advantage of hav ing placeholders e g for the max heap size which enables one to change the memory for all setups easily The key of a command in the INI file is prefixed with cmd_ all other keys are considered placeholders cmd_blah java command blah bloerk placeholder bloerk A placeholder is surrounded in a command with cmd_blah java bloerk Note The key wekajar is determined by the w parameter with which the launcher class is called By default the following commands are predefined e default The default Weka start without a terminal window e console For debugging purposes Useful as Weka gets started from a terminal window e explorer The command that s executed if one double clicks on an ARFF or XRFF file In order to change the maximum heap size for all those commands one only has to modify the ma
116. Delete Edit ne Up y Down Once and experiment has been run you can analyze results in the Analyse panel In the Comparison field you will need to scroll down and select Log_likelihood Source Setup Got 400 results Configure test Testing with Row Column Comparison field Significance Sorting asc by Test base Displayed Columns Show std deviations Output Format ser ser Log_likelihood 0 05 selec m Perform test Save output r Result list Log _likelihooc 20 57 11 Available resultsets Test output Tester Analysing Datasets Resultsets 2 Confidence 0 05 two tailed weka experiment PairedCorrectedTTester Log_likelihood 2 Sorted by Date 11 17 08 8 58 PM Dataset 1 weka clusterer 2 weka cluste iris 100 2 72 0 53 2 70 0 54 Glass 100 0 38 1 65 0 36 1 14 v 0 2 0 Key 1 weka clusterers MakeDensityBasedClusterer M 1 0E 6 W weka c 2 weka clusterers EM I 100 N 2 M 1 0E 6 S 100 834818148381 5 4 REMOTE EXPERIMENTS 83 5 4 Remote Experiments Remote experiments enable you to distribute the computing load across multiple computers In the following we will discuss the setup and operation for HSQLDB and MySQL 3 5 4 1 Preparation To run a remote experiment you will need e A database server e A number of computers to run re
117. E Trans on Info Theory IT 14 426 467 1968 BIBLIOGRAPHY 303 19 G Cooper E Herskovits A Bayesian method for the induction of probabilistic networks from data Machine Learning 9 309 347 1992 20 Cozman See http ww 2 cs cmu edu fgcozman Research InterchangeFormat for details on XML BIF 21 N Friedman D Geiger M Goldszmidt Bayesian Network Classifiers Machine Learning 29 131 163 1997 22 D Heckerman D Geiger D M Chickering Learning Bayesian networks the combination of knowledge and statistical data Machine Learining 20 3 197 243 1995 23 S L Lauritzen and D J Spiegelhalter Local Computations with Probabilities on graphical structures and their applications to expert systems with discussion Journal of the Royal Statistical Society B 1988 50 157 224 24 Moore A and Lee M S Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets JATR Volume 8 pages 67 91 1998 25 Verma T and Pearl J An algorithm for deciding if a set of observed indepen dencies has a causal explanation Proc of the Eighth Conference on Uncertainty in Artificial Intelligence 323 330 1992 26 GraphViz See http ww graphviz org doc info lang html for more infor mation on the DOT language 27 JMathPlot See http code google com p jmathplot for more information on the project
118. ER 8 BAYESIAN NETWORK CLASSIFIERS A node can be renamed by right click and select Rename in the popup menu The following dialog appears that allows entering a new node name A sepallength The CPT of a node can be edited manually by selecting a node right click Edit CPT A dialog is shown with a table representing the CPT When a value is edited the values of the remainder of the table are update in order to ensure that the probabilities add up to 1 It attempts to adjust the last column first then goes backward from there Probab D x petallength sepalwidth Ginf 5 55 6 55 6 15 6 15 inf inf 2 45 Cinf 2 95 0 143 0 143 0 949 0 026 0 026 Cinf 2 45 3 36 inf 0 873 0 111 0 015 0 343 0 463 0 194 2 45 4 75 2 95 3 35 0 111 0 407 0 481 li245 475 asinn 02 05 02 4 75 inf Cin 2 95 0 02 0 347 0 633 0 018 0 158 0 825 4 75 inf 3 35 inf 0 077 0 077 0 845 Randomize Ok Cancel The whole table can be filled with randomly generated distributions by selecting the Randomize button The popup menu shows list of parents that can be added to selected node CPT for the node is updated by making copies for each value of the new parent Set evidence Rename Delete node Edit CPT sepallength i Delete parent gt Delete child gt Add value Rename value gt Delete value gt 8 9 BAYES NETWORK
119. ERIMENTER Chapter 6 KnowledgeFlow 6 1 Introduction The KnowledgeFlow provides an alternative to the Explorer as a graphical front end to WEKA s core algorithms The KnowledgeFlow is a work in progress so some of the functionality from the Explorer is not yet available On the other hand there are things that can be done in the KnowledgeFlow but not in the Explorer I os o IB FbataSources DataSinks Filters Classifiers Clusterers Associations Evaluation Visualization Plugins h E Z p DataSources WA Ri ma Z a al al o splo sl s S e 4 gt A A Tas TS 7 Serialized TexiDirectory REF Loader Loader Loader Loader Loader InstancesLoader Loader Knowledge Flow Layout Log Component Parameters Time Status KnowledgeFlow 0 0 49 Welcome to the Weka Knowledge Flow The KnowledgeFlow presents a data flow inspired interface to WEKA The user can select WEKA components from a tool bar place them on a layout can vas and connect them together in order to form a knowledge flow for processing and analyzing data At present all of WEKA s classifiers filters clusterers loaders and savers are available in the KnowledgeFlow along with some extra tools The KnowledgeFlow can handle data either incrementally or in batches the Explorer handles batch data only Of course learning from data incremen 95 96 CHAPTER 6 KNOWLEDGEFLOW tally requires a classifier that can be
120. ES y Open Save OK Cancel Since the ICS algorithm is focused on recovering causal structure instead of finding the optimal classifier the Markov blanket correction can be made afterwards Specific options The maxCardinality option determines the largest subset of Z to be considered in conditional independence tests x y Z The scoreType option is used to select the scoring metric 8 4 GLOBAL SCORE METRIC BASED STRUCTURE LEARNING 125 8 4 Global score metric based structure learning weka Ed classifiers c bayes Ej net c search gt c local e ci c global QA GeneticSearch Ey HillClimber O k2 QA RepeatedHillClimber QA SimulatedAnnealing Ey TabuSearch E TAN o cf fixed Close Common options for cross validation based algorithms are initAsNaiveBayes markovBlanketClassifier and maxNrOfParents see Sec tion B2 for description Further for each of the cross validation based algorithms the CVType can be chosen out of the following e Leave one out cross validation loo cu selects m N training sets simply by taking the data set D and removing the ith record for training set Dt The validation set consist of just the th single record Loo cv does not always produce accurate performance estimates e K fold cross validation k fold cv splits the data D in m approximately equal parts Dj Dm Training set D is obtained by removing part D from D
121. Evaluator attributelD a classForlRStatistics Alelassi er E predTargetColumn Select Cancel weka Experiment Environment lolx Setup Run Analyse xperiment Configuration Mode Simple 8 Advanced Open Save New Destination Choose InstancesResultListener O Experiment arf Result generator Choose AveragingResultProducer F Fold X 10 Wy weka experiment CrossValidationResultProducer X 10 O splitEvalutorOutzip Runs Distribute experiment Generator properties From ft To ro Host ecan Enabled iz Select property Bydataset O Byrun Iteration control Choose J48 C 0 25 M 2 asa 8 Data sets first Custom generator first ZeroR OneR B 6 Datasets J48 C 0 25 M 2 Add new Edit selecte Delete select v Use relative pat dataliris arff In this experiment the ZeroR OneR and J48 schemes are run 10 times with 10 fold cross validation Each set of 10 cross validation folds is then averaged producing one result line for each run instead of one result line for each fold as in the previous example using the Cross ValidationResultProducer for a total of 30 result lines If the raw output is saved all 300 results are sent to the archive 5 2 STANDARD EXPERIMENTS
122. False Open Save OK Cancel Click on the classifier entry ZeroR to display the scheme properties weka gui GenericObjectEditor weka classifiers rules ZeroR About Class for building and using a 0 R classifier Capabilities debug False Open This scheme has no modifiable properties besides debug mode on off but most other schemes do have properties that can be modified by the user The Capabilities button opens a small dialog listing all the attribute and class types this classifier can handle Click on the Choose button to select a different scheme The window below shows the parameters available for the J48 decision tree scheme If desired modify the parameters and then click OK to close the window 70 CHAPTER 5 EXPERIMENTER weka gui GenericObjectEditor weka classifiers trees J48 About Class for generating a pruned or unpruned C4 More Capabilities binarySplits False b A confidenceFactor 0 25 debug False RA minNumObj 2 numPolds 3 reducedErrorPruning False y savelnstanceData False z seed 1 subtreeRaising True z unpruned False E A useLaplace False v Open Save OK Cancel The name of the new scheme is displayed in the Result generator panel
123. G WEKA 17 2 2 Simple approach The base filters and interfaces are all located in the following package weka filters One can basically divide filters roughly into two different kinds of filters e batch filters they need to see the whole dataset before they can start processing it which they do in one go e stream filters they can start producing output right away and the data just passes through while being modified You can subclass one of the following abstract filters depending on the kind of classifier you want to implement e weka filters SimpleBatchFilter e weka filters SimpleStreamFilter These filters simplify the rather general and complex framework introduced by the abstract superclass weka filters Filter One only needs to implement a couple of abstract methods that will process the actual data and override if necessary a few existing methods for option handling 17 2 2 1 SimpleBatchFilter Only the following abstract methods need to be implemented e globalInfo returns a short description of what the filter does will be displayed in the GUI e determineOutputFormat Instances generates the new format based on the input data e process Instances processes the whole dataset in one go e getRevision returns the Subversion revision information see section Revisions on page 258 If more options are necessary then the following methods need to be overridden e listOptions ret
124. GUI 151 The popup menu shows list of parents that can be deleted from selected node CPT of the node keeps only the one conditioned on the first value of the parent node Set evidence O 3933 Rename Delete node Edit CPT Add parent SS sepaiwidth Delete child Add value Rename value Delete value gt The popup menu shows list of children that can be deleted from selected node CPT of the child node keeps only the one conditioned on the first value of the parent node AAN me 4 7 5 inf Rename Delete node Edit CPT Add parent gt Delete parent gt sepalwidth petalwidth Add value Rename value Delete value Selecting the Add Value from the popup menu brings up this dialog in which the name of the new value for the node can be specified The distribution for the node assign zero probability to the value Child node CP T s are updated by copying distributions conditioned on the new value Node sepallength X New value Value4 The popup menu shows list of values that can be renamed for selected node 152 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS AG FEAPS 7 Set evidence gt mmm 4 7 5 iri 34 Rename Delete node Edit CPT Add parent Delete parent gt Delete child gt Add value inf 2 45 Delete value 2 45 4 75 4 75 infy Selecting a value brings up the following dialog in which a new name can be specified
125. HAPTER 5 EXPERIMENTER e MySQL We won t go into the details of setting up a MySQL server but this is rather straightforward and includes the following steps Download a suitable version of MySQL for your server machine Install and start the MySQL server Create a database for our example we will use experiment as database name Download the appropriate JDBC driver extract the JDBC jar and place it as mysql jar in home johndoe jars 5 4 3 Remote Engine Setup e First set up a directory for scripts and policy files home johndoe remote_engine e Unzip the remoteExperimentServer jar from the Weka distribution or build it from the source with ant remotejar into a temporary direc tory e Next copy remoteEngine jar and remote policy example to the home johndoe remote_engine directory e Create a script called home johndoe remote engine startRemoteEngine with the following content don t forget to make it executable with chmod a x startRemoteEngine when you are on Linux Unix HSQLDB java Xmx256m classpath home johndoe jars hsqldb jar remoteEngine jar Djava security policy remote policy weka experiment RemoteEngine amp MySQL java Xmx256m classpath home johndoe jars mysql jar remoteEngine jar Djava security policy remote policy weka experiment RemoteEngine amp e Now we will start the remote engines that run the experiments on the remote computers note tha
126. L TE Q 3 e The Filter button enables one to highlight classifiers that can handle certain attribute and class types With the Remove filter button all the selected capabilities will get cleared and the highlighting removed again To change to a decision tree scheme select J48 in subgroup trees 5 2 STANDARD EXPERIMENTS 73 weka gui GenericObjectEditor 5 x weka classifiers trees J48 About Class for generating a pruned or unpruned C4 More Capabilities binarySplits False RA confidenceFactor 0 25 debug False SA minNumObj 2 numPolds 3 reducedErrorPruning False X savelnstanceData False NA seed 1 subtreeRaising True gt A unpruned False k useLaplace False J Open Save oK Cancel The new scheme is added to the Generator properties panel Click Add to add the new scheme weka Experiment Environment me Setup Run Analyse xperiment Configuration Mode Simple Advanced Open Save New Destination Choose InstancesResultListener O Experimentt arf Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator W weka classifier Runs Distribute experiment Generator propert
127. LEMENT DEFINITION FOR GIVEN TABLE PROPERTY gt lt ELEMENT FOR PCDATA gt lt ELEMENT GIVEN PCDATA gt lt ELEMENT TABLE PCDATA gt lt ELEMENT PROPERTY PCDATA gt gt Responsible Class es weka classifiers bayes BayesNet toXMLBIF03 weka classifiers bayes net BlFReader weka gui graphvisualizer BIFParser 18 6 5 XRFF files With Weka 3 5 4 a new more feature rich XML based data format got intro duced XRFF For more information please see Chapter LQ Chapter 19 Other resources 19 1 Mailing list The WEKA Mailing list can be found here for a e pas the list https list scms waikato ac nz pipermail wekalist Mirrors http news gmane org gmane comp ai weka http www nabble com WEKA f 435 htm1 for searching previous posted messages Before posting please read the Mailing List Etiquette http www cs waikato ac nz ml weka mailinglist_etiquette html 19 2 Troubleshooting Here are a few of things that are useful to know when you are having trouble installing or running Weka successfullyon your machine NB these java commands refer to ones executed in a shell bash command prompt etc and NOT to commands executed in the SimpleCLI 19 2 1 Weka download problems When you download Weka make sure that the resulting file size is the same as on our webpage Otherwise things won t work properly Apparently some web browsers have trouble downloadi
128. LOCAL SCORE BASED STRUCTURE LEARNING 121 e Hill Climbing 16 hill climbing adding and deleting arcs with no fixed ordering of variables useArcReversal if true also arc reversals are consider when determining the next step to make e Repeated Hill Climber starts with a randomly generated network and then applies hill climber to reach a local optimum The best network found is returned useArcReversal option as for Hill Climber e LAGD Hill Climbing does hill climbing with look ahead on a limited set of best scoring steps implemented by Manuel Neubach The number of look ahead steps and number of steps considered for look ahead are configurable e TAN 21 Tree Augmented Naive Bayes where the tree is formed by calculating the maximum weight spanning tree using Chow and Liu algorithm I8 No specific options e Simulated annealing 15 using adding and deleting arrows The algorithm randomly generates a candidate network By close to the current network Bs It accepts the network if it is better than the current i e Q Bg D gt Q Bs D Otherwise it accepts the candidate with probability et Q Bs D Q Bs D where t is the temperature at iteration i The temperature starts at to and is slowly decreases with each iteration weka gui GenericObjectEditor ZICI x weka classifiers bayes net search local SimulatedAnnealing About This Bayes Network learning algorithm uses the general More purpose searc
129. Marshall MARSHALL PLUCio arc nasa gov c Date July 1988 RELATION iris ATTRIBUTE sepallength NUMERIC ATTRIBUTE sepalwidth NUMERIC ATTRIBUTE petallength NUMERIC ATTRIBUTE petalwidth NUMERIC CATTRIBUTE class Iris setosa Iris versicolor Iris virginica The Data of the ARFF file looks like the following DATA 5 1 3 5 1 4 0 2 lris setosa 4 9 3 0 1 4 0 2 Iris setosa 4 7 3 2 1 3 0 2 Iris setosa 4 6 3 1 1 5 0 2 Iris setosa 5 0 3 6 1 4 0 2 lris setosa 5 4 3 9 1 7 0 4 Iris setosa 161 162 CHAPTER 9 ARFF 4 6 3 4 1 4 0 3 Iris setosa 5 0 3 4 1 5 0 2 Iris setosa 4 4 2 9 1 4 0 2 Iris setosa 4 9 3 1 1 5 0 1 Iris setosa Lines that begin with a are comments The RELATION CATTRIBUTE and DATA declarations are case insensitive 9 2 Examples Several well known machine learning datasets are distributed with Weka in the WEKAHOME data directory as ARFF files 9 2 1 The ARFF Header Section The ARFF Header section of the file contains the relation declaration and at tribute declarations The Orelation Declaration The relation name is defined as the first line in the ARFF file The format is relation lt relation name gt where lt relation name gt is a string The string must be quoted if the name includes spaces The Qattribute Declarations Attribute declarations take the form of an ordered sequence of Cattribute statements Each attribute in the data set has its own Cattribute statement which uniquely d
130. Numeric attributes C Date attributes String attributes Relational attributes Missing values DODODA No class O Unary class Empty nominal class Numeric class Date class String class Relational class Missing class values BODODO Only multi Instance data OK Cancel One can then choose those capabilities an object e g a classifier should have If one is looking for classification problem then the Nominal class Capability can be selected On the other hand if one needs a regression scheme then the Capability Numeric class can be selected This filtering mechanism makes the search for an appropriate learning scheme easier After applying that filter the tree with the objects will be displayed again and lists all objects that can handle all the selected Capabilities black the ones that cannot grey and the ones that might be able to handle them blue e g meta classifiers which depend on their base classifier s 288 CHAPTER 18 TECHNICAL DOCUMENTATION 18 5 Properties A properties file is a simple text file with this structure lt key gt lt value gt Comments start with the hash sign To make a rather long property line more readable one can use a backslash to continue on the next line The Filter property e g looks like this weka filters Filter weka filters supervised attribute weka filters supervised instance weka filters unsupervised attribute weka filters unsu
131. PCDATA instances gt lt ATTLIST value index CDATA IMPLIED gt lt ATTLIST value missing yes no no gt lt dataset name iris version 3 5 3 gt lt header gt lt attributes gt lt attribute name sepallength type numeric gt lt attribute name sepalwidth type numeric gt lt attribute name petallength type numeric gt lt attribute name petalwidth type numeric gt lt attribute class yes name class type nominal gt lt labels gt lt label gt Iris setosa lt label gt lt label gt Iris versicolor lt label gt lt label gt Iris virginica lt label gt lt labels gt 10 3 SPARSE FORMAT 169 lt attribute gt lt attributes gt lt header gt lt body gt lt instances gt lt instance gt lt value gt 5 1 lt value gt lt value gt 3 5 lt value gt lt value gt 1 4 lt value gt lt value gt 0 2 lt value gt lt value gt Iris setosa lt value gt lt instance gt lt instance gt lt value gt 4 9 lt value gt lt value gt 3 lt value gt lt value gt 1 4 lt value gt lt value gt 0 2 lt value gt lt value gt Iris setosa lt value gt lt instance gt lt instances gt lt body gt lt dataset gt 10 3 Sparse format The XRFF format also supports a sparse data representation Even though the iris dataset does not contain sparse data the above example will be used here to illustrate the sparse format lt instances gt lt instance type sparse gt lt
132. RIMENTS 69 trainPercent box The number of runs is specified in the Runs panel in the Setup tab A small help file can be displayed by clicking More in the About panel Information BEE NAME weka experiment RandomSplitResultProducer SYNOPSIS Performs a random train and test using a supplied evaluator OPTIONS option is selected then output from the splitEvaluator for individual train test splits is saved Ifthe destination is a directory then each output is saved to an individual gzip file ifthe destination is a file then each output is saved as an entry in a zip file randomizeData Do not randomize dataset and do not perform probabilistic rounding iftrue rawOutput Save raw output useful for debugging If set then output is sentto the destination specified by outputFile classifier regression scheme etc trainPercent Set the percentage of data to use for training outputFile Set the destination for saving raw output Ifthe rawOutput splitEvaluator The evaluator to apply to the test data This may be a Click on the splitEvaluator entry to display the SplitEvaluator properties weka gui GenericObjectEditor weka experiment ClassifierSplitEvaluator About A SplitEvaluator that produces results for a classification scheme on a nominal class attribute More attributelD 1 classForiRStatistics 0 classifier Choose ZeroR predTargetColumn
133. S Subset Evaluator Including locally predictive attributes Selected attributes 1 3 2 outlook humidity 4 6 1 Searching and Evaluating Attribute selection involves searching through all possible combinations of at tributes in the data to find which subset of attributes works best for prediction To do this two objects must be set up an attribute evaluator and a search method The evaluator determines what method is used to assign a worth to each subset of attributes The search method determines what style of search is performed 4 6 2 Options The Attribute Selection Mode box has two options 1 Use full training set The worth of the attribute subset is determined using the full set of training data 2 Cross validation The worth of the attribute subset is determined by a process of cross validation The Fold and Seed fields set the number of folds to use and the random seed used when shuffling the data As with Classify Section Z3 1 there is a drop down box that can be used to specify which attribute to treat as the class 4 6 3 Performing Selection Clicking Start starts running the attribute selection process When it is fin ished the results are output into the result area and an entry is added to the result list Right clicking on the result list gives several options The first three View in main window View in separate window and Save result buffer are the same as for the classify panel It
134. Sets file with cost matrix l lt name of input file gt Sets model input file In case the filename ends with xml the options are loaded from the XML file d lt name of output file gt Sets model output file In case the filename ends with xml only the options are saved to the XML file not the model v Outputs no statistics for training data Outputs statistics only not the classifier Outputs detailed information retrieval statistics for each class 8 7 RUNNING FROM THE COMMAND LINE 129 Outputs information theoretic statistics p lt attribute range gt Only outputs predictions for test instances or the train instances if no test instances provided along with attributes 0 for none distribution Outputs the distribution instead of only the prediction in conjunction with the p option only nominal classes r Only outputs cumulative margin distribution 78 Only outputs the graph representation of the classifier xml filename xml string Retrieves the options from the XML data instead of the command line Options specific to weka classifiers bayes BayesNet D Do not use ADTree data structure B lt BIF file gt BIF file to compare with Q weka classifiers bayes net search SearchAlgorithm Search algorithm E weka classifiers bayes net estimate SimpleEstimator Estimator algorithm The search algorithm option Q and estimator option E options are manda tory Note that it is importa
135. THE UNIVERSITY OF WAIKATO Te Whare Wananga o Waikato WEKA Manual for Version 3 6 8 Remco R Bouckaert Eibe Frank Mark Hall Richard Kirkby Peter Reutemann Alex Seewald David Scuse August 13 2012 2002 2012 University of Waikato Hamilton New Zealand Alex Seewald original Commnd line primer David Scuse original Experimenter tutorial This manual is licensed under the GNU General Public License version 2 More information about this license can be found at http www gnu org copyleft gpl html Contents I The Command line AA 1 2 1 Datase 1 2 2 lassifie 1 2 4 weka classifiers II The Graphical User Interface 2 Launching WEKA Simple CL 3 Foi oe ee a a ee a 3 3 Command redirectio 3 4 Command completio 4 1 2 tatus Box WEKA Status Ico 42 1 Loadino Datas s seos ee We a a Ge ek le ce aa 4 2 2 The Current Relation 4 2 3 Working With Attributes 4 2 4 Working With Filters ote eR kas Oe bk be Heke ge ee ee 4 3 1 Selecting a Classifieg 43 2 Test Optiong eaaa aa 4 3 The Class Attribute 4 3 4 Training a Classifie 11 13 13 14 14 16 17 19 23 25 27 31 31 32 32 33 CONTENTS 4 3 The Classifier Output Text 0 43 4 3 6 The Result List 2 0 0 00 ee eee 43 l stern cia a as ae Ge A ch BSR Be a a a 45 44
136. TextViewer e Click on the DataSources tab and choose ArffLoader from the toolbar the mouse pointer will change to a cross hairs e Next place the ArffLoader component on the layout area by clicking some where on the layout a copy of the ArffLoader icon will appear on the layout area e Next specify an ARFF file to load by first right clicking the mouse over the ArffLoader icon on the layout A pop up menu will appear Select Configure under Edit in the list from this menu and browse to the location of your ARFF file e Next click the Evaluation tab at the top of the window and choose the ClassAssigner allows you to choose which column to be the class com ponent from the toolbar Place this on the layout e Now connect the ArffLoader to the ClassAssigner first right click over the ArffLoader and select the dataSet under Connections in the menu A rubber band line will appear Move the mouse over the ClassAssigner component and left click a red line labeled dataSet will connect the two components e Next right click over the ClassAssigner and choose Configure from the menu This will pop up a window from which you can specify which column is the class in your data last is the default e Next grab a Cross ValidationFoldMaker component from the Evaluation toolbar and place it on the layout Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over ClassAssigner and se lecting dataSet from under Connections
137. XAMPLES 23 1 3 Examples Usually if you evaluate a classifier for a longer experiment you will do something like this for csh java Xmx1024m weka classifiers trees J48 t data arff i k d J48 data model gt amp J48 data out Y The Xmx1024m parameter for maximum heap size ensures your task will get enough memory There is no overhead involved it just leaves more room for the heap to grow i and k gives you some additional information which may be useful e g precision and recall for all classes In case your model performs well it makes sense to save it via d you can always delete it later The implicit cross validation gives a more reasonable estimate of the expected accuracy on unseen data than the training set accuracy The output both of standard error and output should be redirected so you get both errors and the normal output of your classifier The last amp starts the task in the background Keep an eye on your task via top and if you notice the hard disk works hard all the time for linux this probably means your task needs too much memory and will not finish in time for the exam In that case switch to a faster classifier or use filters e g for Resample to reduce the size of your dataset or StratifiedRemoveFolds to create training and test sets for most classifiers training takes more time than testing So now you have run a lot of experiments which classifier is best Try cat out grep A 3 Stra
138. a the following areas are distinguished 1f there are missing values in the test data but not in the training data the values are filled in in the test data with a ReplaceMissingValues filter based on the training data 8 1 INTRODUCTION 117 e local score metrics Learning a network structure Bs can be considered an optimization problem where a quality measure of a network structure given the training data Q Bs D needs to be maximized The quality mea sure can be based on a Bayesian approach minimum description length information and other criteria Those metrics have the practical property that the score of the whole network can be decomposed as the sum or product of the score of the individual nodes This allows for local scoring and thus local search methods e conditional independence tests These methods mainly stem from the goal of uncovering causal structure The assumption is that there is a network structure that exactly represents the independencies in the distribution that generated the data Then it follows that if a conditional indepen dency can be identified in the data between two variables that there is no arrow between those two variables Once locations of edges are identified the direction of the edges is assigned such that conditional independencies in the data are properly represented e global score metrics A natural way to measure how well a Bayesian net work performs on a given data set is to predict its f
139. actAssociatorTest class and add additional tests 266 CHAPTER 17 EXTENDING WEKA 17 4 Extending the Explorer The plugin architecture of the Explorer allows you to add new functionality easily without having to dig into the code of the Explorer itself In the following you will find information on how to add new tabs like the Classify tab and new visualization plugins for the Classify tab 17 4 1 Adding tabs The Explorer is a handy tool for initial exploration of your data for proper statistical evaluation the Experimenter should be used instead But if the available functionality is not enough you can always add your own custom made tabs to the Explorer 17 4 1 1 Requirements Here is roughly what is required in order to add a new tab the examples below go into more detail e your class must be derived from javax swing JPanel e the interface weka gui explorer Explorer ExplorerPanel must be im plemented by your class e optional interfaces weka gui explorer Explorer LogHandler in case you want to take advantage of the logging in the Explorer weka gui explorer Explorer CapabilitiesFilterChangeListener in case your class needs to be notified of changes in the Capabilities e g if new data is loaded into the Explorer e adding the classname of your class to the Tabs property in the Explorer props file 17 4 1 2 Examples The following examples demonstrate the plugin architecture Only t
140. alse positive rate is additionally output with this pa rameter All these values can also be computed from the confusion matrix This parameter switches the human readable output of the model description off In case of support vector machines or NaiveBayes this makes some sense unless you want to parse and visualize a lot of information We now give a short list of selected classifiers in WEKA Other classifiers below weka classifiers may also be used This is more easy to see in the Explorer GUI e trees J48 A clone of the C4 5 decision tree learner e bayes NaiveBayes A Naive Bayesian learner K switches on kernel den sity estimation for numerical attributes which often improves performance e meta ClassificationViaRegression W functions LinearRegression Multi response linear regression e functions Logistic Logistic Regression 20 CHAPTER 1 A COMMAND LINE PRIMER e functions SMO Support Vector Machine linear polynomial and RBF ker nel with Sequential Minimal Optimization Algorithm due to 4 Defaults to SVM with linear kernel E 5 C 10 gives an SVM with polynomial kernel of degree 5 and lambda of 10 e lazy KStar Instance Based learner E sets the blend entropy automati cally which is usually preferable e lazy IBk Instance Based learner with fixed neighborhood K sets the number of neighbors to use IB1 is equivalent to IBk K 1 e rules JRip A clone of the RIPPER rule learner Based on a simple example
141. ames the dataset name all attribute names and nominal attribute values This is intended for exchanging sensitive datasets without giving away restricted information Remove is intended for explicit deletion of attributes from a dataset e g for removing attributes of the iris dataset java weka filters unsupervised attribute Remove R 1 2 i data iris arff o iris simplified arff java weka filters unsupervised attribute Remove V R 3 last i data iris arff o iris simplified arff weka filters unsupervised instance Resample creates a non stratified subsample of the given dataset i e random sampling without regard to the class information Otherwise it is equivalent to its supervised variant java weka filters unsupervised instance Resample i data soybean arff o soybean 5 arff Z 5 RemoveFoldscreates cross validation folds of the given dataset The class distri butions are not retained The following example splits soybean arff into training and test datasets the latter consisting of 25 1 4 of the data java weka filters unsupervised instance RemoveFolds i data soybean arff o soybean train arff c last N 4 F 1 V java weka filters unsupervised instance RemoveFolds i data soybean arff o soybean test arff c last N 4 F 1 RemoveWithValues filters instances according to the value of an attribute java weka filters unsupervised instance RemoveWithValues i data soybean arff o soybean without_herbicide_inj
142. an network can be learned or the CPTs of a network can be estimated A file choose menu pops up to select the arff file containing the data The Learn Network and Learn CPT menus are only active when a data set is specified either through e Tools Set Data menu or e Tools Generate Data menu or e File Open menu when an arff file is selected The Learn Network action learns the whole Bayesian network from the data set The learning algorithms can be selected from the set available in Weka by selecting the Options button in the dialog below Learning a network clears the undo stack Learn Bayesian Network i x Options P 1 5 BAYES E weka classifiers bayes net estimate SimpleEstimator A 0 5 The Learn CPT menu does not change the structure of the Bayesian network only the probability tables Learning the CPTs clears the undo stack The Layout menu runs a graph layout algorithm on the network and tries to make the graph a bit more readable When the menu item is selected the node size can be specified or left to calculate by the algorithm based on the size of the labels by deselecting the custom node size check box 8 9 BAYES NETWORK GUI 147 Graph Layout Options x Custom Node Size Width 154 Height 32 Layout Graph Cancel The Show Margins menu item makes marginal distributions visible These are calculated using the junction tree algorithm 23 Marginal probabilities for nod
143. and learn ing Bayes nets 3 Visualization Ways of visualizing data with WEKA Weka GUI Chooser REESE Tools Help Plot oF ROC R TreeVisualizer 4T GraphVisualizer G BoundaryVisualizer B e Plot For plotting a 2D plot of a dataset e ROC Displays a previously saved ROC curve e TreeVisualizer For displaying directed graphs e g a decision tree e GraphVisualizer Visualizes XML BIF or DOT format graphs e g for Bayesian networks e BoundaryVisualizer Allows the visualization of classifier decision boundaries in two dimensions 4 Help Online resources for WEKA can be found here e Weka homepage Opens a browser window with WEKA s home page e HOWTOs code snippets etc The general WekaWiki 2 con taining lots of examples and HOWTOs around the development and use of WEKA e Weka on Sourceforge WEKA s project homepage on Sourceforge net e SystemInfo Lists some internals about the Java WEKA environ ment e g the CLASSPATH To make it easy for the user to add new functionality to the menu with out having to modify the code of WEKA itself the GUI now offers a plugin mechanism for such add ons Due to the inherent dynamic class discovery plu gins only need to implement the weka gui MainMenuExtension interface and 29 WEKA notified of the package they reside in to be displayed in the menu un der Extensions this extra menu appears automatically as soon as extensions are discovered More
144. and test set Output source code If the classifier can output the built model as Java source code you can specify the class name here The code will be printed in the Classifier output area 4 3 3 The Class Attribute The classifiers in WEKA are designed to be trained to predict a single class attribute which is the target for prediction Some classifiers can only learn nominal classes others can only learn numeric classes regression problems still others can learn both By default the class is taken to be the last attribute in the data If you want to train a classifier to predict a different attribute click on the box below the Test options box to bring up a drop down list of attributes to choose from 4 3 CLASSIFICATION 43 4 3 4 Training a Classifier Once the classifier test options and class have all been set the learning process is started by clicking on the Start button While the classifier is busy being trained the little bird moves around You can stop the training process at any time by clicking on the Stop button When training is complete several things happen The Classifier output area to the right of the display is filled with text describing the results of training and testing A new entry appears in the Result list box We look at the result list below but first we investigate the text that has been output 4 3 5 The Classifier Output Text The text in the Classifier output area has scroll bars
145. anket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search global GeneticSearch L lt integer gt Population size A lt integer gt Descendant population size U lt integer gt Number of runs M Use mutation default true C Use cross over default true 0 Use tournament selection true or maximum subpopulatin false default false R lt seed gt 134 mbc CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Random number seed Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S LOO CV k Fold CV Cumulative CV Q Score type LOO CV k Fold CV Cumulative CV Use probabilistic or 0 1 scoring default probabilistic scoring weka classifiers bayes net search global HillClimber P lt nr of parents gt R mbc Maximum number of parents Use arc reversal operation default false Initial structure is empty instead of Naive Bayes Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Ma
146. asympt 130 0 1 73 0 f stt 184 0 no 0 0 60 124 37 Ojmale fatyp_anoi 130 0 283 0 f ette 98 0 no 0 0 lt 50 25 37 0 male non_angi 130 0 194 0 f normal 150 0 no 0 0 lt 60 26 37 O male asympt 120 0 223 0 f normal 168 0 no 0 0 normal 50 27 37 O male asympt 130 0 315 0 f normal 158 0 no 0 0 lt 50 28 38 0 female jatyp_angi 120 0 275 0 normal 129 0 no 0 0 lt 50 29 38 0 male__ atyp_angi 140 0 297 0 f normal 150 0 no 0 0 50 F a a rs E oe i 7 2 EDITING 113 For convenience it is possible to sort the view based on a column the underlying data is NOT changed via Edit Sort data one can sort the data permanently This enables one to look for specific values e g missing values To better distinguish missing values from empty cells the background of cells with missing values is colored grey ARFF Viewer D development datasets uci nominal heart h artt loj x File Edit View jRelation hungarian 14 heart disease 1 hal age sex chest_pain chol bs restecg thalach exang oldpeak slope ca thal num Numerio Nominal Nominal Numerio Nominal Nominal Numeric Nominal Numeric Nominal Numeric Nominal Nominal 191 48 0 female jatyp_angi s a 50 38 0 male asympt gt 50_1 34 0 male jatyp_angi 31 0 female jatyp_angi 43 0 female ngina LT 3 lt 50 50 50 normal f normal 150 0lyes 1 0 gt 50_
147. at n is equal to the number of examples Out of necessity loo cv has to be non stratified i e the class distributions in the test set are not related to those in the training data Therefore loo cv tends to give less reliable results However it is still quite useful in dealing with small datasets since it utilizes the greatest amount of training data from the dataset 1 2 BASIC CONCEPTS 17 1 2 3 weka filters The weka filters package is concerned with classes that transform datasets by removing or adding attributes resampling the dataset removing examples and so on This package offers useful support for data preprocessing which is an important step in machine learning All filters offer the options i for specifying the input dataset and o for specifying the output dataset If any of these parameters is not given standard input and or standard output will be read from written to Other parameters are specific to each filter and can be found out via h as with any other class The weka filters package is organized into supervised and unsupervised filtering both of which are again subdivided into instance and attribute filtering We will discuss each of the four subsections separately weka filters supervised Classes below weka filters supervised in the class hierarchy are for super vised filtering i e taking advantage of the class information A class must be assigned via c for WEKA default behaviour use c last weka
148. ata classIndex 1 data setClassIndex 0 uses the last attribute as class attribute if data classIndex 1 data setClassIndex data numAttributes 1 16 2 2 Loading data from databases For loading data from databases one of the following two classes can be used e weka experiment InstanceQuery e weka core converters DatabaseLoader The differences between them are that the InstanceQuery class allows one to retrieve sparse data and the DatabaseLoader can retrieve the data incremen tally Here is an example of using the InstanceQuery class import weka core Instances import weka experiment InstanceQuery InstanceQuery query new InstanceQueryO query setDatabaseURL jdbc_ur1 query setUsername the_user query setPassword the_password query setQuery select from whatsoever if your data is sparse then you can say so too query setSparseData true Instances data query retrievelInstances And an example using the DatabaseLoader class in batch retrieval import weka core Instances import weka core converters DatabaseLoader DatabaseLoader loader new DatabaseLoader loader setSource jdbc_url the_user the_password loader setQuery select from whatsoever Instances data loader getDataSet 200 CHAPTER 16 USING THE API The DatabaseLoader is used in incremental mode as follows import weka core Instance import weka core Instan
149. attributes String attributes allow us to create attributes containing arbitrary textual val ues This is very useful in text mining applications as we can create datasets with string attributes then write Weka Filters to manipulate strings like String ToWordVectorFilter String attributes are declared as follows ATTRIBUTE LCC string Date attributes Date attribute declarations take the form attribute lt name gt date lt date format gt where lt name gt is the name for the attribute and lt date format gt is an op tional string specifying how date values should be parsed and printed this is the same format used by SimpleDateFormat The default format string accepts the ISO 8601 combined date and time format yyyy MM dd T HH mm ss Dates must be specified in the data section as the corresponding string rep resentations of the date time see example below Relational attributes Relational attribute declarations take the form attribute lt name gt relational lt further attribute definitions gt end lt name gt For the multi instance dataset MUSK1 the definition would look like this denotes an omission 164 CHAPTER 9 ARFF attribute molecule_name MUSK jf78 NON MUSK 199 attribute bag relational Cattribute f1 numeric attribute f166 numeric end bag attribute class 0 1 9 2 2 The ARFF Data Section The ARFF Data section of the file contains the data declaration line and the
150. b classifier another lets us specify parameters for the second sub classifier and so on java weka classifiers meta ClassificationViaRegression W weka classifiers functions LinearRegression t data iris arff x 2 S 1 In some cases both approaches have to be mixed for example java weka classifiers meta Stacking B weka classifiers lazy IBk K 10 M weka classifiers meta ClassificationViaRegression W weka classifiers functions LinearRegression S 1 t data iris arff x 2 Notice that while ClassificationViaRegression honors the parameter Stacking itself does not Sadly the option handling for sub classifier specifi cations is not yet completely unified within WEKA but hopefully one or the other approach mentioned here will work Part II The Graphical User Interface Chapter 2 Launching WEKA The Weka GUI Chooser class weka gui GUIChooser provides a starting point for launching Weka s main GUI applications and supporting tools If one prefers a MDI multiple document interface appearance then this is provided by an alternative launcher called Main class weka gui Main The GUI Chooser consists of four buttons one for each of the four major Weka applications and four menus e008 Weka GUI Chooser Program Visualization Tools Help Applications WEKA M 7 The University a of Waikato Experimenter Waikato Environment for Knowledge Analysis KnowledgeFlow Version
151. bad 0 0196078431 0 0754147812 0 0196078431 0 9803921568 0 9667175947 bad bad 0 8295625942 23 24 25 27 bad good good bad good 0 8295625942 0 9627926942 good good bad bad bad good good good good good 0 9627926942 0 1225490196 0 9200553142 0 9627926942 0 9627926942 28 29 33 good bad 0 8774509803 0 0799446857 lO 7636165577 276 CHAPTER 17 EXTENDING WEKA Bar plot with probabilities The PredictionError java example uses the JMathTools library needs the jmathplot jar in the CLASSPATH to display a simple bar plot of the predictions The correct predictions are displayed in blue the incorrect ones in red In both cases the class probability that the classifier returned for the correct class label is displayed on the y axis The x axis is simply the index of the prediction starting with 0 JAR Prediction error Sie well E Eas Bcorrect B incorrect Displays the probability the classifier returns for the actual class label Chapter 18 Technical documentation 18 1 ANT What is ANT This is how the ANT homepage http ant apache org defines its tool Apache Ant is a Java based build tool In theory it is kind of like Make but without Make s wrinkles 18 1 1 Basics e the ANT build file is based on XML e the usual name for the build file is build xml
152. bove will be displayed in the Log panel The results of the experiment are saved to the dataset Experiment1 arff 5 2 STANDARD EXPERIMENTS 63 5 2 2 Advanced 5 2 2 1 Defining an Experiment When the Experimenter is started in Advanced mode the Setup tab is displayed Click New to initialize an experiment This causes default parameters to be defined for the experiment weka Experiment Environment Setup Run Analyse xperiment Configuration Mode Simple 8 Advanced Open Save New Destination Choose InstancesResultListener O weka_experiment25619 arff Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator W weka classifier Runs Distribute experiment Generator properties From 1 To 10 Hosts Disabled h 4 Select property Bydataset O Byrun Iteration control 8 Data sets first Custom generator first Datasets Add new Edit selecte Delete select Use relative pat Can t edit To define the dataset to be processed by a scheme first select Use relative paths in the Datasets panel of the Setup tab and then click on Add new to open a dialog window Look In weka 3 5 6 y alla BB
153. bstractFilterTest 260 CHAPTER 17 EXTENDING WEKA 17 3 Writing other algorithms The previous sections covered how to implement classifiers and filters In the following you will find some information on how to implement clusterers as sociators and attribute selection algorithms The various algorithms are only covered briefly since other important components capabilities option handling revisions have already been discussed in the other chapters 17 3 1 Clusterers Superclasses and interfaces All clusterers implement the interface weka clusterers Clusterer but most algorithms will be most likely derived directly or further up in the class hier archy from the abstract superclass weka clusterers AbstractClusterer weka clusterers SingleClustererEnhancer is used for meta clusterers like the FilteredClusterer that filters the data on the fly for the base clusterer Here are some common interfaces that can be implemented e weka clusterers DensityBasedClusterer for clusterers that can esti mate the density for a given instance AbstractDensityBasedClusterer already implements this interface e weka clusterers UpdateableClusterer clusterers that can generate their model incrementally implement this interface like CobWeb e Number0fClustersRequestable is for clusterers that allow to specify the number of clusters to generate like SimpleKMeans e weka core Randomizable for clusterers that support randomization in
154. ces import weka core converters DatabaseLoader DatabaseLoader loader new DatabaseLoader loader setSource jdbc_url the_user the_password loader setQuery select from whatsoever Instances structure loader getStructure Instances data new Instances structure Instance inst while inst loader getNextInstance structure null data add inst Notes e Not all database systems allow incremental retrieval e Not all queries have a unique key to retrieve rows incrementally In that case one can supply the necessary columns with the setKeys String method comma separated list of columns e If the data cannot be retrieved in an incremental fashion it is first fully loaded into memory and then provided row by row pseudo incremental 16 2 LOADING DATA 201 202 CHAPTER 16 USING THE API 16 3 Creating datasets in memory Loading datasets from disk or database are not the only ways of obtaining data in WEKA datasets can be created in memory or on the fly Generating a dataset memory structure i e a weka core Instances object is a two stage process 1 Defining the format of the data by setting up the attributes 2 Adding the actual data row by row The class wekaexamples core CreateInstances of the Weka Examples collection generates an Instances object containing all attribute types WEKA can handle at the moment 16 3 1 Defining the format There are currently five dif
155. classifiers meta Bagging gt lt option name W type hyphens gt lt options type classifier value weka classifiers meta AdaBoostM1 gt lt option name W type hyphens gt lt options type classifier value weka classifiers trees J48 gt lt option gt lt options gt lt option gt lt options gt lt option gt lt option name B type quotes gt lt options type classifier value weka classifiers meta Stacking gt lt option name B type quotes gt lt options type classifier value weka classifiers trees J48 gt lt option gt lt options gt lt option gt lt option name t gt test datasets hepatitis arff lt option gt lt options gt Note The type and value attribute of the outermost options tag is not used while reading the parameters It is merely for documentation purposes so that one knows which class was actually started from the command line Responsible Class es weka core xml XMLOptions 292 CHAPTER 18 TECHNICAL DOCUMENTATION 18 6 2 Serialization of Experiments It is now possible to serialize the Experiments from the WEKA Experimenter not only in the proprietary binary format Java offers with serialization with this you run into problems trying to read old experiments with a newer WEKA version due to different SerialUIDs but also in XML There are currently two different ways to do this e built in The built in serialization captures only the necessary informations of
156. core Instances Instances data from somewhere setup meta classifier AttributeSelectedClassifier classifier new AttributeSelectedClassifier CfsSubsetEval eval new CfsSubsetEval GreedyStepwise search new GreedyStepwise search setSearchBackwards true J48 base new J48 classifier setClassifier base classifier setEvaluator eval classifier setSearch search cross validate classifier Evaluation evaluation new Evaluation data evaluation crossValidateModel classifier data 10 new Random 1 System out println evaluation toSummaryString 16 8 SELECTING ATTRIBUTES 223 16 8 2 Using the filter In case the data only needs to be reduced in dimensionality but not used for training a classifier then the filter approach is the right one The AttributeSelection filter package weka filters supervised attribute takes an evaluator and a search algorithm as parameter The code snippet below uses once again CfsSubsetEval as evaluator and a backwards operating GreedyStepwise as search algorithm It just outputs the reduced data to stdout after the filtering step import weka attributeSelection CfsSubsetEval import weka attributeSelection GreedyStepwise import weka core Instances import weka filters Filter import weka filters supervised attribute AttributeSelection Instances data from somewhere setup filter AttributeSelection filter new AttributeSelection
157. d not these of the children The Add Node menu brings up a dialog see below that allows to specify the name of the new node and the cardinality of the new node Node values are assigned the names Valuel Value2 etc These values can be renamed right click the node in the graph panel and select Rename Value Another option is to copy paste a node with values that are already properly named and rename the node Add node Name Cardinality Nodes Xx Cancel 8 9 BAYES NETWORK GUI 145 Then a dialog is shown to select a parent Descendants of the child node parents of the child node and the node itself are not listed since these cannot be selected as child node since they would introduce cycles or already have an arc in the network A 8x 6 Select parent node for sepallength petalwidth NA Cancel The Delete Arc menu brings up a dialog with a list of all arcs that can be deleted pares SS x Select arc to delete petallength gt sepallength v The list of eight items at the bottom are active only when a group of at least two nodes are selected e Align Left Right Top Bottom moves the nodes in the selection such that all nodes align to the utmost left right top or bottom node in the selection re spectively e Center Horizontal Vertical moves nodes in the selection halfway between left and right most or
158. d support for the Snow ball stemmers are included 12 2 Snowball stemmers Weka contains a wrapper class for the Snowball homepage http snowball tartarus org stemmers containing the Porter stemmer and several other stemmers for dif ferent languages The relevant class is weka core stemmers Snowball The Snowball classes are not included they only have to be present in the classpath The reason for this is that the Weka team doesn t have to watch out for new versions of the stemmers and update them There are two ways of getting hold of the Snowball stemmers 1 You can add the following pre compiled jar archive to your classpath and yow re set based on source code from 2005 10 19 compiled 2005 10 22 http www cs waikato ac nz ml weka stemmers snowball ja 2 You can compile the stemmers yourself with the newest sources Just download the following ZIP file unpack it and follow the instructions in the README file the zip contains an ANT http ant apache org build script for generating the jar archive http www cs waikato ac nz ml weka stemmers snowball zip Note the patch target is specific to the source code from 2005 10 19 177 178 CHAPTER 12 STEMMERS 12 3 Using stemmers The stemmers can either used e from commandline e within the StringToWordVector package weka filters unsupervised attribute 12 3 1 Commandline All stemmers support the following options e h for display
159. dElement no labels addElement yes Attribute cls new Attribute class labels FastVector attributes new FastVector attributes addElement num1 attributes addElement num2 attributes addFlement cls Instances dataset new Instances Test dataset attributes 0 The final argument in the Instances constructor above tells WEKA how much memory to reserve for upcoming weka core Instance objects If one knows how many rows will be added to the dataset then it should be specified as it saves costly operations for expanding the internal storage It doesn t matter if one aims to high with the amount of rows to be added it is always possible to trim the dataset again using the compactify method 16 3 2 Adding data After the structure of the dataset has been defined one can add the actual data to it row by row There are basically two constructors of the weka core Instance class that one can use for this purpose e Instance double weight double attValues generates an Instance object with the specified weight and the given double values WEKA s internal format is using doubles for all attribute types For nominal string and relational attributes this is just an index of the stored values e Instance int numAttributes generates a new Instance object with weight 1 0 and all missing values The second constructor may be easier to use but setting values via the Instance class methods is a b
160. data remove apply filter A common trap to fall into is setting options after the setInputFormat Instances has been called Since this method is normally used to determine the output format of the data all the options have to be set before calling it Otherwise all options set afterwards will be ignored 16 5 1 Batch filtering Batch filtering is necessary if two or more datasets need to be processed accord ing to the same filter initialization If batch filtering is not used for instance when generating a training and a test set using the StringToWordVector fil ter package weka filters unsupervised attribute then these two filter runs are completely independent and will create two most likely incompatible datasets Running the StringToWordVector on two different datasets this will result in two different word dictionaries and therefore different attributes being generated The following code example shows how to standardize i e transforming all numeric attributes to have zero mean and unit variance a training and a test set with the Standardize filter package weka filters unsupervised attribute Instances train from somewhere Instances test from somewhere Standardize filter new Standardize initializing the filter once with training set filter setInputFormat train configures the Filter based on train instances and returns filtered instances Instances newTrain Filter useFilter t
161. der they are specified in the data file 2 Selection tick boxes These allow you select which attributes are present in the relation 3 Name The name of the attribute as it was declared in the data file When you click on different rows in the list of attributes the fields change in the box to the right titled Selected attribute This box displays the char acteristics of the currently highlighted attribute in the list 1 Name The name of the attribute the same as that given in the attribute list 2 Type The type of attribute most commonly Nominal or Numeric 3 Missing The number and percentage of instances in the data for which this attribute is missing unspecified 4 Distinct The number of different values that the data contains for this attribute 5 Unique The number and percentage of instances in the data having a value for this attribute that no other instances have Below these statistics is a list showing more information about the values stored in this attribute which differ depending on its type If the attribute is nominal the list consists of each possible value for the attribute along with the number of instances that have that value If the attribute is numeric the list gives four statistics describing the distribution of values in the data the minimum maximum mean and standard deviation And below these statistics there is a coloured histogram colour coded according to the attribute chosen a
162. details can be found in the Wiki article Extensions for Weka s main GUI 6 If you launch WEKA from a terminal window some text begins scrolling in the terminal Ignore this text unless something goes wrong in which case it can help in tracking down the cause the LogWindow from the Program menu displays that information as well This User Manual focuses on using the Explorer but does not explain the individual data preprocessing tools and learning algorithms in WEKA For more information on the various filters and learning methods in WEKA see the book Data Mining I 30 CHAPTER 2 LAUNCHING WEKA Chapter 3 Simple CLI The Simple CLI provides full access to all Weka classes i e classifiers filters clusterers etc but without the hassle of the CLASSPATH it facilitates the one with which Weka was started It offers a simple Weka shell with separated commandline and output SimpleCLI Welcome to the WEKA SimpleCLI Enter commands in the textfield at the bottom of the window Use the up and down arrows to move through previous commands Command completion for classnames and files is initiated with lt Tab gt In order to distinguish etween files and classnames file names must gt either absolute or start with Alt BackSpace gt is used for deleting the text in the commandline in chunks gt help Command must be one of java lt classname gt lt args gt gt file break kill cls exit
163. ds an extra attribute at the end which is filled with random numbers The reset method is only used in this example since the random number generator needs to be re initialized in order to obtain repeatable results import weka core import weka core Capabilities import weka filters import java util Random public class SimpleStream extends SimpleStreamFilter protected Random m_Random public String globalInfo return A simple stream filter that adds an attribute blah at the end containing a random number public Capabilities getCapabilities Capabilities result super getCapabilities result enableAllAttributes result enableAllClasses result enable Capability NO_CLASS filter doesn t need class to be set return result protected void reset super reset m_Random new Random 1 protected Instances determineOutputFormat Instances inputFormat Instances result new Instances inputFormat 0 result insertAttributeAt new Attribute blah result numAttributes return result protected Instance process Instance inst double values new double inst numAttributes 1 for int n 0 n lt inst numAttributes n values n inst value n values values length 1 m_Random nextInt Instance result new Instance 1 values return result public static void main String args runFilter new SimpleStream args
164. e Explorer The Explorer does not only save the built classifier in the model file but also the header information of the dataset the classifier was built with By storing the dataset information as well one can easily check whether a serialized classifier can be applied on the current dataset The readA11 methods returns an array with all objects that are contained in the model file import weka classifiers Classifier import weka core Instances import weka core SerializationHelper the current data to use with classifier Instances current from somewhere deserialize model Object o SerializationHelper readA11 some where j48 model Classifier cls Classifier o 0 Instances data Instances o 1 is the data compatible if data equalHeaders current throw new Exception Incompatible data Serializing a classifier for the Explorer If one wants to serialize the dataset header information alongside the classifier just like the Explorer does then one can use one of the writeA11 methods import weka classifiers Classifier import weka classifiers trees J48 import weka core converters ConverterUtils DataSource import weka core SerializationHelper load data Instances inst DataSource read some where data arff inst setClassIndex inst numAttributes 1 train J48 Classifier cls new J48 cls buildClassifier inst serialize classifier and header information Instances
165. e available this method returns null 17 2 WRITING A NEW FILTER 249 batchFinished signals the end of a dataset being pushed through the filter In case of a filter that could not process the data of the first batch immediately this is the place to determine what the output format will be and set if via setOutputFormat Instances and finally process the input data The currently available data can be retrieved with the getInputFormat method After processing the data one needs to call flushInput to remove all the pending input data flushInput flushInput removes all buffered Instance objects from the input queue This method must be called after all the Instance objects have been processed in the batchFinished method Option handling If the filter should be able to handle command line options then the inter face weka core OptionHandler needs to be implemented In addition to that the following code should be added at the end of the setOptions String method if getInputFormat null setInputFormat getInputFormat This will inform the filter about changes in the options and therefore reset it 250 CHAPTER 17 EXTENDING WEKA 17 2 1 2 Examples The following examples covering batch and stream filters illustrate the filter framework and how to use it Unseeded random number generators like Math random should never be used since they will produce different results in each run and rep
166. e clipboard or display the properties in a GenericObjectEditor dialog box The Choose button allows you to choose one of the classifiers that are available in WEKA 4 3 2 Test Options The result of applying the chosen classifier will be tested according to the options that are set by clicking in the Test options box There are four test modes 1 Use training set The classifier is evaluated on how well it predicts the class of the instances it was trained on 2 Supplied test set The classifier is evaluated on how well it predicts the class of a set of instances loaded from a file Clicking the Set button brings up a dialog allowing you to choose the file to test on 3 Cross validation The classifier is evaluated by cross validation using the number of folds that are entered in the Folds text field 4 Percentage split The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing The amount of data held out depends on the value entered in the field Note No matter which evaluation method is used the model that is output is always the one build from all the training data Further testing options can be set by clicking on the More options button 42 10 11 CHAPTER 4 EXPLORER Output model The classification model on the full training set is output so that it can be viewed visualized etc This option is selected by default Output per class stats The prec
167. e menu gives two options 35 36 CHAPTER 4 EXPLORER 1 Memory information Display in the log box the amount of memory available to WEKA 2 Run garbage collector Force the Java garbage collector to search for memory that is no longer needed and free it up allowing more memory for new tasks Note that the garbage collector is constantly running as a background task anyway 4 1 3 Log Button Clicking on this button brings up a separate window containing a scrollable text field Each line of text is stamped with the time it was entered into the log As you perform actions in WEKA the log keeps a record of what has happened For people using the command line or the SimpleCLI the log now also contains the full setup strings for classification clustering attribute selection etc so that it is possible to copy paste them elsewhere Options for dataset s and if applicable the class attribute still have to be provided by the user e g t for classifiers or i and o for filters 4 1 4 WEKA Status Icon To the right of the status box is the WEKA status icon When no processes are running the bird sits down and takes a nap The number beside the x symbol gives the number of concurrent processes running When the system is idle it is zero but it increases as the number of processes increases When any process is started the bird gets up and starts moving around If it s standing but stops moving for a long time it s sick something
168. e stem canker gt in diaporthe stem canker 0 9999992614503429 diaporthe sten cankor elds are the zero based test in diaporthe stem canker 0 999998948559035 diaporthe stem canker Stance id followed by the pre diaporthe stem canker 0 9999998441238833 diaporthe stem canker dicted class value the confi diaporthe stem canker 0 9999989997681132 diaporthe stem canker 7 dence for the prediction esti mated probability of predicted class and the true class Al these are correctly classified so let s look at a few erroneous ones In each of these cases a misclas sification occurred mostly be tween classes alternarialeaf spot and brown spot The confidences seem to be lower than for correct classification so for a real life ap plication it may make sense to output don t know below a cer tain threshold WEKA also out puts a trailing newline p first last the mentioned attributes would have been output afterwards as comma separated values in parentheses However the zero based instance id in the first column offers a safer way to determine the test instances If we had saved the output of p in soybean test preds the following call would compute the number of correctly classified instances cat soybean test preds awk 2 4 amp amp 0 gt wc 1 Dividing by the number of instances in the test set i e wc 1 lt soybean test preds minus one trailing newline we get the training set accuracy 1 3 E
169. e to be set and some of the unsu pervised attribute filters will skip the class attribute if one is set Note that it is also possible to set Class to None in which case no class is set 4 3 CLASSIFICATION 41 4 3 Classification Program Applications Tools Visualization Windows Help Explorer Preprocess Classify Cluster Associate Select attributes Visualize Classifier Choose J48 C 0 25 M 2 Test options Classifier output z a summary O Use training set Supplied test set Set Correctly Classified Instances Incorrectly Classified Instances Cross validation Folds 10 O Percentage split More options Nom play stat Result list right click for options TP Rate FP Rate Precision Recall F Measure 15 15 03 trees J48 To 9586 0 6 0 625 0 556 0 588 0 4 0 444 0 333 0 4 0 364 Detailed Accuracy By Cli Confusion Matrix 4 3 1 Selecting a Classifier At the top of the classify section is the Classifier box This box has a text field that gives the name of the currently selected classifier and its options Clicking on the text box with the left mouse button brings up a GenericObjectEditor dialog box just the same as for filters that you can use to configure the options of the current classifier With a right click or Alt Shift left click you can once again copy the setup string to th
170. e_value in the file 19 2 9 CLASSPATH problems Having problems getting Weka to run from a DOS UNIX command prompt Getting java lang NoClassDefFoundError exceptions Most likely your CLASS PATH environment variable is not set correctly it needs to point to the Weka jar file that you downloaded with Weka or the parent of the Weka direc tory if you have extracted the jar Under DOS this can be achieved with set CLASSPATH c weka 3 4 weka jar CLASSPATH 298 CHAPTER 19 OTHER RESOURCES Under UNIX Linux something like export CLASSPATH home weka weka jar CLASSPATH An easy way to get avoid setting the variable this is to specify the CLASSPATH when calling Java For example if the jar file is located at c weka 3 4 weka jar you can use java cp c weka 3 4 weka jar weka classifiers etc See also Section 13 2 19 2 10 Instance ID People often want to tag their instances with identifiers so they can keep track of them and the predictions made on them 19 2 10 1 Adding the ID A new ID attribute is added real easy one only needs to run the AddID filter over the dataset and it s done Here s an example at a DOS Unix command prompt java weka filters unsupervised attribute AddID i data_without_id arff o data_with_id arff all on a single line Note the AddID filter adds a numeric attribute not a String attribute to the dataset If you want to remove this ID attribute for the classifier in a Filter
171. eatable exper iments are essential in machine learning BatchFilter This simple batch filter adds a new attribute called blah at the end of the dataset The rows of this attribute contain only the row s index in the data Since the batch filter does not have to see all the data before creating the output format the setInputFormat Instances sets the output format and returns true indicating that the output format can be queried immediately The batchFinished method performs the processing of all the data import weka core import weka core Capabilities public class BatchFilter extends Filter public String globalInfo return A batch filter that adds an additional attribute blah at the end containing the index of the processed instance The output format can be collected immediately F public Capabilities getCapabilities Capabilities result super getCapabilities result enableAllAttributes result enableAllClasses result enable Capability NO_CLASS filter doesn t need class to be set return result public boolean setInputFormat Instances instanceInfo throws Exception super setInputFormat instanceInfo Instances outFormat new Instances instancelnfo 0 outFormat insertAttributeAt new Attribute blah outFormat numAttributes setOutputFormat outFormat return true output format is immediately available public boolean batchFinished throws E
172. edClassifier environment again use the Remove filter instead of the RemoveType filter same package 19 2 10 2 Removing the ID If you run from the command line you can use the p option to output predic tions plus any other attributes you are interested in So it is possible to have a string attribute in your data that acts as an identifier A problem is that most classifiers don t like String attributes but you can get around this by using the RemoveType this removes String attributes by default Here s an example Lets say you have a training file named train arff a testing file named test arff and they have an identifier String attribute as their 5th attribute You can get the predictions from J48 along with the identifier strings by issuing the following command at a DOS Unix command prompt java weka classifiers meta FilteredClassifier F weka filters unsupervised attribute RemoveType W weka classifiers trees J48 t train arff T test arff p 5 all on a single line 19 2 TROUBLESHOOTING 299 If you want you can redirect the output to a file by adding gt output txt to the end of the line In the Explorer GUI you could try a similar trick of using the String attribute identifiers here as well Choose the FilteredClassifier with RemoveType as the filter and whatever classifier you prefer When you visualize the results you will need click through each instance to see the identifier listed for each 19 2 11
173. eds to be supplied as well of course e weka core AllJavadoc executes all Javadoc producing classes this is the tool you would normally use e weka core GlobalInfoJavadoc updates the globalinfo tags e weka core OptionHandlerJavadoc updates the option tags e weka core TechnicalInformationHandlerJavadoc updates the tech nical tags plain text and BibTeX These tools look for specific comment tags in the source code and replace every thing in between the start and end tag with the documentation obtained from the actual class e description of the classifier lt globalinfo start gt will be automatically replaced lt globalinfo end gt e listing of command line parameters lt options start gt will be automatically replaced lt options end gt e publication s if applicable lt technical bibtex start gt will be automatically replaced lt technical bibtex end gt for a shortened plain text version use the following lt technical plaintext start gt will be automatically replaced lt technical plaintext end gt 17 1 WRITING A NEW CLASSIFIER 245 Here is a template of a Javadoc class block for an imaginary classifier that also implements the weka core TechnicalInformationHandler interface lt globalinfo start gt lt globalinfo end gt lt technical bibtex start gt lt technical bibtex end gt lt
174. ee chapter L3 for more details on how to configure WEKA correctly and also more information on JDBC Java Database Connectivity URLs Example classes making use of the functionality covered in this section can be found in the wekaexamples core converters package of the Weka Exam ples collection 3 The following classes are used to store data in memory e weka core Instances holds a complete dataset This data structure is row based single rows can be accessed via the instance int method using a 0 based index Information about the columns can be accessed via the attribute int method This method returns weka core Attribute objects see below e weka core Instance encapsulates a single row It is basically a wrap per around an array of double primitives Since this class contains no information about the type of the columns it always needs access to a weka core Instances object see methods dataset and setDataset The class weka core SparseInstance is used in case of sparse data e weka core Attribute holds the type information about a single column in the dataset It stores the type of the attribute as well as the labels for nominal attributes the possible values for string attributes or the datasets for relational attributes these are just weka core Instances objects again 16 2 1 Loading data from files When loading data from files one can either let WEKA choose the appropriate loader the available loader
175. een 16 10 VISUALIZATION 229 16 10 2 2 BayesNet The graphs that the BayesNet classifier package weka classifiers bayes generates can be displayed using the GraphVisualizer class located in package weka gui graphvisualizer The GraphVisualizer can display graphs that are either in GraphViz s DOT language 26 or in XML BIF 20 format For displaying DOT format one needs to use the method readDOT and for the BIF format the method readBIF The following code snippet trains a BayesNet classifier on some data and then displays the graph generated from this data in a frame import weka classifiers bayes BayesNet import weka core Instances import weka gui graphvisualizer GraphVisualizer import java awt BorderLayout import javax swing JFrame Instances data from somewhere train classifier BayesNet cls new BayesNet cls buildClassifier data display graph GraphVisualizer gv new GraphVisualizer gv readBIF cls graph JFrame jf new JFrame BayesNet graph jf setDefaultCloseOperation JFrame DISPOSE_ON_CLOSE jf setSize 800 600 jf getContentPane setLayout new BorderLayout O jf getContentPane add gv BorderLayout CENTER jf setVisible true layout graph gv layoutGraph 230 CHAPTER 16 USING THE API 16 11 Serialization Serialization is the process of saving an object in a persistent form e g on the harddisk as a bytestream Deserialization is the proce
176. een saved import weka core Instances import weka core converters DatabaseSaver data structure to save Instances data store data in database DatabaseSaver saver new DatabaseSaver saver setDestination jdbc_url the_user the_password we explicitly specify the table name here saver setTableName whatsoever2 saver setRelationForTableName false or we could just update the name of the dataset saver setRelationForTableName true data setRelationName whatsoever2 saver setRetrieval DatabaseSaver INCREMENTAL saver setStructure data count 0 for int i 0 i lt data numInstances i saver writeIncremental data instance i notify saver that we re finished saver writeIncremental null 16 10 VISUALIZATION 227 16 10 Visualization The concepts covered in this chapter are also available through the example classes of the Weka Examples collection 3 See the following packages e wekaexamples gui graphvisualizer e wekaexamples gui treevisualizer e wekaexamples gui visualize 16 10 1 ROC curves WEKA can generate Receiver operating characteristic ROC curves based on the collected predictions during an evaluation of a classifier In order to display a ROC curve one needs to perform the following steps 1 Generate the plotable data based on the Evaluation s collected predic tions using the ThresholdCurve class package weka cla
177. efines the name of that attribute and it s data type The order the attributes are declared indicates the column position in the data section of the file For example if an attribute is the third one declared then Weka expects that all that attributes values will be found in the third comma delimited column The format for the Cattribute statement is Cattribute lt attribute name gt lt datatype gt where the lt attribute name gt must start with an alphabetic character If spaces are to be included in the name then the entire name must be quoted The lt datatype gt can be any of the four types supported by Weka e numeric e integer is treated as numeric e real is treated as numeric e lt nominal specification gt e string 9 2 EXAMPLES 163 e date lt date format gt e relational for multi instance data for future use where lt nominal specification gt and lt date format gt are defined below The keywords numeric real integer string and date are case insensitive Numeric attributes Numeric attributes can be real or integer numbers Nominal attributes Nominal values are defined by providing an lt nominal specification gt listing the possible values lt nominal name1 gt lt nominal name2 gt lt nominal name3 gt For example the class value of the Iris dataset can be defined as follows CATTRIBUTE class Iris setosa Iris versicolor Iris virginica Values that contain spaces must be quoted String
178. eka core Instances import weka core converters CSVSaver import java io File data structure to save Instances data save as CSV CSVSaver saver new CSVSaver saver setInstances data saver setFile new File some where data csv saver writeBatch 16 9 2 Saving data to databases Apart from the KnowledgeFlow saving to databases is not very obvious in WEKA unless one knows about the DatabaseSaver converter Just like the DatabaseLoader the saver counterpart can store the data either in batch mode or incrementally as well 226 CHAPTER 16 USING THE API The first example shows how to save the data in batch mode which is the easier way of doing it import weka core Instances import weka core converters DatabaseSaver data structure to save Instances data store data in database DatabaseSaver saver new DatabaseSaver saver setDestination jdbc_url the_user the_password we explicitly specify the table name here saver setTableName whatsoever2 saver setRelationForTableName false or we could just update the name of the dataset saver setRelationForTableName true data setRelationName whatsoever2 saver setInstances data saver writeBatch Saving the data incrementally requires a bit more work as one has to specify that writing the data is done incrementally using the setRetrieval method as well as notifying the saver when all the data has b
179. eka core converters For a certain kind of converter you will find two classes e one for loading classname ends with Loader and e one for saving classname ends with Saver Weka contains converters for the following data sources e ARFF files ArffLoader ArffSaver e C4 5 files C45Loader C45Saver e CSV files CSVLoader CSVSaver e files containing serialized instances SerializedInstancesLoader Serial izedInstancesSaver JDBC databases DatabaseLoader DatabaseSaver libsvm files LibSVMLoader LibSVMSaver XRFF files XRFFLoader XRFFSaver text directories for text mining TextDirectoryLoader 173 174 CHAPTER 11 CONVERTERS 11 2 Usage 11 2 1 File converters File converters can be used as follows e Loader They take one argument which is the file that should be converted and print the result to stdout You can also redirect the output into a file java lt classname gt lt input file gt gt lt output file gt Here s an example for loading the CSV file iris csu and saving it as iris arff java weka core converters CSVLoader iris csv gt iris arff e Saver For a Saver you specify the ARFF input file via and the output file in the specific format with o java lt classname gt i lt input gt o lt output gt Here s an example for saving an ARFF file to CSV java weka core converters CSVSaver i iris arff o iris csv A few notes e Using the ArffSaver from commandline doesn t make m
180. el index 0 based index e numeric class attribute correlationCoefficient The correlation coefficient e general meanAbsoluteError The mean absolute error rootMeanSquaredError The root mean squared error numInstances The number of instances with a class value unclassified The number of unclassified instances pctUnclassified The percentage of unclassified instances For a complete overview see the Javadoc page of the Evaluation class By looking up the source code of the summary methods mentioned above one can easily determine what methods are used for which particular output 214 CHAPTER 16 USING THE API 16 6 3 Classifying instances After a classifier setup has been evaluated and proven to be useful a built classi fier can be used to make predictions and label previously unlabeled data Section 16 5 2Jalready provided a glimpse of how to use a classifier s classifyInstance method This section here elaborates a bit more on this The following example uses a trained classifier tree to label all the instances in an unlabeled dataset that gets loaded from disk After all the instances have been labeled the newly labeled dataset gets written back to disk to a new file load unlabeled data and set class attribute Instances unlabeled DataSource read some where unlabeled arff unlabeled setClassIndex unlabeled numAttributes 1 create copy Instances labeled
181. ement various combinations of the following two interfaces e weka core Randomizable to allow seeded randomization taking place e weka classifiers IterativeClassifier to make the classifier an it erated one But these interfaces are not the only ones that can be implemented by a classifier Here is a list for further interfaces e weka core AdditionalMeasureProducer the classifier returns addi tional information e g J48 returns the tree size with this method e weka core WeightedInstancesHandler denotes that the classifier can make use of weighted Instance objects the default weight of an Instance is 1 0 e weka core TechnicalInformationHandler for returning paper refer ences and publications this classifier is based on e weka classifiers Sourcable classifiers implementing this interface can return Java code of a built model which can be used elsewhere e weka classifiers UpdateableClassifier for classifiers that can be trained incrementally i e row by row like NaiveBayesUpdateable 236 CHAPTER 17 EXTENDING WEKA 17 1 3 Packages A few comments about the different sub packages in the weka classifiers package bayes contains bayesian classifiers e g NaiveBayes evaluation classes related to evaluation e g confusion matrix thresh old curve ROC functions e g Support Vector Machines regression algorithms neural nets lazy learning is performed at predict
182. engines on lt Enter gt adds the host to the list e You can choose to distribute by run or dataset e Save your experiment configuration e Now start your experiment as you would do normally e Check your results in the Analyse tab by clicking either the Database or Experiment buttons 5 4 5 Multi core support If you want to utilize all the cores on a multi core machine then you can do so with Weka All you have to do is define the port alongside the hostname in the Experimenter format hostname port and then start the RemoteEngine with the p option specifying the port to listen on 5 4 6 Troubleshooting e If you get an error at the start of an experiment that looks a bit like this 01 13 19 RemoteExperiment blabla company com RemoteEngine sub experiment datataset vineyard arff failed java sql SQLException Table already exists EXPERIMENT_INDEX in statement CREATE TABLE Experiment_index Experiment_type LONGVARCHAR Experiment_setup LONGVARCHAR Result_table INT 01 13 19 dataset vineyard arff RemoteExperiment blabla company com RemoteEngine sub experiment datataset vineyard arff failed java sql SQLException Table already exists EXPERIMENT_INDEX in statement CREATE TABLE Experiment_index Experiment_type LONGVARCHAR Experiment_setup LONGVARCHAR Result_table INT Scheduling for execution on another host then do not panic this happens because multiple remote machines are trying to create
183. er evaluators that transform the input data Methods In the following a brief description of the main methods of an evaluator buildEvaluator Instances Generates the attribute evaluator Subsequent calls of this method with the same data and the same search algorithm must result in the same attributes being selected This method also checks the data against the capabilities public void buildEvaluator Instances data throws Exception can evaluator handle data getCapabilities testWithFail data actual initialization of evaluator postProcess int can be used for optional post processing of the selected attributes e g for ranking purposes 17 3 WRITING OTHER ALGORITHMS 263 main String executes the evaluator from command line If your new algorithm is called FunkyEvaluator then use the following code as your main method ex Main method for executing this evaluator Oparam args the options use h to display options public static void main String args ASEvaluation runEvaluator new FunkyEvaluator args Search The search algorithm defines the heuristic of searching e g exhaustive search greedy or genetic Superclasses and interfaces The ancestor for all search algorithms is the weka attributeSelection ASSearch class Interfaces that can be implemented if applicable by a search algorithm e RankedOutputSearch for search algorithms that produce ranked lists
184. er 4 Explorer 4 1 The user interface 4 1 1 Section Tabs At the very top of the window just below the title bar is a row of tabs When the Explorer is first started only the first tab is active the others are greyed out This is because it is necessary to open and potentially pre process a data set before starting to explore the data The tabs are as follows 1 Preprocess Choose and modify the data being acted on 2 Classify Train and test learning schemes that classify or perform regres sion 3 Cluster Learn clusters for the data 4 Associate Learn association rules for the data 5 Select attributes Select the most relevant attributes in the data 6 Visualize View an interactive 2D plot of the data Once the tabs are active clicking on them flicks between different screens on which the respective actions can be performed The bottom area of the window including the status box the log button and the Weka bird stays visible regardless of which section you are in The Explorer can be easily extended with custom tabs The Wiki article Adding tabs in the Explorer 7 explains this in detail 4 1 2 Status Box The status box appears at the very bottom of the window It displays messages that keep you informed about what s going on For example if the Explorer is busy loading a file the status box will say that TIP ight clicking the mouse anywhere inside the status box brings up a little menu Th
185. er labelled Jitter which is a random displacement given to all points in the plot Dragging it to the right increases the amount of jitter which is useful for spotting concentrations of points Without jitter a million instances at the same point would look no different to just a single lonely instance 4 7 3 Selecting Instances There may be situations where it is helpful to select a subset of the data us ing the visualization tool A special case of this is the UserClassifier in the Classify panel which lets you build your own classifier by interactively selecting instances Below the y axis selector button is a drop down list button for choosing a selection method A group of data points can be selected in four ways 1 Select Instance Clicking on an individual data point brings up a window listing its attributes If more than one point appears at the same location more than one set of attributes is shown 2 Rectangle You can create a rectangle by dragging that selects the points inside it 3 Polygon You can build a free form polygon that selects the points inside it Left click to add vertices to the polygon right click to complete it The polygon will always be closed off by connecting the first point to the last 4 Polyline You can build a polyline that distinguishes the points on one side from those on the other Left click to add vertices to the polyline right click to finish The resulting shape is open as opposed
186. ercentage into a train and a test file one cannot specify explicit training and test files in the Experimenter after the order of the data has been randomized and stratified 5 2 STANDARD EXPERIMENTS 57 weka Experiment Environment lolx Setup Run Analyse xperiment Configuration Mode 8 Simple Advanced Open Save New Results Destination aer file v Filename le Mtemplweka 3 5 6 Experiments1 arff Browse Experiment Type Iteration Control Number of repetitions 10 Train percentage 66 0 8 Data sets first 8 Classification Regression Algorithms first Datasets Algorithms Add new Edit selecte Delete select Add new Edit selected Delete selected Use relative pat U Ses a Notes e Train Test Percentage Split order preserved because it is impossible to specify an explicit train test files pair one can abuse this type to un merge previously merged train and test file into the two original files one only needs to find out the correct percentage weka Experiment Environment lolx Setup Run Analyse xperiment Configuration Mode Simple Advanced Open Save New Results Destination ARFF file v Filename le Mtemplweka 3 5
187. es are shown in green next to the node The value of a node can be set right click node set evidence select a value and the color is changed to red to indicate evidence is set for the node Rounding errors may occur in the marginal probabilities Bayes Network Editor AS File Edit Tools View Help js Setosa 9999 Iris versicolor 0 Iris virginica 0 es in 8 9805 petalwidth SE a inf 2 45 9680 petalleng A sepalwidth a int 5 55 8673 sepallength 0 Set evidence for class The Show Cliques menu item makes the cliques visible that are used by the junction tree algorithm Cliques are visualized using colored undirected edges Both margins and cliques can be shown at the same time but that makes for rather crowded graphs 148 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Bayes Network Editor Of x ile Edit Tools View Help petallength sepalwidth sepallength Set Group Position Action View menu The view menu allows for zooming in and out of the graph panel Also it allows for hiding or showing the status and toolbars Bayes Network Editor File Edit Tools k El Zoom out View toolbar View statusbar The help menu points to this document Help menu Bayes Network Editor File Edit Tools View DIAS 8 9 BAYES NETWORK GUI 149 Toolbar D a a amp Ba unao reao ls 3 F 8s El EH be
188. ethod returns the maximum version exclusive of WEKA that is necessary to execute the plugin e g 3 6 0 e getDesignVersion Returns the actual version of WEKA this plugin was designed for e g 3 5 1 e getVisualizeMenultem The JMenuItem that is returned via this method will be added to the plugins menu in the popup in the Explorer The ActionListener for clicking the menu item will most likely open a new frame containing the visualized data 17 4 EXTENDING THE EXPLORER Examples Table with predictions 275 The PredictionTable java example simply displays the actual class label and the one predicted by the classifier In addition to that 1t lists whether it was an incorrect prediction and the class probability for the correct class label Prediction table Actual Predicted bad good good good Prob for Actual 0 0 7620853164 good bad good bad S good good bad o 08496732026 good 0 9183006535 0 2379146835 6 8 bad good good good 9 gd good bad 0 0 2379146835 0 0 0 8599439775 good good 0 8599439775 bad good bad mo bad bad 14 good good good 0 8295625942 0 8295625942 0 8145204027 0 8145204027 good good 15 16 17 good good 18 bad bad bad 0 8145204027 0 8145204027 0 9803921568 18 20 bad good good good bad 22 o good bad
189. f D weka experiment InstancesResultListener P weka experiment RandomSplitResultProducer W weka experiment ClassifierSplitEvaluator W weka classifiers rules OneR While commands can be typed directly into the CLI this technique is not particularly convenient and the experiments are not easy to modify The Experimenter comes in two flavours either with a simple interface that provides most of the functionality one needs for experiments or with an interface with full access to the Experimenter s capabilities You can choose between those two with the Experiment Configuration Mode radio buttons e Simple e Advanced Both setups allow you to setup standard experiments that are run locally on a single machine or remote experiments which are distributed between several hosts The distribution of experiments cuts down the time the experiments will take until completion but on the other hand the setup takes more time The next section covers the standard experiments both simple and ad vanced followed by the remote experiments and finally the analysing of the results 53 54 CHAPTER 5 EXPERIMENTER 5 2 Standard Experiments 5 2 1 Simple 5 2 1 1 New experiment After clicking New default parameters for an Experiment are defined weka Experiment Environment l oj xj Setup Run Analyse Experiment Configuration Mode 8 Simple Advanced
190. ferent types of attributes available in WEKA numeric continuous variables date date variables nominal predefined labels string textual data relational contains other relations e g the bags in case of multi instance data For all of the different attribute types WEKA uses the same class weka core Attribute but with different constructors In the following these different constructors are explained e numeric The easiest attribute type to create as it requires only the name of the attribute Attribute numeric new Attribute name_of_attr e date Date attributes are handled internally as numeric attributes but in order to parse and present the date value correctly the format of the date needs to be specified The date and time patterns are explained in detail in the Javadoc of the java text SimpleDateFormat class In the following an example of how to create a date attribute using a date format of 4 digit year 2 digit month and a 2 digit day separated by hyphens Attribute date new Attribute name_of_attr yyyy MM dd e nominal Since nominal attributes contain predefined labels one needs to supply these stored in form of a weka core FastVector object FastVector labels new FastVector labels addElement label_a labels addElement label_b labels addElement label_c labels addElement label_d Attribute nominal new Attribute name_of_attr labels e str
191. fiers meta weka classifiers trees weka classifiers rules 284 CHAPTER 18 TECHNICAL DOCUMENTATION 18 4 3 Exclusion It may not always be desired to list all the classes that can be found along the CLASSPATH Sometimes classes cannot be declared abstract but still shouldn t be listed in the GOE For that reason one can list classes interfaces superclasses for certain packages to be excluded from display This exclusion is done with the following file weka gui GenericPropertiesCreator excludes The format of this properties file is fairly simple lt key gt lt prefix gt lt class gt lt prefix gt lt class gt Where the lt key gt corresponds to a key in the GenericPropertiesCreator props file and the lt prefix gt can be one of the following e S Superclass any class derived from this will be excluded e I Interface any class implementing this interface will be excluded e C Class exactly this class will be excluded Here are a few examples exclude all ResultListeners that also implement the ResultProducer interface all ResultProducers do that weka experiment ResultListener I weka experiment ResultProducer exclude J48 and all SingleClassifierEnhancers weka classifiers Classifier C weka classifiers trees J48 S weka classifiers SingleClassifierEnhancer 18 4 GENERICOBJECTEDITOR 285 18 4 4 Class Discovery Unlike the Class forName String method that grabs the first class it can
192. filters supervised attribute Discretize is used to discretize numeric attributes into nominal ones based on the class information via Fayyad amp Irani s MDL method or optionally with Kononeko s MDL method At least some learning schemes or classifiers can only process nominal data e g weka classifiers rules Prism in some cases discretization may also reduce learning time java weka filters supervised attribute Discretize i data iris arff o iris nom arff c last java weka filters supervised attribute Discretize i data cpu arff o cpu classvendor nom arff c first NominalToBinary encodes all nominal attributes into binary two valued at tributes which can be used to transform the dataset into a purely numeric representation e g for visualization via multi dimensional scaling java weka filters supervised attribute NominalToBinary i data contact lenses arff o contact lenses bin arff c last Keep in mind that most classifiers in WEKA utilize transformation filters in ternally e g Logistic and SMO so you will usually not have to use these filters explicity However if you plan to run a lot of experiments pre applying the filters yourself may improve runtime performance weka filters supervised instance Resample creates a stratified subsample of the given dataset This means that overall class distributions are approximately retained within the sample A bias towards uniform class distribution can be specified via
193. g asc by lt default gt AAN RANA AAA RAR AA iris 10 33 33 94 00 v 96 00 v Test base slo El AAA tos 1 1 0 40 1 0 0 Displayed Columns Columns Show std deviations C Pee 1 rules ZeroR 48055541465867954 Output Format Select 2 rules OneR B 6 2459427002147861445 E 3 trees J48 C 0 25 M 2 217733168393644444 Perform test Save output Result list 16 47 17 Percent_correct rules ZeroR 4805 4 M i Averaging Result Producer An alternative to the Cross ValidationResultProducer is the AveragingResultPro ducer This result producer takes the average of a set of runs which are typ ically cross validation runs This result producer is identified by clicking the Result generator panel and then choosing the AveragingResultProducer from the GenericObjectEditor 5 2 STANDARD EXPERIMENTS 77 weka gui GenericObjectEditor weka experiment AveragingResultProducer About Takes the results from a ResultProducer and submits the average to the result listener More calculateStdDevs False xs expectedResultsPerAverage 10 keyFieldName Fold resultProducer Choose CrossValidationResultPro fer The associated help file is shown below Information jaj x NAME weka experiment AveragingResultProducer SYNOPSIS Takes the results from a ResultProducer and submits the ave
194. geListener 1 m_Support removePropertyChangeListener 1 17 4 EXTENDING THE EXPLORER 271 e additional GUI elements the GOE for the generators protected GenericObjectEditor m_GeneratorEditor new GenericObjectEditor the text area for the output of the generated data protected JTextArea m_Output new JTextArea the Generate button protected JButton m_ButtonGenerate new JButton Generate the Use button protected JButton m_ButtonUse new JButton Use e the Generate button does not load the generated data directly into the Explorer but only outputs it in the JTextArea the Use button loads the data see further down m_ButtonGenerate addActionListener new ActionListener 1 public void actionPerformed ActionEvent evt DataGenerator generator DataGenerator m_GeneratorEditor getValue String relName generator getRelationName String cname generator getClass getName replaceAl1 String cmd generator getClass getName if generator instanceof OptionHandler cmd Utils joinOptions OptionHandler generator getOptions try generate data StringWriter output new StringWriter generator setQutput new PrintWriter output DataGenerator makeData generator generator getOptions m_Output setText output toString O catch Exception ex ex printStackTrace JOptionPane showMessageDialog getExplorer
195. h method of simulated annealing to find a well scoring network structure TStart 10 0 delta 0 999 markovBlanketClassifier False sal runs 10000 scoreType BAYES E seed 1 Open Save oK Cancel Specific options TStart start temperature to delta is the factor used to update the temperature so tj41 ti runs number of iterations used to traverse the search space seed is the initialization value for the random number generator e Tabu search I5 using adding and deleting arrows Tabu search performs hill climbing until it hits a local optimum Then it 122 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS steps to the least worse candidate in the neighborhood However it does not consider points in the neighborhood it just visited in the last tl steps These steps are stored in a so called tabu list weka gui GenericObjectEditor weka classifiers bayes net search local TabuSearch About This Bayes Network learning algorithm uses tabu search for more finding a well scoring Bayes network structure initAsNaiveBayes True RA markovBlanketClassifier False z maxNrOfParents 2 runs 10 scoreType BAYES tabuList 5 useArcReversal False X Open Save OK Cancel Specific options runs is the number of iterations used to traverse the search space tabuList is the
196. hanging the file you just place it in your home directory In order to find out the location of your home directory do the following e Linux Unix Open a terminal run the following command echo HOME e Windows Open a command primpt run the following command echo USERPROFILES If dynamic class discovery is too slow e g due to an enormous CLASSPATH you can generate a new GenericObjectEditor props file and then turn dy namic class discovery off again It is assumed that you already placed the GPC file in your home directory see steps above and that the weka jar jar archive with the WEKA classes is in your CLASSPATH otherwise you have to add it to the java call using the classpath option 18 4 GENERICOBJECTEDITOR 283 For generating the GOE file execute the following steps e generate a new GenericObjectEditor props file using the following com mand Linux Unix java weka gui GenericPropertiesCreator HOME GenericPropertiesCreator props HOME GenericObjectEditor props Windows command must be in one line java weka gui GenericPropertiesCreator USERPROFILE AGenericPropertiesCreator props USERPROFILE AGenericObjectEditor props e edit the GenericPropertiesCreator props file in your home directory and set UseDynamic to false A limitation of the GOE prior to 3 4 4 was that additional classifiers filters etc had to fit into the same package structure as the already existing
197. hat work column wise instance filters that work row wise e unsupervised contains unsupervised filters i e they work without taking any class distributions into account The filter must implement the weka filters UnsupervisedFilter interface attribute filters that work column wise instance filters that work row wise Javadoc The Javadoc generation works the same as with classifiers See section Javadoc on page 244 for more information 17 2 5 Revisions Filters like classifiers implement the weka core RevisionHandler interface This provides the functionality of obtaining the Subversion revision from within Java Filters that are not part of the official WEKA distribution do not have to implement the method getRevision as the weka filters Filter class already implements this method Contributions on the other hand need to implement it in order to obtain the revision of this particular source file See section Revisions on page 245 17 2 WRITING A NEW FILTER 259 17 2 6 Testing WEKA provides already a test framework to ensure correct basic functionality of a filter It is essential for the filter to pass these tests 17 2 6 1 Option handling You can check the option handling of your filter with the following tool from command line weka core CheckOptionHandler W classname additional parameters All tests need to return yes 17 2 6 2 GenericObjectEditor The CheckGOE cla
198. he classifier to pass these tests Option handling You can check the option handling of your classifier with the following tool from command line weka core CheckOptionHandler W classname additional parameters All tests need to return yes GenericObjectEditor The CheckGOE class checks whether all the properties available in the GUI have a tooltip accompanying them and whether the globalInfo method is declared weka core CheckGOE W classname additional parameters All tests once again need to return yes Source code Classifiers that implement the weka classifiers Sourcable interface can out put Java code of the built model In order to check the generated code one should not only compile the code but also test it with the following test class weka classifiers CheckSource This class takes the original WEKA classifier the generated code and the dataset used for generating the model and an optional class index as parameters It builds the WEKA classifier on the dataset and compares the output the one from the WEKA classifier and the one from the generated source code whether they are the same Here is an example call for weka filters trees J48 and the generated class weka filters WEKAWrapper it wraps the actual generated code in a pseudo classifier java weka classifiers CheckSource W weka classifiers trees J48 S weka classifiers WEKAWrapper t data arff It needs to return Tests OK Unit tests
199. he entries flicks back and forth between the various results that have been generated Pressing Delete removes a selected entry from the results Right clicking an entry invokes a menu containing these items 1 View in main window Shows the output in the main window just like left clicking the entry 44 10 11 12 CHAPTER 4 EXPLORER View in separate window Opens a new independent window for view ing the results Save result buffer Brings up a dialog allowing you to save a text file containing the textual output Load model Loads a pre trained model object from a binary file Save model Saves a model object to a binary file Objects are saved in Java serialized object form Re evaluate model on current test set Takes the model that has been built and tests its performance on the data set that has been specified with the Set button under the Supplied test set option Visualize classifier errors Brings up a visualization window that plots the results of classification Correctly classified instances are represented by crosses whereas incorrectly classified ones show up as squares Visualize tree or Visualize graph Brings up a graphical representation of the structure of the classifier model if possible i e for decision trees or Bayesian networks The graph visualization option only appears if a Bayesian network classifier has been built In the tree visualizer you can bring up a menu by right clicki
200. he neces sary details are discussed as the full source code is available from the WEKA Examples 3 package wekaexamples gui explorer SQL worksheet Purpose Displaying the SqlViewer as a tab in the Explorer instead of using it either via the Open DB button or as standalone application Uses the existing components already available in WEKA and just assembles them in a JPanel Since this tab does not rely on a dataset being loaded into the Explorer it will be used as a standalone one 17 4 EXTENDING THE EXPLORER 267 Useful for people who are working a lot with databases and would like to have an SQL worksheet available all the time instead of clicking on a button every time to open up a database dialog Implementation e class is derived from javax swing JPanel and implements the interface weka gui Explorer ExplorerPanel the full source code also imports the weka gui Explorer LogHandl er interface but that is only additional functionality public class 5qlPanel extends JPanel implements ExplorerPanel e some basic members that we need to have the parent frame protected Explorer m_Explorer null sends notifications when the set of working instances gets changed protected PropertyChangeSupport m_Support new PropertyChangeSupport this e methods we need to implement due to the used interfaces Sets the Explorer to use as parent frame public void setExplorer Explorer parent m_Ex
201. header new Instances inst 0 SerializationHelper writeA11 some where j48 model new Object cls header 232 CHAPTER 16 USING THE API Chapter 17 Extending WEKA For most users the existing WEKA framework will be sufficient to perform the task at hand offering a wide range of filters classifiers clusterers etc Researchers on the other hand might want to add new algorithms and compare them against existing ones The framework with its existing algorithms is not set in stone but basically one big plugin framework With WEKA s automatic discovery of classes on the classpath adding new classifiers filters etc to the existing framework is very easy Though algorithms like clusterers associators data generators and attribute selection are not covered in this chapter their implemention is very similar to the one of implementing a classifier You basically choose a superclass to derive your new algorithm from and then implement additional interfaces if necessary Just check out the other algorithms that are already implemented The section covering the GenericObjectEditor see chapter 8 4 shows you how to tell WEKA where to find your class es and therefore making it them available in the GUI Explorer Experimenter via the GenericObjectEditor 233 234 CHAPTER 17 EXTENDING WEKA 17 1 WRITING A NEW CLASSIFIER 235 17 1 Writing a new Classifier 17 1 1 Choosing the base class The ancestor of all classifiers
202. heme entropy 2 9327 6 3651 SF_nean entropy gain 1 7407 6 3616 KB_information 102 0606 3 995 KB_nean information 1 1365 0 0432 KB relative information 8383 6816 335 4174 True_positive_rate 0 51 0 5024 Nun_true positives 0 51 0 5024 False positive_rate 0 0001 0 0011 al Mi falsa sities na na le 274 CHAPTER 17 EXTENDING WEKA 17 4 2 Adding visualization plugins Introduction As of WEKA version 3 5 3 you can easily add visualization plugins in the Ex plorer Classify panel This makes it easy to implement custom visualizations if the ones WEKA offers are not sufficient The following examples can be found in the Examples collection 3 package wekaexamples gui visualize plugins Requirements e custom visualization class must implement the following interface weka gui visualize plugins VisualizePlugin e the class must either reside in the following package visualization classes are automatically discovered during run time weka gui visualize plugins e or you must list the package this class belongs to in the properties file weka gui GenericPropertiesCreator props or the equivalent in your home directory under the key weka gui visualize plugins VisualizePlugin Implementation The visualization interface contains the following four methods e getMinVersion This method returns the minimum version inclusive of WEKA that is necessary to execute the plugin e g 3 5 0 e getMaxVersion This m
203. ibute attribute Gattribute Gattribute Gattribute Gattribute Gattribute Gattribute Gattribute Gattribute Gattribute Gattribute Gattribute Gattribute Gattribute Key_Dataset iris Key_Run 1 2 3 4 5 6 7 8 9 10 Key_Scheme weka classifiers rules ZeroR weka classifiers trees J48 Key_Scheme_options C 0 25 M 2 Key_Scheme_version_ID 48055541465867954 217733168393644444 Date_time numeric Number_of_training_instances numeric Number_of_testing_instances numeric Number_correct numeric Number_incorrect numeric Number_unclassified numeric Percent_correct numeric Percent_incorrect numeric Percent_unclassified numeric Kappa_statistic numeric Mean_absolute_error numeric Root_mean_squared_error numeric Relative_absolute_error numeric Root_relative_squared_error numeric SF_prior_entropy numeric SF_scheme_entropy numeric SF_entropy_gain numeric SF_mean_prior_entropy numeric SF_mean_scheme_entropy numeric SF_mean_entropy_gain numeric KB_information numeric 68 attribute Oattribute Oattribute Oattribute attribute attribute attribute attribute attribute attribute attribute attribute attribute attribute attribute Oattribute attribute Number gt Number attribute attribute attribute data CHAPTER 5 EXPERIMENTER KB_mean_information numeric KB_relative_information numeric True_positive_rate numeric Num_true_positives numeric False_positive_rate numeric Num_false_posi
204. ibute specification Via the class yes attribute in the attribute specification in the header one can define which attribute should act as class attribute A feature that can be used on the command line as well as in the Experimenter which now can also load other data formats and removing the limitation of the class attribute always having to be the last one Snippet from the iris dataset lt attribute class yes name class type nominal gt 10 5 2 Attribute weights Attribute weights are stored in an attributes meta data tag in the header sec tion Here is an example of the petalwidth attribute with a weight of 0 9 lt attribute name petalwidth type numeric gt lt metadata gt lt property name weight gt 0 9 lt property gt lt metadata gt lt attribute gt 10 5 USEFUL FEATURES 171 10 5 3 Instance weights Instance weights are defined via the weight attribute in each instance tag By default the weight is 1 Here is an example lt instance weight 0 75 gt lt value gt 5 1 lt value gt lt value gt 3 5 lt value gt lt value gt 1 4 lt value gt lt value gt 0 2 lt value gt lt value gt Iris setosa lt value gt lt instance gt 172 CHAPTER 10 XRFF Chapter 11 Converters 11 1 Introduction Weka offers conversion utilities for several formats in order to allow import from different sorts of datasources These utilities called converters are all located in the following package w
205. idation 7 Number of repetitions 10 Number of folds 10 8 Data sets first Classification Regression Algorithms first Datasets Algorithms Add new Edit selecte Delete select Add new Edit selected Delete selected y Use relative pat dataliris artt Up Down Load options Save options Up Down 5 2 1 5 Iteration control e Number of repetitions In order to get statistically meaningful results the default number of it erations is 10 In case of 10 fold cross validation this means 100 calls of one classifier with training data and tested against test data e Data sets first Algorithms first As soon as one has more than one dataset and algorithm it can be useful to switch from datasets being iterated over first to algorithms This is the case if one stores the results in a database and wants to complete the results for all the datasets for one algorithm as early as possible 5 2 1 6 Algorithms New algorithms can be added via the Add new button Opening this dialog for the first time ZeroR is presented otherwise the one that was selected last weka gui GenericObjectEditor Choose weka classifiers rules ZeroR About Class for building and using a 0 R classifier More Capabilities debug False RA Open Save oK Cancel With the Choose button one can open the GenericObjectEdit
206. ies From 1 To 10 p AS eta Enabled Select property 9 By dataset O Byrun Iteration control Choose J48 C 0 25 M2 ma 8 Data sets first Custom generator first ZeroR J48 C 0 25 M 2 Datasets Add new Edit selecte Delete select v Use relative pat dataliris arff L J Pesca Delete Edit Up Down Notes Now when the experiment is run results are generated for both schemes To add additional schemes repeat this process To remove a scheme select the scheme by clicking on it and then click Delete Adding Additional Datasets The scheme s may be run on any number of datasets at a time Additional datasets are added by clicking Add new in the Datasets panel Datasets are deleted from the experiment by selecting the dataset and then clicking Delete Selected 74 CHAPTER 5 EXPERIMENTER Raw Output The raw output generated by a scheme during an experiment can be saved to a file and then examined at a later time Open the ResultProducer window by clicking on the Result generator panel in the Setup tab weka gui GenericObjectEditor weka experiment RandomSplitResultProducer About Performs a random train and test using a supplied More evaluator outputFile splitEvalutorOutzio randomizeData True nA rawOutput splitEvaluator Choose ClassifierSplitEvaluator weka classifir
207. ifier s filtering on the fly e Using a filter for preprocessing the data e Low level API usage instead of using the meta schemes classifier or filter one can use the attribute selection API directly as well The following sections cover each of the topics accompanied with a code exam ple For clarity the same evaluator and search algorithm is used in all of these examples Feel free to check out the example classes of the Weka Examples collection 3 located in the wekaexamples attributeSelection package 222 CHAPTER 16 USING THE API 16 8 1 Using the meta classifier The meta classifier AttributeSelectedClassifier this classifier is located in package weka classifiers meta is similar to the FilteredClassifier But instead of taking a base classifier and a filter as parameters to perform the filtering the AttributeSelectedClassifier uses a search algorithm derived from weka attributeSelection ASEvaluation an evaluator superclass is weka attributeSelection ASSearch to perform the attribute selection and a base classifier to train on the reduced data This example here uses J48 as base classifier CfsSubsetEval as evaluator and a backwards operating GreedyStepwise as search method import weka attributeSelection CfsSubsetEval import weka attributeSelection GreedyStepwise import weka classifiers Evaluation import weka classifiers meta AttributeSelectedClassifier import weka classifiers trees J48 import weka
208. ifiers lazy weka classifiers meta weka classifiers trees weka classifiers rules dummy classifiers 286 CHAPTER 18 TECHNICAL DOCUMENTATION Your java call for the GUIChooser might look like this java classpath weka jar dummy jar weka gui GUIChooser Starting up the GUI you will now have another root node in the tree view of the classifiers called root and below it the weka and the dummy package hierarchy as you can see here 12 root Y 2 weka Y classifiers gt bayes gt gt functions gt 12 lazy gt 2 meta gt mi gt J misc gt 2 trees gt rules Y F dummy Y classifiers _ DummyBayes Filter Remove filter Close 18 4 GENERICOBJECTEDITOR 287 18 4 6 Capabilities Version 3 5 3 of Weka introduced the notion of Capabilities Capabilities basi cally list what kind of data a certain object can handle e g one classifier can handle numeric classes but another cannot In case a class supports capabili ties the additional buttons Filter and Remove filter will be available in the GOE The Filter button pops up a dialog which lists all available Capabilities e oe Filtering Capabilities Classifiers have to support at least the following capabilities the ones highlighted don t meet these requirements the ones highlighted blue possibly meet them O Nominal attributes O Binary attributes C Unary attributes O Empty nominal attributes
209. indices attsel selectedAttributes System out println selected attribute indices starting with 0 n Utils arrayToString indices 16 9 SAVING DATA 225 16 9 Saving data Saving weka core Instances objects is as easy as reading the data in the first place though the process of storing the data again is far less common than of reading the data into memory The following two sections cover how to save the data in files and in databases Just like with loading the data in chapter 16 2 examples classes for saving data can be found in the wekaexamples core converters package of the Weka Examples collection 3 16 9 1 Saving data to files Once again one can either let WEKA choose the appropriate converter for sav ing the data or use an explicit converter all savers are located in the weka core converters package The latter approach is necessary if the file name under which the data will be stored does not have an extension that WEKA recognizes Use the DataSink class inner class of weka core converters ConverterUtils if the extensions are not a problem Here are a few examples import weka core Instances import weka core converters ConverterUtils DataSink data structure to save Instances data save as ARFF DataSink write some where data arff data save as CSV DataSink write some where data csv data And here is an example of using the CSVSaver converter explicitly import w
210. ing In contrast to nominal attributes this type does not store a predefined list of labels Normally used to store textual data i e content of documents for text categorization The same constructor as for the nominal attribute is used but a null value is provided instead of an instance of FastVector Attribute string new Attribute name_of_attr FastVector null 16 3 CREATING DATASETS IN MEMORY 203 e relational This attribute just takes another weka core Instances object for defining the relational structure in the constructor The follow ing code snippet generates a relational attribute that contains a relation with two attributes a numeric and a nominal attribute FastVector atts new FastVector atts addElement new Attribute rel num FastVector values new FastVector values addElement val_A values addElement val_B values addElement val_C atts addElement new Attribute rel nom values Instances rel_struct new Instances rel atts 0 Attribute relational new Attribute name_of_attr rel_struct A weka core Instances object is then created by supplying a FastVector object containing all the attribute objects The following example creates a dataset with two numeric attributes and a nominal class attribute with two labels no and yes Attribute num1 new Attribute numi Attribute num2 new Attribute num2 FastVector labels new FastVector labels ad
211. ing a brief help e i lt input file gt The file to process e o lt output file gt The file to output the processed data to default stdout e Uses lowercase strings i e the input is automatically converted to lower case 12 3 2 StringToWordVector Just use the GenericObjectEditor to choose the right stemmer and the desired options if the stemmer offers additional options 12 4 Adding new stemmers You can easily add new stemmers if you follow these guidelines for use in the GenericObjectEditor e they should be located in the weka core stemmers package if not then the GenericObjectEditor props GenericPropertiesCreator props file need to be updated and e they must implement the interface weka core stemmers Stemmer Chapter 13 Databases 13 1 Configuration files Thanks to JDBC it is easy to connect to Databases that provide a JDBC driver Responsible for the setup is the following properties file located in the weka experiment package DatabaseUtils props You can get this properties file from the weka jar or weka src jar jar archive both part of a normal Weka release If you open up one of those files you ll find the properties file in the sub folder weka experiment Weka comes with example files for a wide range of databases e DatabaseUtils props hsql HSQLDB gt 3 4 1 e DatabaseUtils props msaccess MS Access gt 3 4 14 gt 3 5 8 gt 3 6 0 see the Windows databases chapter for m
212. ing it into the Name field c Add a description for this source in the Description field d d Select the server you re connecting to from the Server combobox 185 186 CHAPTER 14 WINDOWS DATABASES e For the verification of the authenticity of the login ID choose With SQL Server f Check Connect to SQL Server to obtain default settings and supply the user ID and password with which you installed the Desktop Engine g Just click on Next until it changes into Finish and click this too h For testing purposes click on Test Data Source the result should be TESTS COMPLETED SUCCESSFULLY i Click on OK e MySQL a Choose the MySQL ODBC driver and click Finish b Give the source a name by typing it into the Data Source Name field c Add a description for this source in the Description field d Specify the server you re connecting to in Server e Fill in the user to use for connecting to the database in the User field the same for the password Choose the database for this DSN from the Database combobox g Click on OK 6 Your DSN should now be listed in the User Data Sources list Step 2 Set up the DatabaseUtils props file You will need to configure a file called DatabaseUtils props This file already exists under the path weka experiment in the weka jar file which is just a ZIP file that is part of the Weka download In this directory you will also find a sample file for ODBC connectivity cal
213. ing the splitOptions String method of the weka core Utils class Here is an ex ample import weka core Utils String options Utils splitOptions R 1 As this method ignores whitespaces using R 1 or R 1 will return the same result as R 1 Complicated command lines with lots of nested options e g options for the support vector machine classifier SMO package weka classifiers functions including a kernel setup are a bit tricky since Java requires one to escape dou ble quotes and backslashes inside a String The Wiki 2 article Use Weka in your Java code references the Java class OptionsToCode which turns any command line into appropriate Java source code This example class is also available from the Weka Examples collection 3 weka core OptionsToCode 16 1 OPTION HANDLING 197 Instead of using the Remove filter s setOptions String method the following code snippet uses the actual set method for this property import weka filters unsupervised attribute Remove Remove rm new Remove rm setAttributelndices 1 In order to find out which option belongs to which property i e get set method it is best to have a look at the setOptions String and getOptions methods In case these methods use the member variables directly one just has to look for the methods making this particular member variable accessible to the outside Using the set methods one will most likely come ac
214. ion of weka core Option objects This enu meration is used to display the help on the command line hence it needs to return the Option objects of the superclass as well setOptions String parses the options that the classifier would receive from a command line invoca tion A parameter and argument are always two elements in the string array A common mistake is to use a single cell in the string array for both of them e g S 1 instead of S 1 You can use the methods getOption and getFlag of the weka core Utils class to retrieve the values of an option or to ascertain whether a flag is present But note that these calls remove the option and if applicable the argument from the string array destructive The last call in the setOptions methods should always be the super setOptions String one in order to pass on any other arguments still present in the array to the superclass The following code snippet just parses the only option alpha that an imaginary classifier defines import weka core Utils public void setOptions String options throws Exception String tmpStr Utils getOption alpha options if tmpStr length 0 setAlpha 0 75 else setAlpha Double parseDouble tmpStr super setOptions options 238 CHAPTER 17 EXTENDING WEKA getOptions returns a string array of command line options that resemble the current clas sifier setup Supplying this array to the se
215. ion time e g k nearest neighbor k NN meta meta classifiers that use a base one or more classifiers as input e g boosting bagging or stacking mi classifiers that handle multi instance data misc various classifiers that don t fit in any another category rules rule based classifiers e g ZeroR trees tree classifiers like decision trees with J48 a very common one 17 1 WRITING A NEW CLASSIFIER 237 17 1 4 Implementation In the following you will find information on what methods need to be imple mented and other coding guidelines for methods option handling and documen tation of the source code 17 1 4 1 Methods This section explains what methods need to be implemented in general and more specialized ones in case of meta classifiers either with single or multiple base classifiers General Here is an overview of methods that your new classifier needs to implement in order to integrate nicely into the WEKA framework globalInfo returns a short description that is displayed in the GUI like the Explorer or Experimenter How long this description will be is really up to you but it should be sufficient to understand the classifier s underlying algorithm If the classifier implements the weka core TechnicalInformationHandler interface then you could refer to the publication s by extending the returned string by getTechnicalInformation toString listOptions returns a java util Enumerat
216. is setosa 0 47 3 b Iris versicolor 0 248 c Iris virginica java weka classifiers trees J48 t data iris ar f 3 3 Command redirection Starting with this version of Weka one can perform a basic redirection java weka classifiers trees J48 t test arff gt j48 txt Note the gt must be preceded and followed by a space otherwise it is not recognized as redirection but part of another parameter 3 4 COMMAND COMPLETION 33 3 4 Command completion Commands starting with java support completion for classnames and filenames via Tab Alt BackSpace deletes parts of the command again In case that there are several matches Weka lists all possible matches e package name completion java weka cl lt Tab gt results in the following output of possible matches of package names Possible matches weka classifiers weka clusterers e classname completion java weka classifiers meta A lt Tab gt lists the following classes Possible matches weka classifiers meta AdaBoostM1 weka classifiers meta AdditiveRegression weka classifiers meta AttributeSelectedClassifier e filename completion In order for Weka to determine whether a the string under the cursor is a classname or a filename filenames need to be absolute Unix Linx some path file Windows C Some Path file or relative and starting with a dot Unix Linux some other path file Windows Some Other Path file 34 CHAPTER 3 SIMPLE CLI Chapt
217. ision recall and true false statistics for each class are output This option is also selected by default Output entropy evaluation measures Entropy evaluation measures are included in the output This option is not selected by default Output confusion matrix The confusion matrix of the classifier s pre dictions is included in the output This option is selected by default Store predictions for visualization The classifier s predictions are remembered so that they can be visualized This option is selected by default Output predictions The predictions on the evaluation data are output Note that in the case of a cross validation the instance numbers do not correspond to the location in the data Output additional attributes If additional attributes need to be out put alongside the predictions e g an ID attribute for tracking misclassi fications then the index of this attribute can be specified here The usual Weka ranges are supported first and last are therefore valid indices as well example first 3 6 8 12 last Cost sensitive evaluation The errors is evaluated with respect to a cost matrix The Set button allows you to specify the cost matrix used Random seed for xval Split This specifies the random seed used when randomizing the data before it is divided up for evaluation purposes Preserve order for Split This suppresses the randomization of the data before splitting into train
218. it costly especially if one is adding a lot of rows There fore the following code examples cover the first constructor For simplicity an Instances object data based on the code snippets for the different attribute introduced used above is used as it contains all possible attribute types 204 CHAPTER 16 USING THE API For each instance the first step is to create a new double array to hold the attribute values It is important not to reuse this array but always create a new one since WEKA only references it and does not create a copy of it when instantiating the Instance object Reusing means changing the previously generated Instance object double values new double data numAttributes After that the double array is filled with the actual values e numeric just sets the numeric value values 0 1 23 e date turns the date string into a double value values 1 data attribute 1 parseDate 2001 11 09 e nominal determines the index of the label values 2 data attribute 2 index0f label_b e string determines the index of the string using the addStringValue method internally a hashtable holds all the string values values 3 data attribute 3 addStringValue This is a string e relational first a new Instances object based on the attribute s rela tional definition has to be created before the index of it can be determined using the addRelation method Instances dataRel ne
219. ith Paired T Tester cor v Available resultsets 1 rules ZeroR 48055541465867954 Row Select 2 rules OneR B 6 2459427002147861445 ins 3 trees J48 C 0 25 M 2 217733168393644444 Column Select Comparison field Percent_correct 4 Significance 0 05 Sorting asc by lt default gt iz Test base Select Displayed Columns Columns Show std deviations Output Format Select Perform test Save output Result list 16 36 04 Available resultsets 5 5 ANALYSING RESULTS 89 The number of result lines available Got 30 results is shown in the Source panel This experiment consisted of 10 runs for 3 schemes for 1 dataset for a total of 30 result lines Results can also be loaded from an earlier experiment file by clicking File and loading the appropriate arff results file Similarly results sent to a database using the DatabaseResultListener can be loaded from the database Select the Percent_correct attribute from the Comparison field and click Perform test to generate a comparison of the 3 schemes weka Experiment Environment oj x Setup Run Analyse Source f 11 Got 30 results File Database Jl Experiment Configure test i Test output Testing with Paired T Tester cor Tester weka experiment PairedCorrectedTTester analysing Percent_correct Row Select Datasets 1 4 Resu
220. lace this on the layout e Now connect the ArffLoader to the ClassAssigner first right click over the ArffLoader and select the dataSet under Connections in the menu A rubber band line will appear Move the mouse over the ClassAssigner component and left click a red line labeled dataSet will connect the two components e Next right click over the ClassAssigner and choose Configure from the menu This will pop up a window from which you can specify which column is the class in your data last is the default e Next choose the Class ValuePicker allows you to choose which class label to be evaluated in the ROC component from the toolbar Place this on the layout and right click over ClassAssigner and select dataSet from under Connections in the menu and connect it with the Class ValuePicker e Next grab a Cross ValidationFoldMaker component from the Evaluation toolbar and place it on the layout Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over ClassAssigner and se lecting dataSet from under Connections in the menu 104 CHAPTER 6 KNOWLEDGEFLOW e Next click on the Classifiers tab at the top of the window and scroll along the toolbar until you reach the J48 component in the trees section Place a J48 component on the layout e Connect the CrossValidationFoldMaker to J48 TWICE by first choosing trainingSet and then testSet from the pop up menu for the CrossValida tionFoldMaker e Repeat these tw
221. lass sepalwidth 2 class petallength 2 class sepallength petalwidth 2 class petallength class 3 8 8 INSPECTING BAYESIAN NETWORKS 139 This list specifies the network structure Each of the variables is followed by a list of parents so the petallength variable has parents sepallength and class while class has no parents The number in braces is the cardinality of the variable It shows that in the iris dataset there are three class variables All other variables are made binary by running it through a discretization filter LogScore Bayes 374 9942769685747 LogScore BDeu 351 85811477631626 LogScore MDL 416 86897021246466 LogScore ENTROPY 366 76261727150217 LogScore AIC 386 76261727150217 These lines list the logarithmic score of the network structure for various meth ods of scoring If a BIF file was specified the following two lines will be produced if no such file was specified no information is printed Missing 0 Extra 2 Reversed 0 Divergence 0 0719759699700729 In this case the network that was learned was compared with a file iris xml which contained the naive Bayes network structure The number after Missing is the number of arcs that was in the network in file that is not recovered by the structure learner Note that a reversed arc is not counted as missing The number after Extra is the number of arcs in the learned network that are not in the network on file The number of reversed arcs
222. lected statistics import weka core Instances import weka classifiers Evaluation import weka classifiers trees J48 Instances train from somewhere Instances test from somewhere train classifier Classifier cls new J48 cls buildClassifier train evaluate classifier and print some statistics Evaluation eval new Evaluation train eval evaluateModel cls test System out println eval toSummaryString nResults n n false 16 6 CLASSIFICATION 213 Statistics In the previous sections the toSummaryString of the Evaluation class was already used in the code examples But there are other summary methods for nominal class attributes available as well e toMatrixString outputs the confusion matrix e toClassDetailsString outputs TP FP rates precision recall F measure AUC per class e toCumulativeMarginDistributionString outputs the cumulative mar gins distribution If one does not want to use these summary methods it is possible to access the individual statistical measures directly Below a few common measures are listed e nominal class attribute correct The number of correctly classified instances The in correctly classified ones are available through incorrect pctCorrect The percentage of correctly classified instances ac curacy pctIncorrect returns the number of misclassified ones areaUnderROC int The AUC for the specified class lab
223. led DatabaseUtils props odbc and one specifically for MS Access called DatabaseUtils props msaccess also using ODBC You should use one of the sample files as basis for your setup since they already contain default values specific to ODBC access This file needs to be recognized when the Explorer starts You can achieve this by making sure it is in the working directory or the home directory if you are unsure what the terms working directory and home directory mean see the Notes section The easiest is probably the second alternative as the setup will apply to all the Weka instances on your machine Just make sure that the file contains the following lines at least jdbcDriver sun jdbc odbc JdbcOdbcDriver jdbcURL jdbc odbc dbname where dbname is the name you gave the user DSN This can also be changed once the Explorer is running 187 Step 3 Open the database de 2 Start up the Weka Explorer Choose Open DB The URL should read jdbc odbc dbname where dbname is the name you gave the user DSN Click Connect Enter a Query e g select from tablename where tablename is the name of the database table you want to read Or you could put a more complicated SQL query here instead Click Execute When you re satisfied with the returned data click OK to load the data into the Preprocess panel Notes e Working directory The directory a process is started from When you start Weka from the Wind
224. leted These categories should make it clear what the difference between the two Discretize filters in WEKA is The supervised one takes the class attribute and its distribution over the dataset into account in order to determine the optimal number and size of bins whereas the unsupervised one relies on a user specified number of bins Apart from this classification filters are either stream or batch based Stream filters can process the data straight away and make it immediately available for collection again Batch filters on the other hand need a batch of data to setup their internal data structures The Add filter this filter can be found in the weka filters unsupervised attribute package is an example of a stream filter Adding a new attribute with only missing values does not require any sophisticated setup However the ReplaceMissingValues filter same package as the Add filter needs a batch of data in order to determine the means and modes for each of the attributes Otherwise the filter will not be able to replace the missing values with meaningful values But as soon as a batch filter has been initialized with the first batch of data it can also process data on a row by row basis just like a stream filter Instance based filters are a bit special in the way they handle data As mentioned earlier all filters can process data on a row by row basis after the first batch of data has been passed through Of course if a filter adds
225. list of tabs displayed in the Explorer we need to modify the Explorer props file just extract it from the weka jar and place it in your home directory The Tabs property must look like this Tabs weka gui explorer ClassifierPanel weka gui explorer ExperimentPanel weka gui explorer ClustererPanel weka gui explorer AssociationsPanel weka gui explorer AttributeSelectionPanel weka gui explorer VisualizePanel Screenshot weka Explorer m oj x Preprocess Classify I Experiment Cluster Associate Select attributes Visualize Classifier eran Choose J48 C 0 25 M 2 Experiment options Experiment output Runs 10 Runs 10 E Evaluation Cross valiali n Evaluation Cross validation Folds 10 Folds fio Nom class v Heen Start Sen Number_of_training_instances 808 2 0 402 a Number of testing instances 89 8 0 402 Result list right click for options Number_correct 88 52 0 9896 2 E incorrect 1 28 0 9329 J Number_unclassified o o Percent_correct 98 5749 1 0387 Percent_incorrect 1 4251 1 0387 Percent_unclassified a 0 Kappa_statistic 0 964 0 0266 Mean _absolute_error 0 0053 0 0035 Root_nean_squared_ error 0 0571 0 0321 Relative_absolute_error 3 9752 2 5754 Root_relative_squared error 22 099 12 4095 F_prior_entropy 107 0541 2 571 F_scheme_entropy 263 6906 572 5101 SF_entropy_gain 156 6366 572 1341 SF_nean prior_entropy 1 1921 0 0252 5F_ mean sc
226. ll We will begin by describing basic concepts and ideas Then we will describe the weka filters package which is used to transform input data e g for preprocessing transformation feature generation and so on Then we will focus on the machine learning algorithms themselves These are called Classifiers in WEKA We will restrict ourselves to common settings for all classifiers and shortly note representatives for all main approaches in machine learning Afterwards practical examples are given Finally in the doc directory of WEKA you find a documentation of all java classes within WEKA Prepare to use it since this overview is not intended to be complete If you want to know exactly what is going on take a look at the mostly well documented source code which can be found in weka src jar and can be extracted via the jar utility from the Java Development Kit or any archive program that can handle ZIP files 13 14 CHAPTER 1 A COMMAND LINE PRIMER 1 2 Basic concepts 1 2 1 Dataset A set of data items the dataset is a very basic concept of machine learning A dataset is roughly equivalent to a two dimensional spreadsheet or database table In WEKA it is implemented by the weka core Instances class A dataset is a collection of examples each one of class weka core Instance Each Instance consists of a number of attributes any of which can be nominal one of a predefined list of values numeric a real or integer number or
227. ll bring up a popup menu listing the following options 40 CHAPTER 4 EXPLORER 1 Show properties has the same effect as left clicking on the field i e a dialog appears allowing you to alter the settings 2 Copy configuration to clipboard copies the currently displayed con figuration string to the system s clipboard and therefore can be used any where else in WEKA or in the console This is rather handy if you have to setup complicated nested schemes 3 Enter configuration is the receiving end for configurations that got copied to the clipboard earlier on In this dialog you can enter a classname followed by options if the class supports these This also allows you to transfer a filter setting from the Preprocess panel to a FilteredClassifier used in the Classify panel Left Clicking on any of these gives an opportunity to alter the filters settings For example the setting may take a text string in which case you type the string into the text field provided Or it may give a drop down box listing several states to choose from Or it may do something else depending on the information required Information on the options is provided in a tool tip if you let the mouse pointer hover of the corresponding field More information on the filter and its options can be obtained by clicking on the More button in the About panel at the top of the GenericObjectEditor window Some objects display a brief description of what they d
228. log as well as the setup for running regular attribute selection 50 CHAPTER 4 EXPLORER 4 7 Visualizing Weka 3 5 4 Explorer Program Applications Tools Visualization Windows Help E Explorer Preprocess Classify Cluster Associate Select attributes Visualize Plot Matrix outlook temperature humidity windy gt o o o humidity PlotSize 100 3 Update PointSize 3 D Jitter Select Attributes Colour play Nom SubSample Class Colour WEKA s visualization section allows you to visualize 2D plots of the current relation 4 7 1 The scatter plot matrix When you select the Visualize panel it shows a scatter plot matrix for all the attributes colour coded according to the currently selected class It is possible to change the size of each individual 2D plot and the point size and to randomly jitter the data to uncover obscured points It also possible to change the attribute used to colour the plots to select only a subset of attributes for inclusion in the scatter plot matrix and to sub sample the data Note that changes will only come into effect once the Update button has been pressed 4 7 2 Selecting an individual 2D scatter plot When you click on a cell in the scatter plot matrix this will bring up a separate window with a visualizatio
229. ltsets 3 PEN Select Confidence 0 05 two tailed E Sorted by E Date 21 12 05 16 37 Comparison field Percent_correct wi Significance 0 05 RA pataset 1 rules Ze 2 rules 3 trees Sorting asc by lt default gt O NAO RAE RA iris 10 33 33 94 31 vw 94 90 v Test hase Select MAT GEAR A RA A Sa M t ws 1 1 070 1 0 0 Displayed Columns Columns Show std deviations C Key 1 rules ZeroR 48055541465867954 ar Select 2 rules OneR B 6 2459427002147861445 3 trees J48 C 0 25 M 2 217733168393644444 Perform test Save output i Result list 16 37 11 Percent_correct rules ZeroR 4805 4 I I Lol 3 The schemes used in the experiment are shown in the columns and the datasets used are shown in the rows The percentage correct for each of the 3 schemes is shown in each dataset row 33 33 for ZeroR 94 31 for OneR and 94 90 for J48 The annotation v or indicates that a specific result is statistically better v or worse than the baseline scheme in this case ZeroR at the significance level specified currently 0 05 The results of both OneR and J48 are statistically better than the baseline established by ZeroR At the bottom of each column after the first column is a count xx yy zz of the number of times that the scheme was better than xx the same as yy or worse than zz the baseline scheme on the da
230. ly to the specific set of results 4 4 CLUSTERING 45 4 4 Clustering Weka 3 5 4 Explorer Program Applications Tools Visualization Windows Help Explorer Preprocess Classify Cluster Associate Select attributes Visualize Clusterer _ choose _fem i100 N 1 M1 06 6 8 100 Cluster mode Use training set gt Supplied test set O Percentage split Classes to clusters evaluation Nom play Z Store clusters for visualization Clusterer output AUCLIDUCET MOMIaITy Discrete Estimator Counts 8 8 Total 16 Attribute windy Discrete Estimator Counts 7 9 Clustered Instances o 14 100 Log likelihood 3 54934 Total 16 Ignore attributes Class attribute play Classes to Clusters Start Result list right click for options 5 16 14 EM 0 lt assigned to cluster 9 yes 5 no Cluster 0 lt yes Incorrectly clustered instances 5 0 35 7143 Los age 4 4 1 Selecting a Clusterer By now you will be familiar with the process of selecting and configuring objects Clicking on the clustering scheme listed in the Clusterer box at the top of the window brings up a GenericObjectEditor dialog with which to choose a new clustering scheme 4 4 2 Cluster Modes The Cluster mode box is used to choose what to cluster and how to evaluate the results The first three options are the same as for classification Use train
231. mber specified is incorrect Verify that SQL Server is lis tening with TCP IP on the specified server and port This might be re ported with an exception similar to The login has failed The TCP IP connection to the host has failed This indicates one of the following SQL Server is installed but TCP IP has not been installed as a net work protocol for SQL Server by using the SQL Server Network Util ity for SQL Server 2000 or the SQL Server Configuration Manager for SQL Server 2005 TCP IP is installed as a SQL Server protocol but it is not listening on the port specified in the JDBC connection URL The default port is 1433 The port that is used by the server has not been opened in the firewall e The Added driver output on the commandline does not mean that the actual class was found but only that Weka will attempt to load the class later on in order to establish a database connection e The error message No suitable driver can be caused by the following The JDBC driver you are attempting to load is not in the CLASS PATH Note using jar in the java commandline overwrites the CLASSPATH environment variable Open the SimpleCLI run the command java weka core SystemInfo and check whether the prop erty java class path lists your database jar If not correct your CLASSPATH or the Java call you start Weka with The JDBC driver class is misspelled in the jdbcDriver property or you have multiple entries
232. mean squared error per exam ple would be a reasonable cri terion We will discuss the re lation between confusion matrix and other measures in the text The confusion matrix is more commonly named contingency table In our case we have two classes and therefore a 2x2 confusion matrix the matrix could be arbitrarily large The number of correctly classified instances is the sum of diagonals in the matrix all others are incorrectly classified class a gets misclassified as b exactly twice and class b gets misclassified as a three times The True Positive TP rate is the proportion of examples which were clas sified as class x among all examples which truly have class x i e how much part of the class was captured It is equivalent to Recall In the confusion ma trix this is the diagonal element divided by the sum over the relevant row i e 7 7 2 0 778 for class yes and 2 3 2 0 4 for class no in our example The False Positive FP rate is the proportion of examples which were classi fied as class x but belong to a different class among all examples which are not of class x In the matrix this is the column sum of class x minus the diagonal element divided by the rows sums of all other classes i e 3 5 0 6 for class yes and 2 9 0 222 for class no The Precision is the proportion of the examples which truly have class x 22 CHAPTER 1 A COMMAND LINE PRIMER among all those which were classified as class
233. ment Environment Setup Run Analyse Experiment Configuration Mode O Simple Advanced CU AO Y CUE CO Y InstancesResultListener O weka_experiment5520 arff Result generator CrossValidationResultProducer X 10 O splitEvalutorOut zip W weka experiment ClassifiersplitEvalua ANA TE CanericObiectEdi E weka experiment CrossValidationResultProducer About Generates for each run carries out an n fold cross validation using the set SplitEvaluator to generate enma raculte numFolds 10 O Use relati rawOutput False splitEvaluator ClassifierSplitEvaluator W weka c Cope Cave A Caneel Ae 5 3 CLUSTER EXPERIMENTS 81 Next choose DensityBasedClustererSplitEvaluator as the split evaluator to use Result generator CrossValidationResultProducer X 10 O splitEvalutorOut zip W weka e Runs e A i Bomi 11 weka experiment CrossValidationResultProducer About J weka Generates for e v i experiment r Iteration cont cross validatio E ClassifierSplitEvaluator cama raciilre Data sets E CostSensitiveClassifiersplitEvaluator af DensityBasedClustererSplitEvaluator B E RegressionSplitEvaluator Datasets numFolds outputFile rawOutput splitEvaluator If you click on DensityBasedClustererSplitE
234. metrize the base classifier but not choose another classifier with the GenericObjectEditor Be aware that this method does not create a copy of the provided classifier getClassifier returns the currently set classifier object Note this method returns the internal object and not a copy MultipleClassifiersCombiner This meta classifier handles its multiple base classifiers with the following meth ods set Classifiers Classifier sets the array of classifiers to use as base classifiers If you require the base classifiers to implement a certain interface or be of a certain class then override this method and add the necessary checks Note this method does not create a copy of the array but just uses this reference internally getClassifiers returns the array of classifiers that is in use Careful this method returns the internal array and not a copy of it get Classifier int returns the classifier from the internal classifier array specified by the given index Once again this method does not return a copy of the classifier but the actual object used by this classifier 17 1 WRITING A NEW CLASSIFIER 241 17 1 4 2 Guidelines WEKA s code base requires you to follow a few rules The following sections can be used as guidelines in writing your code Parameters There are two different ways of setting obtaining parameters of an algorithm Both of them are unfortunately completely independent which makes option ha
235. mm mysql Driver or com mysql jdbc Driver e ODBC part of Sun s JDKs JREs no external driver necessary sun jdbc odbc JdbcOdbcDriver e Oracle oracle jdbc driver OracleDriver e PostgreSQL org postgresql Driver e sqlite 3 x org sqlite JDBC URL jdbcURL specifies the JDBC URL pointing to your database can be still changed in the Experimenter Explorer e g for the database MyDatabase on the server server my domain 13 3 MISSING DATATYPES 181 e HSQLDB jdbc hsqldb hsql server my domain MyDatabase e MS SQL Server 2000 Desktop Edition jdbc microsoft sqlserver v 1433 Note if you add databasename db name you can connect to a different database than the default one e g MyDatabase e MS SQL Server 2005 jdbc sqlserver server my domain 1433 e MySQL jdbc mysql server my domain 3306 MyDatabase e ODBC jdbc odbc DSN_name replace DSN_name with the DSN that you want to use e Oracle thin driver jdbc oracle thin server my domain 1526 orcl Note nachineName port SID for the Express Edition you can use jdbc oracle thin server my domain 1521 XE e PostgreSQL jdbc postgresql server my domain 5432 MyDatabase You can also specify user and password directly in the URL jdbc postgresql server my domain 5432 MyDatabase user lt gt kpassword lt gt where you have to replace the lt gt with the correct values e sqlite 3 x jdbc sqlite path to database db you can acces
236. mote engines on e To edit the remote engine policy file included in the Weka distribution to allow Java class and dataset loading from your home directory e An invocation of the Experimenter on a machine somewhere any will do For the following examples we assume a user called johndoe with this setup e Access to a set of computers running a flavour of Unix pathnames need to be changed for Windows e The home directory is located at home johndoe Weka is found in home johndoe weka Additional jar archives i e JDBC drivers are stored in home johndoe jars e The directory for the datasets is home johndoe datasets Note The example policy file remote policy example is using this setup available in weka experiment 5 4 2 Database Server Setup e HSQLDB Download the JDBC driver for HSQLDB extract the hsqldb jar and place it in the directory home johndoe jars To set up the database server choose or create a directory to run the database server from and start the server with java classpath home johndoe jars hsqldb jar org hsqldb Server database 0 experiment dbname 0 experiment Note This will start up a database with the alias experiment dbname 0 lt alias gt and create a properties and a log file at the current location prefixed with experiment database 0 lt file gt IWeka s source code can be found in the weka src jar archive or obtained from Subversion 1 84 C
237. n field Percent_correct Significance 0 05 Pl Dataset 2 rules On 1 rules 3 trees Sorting asc by lt default gt gt a aaa aa aaa a aad iris 10 94 31 33 33 94 90 Test base AE ws 1 0 0 1 0 1 0 Displayed Columns Columns Show std deviations C 1 rules ZeroR 48055541465867954 2 rules OneR B 6 2459427002147861445 Output Format Select 3 trees J48 C 0 25 M 2 217733168393644444 Perform test Save output Result list 16 41 47 Percent_correct rules OneR B 6 4 4 1 gt 5 5 4 Statistical Significance The term statistical significance used in the previous section refers to the re sult of a pair wise comparison of schemes using either a standard T Test or the corrected resampled T Test 9 The latter test is the default because the stan dard T Test can generate too many significant differences due to dependencies in the estimates in particular when anything other than one run of an x fold cross validation is used For more information on the T Test consult the Weka book I or an introductory statistics text As the significance level is decreased the confidence in the conclusion increases In the current experiment there is not a statistically significant difference between the OneR and J48 schemes 5 5 5 Summary Test Selecting Summary from Test base and performing a test causes
238. n of the scatter plot you selected We described above how to visualize particular results in a separate window for example classifier errors the same visualization controls are used here Data points are plotted in the main area of the window At the top are two drop down list buttons for selecting the axes to plot The one on the left shows which attribute is used for the x axis the one on the right shows which is used for the y axis Beneath the x axis selector is a drop down list for choosing the colour scheme This allows you to colour the points based on the attribute selected Below the plot area a legend describes what values the colours correspond to If the values are discrete you can modify the colour used for each one by clicking on them and making an appropriate selection in the window that pops up To the right of the plot area is a series of horizontal strips Each strip represents an attribute and the dots within it show the distribution of values 4 7 VISUALIZING 51 of the attribute These values are randomly scattered vertically to help you see concentrations of points You can choose what axes are used in the main graph by clicking on these strips Left clicking an attribute strip changes the x axis to that attribute whereas right clicking changes the y axis The X and Y written beside the strips shows what the current axes are B is used for both X and Y Above the attribute strips is a slid
239. name Password Debug o OK Cancel After supplying the necessary data and clicking on OK the URL in the main window will be updated Note at this point the database connection is not tested this is done when the experiment is started weka Experiment Environment Se x Setup Run Analyse Experiment Configuration Mode 8 Simple Advanced Open Save New Results Destination JDBC database URL habe mysqlflocalhost 3306 Wweka_test User Experiment Type Iteration Control Cross validation v Number of repetitions 10 Number of folds 10 8 Data sets first Classification Regression Algorithms first Datasets Algorithms Add new Delete selected Add new Edit selected Delete selected C Use relative paths Load options Save options The advantage of a JDBC database is the possibility to resume an in terrupted or extended experiment Instead of re running all the other algo rithm dataset combinations again only the missing ones are computed 5 2 1 3 Experiment type The user can choose between the following three different types e Cross validation default performs stratified cross validation with the given number of folds e Train Test Percentage Split data randomized splits a dataset according to the given p
240. ncoming instance before the classifier is trained updated with the instance If you have a pre trained classifier you can specify that the classifier not be updated on incoming instances by unselecting the check box in the configuration dialog for the classifier If the pre trained classifier is a batch classifier i e it is not capable of incremental training then you will only be able to test it in an incremental fashion 868e About Class for a Naive Bayes classifier using estimator classes na Capabilities debug False a displayModelinOldFormat False Ke useKernelEstimator False Hy useSupervisedDiscretization False 14 Y Update classifier on incoming instance stream 6 5 PLUGIN FACILITY 107 6 5 Plugin Facility The KnowledgeF low offers the ability to easily add new components via a plugin mechanism Plugins are installed in a directory called knowledgeFlow plugins in the user s home directory If this directory does not exist you must create it in order to install plugins Plugins are installed in subdirectories of the knowledgeFlow plugins directory More than one plugin component may reside in the same subdirectory Each subdirectory should contain jar file s that contain and support the plugin components The KnowledgeF low will dy namically load jar files and add them to the classpath In order to tell the KnowledgeFlow which classes in the jar files to instantiate as components a second file called Beans
241. ndling so prone to errors Here are the two 1 command line options using the setOptions getOptions methods 2 using the properties through the GenericObjectEditor in the GUI Each command line option must have a corresponding GUI property and vice versa In case of GUI properties the get and set method for a property must comply with Java Beans style in order to show up in the GUI You need to supply three methods for each property e public void set lt PropertyName gt lt Type gt checks whether the sup plied value is valid and only then updates the corresponding member vari able In any other case it should ignore the value and output a warning in the console or throw an IllegalArgumentException e public lt Type gt get lt PropertyName gt performs any necessary conver sions of the internal value and returns it e public String lt propertyName gt TipText returns the help text that is available through the GUI Should be the same as on the command line Note everything after the first period gets truncated from the tool tip that pops up in the GUI when hovering with the mouse cursor over the field in the GenericObjectEditor With a property called alpha of type double we get the following method signatures e public void setAlpha double e public double getAlpha e public String alphaTipText These get and set methods should be used in the getOptions and setOptions methods as well to
242. net BayesNetGenerator N 15 A 20 C 3 M 300 will generate a data set in arff format with 300 instance from a random network with 15 ternary variables and 20 arrows 156 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS How do I create an artificial data set using a Bayes nets I have on file Running java weka classifiers bayes net BayesNetGenerator F alarm xml M 1000 will generate a data set with 1000 instances from the network stored in the file alarm xml How do I save a Bayes net in BIF format e GUI In the Explorer learn the network structure right click the relevant run in the result list choose Visualize graph in the pop up menu click the floppy button in the Graph Visualizer window a file save as dialog pops up that allows you to select the file name to save to e Java Create a BayesNet and call BayesNet toXMLBIF03 which returns the Bayes network in BIF format as a String e Command line use the g option and redirect the output on stdout into a file How do I compare a network I learned with one in BIF format Specify the B lt bif file gt option to BayesNet Calling toString will produce a summary of extra missing and reversed arrows Also the divergence between the network learned and the one on file is reported How do I use the network I learned for general inference There is no general purpose inference in Weka but you can export the network as XML BIF file
243. nf 1 Outlook reiny windy FALSE play yes 3 conf 1 temperature cool play yes hunidity normel 3 conf 1 outlook sunny temperature hot 2 gt humidityshigh 2 conf 1 temperature hot play no 2 gt outlook sunny 2 conf 1 4 5 1 Setting Up This panel contains schemes for learning association rules and the learners are chosen and configured in the same way as the clusterers filters and classifiers in the other panels 4 5 2 Learning Associations Once appropriate parameters for the association rule learner bave been set click the Start button When complete right clicking on an entry in the result list allows the results to be viewed or saved 48 CHAPTER 4 EXPLORER 4 6 Selecting Attributes Weka 3 5 4 Explorer Program Applications Tools Visualization Windows Help F Explorer Preprocess Classify Cluster Associate Select attributes Visualize Attribute Evaluator Choose CfsSubsetEval Search Method Choose BestFirst D 1 N 5 Attribute Selection Mode Attribute selection output 9 Use full training set ciao ESTOS Attribute Selection on all input data Search Method Nom play Search direction forward Stale search after 5 node expansions ET Total number of subsets evaluated 11 ERA ALDEN Merit of best subset found 0 247 Attribute Subset Evaluator supervised Class nominal 5 play CF
244. ng Weka 19 2 2 OutOfMemoryException Most Java virtual machines only allocate a certain maximum amount of memory to run Java programs Usually this is much less than the amount of RAM in your computer However you can extend the memory available for the virtual machine by setting appropriate options With Sun s JDK for example you can go 295 296 CHAPTER 19 OTHER RESOURCES java Xmx100m to set the maximum Java heap size to 100MB For more information about these options see http java sun com docs hotspot VMOptions html 19 2 2 1 Windows Book version You have to modify the JVM invocation in the RunWeka bat batch file in your installation directory Developer version e up to Weka 3 5 2 just like the book version e Weka 3 5 3 You have to modify the link in the Windows Start menu if you re starting the console less Weka only the link with console in its name executes the RunWeka bat batch file e Weka 3 5 4 and higher Due to the new launching scheme you no longer modify the batch file but the RunWeka ini file In that particular file you ll have to change the maxheap placeholder See section 8 22 19 23 Mac OSX In your Weka installation directory weka 3 x y app locate the Contents sub directory and edit the Info plist file Near the bottom of the file you should see some text like lt key gt VMOptions lt key gt lt string gt Xmx256M lt string gt Alter the 256M to something higher 1
245. ng a blank area pan around by dragging the mouse and see the training instances at each node by clicking on it CTRL clicking zooms the view out while SHIFT dragging a box zooms the view in The graph visualizer should be self explanatory Visualize margin curve Generates a plot illustrating the prediction margin The margin is defined as the difference between the probability predicted for the actual class and the highest probability predicted for the other classes For example boosting algorithms may achieve better performance on test data by increasing the margins on the training data Visualize threshold curve Generates a plot illustrating the trade offs in prediction that are obtained by varying the threshold value between classes For example with the default threshold value of 0 5 the pre dicted probability of positive must be greater than 0 5 for the instance to be predicted as positive The plot can be used to visualize the pre cision recall trade off for ROC curve analysis true positive rate vs false positive rate and for other types of curves Visualize cost curve Generates a plot that gives an explicit represen tation of the expected cost as described by 5 Plugins This menu item only appears if there are visualization plugins available by default none More about these plugins can be found in the Weka Wiki article Explorer visualization plugins 8 Options are greyed out if they do not app
246. nt and other cross validation methods can be implemented such as Monte Carlo cross validation and bootstrap cross validation e Implement methods that can handle incremental extensions of the data set for updating network structures And for the more ambitious people there are the following challenges e A GUI for manipulating Bayesian network to allow user intervention for adding and deleting arcs and updating the probability tables e General purpose inference algorithms built into the GUI to allow user defined queries e Allow learning of other graphical models such as chain graphs undirected graphs and variants of causal graphs e Allow learning of networks with latent variables e Allow learning of dynamic Bayesian networks so that time series data can be handled 158 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Part III Data 159 Chapter 9 ARFF An ARFF Attribute Relation File Format file is an ASCII text file that describes a list of instances sharing a set of attributes 9 1 Overview ARFF files have two distinct sections The first section is the Header informa tion which is followed the Data information The Header of the ARFF file contains the name of the relation a list of the attributes the columns in the data and their types An example header on the standard IRIS dataset looks like this 1 Title Iris Plants Database 7 2 Sources a Creator R A Fisher b Donor Michael
247. nt base classifier By default all capabilities are initialized as Dependencies weka classifiers meta LogitBoost e g is restricted to nominal classes For that reason it disables the Dependencies for the class result disableAllClasses disable all class types result disableAl1ClassDependencies no dependencies result enable Capability NOMINAL_CLASS only nominal classes allowed 244 CHAPTER 17 EXTENDING WEKA Javadoc In order to keep code quality high and maintenance low source code needs to be well documented This includes the following Javadoc requirements e class description of the classifier listing of command line parameters publication s if applicable author and version tag e methods all not just public each parameter is documented return value if applicable is documented exception s are documented the setOptions String method also lists the command line pa rameters Most of the class related and the setOptions Javadoc is already available through the source code e description of the classifier globalInfo e listing of command line parameters listOptions e publication s if applicable getTechnicalInformation In order to avoid manual syncing between Javadoc and source code WEKA comes with some tools for updating the Javadoc automatically The following tools take a concrete class and update its source code the source code directory ne
248. nt that the E options should be used after the Q option Extra options can be passed to the search algorithm and the estimator after the class name specified following For example java weka classifiers bayes BayesNet t iris arff D Q weka classifiers bayes net search local K2 P 2 S ENTROPY E weka classifiers bayes net estimate SimpleEstimator A 1 0 Overview of options for search algorithms e weka classifiers bayes net search local GeneticSearch L lt integer gt Population size A lt integer gt Descendant population size U lt integer gt Number of runs M Use mutation 130 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS default true Use cross over default true Use tournament selection true or maximum subpopulatin false default false R lt seed gt Random number seed mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local HillClimber P lt nr of parents gt Maximum number of parents R Use arc reversal operation default false N Initial structure is empty instead of Naive Bayes mbc Applies a Markov Blanket correction to the network structure after a network structure i
249. ntrol 8 Data sets first Custom generator first Datasets Add new Edit selecte Delete select v Use relative pat dataliris arff To add another scheme click on the Choose button to display the Generic ObjectEditor window weka Experiment Environment Setup Run Analyse xperiment Configuration Mode Simple 8 Advanced Open Save New Destination Choose InstancesResuttListener O Experimentt arff Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator VY weka classifier Runs From 1 Distribute experiment Generator properties To fio Hosts O Byrun Select property Enabled y By data set weka E classifiers o hayes Iteration control 8 Data sets first Custom generator first Datasets Add new Edit selecte Delete select v Use relative pat dataliris artt o E functions gt lazy o 7 meta gt c mi o misc trees O ADTree n BFTree a DecisionStump E las Qu Dur E mse D NBTree C RandomForest D RandomTree Ej REPTree E simnlecart Eiter Remove iter Down
250. o in an About box along with a More button Clicking on the More button brings up a window describing what the different options do Others have an additional button Capabilities which lists the types of attributes and classes the object can handle At the bottom of the GenericObjectEditor dialog are four buttons The first two Open and Save allow object configurations to be stored for future use The Cancel button backs out without remembering any changes that have been made Once you are happy with the object and settings you have chosen click OK to return to the main Explorer window Applying Filters Once you have selected and configured a filter you can apply it to the data by pressing the Apply button at the right end of the Filter panel in the Preprocess panel The Preprocess panel will then show the transformed data The change can be undone by pressing the Undo button You can also use the Edit button to modify your data manually in a dataset editor Finally the Save button at the top right of the Preprocess panel saves the current version of the relation in file formats that can represent the relation allowing it to be kept for future use Note Some of the filters behave differently depending on whether a class at tribute has been set or not using the box above the histogram which will bring up a drop down list of possible selections when clicked In particular the Supervised filters require a class attribut
251. o steps with the RandomForest classifier e Next go back to the Evaluation tab and place a ClassifierPerformanceE valuator component on the layout Connect J48 to this component by selecting the batchClassifier entry from the pop up menu for J48 Add another ClassifierPerformanceEvaluator for RandomForest and connect them via batchClassifier as well e Next go to the Visualization toolbar and place a ModelPerformanceChart component on the layout Connect both ClassifierPerformanceEvaluators to the ModelPerformanceChart by selecting the thresholdData entry from the pop up menu for ClassifierPerformanceEvaluator e Now start the flow executing by selecting Start loading from the pop up menu for ArffLoader Depending on how big the data set is and how long cross validation takes you will see some animation from some of the icons in the layout You will also see some progress information in the Status bar and Log at the bottom of the window e Select Show plot from the popup menu of the ModelPerformanceChart under the Actions section Here are the two ROC curves generated from the UCI dataset credit g eval uated on the class label good Model Performance Chart Y A oj x X False Positive Rate Num w Y True Positive Rate Num v Colour Threshold Num Reset Clear Open Save Jitter i Plot german_credit 1 _ x J48 good 4 A on RandomForest good 0 06 0 023
252. object s ability to parse command line options via the setOptions String method the counterpart of this method is getOptions which returns a String array The difference between the two approaches is that the setOptions String method cannot be used to set the options incrementally Default values are used for all options that haven t been explicitly specified in the options array The most basic approach is to assemble the String array by hand The following example creates an array with a single option R that takes an argument 1 and initializes the Remove filter with this option import weka filters unsupervised attribute Remove String options new String 2 options 0 R options 1 1 Remove rm new Remove rm setOptions options Since the setOptions String method expects a fully parsed and correctly split up array which is done by the console command prompt some common pitfalls with this approach are e Combination of option and argument Using R 1 as an element of the String array will fail prompting WEKA to output an error message stating that the option R 1 is unknown e Trailing blanks Using R will fail as well since no trailing blanks are removed and therefore option R will not be recognized The easiest way to avoid these problems is to provide a String array that has been generated automatically from a single command line string us
253. ociationsPanel weka gui explorer AttributeSelectionPanel weka gui explorer VisualizePanel e Note the standalone option is used to make the tab available without requiring the preprocess panel to load a dataset first Screenshot Weka Explorer iy ol x Preprocess DataGeneration Classify Cluster Associate Select attributes Visualize Choose Agrawal r weka datagenerators classifiers classification Agrawal S_1_ n_100_ F_1_ P_0 05 S 1 n100 F 1 P 0 05 tree 133290 500508 0 72 1 9 3 135000 19 478544 755812 0 123095 005976 0 53 4 18 0 135000 27 205545 717 1 41013 26647 65925 12099 77 4 3 1 135000 19 246458 974062 0 39800 441222 48711 702065 63 0 3 0 135000 3 147013 568147 0 53983 934863 31163 676578 29 3 4 7 135000 30 237006 252412 0 54807 826346 62457 764517 47 4 16 5 135000 29 348471 24407 1 140563 973068 0 45 2 16 7 135000 4 343021 382218 1 123345 482491 0 63 1 19 3 135000 19 141085 040923 0 74953 433254 0 63 2 4 0 135000 1 270405 039174 0 23157 496764 66802 862654 73 0 9 0 135000 1 104250 346259 0 142358 256798 0 35 3 14 3 135000 13 248291 258397 0 119209 327991 0 76 3 11 3 135000 12 35347 597557 0 122807 943841 0 62 0 6 8 98378 229582 8 172104 290958 0 48938 460705 23739 747112 65 1 8 6 135000 22 431340 422956 0 20000 33626 086883 30 2 2 6 135000 28 218843 650983 0 76509 601725 63881 842964 48 1 14 2 135000 15 337054 162196 1 20000 41533 130281 51 4 3 0 135000 2 470921 412413 1 64910 683962
254. of attributes e StartSetHandler search algorithms that can make use of a start set of attributes implement this interface Methods Search algorithms are rather basic classes in regards to methods that need to be implemented Only the following method needs to be implemented search ASEvaluation Instances uses the provided evaluator to guide the search Testing For some basic tests from the command line you can use the following test class weka attributeSelection CheckAttributeSelection eval classname search classname further options For junit tests you can subclass the weka attributeSelection AbstractEvaluatorTest or weka attributeSelection AbstractSearchTest class and add additional tests 264 CHAPTER 17 EXTENDING WEKA 17 3 3 Associators Superclasses and interfaces The interface weka associations Associator is common to all associator al gorithms But most algorithms will be derived from AbstractAssociator an abstract class implementing this interface As with classifiers and clusterers you can also implement a meta associator derived from SingleAssociatorEnhancer An example for this is the FilteredAssociator which filters the training data on the fly for the base associator The only other interface that is used by some other association algorithms is the weka clusterers CARuleMiner one Associators that learn class associ ation rules implement this interface like Apriori Methods The associators
255. ones i e all had to be located below weka WEKA can now display multiple class hierarchies in the GUI which makes adding new functionality quite easy as we will see later in an example it is not restricted to classifiers only but also works with all the other entries in the GPC file 18 4 2 File Structure The structure of the GOE is a key value pair separated by an equals sign The value is a comma separated list of classes that are all derived from the su perclass superinterface key The GPC is slightly different instead of declar ing all the classes interfaces one need only to specify all the packages de scendants are located in only non abstract ones are then listed E g the weka classifiers Classifier entry in the GOE file looks like this weka classifiers Classifier weka classifiers bayes AODE X weka classifiers bayes BayesNet weka classifiers bayes ComplementNaiveBayes weka classifiers bayes NaiveBayes weka classifiers bayes NaiveBayesMultinomial weka classifiers bayes NaiveBayesSimple weka classifiers bayes NaiveBayesUpdateable weka classifiers functions LeastMedSq weka classifiers functions LinearRegression weka classifiers functions Logistic The entry producing the same output for the classifiers in the GPC looks like this 7 lines instead of over 70 for WEKA 3 4 4 weka classifiers Classifier weka classifiers bayes weka classifiers functions weka classifiers lazy weka classi
256. oore 24 Since I have not noticed a lot of improvement for small data sets it is set off by default Note that this ADTree algorithm is dif ferent from the ADTree classifier algorithm from weka classifiers tree ADTree The debug option has no effect 8 2 LOCAL SCORE BASED STRUCTURE LEARNING 119 8 2 Local score based structure learning Distinguish score metrics Section 2 1 and search algorithms Section 2 2 A local score based structure learning can be selected by choosing one in the weka classifiers bayes net search local package weka U classifiers 9 cf hayes Ej net c search e local Ey GeneticSearch E HillClimber Q k2 y LAGDHIIIClimber Ey RepeatedHillClimber IN SimulatedAnnealing IN TabuSearch y TAN e E ci gt C global o cf fixed Close Local score based algorithms have the following options in common initAsNaiveBayes if set true default the initial network structure used for starting the traversal of the search space is a naive Bayes network structure That is a structure with arrows from the class variable to each of the attribute variables If set false an empty network structure will be used i e no arrows at all markovBlanketClassifier false by default if set true at the end of the traversal of the search space a heuristic is used to ensure each of the attributes are in the Markov blanket of the classifier node If a node is already in the Markov blanket i
257. options start gt lt options end gt author John Doe john dot doe at no dot where dot com version Revision 6192 The template for any classifier s setOptions String method is as follows Parses a given list of options lt options start gt lt options end gt param options the list of options as an array of strings throws Exception if an option is not supported Running the weka core AllJavadoc tool over this code will output code with the comments filled out accordingly Revisions Classifiers implement the weka core RevisionHandler interface This pro vides the functionality of obtaining the Subversion revision from within Java Classifiers that are not part of the official WEKA distribution do not have to implement the method getRevision as the weka classifiers Classifier class already implements this method Contributions on the other hand need to implement it as follows in order to obtain the revision of this particular source file Returns the revision string return the revision public String getRevision 1 return RevisionUtils extract Revision 6192 Note a commit into Subversion will replace the revision number above with the actual revision number 246 CHAPTER 17 EXTENDING WEKA Testing WEKA provides already a test framework to ensure correct basic functionality of a classifier It is essential for t
258. or S _trees J48_ C_0 25_ M_2 version_ 2177331 68393644444 997 21 12 200516 57 433 4iiris ClassifierSplitEvaluator 3 _rules ZeroR_ version_48055541465867954 568 21 12 200516 55 257 5iris ClassifierSplitEvaluator B trees J48_ C_0 25_ M_2 version_ 217733168393644444 919 21 12 2005 16 55 414 Siiris ClassifierSplitEvaluator S _tules ZeroR_ version_48055541 465867954 568 21 12 2005 16 55 257 G iris ClassifierSplitEvaluator Sh _trees J48_ C_0 25_ M_2 version_ 217733168393644444 1 001 21 12 2005 16 57 42 G iris ClassifierSplitEvaluator S _tules ZeroR_ version_48055541 465867954 568 21 12 200516 55 257 7 iris ClassifierSplitEvaluator E trees J48_ C_0 25_ M_2 version_ 217733168393644444 844 21 12 2005 16 54 391 7 iris ClassifierSplitEvaluator S_tules ZeroR_ version_48055541 465867954 568 21 12 2005 16 55 257 8 iris ClassifierSplitEvaluator B _trees J48_ C_0 25_ M_2 version_ 2177331 68393644444 923 21 12 2005 16 55 414 8 ris ClassifierSplitEvaluator _rules ZeroR_ version_48055541465867954 568 21 12 200516 55 257 9 ris ClassifierSplitEvaluator B trees J48_ C_0 25_ M_2 version_ 217733168393644444 907 21 12 2005 16 55 408 Yiiris ClassifierSplitEvaluator The contents of the first run are ClassifierSplitEvaluator weka classifiers trees J48 C 0 25 M 2 version 217733168393644444 Classifier model J48 pruned tree petalwidth lt 0 6 Iris setosa 33 0 petalwidth gt 0 6 petalwidth l
259. or and choose another classifier 60 CHAPTER 5 EXPERIMENTER weka gui GenericObjectEditor weka E classifiers gt c bayes EI functions More P E lazy Capabilities o cJ meta gt cmi o cj misc bA trees c rules Cancel E ConjunctiveRule a DecisionTable a JRip a M5Rules E NNge a OneR E part a Prism a Ridor D Zerr Filter Remove filter Close The Filter button enables one to highlight classifiers that can handle certain attribute and class types With the Remove filter button all the selected capabilities will get cleared and the highlighting removed again Additional algorithms can be added again with the Add new button e g the J48 decision tree weka gui GenericObjectEditor Choose weka classifiers trees J48 About Class for generating a pruned or unpruned C4 More Capabilities binarySplits False E confidenceFactor 0 25 debug False A minNumObj 2 numFolds 3 reducedErrorPruning False Z savelnstanceData False Z seed 1 subtreeRaising True E A unpruned False A useLaplace False EA Open Save OK Cancel After setting the classifier parameters one clicks on OK to add it to the list of algorithms 5 2 STANDARD EXPERIMENTS 61 Weka Experiment Environment
260. ore information e DatabaseUtils props mssqlserver MS SQL Server 2000 gt 3 4 9 gt 3 5 4 e DatabaseUtils props mssqlserver2005 MS SQL Server 2005 gt 3 4 11 gt 3 5 6 e DatabaseUtils props mysql MySQL gt 3 4 9 gt 3 5 4 e DatabaseUtils props odbc ODBC access via Sun s ODBC JDBC bridge e g for MS Sql Server gt 3 4 9 gt 3 5 4 see the Windows databases chapter for more information e DatabaseUtils props oracle Oracle 10g gt 3 4 9 gt 3 5 4 e DatabaseUtils props postgresql PostgreSQL 7 4 gt 3 4 9 gt 3 5 4 e DatabaseUtils props sqlite3 sqlite 3 x gt 3 4 12 gt 3 5 7 179 180 CHAPTER 13 DATABASES The easiest way is just to place the extracted properties file into your HOME directory For more information on how property files are processed check out the following URL http weka wikispaces com Properties File Note Weka only looks for the DatabaseUtils props file If you take one of the example files listed above you need to rename it first 13 2 Setup Under normal circumstances you only have to edit the following two properties e jdbcDriver e jdbcURL Driver jdbcDriver is the classname of the JDBC driver necessary to connect to your database e g e HSQLDB org hsqldb jdbcDriver e MS SQL Server 2000 Desktop Edition com microsoft jdbc sqlserver SQLServerDriver e MS SQL Server 2005 com microsoft sqlserver jdbc SQLServerDriver e MySQL org gjt
261. over the ClassAssigner component and left click a red line labeled dataSet will connect the two components e Next right click over the ClassAssigner and choose Configure from the menu This will pop up a window from which you can specify which column is the class in your data last is the default e Now grab a NaiveBayesUpdateable component from the bayes section of the Classifiers panel and place it on the layout e Next connect the ClassAssigner to NaiveBayes Updateable using a instance connection e Next place an IncrementalClassiferEvaluator from the Evaluation panel onto the layout and connect NaiveBayesUpdateable to it using a incre mentalClassifier connection 106 CHAPTER 6 KNOWLEDGEFLOW Next place a Text Viewer component from the Visualization panel on the Layout Connect the IncrementalClassifierEvaluator to it using a text connection Next place a StripChart component from the Visualization panel on the layout and connect IncrementalClassifierEvaluator to it using a chart con nection Display the StripChart s chart by right clicking over it and choosing Show chart from the pop up menu Note the StripChart can be configured with options that control how often data points and labels are displayed Finally start the flow by right clicking over the ArffLoader and selecting Start loading from the pop up menu 80080 Strip Chart Note that in this example a prediction is obtained from naive Bayes for each i
262. ows Start Menu then this directory would be Weka s installation directory the java process is started from that directory e Home directory The directory that contains all the user s data The exact location depends on the operating system and the version of the operating system It is stored in the following environment variable Unix Linux HOME Windows USERPROFILEY Cygwin USERPROFILE You should be able output the value in a command prompt terminal with the echo command E g for Windows this would be echo ZUSERPROFILE 188 CHAPTER 14 WINDOWS DATABASES Part IV Appendix 189 Chapter 15 Research 15 1 Citing Weka If you want to refer to Weka in a publication please cite following SIGKDD Explorationd paper The full citation is Mark Hall Eibe Frank Geoffrey Holmes Bernhard Pfahringer Pe ter Reutemann lan H Witten 2009 The WEKA Data Mining Software An Update SIGKDD Explorations Volume 11 Issue 1 15 2 Paper references Due to the introduction of the weka core TechnicalInformationHandler in terface it is now easy to extract all the paper references via weka core ClassDiscovery and weka core TechnicalInformation The script listed at the end extracts all the paper references from Weka based on a given jar file and dumps it to stdout One can either generate simple plain text output option p or BibTeX compliant one option b Typical use after an ant exejar for BibTeX
263. pabilities returned by the getCapabilities method public void buildClassifier Instances data throws Exception test data against capabilities getCapabilities testWithFail data remove instances with missing class value but don t modify original data data new Instances data data deleteWithMissingClass actual model generation toString is used for outputting the built model This is not required but it is useful for the user to see properties of the model Decision trees normally ouput the tree support vector machines the support vectors and rule based classifiers the generated rules 17 1 WRITING A NEW CLASSIFIER 239 distributionForInstance Instance returns the class probabilities array of the prediction for the given weka core Instance object If your classifier handles nominal class attributes then you need to over ride this method classifyInstance Instance returns the classification or regression for the given weka core Instance object In case of a nominal class attribute this method returns the index of the class label that got predicted You do not need to override this method in this case as the weka classifiers Classifier superclass already determines the class la bel index based on the probabilities array that the distributionForInstance Instance method returns it returns the index in the array with the highest probability in case of ties the first one For numeric class
264. pervised instance 18 5 1 Precedence The Weka property files extension props are searched for in the following order e current directory e the user s home directory nix HOME Windows USERPROFILEZ e the class path normally the weka jar file If Weka encounters those files it only supplements the properties never overrides them In other words a property in the property file of the current directory has a higher precedence than the one in the user s home directory Note Under Cywgin http cygwin com the home directory is still the Windows one since the java installation will be still one for Windows 18 5 2 Examples e weka gui LookAndFeel props e weka gui GenericPropertiesCreator props e weka gui beans Beans props 18 6 XML 289 186 XML Weka now supports XML eXtensible Markup Language in several places 18 6 1 Command Line WEKA now allows Classifiers and Experiments to be started using an xml option followed by a filename to retrieve the command line options from the XML file instead of the command line For such simple classifiers like e g J48 this looks like overkill but as soon as one uses Meta Classifiers or Meta Meta Classifiers the handling gets tricky and one spends a lot of time looking for missing quotes With the hierarchical structure of XML files it is simple to plug in other classifiers by just exchanging tags The DTD for the XML options is quite simple lt DOCTYPE options
265. played in the Datasets panel of the Setup tab Saving the Results of the Experiment To identify a dataset to which the results are to be sent click on the Instances ResultListener entry in the Destination panel The output file parameter is near the bottom of the window beside the text outputFile Click on this parameter to display a file selection window weka gui GenericObjectEditor weka experiment InstancesResultListener About Outputs the received results in arf format to a Writer More outputFile weka_experiment2561 aarti q ox 5 2 STANDARD EXPERIMENTS 65 E weka gui FileEditor mi Lookin C weka 3 5 6 dalla ER E C changelogs y documentation htmi y remoteExperin data E ExperimenterTutorialpaf y Tutorial paf ci doc D ExplorerGuide pdf B weka src jar D BayesianNetClassifiers pdf D README D weka gif E copyinc README_Experiment_Gui weka ico E documentation css README_KnowledgeFlow y wekajar 4 il I gt FileName Experiment arf Files of Type All Files X Select Cancel Type the name of the output file and click Select The file name is displayed in the outputFile panel Click on OK to close the window weka gui GenericObjectEditor weka experiment InstancesResultListener About Takes results from a result producer and assembles
266. plorer parent returns the parent Explorer frame public Explorer getExplorer return m_Explorer Returns the title for the tab in the Explorer public String getTabTitle return SQL what s displayed as tab title e g Classify Returns the tooltip for the tab in the Explorer public String getTabTitleToolTip return Retrieving data from databases the tooltip of the tab ignored since we generate data and not receive it public void setInstances Instances inst J PropertyChangeListener which will be notified of value changes public void addPropertyChangeListener PropertyChangeListener 1 m_Support addPropertyChangeListener 1 Removes a PropertyChangeListener public void removePropertyChangeListener PropertyChangeListener 1 m_Support removePropertyChangeListener 1 268 CHAPTER 17 EXTENDING WEKA e additional GUI elements the actual SQL worksheet protected SqlViewer m_Viewer the panel for the buttons P protected JPanel m_PanelButtons the Load button makes the data available in the Explorer protected JButton m_ButtonLoad new JButton Load data displays the current query protected JLabel m_LabelQuery new JLabel e loading the data into the Explorer by clicking on the Load button will fire a propertyChange event m_ButtonLoad addActionListener new ActionListener
267. probability tables of x given subsets of pa x The weight of a distribution P x S with S C pa x used is propor tional to the contribution of network structure Vyesy 2x to either the BDe metric or K2 metric depending on the setting of the useK2Prior option false and true respectively 128 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS weka gui GenericObjectEditor loj x weka classifiers bayes net estimate BMAEstimator About BMAEstimator estimates conditional probability tables of a More Bayes network using Bayes Model Averaging BMA alpha 0 5 useK2Prior False v Open Save OK Cancel 8 7 Running from the command line These are the command line options of BayesNet General options t lt name of training file gt Sets training file T lt name of test file gt Sets test file If missing a cross validation will be performed on the training data c lt class index gt Sets index of class attribute default last x lt number of folds gt Sets number of folds for cross validation default 10 no cv Do not perform any cross validation split percentage lt percentage gt Sets the percentage for the train test set split e g 66 preserve order Preserves the order in the percentage split s lt random number seed gt Sets random number seed for cross validation or percentage split default 1 m lt name of file with cost matrix gt
268. props needs to be created and placed into each plu gin subdirectory This file contains a list of fully qualified class names to be instantiated Successfully instantiated components will appear in a Plugins tab in the KnowledgeF low user interface Below is an example plugin directory listing the listing of the contents of the jar file and the contents of the associated Beans props file cygnus mhall 1s 1 HOME knowledgeFlow plugins kettle total 24 IWS 1 mhall mhall 117 20 Feb 10 56 Beans props rw r r 1 mhall mhall 8047 20 Feb 14 01 kettleKF jar cygnus mhall jar tvf Users mhall knowledgeFlow plugins kettle kettlekF jar O Wed Feb 20 14 01 34 NZDT 2008 META INF 70 Wed Feb 20 14 01 34 NZDT 2008 META INF MANIFEST MF O Tue Feb 19 14 59 08 NZDT 2008 weka O Tue Feb 19 14 59 08 NZDT 2008 weka gui O Wed Feb 20 13 55 52 NZDT 2008 weka gui beans O Wed Feb 20 13 56 36 NZDT 2008 weka gui beans icons 2812 Wed Feb 20 14 01 20 NZDT 2008 weka gui beans icons KettleInput gif 2812 Wed Feb 20 14 01 18 NZDT 2008 weka gui beans icons Kettlelnput_animated gif 1839 Wed Feb 20 13 59 08 NZDT 2008 weka gui beans KettleInput class 174 Tue Feb 19 15 27 24 NZDT 2008 weka gui beans KettleInputBeanInfo class cygnus mhall more Users mhall1 knowledgeFlow plugins kettle Beans props Specifies the tools to go into the Plugins toolbar weka gui beans KnowledgeFlow Plugins weka gui beans Kettlelnput 108 CHAPTER 6 KNOWLEDGEFLOW Chapter 7
269. pty Bayesian net work is created with nodes for each of the attributes in the arff file Continuous variables are discretized using the weka filters supervised attribute Discretize filter see note at end of this section for more details The network structure can be specified and the CPTs learned using the Tools Learn CPT menu The Print menu works sometimes as expected The Export menu allows for writing the graph panel to image currently supported are bmp jpg png and eps formats This can also be activated using the Alt Shift Left Click action in the graph panel 144 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Edit menu File EN Tools View Help Undo Ctrl Z Redo Ctrl Y E Select All Ctrl Delete Node Delete Cut Ctrl X Copy Ctrl C F Paste Ctrl Add Node Add Arc Delete Arc se Align Left 33 Align Right 25 Align Top 4 Align Bottom E Center Horizontal EE Center Vertical Space Horizontal E Space Vertical Unlimited undo redo support Most edit operations on the Bayesian network are undoable A notable exception is learning of network and CPTs Cut copy paste support When a set of nodes is selected these can be placed on a clipboard internal so no interaction with other applications yet and a paste action will add the nodes Nodes are renamed by adding Copy of before the name and adding numbers if necessary to ensure uniqueness of name Only the arrows to parents are copie
270. r to get some feedback on the built model main String executes the clusterer from command line If your new algorithm is called FunkyClusterer then use the following code as your main method Main method for executing this clusterer param args the options use h to display options public static void main String args AbstractClusterer runClusterer new FunkyClusterer args Testing For some basic tests from the command line you can use the following test class weka clusterers CheckClusterer W classname further options For junit tests you can subclass the weka clusterers AbstractClustererTest class and add additional tests 262 CHAPTER 17 EXTENDING WEKA 17 3 2 Attribute selection Attribute selection consists basically of two different types of classes e evaluator determines the merit of single attributes or subsets of at tributes e search algorithm the search heuristic Each of the them will be discussed separately in the following sections Evaluator The evaluator algorithm is responsible for determining merit of the current attribute selection Superclasses and interfaces The ancestor for all evaluators is the weka attributeSelection ASEvaluation class Here are some interfaces that are commonly implemented by evaluators e AttributeEvaluator evaluates only single attributes e SubsetEvaluator evaluates subsets of attributes e AttributeTransform
271. rage to the result listener Normally used with a CrossValidationResultProducer to perform n x m fold cross validation OPTIONS calculateStdDevs Record standard deviations for each run expectedResultsPerAverage Setthe expected number of results to average per run For example if a CrossValidationResultProducer is being used with the number of folds setto 10 then the expected number of results per run is 10 keyFieldName Setthe field name that will be unique for a run resultProducer Set the resultProducer for which results are to be averaged Clicking the resultProducer panel brings up the following window weka gui GenericObjectEditor weka experiment CrossValidationResultProducer About Performs a cross validation run using a supplied evaluator ore numFolds 10 outputFile splitEvalutorOut zip rawOutput False z splitEvaluator Choose ClassifierSplitEvaluator V we ka classifier Open Save OK Cancel As with the other ResultProducers additional schemes can be defined When the AveragingResultProducer is used the classifier property is located deeper in the Generator properties hierarchy 78 CHAPTER 5 EXPERIMENTER Select a property x 7 Available properties calculateStaDevs D expectedResultsPerAverage keyFieldName 9 EA resultProducer B numFolds E outputFile G rawOutput cf split
272. rain filter create new test set Instances newTest Filter useFilter test filter 208 CHAPTER 16 USING THE API 16 5 2 Filtering on the fly Even though using the API gives one full control over the data and makes it eas ier to juggle several datasets at the same time filtering data on the fly makes life even easier This handy feature is available through meta schemes in WEKA like FilteredClassifier package weka classifiers meta FilteredClusterer package weka clusterers FilteredAssociator package weka associations and FilteredAttributeEval FilteredSubsetEval in weka attributeSelection Instead of filtering the data beforehand one just sets up a meta scheme and lets the meta scheme do the filtering for one The following example uses the FilteredClassifier in conjunction with the Remove filter to remove the first attribute which happens to be an ID attribute from the dataset and J48 J48 is WEKA s implementation of C4 5 package weka classifiers trees as base classifier First the classifier is built with a training set and then evaluated with a separate test set The actual and predicted class values are printed in the console For more information on classification see chapter 6 6 import weka classifiers meta FilteredClassifier import weka classifiers trees J48 import weka core Instances import weka filters unsupervised attribute Remove Instances train from somewhere Instances test
273. ree software the foundation version is for free TortoiseSVN Under Windows TortoiseCVS was a CVS client neatly integrated into the Windows Explorer TortoiseSVN http tortoisesvn tigris org is the equivalent for Subversion 282 CHAPTER 18 TECHNICAL DOCUMENTATION 18 4 GenericObjectEditor 18 4 1 Introduction As of version 3 4 4 it is possible for WEKA to dynamically discover classes at runtime rather than using only those specified in the GenericObjectEditor props GOE file In some versions 3 5 8 3 6 0 this facility was not enabled by de fault as it is a bit slower than the GOE file approach and furthermore does not function in environments that do not have a CLASSPATH e g application servers Later versions 3 6 1 3 7 0 enabled the dynamic discovery again as WEKA can now distinguish between being a standalone Java application or being run in a non CLASSPATH environment If you wish to enable or disable dynamic class discovery the relevant file to edit is GenericPropertiesCreator props GPC You can obtain this file either from the weka jar or weka src jar archive Open one of these files with an archive manager that can handle ZIP files for Windows users you can use 7 Zip for this and navigate to the weka gui directory where the GPC file is located All that is required is to change the UseDynamic property in this file from false to true for enabling it or the other way round for disabling it After c
274. rff dataset A lx Look In 3 data v ca 6 a e y contact lenses arff soybean arff A cpu arff D weather arff E cpuwithvendor arff y weather nominal arff y segment challenge arff B segment test arff File Name iris arff Files of Type Arff data files arff v After clicking Open the file will be displayed in the datasets list If one selects a directory and hits Open then all ARFF files will be added recursively Files can be deleted from the list by selecting them and then clicking on Delete selected ARFF files are not the only format one can load but all files that can be converted with Weka s core converters The following formats are currently supported e ARFF compressed e C4 5 e CSV libsvm binary serialized instances XRFF compressed 5 2 STANDARD EXPERIMENTS 59 By default the class attribute is assumed to be the last attribute But if a data format contains information about the class attribute like XRFF or C4 5 this attribute will be used instead Weka Experiment Environment Ea x Setup Run Analyse Experiment Configuration Mode 8 Simple Advanced Open Save New Results Destination ARFF file v Filename le ATemplweka 3 5 BlExperiments1 arff Browse Experiment Type Iteration Control Cross val
275. rkov blanket of the classifier node S L00 CV k Fold CV Cumulative CV Q Score type LOO CV k Fold CV Cumulative CV Use probabilistic or 0 1 scoring default probabilistic scoring weka classifiers bayes net search global K2 N Initial structure is empty instead of Naive Bayes P lt nr of parents gt R mbc Maximum number of parents Random order default false Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S L00 CV k Fold CV Cumulative CV Score type LO0 CV k Fold CV Cumulative CV 8 7 RUNNING FROM THE COMMAND LINE 135 Q Use probabilistic or 0 1 scoring default probabilistic scoring weka classifiers bayes net search global RepeatedHillClimber U lt integer gt Number of runs A lt seed gt Random number seed P lt nr of parents gt Maximum number of parents R Use arc reversal operation default false N Initial structure is empty instead of Naive Bayes mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S LOO CV k Fold CV Cumulative CV Score type L00 CV k Fold CV Cumulative CV Q Use probabilistic or 0 1 scoring default probabilistic scoring
276. rom a dataset that contains more attributes than there are nodes in the network is ok The extra attributes are just ignored Learning from a dataset with differently ordered attributes is ok Attributes are matched to nodes based on name However attribute values are matched with node values based on the order of the values The attributes in the dataset should have the same number of values as the corresponding nodes in the network see above for continuous variables 8 10 Bayesian nets in the experimenter Bayesian networks generate extra measures that can be examined in the exper imenter The experimenter can then be used to calculate mean and variance for those measures The following metrics are generated e measureExtraArcs extra arcs compared to reference network The net work must be provided as BIFFile to the BayesNet class If no such network is provided this value is zero e measureMissingArcs missing arcs compared to reference network or zero if not provided e measureReversedArcs reversed arcs compared to reference network or zero if not provided e measureDivergence divergence of network learned compared to reference network or zero if not provided e measureBayesScore log of the K2 score of the network structure e measureBDeuScore log of the BDeu score of the network structure e measureMDLScore log of the MDL score e measureAICScore log of the AIC score e measureEntropyScore log of the entropy 8
277. ror 0 Root relative squared error 0 Total Number of Instances 14 Detailed Accuracy By Class TP Rate FP Rate Precision Recall F Measure Class 1 0 1 1 1 yes 1 0 1 1 1 no Confusion Matrix ab lt classified as 90la yes 05 b no Stratified cross validation Correctly Classified Instances 9 64 2857 Incorrectly Classified Instances 5 35 7143 Kappa statistic 0 186 Mean absolute error 0 2857 Root mean squared error 0 4818 Relative absolute error 60 Root relative squared error 97 6586 Total Number of Instances 14 Detailed Accuracy By Class TP Rate FP Rate Precision Recall F Measure Class 0 778 0 6 0 7 0 778 0 737 yes 0 4 0 222 0 5 0 4 0 444 no Confusion Matrix lt classified as yes no wn b 21 21 a b 21 This is quite boring our clas sifier is perfect at least on the training data all instances were classified correctly and all errors are zero As is usually the case the training set accuracy is too optimistic The detailed accu racy by class which is output via i and the confusion matrix is similarily trivial The stratified cv paints a more realistic picture The accuracy is around 64 The kappa statis tic measures the agreement of prediction with the true class 1 0 signifies complete agreement The following error values are not very meaningful for classifi cation tasks however for regres sion tasks e g the root of the
278. ross ones that re quire a weka core SelectedTag as parameter An example for this is the setEvaluation method of the meta classifier GridSearch located in package weka classifiers meta The SelectedTag class is used in the GUI for dis playing drop down lists enabling the user to chose from a predefined list of values GridSearch allows the user to chose the statistical measure to base the evaluation on accuracy correlation coefficient etc A SelectedTag gets constructed using the array of all possible weka core Tag elements that can be chosen and the integer or string ID of the Tag For in stance GridSearch s setOptions String method uses the supplied string ID to set the evaluation type e g ACC for accuracy or if the evaluation option is missing the default integer ID EVALUATION _ACC In both cases the array TAGS_EVALUATION is used which defines all possible options import weka core SelectedTag String tmpStr Utils getOption E options if tmpStr length 0 setEvaluation new SelectedTag tmpStr TAGS_EVALUATION else setEvaluation new SelectedTag EVALUATION_CC TAGS_EVALUATION 198 CHAPTER 16 USING THE API 16 2 Loading data Before any filter classifier or clusterer can be applied data needs to be present WEKA enables one to load data from files in various file formats and also from databases In the latter case it is assumed in that the database connection is set up and working S
279. ross validation will be per formed Click on More to generate a brief description of the Cross Validation ResultProducer Information l NAME SYNOPSIS OPTIONS weka experiment CrossValidationResultProducer Performs a cross validation run using a supplied evaluator numFolds Number of folds to use in cross validation outputFile Set the destination for saving raw output Ifthe rawOutput option is selected then output from the splitEvaluator for individual folds is saved Ifthe destination is a directory then each output is saved to an individual gzip file if the destination is a file then each output is saved as an entry in a zip file rawOutput Save raw output useful for debugging If set then output is sentto the destination specified by outputFile splitEvaluator The evaluator to apply to the cross validation folds This may be a classifier regression scheme etc l0 x 76 CHAPTER 5 EXPERIMENTER As with the RandomSplitResultProducer multiple schemes can be run during cross validation by adding them to the Generator properties panel weka Experiment Environment Setup Run Analyse xperiment Configuration Mode Simple 8 Advanced Open Save New Destination Choose InstancesResultListener O Experiment arff Result generator Choose CrossValidationResultProducer X 10 O splitEvalutorO
280. s can be found in the weka core converters pack age based on the file s extension or one can use the correct loader directly The latter case is necessary if the files do not have the correct extension The DataSource class inner class of the weka core converters ConverterUtils class can be used to read data from files that have the appropriate file extension Here are some examples import weka core converters ConverterUtils DataSource import weka core Instances DataSource read some where dataset arff DataSource read some where dataset csv DataSource read some where dataset xrff Instances datal Instances data2 Instances data3 In case the file does have a different file extension than is normally associated with the loader one has to use a loader directly The following example loads a CSV comma separated values file import weka core converters CSVLoader 16 2 LOADING DATA 199 import weka core Instances import java io File CSVLoader loader new CSVLoader loader setSource new File some where some data Instances data loader getDataSet NB Not all file formats allow to store information about the class attribute e g ARFF stores no information about class attribute but XRFF does If a class attribute is required further down the road e g when using a classifier it can be set with the setClassIndex int method uses the first attribute as class attribute if d
281. s learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local K2 N Initial structure is empty instead of Naive Bayes P lt nr of parents gt Maximum number of parents R Random order default false mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node 8 7 RUNNING FROM THE COMMAND LINE 131 S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local LAGDHil1Climber L lt nr of look ahead steps gt Look Ahead Depth G lt nr of good operations gt Nr of Good Operations P lt nr of parents gt Maximum number of parents R Use arc reversal operation default false N Initial structure is empty instead of Naive Bayes mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local RepeatedHillClimber U
282. s only local files 13 3 Missing Datatypes Sometimes e g with MySQL it can happen that a column type cannot be interpreted In that case it is necessary to map the name of the column type to the Java type it should be interpreted as E g the MySQL type TEXT is returned as BLOB from the JDBC driver and has to be mapped to String 0 represents String the mappings can be found in the comments of the properties file 182 CHAPTER 13 DATABASES Java type Java method Identifier Weka attribute type String getString 0 nominal boolean getBoolean 1 nominal double getDouble 2 numeric byte getByte 3 numeric short getByte 4 numeric int getInteger 5 numeric long getLong 6 numeric float getFloat 7 numeric date getDate 8 date text getString 9 string time getTime 10 date In the props file one lists now the type names that the database returns and what Java type it represents via the identifier e g CHAR 0 VARCHAR 0 CHAR and VARCHAR are both String types hence they are interpreted as String identifier 0 Note in case database types have blanks one needs to replace those blanks with an underscore e g DOUBLE PRECISION must be listed like this DOUBLE_PRECISION 2 13 4 Stored Procedures Let s say you re tired of typing the same query over and over again A good way to shorten that is to create a stored procedure PostgreSQL 7 4 x The following example creates a proced
283. s the Class using the box above the histogram This box will bring up a drop down list of available selections when clicked Note that only nominal Class attributes will result in a colour coding Finally after pressing the Visualize All button histograms for all the attributes in the data are shown in a separate window Returning to the attribute list to begin with all the tick boxes are unticked They can be toggled on off by clicking on them individually The four buttons above can also be used to change the selection 4 2 PREPROCESSING 39 1 All All boxes are ticked 2 None All boxes are cleared unticked 3 Invert Boxes that are ticked become unticked and vice versa 4 Pattern Enables the user to select attributes based on a Perl 5 Regular Expression E g _id selects all attributes which name ends with _id Once the desired attributes have been selected they can be removed by clicking the Remove button below the list of attributes Note that this can be undone by clicking the Undo button which is located next to the Edit button in the top right corner of the Preprocess panel 4 2 4 Working With Filters Program Applications Tools Visualization Windows Help E Explorer 22s Preprocess Classify Cluster Associate Select attributes Visualize Open file OpenURL OpenDB Generate CI fiters D alritter Selected attribute Name outlook Type Nominal D M
284. se age sex jchest_pain trestbps chol fos restecg thalach exang oldpeak slope ca num Nominal Numerio Nominal Numerio Nominal Nominal Numeric Numeric Nominal Nominal Numerio Nominal Numeric 28 0 male__ atyp_angi 130 0 132 0 f lefty 185 0 no 29 0 male _latyp_angi 120 0 243 0 f normal 160 0 no atyp_angi 140 0 f normal 170 0 To 219 011 st_t_ 150 0 no atyp_angi 198 0 f normal 165 0 no atyp_angi 110 0 225 0 f normal 184 0 no atyp_angi 125 0 254 07 normal 155 0 no non_angi 120 0 298 0 f normal 185 0 no atyp_angi 130 0 161 0 f normal 190 0 no atyp_angi 150 0 214 0 f stt 168 0 no atyp_angi 98 0 220 0 f normal 150 0 no typ_angina 120 0 160 0 f stt 185 0 no asympt 140 01 167 0 f normal 150 0 no atyp_angi 120 0 308 0 f lefty 180 0 no atyp_angi 150 0 264 01 normal 168 0 no atyp_angi 120 0 166 0 normal 180 0 no f non_angi 112 0 340 0 f normal 184 0 no 1 0 fat normal 50 119 36 0 male_ non_angi 130 0 209 0 f normal 178 0 no 0 0 lt 50 20 36 0 male non_angi 150 0 160 0 f normal 172 0 no 0 0 50 121 37 0 female atyp_andgi 120 01 260 0 f normal 130 0 no 0 0 lt 60 22 37 Olfemale non_angi 130 0 211 Off normal 142 0 no 0 0 lt 50 123 37 O female j
285. see above and import it in other packages for example JavaBayes available under GPL from http www cs cmu edu javabayes 8 13 Future development If you would like to add to the current Bayes network facilities in Weka you might consider one of the following possibilities e Implement more search algorithms in particular general purpose search algorithms such as an improved implemen tation of genetic search structure search based on equivalent model classes implement those algorithms both for local and global metric based search algorithms 8 13 FUTURE DEVELOPMENT 157 implement more conditional independence based search algorithms e Implement score metrics that can handle sparse instances in order to allow for processing large datasets e Implement traditional conditional independence tests for conditional in dependence based structure learning algorithms e Currently all search algorithms assume that all variables are discrete Search algorithms that can handle continuous variables would be interest ing e A limitation of the current classes is that they assume that there are no missing values This limitation can be undone by implementing score metrics that can handle missing values The classes used for estimating the conditional probabilities need to be updated as well e Only leave one out k fold and cumulative cross validation are implemented These implementations can be made more efficie
286. set and outputs for a second dataset the predicted clusters and cluster memberships of the individual instances import weka clusterers EM import weka core Instances Instances dataset1 from somewhere from somewhere Instances dataset2 build clusterer EM clusterer new EM clusterer buildClusterer dataset1 output predictions System out println cluster distribution for int i 0 i lt dataset2 numInstances i Y int cluster clusterer clusterInstance dataset2 instance i double dist clusterer distributionForInstance dataset2 instance i System out print i 1 System out print System out print cluster System out print System out print Utils arrayToString dist System out println 16 8 SELECTING ATTRIBUTES 221 16 8 Selecting attributes Preparing one s data properly is a very important step for getting the best re sults Reducing the number of attributes can not only help speeding up runtime with algorithms some algorithms runtime are quadratic in regards to number of attributes but also help avoid burying the algorithm in a mass of at tributes when only a few are essential for building a good model There are three different types of evaluators in WEKA at the moment e single attribute evaluators perform evaluations on single attributes These classes implement the weka attributeSelection AttributeEvaluator interface The
287. solute error 42 3453 Ni 1 Root relative squared error 66 8799 Not cies Y Total Number of Instances 150 Start Stop Detailed Accuracy By Class Result list right click for options TP Rate FP Rate Precisio Recall F Measure ROC Area Class 17 08 27 bayes BayesNet View in main window 1 0 909 0 977 Iris setosa 0 54 0 614 0 799 Iris versicol 0 78 0 765 0 899 Iris virginic View in separate window Save result buffer Delete result buffer Load model Save model Re evaluate model on current test set lor Visualize classifier errors Visualize margin curve Visualize threshold curve Status Visualize cost curve Log The Bayes network is automatically layed out and drawn thanks to a graph drawing algorithm implemented by Ashraf Kibriya Weka Classifier Graph Visualizer 17 08 27 bayes Bayeusse Oo x El a 10034 Serer ESE petalwidth When you hover the mouse over a node the node lights up and all its children are highlighted as well so that it is easy to identify the relation between nodes in crowded graphs Saving Bayes nets You can save the Bayes network to file in the graph visualizer You have the choice to save as XML BIF format or as dot format Select the floppy button and a file save dialog pops up that allows you to select the file name and file format Zoom
288. sqlfocalhost 3306 weka_tesi l User i Connect History Query select from results0 ES AIN Clear aa max rows pH Result Key_Dataset Key_SchemelKey_Scheme_options Key_Scheme_version_ID Date_time Number Close text_files_i weka classi F weka filters unsu 4523450618538717400 2 00606 text_files_i weka classi F weka filters uns 4523450618538717400 2 00606 Close all text_files_i weka classi F weka filters unsu 4523450618538717400 2 00606 Re use query text_files_i weka classi F weka filters unsu 452345061 8538717400 2 00606 text files i weka classi F weka filters unsu 4523450618538717400 2 00606 Optimal width Info connecting to jdbc mysql localhost 3306 weka_test true Clear Query select from resultsO Status Database query finished and data loaded AGS 0 270 CHAPTER 17 EXTENDING WEKA Artificial data generation Purpose Instead of only having a Generate button in the PreprocessPanel or using it from command line this example creates a new panel to be displayed as extra tab in the Explorer This tab will be available regardless whether a dataset is already loaded or not standalone Implementation e class is derived from javax swing JPanel and implements the interface weka gui Explorer ExplorerPanel the full source code also imports the weka
289. ss checks whether all the properties available in the GUI have a tooltip accompanying them and whether the globalInfo method is declared weka core CheckGOE W classname additional parameters All tests once again need to return yes 17 2 6 3 Source code Filters that implement the weka filters Sourcable interface can output Java code of their internal representation In order to check the generated code one should not only compile the code but also test it with the following test class weka filters CheckSource This class takes the original WEKA filter the generated code and the dataset used for generating the source code and an optional class index as parameters It builds the WEKA filter on the dataset and compares the output the one from the WEKA filter and the one from the generated source code whether they are the same Here is an example call for weka filters unsupervised attribute ReplaceMissingValues and the generated class weka filters WEKAWrapper it wraps the actual gen erated code in a pseudo filter java weka filters CheckSource W weka filters unsupervised attribute ReplaceMissingValues S weka filters WEKAWrapper t data arff It needs to return Tests OK 17 2 6 4 Unit tests In order to make sure that your filter applies to the WEKA criteria you should add your filter to the junit unit test framework i e by creating a Test class The superclass for filter unit tests is weka filters A
290. ss in the opposite direction creating an object from a persistently saved data structure In Java an object can be serialized if it imports the java io Serializable interface Members of an object that are not supposed to be serialized need to be declared with the keyword transient In the following are some Java code snippets for serializing and deserializing a J48 classifier Of course serialization is not limited to classifiers Most schemes in WEKA like clusterers and filters are also serializable Serializing a classifier The weka core SerializationHelper class makes it easy to serialize an ob ject For saving one can use one of the write methods import weka classifiers Classifier import weka classifiers trees J48 import weka core converters ConverterUtils DataSource import weka core SerializationHelper load data Instances inst DataSource read some where data arff inst setClassIndex inst numAttributes 1 train J48 Classifier cls new J48 cls buildClassifier inst serialize model SerializationHelper write some where j48 model cls Deserializing a classifier Deserializing an object can be achieved by using one of the read methods import weka classifiers Classifier import weka core SerializationHelper deserialize model Classifier cls Classifier SerializationHelper read some where j48 model 16 11 SERIALIZATION 231 Deserializing a classifier saved from th
291. ssifiers evaluation 2 Put the plotable data into a plot container an instance of the PlotData2D class package weka gui visualize 3 Add the plot container to a visualization panel for displaying the data an instance of the ThresholdVisualizePanel class package weka gui visualize 4 Add the visualization panel to a JFrame package javax swing and dis play it And now the four steps translated into actual code 1 Generate the plotable data Evaluation eval from somewhere ThresholdCurve tc new ThresholdCurve int classIndex 0 ROC for the 1st class label Instances curve tc getCurve eval predictions classIndex 2 Put the plotable into a plot container PlotData2D plotdata new PlotData2D curve plotdata setPlotName curve relationName plotdata addInstanceNumberAttribute 3 Add the plot container to a visualization panel ThresholdVisualizePanel tvp new ThresholdVisualizePanel tvp setROCString Area under ROC Utils doubleToString ThresholdCurve getROCArea curve 4 tvp setName curve relationName tvp addPlot plotdata 4 Add the visualization panel to a JFrame final JFrame jf new JFrame WEKA ROC tvp getName jf setSize 500 400 jf getContentPane setLayout new BorderLayout jf getContentPane add tvp BorderLayout CENTER jf setDefaultCloseOperation JFrame DISPOSE_ON_CLOSE jf setVisible true 228 CHAPTER 16 USING
292. ssifiers trees J48 gt lt option name C gt 0 001 lt option gt lt options gt lt option gt Internally all the options enclosed by the options tag are pushed to the end after the if one transforms the XML into a command line string e quotes A Meta Classifier like Stacking can take several B options where each single one encloses other options in quotes this itself can contain a Meta Classifier From B weka classifiers trees J48 we then get this XML lt option name B type quotes gt lt options type classifier value weka classifiers trees J48 gt lt option gt With the XML representation one doesn t have to worry anymore about the level of quotes one is using and therefore doesn t have to care about the correct escaping ic A since this is done automatically 18 6 XML 291 And if we now put all together we can transform this more complicated com mand line java and the CLASSPATH omitted lt options type class value weka classifiers meta Stacking gt lt option name B type quotes gt lt options type classifier value weka classifiers meta AdaBoostM1 gt lt option name W type hyphens gt lt options type classifier value weka classifiers trees J48 gt lt option name C gt 0 001 lt option gt lt options gt lt option gt lt options gt lt option gt lt option name B type quotes gt lt options type classifier value weka
293. stance i System out print i 1 System out print test instance i toString test classIndex System out print test classAttribute value int pred System out println Utils arrayToString dist 16 6 CLASSIFICATION 215 216 CHAPTER 16 USING THE API 16 7 Clustering Clustering is an unsupervised Machine Learning technique of finding patterns in the data i e these algorithms work without class attributes Classifiers on the other hand are supervised and need a class attribute This section similar to the one about classifiers covers the following topics e Building a clusterer batch and incremental learning e Evaluating a clusterer how to evaluate a built clusterer e Clustering instances determining what clusters unknown instances be long to Fully functional example classes are located in the wekaexamples clusterers package of the Weka Examples collection 3 16 7 1 Building a clusterer Clusterers just like classifiers are by design batch trainable as well They all can be built on data that is completely stored in memory But a small subset of the cluster algorithms can also update the internal representation incrementally The following two sections cover both types of clusterers Batch clusterers Building a batch clusterer just like a classifier happens in two stages e set options either calling the setOptions String method or the appropriate set me
294. sults to the same file by saving them one at a time and using the Append option instead of the Overwrite option for the second and subsequent saves File query E 2 File exists Append Overwrite Choose new name Cancel 5 5 3 Changing the Baseline Scheme The baseline scheme can be changed by clicking Select base and then selecting the desired scheme Selecting the OneR scheme causes the other schemes to be compared individually with the OneR scheme Select items rules ZeroR 48055541465867954 fules Gnet B 6 2459427002147861445 rees J48 C 0 25 M 2 217733168393644444 Summary Ranking Pattern Cancel 92 CHAPTER 5 EXPERIMENTER If the test is performed on the Percent_correct field with OneR as the base scheme the system indicates that there is no statistical difference between the results for OneR and J48 There is however a statistically significant difference between OneR and ZeroR weka Experiment Environment El a x Setup Run I Analyse Source Got 30 results File Database Experiment Configure test M Test output Testing with Paired T Tester cor w Tester weka experiment PairedCorrectedTTester Analysing Percent_correct Row Select Datasets 1 Resultsets 3 Confidence 0 05 two tailed Column Select E t Sorted by Date 21 12 05 16 41 Compariso
295. t 1 5 Iris versicolor 31 0 1 0 petalwidth gt 1 5 Iris virginica 35 0 3 0 Number of Leaves 3 Size of the tree 5 5 2 STANDARD EXPERIMENTS Correctly Classified Instances 47 Incorrectly Classified Instances 4 Kappa statistic 0 8824 Mean absolute error 0 0723 Root mean squared error 0 2191 Relative absolute error 16 2754 7 Root relative squared error 46 4676 Total Number of Instances 51 measureTreeSize 5 0 measureNumLeaves 3 0 measureNumRules 3 0 5 2 2 4 Other Result Producers Cross Validation Result Producer 75 92 1569 7 8431 To change from random train and test experiments to cross validation exper iments click on the Result generator entry At the top of the window click on the drop down list and select Cross ValidationResultProducer The window now contains parameters specific to cross validation such as the number of par titions folds The experiment performs 10 fold cross validation instead of train and test in the given example weka gui GenericObjectEditor weka experiment CrossValidationResultProducer About Performs a cross validation run using a supplied evaluator More numPolds 10 outputFile splitEvalutorOutzip rawOutput False X splitEvaluator Choose ClassifierSplitEvaluator WV weka classifier Open Save OK Cancel The Result generator panel now indicates that c
296. t for details on XML BIF 118 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS weka E classifiers 9 El bayes C AODE E BayesNet Ey ComplementNaiveBayes C Hne B NaiveBayes E NaiveBayesMultinomial D NaiveBayesSimple D NaiveBayesUpdateable C waone functions gt lazy gt meta o Ej mi o Ei mise o c trees gt cJ rules o Filter Remove filter Close The Bayes net classifier has the following options weka classifiers bayes BayesNet About Bayes Network learning using various search algorithms Moe and quality measures Capabilities BIFFile_ irisxmi debug False b A estimator Choose SimpleEstimator A 0 5 searchAlgorithm Choose TabuSearch L 5 U10 P 2 5 BAYES useADTree False x weka gui GenericObjectEditor Open Save OK Cancel The BIFFile option can be used to specify a Bayes network stored in file in BIF format When the toString method is called after learning the Bayes network extra statistics like extra and missing arcs are printed comparing the network learned with the one on file The searchAlgorithm option can be used to select a structure learning algorithm and specify its options The estimator option can be used to select the method for estimating the conditional probability distributions Section 8 0 When setting the useADTree option to true counts are calculated using the ADTree algorithm of M
297. t outFormat F Instances inst getInputFormat for int i 0 i lt inst numInstances i convertInstance inst instance i flushInput O m_NewBatch true m_FirstBatchDone true m_Random null return numPendingOutput 0 protected void convertInstance Instance instance if m_Random null m_Random new Random m_Seed double newValues new double instance numAttributes 1 double oldValues instance toDoubleArray newValues newValues length 1 m_Random nextInt System arraycopy oldValues 0 newValues 0 oldValues length push new Instance 1 0 newValues public static void main String args runFilter new BatchFilter3 args 17 2 WRITING A NEW FILTER 253 StreamFilter This stream filter adds a random number the seed value is hard coded at the end of each Instance of the input data Since this does not rely on having access to the full data of the first batch the output format is accessible immediately after using setInputFormat Instances All the Instance objects are imme diately processed in input Instance via the convertInstance Instance method which pushes them immediately to the output queue import weka core import weka core Capabilities import java util Random public class StreamFilter extends Filter protected Random m_Random public String globalInfo return A stream filter that adds an attribute blah
298. t at once This is fine if the training data fits into memory But there are also algorithms available that can update their internal model on the go These classifiers are called incremental The following two sections cover the batch and the incremental classifiers Batch classifiers A batch classifier is really simple to build e set options either using the setOptions String method or the ac tual set methods e train it calling the buildClassifier Instances method with the training set By definition the buildClassifier Instances method resets the internal model completely in order to ensure that subsequent calls of this method with the same data result in the same model re peatable experiments The following code snippet builds an unpruned J48 on a dataset import weka core Instances import weka classifiers trees J48 Instances data from somewhere String options new Stringl1 options 0 U unpruned tree J48 tree new J48 new instance of tree tree setOptions options set the options tree buildClassifier data build classifier Incremental classifiers All incremental classifiers in WEKA implement the interface UpdateableClassifier located in package weka classifiers Bringing up the Javadoc for this par ticular interface tells one what classifiers implement this interface These classi fiers can be used to process large amounts of data with a small memory
299. t k bits from one and the remainder from another network structure in the population At least one of useMutation and useCrossOver should be set to true useTournamentSelection when false the best performing networks are selected from the descendant population to form the population of the next generation When true tournament selection is used Tournament selection randomly chooses two individuals from the descendant popula tion and selects the one that performs best 8 3 Conditional independence test based struc ture learning Conditional independence tests in Weka are slightly different from the standard tests described in the literature To test whether variables x and y are condi tionally independent given a set of variables Z a network structure with arrows Vzezz y is compared with one with arrows x gt y U Vzezz y A test is performed by using any of the score metrics described in Section 2 1 weka E classifiers c bayes Ej net cf search gt cJ local Gici Ey clSearchalgorithm Q ICSSearchAlgorithm gt A global gt c fixed At the moment only the ICS B5jand CI algorithm are implemented The ICS algorithm makes two steps first find a skeleton the undirected graph with edges if f there is an arrow in network structure and second direct 124 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS all the edges in the skeleton to get a DAG Starting with a complete undirected graph we try to
300. t the same version of Java must be used for the Experimenter and remote engines Rename the remote policy example file to remote policy For each machine you want to run a remote engine on ssh to the machine 2Weka s source code can be found in the weka src jar archive or obtained from Subversion m1 5 4 REMOTE EXPERIMENTS 85 cd to home johndoe remote_engine Run home johndoe startRemoteEngine to enable the remote engines to use more memory modify the Xmx option in the startRemoteEngine script 5 4 4 Configuring the Experimenter Now we will run the Experimenter e HSQLDB Copy the DatabaseUtils props hsql file from weka experiment in the weka jar archive to the home johndoe remote engine directory and rename it to Database Utils props Edit this file and change the jdbcURL jdbc hsqldb hsql server_name database_name entry to include the name of the machine that is running your database server e g jAbcURL jdbc hsqldb hsql dodo company com experiment Now start the Experimenter inside this directory java cp home johndoe jars hsqldb jar remoteEngine jar home johndoe weka weka jar Djava rmi server codebase file home johndoe weka weka jar weka gui experiment Experimenter e MySQL Copy the Database Utils props mysql file from weka experiment in the weka jar archive to the home johndoe remote_engine direc tory and rename it to Database Utils props
301. t very easy to embed it in another projects This chapter covers the basics of how to achieve the following common tasks from source code Setting options Creating datasets in memory Loading and saving data Filtering Classifying Clustering Selecting attributes Visualization Serialization Even though most of the code examples are for the Linux platform using for ward slashes in the paths and file names they do work on the MS Windows plat form as well To make the examples work under MS Windows one only needs to adapt the paths changing the forward slashes to backslashes and adding a drive letter where necessary Note WEKA is released under the GNU General Public License version 44 GPLv2 i e that derived code or code that uses WEKA needs to be released under the GPLv2 as well If one is just using WEKA for a personal project that does not get released publicly then one is not affected But as soon as one makes the project publicly available e g for download then one needs to make the source code available under the GLPv2 as well alongside the binaries http www gnu org licenses gp1 2 0 html 195 196 CHAPTER 16 USING THE API 16 1 Option handling Configuring an object e g a classifier can either be done using the appro priate get set methods for the property that one wishes to change like the Explorer does Or if the class implements the weka core OptionHandler in terface one can just use the
302. tOptions String method must result in the same configuration This method will get called in the GUI when copying a classifier setup to the clipboard Since handling of arrays is a bit cum bersome in Java due to fixed length using an instance of java util Vector is a lot easier for creating the array that needs to be returned The following code snippet just adds the only option alpha that the classifier defines to the array that is being returned including the options of the superclass import java util Arrays import java util Vector public String getOptions Y Vector lt String gt result new Vector lt String gt result add alpha result add getAlpha result addAll Arrays asList super getOptions superclass return result toArray new String result size Note that the getOptions method requires you to add the preceding dash for an option opposed to the getOption getFlag calls in the setOptions method getCapabilities returns meta information on what type of data the classifier can handle in regards to attributes and class attributes See section Capabilities on page 42 for more information buildClassifier Instances builds the model from scratch with the provided dataset Each subsequent call of this method must result in the same model being built The buildClassifier method also tests whether the supplied data can be handled at all by the clas sifier utilizing the ca
303. tasets used in the experiment In this example there was only one dataset and OneR was better than ZeroR once and never equivalent to or worse than ZeroR 1 0 0 J48 was also better than ZeroR on the dataset The standard deviation of the attribute being evaluated can be generated by selecting the Show std deviations check box and hitting Perform test again The value 10 at the beginning of the iris row represents the number of esti mates that are used to calculate the standard deviation the number of runs in this case 90 CHAPTER 5 EXPERIMENTER Weka Experiment Environment Elod Setup Run Analyse Source Got 30 results File Database Experiment Configure test Test output Testing with Paired T Tester cor Tester veka experiment PairedCorrectedTTester Analysing Percent_correct Eom Select Datasets 1 Resultsets 3 a Select Confidence 0 05 two tailed Sorted by Date 21 12 05 16 37 Comparison field Percent_correct 54 Significance 0 05 2 paraset 1 rules ZeroR 2 rules OneR 3 trees J48 Sorting asc by lt default gt E REA iris 10 33 33 0 00 94 31 2 52 v 94 90 2 95 v Test base a AAA t m 17070 1 0 0 Displayed Columns Columns Show std deviations Key 1 rules ZeroR 48055541465867954 DAMA Forinat Select 2 rules OneR B 6 2459427
304. terers EM import weka core Instances Instances data from somewhere EM cl new EM cl buildClusterer data ClusterEvaluation eval new ClusterEvaluation eval setClusterer cl eval evaluateClusterer new Instances data System out println eval clusterResultsToString Density based clusterers i e algorithms that implement the interface named DensityBasedClusterer package weka clusterers can be cross validated and the log likelyhood obtained Using the MakeDensityBasedClusterer meta clusterer any non density based clusterer can be turned into such Here is an example of cross validating a density based clusterer and obtaining the log likelyhood import weka clusterers ClusterEvaluation import weka clusterers DensityBasedClusterer import weka core Instances import java util Random Instances data from somewhere DensityBasedClusterer clusterer new the clusterer to evaluate double logLikelyhood ClusterEvaluation crossValidateModel cross validate clusterer data 10 with 10 folds new Random 1 and random number generator with seed 1 16 7 CLUSTERING 219 Classes to clusters Datasets for supervised algorithms like classifiers can be used to evaluate a clusterer as well This evaluation is called classes to clusters as the clusters are mapped back onto the classes This type of evaluation is performed as follows 1 2 3 create a copy of
305. th 66 of the data used for training and 34 used for testing weka Experiment Environment Setup Run Analyse xperiment Configuration Mode Simple 8 Advanced Open Save New Destination Choose InstancesResultListener O Experimentt arf Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator WW weka classifier Runs Distribute experiment Generator properties From h To h D Hosts L J Enabled PA Select property By data set O Byrun Iteration control Choose J48 C 0 25 M2 aaa 8 Data sets first Custom generator first ZeroR OneR B 6 Datasets J48 0 0 25 M2 Add new Edit selecte Delete select v Use relative pat dataliris artt r r w Down le After the experiment setup is complete run the experiment Then to anal yse the results select the Analyse tab at the top of the Experiment Environment window Click on Experiment to analyse the results of the current experiment Weka Experiment Environment Setup Run Analyse Source Got 30 results File Database Experiment Configure test i Test output Testing w
306. the XML file The file is expected to end with xml e KOML Since the KOML serialization captures everything of a Java Object we can use it just like the normal Java serialization The file is expected to end with koml The built in serialization can be used in the Experimenter for loading saving options from algorithms that have been added to a Simple Experiment Unfor tunately it is not possible to create such a hierarchical structure like mentioned in Section 86 1 This is because of the loss of information caused by the getOptions method of classifiers it returns only a flat String Array and not a tree structure 294 CHAPTER 18 TECHNICAL DOCUMENTATION Responsible Class es weka core xml KOML weka classifiers xml XMLClassifier 18 6 4 Bayesian Networks The GraphVisualizer weka gui graphvisualizer GraphVisualizer can save graphs into the Interchange Format for Bayesian Networks BIF If started from command line with an XML filename as first parameter and not from the Explorer it can display the given file directly The DTD for BIF is this lt DOCTYPE BIF lt ELEMENT BIF NETWORK gt lt ATTLIST BIF VERSION CDATA REQUIRED gt lt ELEMENT NETWORK NAME PROPERTY VARIABLE DEFINITION x gt lt ELEMENT NAME PCDATA gt lt ELEMENT VARIABLE NAME OUTCOME PROPERTY gt lt ATTLIST VARIABLE TYPE nature decisionlutility nature gt lt ELEMENT OUTCOME PCDATA gt lt E
307. the appropriate Instances object as parameter since all Instance objects being processed will rely on the output format they use it as dataset that they belong to getOutputFormat This method returns the currently set Instances object that defines the output format In case setOutputFormat Instances has not been called yet this method will return null input Instance returns true if the given Instance can be processed straight away and can be collected immediately via the output method after adding it to the output queue via push Instance of course This is also the case if the first batch of data has been processed and the Instance belongs to the second batch Via isFirstBatchDone one can query whether this Instance is still part of the first batch or of the second If the Instance cannot be processed immediately e g the filter needs to collect all the data first before doing some calculations then it needs to be buffered with bufferInput Instance until batchFinished is called In this case the method needs to return false bufferInput Instance In case an Instance cannot be processed immediately one can use this method to buffer them in the input queue All buffered Instance objects are available via the getInputFormat method push Instance adds the given Instance to the output queue output Returns the next Instance object from the output queue and removes it from there In case there is no Instanc
308. the following information to be generated 5 5 ANALYSING RESULTS weka Experiment Environment Em Setup Run Analyse Source Got 30 results File Database Experiment Configure test Test output Testing with Paired T Tester cor w Tester weka experiment PairedCorrectedTTester z Analysing Percent_correct Row Select Datasets 1 z Resultsets 3 Com Select Confidence 0 05 two tailed iy Sorted by A Date 21 12 05 16 42 Comparison field Percent_correct pA Significance 0 05 a b c No of datasets where col gt gt row Sorting asc by lt default gt x 1 1 1 1 a 1 rules ZeroR 48055541465867954 0 0 1 0 b 2 rules OneR B 6 2459427002147861445 Test base Select O 0 0 0 c 3 trees J48 C 0 25 M 2 217733168393644444 Displayed Columns Columns Show std deviations Output Format Select Perform test Save output Result list 16 42 24 Percent_correct Summary El 4 il 93 In this experiment the first row 1 1 indicates that column b OneR is better than row a ZeroR and that column c J48 is also better than row a The number in brackets represents the number of significant wins for the column with regard to the row A O means that the scheme in the corresponding column did not score a single
309. them More into a set of instances outputFile Exnerimentt artt Open Save OK Cancel The dataset name is displayed in the Destination panel of the Setup tab weka Experiment Environment Me Setup Run Analyse xperiment Configuration Mode O Simple Advanced Open Save New Destination Choose InstancesResultListener O Experiment arf Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator W weka classifier Runs Distribute experiment Generator properties From 1 Tolo Hosts ia Disabled X Select property By data set D By run Iteration control 9 Data sets first O Custom generator first Datasets Add new Edit selecte Delete select v Use relative pat Can t edit dataliris artt Saving the Experiment Definition The experiment definition can be saved at any time Select Save at the top of the Setup tab Type the dataset name with the extension exp or select the dataset name if the experiment definition dataset already exists for binary files or choose Experiment configuration files cml from the file types combobox the XML files are robust with respect to version changes 66 CHAPTER 5 EXPERIMENTER Save In weka 3 5 6 y alt lc
310. therefore sets a platform specific Swing theme Unfortunately this doesn t seem to be working correctly in Java 1 5 together with Gnome A workaround for this is to set the cross platform Metal theme In order to use another theme one only has to create the following properties file in ones home directory LookAndFeel props With this content 300 CHAPTER 19 OTHER RESOURCES Theme javax swing plaf metal MetalLookAndFeel 19 2 14 KnowledgeFlow toolbars are empty In the terminal you will most likely see this output as well Failed to instantiate weka gui beans Loader This behavior can happen under Gnome with Java 1 5 see Section 9 2 13 for a solution 19 2 15 Links e Java VM options http java sun com docs hotspot VMOptions html 301 302 BIBLIOGRAPHY Bibliography 1 Witten I H and Frank E 2005 Data Mining Practical machine learn ing tools and techniques 2nd edition Morgan Kaufmann San Francisco 2 WekaWiki http weka wikispaces com 3 Weka Examples A collection of example classes as part of an ANT project included in the WEKA snapshots available for download on the homepage or directly from subversion https svn scms waikato ac nz svn weka branches stable 3 6 wekaexamples 4 J Platt 1998 Machines using Sequential Minimal Optimization In B Schoelkopf and C Burges and A Smola editors Advances in Kernel Methods Support Vector Learning 5 Drummond C and Holte R
311. thods of the properties e build the model with training data calling the buildClusterer Instances method By definition subsequent calls of this method must result in the same model repeatable experiments In other words calling this method must completely reset the model Below is an example of building the EM clusterer with a maximum of 100 itera tions The options are set using the setOptions String method import weka clusterers EM import weka core Instances Instances data from somewhere String options new String 2 options 0 I max iterations options 1 100 EM clusterer new EM new instance of clusterer clusterer setOptions options set the options clusterer buildClusterer data build the clusterer Incremental clusterers Incremental clusterers in WEKA implement the interface UpdateableClusterer package weka clusterers Training an incremental clusterer happens in three stages similar to incremental classifiers 1 initialize the model by calling the buildClusterer Instances method Once again one can either use an empty weka core Instances object or one with an initial set of data 2 update the model row by row by calling the the updateClusterer Instance method 3 finish the training by calling updateFinished method In case cluster algorithms need to perform computational expensive post processing or clean up operations 16 7 CLUSTERING
312. tified grep Correctly this should give you all cross validated accuracies If the cross validated ac curacy is roughly the same as the training set accuracy this indicates that your classifiers is presumably not overfitting the training set Now you have found the best classifier To apply it on a new dataset use e g java weka classifiers trees J48 1 J48 data model T new data arff You will have to use the same classifier to load the model but you need not set any options Just add the new test file via T If you want p first last will output all test instances with classifications and confidence followed by all attribute values so you can look at each error separately The following more complex csh script creates datasets for learning curves ie creating a 75 training set and 25 test set from a given dataset then successively reducing the test set by factor 1 2 83 until it is also 25 in size All this is repeated thirty times with different random reorderings S and the results are written to different directories The Experimenter GUI in WEKA can be used to design and run similar experiments bin csh foreach f set run 1 while run lt 30 mkdir run gt amp dev null java weka filters supervised instance StratifiedRemoveFolds N java weka filters supervised instance StratifiedRemoveFolds N foreach nr 0 1 2 3 4 5 set nrpi nr O nrpi run c last i 4 F1 S 4 F 1 S run V c
313. tions you want to implement public Enumeration listOptions public void setOptions String options public String getOptions and the get and set methods for the properties you want to be able to set NB do not use the E option since that is reserved for the BayesNet class to distinguish the extra options for the SearchAlgorithm class and the Estimator class If the E option is used and no extra arguments are passed to the SearchAlgorithm the extra options to your Estimator will be passed to the SearchAlgorithm instead In short do not use the E option 8 12 FAQ How do I use a data set with continuous variables with the BayesNet classes Use the class weka filters unsupervised attribute Discretize to discretize them From the command line you can use java weka filters unsupervised attribute Discretize B 3 i infile arff o outfile arff where the B option determines the cardinality of the discretized variables How do I use a data set with missing values with the BayesNet classes You would have to delete the entries with missing values or fill in dummy values How do I create a random Bayes net structure Running from the command line java weka classifiers bayes net BayesNetGenerator B N 10 A 9 C 2 will print a Bayes net with 10 nodes 9 arcs and binary variables in XML BIF format to standard output How do I create an artificial data set using a random Bayes nets Running java weka classifiers bayes
314. tives numeric True_negative_rate numeric Num_true_negatives numeric False_negative_rate numeric Num_false_negatives numeric IR_precision numeric IR_recall numeric F_measure numeric Area_under_ROC numeric Time_training numeric Time_testing numeric Summary Number of leaves 3 nSize of the tree of leaves 5 nSize of the tree 9 n of leaves 4 nSize of the tree 7 n measureTreeSize numeric measureNumLeaves numeric measureNumRules numeric 5 n iris i weka classifiers rules ZeroR 48055541465867954 20051221 033 99 51 17 34 0 33 333333 66 666667 0 0 0 444444 0 471405 100 100 80 833088 80 833088 0 1 584963 1 584963 0 0 0 0 1 17 1 34 0 0 0 0 0 333333 1 0 5 0 5 0 0 5 2 2 3 Changing the Experiment Parameters Changing the Classifier The parameters of an experiment can be changed by clicking on the Result generator panel The RandomSplitResultProducer performs repeated train test runs weka gui GenericObjectEditor El x weka experiment RandomSplitResultProducer About Performs a random train and test using a supplied evaluator outputFile snlitEvalutorOutzin randomizeData True rawOutput False v splitEvaluator Choose ClassifierSplitEvaluator VV weka classifii trainPercent 66 0 p Save Cancel J Open The number of instances expressed as a percentage used for training is given in the 5 2 STANDARD EXPE
315. top and bottom most respectively e Space Horizontal Vertical spaces out nodes in the selection evenly between left and right most or top and bottom most respectively The order in which the nodes are selected impacts the place the node is moved to Tools menu File Edit View Help j Generate Network Cti N E Generate Data Ctrl D Set Data Ctrl A Learn Network Ctrl L Learn CPT Layout J Show Margins T Show Cliques The Generate Network menu allows generation of a complete random Bayesian network It brings up a dialog to specify the number of nodes number of arcs cardinality and a random seed to generate a network 146 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Generate Random Bayesian a ixi Nr of nodes 10 Nr of arcs 1 n Cardinality 2 Random seed 123 Generate Network Cancel The Generate Data menu allows for generating a data set from the Bayesian network in the editor A dialog is shown to specify the number of instances to be generated a random seed and the file to save the data set into The file format is arf When no file is selected field left blank no file is written and only the internal data set is set 2 Generate Random Data Options xj Nr of instances 100 Random seed 1234 Output file optional iworkspacetweka2w arf Browse emer The Set Data menu sets the current data set From this data set a new Bayesi
316. u might have to edit the file in weka experiment DatabaseUtils props 4 Generate Enables you to generate artificial data from a variety of DataGenerators Using the Open file button you can read files in a variety of formats WEKA s ARFF format CSV format C4 5 format or serialized Instances for mat ARFF files typically have a arff extension CSV files a csv extension C4 5 files a data and names extension and serialized Instances objects a bsi extension NB This list of formats can be extended by adding custom file converters to the weka core converters package 4 2 2 The Current Relation Once some data has been loaded the Preprocess panel shows a variety of in formation The Current relation box the current relation is the currently loaded data which can be interpreted as a single relational table in database terminology has three entries 38 CHAPTER 4 EXPLORER 1 Relation The name of the relation as given in the file it was loaded from Filters described below modify the name of a relation 2 Instances The number of instances data points records in the data 3 Attributes The number of attributes features in the data 4 2 3 Working With Attributes Below the Current relation box is a box titled Attributes There are four buttons and beneath them is a list of the attributes in the current relation The list has three columns 1 No A number that identifies the attribute in the or
317. uch sense since this Saver takes an ARFF file as input and output The ArffSaver is normally used from Java for saving an object of weka core Instances to a file e The C 5Loader either takes the names file or the data file as input it automatically looks for the other one e For the C45Saver one specifies as output file a filename without any ex tension since two output files will be generated names and data are automatically appended 11 2 2 Database converters The database converters are a bit more complex since they also rely on ad ditional configuration files besides the parameters on the commandline The setup for the database connection is stored in the following props file DatabaseUtils props The default file can be found here weka experiment DatabaseUtils props 11 2 USAGE 175 e Loader You have to specify at least a SQL query with the Q option there are additional options for incremental loading java weka core converters DatabaseLoader Q select from employee e Saver The Saver takes an ARFF file as input like any other Saver but then also the table where to save the data to via T java weka core converters DatabaseSaver i iris arff T iris 176 CHAPTER 11 CONVERTERS Chapter 12 Stemmers 12 1 Introduction Weka now supports stemming algorithms The stemming algorithms are located in the following package weka core stemmers Currently the Lovins Stemmer iterated version an
318. umber generates never return a completely random sequence of numbers anyway only a pseudo random one In order to achieve repeatable pseudo random sequences seeded generators are used Using the same seed value will always result in the same sequence then The default constructor of the java util Randomrandom number generator class should never be used as such created objects will generate most likely different sequences The constructor Random long using a specified seed value is the recommended one to use In order to get a more dataset dependent randomization of the data the getRandomNumberGenerator int method of the weka core Instances class can be used This method returns a java util Random object that was seeded with the sum of the supplied seed and the hashcode of the string representation of a randomly chosen weka core Instance of the Instances object using a random number generator seeded with the seed supplied to this method 206 CHAPTER 16 USING THE API 16 5 Filtering In WEKA filters are used to preprocess the data They can be found below package weka filters Each filter falls into one of the following two categories e supervised The filter requires a class attribute to be set e unsupervised A class attribute is not required to be present And into one of the two sub categories e attribute based Columns are processed e g added or removed e instance based Rows are processed e g added or de
319. updated on an instance by instance basis Currently in WEKA there are ten classifiers that can handle data incrementally AODE IB1 IBk KStar NaiveBayesMultinomialUpdateable NaiveBayesUpdateable NNge Winnow And two of them are meta classifiers e RacedIncrementalLogitBoost that can use of any regression base learner to learn from discrete class data incrementally e LWL locally weighted learning 6 2 FEATURES 97 6 2 Features The KnowledgeFlow offers the following features intuitive data flow style layout process data in batches or incrementally process multiple batches or streams in parallel each separate flow executes in its own thread chain filters together view models produced by classifiers for each fold in a cross validation visualize performance of incremental classifiers during processing scrolling plots of classification accuracy RMS error predictions etc plugin facility for allowing easy addition of new components to the Knowl edgeF low 98 CHAPTER 6 KNOWLEDGEFLOW 6 3 Components Components available in the KnowledgeFlow 6 3 1 DataSources All of WEKA s loaders are available DataSources DataSinks Filters Classifiers Clusterers Associations Evaluation Visualization DataSources Arf cas csv Davabase Libsw f os Serialise TexeDi XRFF Loader Loade Loader Loade Loader InstancesLoader Loader Loade 6 3 2 DataSinks All of WEK
320. ure called emplyoeename that returns the names of all the employees in table employee Even though it doesn t make much sense to create a stored procedure for this query nonetheless it shows how to create and call stored procedures in PostgreSQL e Create CREATE OR REPLACE FUNCTION public employee_name RETURNS SETOF text AS select name from employee LANGUAGE sql VOLATILE e SQL statement to call procedure SELECT FROM employee_name e Retrieve data via InstanceQuery java weka experiment InstanceQuery Q SELECT FROM employee_name U lt user gt P lt password gt 13 5 TROUBLESHOOTING 183 13 5 Troubleshooting e In case you re experiencing problems connecting to your database check out the WEKA Mailing List see Weka homepage for more information It is possible that somebody else encountered the same problem as you and you ll find a post containing the solution to your problem Specific MS SQL Server 2000 Troubleshooting Error Establishing Socket with JDBC Driver Add TCP IP to the list of protocols as stated in the following article support microsoft com default aspx scid kb en us 313178 Login failed for user sa Reason Not associated with a trusted SQL Server connection For changing the authentication to mixed mode see the following article http support microsoft com kb 319930 en us e MS SQL Server 2005 TCP IP is not enabled for SQL Server or the server or port nu
321. urns an enumeration of the available options these are printed if one calls the filter with the h option e setOptions String parses the given option array that were passed from command line e getOptions returns an array of options resembling the current setup of the filter See section Methods on page and section Parameters on page for more information 17 2 WRITING A NEW FILTER 255 In the following an example implementation that adds an additional at tribute at the end containing the index of the processed instance import weka core import weka core Capabilities import weka filters public class SimpleBatch extends SimpleBatchFilter public String globalInfo return A simple batch filter that adds an additional attribute blah at the end containing the index of the processed instance public Capabilities getCapabilities Capabilities result super getCapabilities result enableAllAttributes result enableAllClasses result enable Capability NO_CLASS filter doesn t need class to be set return result protected Instances determineOutputFormat Instances inputFormat Instances result new Instances inputFormat 0 result insertAttributeAt new Attribute blah result numAttributes return result protected Instances process Instances inst Instances result new Instances determineOutputFormat inst 0 for int i 0 i lt
322. ury arff V C last L 19 1 2 BASIC CONCEPTS 19 1 2 4 weka classifiers Classifiers are at the core of WEKA There are a lot of common options for classifiers most of which are related to evaluation purposes We will focus on the most important ones All others including classifier specific parameters can be found via h as usual t specifies the training file ARFF format specifies the test file in ARFF format If this parameter is miss Pn ing a crossvalidation will be performed default ten fold cv This parameter determines the number of folds for the cross E validation A cv will only be performed if T is missing Le As we already know from the weka filters section this parameter sets the class variable with a one based index The model after training can be saved via this parameter Each classifier has a different binary format for the model so it can d only be read back by the exact same classifier on a compatible dataset Only the model on the training set is saved not the multiple models generated via cross validation Loads a previously saved model usually for testing on new pre l viously unseen data In that case a compatible test file should be specified i e the same attributes in the same order If a test file is specified this parameter shows you the predictions and one attribute 0 for none for all test instances A more detailed performance description via precision recall true and f
323. ute declaration For example RELATION Timestamps ATTRIBUTE timestamp DATE yyyy MM dd HH mm ss DATA 2001 04 03 12 12 12 2001 05 03 12 59 55 Relational data must be enclosed within double quotes For example an in stance of the MUSK1 dataset denotes an omission MUSK 188 42 30 1 93 Sparse ARFF files Sparse ARFF files are very similar to ARFF files but data with value 0 are not be explicitly represented Sparse ARFF files have the same header i e relation and attribute tags but the data section is different Instead of representing each value in order like this data 0 X O Y class A 0 0 W O class B the non zero attributes are explicitly identified by attribute number and their value stated like this data 1 X 3 Y 4 class A 12 W 4 class B Each instance is surrounded by curly braces and the format for each entry is lt index gt lt space gt lt value gt where index is the attribute index starting from 0 Note that the omitted values in a sparse instance are 0 they are not missing values If a value is unknown you must explicitly represent it with a question mark 7 Warning There is a known problem saving SparseInstance objects from datasets that have string attributes In Weka string and nominal data values are stored as numbers these numbers act as indexes into an array of possible attribute values this is very efficient However the first string
324. utiFiter Missing 0 0 Distinct 3 Unique 0 0 files sinortsed Label i Count 4 E unsupervised Sunny 5 C attribute overcast la O ada A frany B C AddCluster AddExpression A O Adio Dy AddNoise O adavalues Class play Nom O Center E ChangeDateFormat 5 O classAssigner E ClusterMembership D copy O Discretize O Firstorder T Biter Remove fiter Close The preprocess section allows filters to be defined that transform the data in various ways The Filter box is used to set up the filters that are required At the left of the Filter box is a Choose button By clicking this button it is possible to select one of the filters in WEKA Once a filter has been selected its name and options are shown in the field next to the Choose button Clicking on this box with the left mouse button brings up a GenericObjectEditor dialog box A click with the right mouse button or Alt Shift left click brings up a menu where you can choose either to display the properties in a GenericObjectEditor dialog box or to copy the current setup string to the clipboard The GenericObjectEditor Dialog Box The GenericObjectEditor dialog box lets you configure a filter The same kind of dialog box is used to configure other objects such as classifiers and clusterers see below The fields in the window reflect the available options Right clicking or Alt Shift Left Click on such a field wi
325. uture performance by estimating expected utilities such as classification accuracy Cross validation provides an out of sample evaluation method to facilitate this by repeatedly splitting the data in training and validation sets A Bayesian network structure can be evaluated by estimating the network s param eters from the training set and the resulting Bayesian network s perfor mance determined against the validation set The average performance of the Bayesian network over the validation sets provides a metric for the quality of the network Cross validation differs from local scoring metrics in that the quality of a network structure often cannot be decomposed in the scores of the indi vidual nodes So the whole network needs to be considered in order to determine the score e fixed structure Finally there are a few methods so that a structure can be fixed for example by reading it from an XML BIF ud For each of these areas different search algorithms are implemented in Weka such as hill climbing simulated annealing and tabu search Once a good network structure is identified the conditional probability ta bles for each of the variables can be estimated You can select a Bayes net classifier by clicking the classifier Choose button in the Weka explorer experimenter or knowledge flow and find BayesNet under the weka classifiers bayes package see below 2See http www 2 cs cmu edu fgcozman Research InterchangeForma
326. utzip W weka experiment ClassifierSplitEvaluator VV weka classifig Runs Distribute experiment Generator properties From 1 To 1 Hosts Ed Enabled Vv Select property 9 By data set D By run Iteration control Choose J48 C0 25 M2 Add 8 Data sets first Custom generator first ZeroR OneR B 6 Datasets J48 0 25 M2 Add new Edit selecte Delete select v Use relative pat dataliris arff The number of runs is set to 1 in the Setup tab in this example so that only one run of cross validation for each scheme and dataset is executed When this experiment is analysed the following results are generated Note that there are 30 1 run times 10 folds times 3 schemes result lines processed weka Experiment Environment Ae x Setup Run Analyse Source Got 30 results File Database Experiment Configure test Test output Testing with Paired T Tester cor w Tester weka experiment PairedCorrectedTTester ia Analysing Percent_correct ieee Select Datasets 1 Ss Resultsets 3 Colmar Select Confidence 0 05 two tailed a Sorted by z Date 21 12 05 16 47 Comparison field Percent_correct k A Significance 0 05 e Dataset 1 rules Ze 2 rules 3 trees Sortin
327. valuator you will see its options Note that there is an option for removing the class column from the data In the Experimenter the class column is set to be the last column by default Turn this off if you want to keep this column in the data weka experiment DensityBasedClustererSplitEvaluator About se pra that produces results for a density clusterer eM F 100 N 1 M 1 0E 6 removeClassColumn Tue W Open Save OK Cancel A A A AAA AAA Once DensityBasedClustererSplitEvaluator has been selected you will notice that the Generator properties have become disabled Enable them again and expand splitEvaluator Select the clusterer node Generator properties Enabled Available properties E numFolds Ej outputFile E rawOutput Y 3 splitEvaluator E removeClassColumr Now you will see that EM becomes the default clusterer and gets added to the list of schemes You can now add delete other clusterers IMPORTANT in order to any clusterer that does not produce density esti mates i e most other clusterers in Weka they will have to wrapped in the MakeDensityBasedClusterer 82 CHAPTER 5 EXPERIMENTER Generator properties Enabled ma C Select property gt Choose MakeDensityBasedClusterer M 1 0E 6 W weka Add MakeDensityBasedClusterer M 1 0E 6 W weka clusterers SimpleKMeans
328. value index 1 gt 5 1 lt value gt lt value index 2 gt 3 5 lt value gt lt value index 3 gt 1 4 lt value gt lt value index 4 gt 0 2 lt value gt lt value index 5 gt Iris setosa lt value gt lt instance gt lt instance type sparse gt lt value index 1 gt 4 9 lt value gt lt value index 2 gt 3 lt value gt lt value index 3 gt 1 4 lt value gt lt value index 4 gt 0 2 lt value gt lt value index 5 gt Iris setosa lt value gt lt instance gt lt instances gt 170 CHAPTER 10 XRFF In contrast to the normal data format each sparse instance tag contains a type attribute with the value sparse lt instance type sparse gt And each value tag needs to specify the index attribute which contains the 1 based index of this value lt value index 1 gt 5 1 lt value gt 10 4 Compression Since the XML representation takes up considerably more space than the rather compact ARFF format one can also compress the data via gzip Weka automat ically recognizes a file being gzip compressed if the file s extension is xrff gz instead of xrff The Weka Explorer Experimenter and command line allow one to load save compressed and uncompressed XRFF files this applies also to ARFF files 10 5 Useful features In addition to all the features of the ARFF format the XRFF format contains the following additional features e class attribute specification e attribute weights 10 5 1 Class attr
329. w Instances data attribute 4 relation 0 valuesRel new double dataRel numAttributesO valuesRel 0 2 34 valuesRel 1 dataRel attribute 1 index0f val_C dataRel add new Instance 1 0 valuesRel values 4 data attribute 4 addRelation dataRe1 Finally an Instance object is generated with the initialized double array and added to the dataset Instance inst new Instance 1 0 values data add inst 16 4 RANDOMIZING DATA 205 16 4 Randomizing data Since learning algorithms can be prone to the order the data arrives in random izing also called shuffling the data is a common approach to alleviate this problem Especially repeated randomizations e g as during cross validation help to generate more realistic statistics WEKA offers two possibilities for randomizing a dataset e Using the randomize Random method of the weka core Instances ob ject containing the data itself This method requires an instance of the java util Random class How to correctly instantiate such an object is explained below e Using the Randomize filter package weka filters unsupervised instance For more information on how to use filters see section 16 3 A very important aspect of Machine Learning experiments is that experiments have to be repeatable Subsequent runs of the same experiment setup have to yield the exact same results It may seem weird but randomization is still possible in this scenario Random n
330. want to obtain the source code of the book version use this URL 18 3 SUBVERSION 281 https svn scms waikato ac nz svn weka branches book2ndEd branch weka 18 3 3 JUnit The latest version of Weka s JUnit tests can be obtained with this URL https svn scms waikato ac nz svn weka trunk tests And if you want to obtain the JUnit tests of the book version use this URL https svn scms waikato ac nz svn weka branches book2ndEd branch tests 18 3 4 Specific version Whenever a release of Weka is generated the repository gets tagged e dev X Y Z the tag for a release of the developer version e g dev 3 7 0 for Weka 3 7 0 https svn scms waikato ac nz svn weka tags dev 3 7 0 e stable X Y Z the tag for a release of the book version e g stable 3 4 15 for Weka 3 4 15 https svn scms waikato ac nz svn weka tags stable 3 4 1 18 3 5 Clients Commandline Modern Linux distributions already come with Subversion either pre installed or easily installed via the package manager of the distribution If that shouldn t be case or if you are using Windows you have to download the appropriate client from the Subversion homepage http subversion tigris org A checkout of the current developer version of Weka looks like this svn co https svn scms waikato ac nz svn weka trunk weka SmartSVN SmartSVN http smartsvn com is a Java based graphical cross platform client for Subversion Though it is not open source f
331. xception if getInputFormat null throw new NullPointerException No input instance format defined Instances inst getInputFormat Instances outFormat getOutputFormat for int i 0 i lt inst numInstances i double newValues new double outFormat numAttributes double oldValues inst instance i toDoubleArray System arraycopy oldValues 0 newValues 0 oldValues length newValues newValues length 1 i push new Instance 1 0 newValues flushInput m_NewBatch true m_FirstBatchDone true return numPendingOutput 0 public static void main String args runFilter new BatchFilter args 17 2 WRITING A NEW FILTER 251 BatchFilter2 In contrast to the first batch filter this one here cannot determine the output format immediately the number of instances in the first batch is part of the attribute name now This is done in the batchFinished method import weka core import weka core Capabilities public class BatchFilter2 extends Filter public String globalInfo return A batch filter that adds an additional attribute blah at the end containing the index of the processed instance The output format cannot be collected immediately public Capabilities getCapabilities Capabilities result super getCapabilities result enableAllAttributes result enableAllClasses result enable Capability NO_CLASS
332. xheap placeholder 280 CHAPTER 18 TECHNICAL DOCUMENTATION For more information check out the comments in the INI file 18 2 3 java jar When you re using the Java interpreter with the jar option be aware of the fact that it overwrites your CLASSPATH and not augments it Out of conve nience people often only use the jar option to skip the declaration of the main class to start But as soon as you need more jars e g for database access you need to use the classpath option and specify the main class Here s once again how you start the Weka Main GUI with your current CLASSPATH variable and 128MB for the JVM e Linux java Xmx128m classpath CLASSPATH weka jar weka gui Main e Win32 java Xmx128m classpath CLASSPATH weka jar weka gui Main 18 3 Subversion 18 3 1 General The Weka Subversion repository is accessible and browseable via the following URL https svn scms waikato ac nz svn weka A Subversion repository has usually the following layout root l trunk tags branches Where trunk contains the main trunk of the development tags snapshots in time of the repository e g when a new version got released and branches development branches that forked off the main trunk at some stage e g legacy versions that still get bugfixed 18 3 2 Source code The latest version of the Weka source code can be obtained with this URL https svn scms waikato ac nz svn weka trunk weka If you

WEKA Manual for Version 3-6-8

Contents

Download Pdf Manuals

Related Search

Related Contents