Home

CART® for Windows

1. Unacceptable amp not letter number or underscore This character will be replaced with an underscore Character variable names are required to end in an additional so if a character variable name does not end with it will be added by DATABASE CONVERSION NAMES SSNUMBERS 36 Chapter 2 Reading Data Numeric variables may optionally have subscripts from 0 to 99 but CART does not use them in any special way CREDIT 1 OK SCORE 99 OK ARRAY 0 OK ARRAY 100 Unacceptable parenthesis will be replaced with underscore Unacceptable parenthesis will be replaced with underscore x Unacceptable parenthesis will be replaced with underscore x 1 2 Unacceptable parenthesis will be replaced with underscore When using raw ASCII text input data CART does not check for or alter duplicate variable names in your dataset Reading Excel Files We have found that many users like to use Excel files Excel files are easily accessible in mode 2 using DATABASE CONVERSION drivers However care must be exercised when doing this Make sure that the following requirements are met The Excel file must contain only a single data sheet no charts macros or other items are allowed Currently the Excel data format limits the number of variables to 256 and the number of records to 65535 The Excel file must not be currently open in Excel otherwise the op
2. Advanced Save Grove CART Combine Score Cancel Continue In the screen we have set both the Missing Values and the HLC penalties to the frequently useful values of 1 00 Advanced users wishing control over the missing value and high level categorical penalty details can click the Advanced button Penalties on Variables The penalty specified is the amount by which the variable s improvement score is reduced before deciding on the best splitter in a node Imposing a 0 10 penalty on a variable will reduce its improvement score by 10 You can think of the penalty as a handicap with a 0 10 penalty we are saying that the penalized variable must be at least 10 better than any other variable to qualify as the splitter ran Penalties may be placed to reflect how costly it is to acquire data For example in database and targeted marketing selected data maybe available only by purchase from specialized vendors By penalizing such variables we make it more difficult for such variables to enter the tree but they will enter when they are considerably better than any alternative predictor Predictor specific penalties have been used effectively in medical diagnosis and triage models Predictors that are expensive because they require costly diagnostics such as CT scans or that can only be obtained after a long wait say 48 hours for the lab results or that involve procedures that are unpleasant for the patie
3. Note the bottom portion of the window that specifies Files of type and the ASCII Delimited csv dat txt description If you see a different type of file selected in your window click the pull down arrow and select the ASCII file type to see the file we need Selecting HOSLEM CSV will bring up the activity screen that provides some basic information about your file lists the available variables and allows you to jump to the several other activities 83 Chapter 4 Classification Trees D Salfor d SoftDEV Manuals CART v6 Examples HOSLEM CSV File Name HOSLEM CSY Location D Salford SoftDEW Manuals CART v6 E xamples Modified Friday December 10 1999 1 47 52 PM Variables Data Records Variables Character Numeric Sort File Order pacity VewData some You can always bring up this activity window up by clicking on the l icon on your toolbar Definitions of the variables are given below Ul Birth weight less than 2500 grams coded 1 if lt 2500 0 otherwise Mother s age Number of first trimester physician visits History of hypertension coded 1 if present 0 otherwise Mother s weight at last menstrual period less than 110 Ibs coded 1 if lt 110 0 otherwise Occurrence of pre term labor coded 1 if present O otherwise Mother s ethnicity coded 1 2 or 3 Smoking during pregnancy coded 1 if smoked 0 otherwise Uterine
4. Chapter scoring and Translating This chapter provides instructions for the steps required to internally and externally apply models to new data 170 Chapter 7 Scoring and Translating Scoring and Translating Models No predictive modeling process would be complete without the ability to apply your models to new data CART 6 offers two ways to do this internally by using CART s built in scoring facility or externally by first translating your models into any of the supported languages SAS compatible C or PMML This section describes how to use the internal SCORE command to predict a target variable using either new data or old learn data The process of using a CART tree to predict a target variable is known as dropping data down a tree or scoring data Each observation is processed case by case beginning at the root node The splitting criteria are applied and in response to each yes no question the case moves left or right down the tree until it reaches a terminal node If the primary split criterion cannot be applied because the case is missing data a surrogate split criterion is used If no surrogates are available the case moves down the tree with the priors adjusted majority rN In CART 6 unlike previous versions of CART you may score any tree from the pruning sequence without any extra steps involved rN Because of the new mechanism the SELECT command and the Select Tree menu item are no lo
5. Chapter 11 CART Segmentation The root node split on the variable ANYRAQT a binary indicator variable coded 1 if the member uses the racquetball courts and 0 otherwise Members who do not use the racquetball courts go to the left non terminal node while those who use the courts go to the right terminal node To see more or less detail when hovering over a node activate a local menu by clicking the right mouse button on the background or select Node Display from the View menu and then select the level of detail you prefer You can elect to see the splitting variable name the splitting criterion the class assignment the class breakdown counts and percentages and the number of cases in the node If we select the most detailed node report and hover the mouse pointer over the terminal node on the far right terminal node 7 we can see a very good split 82 of the original 95 Class 1 cases and none of the Class 2 or 3 cases appear in this node Thus based on the first split only we already know something about these particular members that is if ANYRAQT 1 then SEGMENT 1 Similarly the terminal node on the far left side terminal node 2 shows that after four splits CART is able to separate 71 of the original 98 Class 3 cases into another pure node Terminal Node 7 Class Cases Terminal Node 7 Terminal Node 2 Viewing the Main Splitters You can quickly scan all main splitters in the entire tree by clicking on the Split
6. Every target class must have at least as many records as the number of folds in the cross validation Otherwise the process breaks down an error message is reported and a No Tree Built situation occurs This means that if your data set contains only nine YES records in a YES NO problem you cannot run more than nine fold cross validation Modelers usually run into this problem when dealing with say a three class target where two of the classes have many records and one class is very small In such situations consider either eliminating rare class cases from the dataset or merging them into a larger class 98 Chapter 4 Classification Trees If your data set has more than 3 000 records and you select cross validation as your testing method a dialog will automatically open informing you that you must increase the setting for the maximum number of observations in learning data set with cross validation in the Model Setup Advanced tab This warning is intended to prevent you from inadvertently using cross validation on larger data sets and thus growing eleven trees instead of just one To raise the threshold adjust the value in the dialog below Data Set Size Warning Limit for Cross Yalidation Warn if the number of observations in leaming data set for cross validation 3000 4 The advent of the Pentium IV class of CPUs has made run times so short that you can now comfortably run cross validation on much larger data
7. Appendix III Command Reference ADJUST Purpose The ADJUST command facilitates resizing of critical memory management parameters The command syntax is ADJUST LEARN lt n gt TEST lt n gt ATOM lt n gt DEPTH lt n gt NODES lt n gt SUBSAMPLE lt n gt All parameters entered but one should be followed by lt n gt values The one parameter on the ADJUST command NOT given a fixed value will be automatically adjusted to attempt to fit the problem into the available workspace Examples ADJUST ATOM 20 DEPTH 8 LEARN ADJUST LEARN 500 NODES ADJUST DEPTH 329 Appendix III Command Reference AUXILIARY Purpose The AUXILIARY command specifies variables either in the model or not for which node specific statistics are to be computed For continuous variables statistics such as N mean min max sum SD and percent missing may be computed Which statistics are actually computed is specified with the DESCRIPTIVE command For discrete categorical variables frequency tables are produced showing the most prevalent seven categories The command syntax is AUXILIARY lt variable gt lt variable gt Examples AUXILIARY ONAER NSUPPS OFFAER Variable groups may be used in the AUXILIARY command similarly to variable names 330 Appendix III Command Reference BATTERY Purpose Results are saved into the grove file The BATTERY command generates a group of models by varying on
8. Error 4 INCORRECT FILE ASSIGNMENT NO ASSIGNMENT MADE The OS was not able to open your file Check your USE GROVE and INCLUDE commands Also make sure that none of the files involved is held by another application Error 5 ILLEGAL VALUES FOR SUBSCRIPT Array variables are limited to 99 elements Anything beyond that will trigger this error message Error 8 YOU ARE TRYING TO PROCESS THE WRONG KIND OF DATA Check that your data file has the right format and is not corrupted Error 10 YOU ARE TRYING TO READ AN EMPTY OR NONEXISTENT FILE OR YOUR FILE IS IN A DIFFERENT DIRECTORY CART is not able to open one of the files Check your USE GROVE SUBMIT commands for possible errors Also make sure that none of the files involved is held by another application Error 12 UNEXPECTED END OF FILE ENCOUNTERED The file you are reading from is corrupt Try another version of the same file or consider using another data format Error 13 YOU HAVE NOT GIVEN AN INPUT FILE WITH USE COMMAND See the USE command in the command reference 321 Appendix II Errors and Warnings Error 14 YOU CANNOT HAVE MORE THAN FIVE NESTED INCLUDE FILES CART command parser allows no more than five nested INCLUDE statements Consider rearranging your scripts into fewer layers Error 24 TEMPORARY FILE CREATE FAILED CART creates temporary files in a dedicated folder needed for its work Check that there is enough space in the temporary fol
9. In addition CART allows variables that have missing values to be penalized The amount of penalty is usually proportional to the percent missingness thus discouraging variables with heavy missingness from becoming part of the model Model Setup Penalty tab This proliferation of controls over missing value handling in CART essentially leads us to support a whole new kind of battery battery MVI Currently the battery offers a series of five runs with the most interesting combinations of missing value settings We illustrate this battery using FNCELLA CSV the Cell Phone dataset see MVI CMD command file for details 217 Chapter 10 CART Batteries i Battery Summary 2 DER Models Contents Accuracy Eror Profiles Var Imp Averaging Charts Rel Eror Nodes MVI_No_P 24 0 540 MYI_No_P 24 0 540 View Zoomed bd Chart Type Bar j Line Test Rel Error Rel Error d ON AWON Battery Types Classification Battery Models MVI Model Opt Terminal Rel Error MVI Mode Name 0 5397 Yes 5655 No Show Min Error eee Model Quality Sample Model Size Save Grove Misclass Roc Test Leam Min Cost 15E The following five models are defined MVI_No_P use regular predictors missing value indicators and no missing value penalties No_MVI_No_P use regular predictors only default CART model no MVIs no penalties MVi_only use missing value indicat
10. Chapter 11 CART Segmentation Check the Save results to a file checkbox and specify the output data set name Choose the tree you want to apply by pressing the Select button in the Sub tree section by default CART offers the optimal tree Set the target weight and id variables when applicable Press OK 3 The output data set will contain new variables added by CART including node assignment class assignment and predicted probabilities for each case Score Data Data Data file C Program Files S alford Data Mining CART ProEX6 0 Examp Select Grove Grove file Navigator 2 Select Type Classification Grove Tree Tree_1_Main X Save results to a file Output file C Program Files S alford Data Mining CART Pro EX 6 0 Examp I Single precision Include F Model information I Path indicators M Predicted probabilities Subtree Sequence tree no 4 Nodes 7 Rel Cost 0 112057 Optimal Tree Select s Target Weight and ID Variables Target Variable Select Weight Variable SMALLBUS Select Clea Select up to 50 ID Variables Sort File Order X Select 3 s CART Model Select Cases Cancel OK x The topics of scoring and translating models are discussed in greater detail in the chapter titled Scoring and Translating 261 Chapter 11 CART Segmentation New Analysis To build another tree using the same data set select Construct Model from the
11. EE fest Ret Error DRAW Model Name Opt Terminal Nodes Rel Error Show Min Error 1 2 3 4 5 6 7 8 3 Model Quality Save Grove Misclass _Roc Sample Model Size Test Leam Min Cost 15E As the results indicate the effect of sampling the learn data alone produces relative errors between 0 1573 and 0 2665 Em Em a rd CART 6 0 CART 6 0 Pro Pro EX Battery FLIP Battery FLIP generates two runs with the meaning of learn and test samples flipped The user has to specify the test sample explicitly using the Testing tab in the Model Setup window We illustrate the use of this battery on the SPAMBASE CSV dataset see FLIP CMD command file for details 212 Chapter 10 CART Batteries wai Battery Summary 1 Models Contents Accuracy Error Profiles Yar Imp Averaging Rel Error D n A S A N S S A S S A S S S S S OEA A S E A A E E E AEO E A E S A 2 4 6 8 10 12 14 16 18 20 2 24 2 28 30 32 34 36 Number of Nodes Chart View Al None Averase Min Max Legend fantiees Model Quality Sample Model Size Save Grove Misclass ROC Test Leam Min Cost 1SE 7 CART 6 0 Pro EX Battery KEEP Battery KEEP randomly selects a specified number of variables from the initial list of predictors controlled by the KEEP command and repeats the random selection multiple times A user has the option of sp
12. Ensemble Models and Committees of Experts The multi tree methods Bootstrap Aggregation and ARCing 162 Chapter 6 Ensemble Models and Committees of Experts Building an Ensemble of Trees Researchers began exploring the potential value of building multiple trees around 1990 The core idea was that if one tree is good then maybe several trees would be even better The best known of the straightforward ensembles is Leo Breiman s bagger which is the main topic of this chapter Subsequently Breiman also introduced Random Forests now available as a separate Salford Systems module Simple ensembles generate predictions by averaging the outputs of independently built models The more complex method of boosting builds a sequence of trees with each new tree attempting to repair the errors made by predecessor trees Boosting was first introduced by Freund and Schapire 1996 who showed how a three tree model could outperform a single tree Later a number of researchers explored the boosting of many trees Leo Breiman 1996 observed that a simple modification to the bagger would yield a method very similar to boosting that method ARCing is also discussed briefly in this chapter A newer and far more powerful form of boosting is available in the Salford Systems TreeNet module Bootstrap Aggregation and ARCing In addition to growing a classification or regression tree you may switch to either bootstrap aggregation bagging or
13. Fraction 1 00 x tectortitissing Suppose you want to penalize a variable with 70 missing data very heavily while barely penalizing a variable with only 10 missing data The advanced tab lets you do this by setting a fractional power on the percent of good data For example using the square root of the fraction of good data to calculate the improvement factor would give the first variable with 70 missing a 55 factor and the second variable with 10 missing a 95 factor The expression used to scale improvement scores is seis SAB S a proportion_not_missing The default settings of a 1 b 0 disable the penalty entirely every variable receives a factor of 1 0 Useful penalty settings set a 1 with b 1 00 or 0 50 The closer b gets to 0 the smaller the penalty The fraction of the improvement kept for a variable is illustrated in the following table where good the fraction of observations with non missing data for the predictor 124 Chapter 4 Classification Trees good b 75 b 50 0 9 0 92402108 0 948683298 0 8 0O 84589701 0 894427191 0 7 O 76528558 0 836660027 0 6 0 68173162 0 774596669 0 5 0759460355 0 707106781 0 4 0 50297337 0 632455532 0 3 0 40536004 0 547722558 0 2 0 29906975 0 447213595 0 1 0 17782794 0 316227766 Looking at the bottom row of this table we see that if a variable is only good in 10 of the data it would receive 10 credit if b 1 17 78 credit if b 75 and 31 62 credit if
14. Model Setup ained Y Coms Y Pine Y Panaty Y Bag e E Splitter Variable Disallow Criteria Tha 7 Disallow Split Region d Min Cases Max Cases 1 2 3 Ind Split Disallowed Above Depth nm DD 777997000 S Variable ANYRAGT ONAER NSUPPS OFFAER NFAMMEM TANNING ANYPOOL SMALLBUS FIT HOME PERSTRN CLASSES ro fod od al Clear All EE Primary Splitter Split Di Sio plit Disallowed At Or Below Depth Sott Fie Order z M Selected predictors Surrogate Splitter 1 2 3 si al aaan TTA Ales 4444545574575 eo000 09000000 olololololololololololo Save Grove CART Combine Score Cancel Continue Start Next we use the slider controls in the Disallow Split Region to specify the depth above and below where our two groups will be allowed in the tree For group 1 we use the right slider control to disallow splits below the depth of 4 For group 2 we use the left slider to disallow splits above the depth of 4 In other words the group 1 consumer variables should only be split in the top portion of the tree while the group 2 product variables should only be found in the lower portions of the tree The resulting setup looks as follows Model Setup a aon Lm nan Force Split Constraints Best Tree Method Splitter Variable Disallow Criteria Trace Disallow Split Region _Ind Min Cases Max Cases 1 2 3 Ind Split Disallowed Above Depth m Clear All E Tee al od od ool Primary Split
15. Spot sequential hotspot identifier Depth depth level of the node in the tree Weight Node Learning Count weighted node size on the train data Weight Node Test Count weighted node size on the test data Focus Class Learning Count number of focus class records in the node on the train data Weight Focus Class Learning Count same as above but weighted Focus Class Test Count number of focus class records in the node on the test data Weight Focus Class Test Count same as above but weighted 197 Chapter 9 Hot Spot Detection You can change the default sorting method of the nodes using the Sorting button in the Edit Spread group or introduce your own filtering conditions using the Filtering button in the same group The lower Details part of the table contains additional information on each terminal node including not only the focus class but also all the remaining classes According to the table Node 12 of Tree 1 has 100 test richness but only 31 cases Node 14 of the same tree is 97 6 rich on a much larger set of 451 test cases An even larger node 706 test cases is found in Tree 11 which has a reduced richness of 92 5 You can double click on any of the nodes to request the corresponding navigator window to show up Pressing on the Show button brings up the Hotspot Chart window 4 Hotspot Chart 1 Target TARGET Class 1 File Battery Summary 1 DER Scatter Lift vs Node Focus Clas
16. Translate a model into computer code ISP SREB eeee ae Score data use a model to make predictions Keyboard Shortcuts The standard Windows keyboard conventions can also be used to activate menu selections For example pressing lt ALT F gt will activate the File menu because F in the File menu is underlined You can also use the keyboard to activate frequently used menu commands The keyboard equivalents for these commands appear on the pull down menus after the command names Opening a File To open the GOODBAD CSYV file Select Open gt Data File from the File menu or click on the toolbar icon rN Note that you can reset default input and output directories select Options from the Edit menu and select the Directories tab In the Open Data File dialog first navigate to the CART 6 0 Sample Data directory and then select the GOODBAD CSV file and click on Open or double click the file name 44 CART BASICS As illustrated below Delimited Text csv dat txt must be selected in the Files of Type box to see files ending with the CSV extension Access mdb a ASCIl Delimited csv dat tt 3 dBASE and compatible dbf W Epi Info rec Excel ls FoxPro dbf Gauss dat HTML htm html JMP imp LIMDEP Ipj You may see a slightly different list of files in your directory When you open GOODBAD the Activity dialog opens automatically as s
17. 16 Introducing CART 6 0 MVIs allow formal testing of the core predictive value of knowing that a field is missing One of the models CART 6 0 will generate for you automatically is a model using only missing value indicators as predictors In some circumstances such a simple model can be very accurate and it is important to be aware of this predictive power Other analyses explore the benefits of imposing penalties on variables that are frequently missing CART CART zl tl CART 6 0 CART 6 0 Modeling Automation Batteries gt ros Most modelers conduct a variety of experiments trying different model control parameters in an effort to find the best settings This is done for any method that has a number of control settings that can materially affect performance outcomes In our training courses we have regularly recommended conducting such experiments via our scripting language and have shown students how to set up such experiments for the most important controls In CART 6 0 we have made the process easier yet by packaging our recommended batteries of models into batches that the modeler can request with a mouse click CART Pro includes a core set of batteries including batteries for ATOM MINCHILD MVI Missing Value Indicators and tree growing methods RULES Cross validation can now be repeated with different random number seeds CVR and the results can be averaged over a set of CV experiments See the relevant section in
18. Best ROC Nodes 10 ROC Train 0 8474 ROC Test 0 7867 Number of Nodes Displays and Reports Save Model Leam Spitters Tree Details Summary Reports Commands Grove Translate Score As illustrated below the Summary Reports dialog contains gains charts terminal node counts variable importance measures misclassification tables and prediction success tables as well as a report on the root node splitters and a Profit tab Gains Chart Cumulative Accuracy Profile The summary report initially displayed is the Gains Chart tab also known in credit risk as the Cumulative Accuracy Profile CAP chart Gains charts are always tied to a specific level of the target variable which we also call the Focus class If your Gains chart appears with the wrong focus class just select the one you want from the 62 CART BASICS pull down menu in the lower right portion of the tab Because we assigned class names the class we are interested in is now listed as BAD instead of 1 ji Navigator 5 10 Tree Summary Reports Misclassification Prediction Success Gains Chart Root Splits Terminal Nodes Variable Importance Gains Chart Node Cases of Node Cum Cum Togt Class Tat Class Togt Class Tgt Class Pop 10 76 92 2 46 7 39 2 46 9 85 49 75 59 61 19 70 79 31 1 97 81 28 2 46 83 74 734 9112 0 O 2 4 60 80 100 Population i Show Perfect Model ROC Train 0 8474 Tot Cla
19. CART1101173521___ TXT refers to the CART session that was finished on November 1st at 5 35 21 pm This serves as a complete audit trail of your work with the CART application Also note that renaming a log file to CMD while subsequently submitting File gt Submit Command File in a new CART session will essentially reproduce the entire previous CART session There is no limit to the number of session command logs that are saved to the CART temporary files folder We suggest that you regularly clean up this folder by deleting obsolete files Chapter Features and Options This chapter provides information on additional features and options found in CART 264 Chapter 12 Features and Options Features and Options This chapter provides an orientation to the features and options not covered in the previous chapters as well as a description of CART s more advanced options If any terms or concepts are new to you please consult the main reference manual Unsupervised Learning and Cluster Analysis CART in its classification role is an excellent example of supervised learning you cannot start a CART classification analysis without first selecting a target or dependent variable All partitioning of the data into homogenous segments is guided by the primary objective of separating the target classes If the terminal nodes are sufficiently pure in a single target class the analysis will be considered successful even if two
20. Competitors and Surrogates ji Classification Rules for node 5 if i ANYRAGT 0 FIT gt 3 45388 Notation Probabilities j L Classic SOL None Leam 256 Chapter 11 CART Segmentation Terminal node reports with the exception of the root node contain a Rules dialog that displays the rules for the selected node and or sub tree For example to view the rules for Node 5 click on the node and select the Rules tab from the Node 5 report dialog The rules for this node displayed above indicate that cases meeting the two specified criteria are classified as Class 2 To also view learn or test within node probabilities click Learn or Test Click Pooled to view the combined learn and test probabilities The rules are formatted as C compatible code to facilitate applying new data to CART models in other applications The rule set can be exported as a text file cut and pasted into another application and or sent to the printer This topic is discussed further below in the section titled Displaying and Exporting Tree Rules Terminal Node Report To view node specific information for a terminal red node click on the terminal node or right click and select Node Report A frequency distribution for the classes in the terminal node is displayed as a bar graph or optionally a pie chart as shown below for the left most terminal node Terminal Node 2 Summary node information class as
21. Construct Model Categorical CVLEARN Model Construct Model Advanced PAGEBREAK Command Line Only NODEBREAK Command Line Only COPIOUS Edit Options CART BRIEF Edit Options CART OPTIONS Command Line Only IMPORTANCE Command Line Only QUICKPRUNE Command Line Only DIAGREPORT Command Line Only HLC Command Line Only PROGRESS Command Line Only MISSING Model Construct Model Advanced MREPORT Command Line Only VARDEF Command Line Only CVS Command Line Only PLC Command Line Only BUILD Model Run CART CATEGORY Model Construct Model Model CDF Command line only CLASS Model Construct Model Categorical Set Class Names COMBINE Model Construct Model Combine DATAINFO View Data Info DESCRIPTIVE Command line only 317 ECHO ERROR EXCLUDE FORCE FORMAT FPATH GROVE HARVEST HELP HISTOGRAM IDVAR KEEP LIMIT LINEAR EXHAUSTIVE LCLIST LOPTIONS MEANS TIMING NOPRINT PREDICTION_SUCCESS GAINS ROC PS UNS UNR PLOTS DBMSCOPY STATTRAN MEMO MEMORY METHOD MISCLASS MODEL Appendix l Command Line Menu Equivalents File Log Results To Model Construct Model Testing Command line only Model Construct Model Force Split Edit Options Reporting Command line only Model Score Data Model Translate Model Command line only Help CART Help Command line only Model Score Data Model Construct Model Model Model Construct Model Advanced Model Construct Model Method
22. Contents Accuracy Error Profiles Var Imp Averaging Charts Rel Error _Nodes SAMPLE _Full 29 0 185 SAMPLE _Full 29 0 185 View Zoomed z Chart Type Ba Line Test Rel Error Rel Error mF IWS HE JIWS HEH IIdWYS bE IWYS gl aldaws Battery Types Classification Battery Models SAMPLE Model Opt Terminal Learn Sample Name Nodes i Leatn Test 0 1952 3 4 0 2220 Half 1828 943 0 2582 1 4 914 943 0 2665 1 8 457 943 Show Min Error 4 Model Quality Sample Model Size Save Grove Misclass ROC Test Leam Min Cost 1SE Apparently minor accuracy loss occurs when going from the full sample to of the data However the loss becomes substantial when or more of the data are eliminated Em Em 2 CART 6 0 CART 6 0 Pro Pro EX Battery SHAVING Battery SHAVING was inspired by conventional step wise regression modeling techniques The key idea is to build a model study the reported variable importance and proceed by eliminating one or a group of variables based on a specified strategy The following shaving strategies are currently available assuming K starting variables BOTTOM remove the least important variables up to K runs TOP remove the most important variables up to K runs ERROR remove the variable with the least contribution based on the LOVO battery see above applied to the current set of variables up to K
23. Number of Nodes Displays and Reports Model Splitters Tree Details Summary Reports Translate Score Next we request an overall Rules display either via View gt Rules menu or by right mouse clicking on the root node and choosing the Rules item The resulting window contains rules for the entire tree when All is pressed or only for the tagged terminal nodes when Tagged is pressed Both Classic and SQL rule notations are supported You can also limit the rules display to a specific branch in a tree by right mouse clicking on the branch root and choosing the Rules item The resulting window will only list rules for the terminal nodes covered by the selected branch as well as rules leading to the given branch 160 Chapter 5 Regression Trees i Navigator 1 Main Tree Rules Terminal Node 1 if RM lt 6 941 amp amp LSTAT lt 4 91 terminalNode 1 mean 31 565 Terminal Node 2 if RM lt 6 941 amp amp LSTAT gt 4 91 amp amp LSTAT lt 9 715 amp amp DIS lt 4 46815 amp amp B lt 393 375 terminalNode 2 mean 28 6714 Notation Nodes Classic SOL All Tagged x The Main Tree Rules display only gives node based rules ignoring missing value handling mechanisms entirely rN To request a full display of the tree logic including missing value handling check the chapter called Translating Model in this manual Chapter
24. Poisson parameter PDF x p x Poisson value PIF a p Studentized S SRN k df SCF s k df k parameter SDF s k df f degrees of freedom SIF a k df t T TRN df TCF t df df degrees of freedom TDF t df t t statistic TIF a df Uniform U URN UCF x x uniform value UDF x UIF a Weibull W WRN p q WCF x p q p scale parameter WDF x p q q shape parameter WIF op q These functions are invoked with either 0 1 or 2 arguments as indicated in the table above and return a single number which is either a random draw a cumulative probability a probability density or a critical value for the distribution We illustrate the use of these functions with the chi square distribution To generate 10 random draws from a chi square distribution with 35 degrees of freedom for each case in your data set oe DIM CHISQ 10 FOR I 1 TO 10 LET CHISQ I XRN 35 NEXT oP ol oe To evaluate the probability that a chi square variable with 20 degrees of freedom exceeds 27 5 SLET CHITAIL 1 XCF 27 5 20 The chi square density for the same chi square value is obtained with SLET CHIDEN XDF 27 5 20 Finally the 5 point of the chi squared distribution with 20 degrees of freedom is calculated with SLET CHICRIT XIF 95 20 413 Appendix IV BASIC Programming Language Missing Values The system missing value is stored internally as the largest negative number allowed Missing values in BASIC programs and pr
25. Purpose The DISCRETE command sets options specific to discrete or categorical variables The command syntax is DISCRETE TABLES NONE SIMPLE DETAILED CASE MIXED UPPER LOWER MISSING MISSING LEGAL REFERENCE FIRST LAST MAX lt M N gt ORDER YES NO ALLLEVELS YES NO TABLES CASE MISSING REFERENCE MAX Controls whether frequency tables should be printed following data preprocessing SIMPLE generates a listing of the levels encountered for each discrete variable and total counts across learn and test samples DETAILED breaks down counts by learn and test sample and also by the dependent variable for classification trees The default is SIMPLE Controls whether character strings are case converted The default is MIXED Controls whether missing values for discrete variables are treated as truly MISSING or are considered a legal and distinct level LEGAL will process missing values for nontarget variables as legal TARGET will process missing values for a model target only as legal ALL will process missing values for all variables as legal Specifies which level is considered the reference or left out level In MARS a reference level is only needed when computing an OLS model for comparative purposes prior to the MARS model By default the FIRST level according to the ORDER and SORT criteria is considered the reference level You may wish to change this to the LAST level to
26. Variable Selection Tree Type Variable Name Target Predictor Categorical Weight Aux 2 INCOME I MARITAL a Vv N_INQUIRIES NUMCARDS OCCUP_BLANK OWNRENT POSTBIN TARGET TIME_EMPLOYED Classification C Regression Unsupervised Set Focus Class Target Variable TARGET Weight Variable Sort Alphabetically X Number of Predictors iE Save Grove CART Combine Score Cancel Continue MAA cTan lt x To be safe it is also worth placing a check mark in the Categorical column Although CART typically assumes that you intend to conduct a classification and not a regression analysis it is wise to remove any possibility of doubt Next we indicate which variables are to be used as predictors CART is a capable automatic variable selector so you do not have to do any selection at all but in many circumstances you will want to exclude certain variables from the model w If you do not explicitly select the predictors CART is allowed to use then CART will screen all variables for potential inclusion in its model w Even if all the variables available are reasonable candidates for model inclusion it can still be useful to focus on a subset for exploratory analyses In our first run we will select all the variables except POSTBIN Do this by clicking on the Predictor column heading to highlight the column check the Select Predictors box underneath the column and
27. lt N gt Maximum number of nodes NODES Allows you to specify a maximum allowable number of nodes in the largest tree grown If you do not specify a limit CART may allow as many as one terminal node per data record When a limit on NODES is specified the tree generation process will stop when the maximum allowable number of nodes internal plus terminal is reached This is a crude but effective way to limit tree size Depth This setting limits the tree growing to a maximum depth The root node corresponds to the depth of zero Limiting a tree in this way is likely to yield an almost perfectly balanced tree with every branch reaching the same depth While this may appeal to your aesthetic sensibility it is unlikely to be the best tree for predictive purposes By default CART sets the maximum DEPTH value so large that it will never be reached Unlike complexity these NODES and DEPTH controls may handicap the tree and result in inferior performance Some decision tree vendors set depth values to small limits such as five or eight These limits are generally set very low to create the illusion of fast data processing If you want to be sure to get the best tree you need to allow for somewhat deeper trees Command line users will use the following command syntax LIMIT NODES lt N gt DEPTH lt N gt Learn Sample Size LEARN The LEARN setting limits CART to processing only the first part of the data available and simp
28. 1 000000 1 000000 0 287729 Tree N oO foo col 1 3 4 5 6 7 8 3 1 Cancel a The current tree number in the pruning sequence starting with the largest tree and going backwards the number of terminal nodes and the relative cost are reported for your convenience Target Weight and ID Variables If a target variable in your data set has a name other than the name used in the original learning dataset proxy target variable then you should highlight it in the left panel and press Select in the Target Variable area rN If the target name is the same as the original simply skip the above step CART will detect this automatically CART will also handle the case when there is no target at all however for obvious reasons some of the scoring results reports will become unavailable Select the weight variable if any by highlighting it in the variable list panel and pressing the Select button in the Weight Variable area Finally select up to 50 ID variables in the variable list panel and add these to the right panel by pressing the corresponding Select button vx An ID variable could be any variable that was NOT part of the final model target and finally selected predictors Check the Model Information 175 Chapter 7 Scoring and Translating checkbox in the Include area if you want the original target and predictors propagated into the output dataset see below Output Data Set CART sa
29. 132 selecting a tree 170 180 selecting cases 146 selecting tree 236 244 selecting variables auxiliary 85 91 categorical 85 88 predictors 85 87 target 85 selection criteria 100 self testing 12 separation test variable 99 setting class names 92 setting focus class 91 setting up model 45 232 Show Next Pruning 236 sorting variable list 92 specify root node splitter 267 specifying tree type 86 specifying model classification 83 regression 146 split criteria 238 split form categorical 89 continuous 89 split value root node 273 setting 273 splitter improvement 69 252 Splitter tab 157 splitters 73 157 254 viewing 238 splitting criteria 13 splitting rules 76 104 146 183 220 258 Class Probability 104 105 Entropy 104 105 even splits 106 Gini 104 105 436 Index Least Absolute Deviation 147 Least Squares 147 149 linear combinations 108 Ordered Twoing 104 106 Symmetric Gini 104 105 Twoing 104 106 splitting variable name 238 SQL rules 159 standard error rule 103 starting CART 26 step wise regression 222 STOP command 423 structured trees 275 SUBMIT command 399 submit command file 301 submit window 300 submitting batch files 298 subsampling 223 size 114 sub sampling 287 subset of cases 100 sub trees 58 242 summary reports 18 61 151 244 Gains Chart tab 61 245 Misclassification tab 66 249 node detail 68 154 251 Pre
30. 284 300 Open gt Command File 300 Open gt Data File 34 Page Setup 144 Print Preview 290 Print Setup 290 Print 60 141 144 243 284 290 Save 140 300 Save As 284 Save CART Output 76 258 Save CART Output 284 Save Grove 257 Save Navigator 74 140 Submit Command File 298 301 Submit Current Line to End 299 Submit Window 299 300 File of type 34 37 files TR1 172 Index grove 170 172 navigator 170 flat file 32 focus class 91 fonts 76 258 FOR NEXT command 408 419 FORCE command 275 Force Split tab 146 Force Splits tab 267 forced splits 15 146 FORMAT command 126 356 fraction of cases for testing 98 frequency distribution 136 fuzzy match 187 190 G gains chart 18 61 245 overlaying 143 printing 143 gini 13 105 GOTO command 420 GROUP command 357 GROVE command 141 171 172 180 182 358 grove files 170 grove information embedded 172 groves 14 growing tree 47 234 236 H HARVEST command 180 359 HELP command 361 Help menu 42 229 high level categorical 88 high level categorical penalty 121 124 high level categorical predictors 94 HISTOGRAM command 362 Hot Spot Detection 17 194 Hot Spots 194 Hotspot Chart 195 197 Hotspot Setup 195 Hotspot Table 195 Edit Spread 196 197 Learn Richness 196 Learn Sample Count 196 Node 196 Test Richness 196 Test Sample Count 196 Tree 196 hyper link 257
31. 395 Appendix III Command Reference SAVE Purpose The SAVE command saves subsequent results to a dataset If you specify a path name enclose the whole thing in single or double quotation marks If an unquoted name is given without an extension a Systat dataset is saved to the default directory and SYS is appended to the name The command syntax is SAVE lt file gt SINGLE DOUBLE lt comment gt Examples SAVE projects scoring Modella csv SAVE results sas7bdat SAVE projects scoring Modella xls xls5 via DBMSCOPY into a spreadsheet SAVE SCORES Save Systat dataset SCORES SYS into the default directory SAVE SCORES cSV Save CSV dataset SCORES CSV into the default directory The SAVE command must appear before the command that causes data to be stored to the file e g you must issue the SAVE command before the SCORE command if you wish to save the scoring results to a dataset 396 Appendix III Command Reference SEED Purpose The SEED command allows you to set the random number seed to a certain value as well as to specify that the seed remain in effect after the tree is built Normally the seed is reset to 13579 12345 131 upon starting up CART The command syntax is SEED I J K RETAIN NORETAIN All three values J K must be given Legal values include all whole numbers between 1 and 30000 If RETAIN is not specified the seed will be reset to 13579 12345 131 after the
32. CART CAR 2 z CART 6 0 CART 6 0 Pro reex Click inside any column of the variable importance chart to start highlighting rows You can use this to select variables on which to focus on in a new analysis Below we have selected the seven variables that actually appear as splitters 66 CART BASICS i Navigator 5 10 Tree Summary Reports DER Misclassification Prediction Success Gains Chart Root Splits Terminal Nodes Variable Importance Variable Importance Variable Score V jv Consider Only N_INQUIRIES Primary Splitters NUMCARDS p Showzero OCCUP_BLANK importance variables AGE r CREDIT_LIMIT TIME_EMPLOYED e INCOME C New Keep List New Keep amp EN w Once you have highlighted variables in this way on the variable importance chart you can automatically build a new model using only those predictors Just click on the New Keep amp Build button Clicking on the New Keep List button creates a list of those variables and places them on a KEEP list in a new notepad You can edit this KEEP command and place it in scripts or just save it for later use Misclassification The Misclassification report shows how many cases were incorrectly classified in the overall tree for both learn and test or cross validated samples The tables which can be sorted by percent error cost or class display CLASS Class level N CASES Total number of cases in the class N MISCLASSIFIED Total nu
33. CART will look in Salford unless the filename is quoted or the FPATH command is canceled by giving an FPATH command without arguments 312 Chapter13 Working with Command Language One can also specify different default directories for different sorts of files To specify a default directory for input datasets use fpath lt pathname gt use To specify a default directory for output datasets use fpath lt pathname gt save For command files use fpath lt pathname gt submit For text output files use fpath lt pathname gt output FPATH without arguments restores the default which is to use the current working directory FPATH with an option but no pathname restores the default for the specified file type Online Help Console CART has its own online help system which can be accessed by opening CART in interactive mode and typing HELP at the prompt To read the entry for a particular command type HELP followed by the name of the command Workspace Allocation Console CART can allocate arbitrary amounts of memory The default workspace size is 25 MB but this can be altered with either the SALFORD_M environment variable or the m command line flag We suggest that SALFORD_M be set in the system wide startup files etc profile and etc csh login on most UNIX like systems as appropriate for the hardware Limit on number of variables By default CART will read datasets with up to 32 768 va
34. Chapter 6 Ensemble Models and Committees of Experts The three options for specifying the holdout data set are grouped in the Evaluation Sample Holdout Method box 1 Use a fraction of the data specify fraction default 0 10 2 Use a separate data set select data set 3 Use an indicator variable select name of test binary dummy variable Files to Save To save individual learn samples obtained using sampling with replacement simply checkmark the Learn samples box and specify the Root Name say learn Because CART will attach a serial number to the root names of the learn files we recommend keeping the names to six characters or less to avoid truncation The serial number corresponds to the resample cycle number e g if cycles 10 the learn samples will be labeled learn01 learn02 learn10 To grow the committee of experts click Start The combine model can be saved into a grove file for further scoring or translating by pressing the Save Grove button and specifying the file name before the model is built x The grove file in this case will have multiple trees and does not have an accompanying navigator file Report Details By default the Combine text output consists of summary statistics for the train learn sample and the holdout sample as well as a prediction success or confusion matrix report summarizing how well the holdout sample performed on the initial tree built using the in and out of b
35. Classification trees use a categorical target variable e g YES NO while the regression tree uses a continuous target variable such as AGE or INCOME The purpose of classification is to accurately discriminate between usually a small number of classes the purpose of regression to is predict values that are close to a true outcome with usually a large number or even an infinity of possible outcomes When the Tree Type Classification radio button is checked the target variable automatically will be considered categorical regardless of the Categorical check box designation defined in Model tab Similarly the Regression radio button will automatically cancel the categorical status of the target variable so long as the variable is coded as a number and not as text In other words the specified Tree Type determines whether a numeric target is treated as categorical or continuous superseding any Categorical check box designation 87 Chapter 4 Classification Trees Predictor Variable Selection Candidate predictor independent variables are specified by check marks in the Predictor column In this example include the following subset of variables as predictors AGE RACE SMOKE HT Ul FTV PTD and LWD by placing checkmarks in the Predictor column against the above variables Alternatively hold down the lt Ctrl gt key to simultaneously highlight the variables with left mouse clicks and then place a checkmark in the Select P
36. E m E C Regression I C Unsupervised ofan Target Variable i I I I I I I I I I r E Ei G G i E I I I I I I Weight Variable ie ia ie fia Sort File Order x Number of Predictors 15 Save Grove CART Combine Cancel Default Variable Sorting Order Many GUI displays include a list of variables and you can always change the sort order between Alphabetical and File Order the order in which the variables appear in your data file This setting allows you to determine the ordering that will always show first when a dialog is opened Default Variable Sorting Order File order C Alphabetical Controlling CART Report Details The parameters controlling the contents of the CART Output window can be set in the Options CART tab This is the middle tab on the Options dialog The default Reporting settings are shown below 130 Chapter 4 Classification Trees Text Report Preferences M Only summary tables of node information M Summary plots Report pruning sequence Number of surrogates to report s Number of competitors to report 54 Number of trees to list in the tree sequence 104 summary Full Node Detail or Summaries Only Previous versions of CART printed full node detail for CART trees These reports can be voluminous as they contain about one text page for every node in an optimal tree If you elect to produce these details you
37. File Order bd Ae He A ANYRAQT DESCRIPTIVE N missing N 0 N Distinct Values Std Deviation Skewness Coeff Variation Cond Mean Sum of Weights Sum Variance Kurtosis Std Error Mean LOCATION VARIABILITY e C aues FREQUENCY TABLES 0 026272 DESCRIPTIVE Std Deviation Skewness Std Error Mean LOCATION VARIABILITY Command line users should issue the following command DATAINFO lt varl gt lt var2 gt vx DATAINFO without arguments generates data information for all variables present in the data set w GUI users may request Data Information for any specific list of variables by issuing the DATAINFO command with the variable list at the command prompt The Data Information window will now contain information on the specified variables only Requesting DATAINFO on large datasets may result in long processing times This is a result of an exhaustive attempt to generate frequency tables for all variables with the specified number of discrete levels Chapter Working with Command Language This chapter provides insight into the essentials of CART configuration and gives an important practical introduction to using command files 296 Chapter13 Working with Command Language Introduction to the Command Language This chapter describes the situations in which a Windows user may want to take adva
38. Imp tab 204 Battery tab 147 Battery Types 200 beginning of file 409 beginning of group 409 below depth 278 best tree 102 146 Best Tree tab 85 146 147 232 233 binary split 11 bootstrap resampling 163 429 Bootstrapping 162 BOPTION command MISSING 115 BOPTIONS command 95 104 115 334 BRIEF 132 COMPETITORS 131 COMPLEXITY 112 COPIOUS 132 CVLEARN 113 PRINT 130 TREELIST 131 Boston Housing data 146 box plots 156 Box Plots tab 156 BUILD command 338 building trees classification 82 regression 146 Cc CART monograph 11 20 CART Notepad 298 CART Report window 178 case weights 90 missing 90 negative values 90 zeroed 90 Categorical tab 146 232 categorical variables 47 88 92 146 234 high level 94 CATEGORY command 339 CDF command 340 character variable names 33 35 character variables 33 89 CHARSET command 341 Chart Type 153 child node 139 253 class assignment 238 Class Assignment dialog 59 242 CLASS command 91 93 109 342 class names 59 92 242 class probability 13 105 Classic rules 159 classification trees 82 cluster analysis 264 color coding 57 59 151 235 241 242 auxiliary variables 136 tagged nodes 158 COMBINE command 344 combine controls 165 number of sample redraws 165 number of trees 165 combine method 164 comma delimited 37 command file ATOM CMD 200 CLASS CMD 302 308 309 Index CLASSCOMB CMD 309 CMD 79 9
39. Input dataset Any preprocessing statements go here We ll create a variable AGE amp NOW BIRTHDATE 365 25 Score the data LINK MODELBEGIN Any postprocessing statements could go here RETURN We don t want to execute the TRANSLATE output twice SINCLUDE mygrove sas TRANSLATE output keep ID RESPONSE PROB1 PROB2 rename PROB1 PROBO PROB2 PROB1 Original target was a 0 1 binary run 402 Appendix III Command Reference USE Purpose The USE command reads data from the file you specify You may specify the root of the filename if the file resides in the current directory usually C Program Files CART 6 0 Sample Data if one is running the GUI or the directory from which CART was launched in the case of the console or specify the directory with Utilities Defaults Path in the GUI or the FPATH command If you specify a path you must provide the complete file name with the appropriate extension and surround the whole path name file name with single or double quotation marks If the file name is unquoted and given without an extension CART will search for files with the specified root name and the following extensions in the order given SYS Native Systat binary format SYD Native Systat binary format CSV Comma separated text TXT Comma separated text DAT Comma separated text Thus the command USE SOMEDATA would cause CART to first try to open SOMEDATA SYS in the default directory if i
40. Line numbers must be integers less than 32000 and we recommend that if you use any line numbers at all all your BASIC statements should be numbered BASIC will execute the numbered statements in the order of the line numbers regardless of the order in which the statements are typed and unnumbered BASIC statements are executed before numbered statements Here is an example of using the GOTO 10 IF PARTY GOP THEN GOTO 96 20 LET NEWDEM 1 30 LET VEEPS GORE 40 GOTO 99 96 LET VEEPS KEMP 99 LET CAMPAIGN 1 416 Appendix IV BASIC Programming Language BASIC Programming Language Commands The following pages contain a summary of the BASIC programming language commands They include syntax usage and examples DELETE Statement Purpose Drops the current case from the data set Syntax Examples To keep a random sample of 75 of a data set for analysis IF URN lt 25 THEN DELETE 417 Appendix IV BASIC Programming Language DIM Statement Purpose Creates an array of subscripted variables Syntax DIM var n where n is a literal integer Variables of the array are then referenced by variable name and subscript such as var 1 var 2 etc In an expression the subscript can be another variable allowing these array variables to be used in FOR NEXT loop processing See the section on the FOR NEXT statement for more information Examples oe DIM QUARTER 4 DIM MONTH 12
41. Of course the modeler can always exclude a variable the penalty offers an opportunity to permit a variable into the tree but only under special circumstances The three categories of penalty are w Missing Value Penalty Predictors are penalized to reflect how frequently they are missing The penalty is recalculated for every node in the tree w High Level Categorical Penalty Categorical predictors with many levels can distort a tree due to their explosive splitting power The HLC penalty levels the playing field rN Predictor Specific Penalties Each predictor can be assigned a custom penalty A penalty will lower a predictor s improvement score thus making it less likely to be chosen as the primary splitter These penalties are defined in the Model Setup Penalty tab Penalties specific to particular predictors are entered in the left panel next to the predictor name and may range from zero to one inclusive Penalties for missing values for categorical and continuous predictors and a high number of levels for categorical predictors only can range from No Penalty to High Penalty and are normally set via the slider on the Penalty tab as seen in the following illustration 122 Chapter 4 Classification Trees Model Setup Categorical Force Split Advanced Priors Penalty Missing Penalty Set1 Penalty 0 00 High Level Categorical Penalty No Penalty Jrs a eae High Penaty Set1 Penalty 0 00
42. option from the File menu to open the TTC CMD command file Second use the File gt Submit Window menu to build a new model The resulting Navigator suggests an 18 node tree as the optimal in terms of expected cost Now press the Summary Reports button and go to the Terminal Nodes tab Percentage Of Node That Is Target Class 100 CULL YY YY by p g pgg Y CHUA pep Wi WNW MW gt pr 1 Other Classes E ZA Other Classes 2 Pot Focus Class Leam amp Test e pes Yj Z y X l ANL GQ WK Q Vi LA B n ar77 Nodes Note two types of instability of the optimal tree with respect to the Learn and Test results 187 Chapter 8 Train Test Consistency TTC Directional Instability Node 15 has 9 of Class 1 on the learn data and 56 of Class 1 on the test data Assuming the node majority rule for class assignment this effectively constitutes instability with respect to the class assignment that depends on the data partition Another way to look at this is that the node lift is less than 1 on the learn data and greater than 1 on the test data Rank Instability The nodes on the graph are sorted according to node richness using the learn data However the sorted order is no longer maintained when looking at the test data hence we have another type of instability Many deployment strategies for example model guided sampling of subjects in a direct marketing campaign
43. rely only on the sorted list of segments and therefore eliminating this kind of instability is highly desirable Note that the Rank Stability requirement is generally stricter than the Directional Stability requirement In other words one may have all nodes directionally stable agree on the class assignment and yet have non conforming sort orders Also note that it is useful to introduce some slack in the above comparisons due to limited node sizes For example one might argue that the discrepancies in the sort sequences must be significant enough to declare the whole model as rank unstable Similarly a directional disagreement node must show a significant difference between the learn and test sample estimates We employ a simple statistical test on a difference in two population proportions to accomplish this The z threshold of this test is controlled by the user thus giving varying degrees of slack In addition special care must be taken in handling nodes where the test data is missing entirely empty test counts The user has the option to either declare such trees unstable or to ignore any such node Fuzzy Match Running TTC To run a TTC analysis just press on the T T Consist button in the Navigator Window The resulting display shows the results 188 Chapter 8 Train Test Consistency TTC wa Train Test Consistency Chec Check ck Navigator 1 Classification Tree a Je J Consistency by Trees C D
44. see earlier with the following differences Requested output file names have been changed in the OUTPUT and GROVE commands The LIMIT settings have been changed to MINCHUILD 1 ATOM 2 in agreement with Leo Breiman s suggestions The MOPTIONS command configures the combined run See the Appendix IIl Command Reference for a complete description Example Sample Scoring Run The contents of a CLASSCOMB CMD sample command file are shown below Line by line descriptions and comments follow CART Notepad C Salford Data Mining CART 5 0 E O K REM ASA KARATE A AAA ATK KATKAAEKAEEKAEKAAEKAE EERSTE E TERETE REM SAMPLE SCORING RUN REMEEERRAARRRRRARRRRRRRRERRRRAERRAERRARRERRERRRAR EARRA REM INPUT OUTPUT FILES B80 08 0 RD 1 gt gt f USE gqymtutor csv 2 gt gt f SAVE ScoreOutGym csv MODEL 3 gt gt f GROVE qymtutor GRY REM F ESAS ETA S AKASAKA ETHER HA SEK ETRE EERE EERE EEE ERE REM 5CORE MODEL REMFA SSS SEAAAAKASAKAEKHSEKASEKAEEKEETA EERE SESE ET ESE 4 gt gt f SCORE DEPYAR SEGMENT REM F SSA AAA AAA AAAAA AKASAKA AAKAEKEKAEKAAEKSEEAKAEKEEEK ESE 5 gt gt REM QUIT CART REM AS AK SAAA AAA AAAKAAKKAEKAAKKAEAEKAEKAAEKA ETHER tkt A detailed description of each command in this command file is provided below Commands 1 through 3 control which files will be used or created during this run 1 gt gt The USE command specifies the data set to be used in modeling CART has built
45. unless the user explicitly turns off this feature Here we provide a brief reminder of the multiple ways to save a grove file 1 When the Navigator window is active you may save the corresponding navigator and grove files by clicking the Grove button 2 Issuing the GROVE lt file_name grv gt command results in a copy of the grove file that will be embedded in the navigator The GROVE command names a grove file in which to store the next tree or committee or group of trees Its syntax is GROVE lt filename gt When a grove file is embedded into a navigator file you may easily save it separately by first opening the navigator file in the GUI File gt Open gt Open Navigator and then pressing the Save Grove button For example let s make a default CART run for the GYMTUTOR CSV data To begin simply mark SEGMENT as the target and press Start When complete with the Navigator in the foreground press Grove In the resulting Save As dialog choose the name of the file and the folder to which you want the file saved Finally press the Save button The grove file extension grv is now saved Furthermore it has the navigator embedded in it You now have all you need to proceed with scoring or translating x Alternatively you may request that grove and navigator files be saved as part of the model building process Simply press the Save Grove button in the Model Setup window and enter the file na
46. when the number of predictors exceeds the limit CART uses dash convention to indicate ranges of predictors for example X1 X5 This setting only affects the GUI logging mechanism The command parser supports both short and standard command notations Window to Display When File ls Opened When you open a data file CART gives you three choices for what to do next Window To Display When Data ls Opened C Classic Output Activity Window C Model Setup Classic Output This is the classic text mainframe style output suitable for diehard UNIX and Linux gurus You will be greeted with a plain text screen looking something like 128 Chapter 4 Classification Trees CART Classic Output Ctrl Alt C ole le S Sle sal ois ala Report Contents gt FORMAT 5 gt USE C Program Files Salford Data Mining CART Pro EX 6 0 Examples HOSLEM CS5V C Program Files Salford Data Mining CART Pro EX 6 0 Examples HOSLEM CSV uses as delimiter VARIABLES IN RECT FILE ARE ID Low AGE RACE SMOKE PTL HT FTV BUT RACE CASEID LWD C Program Files Salford Data Mining CART Pro EX 6 0 Examples HOSLEM CSV 189 records gt REN Resetting Preferences gt REM Setting General default options gt LOPTIONS MEANS YES PREDICTIONS YES BOTH TIMING YES GAINS YES ROC YES gt FORMAT 5 gt REM Setting CART default options gt LOPTIONS NOPRINT NO PLOTS YES PS NO g
47. 1 The nodes are ordered from the richest highest percentage of Class 1 cases to the poorest lowest percentage of Class 1 cases on the learn data The table displays the following information for each terminal node scroll the grid to view the last two columns Node Node reference number Cases Tgt Class Number of cases in the node belonging to the target class Of Node Tgt Class Percentage of cases in the node belonging to the target class Tgt Class Number of target class cases in the node as a percentage of the total number of target class cases Cum Tgt Cass Cumulative number of target class cases as a percentage of the total number of target class cases Cum Pop Cumulative number of cases as a percentage of the total number of cases in the analysis Pop Percentage of the total number of cases in the analysis that are contained in the node Cases In Node Total number of cases in the node Cum Gains Cumulative percentage of target class cases divided by the cumulative share of the number of total cases Lift Index Percentage of target class cases in the node divided by percentage of the total number of cases in the node In the figure displayed in the left panel the x axis represents the percentage of the data included and the y axis represents the percentage of that class included The 45 degree line maps the percentage of the particular class you would expect if each node were a random sample of the population The blue cu
48. 1 1 c N _categories l1 125 Chapter 4 Classification Trees By default c 1 and d 0 these values disable the penalty We recommend that the categorical variable penalty be set to c 1 d 1 which ensures that a categorical predictor has no inherent advantage over a continuous variable with unique values for every record Command line users will use the following command syntax to specify variable penalties PENALTY lt var gt lt penalty gt MISSING lt mis val1 gt lt mis_val2 gt HLC lt hilc_val1 gt lt hic_val2 gt PENALTY MISSING 1 1 HLC 1 1 rN The missing value and HLC penalties apply uniformly for all variables You cannot set different HLC or missing value penalties to different variables You choose one setting for each penalty and it will apply to all variables x You can set variable specific penalties and general missing value and HLC penalties Thus if you have a categorical variable Z that is also sometimes missing you could have all three penalties applying to this variable at the same time Setting Reporting Random Number and Directory Options This section is a guide to the reporting and other fine tuning global controls you may want to set before you grow your trees These parameters are contained in the Options dialog accessed by selecting Options from the Edit menu or clicking ical on the toolbar icon If you are in the Model Setup dialog box you must first click on the Cont
49. 116 The Priors taD ei icc cncc ced eeccveccsnece EAEE AAE ANATA E AAEE ASEARA 118 The Penalty tab ieorrsineriranenen annaran EEEE AANER NANAREN NE AAN ADEREN Noia R S 121 Setting Reporting Random Number and Directory Options ccssssseeeeeees 125 Working with Navigators s asssssessenunennnunnnnnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnne 134 Viewing Auxiliary Variables Information cccccseecessseeeesseeeeeeessseeeeesesneeeeeseeeees 134 Comparing CHiN nic c ccccc cece ccteweeee cence eteceeeeestedeeeeesctensessecedewenoueceeeesctereesdeneneests 139 Comparing Learn and TeSt c cccccccteecceteneestccteeeesectieeesdetenenssdenereeectineestecenteess 139 Saving Navigator Files ccccsscccesseneeeseeeseeeeeeeeeeeseeeseeeseeeeeeseseeseseseaesessseanseeesees 140 Printing Tree seinneann nnana inaa aaraa SAn ONO suaveevewsaversteuseed cxeeneeeeteeseedenteasers 141 Overlaying and Printing Gains Charts ssssssssesenennennnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 143 REGRESSION TREES ariera iia eannan amaeana ania maamaa maamaa aaaea aa anaia eaaa aa 145 Building Regression Tree s cccccssescecssseeeeseseeeeeseseeeeeeseceeseseseenseseseensesesseeneeseneaes 146 Specifying a Regression Model ssssussssnunsnunnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nna 146 Treg Navigatot oreina riant isoannos naana SEENA NAAA ERAAN SANAE Ea NAANA EAA KESAR Aae 149 Regression Tree Summary Reports ssssnussssunnnnnun
50. 143 201 204 Row button 177 250 Save as Defaults button 127 Save Grove button 202 Save Grove button 75 78 87 166 171 259 Save Navigator button 74 Save button 74 140 171 257 283 284 Scatter button 197 Score button 75 78 172 178 237 259 Select Variables button 110 Select button 174 Select button 78 173 178 181 182 259 260 Send To Right gt button 273 Set Class Names button 92 Set Default button 288 Set Defaults button 57 58 241 242 Set Focus Class button 91 Set Left button 273 Set Right button 273 Set Root button 268 Show Min Error button 201 428 Index Show button 197 Smaller button 75 237 Sorting button 197 Split selected groups button 139 Splitters button 51 151 238 Start button 26 47 135 149 166 171 200 234 269 270 274 298 Summary Reports button 61 143 151 186 244 Symmetrical button 117 T T Consist button 187 Table button 208 Tagged button 159 Test button 62 71 75 151 153 183 197 201 204 237 246 250 253 256 259 Translate button 75 77 180 237 259 Tree Details button 56 141 151 239 Tree Details button 271 274 Unlock button 27 Use Default button 288 View Data button 44 A above depth 278 accessing data 31 34 Accuracy tab 202 1 SE Terminal Nodes 203 Average Accur
51. 2579 0 3054 0 2527 0 2515 14 0 2583 0 2528 0 2511 02519 0 2990 0 2636 0 2503 0 2454 15 02525 02497 0 2457 0 2510 0 2991 02615 0 2479 0 2397 16 0 2455 0 2429 0 2345 0 2435 0 2810 0 2489 0 2263 0 2378 17 0 2384 0 2305 0 2773 0 2445 0 2227 0 2345 0 2461 0 2405 0 2777 0 2430 0 2225 0 2369 19 02462 0 2378 0 2313 02418 0 2782 0 2445 2222 0 2380 20 024671 02400 0238 02418 0 27868 0 2445 0 2244 0 2370 2 0 2476 0 2409 0 2331 0 2427 0 2781 0 2458 0 2247 0 2369 2 0 2471 0 2413 0 2333 0 2420 0 2777 0 2432 0 2253 0 2369 23 0 2472 0 2411 0 2333 0 2420 0 2766 0 2438 0 2252 0 2375 Model Quality Sample Model Size Test Lean Min Cost _1 SE ae Fs of pne ue According to our findings the relative CV error could be as low as 0 216 or as high as 0 275 with the average at 0 238 210 Chapter 10 CART Batteries CART Em Ed Ed CART 6 0 CART 6 0 Pro Pro EX Battery DEPTH Battery DEPTH specifies the depth limit of the tree We illustrate it on the SPAMBASE CSV dataset by trying depths at 2 3 4 5 6 7 8 and 9 see DEPTH CMD command file for details a Battery Summary 1 DER Models Contents Accuracy Error Profiles Var Imp Averaging Charts Rel Error Nodes DEPTH_6 14 0 191 DEPTH_3 18 0 185 Rel Error pete i Mn 0 1849 Zoomed Median 0 1910 gt Mean 0 2121 Chart Type j Max 0 2975 as EE 7 WB fe
52. A6 Copy Telephon m 5 pm Mon Fri Pacific Time To unlock this copy of CART R enter the unlock code below Open your email application and compose an email to unlock salford systems com with the following information Name Last First Company Name Institution or Affiliation Email Address Phone Number System ID which can be found by pulling down the HELP menu and then looking on the licensing information tab You just need to paste Ctrl V from your clipboard If you have not already informed us what are you using the software for Once you receive the unlock code highlight the code right click and select Copy to copy the unlock code to your clipboard Restart the software and go to the registration tab as you did previously and verify that the System ID number has not changed Place your cursor in the Unlock Code box and right click then paste the unlock code directly into entry box Click Unlock and you are done amp We suggest you not try to type the unlock code A typo would invalidate the current System ID and cause the whole process to be restarted 28 Installing and Starting CART Preparing Your Data for CART Accessing data for modeling and analysis This chapter discusses file formats and rules governing ASCII and Excel files Setting up Working Directories CART will utilize user specified directories for different input and output files First choose Edit Options then select t
53. Agreed hides all agreed terminal nodes from the Consistency Details by Nodes report Double clicking on any tree in the Consistency by Trees section upper half will result in a graph of train and test focus class lift by node Tree_1 nodes 18 TARGET Class 1 he eO UG ODD HEH MONON TAM OW qe e e ae Node 191 Chapter 8 Train Test Consistency TTC Note the apparent directional instability of Node 15 Learn and Test values are on the opposite sides of the 1 0 lift curve as well as the rank instability of the Test curve severe deviation from monotonicity Identifying Stable Tree Now let us use the TTC results to identify a consistent tree As can be seen in the Consistency by Trees table the 9 node tree is stable both in direction and rank Tree_1 nodes 9 TARGET Class 0 Learn F Tes V e 0 O N TF DO DBD Node Note that even though the rank stability is approximate slight departures from monotonicity in the Test curve it is well within the significance level controller by the Rank z threshold Summary Reports Terminal Nodes further illustrates the tree stability we were initially looking for Percentage Of Node That Is Target Class 1 Other Classes E ZA Other Classes o Z Pet Focus Class Leam amp Test Chapter Hot Spot Detection A new powerful feature designed to identify hot spots in the class of interest 194 Chapter 9 Hot
54. Command line only Model Construct Model Method Edit Options General Edit Options General Edit Options CART Edit Options General Edit Options General Edit Options General Edit Options General Model Construct Model Model Command line only Whether to show Edit Options CART Character to use Command line only Command line only Command line only Command line only Limits Growing Limits Model Construct Model Method Model Construct Model Costs Model Construct Model Model 318 MOPTIONS CYCLES SETASIDE TEST CROSS EXPLORE ROOT REMAP ARC LROOT DETAILS RTABLES NAMES NEW NOTE OPTIONS OUTPUT PAGE PARTITION PENALTY PRIORS PRINT QUIT REM SAVE SCORE STRATA SEED SELECT SUBMIT TRANSLATE USE WEIGHT XYPLOT Appendix l Command Line Menu Equivalents Model Construct Model Combine number of trees to combine evaluation sample holdout method pruning test method Use resampling training data pruning test method Cross validation pruning test method No pruning trees root name combine method arcing exponent combine method learn sample root name report details initial committee report details repeated cases Model Construct Model Model File Clear Workspace Command line only Command line only File Save CART output Command line only Model Construct Model Testing Model Construct Model Penalty Model Construct Model Pri
55. DIM REGION 9 oe 4 oe 418 Appendix IV BASIC Programming Language ELSE Statement Purpose Follows an IF THEN to specify statements to be executed when the condition following a preceding IF is false Syntax The simplest form is IF condition THEN statementl ELSE statement2 oe The statement2 can be another IF THEN condition thus allowing IF THEN statements to be linked into more complicated structures For more information see the section for IF THEN Examples oe 5 IF TRUE 1 THEN GOTO 20 10 ELSE GOTO 30 oe IF AGE lt 2 THEN LET AGEDESS baby ELSE IF AGE lt 18 THEN LET AGEDESS child ELSE IF AGE lt 65 THEN LET AGEDESS adult oe ELSE LET AGEDES senior 419 Appendix IV BASIC Programming Language FOR NEXT Statement Purpose Allows the processing of steps between the FOR statement and an associated NEXT statement as a block When an optional index variable is specified the statements are looped through repetitively while the value of the index variable is in a specified range Syntax The form is oe FOR index variable and limits statements NEXT oe oe The index variable and limits is optional but if used it is of the form x y TO z STEP s where x is an index variable that is increased from y to z in increments of s The statements are processed first with x y then with x y s and so on until x z If STEP
56. Importance Formula In the Best Tree dialog you can also specify how variable importance scores are calculated and how many surrogates are used to construct the tree Rather than counting all surrogates equally the default calculation you can fine tune the variable importance calculation by specifying a weight to be used to discount the surrogates Click on the Discount surrogates radio button and enter a value between 0 and 1 in the Weight text box Number of Surrogates After CART has found the best splitter primary splitter for any node it proceeds to look for surrogate splitters splitters that are similar to the primary splitter and can be used when the primary split variable is missing You have control over the number of surrogates CART will search for the default value is five When there are many predictors with similar missing value patterns you might want to increase the default value You can increase or decrease the number of surrogates that CART searches for and saves by entering a value in the Number of Surrogates Used to Construct Tree box or by clicking on the up down arrow key rN The number of surrogates that can be found will depend on the specific circumstances of each node In some cases there are no surrogates at all Your N surrogates sets limits on how many will be searched for but does not guarantee that this is the number that will actually be found If all surrogates at a given node are missing or no surrogates
57. It appears as follows 292 Chapter 12 Features and Options Datalnfo Setup ANYRAQT ONAER NSUPPS OFFAER NFAMMEM TANNING ANYPOOL SMALLBUS Sort File Order M Include Strata Weight Frequency Tabulation List Extreme Values Levels to Display Levels to Tabulate Save to Grove Variable Name Include Strata Weight xI List extreme v r values a Max levels to display 2000 ass Max levels 7 ual to tabulate Save to Grove adda lt I xI xI xI xI WOOO OOo Oyo Select the variables to include and place a checkmark in the Include column Define a single stratification variable for data information Statistics by placing a checkmark in the Strata column max of eight levels Define a single weighting variable by placing a checkmark in the Weight column Enable frequency tabulations Specify the number of most and least frequent levels for display Specify the maximum number of discrete levels to display Specify the maximum number of discrete levels to track Specify a grove file where data information results are saved After you have made your selections using the Datalnfo Setup dialog click the OK button to proceed with the processing Once the resulting window is open and active you will see two different views from which you can select by using the Brief and Full buttons The Brief view provides a snapshot of the data including t
58. K 1 2 runs By default each battery starts with the current list of predictors and proceeds until no predictors are left The user can change both the number of steps elimination cycles taken and the number of variables removed at each step one by default 223 Chapter 10 CART Batteries We illustrate this process by shaving from the bottom of the entire initial list of predictors in the SPAMBASE CSV data see SHAVING CMD command file for details wa Battery Summary 7 DER Models Contents Accuracy Error Profiles Var Imp Averaging Charts Rel Error Nodes x_None 18 0 185 x_WORD_FREQ_FR 25 0 206 View s AllModels v Chart Type _Bar j Line Test Rel Error Rel Error Battery Types Classification Battery Models SHAVING Model Opt Terminal Rellener Shaved Predictors Important Name Nodes i Predictor Count Vars Count Tree 25 12 0 1891 WORD_FREQ_MONEY Tree 26 0 1928 W0ORD_FREQ_ RE 2 20 IRD_FREG FREE Tree 28 0 2062 WORD_FREQ_TELNET Tree 29 0 2062 WORD_FREG_85 Tree 30 0 2062 WORD_FREG_LABS Tree 31 0 2260 WORD_FREG_EDU Show Min Error Tree 32 0 2235 WORD_FREG_650 Sort by Types 0 2235 W0RD_FREQ_HPL Model Quality Sample Model Size Save Grove Misclass ROC Test Leam Min Cost 1SE It follows that the original list of 41 important predictors can be reduced to only 15 predictors without substantial loss of accuracy CART 6 0 C
59. MB of learn sample data gt REM Resetting Preferences gt REM Setting General default options gt LOPTIONS MEANS YES PREDICTIONS YES BOTH TIMING NO GAINS YES ROC YES gt FORMAT 3 gt REM Setting CART default options gt LOPTIONS NOPRINT NO PLOTS YES PS NO gt BOPTIONS SURROGATES 5 PRINT 5 COMPETITORS 5 CPRINT 5 TREELIST 10 gt BRIEF gt Don t worry if some of the minor details are different on your screen Later you will learn how to customize what you see when the program is started About CART Menus When you first start CART you see one set of menus but the menu items will change as you progress through an analysis Menus can change to reflect the stage of your analysis and the window you have active As a result not all menus are always available Similarly when not accessible the commands that appear in the pull down menus and the toolbar icons are disabled 42 CART BASICS An overview layout of the main CART menus is presented below FILE EDIT VIEW e Open data set Navigator file Grove File or command file e Save analysis results Navigator file Grove file or command file Open a CART notepad for creating command scripts e Specify printing parameters e Activate interactive command mode e Submit batch command files e Cut copy and paste selected text e Search and replace text e Specify colors and fonts e Control reporting options e
60. Nodes The next Summary Report provides a graphical representation of the ability of the tree to capture the BADs in the terminal nodes Observe that we selected BAD as the target class This sorts the nodes so that those with the highest concentrations of BAD are listed first The All Classes button represents each class with its own color The other classes are just colored gray 64 CART BASICS 4 Navigator 5 10 Tree Summary Reports DER Misclassification Prediction Success Gains Chart Root Splits y Terminal Nodes Variable Importance Percentage Of Node That Is Target Class Node 4 has the highest concentration of BADs closely followed by nodes 2 8 and 9 Hover the mouse over a bar to see the precise fraction of the node that is BAD This is a graphical display of the information that is also in the gains chart If you have separate test data you can request a learn test comparison of the terminal nodes in this window Variable Importance It is natural to expect that the root node splitter will be the most important variable in a CART tree and indeed in our example this is the case However you cannot count on it coming out this way in every tree Sometimes a variable that splits the tree below the root is most important because it ends up splitting many nodes in the tree and splitting powerfully Variable importance is determined by looking at every node in which a variable appears
61. Rectangle Top 0 5 Bottom 0 5 Save as Defaults Cancel Printer In our example changing the orientation to landscape and scaling the tree down to 75 of its original size repositions the tree to fit entirely on one page Click OK to return to the Print dialog box and then click OK to send the tree to the printer See Chapter 4 for a description of other page setup options Tree Summary Reports The overall performance of the current tree is summarized in the five Summary Reports dialog tabs To access the reports click Summary Reports at the bottom of the Navigator window or select Tree Summary Reports from the Tree menu The Tree Summary Reports present information on the currently selected tree i e the tree displayed in the top panel of the Navigator To view summary reports for another tree in the nested sequence change the tree topology displayed in the top panel by selecting the tree of interest click the square box above the number of nodes on the line graph Alternatively you can click on the Son or Pune button or choose Select Tree from the Tree menu 245 Chapter 11 CART Segmentation Zi Navigator 1 la x Classification tree topology for SEGMENT Color code using Tat SEGMENT 1 gt Bete wose Smaller Next Larger Prune 0 2 a o 12 eje F 11 ES 7 50 4 a ial a o 1 2 a 4 5 6 7 8 9 0 11 BE DEE DRES tan l Number of
62. Select to load another one assuming that it was saved using the Save Grove button in the Grove section Check the Save results to a file checkbox and specify the output data set name Choose the tree you want to apply by pressing the Select button in the Subtree section by default CART offers the optimal tree Set the target weight and id variables when applicable Press OK 3 The output data set will contain new variables added by CART including node assignment class assignment and predicted probabilities for each case New Analysis To build another tree using the same data set select Construct Model from the Model menu or click the Model Setup toolbar icon CART retains the prior model settings in the Model Setup dialogs To use another data set select Data File from the File gt Open menu The new selected file will replace the file currently open and all dialog box settings will return to default values Saving the Command Log Although we have used the mouse to make menu selections and to set up and run our models underneath it all CART is actually generating and executing commands While you do not ever have to learn how to use these commands they serve one crucial function for everyone the commands corresponding to a session are your audit trail and permanent record of your actions If you find that you must reproduce a model or analysis the command log will ensure that this is possib
63. Spot Detection CART 2 CART 6 0 Pro EX Searching for Hot Spots In many modeling situations an analyst is looking for regions of modeling space richest in the event of interest These regions are usually called Hot Spots For example in fraud detection problems we could be interested in identifying a set of rules that lead to a high ratio of fraud so as to flag records that are almost guaranteed to be fraudulent Because target classes usually overlap making it impossible to have a clear separation of one target group from the other a search for hot spots usually results in a reduced overall accuracy in the class of interest In other words while it might be possible to identify areas of data rich in the event of interest chances are that a substantial amount of data will be left outside the covered areas that cannot be easily separated from the remaining class One of the advantages of CART is that it gives clear sets of rules describing each terminal node Therefore searching for hot spots usually boils down to searching for nodes richest in the given class across multiple trees The hot spot machinery described below can be applied to a single tree but it is most beneficial in processing CART battery models collections of trees obtained by a systematic change in model settings While any CART battery can be used the most suitable for the task is battery prior This battery varies the prior probabilities used in tree construction
64. The COMBINE command begins a combined tree or committee of experts run All options for COMBINE are set with a previous instance of the MOPTIONS command The command syntax is COMBINE Examples USE SEATBELT CSV MODEL BMW MOPTIONS CYCLES 10 EXPLORE YES DETAILS NONI RTABLES NO TRIES 3 ARC NO SETASIDE PROP 0 100000 G COMBINI Gl 345 Appendix III Command Reference DATA Purpose The DATA command designates a block of statements to be interpreted as BASIC statements rather than as CART commands The block is terminated with DATA END Example data let mvql mv lt 17 let mvq2 mv gt 17 and mv lt 21 2 let mvqg3 mv gt 21 2 and mv lt 25 let mvq4 mv gt 25 let mvd mv gt 21 2 data end 346 Appendix III Command Reference DATAINFO Purpose The DATAINFO command generates descriptive statistics for numeric and character variables Its simplest form is DATAINFO The full command syntax is DATAINFO lt varlist gt CHARACTER NUMERIC EXTREMES lt M gt TABLES Examples To indicate particular variables DATAINFO GENDERS WAGES LOGWAGES To generate statistics only for numeric variables and for each such variable to list the extreme 15 values DATAINFO NUMERIC EXTREMES 15 To produce full frequency tabulations use the TABLES option DATAINFO POLPARTYS TABLES To speed up the computation of statistics and avoid
65. WORD _FREQ_MEETIN WORD_FREQ_RECEI CAPITAL_RUN_LENGTH_AVERAG WORD_FREQ_85 WORD_FREQ_YO WORD_FREQ_LABS WORD_FREQ_OU WORD _FREQ_RE WORD_FREQ_ALL WORD_FREQ_3I WORD _FREQ_HP amp WORD _FREQ_C WORD_FREQ_ORDE CHAR_FREQ_DOLLaRE WORD _FREQ_YOUI WORD_FREQ_CREDI CHAR_FREQ_HASI ai a x w G w oe wo me o WORD _FREQ_ADDRES lt Sort Median x Save Grove Min Quantile 0 25 j Median Quantile 0 75 Max The following buttons control which graph is being dis gt gt gt gt o An additional group of buttons allow Mean shows a mean importance profile Grid adds a grid to the display WORD_FREQ_EMAl WORD_FREG_MAIHe PE PEOPLI WORD FREG WORD FREQ TECHNOLO GW WORD FREQ MAKI WORD_FREQ_PART WORD_FREQ_BUSINES WORD_FREQ_LAB WORD_FREG_DATA WORD_FREQ _REPO WORD_FRE _PROJE gt Box Plot Mean Grid played Min smallest importance value for the variable across all models Quartile 0 25 first quartile importance value across all models Median median second quartile importance value across all models Quartile 0 75 third quartile importance value across all models Max maximum importance value across all models The sort order of variables can be changed using the Sort selection box Box Plot shows a box plot of importance scores for each variable We now proceed with the description of all ava
66. a a a A a 9 090 9 9 090 0 09 09 0 0 0 M m i n a a i i i As yaqam 4 Save Grove CART Combine Score Cancel i Continue i SET The Constraints tab has two main sections In the left pane we can specify groups of variables using the check boxes in the columns labeled 1 2 or 3 The column labeled Ind is used for ungrouped or individual variables The second main section titled Disallow Split Region has a set of sliders used to specify constraints for each of the three groups or individual variables The sliders come in pairs one on the left and one on the right The left slider controls the Above Depth value while the right slider controls the Below Depth value As the sliders are positioned either a green or red color coding will appear indicating at what depth a variable is allowed or disallowed as a splitter In the following screen a group 1 constraint has set on the Above Depth Here the slider and color coding indicates the group 1 variables are disallowed red above the depth of 6 but permitted green at any depth greater than or equal to 6 Disallow Split Region 1 2 3 Ind Split Disallowed Above Depth O M ol olp ol al Split Disallowed At Or Below Depth 1 2 3 Ind 279 Chapter 12 Features and Options A more complex example would be setting both the above and below constraints on a group of variables In the next screen we use the left sli
67. a repetition factor each sub sampling size is repeated N times with a different random seed each time e g BATTERY SUB SAMPLE VALUES 1000 2000 5000 10000 20000 0 BATTERY SUB SAMPLE VALUES 1000 2000 REPEAT 20 In the above example note that 0 indicates sub sampling should not be used 334 Appendix III Command Reference BOPTIONS Purpose The BOPTIONS command allows several advanced parameters to be set The command syntax is BOPTIONS SERULE lt x gt COMPLEXITY lt x gt COMPETITORS lt n gt CPRINT lt n gt SPLITS lt n AUTO gt SURROGATES lt n gt PRINT lt n2 gt OPTIONS NCLASSES lt n gt CVLEARN lt n gt NOTEST ECHO TREELIST lt n gt PAGEBREAK lt page_break_string gt NODEBREAK lt ALL EVEN ODD NONE lt N gt gt IMPORTANCE lt x gt COPIOUS BRIEF SCALED QUICKPRUNE lt YES NO gt DIAGREPORT lt YES NO gt HLC lt nI gt lt n2 gt PLC lt YES NO gt cvs lt YES NO gt PROGRESS lt SHORT LONG NONE gt MIssinc lt YES NO DISCRETE CONTINUOUS LIST varlist gt MREPORT lt YES NO gt VARDEF lt N 1 gt in which lt x gt is a fractional or whole number and lt n gt is a whole number SERULE COMPLEXITY COMPETITORS CPRINT TREELIST SPLITS SURROGATES The number of standard errors to be used in the optimal tree selection rule The default is 0 0 Parameter limiting tree growth by penalizing complex trees The defau
68. adaptive resampling and combining ARCing mode by pressing the Combine button in the Model Setup dialog The Combine tab is now available the command center to set up various bagging and ARCing controls The Testing and Best Tree tabs are not available because they are used only in single tree modeling CART s Combine dialog allows you to choose from two methods for combining CART trees into a single predictive model In both bootstrap aggregation bagging and Adaptive Resampling and Combining ARCing a set of trees is generated by resampling with replacement from the original training data The trees are then combined by either averaging their outputs for regression or by using an unweighted plurality voting scheme for classification Bagging versus ARCing The key difference between bagging and ARCing is the way each new resample is drawn In bagging each new resample is drawn in an identical way independent samples while in ARCing the way a new sample is drawn for the next tree depends on the performance of the prior trees 163 Chapter 6 Ensemble Models and Committees of Experts Bootstrap Resampling Bootstrap resampling was originally developed to help analysts determine how much their results might have changed if another random sample had been used instead or how different the results might be when a model is applied to new data In CART the bootstrap is applied in a novel way a separate analysis is conduct
69. algorithm In other decision tree techniques testing is conducted only optionally and after the fact and tree selection is based entirely on training data computations CART accommodates many different types of real world modeling problems by providing a unique combination of automated solutions Cross Validation and Repeated Cross Validation Cross validation one of CART s self testing methods allows modelers to work with relatively small data sets or to maximize sample sizes for training We mention it here because implementing cross validation for trees is extraordinarily challenging and easy to get wrong technically With CART you get cross validation as implemented by the people who invented the technology and introduced the concept into machine learning In CART 6 0 we allow you to rerun many replications of cross validation using different random number seeds automatically so that you can review the stability of results across the replications and extract summaries from an averaging of the results Surrogate splitters intelligently handle missing values CART handles missing values in the database by substituting surrogate splitters back up rules that closely mimic the action of primary splitting rules The surrogate splitter contains information that typically is similar to what would be found in the primary splitter You can think of the surrogate splitter as an imputation that is customized to the node in the tree in which it is
70. and taking into account how good a splitter it is You should think of the variable importance ranking as a summary of a variable s contribution to the overall tree when all nodes are examined The formulas for variable importance calculations are detailed in the CART monograph 65 CART BASICS 4 Navigator 5 10 Tree Summary Reports DER Misclassification Prediction Success Gains Chart Root Splits Yo TerminaiNodes Vi Variable Importance Variable Score V Consider Only N_INQUIRIES Primary Splitters NUMCARDS 33 78 MWM Show zero OCCUP_BLANK 31 69 MWN prete AGE 29 05 MMM IF Discount Surrogates GENDER 28 43 MWM INCOME 21 60 MM e CREDIT_LIMIT 18 63 MM MARITALS 1051 M TIME_EMPLOYED 754 I HH_SIZE 598 I C EDUCATIONS 1 23 Variables earn credit towards their importance in a CART tree in two ways as primary splitters that actually split a node and as surrogate splitters back up splitters to be used when the primary splitter is missing To see how the importance scores change if considered only as primary splitters click the Consider Only Primary Splitters check box CART automatically recalculates the scores a Comparing the standard CART variable importance rankings with the Consider Only Primary Splitters can be very informative Variables that appear to be important but rarely split nodes are probably highly correlated with the primary splitters and contain very similar information
71. are to be used for testing or validation e Use a binary 0 1 numeric variable to define simple learn test partitions We like to code such variables with 0 indicating train and 1 indicating test e f you prefer you can use a text variable with the value TEST for selected records The other records can be marked as TRAIN or LEARN You can use lower case if you prefer 100 Chapter 4 Classification Trees This option gives you complete control over train test partitions because you can dictate which records are assigned to which partition during the data preparation process For a three way partition of the data create a variable with values for train test and valid and select that variable on the testing tab after clicking on the Variable separates test method option In scripts you can use the command like ERROR SEPVAR TEST FLAGS w Consider creating several separation variables to explore the sensitivity of the model building process to random data partition variation Command line users implement these strategies using one of the following commands ERROR EXPLORATORY ERROR CROSS lt N gt ERROR PROP lt p gt ERROR FILE lt file name gt ERROR SEPVAR lt variable gt The Select Cases tab The Model Setup Select Cases tab allows you to specify up to ten selection criteria for building a tree based on a subset of cases A selection criterion can be specifi
72. be in effect The command syntax is PENALTY lt varl gt lt penl gt lt var2 gt lt pen2 gt in which the improvement evaluated for lt var1 gt is multiplied by 1 lt pen1 gt Two additional types of improvement penalties may be specified The MISSING and HCC options may be given after the slash The command syntax is PENALTY lt var gt lt pen gt MISSING lt xml gt lt xm2 gt ucc lt xhI gt lt xh2 gt Missing value improvement penalty To penalize variables that have a large proportion of missing values in the partition node being split the MISSING option is used This option allows significance of the primary splitters and all competitors to be weighted by a simple function of the percentage of cases present nonmissing in the node partition The expression for weighting the significance is improvement improvement factor in which factor 1 0 if there are no missing values and factor xml fract xm2 if there are missing values Fract is the proportion of observations in the partition node that have nonmissing values for the splitter in question If xm1 and xm2 are 387 Appendix III Command Reference set to values that result in taking a root of a negative number or result in improvement lt 0 improvement is set to 0 If improvement gt 1 it is set to 1 High Level Categorical Improvement Penalty To penalize categorical splitters that have a high number of levels re
73. case is selected for the next training set is not constant and is not equal for all cases in the original learn data set instead the probability of selection increases with the frequency with which a case has been misclassified in previous trees Cases that are difficult to classify receive an increasing probability of selection while cases that are classified correctly receive declining weights from resample to resample Note however that as the probability of selection becomes more skewed in favor of the difficult to classify cases the probability of selection for the typical case quickly declines to zero and the process of sample building takes an increasingly longer time In general we recommend bagging rather than ARCing because bagging is more robust with dependent variable errors and also much faster Nevertheless ARCing is capable of remarkably reducing predictive error Note also that both bagging and ARCing generate a committee of experts rather than a single optimal tree Because a single tree is not displayed no simple way exists to explain the underlying rationale driving the averaged predictions In this sense combined trees are somewhat akin to the black box of a neural net although the trees are built much more quickly 164 Chapter 6 Ensemble Models and Committees of Experts amp One final caution on combining via bagging or ARCing the increase in accuracy is sometimes accomplished for the class in which
74. causes CART to stop its tree growing process before reaching the largest possible tree size When CART reaches a tree size with a complexity parameter equal to or smaller than your pre specified value it stops the tree splitting process on that branch If the complexity parameter is judiciously selected you can save computer time and fit larger problems into your available workspace See the main reference manual for guidance on selecting a suitable complexity parameter rN As described in detail in the main reference manual check the Complexity Parameter column in the TREE SEQUENCE section of the CART Output to get the initial feel for which complexity values are applicable for your problem The Scale Regression check box specifies that for a regression problem the complexity parameter should be scaled up by the learn sample size Command line users will use the following command syntax to specify this complexity parameter BOPTIONS COMPLEXITY lt value gt SCALED Dataset Size Warning Limit for Cross Validation By default 3 000 is the maximum number of cases allowed in the learning sample before cross validation is disallowed and a test sample is required To use cross validation on a file containing more than 3 000 records increase the value in this box to at least the number of records in your data file 113 Chapter 4 Classification Trees Command line users will use the following command syntax BOPTIONS CVLEARN
75. click on the Regression Tree radio button Check all the other variables as predictors Model Setup Advanced F Penalty Battery Model Categorical Force Split Constraints Testing Select Cases Best Tree Method Variable Selection Tree Type Variable Name Target Predictor Categorical Weight RM L Classification xI Regression zI Unsupervised re Target Variable MV MY x Weight Variable Sort File Order z Number of Predictors 13 Save Grove CART Combine Score Cancel Continue In the Model Setup Advanced tab set Parent Node Minimum Cases to 40 and Terminal Node Minimum Cases to 20 This will ensure that the terminal nodes will not become too small DIS RAD TAX PT B z a4 a4 4 149 Chapter 5 Regression Trees Minimum Node Sizes Unweighted Weighted Parent node minimum cases 40 a Terminal node minimum cases 20 4 Click Start Tree Navigator At the end of the model building process a navigator window for a regression tree will appear ma Navigator 1 Regression tree topology for MV Color cade using Tat MV Mean High TTT Low Smaller Next Larger Prune Grow Prune Model Statistics 0 8 0 6 0 4 Ce ee re aaa aaa aen D 1 2 a 4S Ss e 7 18 8 40 A aS W e e aeaa 20 Number of Nodes Min Node Cases Relative Error Data Displays and Re
76. cmd When operating in batch mode CART does not send any output to your screen other than startup and error messages unless ECHO ON is in effect or the e command line flag has been specified It is therefore a good idea to specify an output file with the OUTPUT command inside your command file otherwise you may never see the results at all CART will terminate either when it has encountered a QUIT command or there are no more commands to be executed 314 Chapter13 Working with Command Language Startup File When console CART is started in interactive mode it looks for a file named SALFORD CMD first in your current working directory and then in the directory pointed to by the SALFORD environment variable If found CART will execute its contents before displaying the command prompt This allows one to specify default settings for all Salford Systems applications SALFORD CMD is not automatically executed in batch mode Command Line Startup Options CART has a number of other command line options which can be shown by invoking CART with the h flag Command line syntax is cart options commandfile options Options are e Echo results to console q Quiet suppress all output including errors o lt output_file gt Direct text results to a file u lt use_ file gt Attach to a dataset d lt Path gt Identify DBMSCOPY dll path w lt Path gt Identify Stat Transfer dll path not required under UNIX t lt Path gt I
77. combine levels until it meets this limit We show how CART can conveniently do this for you later in the manual For command line users categorical variables are defined using the CATEGORY command See the following command line syntax CATEGORY lt cat_varl cat_var2 cat _var gt CATEGORY LOW RACE SMOKE UI Case Weights In addition to selecting target and predictor variables the Model tab allows you to specify a case weighting variable Case weights which are stored in a variable on the dataset typically vary from observation to observation An observation s case weight can in some sense be thought of as a repetition factor A missing negative or zero case weight causes the observation to be deleted just as if the target variable were missing Case weights may take on fractional values e g 1 5 27 75 0 529 13 001 or whole numbers e g 1 2 10 100 To select a variable as the case weight simply put a checkmark against that variable in the Weight column Case weights do not affect linear combinations in CART SE but are otherwise used throughout CART CART Pro and ProEX include a new linear combination facility that recognizes case weights If you are using a test sample contained in a separate dataset the case weight variable must exist and have the same name in that dataset as in your main learn sample dataset For command line users the variable containing observation case weights is
78. correct classification it will try to ensure that the errors it does make are less costly If credit risks were classified as low moderate or high for example it would be more costly to misclassify a high risk borrower as low risk than as moderate risk Traditional data mining tools and many decision trees cannot distinguish between these types of misclassification errors in their model construction processes Alternative splitting criteria make progress when other criteria fail CART includes seven single variable splitting criteria Gini Symgini Twoing Ordered Twoing Entropy and Class Probability for classification trees and Least Squares and Least Absolute Deviation for regression trees In addition we offer one multi variable or oblique splitting criterion the Linear Combinations or LC method CART 6 includes some important extensions to the classic LC method The default Gini method frequently performs best but by no means is Gini the only method to consider in your analysis In some circumstances the Twoing method will generate more intuitive trees To help you find the best method CART will optionally test all its methods automatically and summarize the results in tables and charts 14 Introducing CART 6 0 What s New in CART 6 0 Our goal in developing CART 6 0 has been to help the data analyst be more productive and to make the whole process of developing high performance models faster easier and more intuitive W
79. criterion for classification trees and least squares for regression trees the Method tab Unit equal misclassification costs the Costs tab Equal priors all classes treated as if they were equal size the Priors tab No penalties the Penalty tab Parent node requirements set to 10 and child node requirements set to 1 the Advanced tab Allowed sample size set to the currently open data set size the Advanced tab Many other options are available to the advanced user and we invite you to explore them at your leisure in the chapters that follow The good news about CART is that you can get started by focusing only on the essentials deferring advanced topics The remainder of this section discusses the model setup process Subsequent sections cover additional options The Model tab The Model Setup Model tab is the central location for model control where you identify the target or dependent variables This is the one and only task that CART requires of you CART will not know which column of your data to try to analyze without your guidance Once you provide that information CART is technically able to do everything else for you In practice you will probably also want to select the candidate predictor independent variables because data sets typically contain bookkeeping columns such as ID variables that are not suitable for prediction In some cases you may also have a weight variable Where possible CART will auto
80. current tree is completed If RETAIN is specified the seed will keep its latest value after the tree is built Examples SEED 1 99 7773 SEED RETAIN SEED 35 784 29954 NORETAIN 397 Appendix III Command Reference SELECT Purpose The SELECT command selects cases from a file for analysis You may specify up to ten simple conditions the data preprocessor then selects those cases in the data file that meet all the conditions that is the conditions are linked by logical AND SELECT commands are processed after any BASIC statements allowing selections to be made based on variables created on the fly Specify each condition as variable name logical relation and a constant value The variable name must come first The six possible logical relations are lt gt lt gt lt and gt You must enclose character values in quotes Character comparisons are case sensitive The command syntax is SELECT lt var gt lt relation gt lt string gt or SELECT lt var gt lt relation gt lt gt Examples SELECT GROUP 2 SELECT GROUP lt gt SELECT AGE gt 21 AGE lt 65 SELECT SEX Female AGE gt 25 398 Appendix III Command Reference STRATA Purpose The STRATA command defines a stratification variable for DATAINFO statistics Its syntax is STRATA lt variable gt Examples STRATA GENDERS DATAINFO INCOME AGE POLPARTYS 399 Appendix III Command Refer
81. due to the way licensing works on those platforms the information is written to a system folder to which you must have write access Starting and Running CART Start CART by clicking Start and selecting the CART program group icon CART takes advantage of Windows preemptive multi tasking ability so you can start a CART run and then switch to other Windows tasks Be aware that performance in CART and your other active applications will decrease as you open additional applications If CART is running slowly you may want to close other applications Licensing CART After completing the install process click your start button and navigate into Program software software clicking on the software icon to start the application You will be presented with a screen similar to the following Evaluation Copy To continue in evaluation mode click the Continue button below Remaining evaluation days 3 Select Continue to start your instant 3 day evaluation This will get the software up and running while you work through the unlock process Once launched select License from the Help menu and choose the Registration tab Click on the Copy button to copy the System ID number 27 Installing and Starting CART License Information Registration License For an instant faluation click on the button below Instant Evaluation Remaining evaluation days 0 aluation or ify 6 2 0 153CPX152 256 C A736 EAB6 4327 62F8
82. example appear to be relatively pure with six of the nine nodes containing only one class You can also see how populated each terminal node is and whether particular classes are concentrated in a few nodes or scattered across many nodes an indication of the number of splits required to partition each of the classes Variable Importance The next Summary Report displays the variable importance rankings as illustrated below The scores reflect the contribution each variable makes in classifying or predicting the target variable with the contribution stemming from both the variable s role as a primary splitter and its role as a surrogate to any of the primary splitters In our example ANYRAQT the variable used to split the root node is ranked as most important PERSTRN received a zero score indicating that this variable played no role in the analysis either as a primary splitter or as a surrogate To see how the scores change if each variable s role as only a primary splitter is considered click the Consider Only Primary Splitters check box CART automatically recalculates the scores You can also discount surrogates by their association values if you check the Discount Surrogates check box and then select the By Association radio button Alternatively you can discount the improvement measure attributed to each variable in its role as a surrogate by clicking on the 249 Chapter 11 CART Segmentation Geometric radio button and ent
83. files Small BASIC programs are defined near the beginning of your analysis session after you have opened your dataset but before you estimate or apply the model and usually before defining the list of predictor variables BASIC is powerful enough that in many cases users do not need to resort to a stand alone data manipulation program See Appendix IV for more on the BASIC Programming Language Command Line Mode Choosing Command Prompt from the File menu allows you to enter commands directly from the keyboard Switching to the command line mode also enables you to access the integrated BASIC programming language See Appendix IV for a detailed description of the BASIC programming language x This menu item is available only when the CART Output window is active The command line prompt is marked by the gt symbol and a vertical blinking cursor at the lower end of the right panel of the CART Output window Creating and Submitting Batch Files The CART Notepad can be used to create and edit command files From the Notepad you can submit part or all of an open file To submit a section of the command file move the cursor to the first line of the selected section and select Submit Current Line to End from the File menu To submit the entire command file select Submit Window from the File menu or click on the in the toolbar After you submit the file the analysis proceeds as if you had clicked on the Start button in the GUI T
84. gt SUBSAMPLE lt N gt Model Missing Values CART 6 0 introduces a new set of missing value analysis tools for automatic exploration of the optimal handling of your incomplete missing data On request CART will automatically add missing value indicator variables MVIs to your list of predictors and conduct a variety of analyses using them For a variable named X1 the MVI will be named X1_MIS and coded as 1 for every row with a missing value for X1 and 0 otherwise If you activate this control the MVIs will be created automatically 115 Chapter 4 Classification Trees as temporary variables and will be used in the CART tree if they have sufficient predictive power MVIs allow formal testing of the core predictive value of knowing that a field is missing Create new variable for MVI There are three control options for missing values indicators The user can request MVIs for all variables or limit them to either continuous only or categorical only predictor variables Command line users will use the following command syntax The following command syntax will turn on MVIs for all variables BOPTIONS MISSING YES To limit MVIs to categorical discrete variables only we use BOPTIONS MISSING DISCRETE To limit MVIs to continuous variables only we use BOPTIONS MISSING CONTINUOUS Create missing categorical level For categorical variables an MVI can be accommodated in two ways by adding a separate MVI
85. has the following variable importance list Variable Importance Variable CHAR_FREG_DOLLAR CHAR_FREG_EXPLAM 76 10 WORD_FREG_FREE 72 74 cist 68 68 WORD_FREG_OUR 54 12 WORD_FREG_650 27 80 WORD_FREG_85 13 19 WORD_FREG_INTERNET 10 54 WORD_FREG_PM 5 23 WORD_FREG_PEOPLE 1 46 I CTU AO CT wnt tl i 214 Chapter 10 CART Batteries Zs CART 6 0 CART 6 0 Pro Pro EX Battery LOVO Battery LOVO Leave One Variable Out generates a sequence of runs where each run omits one of the variables on the predictor list one at a time each We illustrate this battery on the SPAMBASE CSV data using the full list of predictors see LOVO CMD command file for details wa Battery Summary 1 Models Contents Accuracy Error Profiles Var Imp Averaging Charts Rel Eror Nodes lov WORD F 18 0 185 r View 0 22 AllModels v p Chart Type Bar J Line Rel Error lov_CAPITA 24 0 169 Battery Types Classification Battery Models Lovo Opt Terminal Nodes Rel Error Lovo 0 1849 Assuming K predictors on the initial keep list the battery produces K models having K 1 predictors DER Test Rel Error 0 1849 0 1849 0 1918 0 1849 0 1943 Show Min Error 5l jujo jo o n 0 1849 0 1849 Model Quality Save Grove Misclass ROC Test Leam Min Cost 1SE A
86. if it is unknown Error 10069 Unable to open the grove file Check the GROVE command Error 10070 Unable to identify model eg tree treenet mars in grove You have a corrupted grove file the wrong version of the file or the wrong model selection criteria Error 10072 Error creating grove file Check for enough disk space and or permissions Error 10074 Not enough memory available to estimate model CART does not have enough resources to complete your run Check the run settings certain extreme situations such as high level categorical predictors and targets can render your run impossible to conduct Contact Salford Systems if this message appears under normal settings 324 Appendix II Errors and Warnings Error 10075 Invalid MODEL command options was expecting Check the MODEL command Error 1008 Target had no variation after LAD transformation This usually happens when the LAD method is activated on binary targets switch to LSD or classification Error 11004 TOO MANY CATEGORICAL OR LINEAR COMBINATION SPLITS TRY USING THE COMMANDS BOPTION SPLITS LINEAR LINSPLITS The number of categorical or linear combination splits has exceeded the initially reserved amounts Increase the limits using the corresponding commands Error 11005 TREE IS GROWING TOO DEEP TRY USING COMMAND LIMIT DEPTH The tree depth exceeds the default maximum value Use the LIMIT DEPTH command to increase it Error 1
87. in support for comma separated ASCII files 2 gt gt The SAVE command specifies the case by case prediction output file The specified file may contain case by case predictions model variable values path information and class probabilities 310 Chapter13 Working with Command Language 3 gt gt The GROVE command specifies the binary grove file to be used for scoring Commands 4 through 5 control various engine settings 4 gt gt The SCORE command signals the CART engine to start the scoring process 5 gt gt The QUIT command terminates the program UNIX Console Usage Notes The nature of UNIX like operating environments affects the operation of CART in non trivial ways This section discusses the operation of CART in the UNIX operating environment and the operation of console non GUI CART in general Both GUI and console CART are offered for Windows only the console is offered for UNIX or Linux Case Sensitivity CART s command interpreter is case insensitive in fact commands are generally converted internally to upper case letters to include file names The only exception to this rule is that text placed between quotation marks is not converted remaining in its original case UNIX file systems on the other hand are case sensitive meaning that upper and lower case letters are treated as completely different characters Thus one could not refer to a file named this csv as THIS CSV or vice versa It is t
88. in the Report window can be changed by selecting Fonts from the Edit menu Use a mono spaced font such as Courier to maintain the alignment of tabular output We have already viewed the majority of the text output through the Node Navigator graphical displays Sections not summarized in the Navigator and Tree Summary Reports include the Variable Statistics and some of the more detailed information in the Tree Sequence and Terminal Node Information tables For a line by line description of these sections as well as the rest of the text output consult the main reference manual Displaying and Exporting Tree Rules Non terminal and terminal node reports with the exception of the root node contain a Rules dialog that displays the rules for the selected node and or sub tree For example to view the rules for Terminal Node 1 click on the node and select the Rules tab from the Terminal Node Report dialog The rules for this node displayed below indicate that cases meeting the four specified criteria are classified as Class 1 259 Chapter 11 CART Segmentation Terminal node 1 Rules for root node if SMALLBUS 1 amp amp ANYPOOL amp amp ANYRAQT amp amp FIT lt 3 454 terminalNode 1 class 1 To also view learn or test within node probabilities click Learn or Test Click Pooled to view the combined learn and test probabilities The rules are formatted as C compatible code
89. in which the variables enter the tree For example we may want the tree to use only characteristics of the consumer at the top of the tree and to have only the bottom splits based on product characteristics 276 Chapter 12 Features and Options Such trees are very easy to read for their strategy advice first they segment a database into different types of consumer and then they reveal the product configurations or offers that best elicit response from each consumer segment CART now offers a powerful mechanism for generating structured trees by allowing you to specify where a variable or group of variables are allowed to appear in the tree The easiest way to structure a tree is to group your predictor variables into lists and then to dictate the levels of the tree where each list is permitted to operate Thus in our marketing example we could specify that the consumer attributes list can operate anywhere in the top four levels of the tree but nowhere else and that the product attributes list can operate from level five and further down into the tree but nowhere else Structuring a tree in this way will provide the marketer with exactly the type of tree described above How did we know to limit the consumer attributes to the first four levels We know only by experimenting by running analysis using different ways to structure the tree If we are working with two groups of variables and want to divide the tree into top and bottom re
90. is easy for most readers to understand but constraints for structuring trees can be used in many applications In scientific applications constraints may be imposed to reflect the natural or causal order in which certain factors may be triggered in a real world process Constraints may also be used to induce a tree to use broad general predictors at the top and then to complete the analysis using more specific and detailed descriptors at the bottom 277 Chapter 12 Features and Options CART allows you to structure your trees in a number of ways You can specify where a variable can appear in the tree based on its location in the tree or based on the size of the sample arriving at a node You can also specify as many different regions in the tree as you wish For example you could specify a different list for every level of the tree and one predictor may appear on many different lists Structured Trees Using Predictor Groups For the following example we once again will use the GYMTUTOR CSV data file used in the Chapter 11 segmentation example Using the Model Setup Model tab specify the target variable as SEGMENT by placing a checkmark inside the checkbox located in the Target column Select the remaining variables and place a checkmark in the Predictors column Also place checkmarks in the Categorical column against those predictors that should be treated as categorical For our example specify ANYRAQT TANNING ANYPOOL SMALLBUS
91. j Line 3 Test Rel Error Rel Error Buo L pio Battery Types Classification Battery Models RULES Model Opt Terminal Splitting Name Nodes HENS Rule EH 0 5393 CI Prob 0 6001 Entropy 0 6180 Sym Gini 0 6375 Twoing 0 6442 Gini 0 6442 0rd Twoing Show Min Error Sort by Types Model Quality Sample Model Size Save Grove Misclass ROC Test Leam Min Cost 15E It appears that the Class Probability splitting rule resulted in the smallest relative error while GINI and Ordered Twoing resulted in the largest relative error exer Em 2 2 CART 6 0 CART 6 0 Pro Pro EX Battery SAMPLE The CART process iteratively partitions the train data until no more sensible splits can be found When the train data size is limited it is possible to run out of support for subsequent splits before the useful signal is fully extracted CART is sensitive to the overall size of the train data Battery SAMPLE was designed to investigate the amount of accuracy loss incurred in the course of progressive reduction of the train data size observation wise A total of five runs are produced full train data of the train data 2 of the train data of the train data and 1 8 of the train data We illustrate this battery using the SPAMBASE CSV data with 20 randomly allocated for test partition see SAMPLE CMD command file for details 222 Chapter 10 CART Batteries a Battery Summary 6 Models
92. likely contain age information belonging to another customer We now repeat this process in every column of the data Breiman uses a variant in which each column of original data is replaced with a bootstrap resample of the column either method can be used in Salford s software The following displays a small example Z c Program Files Salford Data Mining CART Pro EX 6 0 Examples CELL10nUNS csv BA UNS_TARGET SEGMENT AGE PLAN DATA MUSIC HOME NFAMMEM EMP AVGMIN BILLNG BILLNO STATE Original 0 0 4 250 20 33 Original 1000 150 22 Original 100 125 33 Original 500 j 43 40 16 15 22 300 46 30 11 22 Original Original Original Original Original Original 1 3 1 2 1 3 1 1 1 2 1 250 2 1 600 2 1 4000 3 0 6000 2 f 1 150 2 4 1 250 2 300 1 3 1 3 1 2 1 2 1 2 1 2 1 3 0 1 1 2 1000 15 600 20 500 40 4000 150 100 250 6000 Note that all we have done is to move information about in the Copy portion of the database Other than moving data we have not changed anything discrete levels or values so aggregates such as averages and totals will not have changed Any one customer record is now a Frankenstein record with all items of information having been obtained from a different customer In the above example Copy 17 has been given AGE 85 from customer 10 and the average bill AVGBILL from customer 3 2 Now append the scrambled data set to the original data We therefore now have the sam
93. list 18 lt Control that allows the user to specify how many files to show in the MRU list displayed in the File menu The maximum allowable is 20 files Chapter Reading Data This chapter covers typical situations you may encounter while accessing your data in CART 32 Chapter 2 Reading Data General Comments The following requirements must be met to read your data successfully in CART amp Data must be organized into a flat file with rows for observations cases and columns for variables features The maximum number of cells rows x columns allowed in the analysis will be limited by your license The maximum number of variables allowed in the analysis is initially set to 32768 See the appendix for dealing with larger numbers of variables CART is case insensitive for variable names all reports show variables in upper case CART supports both character and numeric variable values Variable names must not exceed 32 characters Variable names must have only letters numbers or underscores spaces amp etc are NOT ALLOWED If characters other than letters numbers or underscores are encountered CART will attempt to remedy the problem by substituting the illegal characters with underscores The only exception is that character variables in ASCII files must end with a sign see the next section Variable names must start with a letter Be especially careful to follow the var
94. minimum node sample size for linear combinations which can be changed from the default of three by clicking the up or down arrows specifies the minimum number of cases required in a node for linear combination splits to be considered Nodes smaller than the specified size will be split on single variables only The default value is far too small for most practical applications We would recommend using values such as 20 50 100 or more Variable Deletion Significance Level The Variable deletion significance level set by default at 0 20 governs the backwards deletion of variables in the linear combination stepwise algorithm Using a larger setting will typically select linear combinations involving fewer variables We often raise this threshold to 0 40 for this purpose Estimating Number of Linear Splits By default CART automatically estimates the maximum number of linear combination splits in the maximal tree The automatic estimate may be overridden to allocate more linear combination workspace To do so click on the Number of 109 Chapter 4 Classification Trees nodes likely to be split by linear combinations in maximal tree radio button and enter a positive value amp CART will terminate the model building process prematurely if it finds that it needs more linear combination splits than were actually reserved amp Linear combination splits will be automatically turned off for all nodes that have any constant predi
95. node sub sampling parameter was set to 5 000 In the root node we would take our 100 000 records and extract a random sample of 5 000 The search for the best splitter would be conducted on the 5 000 random record extract Once found the splitter would be applied to the full analysis data set Suppose this splitter divided the 100 000 root node into 55 000 records on the left and 45 000 records on the right We would then repeat the process of selecting 5 000 records at random in each of these child nodes to find their best splitters As you can see the tree generation process continues to work with the complete data set in all respects except for the split search procedure By electing to use node sub sampling we create a shortcut for split finding that can materially speed up the tree growing process But is node sub sampling a good idea That will depend in part on how rare the target class of interest is If the 100 000 record data set contains only 1 000 YES records and 99 000 NO records then any form of sub sampling is probably not helpful In a more balanced data set the cost of an abbreviated split search might be minimal and it is even possible that the final tree will perform better Since we cannot tell without trial and error we would recommend that you explore the impact of node sub sampling if you are inclined to consider this approach Command line users will use the following command syntax LIMIT LEARN lt N gt TEST lt N
96. optimal smallest cost tree Detailed Node Reports To see what else we can learn about our CART trees return to the Navigator by closing the Summary Reports window or by selecting Navigator from the Window menu Move the mouse pointer to the root top node in the tree topology panel and click to activate a non terminal Node Report dialog or right click on the root node and select Node Report The Competitors and Surrogates tab As illustrated below the first of the three tabs in the non terminal node report provides node specific information for both the competitor and the surrogate splits for the selected node in this case the root node 252 Chapter 11 CART Segmentation i Navigator 1 7 Node 1 Competitors and Surrogates if Root Competitor Splits Classification Splitter Is ANYRAQT 0 Main Splitter Improvement 0 269 T T g 0 Competitor Spit Improvement p2 0 500 0 23 0 10 2 FIT 3 454 ANYPOOL 0 OFFAER 0 500 Competitor TANNING 01 5 Surrogate Split Association Improvement ANYPOOL NFAMMEM NSUPPS SMALLBUS The splitting rule s ANYRAQT 0 is displayed in the top line and the main splitter improvement the metric CART uses to evaluate the quality of the split in the following line A table of the top five competitor splits in decreasing order of importance is displayed in the left panel Each competitor is identified by a variable name the value
97. percent for the learn sample and two to nine percent for the cross validated samples with Class 3 most frequently misclassified in both learn and test data 250 Chapter 11 CART Segmentation ji Navigator 1 7 Tree Summary Reports Gains Chart j Root Splits Terminal Nodes _ Variable Importance Misclassification Prediction Success i Misclassification Leaming Sample Test Sample N NMis Pet Cases Classed Error NMis Pet N Class Cases Classed Error Cost Class Cost 4 400 0 04 100 5 500 005 7 7144 007 38 15 EA 0 15 Sort by Class z Sort by Class f Prediction Success The final Summary Report displays the Prediction Success table also known as the confusion matrix for both learn and test or cross validated samples The Prediction Success table shows whether CART tends to concentrate its misclassifications in specific classes and if so where the misclassifications are occurring The learn and test tables display the following Actual Class Class level Total Cases Total number of cases in the class Percent Correct Percent of cases for the class that were classified correctly Class 1 N Number of Class 1 cases classified in each class where N is the total number of cases predicted correctly or incorrectly as Class 1 Class 2 N Number of Class 2 cases classified in each class where N is the total number of cases predicted as Class 2 Class 3 N N
98. performance Detailed Node Reports To see what else we can learn about our CART trees return to the Navigator by closing the Summary Reports window or by selecting Navigator from the Window menu Move the mouse pointer to the root top node in the tree topology panel and click to activate a non terminal Node Report dialog or right click on the root node and select Node Report The Competitors and Surrogates tab As illustrated below the first of the three tabs in the non terminal node report provides node specific information for both the competitor and the surrogate splits for the selected node in this case the root node 69 CART BASICS 4 Navigator 5 10 Node 1 Competitors and Surrogates j Root Competitor Splits Classification Is N_INQUIRIES lt 1 5 Main Splitter Improvement 0 104 Competitor Split Improvement CREDIT_LIMIT 5546 000 OCCUP_BLANK EDUCATIONS SNES Competitor NUMCARDS Surrogate Split Association Improveme 6 500r AGE 29 500s MARITAL Single NUMCARDS 6 500r The splitting rule Is N_INQUIRIES lt 1 5 is displayed in the top line and the main splitter improvement is displayed in the following line on the left Splitter improvement is the metric CART uses to evaluate the quality of all splits it is computed differently for different splitting rules A table of the top five competitor splits in decreasing order of importance is displayed in the le
99. press the Score button and save the results to a file called boston_scored csv Note that we score the same dataset as was used for learning If we needed to score another dataset we would use the Select button in the Data field to pick the new data file Click OK to start the scoring process After all 506 cases are dropped down the tree a Score dialog opens as shown below and a Score Text Report appears in the CART Output window 179 Response Statistics Chapter 7 Scoring and Translating TER Results of Applying CAAT Tree to Data Node Cases 9 5 43 5 79 68 17 11 18 9 Overall Results Summary Train Cases Score Cases Grove Navigator 1 Predicted Mean 45 580 Actual Mean 45 580 23 970 26 900 19 997 21 659 30 241 24 564 28 500 23 AR Score RMS Error Train RMS Error Percent Percent Score Data Train Data 8 50 0 99 15 61 13 44 3 36 2 17 3 56 178 1 736 5 685 2 353 2 178 2 319 1 762 2 270 NAR 8 50 0 99 15 61 13 44 3 36 217 3 56 17A RMS Eror 506 Mean Pred Response 506 22 533 Mean Obs Response 22 533 Data C Program Files Salford Da Boston csv Response Statistics Tab The Response Statistics tab provides distributional information by terminal node predicted response and observed response The grid in the top panel displays the following information for each terminal node Node Cases Perc
100. probability splitting rule is used the reverse of ONEOFF BATTERY PRIOR lt target_class gt CART EX Pro only Vary the priors for the specified class from 0 02 to 0 98 in steps of 0 02 i e 49 models If you wish to specify a particular set of values use the START END and INCREMENT options e g BATTERY PRIOR 3 START 5 will infer END and INCREMENT settings BATTERY PRIOR Male START 45 END 75 INCREMENT 01 BATTERY RULES Generate a model for each splitting rule six for classification two for regression Note that for the TWOING model POWER is set to 1 0 to help ensure it differs from the GINI model BATTERY SHAVING lt n gt TOP BOTTOM ERROR STEPS lt n gt CART EX Pro only Shave predictors from the model cycling until the specified number of steps have been completed STEPS or until there are no predictors left Can shave from the TOP most important are shaved first or BOTTOM ERROR will build a full set of models before determining which single predictor can be best eliminated based on model error not importance repeating for each predictor that is shaved TOP and BOTTOM can shave N at a time The defaults are to shave one predictor at a time from the bottom until the model degenerates to nothing Note that ERROR will proceed until the model degenerates i e the STEPS option has no effect with ERROR 332 Appendix III Command Reference BATTERY TARGET MP lt yes no gt MT lt yes no gt
101. profiles results in a test ROC within 48 and 54 It would have been difficult to justify the legitimacy of a model having a ROC value within this region 216 Chapter 10 CART Batteries CART l CART 6 0 Pro EX Battery MINCHILD Battery MINCHILD is very similar to battery ATOM described above It varies the required terminal node size according to a user supplied setting CART CART CART a a a CART CART 6 0 CART 6 0 Standard Pro Pro EX Battery MVI Battery MVI addresses missing value handling which is important for the success of any data mining project CART has a built in default ability to handle missing values via the mechanism of surrogate splits alternative rules automatically invoked whenever the main splitter is missing Surrogate splits effectively redistribute the missing part of data between the left and right sides of the tree based on an alternative split that most resembles the local split This is fundamentally different from treating a missing value as a separate category thus sending the entire subset to one side Alternatively it is often important to find out whether the fact that one variable is missing can be predictive on its own In CART this can be accomplished by creating missing value indicator variables MVIs binary variables set to one when the variable of interest is missing and zero otherwise and subsequently using the MVIs as part of the analysis see the Model Setup Advanced tab
102. refer to terminal nodes and zeros refer to depths not applicable for this observation When Predicted Probabilities is checked PROB_ lt N gt Predicted probabilities based on the learn data for each target class x The predicted probabilities will be included only if the number of classes in the target does not exceed the limit set in the corresponding selection box 176 Chapter 7 Scoring and Translating All target classes other than the original classes used in learning will be assumed to be missing Score GUI Output for Classification Trees After you click on OK in the Score Data dialog a progress dialog appears and after all the cases are dropped down the tree a Score dialog opens and a Text Report appears in the CART Output window The content of both the GUI and the text output for a scoring run will vary depending on whether the target variable is continuous or categorical and whether you are using new or training data The Score results dialog using a categorical target variable and the GYMTUTOR CSV training dataset is discussed below See the subsequent section for a discussion of score output for a regression tree and the CART Reference Manual for a description of Score text reports Response Statistics Gains Prediction Success Results of Applying CART Tree to Data Percent Percent Node Percent Mess ees Class Correct E Overall Results Summary Train Cases Pet Correct Score
103. remarkable technology Based on decades of machine learning and statistical research CART provides reliable performance and accurate results Its market proven methodology is characterized by 11 Introducing CART 6 0 A complete system of reliable data analysis When the CART monograph was first published it revolutionized the emerging field of decision trees An entire methodology was introduced for the first time that included multiple tree growing methods tree pruning methods to deal with unbalanced target classes adapting to the cost of learning and the cost of mistakes self testing strategies and cross validation For the scientifically minded rigorous mathematical proofs were provided to show that the underlying algorithms were mathematically sound and could be relied upon to yield trustworthy results The CART monograph published in 1984 is now justly regarded as a landmark work and one of the most important mathematical events of the last 30 years It is one of the most frequently cited works in machine learning and data mining An effective tree growing methodology CART introduced several new methods for growing trees including the Gini and the innovative Twoing method among others These methods have proven effective in uncovering productive trees and generating insights into data To cover a broad variety of problems CART also includes special provisions for handling ordered categorical data and the growing of p
104. reports xw Every variable specified in your KEEP list or checked off as an allowed predictor on your Model Set Up is a competitor splitter Normally we do not want or need to see how every one of them performed The default setting displays the top five but there is certainly no harm in setting this number to a much larger value CART tests every allowed variable in its search for the best splitter This means that CART always measures the splitting power of every predictor in every node You only need to choose how much of this information you would like to be able to see in a navigator Choosing a large number can increase the size of saved navigators groves Command line equivalent BOPTIONS COMPETITORS lt N gt Number of Trees to List in the Tree Sequence Summary Each CART run prints a summary of the nested sequence of trees generated during growing and pruning The number of trees listed in the tree sequence summary can be increased or decreased from the default setting of 10 by entering a new value in the text box x This option only affects CART s classic output Command line equivalent BOPTIONS TREELIST lt N gt Cross validation Details Classic Text Report If you use the cross validation testing method you can request a text report for each of the maximal trees generated in each cross validation run by clicking on the corresponding radio button for this option For example if testing is set to the default 10 f
105. reset forced splits use the command with no options FORCE CART CART Ed z CART 6 0 CART 6 0 Pro Pro EX The Constraints tab The Model Setup Constraints tab is new in CART 6 0 This setup tab specifies how predictor variables are constrained for use as primary splitters and or as surrogates at various depths of the tree and according to the size of the learn sample in the node By default all predictors are allowed to be used as primary splitters i e competitors and as surrogates at all depths and node sizes The Constraints tab is used to specify at which depths and in which partitions by size the predictor or group of predictors is not permitted to be used either as a splitter a surrogate or both Constraints and Structured Trees In marketing applications we often think about predictors in terms of their role in influencing a consumer s choice process For example we distinguish between characteristics of the consumer over which the marketer has no control and characteristics of the product being offered over which the marketer may have some degree of control Normally CART will be unaware of the different strategic roles different variables may play within the business context and a CART tree designed to predict response will mix variables of different roles as needed to generate an accurate predictive model However it will often be useful to be able to STRUCTURE a CART tree so that there is a systematic order
106. sample csv GROVE classcomb grv OUTPUT classcomb dat RENASAAAAAA AAA AAA AA AAA AAA AAA ER ERE T REM OPTIONS SETTINGS REMFTFE SAAT AAA TEAK TAAA TAKA EAA EAA KEE KEE K ETEK E EERE LOPTIONS MEANS YES NOPRINT NO PREDICTIONS YES BOTH TIMING YES PLOTS YES GAINS NO0 ROC NO0 FORMAT 7 UNDERFLOW LIMIT MINCHILD 1 ATOM 2 NODES 5000 DEPTH 50 LEARN 100000 TEST 100000 SUBSAMPLE 100000 REMFTA FASTEST TAA TEAATAA ATTA A AEE KEKE AERA AEE ATE REM MODEL SETUP REMFTA FAA AAA AAAK TEAR TAA ATER A KEKE KET R EEE A EERE MODEL Y2 CATEGORY Y2 PRIORS SPECIFY l1 5 1 5 Misclassify Cost 2 Classify 1 as l Misclassify Cost 3 Classify l as l KEEP Zl 229 Xl X2 X3 X4 X5 X6 X7 X8 X9 X10 METHOD GINI POWER 0 WEIGHT W PENALTY MISSING 1 1 HLC 1 1 REM t Fe ee AK AK TEAK AK AAR TAK KAAKTAK TEAK KAKTAHKAAKKAKTET REM COMBINE SETTINGS REN EAAAAAA AAA AAA AA AAA AAA AAA AAA AAA AAA AAA AAT AT TT MOPTIONS CYCLE5 10 EXPLORE YES DETAILS NONE RTABLES NO TRIES 3 ARC NO SETASIDE PROP 0 2 REM AAAAAAAAAAAAAAAAAA AAA AAA AAA AAA AA AA AAA AAA TAA AAA TT ATT REM BUILD MODEL REM AAAAAAAAAA AAA AAA AA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA TT COMBINE REM T Fee e eee ane eeeeeneeneaaeean een een ean eeneeaneaeneee REM QUIT CART REM t Fee e eK RAK TAK TART TARTAR TAAKAAKTAAKKAAKTAKTAAKAAAAT EH 309 Chapter 13 Working with Command Language This command file is almost identical with the CLASS CMD command file
107. selecting cases that have been previously misclassified the higher the power the greater the bias against selecting cases that were previously classified correctly Breiman has found that a power setting of four works well while settings of one or two give results virtually identical to bagging Setting the power greater than four could make it difficult to locate a sample large enough to fill the training sample if only a small fraction of the data is misclassified Also as Dietterich 1998 has reported if the dependent variable is recorded in error 165 Chapter 6 Ensemble Models and Committees of Experts then using ARCing will progressively focus new trees on the bad data yielding poor predictive models Combine Controls After selecting bagging or ARCing the next step is to select the number of trees you want to grow Bagging typically shows good results with about 100 trees but ARCing may require up to 250 trees The number of trees is initially set at 10 and can be changed by entering a new value in Number of Trees to Combine We recommend you first experiment with a modest number to see how the procedure is working If it looks promising launch a CART run with a full complement of 100 or more trees As noted above when using ARCing as the probability of selection becomes more skewed in favor of difficult to classify cases the probability of selecting the typical case quickly declines to zero and the time for sample buil
108. series of variable shaving models result CART and TreeNet engines analysis data version 2a END To see the currently defined memo issue the command MEMO ECHO To reset the memo MEMO RESET Normally memos are reset after a model is built To force the memo to persist across models until it is explicitly RESET use the command MEMO PERSIST YES PERSIST NO returns to the default To cause the memo to be displayed in the classic text output at the start of each model use the INCLUDE option MEMO INCLUDE YES INCLUDE NO returns to the default To quickly see any memo that may be embedded in a particular grove use the ECHO command on the GROVE command GROVE filename grv ECHO As an alternative to the MEMO command you can specify a single line quoted memo on the GROVE command itself GROVE filename grv MEMO A one line quoted memo 374 Appendix III Command Reference MEMORY Purpose The MEMORY command provides information about memory usage and memory requirements for the current model Use the BOPTIONS LIMIT and ADJUST commands to refine your problem to fit it into available memory The command syntax is MEMORY 375 Appendix III Command Reference METHOD Purpose The METHOD command specifies the splitting rule used in tree construction The CLASSIFICATION tree command syntax is METHOD GINI SYMGINI TWOING ORDERED PROB ENTROPY POWER lt x gt GINI Is the default
109. simply issue the LCLIST command alone Multiple LCLIST commands can be issued In this way multiple linear combinations may be developed at each node The linear combination with the highest improvement will be compared to the best univariate splitter to determine the primary splitter in the node Options are N lt n gt W lt x gt SIZE lt n gt STORED lt n gt OPTIM lt n gt Specifies the minimum number of records required in a node for linear combination splits from this LCLIST to be considered Smaller nodes will not consider this LCLIST This is essentially an LCLIST specific atom Default 3 Similar to N lt n gt but based on sum of case weights If this option is issued a node must have a sum of case weights equal to or exceeding lt x gt for this LCLIST to be considered This is essentially an LCLIST specific weighted atom The maximum number of predictors in a linear combination Must be gt 1 The default is 6 Defines how many candidate linear combinations formed from the LCLIST are maintained in memory during the search A high value allows for a more comprehensive search involving higher ordered linear combinations but at a potentially significant increase in compute time Must be gt 1 The default is 5 Must be 0 or greater The default is 0 367 Appendix III Command Reference PENALTY lt x gt Must be in the range 0 5 1 0 inclusive Defaults to 0 9 POSITIVE lt yes no gt Specifies
110. smallest child node below which no nodes can be constructed Naturally if you set the value too high you will prevent the construction of any useful tree w Increasing allowable parent and child node sizes enables you to both control tree growth and to potentially fit larger problems into limited workspace RAM 112 Chapter 4 Classification Trees a You will certainly want to override the default settings when dealing with large datasets The parent node limit ATOM must be at least twice the terminal node MINCHILD limit and otherwise will be adjusted by CART to comply with the parent limit setting rN We recommend that ATOM be set to at least three times MINCHILD to allow CART to consider a reasonable number of alternative splitters near the bottom of the tree If ATOM is only twice MINCHILD then a node that is just barely large enough to be split can be split only into two equal sized children Command line users will use the following command syntax to specify node limitations LIMIT ATOM lt parent limit gt MINCHILD lt child limit gt Minimum complexity This is a truly advanced setting with no good short explanation for what it means but you can quickly learn how to use it to best limit the growth of potentially large trees The default setting of zero allows the tree growing process to proceed until the bitter end Setting complexity to a value greater than zero places a penalty on larger trees and
111. the potentially time consuming complete tabulation of all variables use the CONTINUOUS option to specify that only continuous statistics should be produced DATAINFO PROFIT LOSS VOLUME CONTINUOUS Variable groups may be used in the CATEGORY command similarly to variable names e g GROUP GRADES ROSHRECS SOPHRECS JUNIORS SENIORS PSAT SAT MCAT DATAINFO GRADES Caution if you have ordered variables with many distinct values included in the DATAINFO the TABLES option can generate huge output The default is DATAINFO EXTREMES 5 347 Appendix III Command Reference DESCRIPTIVE Purpose The DESCRIPTIVE command specifies what statistics are computed and printed during the initial pass through the input data The statistics will not appear in the output unless the command LOPTIONS MEANS YES command is issued By default the mean N SD and sum of each variable will appear when LOPTIONS MEANS YES is used To indicate that only the N MIN and MAX should appear in descriptive statistics tables use the commands DESCRIPTIVE N MIN MAX LOPTIONS MEANS YES The command syntax is DESCRIPTIVE MEAN lt YES NO gt N lt YES NO gt sbD lt YES NO gt sum lt YES NO gt MIN lt YES NO gt mMax lt YES NO gt MIssinc lt YES NO gt ALL The ALL option will turn on all statistics and MISSING will produce the fraction of observations with missing data 348 Appendix III Command Reference DISCRETE
112. the dataset By default this variable is set to the target variable in regression runs however it could be changed to any of the continuous auxiliary variables that were specified in the Model tab of the Model Setup dialog Second specify the Default Sort Order This setting will control how the terminal nodes of the currently selected tree are ordered on the table and the graph above Currently sorting either by Profit Learn node sum of profit values in the Learn data or Average Profit Learn Profit Learn divided by node size is available Third choose one of the four possible measures to be displayed on the vertical axis of the graph by pressing the following group of buttons Profit Ave Profit Cum Profit Cum Ave Profit Profit within node accumulated profit Ave Profit Profit divided by the node case count 153 Chapter 5 Regression Trees Cum Profit same as Profit but accumulated over all nodes in the sorted sequence up until the current node Cum Ave Profit Cum Profit divided by the total number of cases in all nodes in the sorted sequence up until the current node rN All four measures as well as node case counts are reported on the table w In the presence of the explicit Test sample the user can also choose among Learn Test and Pooled measures using the corresponding buttons rN The Zoom and Chart Type controls change the visual appearance of the graph Terminal Nodes The Terminal No
113. the x to scroll down the variable list until SEGMENT is visible Put a checkmark inside the checkbox located in the Target column If no predictor variables are selected using a checkmark in the Predictors column CART by default includes every variable except the target in the analysis the desired result in this example The Model tab now appears as follows Model Setup Advanced Costs Priors Penalty Battery iM Categorical Force Split Constraints Testing Select Cases Best Tree TL Method Variable Selection Tree Type a Variable Name Target Predictor Categorical Weight Aux lt NFAMMEM B B 3 TANNING ANYPOOL SMALLBUS FIT HOME PERSTRN CLASSES SEGMENT Classification C Regression E E E C Unsupervised r Set Focus Class Target Variable SEGMENT B 1 Weight Variable Vt l esescaced a7 Select Select Select Cat B i x Sort File Order x Predictors Aur Number of Predictors 12 Save Grove CART Combine Cancel Continue Categorical Predictors Put checkmarks in the Categorical column against those predictors that should be treated as categorical For our example specify ANYRAQT TANNING ANYPOOL SMALLBUS and HOME as categorical predictor variables Growing the Tree The Classification radio button is already selected in the Tree Type group box by default Likewise the CART button is depressed or ON by de
114. then uncheck POSTBIN Your screen should now look something like 47 CART BASICS Model Setup Advanced Costs Priors Penalty Battery Model Categorical Force Split Constraints Testing Select Cases Best Tree Method Variable Selection Tree Type Variable Name Target Predictor Categorical Weight Aux al fe AON INCOME 7 CE RA MARITAL G N_INQUIRIES NUMCARDS OCCUP_BLANK OWNRENTS POSTBIN TARGET TIME_EMPLOYED Unsupervised Set Focus Class Target Variable TARGET Weight Variable Sort Alphabetically Z Number of Predictors 12 Save Grove CART Combine Cancel Continue as eam ee hm ee 4 Categorical Predictors In this data set TARGET is a categorical variable and should be checked as such The other categorical variables such as MARITAL have been automatically checked as categorical predictors because they are character text variables ADD SENTENCE ABOUT NON CHARACTER CATEGORICALS Growing the Tree To prepare for model building we only need to follow these three simple steps e Open a file for analysis e Select a target variable e Indicate which numeric variables if any should be treated as categorical In this case we also decided not to use one variable in the analysis We are now ready to grow our tree To begin the CART analysis click the Start button While the model is being built a progress report will keep yo
115. this tree will allow you to quickly spot where linear combination splits have been found Here we double click on the root node of the navigator to bring up this display Observe that the node is split on a linear combination of the two variables AGE and SMOKE with the splitter displayed near the top of the window The improvement score of this LC is 0433 which is about 20 better than the best single variable splitter PTD which has an improvement score of 0355 If you do not restrict the LCs with LCLISTs and instead run a legacy CART with linear combinations you won t find any LCs reported This is not a surprise we have found it many times Limiting LCs to a few choice variables is likely to yield better results than allowing CART to search over all available variables a reflection of the fact that the LC search procedure cannot guarantee a global maximum 111 Chapter 4 Classification Trees The Advanced Tab The Model Setup Advanced tab allows you to specify additional tree building control options and settings You should not hesitate to learn the meaning and use of these controls as they can be key to getting the best results Model Setup Categorical Force Split Constraints Testing Select Cases Best Tree Method Penalty Battery Advanced Options Minimum Node Sizes Tree Size Unweighted Weighted Maximum number of nodes AUTO Parent node minimum cases 10 5 Depth Maoa a pex Terminal node minimum cases
116. to be grown Maximum number of cases to allow into the learning set By default no limit is in effect AUTO removes current limit Maximum number of cases to allow into the test set By default no limit is in effect AUTO removes current limit Sets the minimum size for a child node The default is 1 Sets the minimum weighted size for a child node It is only used if you explicitly set a nonzero value LIMIT LEARN 20000 TEST 5000 LIMIT ATOM 15 NODES 150 LIST LIMIT DEPTH 18 MINCHILD 10 WMINCHILD 30 370 Appendix III Command Reference On some platforms CART can automatically determine the number of records in the USE and ERROR FILE datasets but on other platforms it cannot and will assume 1000 records These assumptions may lead to poor choices of memory parameters if your datasets have considerably more records than 1000 In this case use the DATASET and ERRORSET options to inform CART of the correct number of records in your datasets Some examples are LIMIT DATASET 33000 LIMIT DATASET 100000 ERRORSET 75000 371 Appendix Ill Command Reference LINEAR Purpose The LINEAR command allows CART to search for linear combinations of non categorical predictor variables to split nodes The command syntax is LINEAR N lt n gt DELETE lt x gt LINSPLITS lt n2 AUTO gt EXHAUSTIVE in which lt x gt is a fractional or whole number and lt n1 gt and lt n2 gt are whole numbers N specifi
117. to facilitate applying new data to CART models in other applications The rule set can be exported as a text file cut and pasted into another application and or sent to the printer To get the set of rules for the entire tree 1 Select Rules from the View menu or right click on the root node and select Rules from the local menu 2 Select Export from the File menu a command only available when the Rules dialog is the active window 3 Inthe Save As dialog specify a directory and file name the file extension is by default txt This rules display is only intended as a rough guide and does not contain information about surrogate splits You should use the Translate feature available by pressing the Translate button in the Navigator window to get the complete representation of the CART model including surrogates and procedures for handling missing values See Chapter 7 for details Scoring Data You may score your data by applying any tree reported in the Navigator window To score your data proceed as follows 1 Press Score in the Navigator window containing the model you want to apply 2 Inthe Score Data window Accept the current data filename or change it using the Select button in the Data section Accept the current Grove file embedded into the current Navigator or use Select to load another one assuming that it was saved using the Save Grove button in the Grove section 260
118. to read and understand approximate model As complex computer code more accurate model This section focuses on the first form of the rules The second form is discussed in the sections on scoring and translation 77 CART BASICS Every node displayed in a navigator can be described by the rules that lead to it from the root To view the rules just right click on the node and select Rules If you select the root node for rule extraction you actually get the rules for every terminal node in the tree Below we show this for our example 2 Navigator 1 Main Tree Rules Terminal Node 1 if t N INQUIRIES lt 1 5 amp amp OCCUP_BLANK lt 0 5 amp amp CREDIT_LIMIT lt 5546 amp amp AGE lt 40 5 temminaNode 1 terminalNode 2 class 1 1 v Notation Nodes Probabilities Classic SaL All None Leam You have a few further options on this window The rules can be displayed as standard C or SQL programming code Probabilities can be based on Learn data Test data if available or on the combination of learn and test data Rules can be displayed for specific nodes only those you have tagged on the navigator via the right mouse click menu This rules display is intended only as a rough guide The rules produced are only an approximate version of the CART model because they do not contain information about surrogate splits You should use the Translate feature available b
119. variable you have created for this purpose This is most useful when there are repeated observations on a behavioral unit such as person or a firm and it is important to keep all records pertaining to such a unit together either all records are in the training sample or all are in the test sample User constructed CV bins are also useful in the analysis of time series or geographically correlated data m Repeated Cross Validation Bins e g BATTERY CVR e rom CART produces its cross validation bins via a randomized partition of the data into the requested number of partitions or folds To explore how results might differ as the random partitioning differs you can request repeated CART runs in which the CV bins are constructed using different random starting points 19 Introducing CART 6 0 Additional Fraction for Auto Validation Traditionally CART trees are grown on learn or training data and evaluated on test data Because the test data are used to help select the optimal sized tree some practitioners prefer to conduct a further model check by evaluating a performance on a never looked at holdout portion of the data We refer to these holdout data as the validation data Improved Probability Trees In CART 5 probability tree performance was summarized using a version of the Gini splitting criterion In CART 6 we use the same relative error metric that is used for all other CART splitting rules Additional Model Eval
120. we may place little value on the overall performance of a model So long as a model is effective in identifying a high concentration of the class of interest it may not matter to us whether the model exhibits good overall accuracy We call the process of uncovering especially good segments hot spot detection and the process is fully automated in CART EX 18 Introducing CART 6 0 Additional Summary Reports ROC curves train test ROC curves have become a preferred way of summarizing the performance of a model and these are now available for all CART models and ensembles An estimate of the area under the ROC curve is also produced when cross validation is used to assess model performance Learn Test Pooled Results Results can be viewed for either the learn training data the test data or the aggregate created by pooling the learn and test samples Gains Chart Show Perfect Model In a gains curve the performance of a perfect model depends on the balance between the response and nonresponse sample sizes The perfect model reference line helps to put the observed gains curve into proper perspective Activity Window The activity window offers a quick way to access summary statistics summary graphs the model setup dialog a view of the data and scoring User Controlled Cross Validation Bins If you prefer to create your own partition of the data for the purpose of cross validation you can specify that CART is to use a
121. were found for that particular node a case that has a missing value for the primary splitter will 104 Chapter 4 Classification Trees be moved to the left or right child node according to a default rule discussed later Because the number of surrogates you request can affect the details of the tree grown we have placed this control on the Best Tree tab Usually the impact of this setting on a tree will be small and it will only affect trees grown on data with missing values Command line users will use the following command syntax to set the standard error rule BOPTIONS SERULE lt value gt To discount surrogates use BOPTIONS IMPORTANCE lt weight gt weight must be between 0 and 1 To limit the number of surrogates to be kept use BOPTIONS SURROGATES lt N gt The Method Tab The Model Setup Method tab allows you to specify the splitting rule used to construct the classification or regression tree and to turn on the linear combinations option Splitting Rules A splitting rule is a method and strategy for growing a tree A good splitting rule is one that yields accurate trees Since we often do not know which rule is best for a specific problem it is good practice to experiment For classification trees the default rule is the Gini This rule was introduced in the CART monograph and was selected as the default because it generally works quite well We have to agree with the original CART authors working w
122. whether all coefficients must be constrained to be positive The default is NO DELETE lt x gt Governs the backwards deletion of variables in a the stepwise linear combination search algorithm The default is 0 20 DOF lt x gt When comparing a linear combination against univariate competitors the LC improvement is DOF adjusted adj_imp improvement N X NC 1 2 N 2 in which N number of records used in the LC search algorithm usually the node size NC number of nonzero coefficients in the LC improvement unadjusted improvement displayed in model results reports etc X parameter specified on the DOF option For agreement with previous versions of CART that used the the LINEAR command use DOF 1 To disable the adjustment use DOF 0 The default is 1 0 EXH lt yes no gt Tells CART to repeat the stepwise search algorithm using each predictor in the LCLIST as the focal variable This increases compute time proportional to the number of predictors in the LCLIST It can in some cases yield better split points than the default approach Default NO SS lt yes no gt The default SS yes allows the linear combination search algorithm to proceed even if some of the predictors in the LCLIST have a high proportion of missing values or are constant Disabling this feature SS no causes CART to use a more stringent listwise like criterion for determine which records in a node are used in forming linear
123. you have the least interest For example in a binary response model in which response is relatively rare bagging and ARCing may improve the non response classification accuracy while slightly reducing the response classification accuracy relative to a standard CART tree We recommend that you experiment with adjusting the priors setting to induce the most useful improvements The Combine Tab The Model Setup Combine tab allows you to specify various advanced committee tree building control options and settings Model Setup Costs Priors Penalty Model Categorical Force Split Constraints Select Cases Method Advanced Committee of Experts Setup Combine Method Pruning Test Method Evaluation Sample Holdout Method Bagging No pruning exploratory oe aay a ranon 0 1 co Use data c c ARCing Use resampling training data fom Ma C Cross validation C se drato Report Details Combine Controls Files to Save I Initial tree Number of trees Root Name to combine I Committee trees I Leam samples Maximum number Repeated cases of sample redraws Save Grove CART Combine Cancel Continue Combine Method The Combine dialog houses the command controls for both bagging and ARCing To build a committee of experts tree first select either Bagging or ARCing If you select ARCing you will need to specify the exponent or power setting as well Power sets the weight the resampling puts on
124. 0 Other useful methods are PROP lt ratio gt proportion selected at random FILE lt file gt test set in a separate file and EXPLORE do not proceed with testing 14 gt gt The METHOD command sets the improvement calculation method The commands METHOD GINI and METHOD TWOING are the most widely used methods POWER gt 0 results in more even splits 15 gt gt The WEIGHT command sets the weight variable if applicable 16 gt gt The PENALTY command induces additional penalties on missing value and high level categorical predictors For backwards compatibility with earlier CART engines one should use the following command instead PENALTY MISSING 1 0 HLC 1 0 The remaining two commands are action commands 17 gt gt The BUILD command signals the CART engine to start the model building process 18 gt gt The QUIT command terminates the program amp Anything following QUIT in the command file will be ignored Multiple runs may be conducted using a single command file by inserting additional commands 305 Chapter 13 Working with Command Language Example A sample regression run The contents of a REG CMD sample command file are shown below Line by line descriptions and comments follow CART Notepad C Program Files Salford Data Mining CART 5 DAR REMPARRRRA AERA RA RARE RA AAA RA REAR ERE RE RARER RARER EERE RARER EERE N REM SAMPLE REGRESSION RUN REMFTT ESAS KASH TAKA KAAS AT
125. 00 000 records containing 4 000 bankrupts we will always work with ratios that are computed relative to 4 000 for the bankrupts and relative to 96 000 for the non bankrupts By doing everything in relative terms we bypass completely the fact that one of the two groups is 24 times the size of the other This method of bookkeeping is known as PRIORS EQUAL It is the default method used for classification trees and often works supremely well It is the setting we almost always use to start our exploration of new data This default setting frequently gives the most satisfactory results because each class is treated as equally important for the purpose of achieving classification accuracy 119 Chapter 4 Classification Trees Priors are usually specified as fractions that sum to 1 0 In a two class problem EQUAL priors would be expressed numerically as 0 50 0 50 and in a three class problem they would be expressed as 0 333 0 333 0 333 amp PRIORS may look like weights but they are not weights Priors reflect the relative size of a class after CART has made its adjustments Thus PRIORS EQUAL assures that no matter how small a class may be relative to the other classes it will be treated as if it were of equal size PRIORS DATA or PRIORS LEARN or PRIORS TEST makes no adjustments for relative class sizes Under this setting small classes will have less influence on the CART tree and may even be ignored if they interfere with CART s
126. 1 Sample Si ample Sizes Node Complexity Learn Sample Size 189 H Minimum complexity 0 00000 Test Sample Size 1 Scale regression complexity by sample size A m Automatic if complexity is greater than 1 Subsample Size 189 4 Data Set Size Warning Limit for Cross Validation Model Missing Values Warn if the number of observations in Create new variables for None X leaming data set for cross validation exceeds 3000 Create missing categorical level None X Defaults Save Grove CART Combine Cancel Continue Parent node minimum cases ATOM When do we admit that we do not have enough data to continue Theoretically we can continue splitting nodes until we run out of data for example when there is only one record left in a node In practice it makes sense to stop tree growing when the sample size is so small that no one would take the split results seriously The default setting for the smallest node we consider splitting is 10 but we frequently set the minimum to 20 50 100 or even 200 in very large samples Terminal node minimum sizes MINCHILD This control specifies the smallest number of observations that may be separated into a child node A large node might theoretically be split by placing one record in one child node and all other records into the other node However such a split would be rightfully regarded as unsatisfactory in most instances The MINCHILD control allows you to specify a
127. 1006 TOO MANY CATEGORICAL COMPETITOR SPLITS The number of categorical splits has exceeded the initially reserved amount Increase the limit using the BOPTION SPLITS command Error 11008 COMPUTATIONAL INSTABILITY DUE TO LINEAR COMBINATIONS TRY DISABLING LINEAR COMBINATIONS AND RERUN Contact Salford systems with details about your run Error 20008 YOU HAVE SPECIFIED MULTIPLE DEPENDENT VARIABLES Check the MODEL command only one variable is allowed there Error 20011 COMPUTATIONAL DIFFICULTIES ENCOUNTERED UNABLE TO CONTINUE Contact Salford systems with details about your run Error 20068 Unable to discern a valid set of variable names from your text dataset Make sure that the correct value separator is used and that the first line lists the variable names Error 20069 Unable to open your text dataset Check the file location and USE command Error 20071 You have not specified a grove file yet Add the GROVE command appropriately Error 20076 Error managing data swap file cannot continue 325 Appendix II Errors and Warnings Proceed with regular system maintenance change swap file settings Warning 1 At least one variable had too many distinct values to tabulate completely This is most likely to occur with character variables especially those with long string values Also this may be due to treating a ordinal variable as discrete categorical Read carefully the entire warning and pro
128. 20 of the data should be allocated for testing purposes and 25 as validation data PARTITION TEST 2 VALID 25 In the above example the LEARN option does not appear so the amounts specified for test and validation samples must be expressed as proportions between 0 and 1 and must sum to less than 1 If you specify the LEARN option then the amounts will be normalized to sum to 1 0 such as in PARTITION LEARN 20 TEST 12 VALID 8 Which would result in 50 of the data for the learn sample 30 for the test sample and 20 for the validation sample PARTITION SEPVAR PURPOSES specifies a character variable that should take on values TEST Test VALID or Valid to steer records into the test and validation samples otherwise they will go to the learn sample For a numeric separation variable such as PARTITION SEPVAR USAGE a value of 1 will place the record into the test sample and 1 for the validation sample 386 Appendix III Command Reference PENALTY Purpose The PENALTY command offers three ways to specify a multiplicative fraction between 0 and 1 to penalize down weight the improvement thus making it more difficult for the variable to be chosen as the primary splitter in relation to other predictor variables Predictor specific improvement factor By default no variable specific penalty is applied to a variable s improvement when considering the variable as a splitter although a penalty for missing data may
129. 3 261 CV CMD 206 CVR CMD 208 DEPTH CMD 210 DRAW CMD 211 FLIP CMD 211 HOTSPOT CMD 194 KEEP CMD 212 LOVO CMD 214 MCT CMD 215 MVI CMD 216 ONEOFF CMD 218 PRIORS CMD 219 REG CMD 305 RULES CMD 220 SAMPLE CMD 221 SHAVING CMD 223 TARGET CMD 224 TTC CMD 186 command file cmd 297 command input 298 command line equivalents 315 command log 78 261 297 299 300 command prompt 298 command reference 327 command sequence 300 command syntax 300 301 classification example 302 committee tree combine example 308 regression example 305 scoring example 309 command line 296 command line mode 298 committee of experts ARCing 163 bootstrap resampling 163 combine controls 165 combine method 164 evaluation sample holdout 165 files to save 166 pruning test method 165 report details 166 specify model 164 committee tree 167 comparing child nodes 139 comparing learn and test 139 competitors 68 155 251 number to report 131 Competitors and Surrogates tab 155 complexity parameter 112 confusion matrix 250 Consistency by Trees 188 Dir Fail Count 189 Direction Max Z 189 Directional Agreement 188 Rank Fail Count 189 Rank Match 188 430 Index Rank Max Z 189 table 191 Terminal Nodes 188 Tree Name 188 Consistency Details by Nodes 189 Lift Learn 189 Lift Test 189 N Focus Learn 189 N Focus Test 189 N Node Learn 189 N Node Test 189 N Other Learn 189 N Ot
130. 3 Filtering the Data Set or Splitting the Data Set csesseeeeeeseeeeeeeeeeeeeeeseeees 414 DATA BIOCKS oocri25 A A cudhasstncebasaeestusesitertaaaceusanegdvencttrarenscuedes 415 Advanced Programming Features cccccessescseseerseseseeeeseseseeeseseseeeseseseeeeenseseenens 415 BASIC Programming Language Command cccceseeeeceseeeeeeseeeeeeseeeeneeeeeseans 416 DELETE Statement neninn aa ea oramne aaa aaao apana aaa a aae aaaea a Ea aCA iaaa 416 DIM Statement ccccecccict cdi cectesarecdat dep ecenserds aara aar aer aA a aa aaa aa aa ETEA Aaa aE Eain REREN 417 ELSE Statement inina ernan areae cnivecaoatitpeeeettvycnestiveecisteveceoatacpesevnseueveebccnawies 418 FOR NEXT Statement cccccccccccccicctesscecceesccecsaeneetencenenecsveseeteteencterveshactestveseedeetensers 419 GOTO Statement niecc cceeic eesti cere cdedineensdeted anaE nan aE NAANA NENNT N NANANA ANNABAN ANARA ARNAR NO NaRa AAN 420 IF THEN Statement cineris eannan aaan aaa aaa aAa aaa ana aa ERNA a atiaeina anane 421 LET Statement eiaeaen a rerea aaae nae aeaa aa anaa a AOA aa aaa a eE AAAA ETEei aana EET 422 STOP Statement a S EE EE A E weed avsessueeeedave 423 Intro Introducing CART 6 0 This chapter provides a brief introduction to CART and this manual and an overview of new features 10 Introducing CART 6 0 Introduction Welcome to CART 6 0 for Windows a robust decision tree tool for data mining predictive modeling and data preprocess
131. 3750000 DEFICIT 7742578 Y ART h s insu ry sub sampli to continue the command LIMIT SUBSAMPLE fficient memor ing nodes with Il R 286 Chapter 12 Features and Options If this occurs or if you suspect the problem is too large for the workspace you may need to specify limitations on the structure of the tree to be able to process the model Memory Usage Example A data set with 32 231 records a 10 level target categorical variable and 68 categorical predictors is used to illustrate how to overcome a memory shortfall As shown below the top three rows provide an overview of the workspace requirements for this example The estimated total workspace is 41 357 617 elements 2 092 770 elements to hold the data and 39 264 847 to process the analysis Because the available workspace is only 33 750 000 workspace elements the memory deficit is 7 607 617 elements Your options at this point are to upgrade to a version of CART with more workspace or to specify limitations on the structure of the tree We offer two methods to specify growing limitations Setting Limits Using Model Setup The easiest method to limit the growth of a tree is to use the Model Setup Advanced tab Tree Size options By default CART sets the maximum values based on the dataset size to assure that they can never be reached Reducing these values will considerably reduce the amount of required workspace Tree Size Maxim
132. 383 OUTPUT so vicissvassvuscascccusecisvsbeadsedsanseenssuecacahadacisvandecuacsenuacdashatend anes sncdedaateasstuavessichaaZ 384 PARTITION 5 scscsntivcccscstsceegswcnunssecsateasscensestactacsadnwesunsuenhabacdanatssuazeven sncnasadisasovunseusastsad 385 PRIOR Sineccescssvassttscnccestatensavensnssenaaceasnsthsaascvupandancuatanscvnbscinaaaaiaazaveyanaenassdsanatuasnerasbans 388 PRINT cicstcceticsivcadecvecsviscciavatnettatecadevecasettantuvent svstaavusosucadsduvuuieduesuesticeasaasivestcatstereiie 389 QUIT ecesdccctnsstesssescausddesnsti svabscdeasuavecsssebunaccdasaudecevatansasibuduandadtuadssnascecedsatennscussnnnatndad 390 PRE Miivcccacsccivtincuceadecctoecavacaevesnaadacecvesavsusauententusasdcussaasasesiendacevsadeducsuantictavaneduaceunstgeenvan 391 PRU cisiccseanduacin ce vavensscetuseniuvsaneatacstaatenctancanuctknatesscucisdcsuvaatetevausslansteaabiessaaeavedvunattensaaas 392 SCORE E E cist ausndacanassdupaucavanapscdedustace stusndvcaidestuavustndudscuaapatevaciwarsueensutei ad 393 SAV Bocce T 395 T E tees suussuneasecatuernseniudscanspensasaeiwadsueapescevend 396 SEE C E eecceseste ceanalsi A veda sannsisauesssechuntuts canshaxadscuanaleasusheactastuessuvsvexstuceanis 397 STRATA arr ee ra rea a eieo tare aar aars ares sslsaueabssay2aseucusisTasnsushcatcendauuvsendetuceanas 398 viii Table of Contents TRANSLA T Bice ara ea aae a aare aeetas a EA aa aaa aaaea E aaa aasan Lataa Lamana 400 WS E E A eustaeesatSeecedstvataesd 402 WEIGH
133. 4 46 O 69 802 1 40 541 io n 198 W 74000 W 86 000 N 74 N 86 57 CART BASICS The example shows how CART creates multi way splits from binary split building blocks The root node first splits on N_INQUIRIES gt 1 5 and then again on N_INQUIRIES gt 4 5 This creates three regions for N_INQUIRIES 0 or 1 2 3 or 4 and 5 or more With a mouse click you can Zoom in or Zoom out by pressing the or ol keys Fine tune the scale by changing the 100 x selection box Experiment with two alternative node spacing modes al and buttons Turn color coding of target classes on or off El button Try clicking on these controls now to see what they do The detail appearing in each of the nodes can be customized separately for internal and terminal nodes From the View menu select Node Detail the following dialog appears Display Details E Internal Node Details E Terminal Node Detail Node Information to Display Parent Node Node number IV Splitting variable name I Split criteria Decimal places 3 Sample Node Display M Weighted cases Decimal places EZ M Unweighted cases J Splitting variable name Split criteria Regression Trees M Average Median I Standard Mean Average deviation Classification Trees I Class assignment I Class breakdown M I Class histogram Set Defaults Copy To Terminal Nodes OK Cancel Apply The default display setting is shown in a sample node in
134. 432 Index icons 42 IDVAR command 363 IF THEN command 407 421 improvement 69 252 indicators missing values 34 initial tree 167 input files default directory 28 133 installation custom 25 permissions 26 procedure 25 typical 25 introduction 10 K KEEP command 212 364 keyboard conventions 43 keyboard shortcuts 43 L LABEL command 365 labels assigning 59 242 language command line 296 learn sample 210 211 281 learn sample size 113 287 least absolute deviation 13 147 least squares 13 147 LET command 407 422 level of detail 238 lift index 246 LIMIT command 112 287 369 ATOM 111 DEPTH 113 LEARN 114 MINCHILD 111 NODES 113 limits specifying growth size 113 286 linear combinations 17 104 108 estimating number of splits 108 LC lists 17 minimum node sample size 108 selected variables 109 variable deletion 108 LINEAR command 371 logical operators 409 LOPTIONS command 372 PRINT 126 main splitters 238 main tree 56 57 239 241 main tree rules 159 Max Cases 281 MEMORY command 374 memory management 285 memory problems 74 257 memory requirements 285 memory usage example 286 menus 41 229 method See splitting rules METHOD command 375 Method tab 85 146 147 232 233 methodology 10 Min Cases 281 MINCHILD command 216 minimum cost tree 102 103 MISCLASS command 376 misclassification 12 67 250 misclassification costs 116 misclassif
135. 6 0 into your CD ROM drive If Autorun is enabled on your system the installation starts automatically and you can skip steps 2 and 3 From the start menu select Run In the Run dialog box type D SETUP substituting the appropriate drive letter of your CD ROM if other than D From the pre installer menu choose the appropriate option to begin the CART installation procedure The installation program prompts you to select a type of setup Typical The Typical installation provides you with all application software tools documentation and sample data files that are normally available All components will be installed within the directory structure defined during the installation procedure Custom Choose the Custom installation if you would like to choose specific components available for installation To include a particular option click the mouse once on the desired option Be sure that a checkmark appears in the appropriate box to ensure the item will be included as part of the installation By default CART is installed in C Program Files Salford Data Mining CART 6 0 Each component of the CART installation is installed in a subfolder under CART 6 0 26 Installing and Starting CART Ensuring Proper Permissions If you are installing CART on a machine that uses security permissions please read the following note vx You must belong to the power user group on Win NT Win XP and Win 2000 to be able to run CART This is
136. 6 0 stores the navigator inside the grove file and no longer makes use of a separate navigator file format CART 6 0 will recognize and read old navigator files and you can load these from the File Open Open Navigator menu selection Ifthe trees you are building are large e g several thousand terminal nodes Windows system resources can become depleted To avoid memory problems consider periodically closing the open Navigator windows you will not need More Navigator Controls standard Relative Cost curve color coded Relative Cost curve percent population by node display 75 CART BASICS The first two displays show the relative cost curve depending on the number of terminal nodes while the last display reports how the original data set is distributed into the terminal nodes in the currently selected tree rN If you click on an individual bar in the percent population by node display the corresponding node in the tree topology is briefly highlighted Pressing on the Smaller or Larger button causes the scale of the tree topology in the top half of the navigator window to become larger or smaller This is useful when analyzing large trees When applicable you may switch between learn or test counts displayed for each node by pressing the Learn or the Test buttons Since cross validation was used in this example only learn counts are available on the node by node basis You can also save the Navigat
137. 8 Color button 59 242 Column button 177 250 Columns button 196 COMBINE button 162 Continue button 26 101 125 Copy to Internal Nodes button 58 241 Copy to Terminal Nodes button 58 241 Copy button 26 27 Cum Lift button 143 Cum Ave Profit button 153 Cum Profit button 153 Defaults button 117 287 Delete from List button 100 Filtering button 197 Full button 292 Fuzzy Match button 190 Gains button 143 Grid button 205 Grove button 171 237 257 Grow button 56 236 244 Larger button 75 237 Learn button 62 71 75 151 153 183 197 201 204 237 246 253 256 259 Legend button 204 Lift button 143 Line button 201 Max button 205 208 Mean button 205 Median button 205 Merge selected groups button 138 Min Cost button 201 203 204 Min button 205 208 Misclass button 201 204 Model button 83 146 Next Prune button 54 236 Nodes button 201 None button 208 Open button 43 140 230 298 Optimal Tree button 174 Optimal Tree button 182 Other Classes button 63 247 Page Setup button 60 141 243 Pct button 71 254 Pooled button 183 256 259 Profit button 152 Prune button 56 236 244 Quartile 0 25 button 205 Quartile 0 75 button 205 Recall Defaults button 127 Rel Error button 201 Report Now button 289 ROC button
138. AN environment variable must point to the location of the Stat Transfer libraries not required under UNIX or Linux to use the DBMS COPY interface the DBMSCOPY environment variable must point to the location of the DBMS COPY libraries Beginning with CART 6 the Stat Transfer interface where present takes precedence over the DBMS COPY interface which is disabled To disable the Stat Transfer interface one can use the command LOPTIONS STATTRAN NO likewise to re enable the Stat Transfer interface one uses the command LOPTIONS STATTRAN YES LOPTIONS DBMSCOPY can be similarly employed to enable or disable the DBMS COPY interface If both data translation engines are disabled the only supported file formats are Systat and text w CART 6 includes native support for text datasets which are for many users the most flexible and natural formats in which to maintain data A single delimiter is used throughout the dataset Itis usually a comma but semicolon space and tab are also supported as delimiters See Chapter 2 Reading Data Reading ASCII files The FPATH Command The FPATH command can be used to specify locations for different types of input and output files For example the following command will cause CART to read and write files in the directory Salford under your home directory by default on UNIX like systems FPATH Salford Thereafter if one gives an input output command such as USE OUTPUT or SAVE
139. ART 6 0 Pro Pro EX Battery SUBSAMPLE Battery SUBSAMPLE varies the sample size that is used at each node to determine competitor and surrogate splits The default settings are no subsampling followed by subsampling of 100 250 500 1000 and 5000 You may list a set of values with the VALUES option as well as a repetition factor Each subsampling size is repeated N times with a different random seed each time 224 Chapter 10 CART Batteries Em Em Ed Ed CART 6 0 CART 6 0 Pro Pro EX Battery TARGET While theoretical research usually assumes independence among predictors this assumption is almost always violated in practice Understanding the mutual relationship among a given list of predictors becomes important in a variety of contexts A traditional covariance matrix may provide insight into pair wise correlations among predictors but usually fails to capture any serious multivariate relationships or possible non linearities Battery TARGET was designed to overcome the limitations of conventional approaches and construct a more reliable measure of inter dependency The process proceeds as follows each variable from the current predictor list is taken as a target and a model is built to predict this target classification tree for categorical predictors and regression tree for continuous predictors using the remaining variables The resulting model accuracy indicates the degree of association between the current target and the r
140. CART 6 0 User s Guide Dan Steinberg and Mikhail Golovnya Salford Systems 4740 Murphy Canyon Rd Suite 200 San Diego California 92123 USA 619 543 8880 TEL 619 543 8888 FAX www salford systems com Developers of TreeNet MARS RandomForests and other award winning data mining and predictive analytics tools Salford Systems 2002 2007 Copyright Copyright 2002 2007 Salford Systems all rights reserved worldwide No part of this publication may be reproduced transmitted transcribed stored in a retrieval system or translated into any language or computer language in any form or by any means electronic mechanical magnetic optical chemical manual or otherwise without the express written permission of Salford Systems Limited Warranty Salford Systems warrants for a period of ninety 90 days from the date of delivery that under normal use and without unauthorized modification the program substantially conforms to the accompanying specifications and any Salford Systems authorized advertising material that under normal use the magnetic media upon which this program is recorded will not be defective and that the user documentation is substantially complete and contains the information Salford Systems deems necessary to use the program If during the ninety 90 day period a demonstrable defect in the program s magnetic media or documentation should appear you may return the software to Salford Systems for r
141. Cases Grove Navigator 1 Data C Program Files Salfo GYMTUTOR CS As illustrated above the score output dialog displays summary Response Statistics Gains and a Prediction Success table for the actual and predicted target variable values Because the target variable from the original tree appears in the training data we can access the predictive accuracy of this particular tree Response Statistics Tab The Response Statistics tab provides distributional information by terminal nodes predicted class and by the actual target variable because it is observed The grid in the top portion displays the following information for each terminal node 177 Chapter 7 Scoring and Translating Node Terminal node number Cases Number of cases Percent Score Data Percent of scored data in this node Percent Train Data Percent of learn data in this node Node Class Node class assignment Percent Correct Percentage of cases classified correctly in node The Results Summary group box in the lower panel displays the number of predicted cases the number of observed cases for the target variable and the percent classified correctly in this example 96 The name of the grove file and the dataset used in the Score run are also noted in the last row of the dialog Gains Tab The Gains tab displays gains both in graphical and table forms Note that you may switch between gains based on the current scored dataset and gains based on
142. Col 1976 Married 34 17929 Col 4500 Married 100000 Col 15000 Married 31 12000 Col 3500 Married 8000 Col 4000 Married oln on amp wn 0 0 0 0 0 0 0 0 0 0 0 Oo 0 0 0 bhh oe ae Se ae domo Biwi wl eiw wl Closing the View Data window puts us back in the Classic Output so we click on the Activity Window icon land select the Model Setup toolbar icon to reach the Model Setup dialog Setting Up the Model The Model Setup dialog tabs are the primary controls for conducting CART analyses Fortunately you only need to visit the first Model tab to get started so we now focus on this one tab x Tab headings are displayed in RED when the tab requires information from you before a model can be built In our example the tab is red because we have not yet selected a TARGET variable Without this information CART does not know which of the 14 variables we are trying to analyze or predict This is the only required step in setting up a model Everything else is optional Selecting Target and Predictor Variables For this analysis the binary categorical variable TARGET coded 0 1 is the target or dependent variable To mark the target variable use the x to scroll down the variable list until the TARGET name is visible and place a checkmark as shown below 46 CART BASICS Model Setup Advanced Costs Priors Penalty Battery Model Categorical Force Split Constraints Testing Select Cases Best Tree Method
143. Democratic 3 Peace and Freedom CLASS GENDER 0 female l male CLASS EVAL G Good F Fair P Poor or you may combine them in a single command separating variables with a slash CLASS PARTY 1 Repub 2 Democratic 3 Peace and Freedom GENDER O female 1 male EVALS G Good F Fair P Poor Note that the label Peace and Freedom requires quotes since it contains spaces Labels consisting only of numbers and letters can be listed without quotes but if so any letters will be converted to uppercase Note also that all class labels for a given variable must be defined at once since the lt variable gt token that leads the list of classes clears out any existing class labels for the variable Variable groups that are composed of one type of variable only i e numeric or character may be used in the CLASS command similarly to variable names e g 343 Appendix Ill Command Reference GROUP CREDITEVAL EVAL3MO EVAL6MO EVAL1YR EVAL3YR CATEGORY CREDITEVAL CLASS CREDITEVAL 0 n a 1 Poor 2 Fair 3 Good Class labels are reset with the USE command They are preserved in a CART grove file They will not carry over from a BUILD run to a CASE run unless in a continuation of the BUILD session To reset all class labels issue the CLASS command with no options CLASS To see a summary of class labels issue the command CLASS _TABLE_ 344 Appendix III Command Reference COMBINE Purpose
144. E 4 gt gt BOPTIONS SURROGATES 5 PRINT 5 COMPETITORS 5 TREELIST 10 BRIEF SERULE 0 IMPORTANCE 1 MISSING NO 5 gt gt f LOPTIONS MEANS YES NOPRINT NO0 PREDICTIONS YES BOTH TIMING YES PLOTS YES GAINS N0 ROC NO 6 gt gt f FORMAT 7 UNDERFLOW 7 gt gt LIMIT MINCHILD 100 ATOM 200 NODES 5000 DEPTH 50 LEARN 100000 TEST 100000 SUBSAMPLE 100000 REMSAAAAAAAA AAA AAA AAA RATA AAA AA ALATA AAA AAA AAR AAA A AREA AEE E REM MODEL SETUP REMASA AAA AAA ATTRA TAAATTAKATTAKATAAA TEAK TAA TEAK TEAK TEAK TAKA T 8 gt gt MODEL Y2 9 gt gt f CATEGORY Y2 10 gt gt f PRIORS SPECIFY l1 5 l 5 11 gt gt f Misclassify Cost 2 Classify 1 as 1 11 gt gt f Misclassify Cost 3 Classify l as 1 12 gt gt KEEP Z1 Z2 Xl X2 X3 X4 X5 X6 X7 X8 X9 X10 13 gt gt ERROR SEPVAR T 14 gt gt f METHOD GINI POWER 0 15 gt gt f WEIGHT W 16 gt gt f PENALTY MISSING 1 1 HLC 1 1 REMAT TESA AAA ATA ATTA ATTA TAHT ATTRA ATTA ETHER ATER AREA TEE REM BUILD MODEL REM AAAAAAAAAAAAAAAAAAAAAAAAAAA ALARA ALARA ALARA AAA EAA ATER 17 gt gt BUILD REM AAAAAAAAAAAAAA ALARA AAA AAR AAA A AEA A ALARA ALARA AAA RRA ET REM QUIT CART REMASA SAA TAA AAA TAA ATA ATAK TEAK KAAK TEAK TEAK TEATS AKEEA AEE K TEE 18 gt gt f QUIT All lines starting with REM are comments and will be ignored by the command parser We have marked commands of special interest with RED numbers 303 Chapter 13 Working with Comm
145. E Foreris cs cstcete cs coeds de scnctenasucudest sancti ceseeece tesettnesusecves tex seevestvsttlaetevenecessstusteseses 341 vii Table of Contents COMBINE icvsstettastettssdissaskivicendeiteasactvnnscecenteanaed svasuvevadeersataasccdentathenutteasasesuanieasbaneea 344 aa vuleawacasavaaussuananavesaneastsasadenah oaceansbasnduanvaalanasssdwaasesnanieastannaais 345 DATAINE O i ociccccccceslidscecdus teaetuctiveacesustanavensncadtsasctvansdaaeassanduasayansuanstsubdvaasssuanaeabtadanih 346 DESCRIPTIV E eeaeee cuss svandecadeesdundavevseaasincvatisauaduausudehisssevsaanastcshusaceveateasteenuasennoantid 347 DISCRETE cavescss Soci cetera sts ee chadetestsdas cactus setesaalessstunsansserancdanauanasbcareaasteneatsnecutvascieeanas 348 DISALLOW rennin remanere aae oea aiae omaan p alla uaduausutewandassacuannea senwaaseusaahaaecunveadentanas 350 O r E leaned T a a T 352 EXCLUDE rrenean ae re AEEA a Aae enaena rP allasvacvantuuensacsdanstantessehecesveseabaecunuvedentavas 353 FORG E PE E T E a a E seus 354 FPA E TA T T 355 FORMA E T E AT 356 GROUP aE E TTT T T 357 GROV d E T E 358 HARVEST T A A T 359 PIB E T E a a T 361 HISTOGRAM 60 ccss tess teccctesstuacssnccanteguessa sven ccvuateadenuunecuadestiuntauascustutebineatagazceauestinsad s 362 DVAR esscsicasasscescievagea ivaatins savita te ctasectu aca aS on aeneae Eana Eo A EEEo eE porron arinaa 363 A r a T 364 A a T r a A T 365 POLIST aaea aasad roaraa opnara a a aaa aAA AENEAS rN arsar reni aa
146. EEE RENERE KEENAN RR ed enoee sheeted ed 48 Viewing the Main Tree ccccssesceceseeeceeeseeeeeeeeeeeseeeseneeseseseeeseseseeeseseseeesesesneeseeesenanen 56 Viewing Sub treeS wivciece ciccceeecccestecceeeeteecceesncteteeesteecerenseectecuneeceeeeneeeeserdeeteeedsnseceeseneees 58 Assigning Labels and Color Codes ccccccsseeeessseeeeseseeneeeeseeneneeesesneeseeeseaneeeneseanes 59 Printing the Main Tree cccccsssssecesseeeeeeseeeeeseseenseeeeseeseeeseeeseeesaeseeeseanseeeeseansesenees 60 Tree Summary Reports 0 ceceeee essen ee eeee ee eeeeeee seen eeeeeeeeaeeeeeeaeeeeeeeesaseeeeeesseeeseneseenen 61 Gains Chart Cumulative Accuracy Profile ccccssseeeesseneeeeeseeeeeeeseeeeeeeneeeeenees 61 Terminal NOGGS rarer cece secs iecde ee cectteceedactueceesscaceecishucaevespaser ravantaeestvescaevastacerearesaets 63 Variable Importance siaa eea an nr ee eaaa ceca devas e eaa raa eeesuenct Naaa aa aaaea 64 iv Table of Contents MisclassifiCati Om iivce scciccecticscccccsdeeeescectevsctetecbeccesd svvccetseeecistventeteseecceed svuseeteevseeests 66 Prediction Success or Confusion Matrix ccecseeeeeeseeeeeeeseeeseeeeseeeseeeseeneeeeeseenees 67 Detailed Node Report icceicccssceeset cc ananena aaaea eu tusbendinddentetdansnedetcveeeetevnsgedelaxanesess 68 Terminal Node RE port esiscenctccecsceteseeadececees ae neoa aerer aaan ernan ennea ce evsanacecraveasvessenceces 73 Saving the Navigator Grove File ccccsseccsessee
147. EEP or EXCLUDE commands are used Examples The cost of misclassifying a class 2 case as a class 4 case is 4 5 MISCLASS COST 4 5 CLASSIFY 2 AS 4 The cost of misclassifying a case from classes 1 2 3 5 or 8 as a class 6 case is 2 75 MISCLASS COST 2 75 CLASSIFY 1 3 5 8 AS 6 MISCLASS commands are cumulative each command will specify a part of the misclassification matrix To reset the matrix use MISCLASS UNIT 377 Appendix III Command Reference MODEL Purpose The MODEL command specifies the dependent variable The command syntax is MODEL lt depvar gt lt indep list gt in which lt depvar gt is the dependent variable and lt indep_list gt is an optional list of potential predictor variables If no lt indep_list gt is specified all variables are used for CART processing unless KEEP or EXCLUDE commands are used Examples MODEL DIGIT all non character variables used in tree generation MODEL WAGE AGE IQ EDUC FACTOR 3 8 RACE selected variables MODEL CLASS PRED 8 VARA VARZ PRED 1 3 See the KEEP and EXCLUDE commands for another way to restrict the list of candidate predictor variables 378 Appendix III Command Reference MOPTIONS Purpose The MOPTIONS command sets options for a subsequent COMBINE command which launches the building of combined or multi trees a committee of experts tree The data are split into a setaside set and an overall set Tre
148. ES 2 FIT 3 ANYPOOL 4 OFFAER 5 TANNING 6 ONAER 7 SMALLBUS 8 NFAMMEM 9 NSUPPS 10 HOME Competitor Split 0 0 50000 3 45388 0 0 50000 015 5 50000 1 1 50000 1 50000 1 Improvement 0 23363 0 15584 0 07206 0 06622 0 05951 0 03704 0 02424 0 02248 0 00579 N Left N Right N Missing The Classification Tab The next tab in the non terminal node report displays node frequency distributions in a bar graph or optionally a pie chart or horizontal bar chart for the parent left and right child nodes If you use a test sample frequency distributions for learn and test samples can be viewed separately using the Learn or Test buttons As shown below the parent node in this example the root node contains all 293 cases The split ANYRAQT 0 is successful in pulling out 82 of the Class 1 observations and putting them in the right child node Terminal Node 7 The remaining 13 Class 1 observations and all Class 2 and 3 observations are assigned to the left child node 254 Chapter 11 CART Segmentation i Navigator 1 7 Node 1 Competitors and Surrogates i Root Competitor Splits Classification I Splitter Is ANYRAQT 0 mi F You may switch between counts and percentages by pressing the Cases or Pct button The horizontal bar chart offers an alternative view of the class partitions Each colored bar represents one target class The vertical l
149. ES is categorical Cases With These Values Cases With These Values Go To Left Child Node Go To Right Child Node Send To Right gt lt Send To Left Reset A v 274 Chapter 12 Features and Options In this example we are choosing to send classes 1 and 3 to the left and classes 0 and 2 to the right The resulting setup dialog looks as follows Set Right Child Node Split Value for CLASSES Variable CLASSES is categorical Cases With These Values Cases With These Values Go To Left Child Node Go To Right Child Node Send To Right gt lt Send To Left Reset Cancel Click OK to continue and return to the Model Setup dialog From the Model Setup window click Start to build the model From the resulting Navigator if we click on the Tree Details button we will see that our specified forced splits have been implemented For illustrative purposes we are only displaying the top two level splits FIT 3 96 FIT 3 96 ANYRAQT 1 CLASSES 0 2 ANYRAGT 0 CLASSES 1 3 Node 3 Terminal Node 5 Terminal N 82 Node 3 N 115 Node 6 N 82 E N 14 275 Chapter 12 Features and Options Command line users will use the following command syntax to set the force split rules FORCE ROOT LEFT RIGHT ON lt predictor gt AT lt splits gt For example FORCE ROOT ON GENDERS AT Male Unknown FORCE LEFT ON REGION AT 0 3 4 7 999 FORCE RIGHT ON INCOME AT 100000 To
150. FOR NEXT loops GOTOs and array processing Core capabilities include filtering and deleting records on the basis of simple or complex criteria New variables can be constructed with the help of more than 50 mathematical and statistical functions and a complete set of logical text and arithmetic operators These functions have been available to assist modelers in adjusting data and focusing on specific data subsets 15 Introducing CART 6 0 Starting in 2006 we have made it easier to use our data processing machinery for the sole purpose of data preparation You can now read in data in any one of our supported data formats process the data as required and then save the results in another data format without having to conduct any modeling In other words you can now use our software as a dedicated data preparation tool Descriptive Statistics Our complete set of statistics including standard summary statistics quantiles and detailed tabulations continue to be available in a single easy to access display We now also offer an abbreviated version in the traditional one row per predictor format Also new in CART 6 0 are sub group statistics based on any segmentation or stratification variable CART CART E fa CART 6 0 CART 6 0 Pro EX Tree Control e g Forced Splits Constraints CART 6 0 allows you to dictate the splitter to be used in the root or in either of the two children of the root This control is frequently
151. G CART aooiee coleaddetecvvendeueeunsceceived decevvbeas copeunccsceewadiseewbeaccecuatcesesewens 26 Preparing Your Data for CART ccceccseeseeeeeeeeeeeeeeneeseeeseeaeseeeseeeseseseeeseeeseanseeenaes 28 Setting up Working Directories cccceeseneeeeeeeeeeeeeeeeneeeeseeneeeeseeeeeeenseeneeeeseenenenss 28 READING DATA haidai iei ceccececectieatieiitececeiedessieteiecccnsciestindeiedcccaiaveaseteiasiee 31 General Commens aasar naan Ee AAAA ARAE AAEE ERANA a EAEAN ANER ERREA 32 Accessing Data from Salford Systems Tools ssssuussssunsnnnunnnnnnnnnnnnunnnnnnnnnnnnnnnnnn 32 Variable Nanm iernii cee ccccedecccvscee caccsctie cevacecesctasceeeacuvecestavecertve ecieenesseeceteevecert 35 Reading Excel Files naarn raaa neea EEA EEEN EEEa AEAN ar EAREN NR 36 CART BASICS A iserininneusnunin anusan annunin u urbar a ESEN u asa NENONEN SEEE 39 CART T toria la ra eaan asnasan eaa A aE enaa AA a aaae aeaa EEn ai 40 CART DeSktop o E AE E A E A 41 Ab t CART Men s viiene skarret caeceutanenevnansaeeccivajhdnnnelacuesbuvatsnancecasennvann 41 About CART Toolbar ICOMS cccccccsssssseeeeceeceenseeeeeeeeeeeacsesseeeeeseansesseneesseesnanseneees 42 Opening a Fileoianesp oes re oinak eni K Aa KE Aa ey EEEa HENA KAE KENNARA ARENA Aa KA NASCE KENARA ARENY RAY 43 Setting Up the Model ccccceesenneeeeeeneeeeeeeeneeeeeeeneeeeseeeeeeeseecneeeeseeneeenseeneeeeeeeeeeeenss 45 Tree Navigator isene re Ene Ea E EEEa AT E T Ee RKK EUNANA EAA S
152. HOME and CLASSES as categorical predictor variables The resulting Model tab will look like the following Model Setup ail Advanced Costs Priors Penalty Battery Model Categorical Force Split Constraints Testing Select Cases at Best Tree Method Variable Selection Tree Type Variable Name Target Predictor Categorical Weight Aux L4 ean NFAMMEM a v I Regression TANNING v Unsupervised ANYPOOL z SMALLBUS FIT HOME PERSTRN Target Variable CLASSES SEGMENT SEGMENT A Weight Variable Sort File Order ba Number of Predictors 12 Save Grove CART Combine Score Cancel Continue Start Set Focus Class E K K KE KI KI lt I Let s take a closer look at the Model Setup Constraints tab and get ready to specify a group of constraints 278 Chapter 12 Features and Options Model Setup Advanced Costs Priors Penalty Battery Model Categorical Force Split Constraints Testing Select Cases Best Tree i Method Splitter Variable Disallow Criteria Disallow Split Region Min Cases Max Cases 1 2 3 Ind Split Disallowed bove Depth Em Hz Variable ANYRAQT ONAER NSUPPS OFFAER NFAMMEM TANNING ANYPOOL SMALLBUS FIT HOME PERSTRN CLASSES J J f 0 0 0 Clear All Select J m Primary Splitter Shn Split Disallowed At Or Below Depth Sort File Order Ve Heirs Surrogate Splitter 1 2 3 Ind BE a TY
153. IRIES gt 4 5 AND NUMCARDS lt 4 5 73 CART BASICS and is estimated to be 70 BAD We need to click on one of the Probabilities buttons if we want them to be displayed with the rules The rules are formatted as C compatible code to facilitate applying new data to CART models in other applications The rule set can be exported as a text file cut and pasted into another application and or sent to the printer This topic is discussed further below in the section titled Displaying and Exporting Tree Rules The Splitter tab When a node is split on a categorical variable an additional tab called Splitter is available in the Node Information window for all internal nodes In our example we will not see a categorical splitter in the tree unless we expand the tree out to 26 nodes If you do that and go to the parent of Terminal Node 3 at the bottom left you will see that it splits on the categorical EDUCATION variable Click that node and select the Splitter tab to obtain Zi Navigator 5 26 Node 10 Competitors and Surogates Classification if Rules Splitter Is EDUCATIONS HS No_HS Levels That Go Left Levels That Go Right College With only three education levels we can readily see whether a level goes to the left or the right This report is most useful for following high level categorical splits or for tracing which levels end up where when the same categorical variable is used as the main spli
154. Levels 1 2 3 4 5 6 x Bete MM worse Smaller Next Prune Grow Prune ei Model Statistics Min Node Cases Relative Cost Best ROC Nodes 6 ROC Train 0 9826 Number of Nodes ROC Test 0 9756 Data Displays and Reports Save Model Leam Splitters Tree Details Summary Reports 1 1 C Commands Grove Translate Score Similarly you may color code terminal nodes by a continuous auxiliary variable In this case the color codes will be based on the mean instead of the level in focus similar to target color coding in regression trees see Chapter 5 Regression Trees 139 Chapter 4 Classification Trees a You may break the group down into original levels by checking the grouping and pressing the Split selected groups button w Return to the Select Target Variable dialog to return display details back to the original target variable SEGMENT Comparing Children It is possible to compare two children of any internal node side by side Simply point the mouse to the internal node right click and choose the Compare Children menu item A window similar to the Tree Details window shows two children side by side 25 Children of Node 3 DEX TANNING lt 0 50 TANNING gt 0 50 Terminal Node 1 Class 1 x You can control what is reported using the View gt Node Detail menu just as you do for the Tree Details window Comparing Learn and Test It is possible to co
155. List Learn Sample Gains and ROC for TARGET Oo Prop Miss 1 Overall AGE 58 000 18561 000 CREDIT_LIMIT 664 00 14966 574 20699 596 104622 000 9937805 000 GENDER 589 00 0 689 4 710 6 000 406 000 HH_SIZE 544 00 3 314 1 567 7 000 1803 000 INCOME 664 00 4081 373 2629 637 20800 000 2710032 000 N_INQUIRIES 664 00 2 985 3 734 23 000 1982 000 NUMNCARDS 664 00 1 801 1 839 9 000 1196 000 OCCUP_BLANK 664 00 0 054 0 227 1 000 36 000 TIME_EMPLOYED 664 00 1 746 3 282 33 000 1159 500 You can save a copy of the text output as a record of your analysis by selecting Save Output from the File gt Save menu You can also copy and paste sections of the output into another application or to the clipboard The font used in the Report window can be changed by selecting Fonts from the Edit menu Use a mono spaced font such as Courier to maintain the alignment of tabular output rN You can always regenerate most of the classic output from a saved Grove file by using the TRANSLATE facility built into every grove rN Advanced users may want to use PERL scripts to process the classic output to create custom reports For a line by line description of the text output consult the main reference manual Displaying and Exporting Tree Rules Decision trees can be viewed as flow charts or as sets of rules that define segments of interest in a database The rules for a CART tree can be rendered in two quite different ways As simple rules that are easy
156. M YES Note that DCM YES can generate voluminous output for large committees If no committees exist in the grove this option is ignored and reports are printed for all trees PROBS causes predicted probabilities for classification models to be added to the output dataset if there are N or fewer target classes By default models with five or fewer target classes will have predicted probabilities saved PATHS causes path indicators to be added to the output dataset By default these are not saved DEPVAR is used to specify a proxy target dependent variable with a different name than the target variable used when the model was created 394 Appendix III Command Reference If a variable with the same name as the original target is present or if a proxy target is specified with the DEPVAR option SCORE will also produce misclassification or error rate reports If the SAVE command is issued prior to SCORE model scores will be saved to a dataset To include all model variables in the save file use the MODEL option on the SAVE command Merge variables may be included in the SAVE dataset by issuing the IDVAR command prior to the SCORE command The IDVARs may be any variables on the USE dataset The MEANS PREDICTION GAINS and ROC options on the LOPTIONS command will generate additional scoring output Examples USE gymtutor csv SAVE testPRED CSV MODEL GROVE BUILD GYMc GRV SCORE DEPVAR SEGMENT PATH YES PROBS 3
157. MD command file for details 220 Chapter 10 CART Batteries wai Battery Summary 4 DER Models Contents Accuracy Eror Profiles Var Imp Averaging Chart Type Focus Class Other Classes Focus Class Battery Models Model Opt Terminal Rel Ener Class 1 Class 1 Other Classes Average Overall Name Nodes leas ROC Accuracy Accuracy Accuracy Accuracy Model Quality Sample Model Size Save Grove Misclass ROC f Test Leam Min Cost 1SE Here the priors were varied from 0 05 0 95 to 0 95 0 05 in increments of 0 05 producing 19 runs overall Note the powerful impact on individual class accuracies sensitivity versus specificity tradeoff This battery is the most suitable raw material for the hot spot detection procedure searching for rich nodes in the class of interest described earlier CART CART 6 0 CART 6 0 Standard Pro Pro EX Battery RULES Battery RULES simply runs each available splitting rule thus producing six runs for classification and two runs for regression We illustrate battery RULES for a multinomial target with non symmetric costs using the Prostate dataset PROSTATE2 CSV see RULES CMD command file for details 221 Chapter 10 CART Batteries wa Battery Summary 5 DER Models Contents Accuracy Error Profiles Var Imp Averaging Charts Rel Error Nodes CI Prob 11 0 539 CI Prob 11 0 539 View Zoomed bd Chart Type Bar
158. MS lt yes no gt SAVE lt filename gt CART EX Pro only Attempt to model each variable in the KEEP list as a target using all other variables in the KEEP list as predictors MP governs whether MVIs Missing value indicators are used as predictors MT governs whether MVIs are used as targets MS governs whether MVIs are saved to the output dataset SAVE saves the imputed values to a new dataset If you wish to specify a list of targets separately from the KEEP list of predictors use the syntax BATTERY TARGET lt targetl1 gt lt target2 gt In this instance variables can be part of both the TARGET list and KEEP but in the most common use the two lists would be mutually exclusive BATTERY CVR lt n gt Repeats the CV process N times with a different random seed each time BATTERY KEEP lt NK NR gt CORE lt predictor gt lt predictor gt CART EX Pro only Repeat the model NR times selecting a subset of NK predictors from the KEEP list each time The CORE option defines a group of predictors from the main KEEP list that are included in each of the models of the battery BATTERY MCT lt n gt Monte Carlo shuffling of the target First model is unperturbed Successive models have target shuffled to break the correlation between target and explanatory variables MCT may only be run alone or with RULES in which case it will be nested BATTERY QUIET YES NO AUTO Some results that would be produced for a sing
159. Main ANYRAQT 26929 1 CLASSES 0 50000 0 23363 2 FIT 3 45388 0 23363 3 ANYPOOL 0 0 15584 4 OFFAER 0 50000 0 07206 5 TANNING 01 5 0 06622 6 ONAER 5 50000 0 05951 7 SMALLBUS 1 0 03704 8 NFAMMEM 1 50000 0 02424 9 NSUPPS 1 50000 0 02248 10 HOME 1 0 00579 Terminal Nodes The next Summary Report provides a graphical representation of the terminal nodes as illustrated below You may choose the target class in the selection box When the Other Classes button is pressed the bar chart contains one bar per terminal node sorted by the node richness in the target class In the example below terminal nodes 7 3 and 1 are nearly pure in class 1 whereas only about 5 of node 5 belongs to class 1 248 Chapter 11 CART Segmentation 4 Navigator 1 7 Tree Summary Reports Misclassification Prediction Success Gains Chart Root Splits if Terminal Nodes I Variable Importance Percentage Of Node That Is Target Class A 3 F E E 8 Tot Clase All Classes Other Classes When the All Classes button is pressed you will see a stacked bar chart with the target class first If you use a test sample more buttons will be available to reflect distributions on learn test or both parts The bar charts enable you to evaluate the purity or homogeneity of the terminal nodes an indication of how well CART partitioned the classes The terminal nodes in our
160. Model menu or click di the Model Setup toolbar icon CART retains the prior model settings in the Model Setup dialogs To use another data set select Data File from the File gt Open menu The new selected file will replace the file currently open and all dialog box settings will return to default values rN If you want to ensure that all default setting are reset to their original state select Clear Workspace from the File menu Saving Command Log To save the Command Log select Open Command Log from the View menu or press BI the Command Log toolbar icon and then select Save from the File menu Specify a directory and the name of the command file saved by default with a CMD extension If your model is rather time consuming e g the model contains many candidate predictors most of which are categorical saving the command log can expedite further manipulation of model setup specifications in subsequent CART sessions See Chapter 13 for more about the CART command log and running CART in batch mode rN IMPORTANT NOTE When a CART session is finished the CART application is closed a log file containing all commands issued during the session is created in the CART temporary folder specified in Edit gt Options gt Directories This text file is given a name that starts with CART followed by month and day followed by hour military convention 0 23 minutes and seconds followed by two underscores For example
161. NDUS 0 7453 CRIM 0 7900 ZN 0 8000 RAD l 0 8551 DIS Show Min Error Sort by Types Model Quality Sample Model Size Save Grove LS Test Leam Min Cost 1SE 0 8712AGE na It is clear that LSTAT alone could reduce the relative error to 0 35 while CHAS has virtually no univariate connection with the response The following table reports Pearson correlations of the same variables with the response sorted by the absolute value of the correlation The results are directly 219 Chapter 10 CART Batteries comparable to CART findings However the CART approach has the added advantage of being able to identify potential non linearities VARIABLE CORRELATION LSTAT 0 73766 RM 0 69536 PT 0 50779 INDUS 0 48373 TAX 0 46854 NOX 0 42732 CRIM 0 3883 RAD 0 38163 AGE 0 37695 ZN 0 360445 B 0 333461 DIS 0 249929 CHAS 0 17526 Battery PRIOR Prior probabilities play a fundamentally important role in overall tree construction as well as in model evaluation By manipulating priors one could impose different solutions on the sensitivity versus specificity tradeoff as well as control node purity and overall model performance Battery PRIOR streamlines this process by allowing priors to be varied within the specified range in user supplied increments We illustrate this battery using the SPAMBASE CSV dataset see PRIORS C
162. Navigator 5 10 Node 1 Competitors and Surrogates Root Competitor Splits Is N_INGUIRIES lt 1 500 Classification Main N_INQUIRIES 1 CREDIT_LIMIT 2 OCCUP_BLANK 3 EDUCATIONS 4 GENDER Competitor Split Improvement 1 500 0 104 5546 000 0 500 College No_HS 1 000 N Left N Right N Missing 5 NUMCARDS 0 500 6 INCOME 606 500 7 AGE 29 500 8 MARITALS Single 9 TIME_EMPLOYED 5 250 10 0WNRENT Parents Rent 11 HH_SIZE 1 500 In some circumstances you may be uncomfortable with a main splitter because it is too frequently missing or because it generates a highly uneven split For example OCCUP_BLANK puts 628 cases on the left and only 36 cases on the right OWNRENTS has 143 cases missing Other sections of the manual discuss what you can do if your main splitter exhibits such possibly undesirable characteristics The Rules tab The rules tab will display text rules describing how to reach the node selected and thus is available for every node except the root Select Terminal node 9 from the 10 node tree double click on the node and then select the Rules tab to see 4 Navigator 5 Sub Tree Node 9 Rules EA Terminal node 9 Rules for root node if i N_INQUIRIES gt 4 5 amp amp NUMCARDS lt 4 5 terminalNode 9 class 1 Probabilities None Leam Notation Nodes Node 9 contains the data segment satisfying the rules N_INQU
163. Node 4 Node 5 Class 1 Class 3 Class 3 Class 2 Class Cases Class Cases Class Cases 1 0 00 1 1 40 1 1 2 0 0 0 4 16 0 1 73 3 71 1000 20 80 0 3 200 N 25 N 15 Do gt E As illustrated above in the upper left a Tree Map window shows a thumbnail sketch of the whole tree and outlines the portion of the tree currently displayed in the Tree window If the tree takes up more than the screen you can use the tree map to see which portion of the tree you are viewing and to change the displayed section Clicking on the tree map moves the viewed portion to center on the mouse position Conversely the outline in the map and the section of the tree displayed move when you use the horizontal and vertical scroll bars vx Tree Map is also available when viewing the Splitters window 241 Chapter 11 CART Segmentation With a simple mouse click you may Zoom in or Zoom out by pressing the e or iS keys Fine tune the scale by changing the 100 z selection box Experiment with two alternative node spacing modes 2l and buttons Turn color coding of target classes on or off El button The level of detail appearing in each of the tree nodes can be customized according to your preferences From the View menu select Node Detail the following dialog appears 4 Display Details E Intemal Node Details E Terminal Node Details Node Information to Display Parent Node MV Node number Splitting variable nam
164. Nodes ROC Test 0 3862 Data Displays and Reports r Save Model Leam Test Splitters Tree Details Summary Reports Commands Grove Translate Score As illustrated below the Summary Reports dialog contains gains charts terminal node distributions variable importance measures misclassification tables and prediction success tables result tabs Gains Chart The summary report initially displayed is the Gains Chart tab for the first level of the target variable Class 1 i Navigator 1 7 Tree Summary Reports Misclassification Prediction Success Gains Chart g Root Splits Terminal Nodes Gains Chart Node of Node Cum Class Cases Togt Class Tot Class Tot Class Tat 82 100 00 86 32 86 32 o O 2 40 60 s0 100 Population Show Perfect Model ROC Train 0 9982 Tat Class 1 x ROC Test 0 9862 1 Leam Gans Lit cumLit ROC Total cases 95 Percent of sample 32 42 If you use a test sample Learn Test and Both buttons will appear in the lower portion of the Gains Chart dialog To view gains charts for the test sample click 246 Chapter 11 CART Segmentation Test To view gains charts for learn and test combined click Both In this example we used cross validation so these buttons do not appear The grid displayed in the right panel shows the relative contribution of the nodes to coverage of a particular class in this case Class
165. O SEGMENT CREDITRANK MART GO Groups can contain a mix of character and numeric variables however the CLASS command will accept homogenous all character or all numeric groups only A variable may be included in more than one group If a group is assigned a name that is identical to a variable name the group name will take precedence in variable lists i e the variable name will be masked The following commands recognize variable groups CATEGORY KEEP EXCLUDE AUXILIARY IDVAR CONSTRAIN DATAINFO PENALTY CLASS XYPLOT HISTOGRAM 358 Appendix III Command Reference GROVE Purpose The GROVE command names a grove file in which to store the next tree or committee or group of impute trees or to use in the next TRANSLATE or SCORE operation If an unquoted name is given without an extension GRV is appended The command syntax is GROVE lt filename gt IMPORT legacy treefile LOAD MEMO contents ECHO Examples GROVE c modeling revl1 groves M_ 2b grv GROVE MOD1 To convert a legacy treefile e g mytree tr1 from a previous version of CART to a grove use the IMPORT option e g GROVE robustus projects groves J3b grv IMPORT c c3po legacy tri To test a grove file for validity use the LOAD option e g GROVE qmodell grv LOAD If the grove file is invalid an error message will be generated To add a memo to a grove command use the MEMO option e g GROVE filename
166. OTO 40 20 ELSE STOP 40 LET X 15 Bibliography Breiman L 1996 Arcing classifiers Technical Report Berkeley Statistics Department University of California Breiman L 1996 Bagging predictors Machine Learning 24 123 140 Breiman Leo Jerome Friedman Richard Olshen and Charles Stone 1984 Classification and Regression Trees Pacific Grove Wadsworth Dietterich T 1998 An experimental comparison of three methods for constructing ensembles of decision trees Bagging Boosting and Randomization Machine Learning 40 139 158 Freund Y amp R E Schapire 1996 Experiments with a new boosting algorithm In L Saitta ed Machine Learning Proceedings of the Thirteenth National Conference Morgan Kaufmann pp 148 156 Steinberg Dan and Phillip Colla 1997 CART Classification and Regression Trees San Diego CA Salford Systems Index Buttons button 132 button 57 241 293 button 57 241 293 lt Send To Left button 273 button 142 1 SE button 201 204 Add to List button 100 Add button 200 Advanced button 122 All Classes button 63 248 All button 159 Apply button 57 59 242 Ave Profit button 152 Average button 208 Bar button 201 Bars button 197 Both button 62 246 Box Plot button 205 Brief button 292 CART button 234 Cases button 71 254 Change button 270 273 Chart button 20
167. ROGRAMMING LANGUAGE Chapter Installing and Starting CART This chapter provides a brief instruction on how to install and start CART and how to prepare to read the data 24 Installing and Starting CART Installing and Starting CART 6 0 This chapter provides instructions for installing and starting CART 6 0 for Windows 2000 Windows 2003 and Windows XP Although CART 6 0 may run on older versions of the Windows operating system we strongly recommend that you rely on later versions of Windows Minimum System Requirements To install and run CART the minimum hardware you need includes Pentium processor or similar 512 MB of random access memory RAM This value depends on the size of CART you have licensed 32 MB 64MB 128MB 256MB 512MB 1GIG 2GIG While some versions of CART will run with a minimum of 128MB of RAM we highly recommend that you follow the recommended memory configuration that applies to the particular version of CART you have licensed Using less than the recommended memory configuration results in excessive hard drive paging reducing performance significantly and risking that you will run out of resources quickly leading to a shut down of the software Hard disk with 40 MB of free space for program files data file access utility and sample data files Additional hard disk space for scratch files with the required space contingent on the size of the input data set CD ROM or DVD drive for inst
168. RT will do and the terminal nodes may identify interesting data segments 20 Introducing CART 6 0 New Model Translation Formats In CART 6 we have added Java and PMML to our existing group of model translation languages The Predictive Modeling Markup Language PMML is a form of XML specifically designed to express the predictive formulas or mechanisms of a data mining model In CART 6 we conform to PMML release 3 0 exer Ed CART 6 0 Train Test Consistency r Classic CART trees are evaluated on the basis of overall tree performance However many users of CART are more interested in the performance of specific nodes and the degree to which terminal nodes exhibit strongly consistent results across the train and test samples The TTC report provides new graphical and tabular reports to summarize train test agreement About this Manual This User s Guide provides a hands on tutorial as well as step by step instructions to orient you to the graphical user interface and to familiarize you with the features and options found in CART We have also incorporated command line syntax for our non GUI Linux and UNIX users This manual is not intended to instruct the user on the underlying methodology but rather to provide exposure to the basics of the CART software application If you are new to CART and decision trees we think you will find CART an ideal way to learn After you have become familiar with the nuts and bolts of runnin
169. SAS C PMML HISTORY 0 V T D S UTPUT Output file LIST lt yes no gt LIST lt yes no gt ETAILS lt yes no gt URROGATES lt yes no gt SMI SAS missing value string SBE SAS begin label spo SAS done label sno SAS node prefix stn SAS terminal node prefix The available languages are as follows SAS CLASSIC PMML HISTORY Implement the model in the form of a subroutine which can be included in a SAS data step and called with the LINK command At present only single tree models are fully supported Print the model in much the same way it is represented in the classic text output Implement the model in the form of a C language function Print the model using Predictive Model Markup Language PMML 3 1 This is an XML based language for representing statistical models Again only single tree models are fully supported Batteries and COMBINE models are currently represented as series of single trees List the commands executed between the time CART started and when the model or battery contained in the grove was built This is useful for reconstructing the code required to build a particular model or battery 401 Appendix III Command Reference Example GROVE mygrove grv TRANSLATE LANGUAGE SAS OUTPUT mygrove sas Example SAS data step to score data with TRANSLATE output DATA OUTLIB SCORES Output dataset SET INLIB NEWDATA
170. SIC also includes a collection of probability functions that can be used to determine probabilities and confidence level critical values and to generate random numbers The following table shows the distributions and any parameters that are needed to obtain values for the random draw the cumulative distribution the density function or the inverse density function Every function name is composed of two parts The Key first letter identifies the distribution Remaining letters define function RN random number CF cumulative DF density IF inverse Distribution Key Random Cumulative C Comments Letter Draw RN Density D a is the probability for Inverse I inverse density function Beta B BRN BCF p q B beta value BDF B p q p q beta parameters BIF a p q Binomial N NRN ap NCF x n p n number of trials NDF x n p p prob of success in trial NIF a n p x binomial count Chi square X XRN df XCF df X chi squared valued XDF y df f degrees of freedom XIF a df Exponential E ERN ECF x X exponential value EDF x EIF a F F FRN df1 df2 FCF F dfl df2 dfl df2 degrees of freedom FDF F df1 df2 F F value FIF o dfl df2 Gamma G GRN p GCF y p p shape parameter GDF p y gamma value GIF o p Logistic L LRN LCF x x logistic value LDF x LIF a Normal Z ZRN ZCF z z normal z score Standard 412 Appendix IV BASIC Programming Language ZDF z ZIF a Poisson P PRN p PCF x p p
171. Select Cases selects a subset of original data Best Tree defines the best tree selection method Method selects a splitting rule 147 Chapter 5 Regression Trees Penalty sets penalties on variables missing values and high level categorical predictors Advanced specifies other model building options Battery specifies batteries of automated runs The key differences regression tree models impose on both model setup and resulting output are ran Certain Model Setup dialog tabs are grayed when you select the regression tree type in the Model dialog These include the Costs and Priors tabs that provide powerful means of control over classification trees Least Squares default setting and Least Absolute Deviation are the only splitting rules available Regression Trees Least Squares C Least Absolute Deviation Even though classification splitting rules are not grayed out the actual setting is ignored in all regression runs Gains charts misclassification tables and prediction success tables are no longer displayed in the Tree Summary Reports because they are not applicable The Mean or within node average of the target variable is reported for each node rather than a class assignment and node distributions are displayed as box plots rather than as bar pie graphs The only required step for growing a regression tree is to specify a target variable and a tree type in the Model Setup Model tab If the o
172. Set random number seed e Specify default directories Open command log e View data e View descriptive statistics e Display next pruning e Assign class names and apply colors e View main tree and or sub tree rules e Overlay gains charts e Specify level of detail displayed in tree nodes EXPLORE Generate frequency distributions MODEL TREE REPORT WINDOW HELP e Specify model setup parameters e Grow trees committee of experts e Generate predictions score data e Translate models into SAS C PMML or Java e Prune grow tree one level e View optimal minimum cost maximal tree e View tree summary reports e Control CART reporting facility e Advanced HotSpot and TTC reports featured in ProEX e Control various windows on the CART desktop e Access online help About CART Toolbar Icons The commands used most commonly have corresponding toolbar icons Use the following icons as shortcuts for Open a data file Submit a command file or stored script is Turning command line entry mode on or off 43 CART BASICS Opening the command log to view your session history View Data File Print the active window Cut selected text to clipboard Copy selected text to clipboard Paste clipboard text Set major reporting options and working directory locations Display statistics for current data Open activity window Model Setup Grow a tree or launch an analysis Grow an Ensemble or Committee of Experts model
173. T EA A T E deasiestieesssceeeey 403 MY PLOW E T E A E T 404 BASIC PROGRAMMING LANGUAGE 0 ccccceeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeees 405 BASIC Programming Language s ssssssssesseunnnrnnnnnrnnunnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnne 406 Getting Started with BASIC Programming Language ssssssssseensenrnnnenrennnnennnn 406 BASIC Overview of BASIC Components cccceeceeeeeseeeeeeeeeeeeeeesseeneeeeeeees 407 a A E E E 407 F PHEN asec dnc aa ci a aaee aaae cance soutien le anon aaa a a Naaa doma a dat aeaa tiap aeaa Eaa 407 ES EE T T A T 407 FOR INE XU ire cred evs E T T T T 408 DOU Mtoe ccttecete fe A AT T wardehoesavectsvaceasaeea uceytrardesanages 408 DBE A A tev ater acatessuadestuay scutes teuarssctedsdecdsvardvsctsadvectscecesseteateecuevarcseenedes 409 OP ONAL ONS r aa ra A cx ceten dade austen caetvusasnceicececaskuasuecechaengeduwsuascucesvercseeaaaee 409 BASIC Special Variables cccccceeeesseeeeeeeeeeeeeenseeeeeeeeeeeeeesneeeeseeeseeseeeneneeeees 409 BASIC Mathematical Functions 0 cecccceseeeeeeeeneeeeeeeeneeeneeeneesesesneeesesesneeseneeneees 410 BASIC Probability Functions ccceseeeeeeeeeeeeeeeeeeneeeeeeeeeeeeeseeeseeeeseeseeeseeneeeeeeees 411 MISSING Valts c2cccveccecectsdeedescsscececeatzactectneceeecusudeecesaccecedaveaterasasueencevaatecesdezateneesscts 413 More Examples tic2 eccesccigescnvdchcessescewetivaertraagadvncttverksshenacnceutvaresheaaatenciteestsessdenseveteares 41
174. TA ATTA ATE A ATTA HATER ATA AAA TEAR TTA REM INPUT OUTPUT FILES REMFTS ASAT AAA ATTA ATTRA TTA KATA ATTA KATA K TEAK ATTA AKER A TEAK TEA A t USE sample csv GROVE reg gry OUTPUT reg dat REMFAEAAAA AAA AA AAAAAA AA TKAKTATATAATATAATATAKAKLAKLAKAKAKAKTAKATEKAH REM OPTIONS SETTINGS REM AAAAAAAAAAAAAAAAAA LARA AAA A RRA LARA ALARA AAA R AAA E REAR ET BOPTIONS SURROGATES 5 PRINT 5 COMPETITORS 5 TREELIST 10 BRIEF SERULE 0 IMPORTANCE 1 MISSING NO LOPTIONS MEANS YES NOPRINT N0 PREDICTIONS YES BOTH TIMING YES PLOTS YES GAINS N0 ROC NO0 FORMAT 7 UNDERFLOW LIMIT MINCHILD 100 ATOM 200 NODES 5000 DEPTH 50 LEARN 100000 TEST 100000 SUBSAMPLE 100000 REMPAAAAA AAA AAA AAA AAA AAA TAAL AA LEA AAA LEAT A AERA AAA AEA A EAA E ET REM MODEL SETUP REMFASA SAA AAA AATAA ATER KTR KATRAAKEAAKEEAAEEAA TEER AEE ATER ETERS MODEL Yl CATEGORY KEEP Z1 Z2 Xl X2 X3 X4 X5 X6 X7 X8 X9 Xl0 ERROR SEPVAR T METHOD LS WEIGHT W PENALTY MISSING l 1 HLC 1 1 REMEEEARERRAREAREEREARRERRERRERRARRERERRERRERRERRERREREEREAERAAA REM BUILD MODEL REMEEERRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR BUILD REMATA SSAA AAA AAAKATAAKATTAAATAAA EEEE E E a REM QUIT CART REMFTF SAAT TAA ATTA AK TEAK TEAK TEAK EERE EEA EEA E EERE EER EERE EERE ES QUIT All lines starting with REM are comments and will be ignored by the command parser We have marked commands of special interest with RED nu
175. TIMING YES GAINS YES ROC YES gt FORMAT 5 gt REM S5etting CART default options gt LOPTIONS NOPRINT NO PLOTS YES PS NO gt BOPTIONS SURROGATES 5 PRINT 5 COMPETITORS 5 CPRINT 5 TREELIST 10 gt BRIEF SUSE C Program Files Salford Data Mining CART Pro EX 6 0 Examples GYMTUTOR CSV C Program Files Salford Data Mining CART Pro EX 6 0 Examples GYMTUTOR CSV uses as delimiter VARIABLES IN RECT FILE ARE ANYRAQT ONAER NSUPPS TANNING ANYPOOL SMALLBUS PERSTRN CLASSES SEGMENT C Program Files Salford Data Mining CART Pro EX 6 0 Examples GYMTUTOR CS 293 records gt REM Resetting Preferences gt REN Setting General default options gt LOPTIONS MEANS YES PREDICTIONS YES BOTH TIMING YES GAINS YES ROC YES gt FORMAT 5 gt REM S5etting CART default options gt LOPTIONS NOPRINT NO PLOTS YES PS NO gt BOPTIONS SURROGATES 5 PRINT 5 COMPETITORS 5 CPRINT 5 TREELIST 10 gt BRIEF gt C Program Files Salford Data Mining GYMTUTOR CS y Setting Up the Model The Model Setup dialog tabs are the primary controls for conducting CART analyses with the analysis functions most commonly used conveniently located in eleven Model Setup tabs After you open a data set setting up a CART analysis entails several logical groups all of which are carried out in one of the Model Setup dialog tabs Model select target and predictor variables specify categor
176. T_LIMIT lt 5546 000 CREOMLMT gt 5546 000 Left Child Node 59 CART BASICS ied Navigator 2 Sub Tree Node 5 ell ra E Node 5 Class 1 N_INQUIRIES lt 4 500 N 326 N_INQUIRIES lt 4 500 N_INQUIRIES gt 4 500 Node 6 Node 9 Class 1 Class 1 NUMCARDS lt 1 500 NUMCARDS lt 4 500 N 160 N 166 NUMCARDS lt 1 500 NUMCARDS gt 1 600 NUMCARDS lt 4 500 NUMCARDS gt 4 500 fi rn rs Terminal Node 7 Terminal Terminal Node Class 0 Node 9 Node 10 Class 1 TIME_BuPLOYED lt 9 000 Class 1 Class 0 N 74 N 86 N 14 N 22 TIME_BMPLOYED lt 9 000 TIME_BMPLOYED gt 9 000 4 Node 8 Terminal Class 0 Node 8 INCOME lt 2342 000 Class 1 N 79 N 7 INCOME lt 2342 000 INCOME gt 2342 000 Terminal Terminal Node 6 Node 7 Class 1 Class 0 N 9 N 70 Right Child Node Assigning Labels and Color Codes Trees detailing sample breakdowns can be displayed with or without colors the node histograms are always color coded Instructions for customizing the colors appear below If your target variable is coded as text then the text value labels will be displayed where required but if your target variable is coded as a number you can replace the numbers with labels with Class Names Class names up to 32 characters and colors can be assigned to each level of the target variable from V
177. Variables HOME CLASSES FIT NFAMMEM Tree Type Classification Model Setup Advanced Penalty Battery Model Categorical Force Split Constraints Testing Select Cases Best Tree Method Variable Selection Tree Type Variable Name Target Predictor Categorical Weight Aux fe AA NFAMMEM TANNING al ANYPOOL SMALLBUS Regression Unsupervised Set Focus Class Target Variable SEGMENT PERSTRN CLASSES SEGMENT alda 4 aaanalaan aalala a qalalla 5 diadaaadiaia Sort Fie Order z Predict Cal Number of Predictors 4 a CART i lt sf Cancel Continue After specifying our modeling and auxiliary variables Start is pressed the resulting Navigator looks as follows color coding has been activated for SEGMENT 2 136 Chapter 4 Classification Trees iy Navigator 1 Classification tree topology for SEGMENT Color code using Tat SEGMENT 2 gt Bete TM wore Smaller Terminal Node 6 Class Cases Model Statistics Predictors Important Nodes Min Node Cases a Relative Cost Best ROC Nodes ROC Train Number of Nodes ROC Test Displays and Reports Model Splitters Tree Details Summary Reports Translate Score According to the current color coding terminal node 6 captures the majority of the second segment Now right mouse click on this node and choose Auxiliary Variables i Navigator 2 T
178. a sample SAS output Classic Options When translating into Classic you may further define which pieces of information should be included Translating in Command Line For command line translating you must have a grove file saved separately To translate 1 Issue a GROVE file_name grv command to specify the grove file 2 Issue one of the following commands to specify other than the optimal tree If this command is not issued the optimal tree will be used by default HARVEST PRUNE TREENUMBER lt N gt HARVEST PRUNE NODES lt N gt 3 Depending on the language issue one of the following commands TRANSLATE LANGUAGE SAS SMI gt z SBE MODELBEGIN SDO MODELDONE SNO NODE STN TNODE OUTPUT file name sas TRANSLATE LANGUAGE CLASSIC VLIST YES TLIST YES DETAILS YES SURROGATES YES OUTPUT lt file name gt TRANSLATE LANGUAGE PMML OUTPUT lt file name xml gt 183 Chapter 7 Scoring and Translating Exporting and Printing Tree Rules The scoring procedure described above generates detailed output The resulting code includes not only main splitters but also all surrogate splits as alternative conditions While having this code is invaluable for external scoring it might be overkill if all you need is a set of simple rules based on main splitters only CART 6 has inherited an older simplified version of model translation that is accessible by selecting Rules from the View menu when the c
179. ability to classify the larger classes accurately PRIORS DATA is perfectly reasonable when the importance of classification accuracy is proportional to class size Consider a model intended to predict which political party will be voted for with the alternatives of Conservative Liberal Fringe1 and Fringe2 If the fringe parties together are expected to represent about 5 of the vote an analyst might do better with PRIORS DATA allowing CART to focus on the two main parties for achieving classification accuracy Six different priors options are available as follows EQUAL Equivalent to weighting classes to achieve BALANCE default setting DATA Larger classes are allowed to dominate the analysis Priors set to the average of the DATA and EQUAL options LEARN Class sizes calculated from LEARN sample only TEST Class sizes calculated from TEST sample only SPECIFY Priors set to user specified values 120 Chapter 4 Classification Trees Model Setup Categorical Force Split Constraints Testing Select Cases Best Tree Method Advanced Penalty Battery Priors For EQUAL All categories have equal probability LEARN Probabilities match learn sample frequencies C TEST Probabilities match test sample frequencies C DATA Probabilities match total sample frequencies C MIX The average of EQUAL and DATA C SPECIFY Specify priors for each level Save Grove CART Combine Cancel Continue Start Default Priors setti
180. acy overall accuracy of the model The Battery Summary Error Profiles tab shows the actual mode profiles for each run 4 Battery Summary 1 Models Contents Accuracy Eror Profiles Yar Imp Averaging Rel Error 10 20 30 40 50 60 70 80 90 100 110 120 130 Number of Nodes Chart View Model Quality Sample Model Size Save Grove Misclass ROC Test Leam Min Cost 1 SE Relative error profiles are shown when the Misclass button is pressed The areas under the ROC profiles are shown when the ROC button is pressed You can also switch between Learn and Test profiles The vertical markers indicate the optimal tree positions Min Cost button is pressed or 1SE tree positions 1 SE button is pressed The legend can be turned on or off using the Legend button Finally the Battery Summary Var Imp Averaging tab shows the results of variable importance averaging across all models in the battery 205 4 Battery Summary 1 Models Contents Accuracy Eror Profiles Var Imp Averaging Chapter 10 CART Batteries Variable Importance Averaging Var Importance x bd WORD_FREQ_FRE 4 ii WORD FREQ _ED J CAPITAL_RUN_LENGTH_TOT i WORD _FREQ_REMO GE WORD_FREQ_650 WORD_FREQ_00 WORD_FREQ_OVE WORD_FREQ_ORIGIN WORD_FREQ_INTERNI WORD_FREQ CAPITAL_RUN_LENGTH_LON WORD_FREQ_TELNE WORD_FREQ_MONE WORD_FREQ_VWIL
181. acy 203 Avg ROC 203 Class Accuracy 203 Class ROC 203 Opt Terminal Nodes 203 Overall Accuracy 204 Rel Error 203 Activity Window 18 146 ADJUST command 328 advanced options 111 advanced programming features 415 advanced settings 147 148 Advanced tab 85 147 148 232 233 ARCing 162 163 power setting 164 ASCII files 33 character variables 33 numeric variables 33 association 248 252 auto validation 19 AUXILIARY command 329 auxiliary variables 85 91 color coding 136 merge selected groups 138 split selected groups 139 viewing information 134 Averaging tab 208 B bagging 162 bar chart 247 BASIC data management 14 mathematical functions 410 probability functions 411 programming language commands 416 programming language overview 406 special variables 409 BASIC programming language 297 298 batch command file 297 batch processing 296 298 Batteries 16 200 CV 206 CVR 208 DEPTH 210 DRAW 210 FLIP 211 KEEP 212 LOVO 214 MCT 215 MINCHILD 216 MVI 216 NODES 218 ONEOFF 218 PRIOR 219 RULES 220 SAMPLE 221 SHAVING 222 SUBSAMPLE 223 TARGET 224 battery 147 battery models 194 Battery Options 200 Battery Summary 1 SE Terminal Nodes 201 Accuracy tab 202 Avg ROC 202 Classification Battery Models 201 Contents tab 202 Error Profiles tab 204 Model Name 201 Model Specifications 202 Opt Terminal Nodes 201 Rel Error 202 Var Imp Averaging tab 208 Var
182. ag data relative to the committee of experts tree The prediction success tables for the committee and for the initial tree are also displayed in the Combine Results dialog see example below 167 Chapter 6 Ensemble Models and Committees of Experts Combine 1 Committee Tree Initial Tree Committee Prediction Success Predicted Class 3 Actual Total Percent 1 2 Cases Correct N 6 100 00 Total 24 00 Average Overall Correct Count Row Column Grove C Salford TestT emp s1fs51 Data C Program Files GYMTUTOR CSV Save Grove In the Report Details group box you can change the default report contents as well as request the following additional reports Initial tree standard text report tree sequence node details etc for the tree grown on the entire in and out of bag data Committee trees standard text report for each expert tree grown in the series e Repeated cases summary tables displaying the proportion of observations repeated in each resample displayed for each committee tree and for the committee as a whole Given that the initial tree is constructed using CART s default tree building settings another benchmark you may want to consider when evaluating the performance of your committee of experts is a single CART tree built using options appropriate for your particular application e g you may want to experiment with different splitting rules priors costs etc
183. allation from external media Installers may be downloaded from our web and ftp sites eliminating the need for the CD ROM drive Windows 2000 2003 XP Recommended System Configuration Because CART is extremely CPU intensive the faster your CPU the faster CART will run For optimal performance we strongly recommend that CART run on a machine with a system configuration equal to or greater than the following Pentium 4 processor running 1 0 GHz Amount of RAM needed depends on the size of CART you have licensed 128MB 256MB 512MB 1GIG 2GIG While several versions of CART will run with a minimum of 128MB of RAM we urge you to follow the recommended memory configuration that applies to your version of CART 25 Installing and Starting CART Using less than the recommended memory configuration results in hard drive paging reducing performance significantly Hard disk with 40 MB of free space for program files data file access utility and sample data files Additional hard disk space for scratch files with the required space contingent on the size of the input data set CD ROM or DVD drive to install from external media All CART installation files including documentation are also available over internet connections Windows 2000 2003 XP 2 GIG of additional hard disk space available for virtual memory and temporary files Installation Procedure From CD ROM To install CART 1 Insert the CD labeled CART
184. amming Language DELETE Deletes the current case from the data set Operators The table below lists the operators that can be used in BASIC statement expressions Operators are evaluated in the order they are listed in each row with one exception a minus sign before a number making it a negative number is evaluated after exponentiation and before multiplication or division The lt gt is the not equal operator Numeric Operators am s Relational Operators lt lt lt gt gt gt Logical Operators AND OR NOT BASIC Special Variables BASIC has five built in variables available for every data set You can use these variables in BASIC statements and create new variables from them You may not redefine them or change their values directly Variable Definition Values CASE observation number 1 to maximum observation number BOF logical variable for 1 for first record in file beginning of file 0 otherwise EOF logical variable for 1 for last record in file end of file 0 otherwise BOG logical variable for 1 for first record in beginning of BY group BY group 0 otherwise EOG logical variable for 1 for last record in end of BY group BY group 0 otherwise BY groups are not supported in CART so BOG and EOG are synonymous with BOF and EOF 410 Appendix IV BASIC Programming Language BASIC Mathematical Functions Integrated BASIC also has a number of mathematical and statistical functions The statistica
185. and Language Commands 1 through 3 control which files will be used or created 1 gt gt The USE command specifies the data set to be used in modeling CART has built in support for comma separated ASCII files You may also access other supported file formats using DATABASE CONVERSION drivers w Use the GUI Command Log facility to learn quickly how to access various available file formats through DATABASE CONVERSION 2 gt gt The GROVE command specifies the binary grove file to be created in the current directory This file which contains detailed model information will be needed for the scoring and translating described later This binary file is needed to view trees and model results from inside the CART GUI It includes complete information about the model building process including pruning sequences and multiple collections of trees when applicable 3 gt gt The OUTPUT command specifies the classic output file This text file will report basic information about the data the model building process and the optimal tree The contents of this file which are somewhat limited may be controlled using LOPTIONS and FORMAT commands Commands 4 through 7 control various engine settings 4 gt gt The BOPTIONS command sets important model building options 5 gt gt The LOPTIONS command sets various reporting options 6 gt gt The FORMAT command sets the number of decimal digits to be reported 7 gt gt The LIMIT command sets va
186. and is frequently the best choice SYMGINI May be used with variable misclassification costs TWOING Is a competitor to GINI ORDERED Can be used for ordered categorical dependent variables PROB Requests probability trees instead of classification trees ENTROPY Is a modification of GINI using p log p rather than p 1 p POWER lt x gt Can be used to tune CART away from end cut splits The REGRESSION tree command syntax is METHOD LS LAD LS uses a least squares measure of within node dispersion and LAD uses a least absolute deviation measure Examples METHOD TWOING use TWOING for classification METHOD LAD use LAD for regression METHOD ENTROPY LS use ENTROPY for classification and least squares for regression 376 Appendix III Command Reference MISCLASS Purpose The MISCLASS command specifies misclassification costs The command syntax is To specify unit misclassification costs use one of the following commands MISCLASS UNIT To specify other than unit costs use one of the following command forms MISCLASS COST lt x gt CLASSIFY lt nl n2 gt AS lt m gt COST CLASSIFY MISCLASS COST lt x gt CLASSIFY lt n gt as lt mil m2 gt COST CLASSIFY in which lt depvar gt is the dependent variable and lt indep_list gt is an optional list of potential predictor variables If no lt indep_list gt is specified all variables are used for CART processing unless K
187. anding of the data from a cost neutral perspective The Priors tab The Model Setup Priors tab is one of the most important options you can set in shaping a classification analysis and you need to understand the basics to get the most out of CART Although the PRIORS terminology is unfamiliar to most analysts the core concepts are relatively easy to grasp Market researchers and biomedical analysts make use of the priors concepts routinely but in the context of a different vocabulary We start by discussing a straightforward 0 1 or YES NO classification problem In most real world situations the YES or 1 group is relatively rare For example in a large field of prospects only a few become customers relatively few borrowers default on their loans only a tiny fraction of credit card transactions and insurance claims are fraudulent etc The relative rarity of a class in the real world is usually reflected in the data available for analysis A file containing data on 100 000 borrowers might include no more than 4 000 bankrupts for a mainstream lender Such unbalanced data sets are quite natural for CART and pose no special problems for analysis This is one of CART s great strengths and differentiates CART from other analytical tools that do not perform well unless the data are balanced The CART default method for dealing with unbalanced data is to conduct all analyses using measures that are relative to each class In our example of 1
188. anguage diagram is completely automated so that the user does not need to worry about what is taking place underneath A more demanding user may write separate command files with or without the help of the GUI front end This feature is especially attractive for audit trail or various process automation tasks Given that the current release of CART for UNIX is entirely command line driven the user running CART for UNIX will fall into this category The CART engine reads data off the hard drive for modeling or scoring takes grove files for scoring or executes command files when requested In addition the engine may generate new data with scoring information added create grove files for models and save classic text output The following sections provide in depth discussions for users who have chosen to utilize command line controls Alternative Control Modes in CART for Windows In addition to controlling CART with the graphical user interface GUI you can control the program via commands issued at the command prompt or via submission of a command cmd file This built in flexibility enables you to avoid repetition create an audit trail and take advantage of the BASIC programming language Avoiding Repetition You may need to interact with several dialogs to define your model and set model estimation options This is particularly true when a model has a large number of variables or many categorical variables or when more than jus
189. are allowed to be used as primary splitters i e competitors and as surrogates at all depths and node sizes For each predictor the DISALLOW command is used to specify at which depths and in which partitions by size the predictor is NOT permitted to be used either as a splitter a surrogate or both The syntax is DISALLOW lt variable gt lt variable gt ABOVE lt depth gt BELOW lt depth gt MORE lt node_size gt FEWER lt node_size gt SPLIT SURROGATE To enable a DISALLOW command to apply to all variables use the the syntax DISALLOW ABOVE lt depth gt BELOW lt depth gt MORE lt node_size gt FEWER lt node_size gt SPLIT SURROGATE Note that the ABOVE and BELOW options may be used together to describe the following depth ranges in which a variable is not used D depth ABOVE N Variable will not be used if depth D lt N i e at depth N or shallower BELOW M Variable will not be used if depth D gt M i e at depth M or deeper ABOVE N BELOW M_ N gt M This defines a depth range in which the variable will not be used i e the variable will not be used if depth is between N and M inclusive N lt M This defines two depth ranges in which the variable will not be used The variable will not be used if D lt N depth N and shallower or if D gt M depth M and deeper Similarly for the MORE and FEWER options which operate on the node size number of
190. arget column Select the remaining variables and place a checkmark in the Predictors column Also place checkmarks in the Categorical column against those predictors that should be treated as categorical For our example specify ANYRAQT TANNING ANYPOOL SMALLBUS HOME and CLASSES as categorical predictor variables 268 Chapter 12 Features and Options The resulting Model Setup tab will look like the following Model Setup Advanced Costs Priors Penalty Battery Model Categorical Force Split Constraints Testing Select Cases Best Tree ii Method Variable Selection Tree Type Variable Name Target Predictor Categorical Weight Aux NFAMMEM TANNING ANYPOOL SMALLBUS FIT HOME PERSTRN CLASSES SEGMENT Classification C Regression C Unsupervised Set Focus Class Target Variable SEGMENT x Weight Variable Sort File Order Sd Number of Predictors 12 Save Grove CART Combine Score Cancel Continue 1a lt a 41 4 lt qA AA R A K T lt l 4 Now let s take a look at the Model Setup Force Split tab and specify a root node split In this example we only want to force a split on a specific variable without concern for the split value itself Later we will force a split variable and value To specify the root node split select FIT from the variable list and click the Set Root button This tells CART that the root node split must
191. as the classic CART output appearing in the Classic Output window Using the Report Writer is easy One way is to copy certain reports and diagrams to the Report window as you view the CART results dialog or output windows Once processing is complete a CART results window appears allowing you to explore the performance with a variety of graphic reports statistics and diagrams Virtually any graph table grid display or diagram can be copied to the Report Writer Simply right click the item you wish to add to the Report Writer and select Add to Report The selection will appear at the bottom of the Report window CART also produces classic output for those users more comfortable with a text based summary of the model and its performance To add any or all of CART s classic output to the Report Writer window highlight text in the classic output window copy it to the Windows clipboard Ctrl C switch to the Report Writer window and paste Ctrl V at the point you want text inserted Thus you can combine those CART result elements you find most useful either graphic in nature and originating in the CART results dialog or textual in nature from the classic output into a single custom report w Only one Report window is available at a time x To see whether a given table or chart can be added to the Report simply right click on the item you wish to add and see whether the Add to Report line is available in the pop up m
192. at which the split would be made and the improvement yielded by the split The best competitor CLASSES would split at the value 0 500 and would yield an improvement of 0 234 not far below the improvement afforded by the optimal split The quality of the competitor splits relative to the primary split can also be evaluated by inspecting the line graph displayed in the upper right panel The improvement yielded by each competitor split appears on the y axis and the number or rank of the competitor split on the x axis with the primary split improvement displayed at X 0 The top five surrogates are listed in the bottom right panel along with the splitting criterion the association value and the improvement yielded by the surrogate split In this example the best surrogate ANYPOOL has an association value of 0 439 resulting in an improvement of 0 156 in the misclassification rate rN See the main reference manual for detailed information about how CART calculates and uses Competitors and Surrogates rt See the main reference manual for a detailed discussion of association and improvement 253 The Root Competitor Splits tab Chapter 11 CART Segmentation If the root node is selected the second tab shows the competing root node splits otherwise this tab is omitted g l Navigator 1 7 Node 1 Root Competitor I Classification Competitors and Surrogates Splitter Is ANYRAQT 0 Main ANYRAQT 1 CLASS
193. aximal tree Note that the number of test sample data points that may be analyzed is unlimited For example suppose you are using our 32MB version that sets a learn sample limitation of 8 MB Each data point occupies 4 bytes An 8MB license will allow up to 8 1024 1024 4 2 097 152 learn sample data points to be analyzed A data point is represented by 1 variable by 1 observation 1 row by 1 column In general the analysis workspace provided to build the tree will be adequate for most modeling scenarios However if the user models a large number of high level categorical predictors or is using a high level categorical target the user may encounter workspace limitations that will not allow the entire learn sample to be used In these special cases the user will have to upgrade to a larger memory version or use one of the options discussed below Workspace Usage Because CART checks on every possible split at every node CART must store the full data set in memory when it is building a tree In certain situations it may be necessary to restrict the size of the maximal tree grown so the analysis will fit into the workspace available on your computer If the available workspace is not large enough to grow the requested tree a CURRENT MEMORY REQUIREMENTS table will appear in the CART Report window that looks something like the following CURRENT MEMORY REQUIREMENTS TOTAL 41492578 DATA 2223939 ANALYSIS 41492578 AVAILABLE 3
194. b 50 If b 0 the variable would receive 100 credit because we would be ignoring its degree of missingness ww In most analyses we find that the overall predictive power of a tree is unaffected by the precise setting of the missing value penalty However without any missing value penalty you might find heavily missing variables appearing high up in the tree The missing value penalty thus helps generate trees that are more appealing to decision makers High level Categorical Penalty Categorical predictors present a special challenge to decision trees Because a 32 level categorical predictor can split a data set in over two billion ways even a totally random variable has a high probability of becoming the primary splitter in many nodes Such spurious splits will not prevent CART from eventually detecting the true data structure in large data sets but they make the process inefficient First they add unwanted nodes to a tree and as they promote the fragmentation of the data into added nodes the reduced sample size as we progress down the tree makes it harder to find the best splits To protect against this possibility CART offers a high level categorical predictor penalty used to reduce the measured splitting power On the Basic Penalty dialog this is controlled with a simple slider The Advanced Penalty dialog allows access to the full penalty expression The improvement factor is expressed as log node _ size j S min
195. ble It should be noted that CART 6 supports high level categoricals through its proprietary algorithms that quickly determine effective splits in spite of the daunting combinatorics of many valued predictors This feature was introduced in CART 4 and is increasingly important considering CART 6 s character predictors which in real world datasets often have hundreds or even thousands of levels When forming a categorical splitter traditional CART searches all possible combinations of levels an approach in which time increases geometrically with the number of levels In contrast CART s high level categorical algorithm increases linearly with time yet 89 Chapter 4 Classification Trees yields the optimal split in most situations See the section below titled High Level Categorical Predictors for additional details Character Variable Caveats Character variables are implicitly treated as categorical discrete so there is no need to declare them categorical CART 6 has no internal limit on the length of character data values strings You are limited in this respect only by the data format you choose e g SAS text Excel etc vx Character variables marked by at the end of variable name will always be treated as categorical and cannot be unchecked w Occasionally columns stored in an Excel spreadsheet will be tagged as Character even though the values in the column are intended to be numeric If this oc
196. bmit Window facility Console versions of CART running in batch mode will terminate automatically once all commands have been processed Any commands appearing in a command file after a QUIT command will be ignored 391 Appendix III Command Reference REM Purpose The REM command is for comments All subsequent text on that line is ignored The REM command is especially useful when writing programs in BASIC and in the writing of command files The command syntax is REM lt text gt Examples REM This is a comment line and is not executed 392 Appendix III Command Reference RUN Purpose RUN processes the input dataset s produces summary reports and optionally creates two output datasets but no modeling is done Its syntax is RUN SD saved dataset pp processed dataset pDM lt yes no gt The PDM option governs whether internal class labels are written to the preprocessed dataset PDM YES rather than the original ones PDM NO which is the default The saved dataset can alternately be specified with the SAVE command Examples REM Create a new dataset from the old one by adding a new variable REM and deleting some records USE INFILE CSV SAVE OUTFILE CSV SIF DEATHDATE OR BIRTHDATE THEN DELETE LET DEATHAGE DEATHDATE BIRTHDATE 365 25 RUN REM Create a preprocessed dataset with categorical variable labels REM replaced with consecutively numbered ones REM in same order as o
197. cal The command syntax is CATEGORY lt varl gt lt var2 gt Examples MODEL LOW CATEGORY Low categorical dependent variable indicates CLASSIFICATION tree MODEL SEGMENT CATEGORY SEGMENT CATEGORY is also used to identify categorical predictor variables CART will determine the number of distinct values for you Example MODEL LOW CATEGORY LOW AGE RACE EDUC 340 Appendix III Command Reference CDF Purpose The CDF command evaluates one or more distribution density or inverse distribution functions at specified values For cumulative distribution functions the syntax is CDF NORMAL Z T t dof F f dofl dof2 CHI SQUARE chisq dof EXPONENTIAL x GAMMA gamma p BETA beta p q LOGISTIC x STUDENTIZED x p qg WEIBULL X p q BINOMIAL X p q POISSON x p To generate density values use the syntax above with the DENSITY option CDF DENSITY distribution_name user specified value s To generate inverse cdf values specify an alpha value between 0 and 1 CDF INVERSE NORMAL alpha tT alpha dof PotssoN alpha p F alpha dofl dof2 cHI squaRE alpha dof EXPONENTIAL alpha camMaA alpha p BETA alpha p q Locistic alpha STUDENTIZED alpha p q WEIBULL alpha p q BINOMIAL alpha p q CDF NORMAL 2 16 DENSITY NORMAL 2 5 INVERSE CHISQ 8 3 341 Appendix III Command Reference CHARSET Purpose The CHARSET command allows you to select which ty
198. cal searches that are fast but explore only a limited range of possible splits The default setting for the number of local splits to search is around 200 To change this default and thus search more or less intensively increase or decrease the search intensity gauge Our experiments suggest that 200 is a good number to use and that little can be gained by pushing this above 400 As indicated in the Categorical dialog a higher number leads to more intensive and longer searching whereas a lower number leads to faster less thorough searching If you insist on more aggressive searching you should go to the command line Command line users will use the following command syntax to define the high level categorical thresholds gt BOPTIONS NCLASSES 20 gt BOPTIONS HLC 600 10 BOPTIONS NCLASSES 20 turns on shortcut searching for categoricals with more than 20 levels BOPTIONS HLC 600 10 conducts 600 local searches each of which is subjected to a further 10 refinement searches The default settings of BOPTIONS HLC 200 10 should suffice for most problems Remember that these controls are only relevant if your target variable has more than two levels For the two level binary target the YES NO problem CART has special shortcuts that always work Remember that there are actually disadvantages to searching too aggressively for the best HLC splitter as such searches increase the likelihood of overfitting the model to the trainin
199. cal or data mining software CART and other Salford data mining modules now include an approach to cluster analysis density estimation and unsupervised learning using ideas that we trace to Leo Breiman but which may have been known informally among statisticians at Stanford and elsewhere for some time The method detects structure in data by contrasting original data with randomized variants of that data Analysts use this method implicitly when viewing data graphically to identify clusters or other structure in data Take for example customer ages and handsets owned If there were a pattern in the data we would expect to see certain handsets owned by people in their early 20s and rather different handsets owned by customers in their 265 Chapter 12 Features and Options early 30s If every handset is just as likely to be owned in every age group then no structure relates these two data dimensions The method we use generalizes this everyday detection idea to higher dimensions The method consists of the following steps 1 Make a copy of the original data and then reorder the data in each column using a random scramble Do this one column at a time using a different random ordering for each column so that no two columns are scrambled in the same way As an example starting with data typical of a mobile phone company suppose we randomly exchange date of birth information in our copy of the database Thus each customer record would
200. can easily end up with more than the equivalent of 1000 pages of plain text reports We have now set the default to printing only summary tables as most users do not refer to the classic text node detail x You can always recover the full node detail text report from any saved grove file via the TRANSLATE facility Thus there is no longer any real need to produce this text during the normal tree growing process Summary Plots These are classic mainframe line printer style plots for a few classic CART graphs You can see these plots in the GUI so they are turned off by default Number of Surrogates to Report Sets the maximum number of surrogates that can appear in the text report and the navigator displays This setting only affects the displays in the text report and the Navigator windows It does not affect the number of surrogates calculated The maximum number of surrogates calculated is set in the Best Tree tab of the Model Setup dialog You can elect to try to calculate 10 surrogate splitters for each node but then display only the top five No matter how many surrogates you request you will get only as many as CART can find In some nodes there are no surrogates found and the displays will be empty The command line equivalent of the number of surrogates to report is BOPTIONS PRINT lt N gt 131 Chapter 4 Classification Trees Number of Competitors to Report Sets the maximum number of competitors that appear in
201. cceecactecscessstecertascecersactececvvsceceecvanceceesacieeecsvacedeesaciteseeaacit 229 Opening a File ooacccssceicecceckececteecesceceeteseetsecteseeteceacecsetesnstubeesnccuctenssseictentsteieegesczers 230 Setting Up the Model ueria enneunen anA aeann e EaP RARA EAR EAAS RAOR ARAON AAAA NEAR SANRA URRA EAS R RARAN 232 Tree Navigatot cicicccescerccicccecsceeccteceadsccetennstetceneedueuedeuecccteduster AA ONAE ERAEN EEE SASA AEREN ASAR ERR SAE 235 Viewing Variable Splits cccecccesseneeeeeeeeeeeseeeeeeeseeeseeeseeeeeeeeseanseseseaeseseseenseeeeees 237 Viewing the Main Splitters ccceccesseeeeeeeeeeeeesneeeeeeeseeeeeeeseenseeeseeeseseseeeseeeneees 238 Viewing the Main Tree cccccsseccecesseeeceseeeeeeeeneeseeeeseeseeeseeeseseseaeseseseanseseseanseeesenes 239 VIEWING SUD t6OOS 0 2 ccecesteecceesceccedactecerensceceecevaceecessciecersabeneeesadienensuscedeteesteneeesscit 242 Assigning Labels and Color Codes cccccssseseceseeerseeesnenseeeseeneeesesneneeseseenseeessees 242 Printing the Main Tree ccccessscceeeeseneeeeeeeeeeeeeeeneeeeeeeeeeeenseeneeeeseeneeenseeneeeenseeneeenss 243 Tree Summary Reports ceseeceeceseeeeeeeeeeeeeeeeeeeeeeeeeeeeseeeseeeseseeneesesesseeeeneeseeesneeeeaees 244 Gains Charten ansann aao tect aS Adeeduadetesgatetuctuceddee edducdenedavsceeceavsndeedvecunesds 245 RROOU SDIItS E et ce si eden dGadaaten cages ine lcawdes A daesdeveasnatoesteehanetvaaeed 247 Terminal NOGEGS seesinane nsaan
202. ccording to the results removing CAPITAL_RUN_LENGTH_AVERAGE from the Sample Model Size RIMAR ANA PAPA vaan predictor list actually improves the relative error to 0 169 215 Chapter 10 CART Batteries Ze CART CART 6 0 CART 6 0 Standard Pro Pro EX Battery MCT Battery MCT generates a Monte Carlo test on the significance of the model performance obtained in a given run The target is first randomly permuted thus destroying any possible dependency of the target on all remaining variables and then a regular model is built The process is repeated many times and the resulting profiles are shown together with the actual run profile One would want to see the actual run profile as far away from the MCT profiles as possible We illustrate this battery on the SPAMBASE CSV data using a small list of predictors see MCT CMD command file for details 4 Battery Summary 1 Models Contents Error Profiles MCT init MCT2 MCT_3 MCT_4 MCT_5 MCT_6 MCT MCT_8 MCTI MCT_10 MCT11 MCT 12 MCT_13 MCT_14 MCT_15 MCT_16 MCT_17 MCT_18 meri MCT_20 at Eve 1 A MCT_21 15 20 25 30 35 40 45 50 55 60 65 70 MCT_22 Number of Nodes MCT_23 MCT_24 MCT_25 AllTrees v Model Quality Sample Model Size Save Grove Misclass ROC Test Leam Min Cost 1SE It is clear that even this arbitrarily chosen set of predictors is capable of capturing some useful signal Note that the family of MCT
203. ceed with the recommendations Check the KEEP EXCLUDE commands Warning 2 The following variables had more than 2000 distinct values Check the KEEP EXCLUDE commands for the presence of undesirable predictors Warning 3 CART is using v fold cross validation on a training sample with lt N gt records Using a test sample will speed up the run Your data set is large enough to allow a separate test set Warning 4 Singularity solving for linear combination split CART has encountered difficulties finding linear combination splits univariate splits will be used instead for the node where the difficulty appeared Warning 5 The optimal tree has no splits and one node According to the current set of PRIORS and COSTS the null tree is better than any other tree CART has grown This situation may also take place when growing regression trees on data sets with a lot of noise Warning 7 Obsolete syntax on CATEGORY command CATEGORY command no longer requires explicit level counts in CART 6 Warning 10 Case weights are not supported for linear combinations Support for weights in linear combinations will be implemented in future versions of CART Warning 11 Case weights are not supported for the LAD rule Support for weights in LAD regression will be implemented in future versions of CART Appendix III Command Reference This appendix provides a command language reference including syntax and examples 328
204. combinations and whether linear combination searching is even attempted in a node for this LCLIST SEARCH lt n gt Limits the linear combination search to only consider the topmost N univariate competitors in the LCLIST The default is 10 the minimum value is 2 Smaller values reduce run time at 368 Appendix III Command Reference the expense of perhaps not considering potentially valuable linear combinations Examples Q IST KEEP Enable LCs allow all predictors to be considered LCLIST CRIM ZN INDUS CHAS N 50 Specify LCLIST Set min node size to 50 369 LIMIT Purpose Appendix III Command Reference The LIMIT command allows tree growth limits to be set The command syntax is LIMIT ATOM lt M gt SUBSAMPLE lt m gt NODES lt n AUTO3 gt DEPTH lt n AUTO gt LEARN lt n AUTO gt TEST lt n AUTO gt DATASET lt N gt ERRORSET lt i gt MINCHILD lt n gt in which lt n gt is a whole number ATOM SUBSAMPLE NODES DEPTH LEARN TEST MINCHILD WMINCHILD Examples Minimum size below which a node will not be split Default 10 Node size above which a subsample is used to locate splits Forecast of the number of terminal nodes in the largest tree grown Default of AUTO lets CART set a value for you Override allocates required workspace for unusual problems Limits maximal tree growth to a specified depth Default of AUTO forecasts depth of largest tree likely
205. crificed in favor of accuracy While a few academic studies have embraced LCs also known as oblique splits they have largely not been used in practical modeling settings Our new controls may not persuade you to make use of LCs but they can help to make trees more interpretable and are likely to also give better results In CART 6 0 you may specify lists of variables LC lists from which any LC can be constructed Every variable in an LC must appear on a single LC list Thus in a credit risk model you might list credit report variables on one list core demographics on another list and current income related variables on a third list Such LC lists force combinations of variables used in an LC splitter to all be of a specific type Time series analysts might create a separate LC list for a variable and all its lagged values If LC lists contain no more than two predictors then any LCs used in the tree will be of the simplest possible form a weighted average of two predictors CART z CART 6 0 rx CART ProEX includes a new control that allows an LC list to be limited to a specific node size regardless of how many variables are on an LC list Additionally we have added controls to limit the number variables allowed in a LC an improvement adjustment for DOF as well as an improvement penalty control CART td CART 6 0 Pro EX Hot Spot Detection When the goal of an analysis is to identify especially interesting subsets of the data
206. csv X coonean csy SQ HOSLem csv prostate2 csv File name GYMTUTOR CSV Files of type ASCII Delimited csv dat txt After you open GYMTUTOR a dialog opens automatically that gives information on the dataset and allows one to choose between data viewing stats modeling or scoring 231 Chapter 11 CART Segmentation C Program Files Salford Data Mining CART Pro EX 6 0 Examples GYMTUTOR CSV File Name GYMTUTOR CSY Location C Program Files S alford Data Mining CART Pro EX 6 0 Examples Modified Wednesday May 15 2002 10 09 34 AM Variables Records Variables Character Numeric Sort File Order X Activity Stats View Data Score If the Model button is clicked on the Model Setup dialog opens and the CART Output window appears in the background Hyperlinked Report Contents appears in the left panel of the Output window and text output in the right The initial text output contains the variable names the size of the file and the number of records read in 232 Chapter 11 CART Segmentation a CART Classic Output Ctrl Alt C AT File Edit View Explore Model Report Window Help a 5 01 a sje m sagal mael E Bal This launch supports up to 32768 variables Report Contents The license supports up to 100 MB of learn sample data gt REM Resetting Preferences gt REM Setting General default options gt LOPTIONS MEANS YES PREDICTIONS YES BOTH
207. ct ANYPOOL as the Left Child Node splitter Repeat using the Set Right button for CLASSES Because ANYPOOL is a binary no split value is specified For the Right Child Node check x Set Split Value and then click the Change button Model Setup Advanced Costs Priors Penalty Battery Model Categorical Force Split Constraints Testing Select Cases if Best Tree aii Method Specify Splitter For Root Node And Its Children pS Root Node ANYRAQT Set Root ONAER perei Set Left il Set Right E E ene ee Left Child Node Right Child Node HOME PERSTRN SEGMENT ANYPOOL Clear CLASSES Clear lue I Set Split Value 3 a Sort Fie Order Trara Change Save Grove CART Combine Score T Cancel ie Start 273 Chapter 12 Features and Options The resulting Set Root Node Splits Value dialog will appear Set Right Child Node Split Value for CLASSES Variable CLASSES is categorical Cases With These Values SRS With These Values Go To Left Child Node GoTo SRS Child Node Send To Send To Right gt gt lt Send To Left lt Send To Left To Left Unlike our previous example for continuous variables this time we are using the lower portion of the dialog to specify the left right direction for individual classes To do so select the classes you want to go left or right and then click either the Send To Right gt or the lt Send To Left button Variable CLASS
208. cted you may list different pruning criteria for each with the HARVEST PRUNE LIST command The command syntax is HARVEST PRUNE LIST NODES lt nJ n2 gt DEPTH lt nI n2 gt TREENUMBER lt n1 n2 gt The options on the HARVEST SELECT command are 360 Appendix III Command Reference ALL RELERR lt x gt COMPLEXITY lt x gt NODES lt n gt RANDOM lt n gt DEPTH lt n gt BEST lt n gt Select all trees in the grove Select all trees which when pruned to optimal size have a test sample relative error rate or resubstitution error rate if no test sample was used less than lt x gt Select all trees which when pruned to optimal size have a complexity threshold less than lt x gt Select all trees which when pruned to optimal size have less than or equal to lt n gt terminal nodes Randomly select up to lt n gt trees from the grove Select all trees which when pruned to optimal size are less than or equal to lt n gt nodes deep When used with the RELERR COMPLEXITY NODES RANDOM KEEP or EXCLUDE criterion ensures that only the most accurate lt n gt trees are selected from those meeting the original criterion Accuracy is based on test sample error rate or resubstitution error rate if no test sample was used HARVEST CVTREES YES NO specifies whether ancillary trees created as part of a CART cross validation model are selected By default they are not A
209. ctors all values the same for all records Thus having a constant predictor in the training data will effectively turn off linear combinations for the entire tree Command line users will use the following command syntax to specify linear combinations LINEAR N lt min_cases gt DELETE lt signif level gt SPLITS lt max_splits gt LC Lists Use Only Selected Variables LC lists are a new addition to CART and can radically improve the predictive power and intuitive usefulness of your trees In legacy CART if you request a search for linear combination splitters ALL the numeric variables in your predictor KEEP list are eligible to enter the linear combination LC In every node with a large enough sample size CART will look for the best possible LC regardless of which variables combine to produce that LC We have found it helpful to impose some structure on this process by allowing you to organize variables into groups from which LCs can be constructed If you create such groups then any LC must be constructed entirely from variables found in a single group In a biomedical study you might consider grouping variables into demographics such as AGE and RACE lifestyle or behavioral variables such as SMOKE and FTV and medical history and medical condition variables such as Ul PTD and LWT Specifying LCLISTS in this way will limit any LCs constructed to those that can be created from the variables in a single list x T
210. curs with your data refer to the READING DATA section to remedy this problem Categorical vs Continuous Predictors Depending whether a variable is declared as continuous or categorical CART will search for different types of splits Each takes on a unique form Continuous Split Form Continuous splits will always use the following form A case goes left if split variable lt split value A node is partitioned into two children such that the left child receives all the cases with the lower values of the split variable Categorical Split Form Categorical splits will always use the following form A case goes left if split variable level_i OR level_j OR level_k rN In other words we simply list the values of the splitter that go left and all other values go right If a categorical variable with many levels is coded as a number it may actually be helpful to treat it as a continuous variable This is discussed further in a later chapter 90 Chapter 4 Classification Trees One should exercise caution when declaring continuous variables as categorical because a large number of distinct levels may result in significant increases in running times and memory consumption amp Any categorical predictor with a large number of levels can create problems for the model While there is no hard and fast rule once a categorical predictor exceeds about 50 levels there are likely to be compelling reasons to try to
211. d dictates which variable should be used to split a node specifies splitter variable criteria specifies which test method to use selects records to use specifies best tree selection method specifies splitting rule to use to grow tree specifies cost of making specific mistakes specifies how to balance unequal classes sets penalties on predictors missing values categoricals specifies other model building options specifies modeling automation The only required step is to specify a target variable and tree type in the Model Setup WModel tab For most users the default settings for any tab are reasonable and suffice to obtain useful models with good to excellent performance As you become more accustomed to the software you might experiment with the available controls to see if you can improve your results We also provide automatic experimentation for you using the Battery tab described in detail later 85 Chapter 4 Classification Trees If the other Model Setup dialog tabs are left unchanged the defaults used are All variables in the data set other than the target will be used as predictors the Model tab No weights will be applied the Model tab 10 fold cross validation for testing the Testing tab Minimum cost tree will become the best tree the Best Tree tab e Only five surrogates will be tracked and they will all count equally in the variable importance formula the Best Tree tab GINI splitting
212. d the Model tab 10 fold cross validation for testing the Testing tab Minimum cost tree will become the best tree the Best Tree tab Only five surrogates will be tracked and will count equally in the variable importance formula the Best Tree tab GINI splitting criterion for classification trees and least squares for regression trees the Method tab Unitary equal misclassification costs the Costs tab Equal priors the Priors tab No penalties the Penalty tab Parent node requirements set to 10 and child node requirements set to 1 the Advanced tab Allowed sample size set to the currently open data set size the Advanced tab e 3000 limit for cross validation warning gt gt Additional tree building and reporting options include enabling linear combination splits combining trees using bagging or ARCing limiting the size and structure of the tree filtering data and limiting the size of the test and learn samples exporting tree rules and identifying where to permanently save the CART output navigator file and tree models See Chapters 3 4 and 5 for further discussion of the default settings and an extended tutorial using the other nine Model Setup dialogs 234 Chapter 11 CART Segmentation Selecting Target and Predictor Variables For this analysis the three level categorical variable SEGMENT 1 2 or 3 is the target or dependent variable To specify the target variable use
213. d request the Twoing rule with a moderate favoring of even splits Of course you never have to deal with the command language if you do not want to but knowing a little can be helpful If you want to lean further in the direction of even splits then raise the setting to 2 00 as we do below Favor Even Splits Less The GUI limits your POWER setting to a maximum value of 2 00 This is to protect users from setting outlandish values There are situations however in which a higher setting might be useful and if so you will need to enter a command with a POWER setting of your choice Using values greater than 5 00 is probably not helpful w On binary targets when both Favor even splits and the unit cost matrix are set to 0 Gini Symmetric Gini Twoing and Ordered Twoing will produce near identical results Although we make recommendations below as to which splitting rule is best suited to which type of problem it is good practice to always use several splitting rules and compare the results You should experiment with several different splitting rules and should expect different results from each As you work with different types of data and problems you will begin to learn which splitting rules typically work best for specific problem types Nevertheless you should never rely on a single rule alone experimentation is always wise The following rules of thumb are based on our experience in the telecommunications banking and mark
214. data set contains variables for gender and age and you want to create a categorical variable with levels for male senior female senior male non senior female non senior You might type 414 Appendix IV BASIC Programming Language oe IF MALE OR AGE THEN LET NEWVAR ELSE IF MALE 1 AND AGE lt 65 THEN LET NEWVAR 1 ELSE IF MALE 1 AND AGE gt 65 THEN LET NEWVAR 2 ELSE IF MALE 0 AND AGE lt 65 THEN LET NEWVAR 3 ELSE LET NEWVAR 4 oe oe oe oe If the measurement of several variables changed in the middle of the data period conversions can be easily made with the following oe IF YEAR gt 1986 OR MEASTYPES 0OLD THEN FOR LET TEMP OLDTEMP 32 1 80 LET DIST OLDDIST 621 NEXT ELSE FOR LET TEMP OLDTEMP LET DIST OLDDIST oe NEXT If you would like to create powers of a variable Square cube etc as independent variables in a polynomial regression you could type something like oe DIM AGEPWR 5 FOR I 1 TO 5 LET AGEPWR I AGE I NEXT oe oe oe Filtering the Data Set or Splitting the Data Set Integrated BASIC can be used for flexibly filtering observations To remove observations with SSN missing try IF SSN THEN DELETE To delete the first 10 observations type IF CASE lt 10 THEN DELETE Because you can construct complex Boolean expressions with BASIC using programming logic combined with the DELETE statement giv
215. del Translation Dialog The Model Translation dialog is shown below Model Translation Grove Grove file Navigator 1 Select Type Classification Subtree Tree no 9 Rel Cost 0 081445 Optimal Tree Select I Save Output To File Translation Options Language SAS Options G SAS Missing value string gt 2 Classic Begin label MODELBEGIN Ce Done label MODELDONE PMML Node prefix NODE C Java Terminal Node Prefix TNODE Cancel Grove File Hit the Select button to pick an external grove file for translation or leave this field unchanged if you have a navigator file with an embedded grove file in which case the navigator file name will appear in the field 182 Chapter 7 Scoring and Translating Subtree Press Select to translate other than the optimal tree then choose the tree in the Tree Sequence dialog By default CART will translate the optimal tree to which you may always return by pressing the Optimal Tree button Save Output to File Put a checkmark if you want the results of scoring saved into an external output file Press the Select button to specify the output files name location and extension Language Choose the language SAS compatible Classic C PMML and Java are currently available SAS compatible Options When translating into SAS you may also specify additional SAS related preferences The definitions should become clear once you look at
216. dentify scratch file path s lt MBytes gt Data amount in MB subject to license threshold m lt MBytes gt Model space in MB subject to hardware limits l lt optional_logfile gt Error warnings to text logfile mt lt N gt Max ternary size 0 to grow tables without bound v lt N gt Specifies max N variables for the session Examples cart e modell cmd cart DataMining Jobs 1 simulate cmd q cart jobl cmd o RESULTS job1 txt u AnalysisData samplel sys cart d Progra 1 DBMSCopy7 u MyData joint data xls xls5 cart s512 p64 m128 Environment variables can be used in lieu of command line switches SALFORD _S in lieu of s SALFORD M in lieu of m SALFORD P in lieu of p Appendix Command Line Menu Equivalents This appendix provides an overview of command line equivalents to the graphical user interface options 316 Appendix l Command Line Menu Equivalents Command Pull Down Menu Dialog ADJUST Limits Growing Limits AUXILIARY Model Construct Model Model BATTERY Model Construct Model Battery BOPTIONS SERULE Model Construct Model Best Tree COMPLEXITY Model Construct Model Advanced COMPETITORS Model Construct Model Best Tree CPRINT Edit Options CART TREELIST Edit Options CART SPLITS Command Line Only SURROGATES How Many to Store Model Construct Model Best Tree How Many to Report SURROGATES PRINT Edits Options CART SCALED Model Construct Model Advanced NCLASSES Model
217. der and that you have the write permission to that folder Error 28 Too many variables in your dataset You have exceeded CART s limit on the number of variables currently 8128 Note that new variables created by BASIC and missing value indicators are treated as legitimate variables and may cause the total number of predictors to go beyond the limit Error 10002 NO INDEPENDENT VARIABLES WERE SPECIFIED FOR THIS MODEL Check your KEEP or EXCLUDE commands Error 10005 THE NUMBER OF DEPENDENT VARIABLE CATEGORIES IS NOT EQUAL TO THE NUMBER OF PRIOR CLASS PROBABILITIES Make sure that you have listed all available levels in the PRIORS SPECIFY command Error 10006 YOU HAVE SPECIFIED THE DEPENDENT VARIABLE AS THE SEPARATION VARIABLE CART does not allow the use of the same variable in the MODEL and ERROR commands Error 10007 AN INDEPENDENT VARIABLE WAS SPECIFIED AS THE SEPARATION VARIABLE CART does not allow the use of the same variable in the ERROR and KEEP or EXCLUDE commands Error 10008 DATASET HAS NO NUMERIC VARIABLES IN COMMON WITH YOUR USE DATASET You are using the wrong test set Error 10009 MISCLASSIFICATION COSTS MUST BE POSITIVE AND NONZERO Use small positive numbers such as 001 to reflect zero costs Error 10011 OUT OF MEMORY SPLIT INTO SEVERAL SMALLER COMMANDS The command parser has encountered difficulties processing one of your commands due to its length Consider alternative ways to use
218. der to specify our Above Depth constraint of 2 and the right slider to specify our Below Depth constraint of 5 Now our selected variable s are only permitted for the depth levels of 2 3 or 4 They are disallowed above 2 and below 5 Disallow Split Region 1 2 3 Ind Split Disallowed Above Depth ho 2p op Of sl pad pod ol Split Disallowed At Or Below Depth 1 2 3 Ind Now let s run an example and specify two groups of structure constraints using the GYMTUTOR CSV data One group of variables is the consumer characteristics and a second group is the or product characteristics Consumer characteristics NFAMMEM Number of family members SMALLBUS Small business discount binary indicator coded 0 1 FIT Fitness score HOME Home ownership binary indicator coded 0 1 Product characteristics ANYRAQT Racquet ball usage binary indicator coded 0 1 ONAER Number of on peak aerobics classes attended NSUPPS Number of supplements purchased OFFAER Number of off peak aerobics classes attended TANNING Number of visits to tanning salon ANYPOOL Pool usage binary indicator coded 0 1 PERSTRN _ Personal trainer binary indicator coded 0 1 CLASSES Number of classes taken For our group 1 variables place a check mark for each using the column labeled 1 Repeat this process for group 2 using the column labeled 2 The resulting Constraints tab will look as follows 280 Chapter 12 Features and Options
219. des tab displays box plots for the node distributions of the target sorted by the mean Hover over any of the boxes to see detailed information about the node iv Navigator 1 18 Tree Summary Reports Root Splits T erminal Nodes Variable Importance Terminal Node Box Plots Terminal Hodes Sorted By Target Variable Prediction l F x When separate learn and test parts of the data are used Learn and Test buttons allow switching between learn and test distributions No matter which button is pressed the nodes are always sorted by the learn means to quickly assess node stability 154 Chapter 5 Regression Trees Root Splits The Root Splits lists ALL root node competitors sorted in descending node by split improvement The report also shows split details in terms of case counts 4 Navigator 1 18 Tree Summary Reports Profit R tsi if Terminal Nodes Variable Importance Is RM lt 6 94100 Competitor Split Improvement N Left N Right N Missing Main RM 6 94100 3i J 1 LSTAT 9 72500 37 34426 2 INDUS 6 66000 21 90361 3 PT 19 90000 20 62983 4 NOX 0 66950 18 84629 5 TAX 416 50000 17 03179 6 CRIM 6 68632 16 33631 7 RAD 16 00000 13 25819 8 ZN 15 00000 13 17997 9 AGE 76 25000 11 01511 10 8 346 39502 10 39391 11 DIS 2 59770 9 87063 12 CHAS 0 50000 2 59304 x While the competitor information is also available
220. desired by users wanting to impose some modest structure on a tree You can also specify the split values for both continuous and categorical splitters if you prefer to do so A much more sophisticated set of controls is available in CART ProEX These controls allow you to pre specify sets of variables to be used in specific regions of the tree and to determine the order in which splitters appear in the tree Look for a discussion of the structured tree to learn more about this patent pending feature Missing Value Controls and Analysis CART has always offered sophisticated high performance missing value handling In CART 6 0 we introduce a new set of missing value analysis tools for automatic exploration of the optimal handling of your incomplete data On request CART 6 0 will automatically add missing value indicator variables MVIs for every variable containing any missing values to your list of predictors and conduct a variety of analyses using them For a variable named X1 the MVI will be named X1_MIS and coded as 1 for every row with a missing value for X1 and 0 otherwise If you activate this control the MVIs will be created automatically as temporary variables and will be used in the CART tree if they have sufficient predictive power For categorical variables an MVI can be accommodated in two ways by adding a separate MVI variable or by treating missing as a valid level Modelers can now experiment to see which works best
221. diction Success tab 67 250 Profit tab 152 Root Splits tab 154 terminal node detail 73 158 256 Terminal Nodes tab 63 153 247 Variable Importance tab 64 154 248 viewing rules 158 Summary Reports dialog 61 244 summary statistics 136 surrogate splits 216 surrogate splitters 275 surrogates 12 68 155 248 251 discount 248 discount weights 103 number of 103 number to report 130 symgini 13 symmetric gini 105 symmetrical cost matrix 117 system requirements minimum 24 recommended 24 T tables misclassification 66 249 prediction success 67 250 target class 289 target variable 45 85 234 class names 59 242 temporary files default location 29 133 terminal node distributions 61 245 Terminal Node Report window 73 158 256 terminal node size 216 218 terminal nodes 63 153 235 247 color coding 59 136 151 235 242 minimum size 111 test methods 95 146 fraction of cases 98 no independent testing 96 separation test variable 99 test sample file 99 v fold cross validation 96 test sample 90 99 211 test sample size 113 287 Testing tab 146 147 232 233 text files 33 text output 75 231 257 283 toolbar icon Command Log 79 261 Model Setup 45 78 261 Options 125 View Data 290 toolbar icons 42 train data 221 Train Test Consistency 20 186 transforming variables 406 TRANSLATE command 182 400 translating 181 translating models 20 160 170 180 c
222. ding increases In many runs the ARC process of resampling will simply bog down and the ARCer will automatically reset the probabilities to their equal starting values and continue generating additional trees The option Maximum Number of Sample Redraws enables you to control how hard the ARCer should try to build a sample The default setting is three If CART cannot build one of the trees in the resampled series you can increase the maximum number of redraws and try again Pruning Test Method When growing a single tree pruning is not merely optional it is vital to obtaining a reliable tree By definition a CART tree is first overgrown i e overfit and then the overfit portions are pruned away with the help of a test or cross validation data set When combining trees Breiman has shown that the trees need NOT be pruned because whatever overfitting may result is averaged away when the combining takes place For this reason No Pruning is the default setting when using either bagging or ARCing x The other two pruning methods are available for historical reasons only Evaluation Sample Holdout Method A holdout sample is used to evaluate the performance of the committee of experts tree generated via bagging or ARCing The holdout sample is NOT used to build or prune any tree but rather is used only to evaluate the predictive capability of both the committee of experts tree and the initial tree built on the full sample 166
223. drawn from the USE data FILE lt file gt sets up a separate dataset SEPVAR lt var gt separates the learn and test samples with a named variable The setaside value is 1 for numeric and SETASIDE or setaside for character variables 379 Appendix III Command Reference The TEST CROSS and EXPLORE options are used to specify if and how pruning is conducted They are mutually exclusive options TEST CROSS EXPLORE TRIES POWER RTABLES DETAILS Examples specifies that the unsampled training data is to be used as a test sample to prune each tree specifies that N fold cross validation is used for each tree in the series in lieu of a test sample If lt N gt is not specified it defaults to 10 specifies that no test sample or cross validation is to be used for each tree Occasionally CART cannot build one of the trees in the series You can specify how many times CART should draw and redraw learn and test samples in an effort to get it built The default is 3 This is the exponent K in the ARC function evaluated for each observation in the overall set arc func 1 m i sum _ j l m j A value of 0 effectively turns ARC off Controls the tables CART can produce to summarize how observations in the overall set are being repeated into the learn and test samples both for each tree and cumulatively at the end of the series controls whether CART produces detailed output tree sequ
224. e Click the Select button next to this line to select the data file you want to score By default CART puts the most recently opened data file into this field Grove File Click the Select button to pick an external grove file for scoring or leave this field unchanged if you are scoring from a navigator file with an embedded grove file in which case the navigator file name will appear in the field Save Results to a File Place a mark in the x Save results to a file check box if you want the results of scoring saved into a separate data file Press the Select button to specify the output file s name location and format Decide whether you want Model Information Path Indicators and Predicted Probabilities included in the output dataset see details below 174 Chapter 7 Scoring and Translating Subtree Click Select if you want to score other than the optimal tree and then choose the tree in the Tree Sequence dialog By default CART scores the optimal tree which you may always return to by pressing the Optimal Tree button Tree Sequence Terminal Cross Validation Resubstitution Complexity Nodes Relative Error Relative error Parameter 1 0 081445 0 040773 0 000000 0 081445 0 050773 0 003354 0 086649 0 056037 0 003519 0 101955 0 066241 0 006813 0 117905 0 087293 0 014045 0 194850 0 124135 0 024571 0 204850 0 163829 0 026473 0 206176 0 206176 0 028241 0 568421 0 568421 0 241507
225. e V Split criteria Decimal places E Sample Node Display J Weighted cases Decimal places 2 4 V Splitting variable name W Split criteria MV Unweighted cases Regression Trees V Average Median MV Standard Mean Average deviation Classification Trees V Class assignment IV Class breakdown f Underlined V Class histogram Set Defaults Use Defaults PETALVID gt 1 12 Node 4 Class 2 PETALWMID lt 2 12 Class Cases 1 812 2 2 5012 55 3 3012 34 N 60 Copy To Terminal Nodes Cancel The default display setting is shown in a sample node in the right panel Click on the check boxes to turn each option on and off and then click OK to update the Main Tree display To save your preferred display options as the default settings click the Set Defaults button Also note that you may separately control the display of internal nodes versus terminal nodes Press the Copy to Terminal Nodes or Copy to Internal Nodes button if you wish the current setup to be copied into the other tab 242 Chapter 11 CART Segmentation The Set Defaults button only sets the defaults for the current tab If you want to set defaults for both terminal and internal nodes press this button twice once for each tab Viewing Sub trees You can also view sub trees different sections of the tree by right clicking on an internal node that originates the branch you want displayed and selecting Display Tree As with the main tre
226. e the level of node detail can be changed by selecting Node Detail from the View menu As illustrated below separate sections of the tree can be displayed side by side by opening a second sub tree window the two windows are automatically positioned side by side SMALLBUS 1 SMALLBUS 0 ONAER lt 250 ONAER gt 2 50 Terminal Terminal Terminal Terminal Node 1 Node 2 Node 4 Node 5 Class 1 Class 3 Class 3 Class 2 N 71 N 25 N 15 Assigning Labels and Color Codes Class names 32 character maximum and colors can also be assigned to each level of the target variable 4 Select Assign Class Names from the View menu 5 Click on the Name text box and enter a label for that class 6 Click on Color select a color from the palette and click OK 7 Click Apply to enter the name color repeat steps 2 4 for the other levels An illustrative Class Assignment dialog box for our example is shown below The labels and color codes are displayed in the individual node detail you see when you hover the mouse pointer over a node in the Navigator window as well as in the main and sub tree diagrams and printed tree output 243 Chapter 11 CART Segmentation Class Name Assignments High Medium ane Color Low Set Default Colors Apply Cancel Printing the Main Tree To print the Main Tree bring the tree window to the foreground and then select Print from the File menu or use lt Ctrl P g
227. e Partitions To request a random division of your data into a three way partition just check the relevant box in the Model Setup Testing tab and specify your preferred fractions When setting up such partitions be sure that each partition will be large enough to fulfill its function In the example below we have set up a partition that is 60 train 20 test and 20 validate Batter Advanced Penalty y Constraints Testing SelectCases BeatTiee Method Costs Model Categorical_ Priors Force Spit Select Method for Testing Tree No independent testing exploratory tree m for testing 0 2 I Additional fraction for auto validation 02 i r Save Grove CART Combine Score Cancel Continue Start Test Sample Contained in a Separate File Two separate files are assumed one for learning and one for testing The files can be in different database formats and their columns do not need to be in the same order 3 The train and test files must both contain ALL variables to be used in the modeling process In general we recommend that you keep your train and test data in the same file for data management purposes This helps to ensure that if you process your training data you also process the test data in exactly the same way Variable Separates Test and Validate Samples A variable on your data set can be used to flag which records are to be used for learning training and which
228. e have introduced new ways to shape and control models new ways to assess the quality of your models and added tools to report deploy and export models for production purposes This section provides a brief and selective overview of the newest features Complete details are provided in the main body of the manual For a list of the features introduced in CART 5 0 please see the relevant appendix CART Pro and CART ProEX To accommodate a diverse set of user requirements we are now offering three main versions of CART 6 0 the SE or standard edition the Pro or professional and the Pro EX or professional extended edition intended for our most demanding users Features available only in the Pro and ProEX versions are marked throughout the documentation using the following indicators CART fa to CART 6 0 Pro indicator CART a g CART ProEx indicator Groves and Navigators CART 6 0 now uses only the grove file grv to store model information and no longer creates navigator nav or nv3 files CART 6 0 will still read your old navigator files so you can continue to view and extract reports from them You will not need navigator files in the future because CART 6 0 stores all model information in the grove Data Preparation and Management All Salford tools have traditionally offered a comprehensive built in BASIC programming language for on the fly data manipulation The language includes full flow control in
229. e number of columns as before but twice as many rows The top portion of the data is the Original data and the bottom portion will be the scrambled Copy 266 Chapter 12 Features and Options Add a new column to the data to label records by their data source Original vs Copy 3 Generate a predictive model to attempt to discriminate between the Original and Copy data sets If it is impossible to tell after the fact which records are original and which are random artifacts then there is no structure in the data If it is easy to tell the difference then there is strong structure in the data 4 In the CART model separating the Original from the Copy records nodes with a high fraction of Original records define regions of high density and qualify as potential clusters Such nodes reveal patterns of data values that appear frequently in the real data but not in the randomized artifact We do not expect the optimal sized tree for cluster detection to be the most accurate separator of Original from Copy records We recommend that you prune back to a tree size that reveals interesting data groupings Setting Up an Unsupervised Model To set up an unsupervised model we use the Model Setup Model tab Start by defining your predictors using the check boxes in the Predictors column For unsupervised learning there is no target variable If a target variable is checked it will be discarded and ignored The only other setup requi
230. e or more features or control parameters of the model It is given prior to the BUILD command which begins the model building process The various forms of the BATTERY command are BATTERY ATOM Eight models are generated using ATOM values of 2 5 10 25 50 100 200 and 500 BATTERY CV Cross validation trees using 5 10 20 and 50 CV bins BATTERY DEPTH Generates one unconstrained and seven depth limited 1 2 3 5 10 20 50 models BATTERY FLIP Generates two models reversing the learn test samples BATTERY MVI Generates five models main effects main effects with MVIs Missing value indicators MVIs only main effects with missing values penalized main effects and MVIs with missing values penalized BATTERY MINCHILD Eight models using minchild settings of 1 2 5 10 25 50 100 and 200 BATTERY NEST YES NO CART EX Pro only Do we nest combine battery specifications or not The default is no 331 Appendix III Command Reference BATTERY NODES CART EX Pro only Four models each limiting the number of nodes in a tree 4 8 16 and 32 terminal nodes BATTERY ONEOFF CART EX Pro only Attempt to model the target as a function of one predictor at a time Note that for CART classification models the class probability splitting rule is used BATTERY LOVO CART EX Pro only Repeat the model leaving one predictor out of the model each time Note that for CART classification models the class
231. e predicted with any reasonable accuracy at all relative error 1 09 is greater than 1 0 Chapter 11 CART Segmentation A classification segmentation example to illustrate the multi class problem 228 Chapter 11 CART Segmentation Modeling the multi class target So far we have discussed two class classification examples In this chapter we walk through a simple three class example to illustrate some of the unique aspects of this form of modeling In the example that follows we analyze a data set containing information on health club members who have been classified into three market segments The goal of our analysis is to uncover the important factors that differentiate the three segments from each other The variables GYMTUTOR CSV data set included on your installation CD are SEGMENT ANYRAQT ONAER NSUPPS OFFAER NFAMMEM TANNING ANYPOOL SMALLBUS FIT HOME PERSTRN CLASSES Member s market segment coded 1 2 or 3 Racquet ball usage binary indicator coded 0 1 Number of on peak aerobics classes attended Number of supplements purchased Number of off peak aerobics classes attended Number of family members Number of visits to tanning salon Pool usage binary indicator coded 0 1 Small business discount binary indicator coded 0 1 Fitness score Home ownership binary indicator coded 0 1 Personal trainer binary indicator coded O 1 Number of classes taken CART Desktop Double click on the CART
232. e statement when the condition is not true IF condition THEN statementl ELSE statement2 2 5 oe Second the ELSE may be combined with an IF THEN to link conditions IF condition THEN statement ELSE IF condition2 THEN statement2 To allow multiple statements to be conditionally executed combine the IF THEN with a FOR NEXT IF condition THEN FOR statement statement NEXT AP ol o oe Examples To remove outlier cases from the data set IF ZCF ABS z zmean zstd gt 95 THEN DELETE 422 Appendix IV BASIC Programming Language LET Statement Purpose Assign a value to a variable Syntax The form of the statement is LET variable expression The expression can be any mathematical expression or a logical Boolean expression If the expression is Boolean then the variable defined will take a value of 1 if the expression is true or 0 if it is false The expression may also contain logical operators such as AND OR and NOT Examples oe LET AGEMONTH YEAR BYEAR 12 MONTH BMONTH LET SUCCESS MYSPEED MAXSPEED LET COMPLETE OVER 1 OR END 1 oe oe 423 STOP Statement Purpose Appendix IV BASIC Programming Language Stops the processing of the BASIC program on the current observation The observation is kept but any BASIC statements following the STOP are not executed Syntax The form of the statement is STOP Examples 10 IF X 10 THEN G
233. e the following command syntax to specify selection criteria where lt condition gt is written as lt variable gt lt relation gt lt string gt SELECT lt condition1 gt lt condition2 gt etc SELECT AGE lt 35 Using CART s Built in Programming Language As an alternative to the Model Setup Select Cases tab CART offers a full built in BASIC programming language When accessed via the command line BASIC can be used to modify existing variables as well as to define new variables filter cases and implement other database programming functions at any step during the Model Setup process For example if you are in the Model Setup dialog and want to create a new variable to add to your candidate predictor list click the Continue button Ensure that the command prompt is on by placing a checkmark by the Command Prompt from the File menu item The command prompt is represented by the gt character At the gt type SIF FTV gt 0 THEN LET NEWVAR 1 SELSE LET NEWVAR 0 to create a categorical variable NEWVAR that takes on the value 1 if the number of first trimester visits was greater than zero and a value of 0 otherwise To then add NEWVAR as a candidate predictor variable reopen the Model Setup dialog NEWVAR will now appear in the Variables box of the Model dialog highlight NEWVAR and add it to the predictor list 102 Chapter 4 Classification Trees a The signs are part of the input and si
234. e the property of their respective owners Table of Contents CO Py Fine on 2es ese E A erent hee fsctened cesneheeserentoeesectmued cavsuuis EAT 1 Limited Warranty oi ees cei eee a cee cdeh een dedi naoa H ov ns cece AARE N KEN KASAT AANE E PEKNE RARUA KATMER AANVAT AN 1 ELE e g PAA EA E AE E TE T T A 2 Trademarks ciis i inisee s aaa Aan ENEE RS AEE E NO AAEN ENEE ERS EAER ES NENE 2 INTRODUCING CAR F 6 0 aaae araar ara naran a naaar iran anai 9 lekeas aD maLo t PETATE E AE A E E E N 10 What s New in CART 6 0 ccccccssseeeeeeseeeeeeeeeneeeeeseneeeeseeneesesesneeseseeneesesesneeseseenenens 14 About this Manual coe c c cccedecescteczevecceceeseescececvactendevesceccesseteaerevsete ANEA EASTA EEN RANA NER ENNSS 20 INSTALLING AND STARTING CART esssseeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeees 23 Installing and Starting CART 6 0 cccccsssssssseeeeeeseeeeeeeseeeeseeeseenseeeseaeseeeeseaneeeenees 24 Minimum System Requirements cccceeccecesseceeeeeeeeeeeeeeeneeseseenseeesseeeeeeeseeneeeeseees 24 Recommended System Configuration ccccecccceeeeeneeeseeeeeeeeeeeeseeeseeneeeeseeeeeeesseees 24 Installation Procedure From CD ROM ccccssescceseeneeeeeseeneeeeeeeneeeeeseeeeeeeseeneeeesees 25 Ensuring Proper Permissions c ecccceeseereeeseeeeeeeeeeeeeeeseeeeeeeeseaneeseseenseeeeeeneeeeeees 26 Starting and RUNNING CART ccccceessenceeeeeeneeeeeeeneeeeeseeneeeeseeeseeenseeneeeaseeneeeenseeneeenss 26 LICENSIN
235. eatures and Options i You specified a split point for FIT as a forced splitter that lies outside the observed learn sample range for this variable This is not permitted Removing the forced split constraint from the node and continuing I Do not show this message again OK Cancel From the resulting navigator if you hover your mouse over the root node we can see that CART now uses both the specified variable FIT and the split point 3 96 FITs 3 96 Class Cases 95 32 4 100 34 1 An alternative view would be to look at the tree details diagram by clicking the Tree Details button found on the Navigator This would give you the following view again showing that the split variable and the value were utilized Node 1 Class 1 FIT lt 3 96 Class Cases 1 95 324 2 100 341 3 98 33 4 W 293 00 N 293 kz FIT lt 3 96 FIT gt 3 96 Terminal Terminal Node 1 Node 2 Class 1 Class 2 Class Cases Class Cases 1 2 16 2 100 775 3 237 209 We 129 00 N 129 Ln 272 Chapter 12 Features and Options Specifying the Left Right Child Node Splitter Using the same root node force split variable and value we now demonstrate how to specify the right left child node splits Like the root node split the user can specify not only the variable but also a split value In this example we use the categorical variables ANYPOOL 0 1 and CLASSES 0 1 2 3 Using the Set Left button sele
236. eatures and Options To do so return to the Model Setup Force Split tab The previous specified variable FIT should be retained and displayed as the Root Node entry This time we will check x Set Split Value and then click the Change button The resulting Set Root Node Splits Value dialog will appear Set Root Node Split Value for FIT R Cases go to left child node if the value is lt 3 96 Variable FIT is categorical Cases With These Values Cases With These Values Go To Left Child Node Go To Right Child Node 0 a 6 90775 Send To Right gt 7 09008 7 31322 7 49554 lt Send To Left 7 64969 7 74066 8 00637 8 16052 Reset 8 29405 8 51719 8 55641 8 57546 8 6125 8 68271 8 76405 8 77956 8 83928 8 85366 Cancel This dialog allows you to specify the split value for continuous variables in the upper portion and categorical variables in the lower portion Here we have placed the value 3 96 in the entry box titled Cases go to left child node if the value is lt Click OK to continue and return to the Model Setup dialog From the Model Setup window click Start to build the model The user is allowed to enter any value as long as it falls within the range of permissible values In the case of the variable FIT the minimum value is zero and the maximum is 10 127 However the user who enters a value outside the range would receive an error like the following 271 Chapter 12 F
237. eceseeneeeeeeeeeeeeeeeneeeeeeeeneeenseeensnneeeees 74 More Navigator Controls cccccseeeceseseeeseeeeeeeeeeeeeeeeeseseneeseseseeeseseeneeseeeseeeseeeseenens 74 CART Text OUtDUE ne e aaa eaten ea a aaa aaae eaaa AEAEE a naO aaea E aaea aii 75 Displaying and Exporting Tree Rules ccccccceseeceseseeeeeseseeeeeeeeneeseeeseneeeeeesenees 76 SCOPING Data E E E E N E T I E E EO 77 New Analy Sis ooi anoe A e NEE EN AO AEAEE EEA NASAAN AA O EAE ENAA AO NENANA i 78 Saving the Command Log s sssssssssennsnrunnnnnnnunnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nenne 78 CLASSIFICATION TREES iis sicsinticdsceshiecdisietedicestieatisietesberasdecdieteisdicratinaniens 81 Building Classification Trees cccccessescssseeeeeseeneeseseeeeseseeneeeeeeesneeeesesneeseseennsees 82 The Model tally onioni n na EAE ae eNEAN ERE a ee ANAA A AEEA EENET 85 Th Categorical taD enrian neia aa Rin RRA ENA EA K ENAERE NEA ENEE 92 Th Festing Tab oriee anna EA Aa EAE ARNa PENE O RAEE ARTANA 95 The Select Cases tab anie rnaen eE iE aaa A PAA AAE ENEA R AKANE NAE EAREN ESA SE 100 The Best Tree tal osiin tetesecesswettvevsneppecerevtacessenaees eantivsssnnuiecsrustieerssueaevesesseaecns 102 Th Method Tabi crearen anara rRe EENE AEEA rRNA AAEE LERAP ARENE EARNERS 104 Th Advanced Tab ciicscieicctieticecececticeesectneresennaecuscecaceseautivscysusqacesssstecersvetiveseemarers 111 The Cost Tab iasa ENEE NA AAEE ARE EAE REIA P AA EANA E RANEE AO RASE ARS EERTE
238. ecifying the CORE subset of variables that are always present in each run We illustrate this battery on the SPAMBASE CSV dataset by sampling 10 predictors at a time and repeating this process thirty times while requiring the CHAR_FREQ_EXPLAM and CHAR_FREQ_DOLLAR variables to be always present see KEEP CMD command file for details 213 wa Battery Summary 1 Models Contents Accuracy Error Profiles War Imp Averaging Charts fava ROC Nodes p View Chapter 10 CART Batteries EOR Zoomed bd Chart Type Bar j Line Test Avg ROC zda EENES tda sl daay gl daay da eda 4 6b daay oz daJ tz daay zz d3 ez daD pe daD 4 SZ d3 4 az da daD 4 9z day 4 67 daJ 4 334 Of d Opt Terminal Avg ROC Q 8938 WORD_FREQ_MAI LWORD_FREQ Keep List Tree 24 0 9209 WORD_FREQ_3D WORD_FREQ_MA I 0 9182 W0RD_FREQ_ORDER WORD_FREQ Tree 25 0 9064 WORD_FREQ_BUSINESS WORD_FR Tree 26 0 8862 WORD_FREQ_3D WORD_FREG_OU Tree 27 0 8964 WORD_FREQ_MAKE WORD_FREO_ Tree 28 0 9186 WORD_FREG_ADDRESSES WORD_ Tree 29 Show Max ROC Model Quality Sample Save Grove Misclass Entropy ROC Lift Test Leam ORD_FREQ_OUR WORD_FREG_IN Model Size Min Cost 1SE The resulting models have an average area under the ROC curve ranging from 87 84 to 93 47 The largest ROC model
239. ection of the output highlight that section and select Copy from the Edit menu or from the toolbar Paste the copied text to the Notepad by selecting New Notepad from the File menu and then save the notepad contents by selecting Save As from the File menu Alternatively after you copy the text paste it to another application such as Microsoft Word or Excel Printing the CART Output Window To send output contained within the CART Output window simply select Print from the File menu The following Print dialog will appear and provide a set of Print Range options Choose the desired option and click the OK button to complete printing E Print Range Select Print Range Selected Text C Entire File Current Page C Selected Pages First Page a Last Page ir Number of Copies 1 M Collate Cancel 285 Chapter 12 Features and Options Memory Management Formerly CART was compiled into distinct memory versions 64MB 128MB etc A user s license determined which memory version was delivered Thus the license was tied to the amount of workspace inherent in the program and loosely tied to the amount of data type of data categorical vs continuous size of final tree etc that the user could analyze Licensing and workspace are handled differently in CART 6 A user s license sets a limit on the amount of learn sample data that can be analyzed The learn sample consists of the data used to grow the m
240. ed for each resample or replication generated and then the results are averaged If the separate analyses differ considerably from each other suggesting tree instability averaging will stabilize the results yielding much more accurate predictions If the separate analyses are very similar to each other the trees exhibit stability and the averaging will neither harm nor improve the predictions Thus the more unstable the trees the greater the benefits of averaging When training data are resampled with replacement a new version of the data is created that is a slightly perturbed version of the original Some original training cases are excluded from the new training sample whereas other cases are included multiple times Typically 37 percent of the original cases are not included at all in the resample the sample is brought up to full size by including other cases more than once A handful of cases will be replicated 2 3 4 5 6 or even 7 times although the most common replication counts are 0 1 and 2 The effect of this resampling is to randomly alter the weights that cases will have in any analysis thus shifting slightly the results obtained from tree growing or any other type of statistical analysis Adaptive Resampling and Combining ARCing Leo Breiman s 1996 variant on the boosting procedure first introduced by Freund and Schapire 1996 performs as well as or better than boosting In ARCing the probability with which a
241. ed in detail in Chapter 3 CART BASICS 157 Chapter 5 Regression Trees we Navigator 1 18 Node 2 BX Competitors and Surrogates Box Plots Rules for node 2 7 CREATE VIEW BOSTON_Mean_19 9337_terminalNode2 AS SELECT FROM BOSTON WHERE RM lt 6 941 The Splitter tab When the main splitter is continuous the left and right child summary statistics of the target are displayed in table form 4 Navigator 1 18 Node 2 Competitors and Surrogates Box Plots ii IsLSTAT lt 14 4 Left Target s statistics Right Target s statistics 11 90000 5 00000 23 34980 14 95600 50 00000 27 50000 255 00000 175 00000 When the main splitter is categorical the partition of the splitters levels between the left and right sides is displayed This results tab is discussed in more detail in Chapter 3 CART BASICS 158 Chapter 5 Regression Trees Terminal Node Report To view node specific information for a terminal red node click on the terminal node or right click and select Node Report For our example left click on terminal node 18 far right terminal node The Node Statistics tab The Node Statistics tab shows the current node target box plot in comparison with the target box plot for the root node the entire learn sample This helps us to see whether the high end or the low end segment of the population is contained in the current node Node specific s
242. ed in terms of any variable appearing in the data set whether or not that variable is involved in the model and is constructed as follows 1 Double click a variable in the variable list to add that variable to the Select text box 2 Select one of the predefined logical relations by clicking its radio button 3 Enter a numerical value in the Value text box 4 Click Add to List to add the constructed criterion to the right window and use Delete from List to remove For example if you want to exclude all mothers over 35 years of age from the analysis double click on AGE Click on the lt button and enter 35 in the Value text box When you click on Add to List AGE lt 35 will now appear in the previously blank panel on the right as illustrated above 101 Chapter 4 Classification Trees Advanced Costs Model Categorical_ Prios Penaty Battey Force Spit Constraints Testing Select Cases Best Tree Method Set Range of Variable Values Used in Analysis a Select AGE lt 35 CART Combine Score Cancel Continue Stat a The SELECT criteria are ANDed meaning that if you specify two conditions both must be satisfied for a record to be selected into the analysis If you want to create logical selection criteria that allow some but not all conditions to be met you will need to use the built in BASIC programming language Command line users need to us
243. eeeeeneeeeees 176 Case Output for Regression Trees ccccssseeeesseeneeseeneeeeseeeeeesessseeeenseseeneeseseenens 178 Scoring in Command Line eeeecseseee eee eeeeeeeeeeeeeeeneeeeneeseseeeeeseseseeeeseeesenensesenens 180 Translating CART Model s ccccsseccccsesseeeseseeeeeeeeeeeseeeeeeeesesesneesesesneeseeneseenensesneeen 180 Translating in Command Line ceseeeccesesseeeeesseeeseeeeeeeeeeseeeeeseeeeseeeseseseeeseeneeneees 182 Exporting and Printing Tree RUICS cccccceessneeeeeeeeeeeeeseeneeeeseeeeeeenseeneeeneseneeenes 183 TRAIN TEST CONSISTENCY TTC cccccceeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeenens 185 Optimal Models and Tree Stability cceecessseeeeeseeeeeeeseeeeeeeesseeeeenseeeeeeeeeseenens 186 HOT SPOT DETECTION ssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnna 193 Searching for Hot Spots srai oa EENAA AES E AEAEE AESA RR 194 CARICUBATTERIES iicvsseccvevsesicvesvescevets scceusteentevetsurceuetewarevexertcensttesguvvdvececres 199 Batteries of RUNS lt i ccisieciecessececeessedeeceestectecesdecetenysedeteresecncecenctedspevcedudesnsecuees secteleesae 200 CART SEGMENTATION feiicsescescescicavcceiteescusdieaccciitesesssstensccedteseerstvensecedtee 227 Modeling the multi class target ccccessseeeesseneeeeeeeneeeeeeeeeeeeseeeeeeenseeneeensseneeeneas 228 CART D SKto pp iiinis ankaraan aoi aaa aorti aana aama aaa aierua aeaaeai vente 228 About CART Menus viccceccecc
244. eeeesseaneeeeeeeeeseseeeeneeees 301 Command Syntax Conventions ssssssssssseseeunnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnn nna 301 Example A sample classification run sssssssssssnneesennnnnnnnnennnnnnnnnnnnnnnnnnnnnnnnnn nnne 302 Example A sample regression run 2 ccccececeeeeeseeeeeeeeeeeeseeeseaneeeeeeeeseeeeeesnenees 305 UNIX Console Usage Notes cceseeccccceseeeeeeeseeeseeeeeeeeseeneesesesneeseeeeeneesesesneeseeeseeees 310 COMMAND LINE MENU EQUIVALENTS o n 315 ERRORS AND WARNINGS cece cece eee i ee eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeees 319 COMMAND REFERENCE visssisisians ian cicessaes ests sins cuentas inca stestesasanainiasivs dvuatecsuedis 327 ADJUST ornina a cased ce tec eg cee oa ce pant ve nk ee enta oa eh eect ge enh Teed hess 328 AUXILIAR Y criera fect tececeteds copeesetecesecuctdnvessaecenectlveduasedevervctevsesssoets restleecersetevtnvestietes 329 BAT TERY cerei cociceccesstaccccadisrcreetiteccdeteatvetssececcesdeateseeneudevehsscteccestoatestesstecerssatversaecet 330 BOPTIONS ince feces ctcegs ceceecdadearettesscecccestetennsnecececeseteck cousstenrsasuercesdeceetuetederersssetveneszeet 334 BULL Desiree aa a Sees Rivet ce dv aba a hedonic Gove shed ctremeed devs esecs Seeethas 338 CATEGORY civeiccscteeeds oie icd eect ns Aneesh deters ns Cebedh N Ateth statins hueee tenets 339 CODE service feces cae EE ds cntecedssucteetecncudesyseuebesstececs su seudessasecuetesevedscsguseudeedsvecdeeevseusceeeses 340 CHARS
245. elected file 1 2 3 Select Log Results to from the File menu and choose the File option Click on the File Name text box in the Text Results to File dialog box to set the file name as illustrated below Select the directory in which the file should be saved Click on Save Save in Examples 2 atom dat amp hotspot dat cv dat keep dat S cvr dat 5 mvi dat depth dat ttc dat S draw dat 3 flip dat B B File name GYMOUT Save as type Output Files dat To stop sending the output to a file select the Log Results to Window from the File menu rae rae The CART Output window must be active to have access to the above menus Due to some features of the operating system you will not be able to see the contents of the log file until after CART is closed a new log file is specified or the output is logged back to Window 284 Chapter 12 Features and Options Command line equivalents OUTPUT lt file _name dat gt OUTPUT Specify Output File Post Processing To save the complete current contents of the CART Output window to a file after you have built a tree 1 Select Save Output from the File gt Save 2 Click on the File Name text box in the Text Results to File dialog box to set the file name as illustrated below 3 Select the directory in which the file should be saved 4 Click on Save To save a particular s
246. en prunes those sections of the tree that contribute least to overall accuracy pruning all the way back to the root node As we would expect the relative cost or misclassification rate goes down as the tree gets progressively larger but at a certain point plateaus and in some cases will begin to climb CART s Navigator allows you not only to explore the different tree topologies but to interactively inspect detailed summary and diagnostic information for each sub tree in the tree sequence To explore a different tree you may click on the blue box in the line graph or from the Tree menu choose Select Tree or use left and right arrow keys or click Grow or Prune buttons or select Grow One Level or Prune One Level in the Tree menu gt gt Oe o The tree size you select appears in the top panel of the Navigator window The ability to see different trees is particularly useful if you feel the optimal CART tree is too large or if you are only concerned with the first few splits in the tree You can also see the tree nodes that will be pruned next as you move one step down the tree sequence by selecting Show Next Pruning from the View menu or pressing the Next Prune button in the Navigator window these nodes are outlined in bright yellow 237 Chapter 11 CART Segmentation Repeated pressing of the button cycles through three alternative displays in the lower half of the Navigator standard Re
247. ence SUBMIT Purpose The SUBMIT command lets you send a text not binary command file to CART for processing in batch mode The commands are executed as if you had typed them from the keyboard If the file of commands is in the current directory or the directory specified with Utilities Defaults Path and has a CMD extension you need only specify the basic file name without the extension Otherwise specify a path name and the complete file name enclosed in single or double quotation marks The command syntax is SUBMIT lt file gt ECHO The ECHO option displays the commands on the screen as CART reads them from the SUBMIT file Note that screen output is automatically scrolled when you SUBMIT commands You can use the OUTPUT command to specify an ASCII text file to review the output that is quickly generated Examples SUBMIT COMMANDS reads from file COMMANDS CMD in current directory SUBMIT ANALYSES NEWJOB cMD reads from named file SUBMIT JOB ECHO reads JOB CMD and displays commands on screen 400 Appendix III Command Reference TRANSLATE Purpose The TRANSLATE command generates reports and splitting rules from a grove file A grove file must be named by the GROVE command prior to using the TRANSLATE command otherwise the most recently created grove file will be used The OUTPUT option will direct the output from TRANSLATE to the named file The command syntax is TRANSLATE LANGUAGE CLASSIC
248. ence node details etc for the initial tree and for each tree in the series MOPTIONS CYCLES 10 EXPLORE YES DETAILS NONE RTABLES NO TRIES 3 NO SETASIDE FILE C gymtutorTEST csv 380 Appendix III Command Reference NAMES Purpose The NAMES command lists the variables on the data set The command syntax is NAMES 381 Appendix III Command Reference NEW Purpose The NEW command resets all CART specific options while leaving CART s global options USE file PRINT settings etc in effect The command syntax is NEW 382 Appendix III Command Reference NOTE Purpose The NOTE command lets you write comments on your output A note can span any number of lines but no line may be more than 150 characters long You can embed an apostrophe in a note if you enclose the line in double quotation marks You can embed double quotation marks if you enclose the line in apostrophes single quotation marks A number without quotation marks sends the corresponding ASCII character to the current output device The command syntax is NOTE lt gt lt gt beudai y lt s Examples NOTE THIS IS A COMMENT This is second line of comment Tt s the third line here NOTE This the top of a new page subsequent NOTE creates line break 383 Appendix III Command Reference OPTIONS Purpose The OPTIONS command displays the CART options currently in effect includin
249. ent Score Data Percent Train Data Predicted Mean Actual Mean Train RMS Error Score RMS Error Terminal node number Number of cases Percent of scored data in the node Percent of learn data in the node Average of learn cases in the node Actual target average in the node RMS error on train data RMS error on scored data The Results Summary group box in the lower panel displays the number of predicted cases the number of observed cases for the target variable the predicted response overall mean for predicted target the observed response overall mean for observed target variable and the total mean squared error for the tree The name of the grove file and the dataset used are also noted in the last row 180 Chapter 7 Scoring and Translating Scoring in Command Line For command line scoring the grove file must be saved separately To score Issue the GROVE file_name grv command to specify the grove file GROVE lt file_name grv gt Issue one of the following commands to specify other than the optimal tree If this command is not issued the optimal tree will be used by default HARVEST PRUNE TREENUMBER lt N gt HARVEST PRUNE NODES lt N gt Issue either of the following commands depending on whether or not you want model information added SAVE lt filename gt MODEL SAVE lt filename gt Start scoring by issuing SCORE PATH YES PROBS lt N gt Translating CART models Any CART mode
250. enu If it is available click on it and the item will appear at the bottom of the Report window Default Options In the Report Contents dialog the currently selected items to report and the Automatic Report checkbox can be saved as a default group of settings for future CART sessions by clicking the Set Default button These default options will then persist from session to session because they are saved in the CART preference file CART6 INI You may recall these settings at any time with the Use Default button 289 Report Contents Available Models CART Models Combine Models Select result window Navigator 1 Chapter 12 Features and Options Select Items to Report Data file Target variable Predictors Sample Tree sequence Tree topology Cost vs Nodes Gains chart Gains data Variable importance amp lt Select all Unselect All Select All Unselect All Report Now Cancel OK rN CART 6 contains two sets of report options One is for standard one tree models the other is for the combined bagging and ARCing models Automatic Report Additionally CART can produce a stock report with the click of a button You decide which components of the CART output would be most useful to you on the Report Set Report Options menu and then select them The stock report will be the same for all CART results in the session until you visit the Report Contents dialog again In addition the cu
251. epair or replacement at Salford Systems option If Salford Systems cannot repair the defect or replace the software with functionally equivalent software within sixty 60 days of Salford Systems receipt of the defective software then you shall be entitled to a full refund of the license fee Salford Systems cannot and does not warrant that the functions contained in the program will meet your requirements or that the operation of the program will be uninterrupted or error free Salford Systems disclaims any and all liability for special incidental or consequential damages including loss of profit arising out of or with respect to the use operation or support of this program even if Salford Systems has been apprised of the possibility of such damages Citations The proper citations for CART technology and this software are Breiman Leo Jerome Friedman Richard Olshen and Charles Stone Classification and Regression Trees Pacific Grove Wadsworth 1984 Steinberg Dan and Phillip Colla CART Classification and Regression Trees San Diego CA Salford Systems 1997 Steinberg Dan and Mikhail Golovnya CART 6 0 User s Manual San Diego CA Salford Systems 2006 Trademarks CART is a registered trademark of California Statistical Software Inc and is exclusively licensed to Salford Systems StatTransfer is a trademark of Circle Systems DBMS Copy is a trademark of Conceptual Software All other trademarks mentioned herein ar
252. erating system will block any access to it by an external application such as CART On some operating systems if the Excel file was recently open in Excel the Excel application must be closed to entirely release the file to be opened by CART The first row must contain legal variable names see the beginning of this chapter for details Missing values must be represented by blank cells no spaces or any other visible or invisible characters are allowed Any cell with a character value will cause the entire column to be treated as a character variable will show up ending in a sign within the Model Setup This situation may be difficult to notice right away especially in large files Any cell explicitly declared as a character format in Excel will automatically render the entire column as character even though the value itself might look like a number such cases are extremely difficult to track down 37 Chapter 2 Reading Data It is best to use the cut and paste values technique to replace all formulas in your spreadsheet with actual values Formulas have sometimes been reported to cause problems with reading data correctly Alternatively you may save a copy of your Excel file as a comma delimited file CSV and use the File of type Delimited Text csv dat txt caution make sure no commas are part of the data values Chapter CART BASICS This chapter provides a hands on exercise using a credit risk binar
253. erical variables may be specified Variable groups may be used in the XYPLOT command similarly to variable names Appendix IV BASIC Programming Language This chapter provides an overview of the built in BASIC programming language available within CART 406 Appendix IV BASIC Programming Language BASIC Programming Language CART and other Salford Systems modules contain an integrated implementation of a complete BASIC programming language for transforming variables creating new variables filtering cases and database programming Because the programming language is directly accessible anywhere in CART you can perform a number of database management functions without invoking the data step of another program The BASIC transformation language allows you to modify your input files on the fly while you are in an analysis module Permanent copies of your changed data can be obtained with the RUN command which does no modeling BASIC statements are applied to the data as they are read in and before any modeling takes place allowing variables created or modified by BASIC to be used in the same manner as unmodified variables on the input dataset Although this integrated version of BASIC is much more powerful than the simple variable transformation functions sometimes found in other statistical procedures it is not meant to be a replacement for more comprehensive data steps found in general use statistics packages At present in
254. ering a value between 0 and 1 CART will use this value to geometrically decrease the weight of the contribution of surrogates in proportion to their surrogate ranking first second third etc Finally you may click on the Use Only Top radio button and select the number of surrogates at each split that you want CART to consider in the calculation 4 Navigator 1 7 Tree Summary Reports Misclassification Prediction Success oes Gains Chart Root Splits Jf Terminal Nodes y Variable Importance Variable Importance Variable Score V r Consider Only ANYRAQT 700 00 Primary Splitters ANYPOOL 33 12 Show zero FIT 67 11 impatiaoce vaba CLASSES 65 70 WNNNNWIMWIM I Discount Surrogates ONAER 59 66 il OFFAER 58 38 WINNNWINII e TANNING 55 96 I NSUPPS 24 97 MIM 3 z ad SMALLBUS 23 60 uu NFAMMEM 13 47 M C HOME 1 48 Misclassification The Misclassification report shows how many cases were incorrectly classified in the overall tree for both learn and test or cross validated samples The tables which can be sorted by percent error cost or class display Class Class level N Cases Total number of cases in the class N Misclassified Total number of misclassified cases in the class Pct Error Percent of cases misclassified Cost Fraction of cases misclassified multiplied by cost assigned for misclassification In our example we can see that the misclassification errors ranged from one to five
255. erminal Node 6 Auxiliary Variables Categorical Variables Frequency Cum Pet 1 00000 1 12 100 00 88 00000 98 88 98 88 Continuous Variables Min Max Mean Missing FIT 89 00000 6 90775 10 12663 9 04828 0 00000 805 29667 0 66821 CLASSES 89 00000 1 00000 3 00000 1 15730 0 00000 103 00000 0 39597 This table reports summary statistics for HOME CLASSES and FIT for the given node Frequency distributions are reported when a predictor is categorical for example all but one case have HOME 0 and means and standard deviations are reported for continuous predictors In addition to viewing the summary statistics you may color code all terminal nodes based on any of the auxiliary variables For example do the following steps to color code terminal nodes using the HOME variable 1 Right mouse click anywhere in the gray area in the top half of the navigator window and choose Select Current Target alternatively use the View gt Select Current Target menu The Select Target Variable window will appear 137 Chapter 4 Classification Trees Select Target Variable Current variable SEGMENT T arget Select Target Variable Curent voiant Group Level s Name Color Set Default Colors Apply Cancel Click OK Back in the Navigator window choose the desired class level the terminal nodes will now be color coded as if HOME were the target Navigator 1 Classification tree to
256. es to get started In the left panel select a variable for which labels are to be defined If any class labels are currently defined for this variable they will appear in the left panel and if the variable is selected in the right panel as well where they may be altered or deleted To enter a new class name in the right panel for the selected variable define a numeric value one that will appear in your data in the Level column and its corresponding text label in the Class names for column Repeat for as many class names as necessary for the selected variable You need not define labels for all levels of a categorical variable A numeric level which does not have a class name will appear in the CART output as it always has as a number Also it is acceptable to define labels for levels that do not occur in your 93 Chapter 4 Classification Trees data This allows you to define a broad range of class names for a variable all of which will be stored in a command script CMD file but only those actually appearing in the data you are using will be used In a classification tree class names have the greatest use for categorical numeric target variables i e in a classification tree For example for a four level target variable PARTY classes such as Independent Liberal Conservative and Green could appear in CART reports and the navigator rather than levels 1 2 3 and 4 In genera
257. es are built and pruned using overall data and are evaluated using setaside data Learn and test samples for each of the trees in the expert series are constructed from the overall set These samples may be copies of the overall data or may be sampled with or without replacement from the overall set It is not necessary to have a test set for each tree they can be built using cross validation or with no pruning exploratory It is not necessary to have a setaside set although without it comparison of the initial tree and the expert set must be done with two additional separate case runs The command syntax is MOPTIONS CYCLES lt N gt ARC lt yes no gt SETASIDE PROP lt x gt FILE lt file gt SEPVAR lt var gt TEST CROSS lt N gt EXPLORE DETAILS INITIAL SET ALL NONE TRIES lt N gt POWER lt X gt RTABLES lt yes no gt CYCLES specifies the number of desired trees in the committee of experts not including any initial tree ARC specifies which combine method will be used When ARC YES the ARCing Adaptive Resampling and Combining method is used When ARC NO the bootstrap aggregation or bagging method is used Default is ARC NO SETASIDE specifies how the setaside sample is created This sample is NOT used to build or prune any of the trees It is used to evaluate the predictive capability of trees only including the initial tree PROP lt x gt specifies the proportion 0 to 1
258. es the minimum number of cases required in a node for linear combination splits to be considered Smaller nodes will be split on single variables DELETE governs the backwards deletion of variables in a stepwise algorithm The default is 0 20 LINSPLITS is a forecast of the maximum number of linear combination splits in the maximal tree This value is estimated automatically by CART and normally need not be set The automatic estimate may be overridden to allocate more linear combination workspace EXHAUSTIVE tells CART to attempt computing linear combinations using each continuous independent variable as the perturbation variable Examples LINEAR N 400 DELETE 30 Linear combination splits are turned off by simply entering the command LINEAR The LINEAR command is deprecated in favor of the LCLIST command and may be removed from future versions of CART 372 Appendix III Command Reference LOPTIONS Purpose The LOPTIONS command toggles several logical options on and off The command syntax is LOPTIONS MEANS YES NO TimInc YES NO NOPRINT PREDICTION succEss YES NO GAINS YES NO Roc YES NO ps YES NO PLOTS YES NO lt plot_character gt ppmscopy YES NO STATTRAN YES NO MEANS Controls printing of summary stats for all model variables TIMING Reports CPU time on selected platforms NOPRINT Omits node specific output and prints only summary tables PREDICTIONS Requests the prediction success
259. es you far more control than is available with the simple SELECT statement For example IF AGE gt 50 OR INCOME lt 15000 OR REGION 9 AND GOLF THEN DELETE It is often useful to draw a random sample from a data set to fit a problem into memory or to speed up a preliminary analysis By using the uniform random number generator in BASIC this is easily accomplished with a one line statement IF URN lt 5 THEN DELETE 415 Appendix IV BASIC Programming Language The data set can be divided into an analysis portion and a separate test portion distinguished by the variable TEST LET TEST URN lt 4 This sets TEST equal to 1 in approximately 40 of all cases and 0 in all other cases The following draws a stratified random sample taking 10 of the first stratum and 50 of all other strata e 5 IF DEPVAR 1 AND URN lt 1 THEN DELETE ELSE IF DEPVAR lt gt 1 AND URN lt 5 THEN DELETE DATA Blocks A DATA block is a block of statements appearing between a DATA command and a DATA END command These statements are treated as BASIC statements even though they do not start with Here is an example DATA let ranbetal brn 25 75 let ranbeta2 brn 75 25 let ranbinl nrn 100 25 let ranbin2 nrn 500 75 let ranchil xrn 1 let ranchi2 xrn 2 DATA END Advanced Programming Features Integrated BASIC also allows statements to have line numbers that facilitate the use of flow control with GOTO statements
260. ess OK to activate the scoring process a The Grove File portion of the Score Data window will contain your navigator file name this means that the embedded grove file will be used for scoring You do not have to change this unless you want an external grove file to be used for scoring This mode may not be available for all older navigators Scoring Using Only the Grove File If you have a grove file you would like to use for scoring do the following steps 1 Make sure the CART Output window is active Choose Score Data in the Model menu ye x Both the above steps can be replaced by simply clicking the Sel button in the toolbar Enter relevant information into the Score Data dialog including the name of the grove file in the Grove File section 173 Chapter 7 Scoring and Translating Press OK to activate the scoring process Score Data Dialog The Score Data dialog is shown in the picture below Score Data Data Data file AUTEM ES Mt Meelis h A til Select Grove Grove file Navigator 1 v Select Type Classification Grove Tree Tree_1_Main X I Save results to a file z r r Subtree Sequence tree no 2 Nodes 3 Rel Cost 0 081445 Optimal Tree Select Target Weight and ID Variables Target Variable Select a Weight Variable SMALLBUS Select nF Select up to 50 ID Variables Sort File Order Select CART Madel Select Cases Cancel Data Fil
261. est of the variables while the variable importance list tells exactly what variables are involved We illustrate this process using the SPAMBASE CSV dataset see TARGET CMD command file for details wa Battery Summary 8 Rear Models Contents Var Imp Averaging Charts Rel Eror Nodes TGT_WORD_F 7 0 097 TGT_WORD F 7 0 097 View 15 AllModels v Chart Type Test Rel Error Rel Error Battery Types Regression Battery Models TARGET Model Opt Terminal Name Nodes Rel Error Target 1 2 84 3 0 3429 CAPITAL_RUN_LENGTH_AVERAGE 4 0 3756 CAPITAL_RUN_LENGTH_LONGEST 5 0 4022 WORD_FREQ_FONT 6 0 4084 WORD_FREG_ADDRESS E 8 3 0 4333 WORD_FREQ_DIRECT 0 4489 WORD_FREG_TECHNOLOGY 0 4604 WORD_FREQ_650 AannchwAnn PAPA ARRAPee ee Show Min Error Sort by Types Model Quality Sample Model Size Save Grove LS Test Leam Min Cost 1SE 225 Chapter 10 CART Batteries The results indicate that WORD_FREQ_415 is the easiest to predict relative error 0 0971 Double clicking on the highlighted line and looking at the Splitters information in the resulting navigator reveals 4 Navigator 9 target grv Main Tree Split Variable DER WORD_FREQ_857 WORD_FREQ_857 WORD_FREQ_857 WORD_FREQ_857 WWORD_FREQ_857 VWWORD_FREQ_857 In other words WORD_FREQ_857 can be used to predict WORD_FREQ_415 nearly perfectly In contrast WORD _FREQ_PARTS cannot b
262. et research arenas and may not apply to other subject areas Nevertheless they represent such a consistent set of empirical findings that we expect them to continue to hold in other domains and data sets more often than not rN For a two level dependent variable that can be predicted with a relative error of less than 0 50 the Gini splitting rule is typically best 108 Chapter 4 Classification Trees For a two level dependent variable that can be predicted with a relative error of only 0 80 or higher Power Modified Twoing tends to perform best For target variables with four to nine levels Twoing has a good chance of being the best splitting rule For higher level categorical dependent variables with 10 or more levels either Twoing or Power Modified Twoing is often considerably more accurate than Gini Linear Combination Splits To deal more effectively with linear structure CART has an option that allows node splits to be made on linear combinations of non categorical variables This option is implemented by clicking on the Use Linear Combinations for Splitting check box on the Method tab as seen below Use Linear Combinations for Splitting Minimum node sample size for linear combinations Variable deletion O22 significance level a Number of nodes likely tobe Automatic split by linear combinations in maximal tree C F Use Only Selected Variables Minimum Node Sample Size The
263. fault We are now ready to grow our tree To begin the CART analysis click the Start button A progress report appears that lets you know how much time the analysis should take and approximately how much time remains Once the analysis is complete text output appears in the CART 235 Chapter 11 CART Segmentation Output window blue hyperlinks appear in the Report Contents panel and a new window the Navigator is opened and placed in the foreground We first explore the Navigator and then return to the text output Tree Navigator The tree topology displayed in the top panel of the Navigator window provides an immediate snapshot of the tree s size and depth By default the optimal or minimum cost tree is initially displayed and in this example is the tree with nine terminal nodes as illustrated below ee Navigator 1 Classification tree topology for SEGMENT Color code using Tat SEGMENT None Model Statistics Predictors Important Nodes Min Node Cases Best ROC Nodes ROC Train ROC Test Number of Nodes Displays and Reports Save Model Splitters Tree Details Summary Reports Commands Grove Translate Score Terminal nodes in classification trees are color coded to indicate whether a particular class level improves or worsens with respect to terminal node purity when compared to the root node By default color coding is not init
264. ference in two population proportions learn node content versus test node content Note that a node may agree on the direction class assignment but still 189 Chapter 8 Train Test Consistency TTC have a significant difference between the learn and test proportions as reflected by the z value Rank Max Z reports the z value of the standard statistical test on the difference in two population proportions as follows We first sort nodes by the learn based responses then we sort nodes by the test based responses and finally we look at the nodes side by side and check the difference in test based proportions for each pair Dir Fail Count reports the total number of terminal nodes in the tree that failed directional agreement Rank Fail Count reports the total number of terminal node pairs in the tree that failed the rank agreement The Consistency Details by Nodes lower half provides a detailed node by node stability report for the tree selected in the Consistency by Trees part upper half For example the optimal tree with 18 terminal nodes has one directional instability in node 15 as seen by scrolling the list in the lower half at the given significance level In addition to the columns already present in the Consistency by Trees report the following ones are added Lift Learn node lift on the train data Lift Test node lift on the test data N Focus Learn number of train records that belong t
265. for all internal nodes by clicking on the node itself it is usually limited to only the top five entries Variable Importance The Variable Importance tab same as classification but importance scores are now based on regression improvements See Chapter 3 CART BASICS for discussion of Variable Importance Detailed Node Reports To see what else we can learn about our regression tree return to the Navigator by closing the Summary Reports window To request a detailed node information display simply click on the node of interest for example left click on the left child of the root node internal node 2 155 Chapter 5 Regression Trees 4 Navigator 1 Regression tree topology for MV Color code using Tat MV Mean Smaller Larger Model Statistics Predictors Important Nodes Min Node Cases Relative Error a at te t 7 e 9 10 Ir tose 4s 16h 18 19 20 Number of Nodes Data Displays and Reports Save p Model Leam t Splitters Tree Details Summary Reports msist Commands Grove Translate Score The Competitors and Surrogates tab As illustrated below the first of the four tabs in the non terminal node report provides node specific information on both the competitor and surrogate splits for the selected node in this case the root node This results tab is discussed in detail in Chapter 3 CART BASICS we Navigator 1 18 Node 2 Competitors and S
266. ft panel Each competitor is identified by a variable name the value at which the split would be made and the improvement yielded by the split vx You may need to alter the width of the columns in this display to make everything we discuss here visible Just position your mouse in the column header and over the border you wish to move When the cursor changes to a cross hairs right click and drag the border to widen or narrow the column The best competitor CREDIT_LIMIT would split at the value 5546 and would yield an improvement of 0 0346 quite a bit below the main splitter improvement of 0 1035 Improvement scores should be looked at in relative rather than absolute terms The improvement of the main splitter is almost three times that of the best competitor an unusually large but not suspiciously large ratio The quality of the competitor splits relative to the primary split can also be evaluated by inspecting the line graph displayed in the upper right panel The improvement yielded by each competitor split appears on the y axis and the number or rank of the competitor split on the x axis with the primary split improvement displayed at x 0 The graph makes plain that the primary splitter is quite a bit better than the closest competitor but that the Ze 3 and 4 competitors all have similar improvements Surrogates are an important innovation in CART technology and play a key role in CART prediction and tree interpretation A surrogate s
267. g CART we recommend that you devote some time to further reading The primary source of information about the software s methodology is the main reference manual CART Classification and Regression Trees which contains a comprehensive discussion of the conceptual basis and features of CART As you work through this manual you may find it helpful to consult the main manual for more detailed discussion of some technical terms and concepts Additional detailed information about the CART algorithm and the thinking of the authors can be found in the original CART monograph Breiman Leo Jerome Friedman Richard Olshen and Charles Stone Classification and Regression Trees Pacific Grove Wadsworth 1984 21 Introducing CART 6 0 The remainder of the Windows User s Guide is organized as follows gt gt gt gt gt H gt Fe Fe Fe o o Fe o o o Chapter 1 INSTALLING AND STARTING CART Chapter 2 READING DATA Chapter 3 CART BASICS Chapter 4 CLASSIFICATION TREES Chapter 5 REGRESSION TREES Chapter 6 ENSEMBLE MODELS AND COMMITTEES OF EXPERTS Chapter 7 SCORING AND TRANSLATING Chapter 8 TRAIN TEST CONSITENCY TTC Chapter 9 HOT SPOT DETECTION Chapter 10 CART BATTERIES Chapter 11 CART SEGMENTATION Chapter 12 FEATURES AND OPTIONS Chapter 13 WORKING WITH COMMAND LANGUAGE Appendix I COMMAND LINE MENU EQUIVALENTS Appendix Il ERRORS AND WARNINGS Appendix Ill COMMAND REFERENCE Appendix IV BASIC P
268. g data The Testing Tab Testing is a vital stage in the CART tree selection process and without testing we cannot know how well a given tree can be expected to perform on new data CART allows you to choose from five different test strategies accessed in the Model Setup tTesting tab where you will see the following methods 96 Chapter 4 Classification Trees No independent testing V fold cross validation default is 10 fold Fraction of cases to be set aside at random for testing default 0 20 for validation default 0 00 Test sample contained in a separate file Variable separates learn and test samples binary indicator wn gt ask Battery Testir osts Priors Penalty i Model Categorical Force Spit Constraints Testing SelctCases BestTree Method Select Method for Testing Tree ee for testing r rate I Save all CV trees to grove Save Grove CART Combine Default test setting 10 fold cross validation No Independent Testing This option skips the entire testing phase and simply reports the largest tree grown We recommend you use this option only in the earliest stages of becoming familiar with the data set as this option provides no way to assess the performance of the tree when applied to new data Because no test method is specified CART does not select an optimal tree Bypassing the test phase can be useful when you are using CART to generate a qu
269. g the currently used file any weighting grouping or selection in effect short medium or long output current graphics character set number of decimal places to which output prints and the output destination The command syntax is OPTIONS 384 Appendix III Command Reference OUTPUT Purpose The OUTPUT command routes output to the screen the video display or to a file If you send output to a file and specify a simple filename CART automatically gives the file a DAT extension If you supply a complete path name for the file you must enclose the name in quotes If you send output to a file the analysis results will also appear on the display If the screen pauses waiting for you to hit Enter or Return output to a file will also pause The command syntax is OUTPUT lt file gt Examples OUTPUT sends subsequent output to screen only OUTPUT FILE1 sends output to FILE1 DAT in the default directory OUTPUT C REPORTS NEWOUT DAT 385 Appendix III Command Reference PARTITION Purpose The PARTITION command defines how a single input dataset is to be partitioned into learn test and validation samples There are two options specify the proportions numerically or specify a variable that identifies the sample into which each record should be placed PARTITION LEARN lt x gt TEST lt x gt VALIDATION lt x gt PARTITION SEPVAR lt variable gt For instance to specify that
270. gative might allow a potentially life threatening illness to go untreated In data mining costs can be handled in two ways x on a post analysis basis where costs are considered after a cost agnostic model has been built and ona during analysis basis in which costs are allowed to influence the details of the model CART is unique in allowing you to incorporate costs into your analysis and decision making using either of these two strategies To incorporate costs of mistakes directly into your CART tree complete the matrix in the Model Setup Cost tab illustrated below For example if misclassifying low birth weight babies LOW 1 is more costly than misclassifying babies who are not low birth weight LOW 0 you may want to assign a penalty of two to misclassifying class 1 as 0 See the main reference manual for a detailed discussion of misclassification costs 117 Chapter 4 Classification Trees Model Setup Categorical Force Split Constraints Testing Select Cases Best Tree Method Priors Penalty Battery Costs For Defaults Symmetrical misclassified as Cost JA moe Save Grove CART Combine Score Cancel Continue Start w Only cell ratios matter that is the actual value in each cell of the cost matrix is of no consequence setting costs to 1 and 2 for the binary case is equivalent to setting costs to 10 and 20 rN In a two class problem set the lower cost to 1 00 and then set the
271. gator window to bring up the patterns available then left mouse click on your preferred display You can also use View gt Node Display menu to control mouse hover displays Now hover over the bright red node near the bottom right of the tree This is terminal node 9 which has a bad rate of 70 1 substantially higher than the baseline rate of 30 6 in the root Visiting the other bright red nodes reveals similarly concentrated groups of defaulters Having established that our tree appears to be a promising model we now want to drill deeper into the results 51 CART BASICS Viewing the Main Splitters A convenient way to get a bird s eye view of the model is to reveal only the variables used in each node At the bottom left of the navigator click on the Splitters button to see 4 Navigator 2 Main Tree Split Variables N_INQUIRIES TIME_EMPLOYED The color coding here is a simplified one red means above average risk and blue means below average risk Because the CART tree splitters always send low values of a splitter to the left and high values to the right reading this display is easy Going down the right side of the tree we see that if a person has a large number of inquiries but few credit cards they are quite high risk Presumably this means that the person has probably attempted to obtain additional cards in the recent past but has failed Looking down the left hand side of the tree
272. gions we can try dividing the tree at different depths for example by enforcing the top bottom division point at a depth of 2 then 3 then 4 etc Usually it is quickly apparent that one of these divisions works better than the others How should the variables be divided into different lists This is entirely up to the analyst but typically each list will represent a natural grouping of variables You might group variables by the degree of control you have over them by the cost of acquisition by accepted beliefs regarding their importance or for convenience Example In a model of consumer choice we wanted to develop a model relating consumer needs and wants to a specific product being selected An unrestricted CART model always placed the country of origin of the product in the root node as our consumers for the product in question had very strong feelings on this subject For a number of reasons our client wanted the country of origin to be the LAST splitter in the tree To generate such a tree was easy using CONSTRAINTS we created one list of attributes containing all the consumer wants and needs and specified that those variables could only be used in the top region of the tree We also created another list consisting of just the one country of origin attribute and specified that it could only appear in the bottom portion of tree The resulting tree was exactly what the marketers were looking for We use marketing as an example because it
273. gnal the command parser that the rest of the line should be treated as a BASIC statement not as a CART command Alternatively you can use BASIC to take the log or square root as well as many other mathematical and statistical functions of an existing variable BASIC can also be used to draw a random sub sample from the input data set By using the uniform random number URN generator in BASIC deleting a random sample of 50 percent for example is easily accomplished with the following statement IF URN gt 5 THEN DELETE For more about CART s built in BASIC programming language see Appendix IV in the main reference manual The Best Tree tab The Model Setup Best Tree tab is largely of historical interest as it dates to a time when CART would produce a single tree in any run Specifying how you wanted that single tree to be selected was an important part of the model setup procedure In today s CART you have full access to every tree in the pruned tree sequence and you can readily select trees of a size different than considered optimal Nonetheless when a tree is saved to a grove CART always marks one of the pruned sub trees as optimal This tree will be selected by default for scoring When you are working with many trees in a batch scoring mode it will be most convenient if they are all marked with your preferred method for optimal tree selection The Best Tree tab allows you to specify and modify the following parameters inf
274. grv MEMO A one line quoted memo To view any memo that may be embeded in a particular grove use the ECHO option e g GROVE filename grv ECHO If one of the above options is specified the file name must be quoted 359 Appendix Ill Command Reference HARVEST Purpose The HARVEST command specifies which trees in a grove are processed during SCORE or TRANSLATE and how those trees are pruned for processing For selecting trees in a grove the HARVEST SELECT command is used The command syntax is HARVEST SELECT ALL RELERR lt x gt COMPLEXITY lt x gt NODES lt n gt RANDOM lt n gt KEEP lt nl n2 gt EXCLUDE lt nI n2 gt BEST lt n gt If the HARVEST SELECT command is not issued all trees in the grove are selected HARVEST SELECT is used to select specific trees from multi tree models created with the COMBINE command or from groves containing batteries of trees requested with the BATTERY command Since regular CART models have only a single tree HARVEST SELECT has no effect on them use HARVEST PRUNE instead Prior to being used in a scoring or translation step the selected trees are pruned to their optimal size To specify a pruning condition to be applied to all the selected trees use the HARVEST PRUNE command The command syntax is HARVEST PRUNE NODES lt n gt DEPTH lt m gt TREENUMBER lt N gt COMPLEXITY lt x gt If several trees are sele
275. gt key to simultaneously highlight the variables with left mouse clicks and then place a checkmark in the Select Categorical box at the bottom of the column Model Setup Advanced Costs Priors Penalty Battery Model Categorical Force Split Constraints Testing Select Cases Best Tree Method Variable Selection Variable Name Target Predictor Categorical Weight Aur 4 Vv I E Vv 5 C Unsupervised E E a r m Set Focus Class T E S z2 Target Variable god og ey Vv a E i Weight Variable Select Select Select Sort File Order ad Predictors Sa Cat aN Aux x Number of Predictors 8 Tree Type Save Grove CART Combine Score Cancel Continue Start x When the Tree Type Classification radio button is checked the target variable will be automatically defined as categorical and appear with the corresponding checkmark at later invocations of the Model Setup Similarly the Regression radio button will automatically cancel the categorical status of the target variable In other words the specified Tree Type determines whether the target is treated as categorical or continuous Annotation On Categorical Variables Categorical targets and predictors are those that take on a conceptually finite set of discrete values for example data naturally in text form e g Male Female You may declare any variable categorical but you should do so only when this is sensi
276. h or diagram images or table formatting It is possible to cut and paste to from the Report Window and other Windows documents such as Microsoft Word Notepad Wordpad etc To select the entire report quickly and drop it into another Windows application use Ctrl A shortcut for Edit gt Select All then Ctrl C copy to clipboard move to the other application and paste Data Viewer Once you have opened your data base CART s Data Viewer allows you to view but not edit or print the data as a spreadsheet for investigating data anomalies or seeing the pattern of missing values The Data Viewer window is opened by selecting the View View Data menu item or clicking on the View Data toolbar icon E Only one data file can be displayed at a time 291 Chapter 12 Features and Options ANYRAQT ONAER NSUPPS OFFAER NFAMMEM TANNING ANYPOOL 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 ie w Nye OH u enne ODA HAMNYWH amp INNA ONONNOO wN OI MlM wD Data Information CART provides a GUI facility for viewing information on the currently open data file Information is provided in groups of descriptive statistics for each variable numeric and character The Datalnfo Setup window is opened by selecting the View gt Data Info menu item or by clicking the Bal toolbar icon This action will open the Datalnfo Setup dialog Here you can see various details about the data information that will be generated
277. hange left right top and bottom margins 143 Chapter 4 Classification Trees Overlaying and Printing Gains Charts You can overlay gains charts for nested trees in a CART sequence for different CART analyses and for different classes of a target variable To overlay two or more gains charts 1 Select the corresponding navigator 2 Click Summary Reports and make sure the Gains Chart tab is active ran Each click on the Summary Reports button creates a new instance of the Summary Reports window Choose the right target class in the Tgt Class selection box Repeat steps 1 through 3 as many times as needed to have all the gains charts you would like to overlay Select Gains Charts from the View menu which will open the Overlay Gains Charts dialog listing the charts you want to overlay in the right panel Overlay Gains Charts Navigator 1 3 Class 1 Navigator 1 8 Class 1 i Add All Remove All Overlay Graphs Tile Displays Cumuit tit Gans roc Cancel Click Cum Lift Lift Gains or ROC to request the corresponding overlay charts 4 Gains Chart 50 60 Population Legend Navigator 1 3 Tree Summary Reports 3 Nodes Class 1 Mode LEARN ROC Integral 0 90832 Navigator 1 8 Tree Summary Reports 8 Nodes Class 1 Mode LEARN ROC Integral 0 93206 144 Chapter 4 Classification Trees Each chart is displayed in a uni
278. has a tendency to generate trees that include some rather small nodes highly concentrated with the class of interest If you prefer more balanced trees you may prefer the results of the Twoing rule This is a special variant of the Gini rule designed specifically to work with a cost matrix If you are not specifying different costs for different classification errors the Gini and the Symmetric Gini are identical See the discussions on cost matrices for more information Entropy The Entropy rule is one of the oldest decision tree splitting rules and has been very popular among computer scientists Although it was the rule first used by CART authors Breiman Friedman Olshen and Stone they devote a section in the CART monograph to explaining why they switched to Gini The simple answer is that the Entropy rule tends to produce even smaller terminal nodes end cut splits and is usually less accurate than Gini In our experience about one problem in twenty is best handled by the Entropy rule Class Probability The probability tree is a form of the Gini tree that deserves much more attention than it has received Probability trees tend to be larger than Gini trees and the predictions made in individual terminal nodes tend to be less reliable but the details of the data structure that they reveal can be very valuable When you are primarily interested in the performance of the top few nodes of a tree you should be looking at probabi
279. hat match the selected extension in the File of type selection box You must select an explicit data format to activate the corresponding data access driver Accessing your data regardless of the original file format CART as well as other Salford Systems applications employs built in DATABASE CONVERSION functionality to enable you to access data in over 90 file formats including Excel SAS S Plus Access etc By default this capability is enabled during the installation procedure 35 Chapter 2 Reading Data The Open Data File window contains a wide selection of supported data formats Choose the corresponding data format first to see your files Variable Naming Acceptable variable names have a maximum of 32 characters must be composed of letters numbers and underscores and must begin with a letter w Spaces are not permitted when reading raw ASCII text files When using DATABASE CONVERSION spaces are permitted only when the selected data file format allows them However in most cases the space will be converted and displayed as an underscore Examples of acceptable and unacceptable variable names AGE 1 OK GENDER OK POLPARTY OK 1WORLD Unacceptable leading character other than letter SWEIGHT Unacceptable leading character other than letter SOCIAL SECURITY NUMBER AND ACCOUNT Unacceptable too long Variable name will be truncated to 32 characters SALT amp PEPPER
280. hat we can do for the current tree is indicated by the green bar marking the low point on the error profile where we hit a relative error of 488 If we settle for either too small or too large a tree we will not do as well as we could with the 10 node tree Here we see the characteristic U shaped curve with a partially flattened bottom w At this stage all you need to keep in mind is that we are looking for trees with low values of relative error vx A tree with a relative error of 0 or near 0 is usually too good to be true In almost all cases this results from including an inappropriate predictor in the model rN It is possible to have a relative error greater than 1 This happens when the model is actually worse than random guessing Returning to the navigator we see some core model statistics in the bottom right section The report shows that we conducted the analysis with 12 predictors of which 11 were found to have some value The tree being displayed now has 10 terminal nodes and the smallest of these nodes contains seven records Just below the main model statistics are ROC measures If you are not familiar with the ROC we include some introductory material on this important metric in a later chapter For right now all you need to know is that the ROC can range between 0 and 1 with higher values indicating better performance Our model shows excellent performance with a test value of the ROC of 7867 w Suppose we were to take a si
281. he Directories tab to access change the default locations The tab appears as follows Options General CART Directories Default Directories Input Files Data C Program Files S alford Data Mining CART Pro EX 6 0 Examples Ble Model information C Program Files Salford Data Mining CART Pro EX 6 0 Exampless v Command C Program Files S alford Data Mining CART Pro EX 6 0 Examples hia pe V Most Recently Used file list 8 5 Output Files Model information C Program Files Salford Data Mining CART Pro EX 6 0 Examples az Prediction results C Program Files S alford Data Mining CART Pro EX 6 0 Examples kaj amp Run report C Program Files Salford Data Mining CART Pro EX 6 0 Examples Bae Temporary Files Temporary D Salford T estTemp TestCART pro x E Save as Defaults Recall Defaults Cancel Input Files Location Data input or training data sets for modeling Model information previously saved CART model files to be used for scoring Command command files or scripts Output Files Location Model information CART model files saved for later scoring or export Prediction results output data sets containing scores or predictions Run report classic plain text output 29 Installing and Starting CART Temporary Files Temporary location where CART will create additional temporary files as needed Make sure that the drive where the temporary f
282. he number of records number of missing values percent missing number of distinct levels mean minimum and maximum values The following is an example of this view 293 Chapter 12 Features and Options Zi Data Info 9 C Program Files Salford Data Mining CART Pro EX 6 0 Examples GYMTUTOR CSV DER Find PG x Sort Fie Order Ful Brief Variable N Missing Missing N Distinct Mean 4 0307 16 324 1 4266 5 058 1 5904 0 1843 2 0 051195 3 9564 2 0 037543 1 0 4 0 50171 3 2 0102 SMALLBUS FIT HOME PERSTRN CLASSES SEGMENT oococc 9 9 979 Hoc co oF 0 0 0 0 0 0 0 0 0 0 0 0 ooo coc 9c 97 9 TF TCO Co Co D8 When the user clicks the Full button more details can be seen about the data Use the and toggles to expand and contract each information group The information groups available for viewing include the following DESCRIPTIVE N N missing N 0 N lt gt 0 N Distinct Values Mean Std Deviation Skewness Coeff Variation Cond Mean Sum of Weights Sum Variance Kurtosis Std Error Mean LOCATION Mean Median Range VARIABILITY Std Deviation Variance Intrqrt Range QUANTILES 100 Max 99 95 90 75 Q3 50 Median 25 Q1 10 5 1 0 Min FREQUENCY TABLES Most Top 5 in Pop Least Bottom 5 in Pop All 294 Chapter 12 Features and Options BH 4 Data Info 9 C Program Files Salford Data Mining CART Pro EX 6 0 Examples GYMTUTOR CSV x Sort
283. he progress report window appears and after the analysis is complete the Results dialog is opened x These menu items are available only when the Notepad window is active see below To submit an existing batch file choose Submit Command File from the File menu In the Submit Command File dialog that appears specify the ASCII text file from which command input is to be read and then click on Open To facilitate multiple 299 Chapter 13 Working with Command Language CART runs the CART results are directed only to the CART Output window in text form i e the GUI Results dialog does not appear This menu item is available only when the CART Output window is active Each of these topics is discussed in more detail below Command Log Most GUI dialog and menu selections have command analogs that are automatically sent to the Command Log and can be viewed edited resubmitted and saved via the Command Log window When the command log is first opened by selecting Open Command Log from the View menu all the commands for the current CART session are displayed Subsequently by selecting Update Command Log from the View menu the most recent commands are added to the Command Log window This menu item is available only when the Command Log window is active After computing a CART model the entire set of commands can be archived by updating the command log highlighting and copying the commands to the Notepad or sa
284. her Test 189 constrains predictor groups 277 constraints 15 146 learn sample 281 Constraints tab 275 Contents tab 202 Model Specifications 202 continuous target variables 146 Contraints tab 146 contribution variable 248 control modes 297 converting older tree files 172 copy 284 co relational analysis 218 correlation structure 267 cost matrix 116 Cost tab 233 costs 12 Costs tab 85 counts comparing learn test 139 covariance matrix 224 creating batch files 298 creating new variables 101 cross validation 12 18 96 206 208 data size warning 112 reporting options 131 CSV 37 cut 290 D data accessing 32 ASCII 32 DBMS COPY 32 methods of reading 32 SPAMBASE CSV 186 data files BOSTON CSV 146 178 206 218 FNCELLA CSV 216 GOODBAD CSV 40 43 134 183 GYMTUTOR CSV 134 171 176 208 228 267 277 279 HOSLEM CSV 82 91 PROSTATE2 CSV 220 SAMPLE CSV 34 SPAM CSV 194 SPAMBASE CSV 210 211 212 214 215 219 221 223 224 data information 291 descriptive statistics 293 extreme values 292 frequency tables 291 292 include variables 292 location 293 maximum levels 292 maximum tabulations 292 quantiles 293 saving to grove 292 strata variable 292 variability 293 weight variable 292 data management 14 data preparation 14 data viewer 290 DATAINFO command 294 346 DBMS COPY ASCII format 33 Excel format 36 default directories 29 132 default displa
285. her extends information previously presented on the Models tab 203 Chapter 10 CART Batteries wa Battery Summary 1 Models Contents Accuracy Error Profiles Var Imp Averaging Zoom Charts Zoomed v Chart Type Focus Class Other Classes Focus Class Battery Models Model Opt Terminal Pa Eno Class 0 Class 0 Other Classes Average Overall Name Nodes i ROC Accuracy Accuracy Accuracy Accuracy 0 14 0 94 0 94 0 31 0 9261 0 9295 Tree 2 Model Quality Sample Model Size Save Grove Misclass Roc Test Leam Min Cost 15E The upper graph shows the accuracy of the focus class blue curve and accuracy in the remaining classes green curve by models The table below contains the actual values in the following columns Model Name unique model identifier Opt Terminal Nodes number of terminal nodes in the minimum cost tree Min Cost is pressed 1 SE Terminal Nodes number of terminal nodes in the 1SE tree 1 SE is pressed Rel Error relative error Avg ROC average area under the ROC curve Class 0 ROC the ROC for the class in focus the Focus Class selection box controls which class is put in focus class 0 in this example Class 0 Accuracy accuracy in the focus class Other Classes Accuracy accuracy in the remaining classes Average Accuracy average accuracy over all available classes 204 Chapter 10 CART Batteries Overall Accur
286. herefore important to remember that unquoted filenames are assumed to be upper case lower and mixed case names must be quoted Platform File Format Dependency The Systat file format traditionally used by CART and other Salford Systems programs is platform dependent There are three known variations on the platforms we currently support e Big endian UNIX Solaris IRIX AIX HP UX Little endian UNIX Alpha Linux DOS Windows The consequence of this is that Systat datasets created on Windows PCs cannot be read by CART under UNIX and vice versa unless the data translation engine is enabled not currently available for AIX or IRIX This is far less of a problem than it once was 311 Chapter 13 Working with Command Language Use Caution When Transferring PC Files It is always important to use binary mode when copying non text files from a DOS Windows environment to a UNIX environment or vice versa Failure to do so will cause the files to be corrupted Supporting Database Conversion Libraries On selected platforms CART will use the Stat Transfer database engine to read and write any file format supported by Stat Transfer provided that the interface is enabled To access data through the Stat Transfer interface one simply uses the USE SAVE or ERROR FILE commands the file name must be quoted but no DBMS COPY style pseudo extentions are required To use the Stat Transfer interface under Windows the STATTR
287. higher cost as needed You may find that a small change in a cost is all that is needed to obtain the balance of correct and incorrect and the classifications you are looking for Even if one cost is 50 times greater than another using a setting like 2 or 3 may be adequate rN On binary classification problems manipulating costs is equivalent to manipulating priors and vice versa On multilevel problems however costs provide more detailed control over various misclassifications than do priors By default all costs are set to one unit costs To change costs anywhere in the matrix click on the cell you wish to alter and enter a positive numeric value in the text box called Cost To specify a symmetrical cost matrix enter the costs in the upper right triangle of the cost matrix and click on Symmetrical CART automatically updates the remaining cells with symmetrical costs Click Defaults to restore to the unit costs 118 Chapter 4 Classification Trees Command line users should use the following command syntax for each cell that has a non unit value MISCLASSIFY COST lt value gt CLASSIFY lt origin class gt AS lt predicted gt MISCLASSIFY COST 2 CLASSIFY 1 AS 0 CART requires all costs to be strictly positive zero is not allowed Use small values such as 001 to effectively impose zero costs in some cells We recommend conducting your analyses with the default costs until you have acquired a good underst
288. hoosing output language 182 classic output options 182 command line 182 SAS options 182 saving result to a file 182 sub trees 182 tree sequence 182 using grove file 181 using navigator file 180 tree control 15 tree map 240 Tree menu 42 229 236 Select Tree 244 Tree Summary Reports 61 244 Tree Summary Reports 151 tree navigator 149 tree sequence 174 182 186 236 number of trees 131 tree size 236 maximum depth 113 287 maximum number of nodes 113 287 tree stability 186 Tree Summary Reports 61 147 151 244 tree topology 48 149 235 236 244 tree type 86 88 234 unsupervised 266 Tree window 240 trees 437 committee 167 ensembles 162 initial 167 minimum cost 102 optimal 102 printing 60 141 243 sub tree 58 242 viewing 56 57 239 241 TTC 20 186 tutorial 40 segmentation 228 twoing 13 106 U UNIX platform 296 UNIX usage notes 310 unsupervised learning 19 264 USE command 402 Vv validation auto 19 Var Imp tab 204 Box Plot button 205 Grid button 205 Max button 205 Mean button 205 Median button 205 Min button 205 Quartile 0 25 button 205 Quartile 0 75 button 205 sort order 205 variable importance 16 64 103 154 248 contribution of surrogates 249 discounting improvement 248 measures 61 245 number of surrogates considered 249 variable names 32 33 35 ASCII text 35 36 DBMS COPY 35 variable transformati
289. hose in the list that appear as predictors in the model and have missing values in the learn sample will get missing value indicators LIST can include variable groups and variables that are not part of the model Specifies whether a denominator of N or N 1 should be used in variance and standard deviation expressions in regression trees The default is N which is what the original CART implementation used Controls whether linear combinations other than the primary splitter are included in the node by node detail report ignored unless the LCLIST command is in effect 337 Appendix III Command Reference Cvs Controls whether CV trees are saved in the GROVE Examples BOPTIONS SERULE 85 SURROGATES 10 COPIOUS LIST BOPTIONS SPLITS 90 SURROGATES 8 PRINT 3 SERULE 0 OPTIONS 338 Appendix III Command Reference BUILD Purpose The BUILD command reads the data chooses the LEARN and TEST samples if any and generates trees It is the hot command that begins processing If using CART in the interactive mode as opposed to a command file the BUILD phase is ended with a QUIT command that returns you to CART The command syntax is BUILD Examples USE SEATBELT CSV MODEL BMW BUILD 339 Appendix Ill Command Reference CATEGORY Purpose The CATEGORY command indicates whether the target variable is categorical thereby initiating a classification tree and identifies which predictors are categori
290. hown next C Program Files Salford Data Mining CART Pro EX 6 0 Examples GOODBAD CSV File Name GOODBAD CSV Location C Program Files Salford Data Mining CART Pro EX 6 0 Examples Modified Saturday March 25 2006 1 28 06 PM Variables TARGET AGE CREDIT_LIMIT EDUCATIONS Records GENDER it ie Variables MARITALS N_INQUIRIES Character NUMCARDS OCCUP_BLANK OWNRENT TIME_EMPLOYED POSTBIN Numeric Sort File Order z Activ Stats View Data Score We can see from here that our file contains 664 records and 14 variables three of which are character or text columns The variable names are also listed and you can change the order they are sorted in from Alphabetical to File Order using the Sort drop down control Start by clicking on the View Data button to bring up a spreadsheet display of the file contents Note that some of the cells are blank or contain only a these are missing values The window offers a view only display you can scroll through the data but you cannot edit it from here 45 CART BASICS wa C Salford SoftTest TestData cart6 GOODBAD CSV TARGET AGE CREDIT_LIMIT EDUCATION GENDER HH_SIZE INCOME 48 19387 HS 2399 Single 37 0 HS 9618 Married 33 0 Col 0 Married d 0 Col F 2700 Married 24 12000 Col 2665 Single N 11516 Col 4358 Married 30 19000 Col 3922 Married 22 14000 Col 2900 Married 29 7000 Col 3500 Married 31 0 Col 2500 Married 31 0
291. ia ii oeren p itai 366 a T a r A T 369 LINEAR a ciitastves igand oivana sacnvie sh ctantdgavscnas catuntaanaentans cedustiantaugancevutesitantiessuceausess teins 371 OPTIONS vatsvsavinss des cvantanassavtetectaseduunchapasdecuatesuauntans coh ar e AIEE N ARSED koaua paitani 372 MEMO weiss cevcvastvesanencea tvantins sanity sh ctanedtavscneas tatantate snsunssctnavaden detatagustsveetsstesnsaucesusesndaina 373 MEMORY oorno irunsi stepena tecvase tu unica cpastacunee vou aaectaundtandetutaguseevaslesssiasaasn anseshiveins 374 METHOD atc cakcea nuronan arenaene antaa enaa Sivuateasavaaenssbatiaten datutasssaavanls sdaabavaen sueashivanns 375 MISCEASS vais esisscccnstasvesssensessenaatennstensnscasdadzeassuhsundechabsieanasvadsasaubscsnvascadsedeateonawhband 376 MODEL oan cescccsvasatesccccesaatvadsvensn acecstsaansthuniassvasaedancestuoncnnssncnavadsanaarenseasaasasaaeciuasncnazsats 377 MOPTIONS vaicvccscivecciestacsednvcnuesensateissctauessssdaecadnwensuasechabstsasussadsed abeba svasseddedhaveasnaaasai 378 NAMES oc cicscssvaseresesecsanasvadsvensnacecaateasanchuessistadnednsts T 380 NEW ii occsscccessaseassebeccscshasssdnstnanssesasteiansctebaobarasdadanceetwnssibysecsandassadevey ancandsaisaasaunanachewess 381 NOE vcsscs cuscctvasssesssccuanatvadawuhsnscecateananthunssesdadaadapussndssavnbaccataeutaalaves seceazadivascvusseisaaaad 382 OPTIONS eicccisstssvescesscctssasisvensaddecsauecesvanuncuestaased suvaatuasstunnccdastasvagauansncsadaabaasssensenrsehaad
292. iable name requirements because failure to do so may cause CART to operate improperly When you experience difficulties reading your data first make sure the variable names are legal Accessing Data from Salford Systems Tools Many data analysts already have preferred database formats and use widely known systems such as SAS to manage and store data If you use a format we support then reading in data is as simple as opening the file The Excel file format is the most challenging because Excel allows you to enter data and column headers in a free format that may conflict with most data analysis conventions To successfully import Excel spreadsheets be sure to follow the variable column header naming conventions below If you prefer to manage your data as plain ASCII files you will need to follow the simple rules we list below to ensure successful data import 33 Chapter 2 Reading Data Reading ASCII Files CART has the built in capability to read various forms of delimited raw ASCII text files This built in capability is most appropriate for datasets composed of numeric and quoted character data using a comma for the delimiter Optionally spaces tabs or semicolons instead of commas can separate the data although a single delimiter must be used throughout the text data file ASCII files must have one observation per line with the first line containing variable names see the necessary requirements for variable names in the pre
293. ially displayed To activate color coding select a target class level 1 from the control next to the Color code using title box located at the top of the Navigator As the legend in the upper right corner of the Navigator indicates nodes better than the root node are shades of red while nodes worse than the root node are shades of blue The more saturated the color the greater the improvement or worsening in that terminal node when compared to the root node CART terminal nodes are numbered left to right in ascending order starting with one In our example we can quickly ascertain that Class 1 cases are concentrated primarily in red terminal nodes 1 3 and 7 whereas very few or no Class 1 cases populate the remaining blue terminal nodes 236 Chapter 11 CART Segmentation Hovering the mouse over any of the nodes results in additional node information popping up vx You may change how many details are shown by a right mouse click in the gray area of the navigator window and a further left mouse click on the node information sample display Alternatively one may do the same using the View gt Node Display menu The bottom panel of the Navigator provides a visual indication of the quality of the optimal tree a graph displaying the cross validated relative cost by the number of terminal nodes for each tree size in the nested tree sequence Recall that CART begins by growing the maximal or largest tree possible and th
294. ical predictors and weight variables choose tree type classification regression or cluster analysis Categorical set up categorical class names Force Split specify a split variable for the root node and its immediate children Constraints allows you to pre specify sets of variables to be used in specific regions of the tree and to determine the order in which splitters appear in the tree This is a CART ProEX feature only Testing select a testing or self validation method Select Cases select a subset of original data Best Tree define the best tree selection method Method selecting a splitting rule Advanced specify other model building options 233 Chapter 11 CART Segmentation Cost specify misclassification costs Priors specify priors Penalty set penalties on variables missing values and high level categorical predictors Battery specify a battery of models to be run The only required step is the first one specify a target variable and tree type in the Model Setup dialog In our tutorial example we enter information into the Model tab only and then grow the tree using CART s default settings cross validation Gini splitting rule unitary misclassification costs and equal priors When the other Model Setup dialog tabs are left unchanged the following defaults are used All remaining variables in the data set other than the target will be used as predictors the Model tab No weights will be applie
295. ication table 61 66 245 249 MISCLASSIFY command 118 missing case weight 90 missing value analysis 15 114 missing value controls 15 114 missing value indicators 15 114 missing values 12 34 77 84 103 147 259 290 323 348 386 410 413 penalty 121 123 216 missing values indicators 216 model automation 16 200 MODEL command 377 model information 175 embedded 172 Model menu 42 229 Construct Model 78 261 model setup classification trees 83 default settings 85 147 regression trees 146 setting limits 113 286 Model Setup 84 129 146 Advanced tab 216 Advanced tab 147 148 Battery tab 147 200 Best Tree tab 146 Categorical tab 146 Constraints tab 146 Force Split tab 146 Method tab 146 Model tab 146 147 Penalty tab 147 216 Select Cases tab 146 Testing tab 210 211 Model Setup dialog 34 45 78 83 234 Advanced tab 85 111 232 233 433 Best Tree tab 85 102 232 233 Categorical tab 92 94 232 Combine tab 164 Constraints tab 275 Cost tab 116 233 Costs tab 85 Force Splits tab 267 Method tab 85 104 232 233 Model tab 85 232 233 264 266 Penalty tab 85 121 233 Priors tab 85 118 233 Select Cases tab 100 232 Testing tab 95 232 233 model specifications saving 140 Model tab 85 146 147 232 233 266 model translation 20 180 models scoring 170 172 173 translating 170 180 Monte Carlo test 19 215 MOPTIONS command 378 MRU files mos
296. ick cross tabulation of the target against one of your predictors It is also useful for supervised binning or aggregation of variables such as high level categoricals This use of CART is discussed in more detail in other sections V fold Cross validation Cross validation is a marvelous way to make the maximal use of your training data although it is typically used when data sets are small For example because the HOSLEM data set contains only 189 records it would be painful to segregate some of those data for the sake of testing alone Cross validation allows you to build your tree using all the data The testing phase requires running an additional 10 trees in 10 fold CV each of which is tested on a different 10 of the data The results from those 10 test runs are combined to create a table of synthesized test results 97 Chapter 4 Classification Trees Cross validation is discussed in greater detail in the command line manual and in the references cited there When deciding whether or not to use cross validation keep these points in mind w Cross validation is always a reasonable approach to testing However it is primarily a procedure that substitutes repeated analyses of different segments of your data for a more typical train test methodology If you have plentiful data you can save quite a bit of time by reserving some of the data for testing Cross validation can give you useful reports regarding the sensitivity of
297. iew menu 1 Select Assign Class Names 2 Click on the Name text box and enter a label for that class 3 Click on Color to select a color from the palette then click OK 4 Click Apply to enter the name color repeat steps 2 4 for the other levels An illustrative Class Assignment dialog box for our example is shown below The labels and color codes are displayed in the individual node detail you see when you hover the mouse pointer over a node in the Navigator window as well as in the main and sub tree diagrams and printed tree output 60 CART BASICS N_INQUIRIES lt 1 50 Clas Cases Printing the Main Tree To print the Main Tree bring the tree window to the foreground and then select Print from the File menu or use lt Ctrl P gt In the Print dialog box illustrated below you can select the pages that will be printed and the number of copies as well as specify printer properties You can also preview the page layout CART will automatically shift the positions of the nodes so they are not split by page breaks Printer Copies Name Acrobat PDP Writer Properties Number of copies Status Ready Type Acrobat PDFwriter Where LPT1 Comment I Print to file Fit to two pages if possible Select Pages C All pages Non blank only C Selected Page Setup Cancel You can see from the preview that a small section of the GOODBAD main tree spills o
298. ight help to explain the variation in median value across tracts For ease of reference definitions of the variables in BOSTON CSV data included with your installation sample data are given below CRIM per capita crime rate by town ZN proportion of residential land zoned for lots over 25 000 sq ft INDUS proportion of non retail business acres per town CHAS Charles River dummy variable 1 if tract bounds river 0 otherwise NOX nitric oxides concentration parts per 10 million RM average number of rooms per dwelling AGE proportion of owner occupied units built prior to 1940 DIS weighted distances to five Boston employment centers RAD index of accessibility to radial highways TAX full value property tax rate per 10 000 PT pupil teacher ratio by town LSTAT population of lower status MV Median value of owner occupied homes in 1000 s After you open a data set setting up a CART regression analysis entails several logical steps all carried out in one of the Model Setup dialog tabs available after clicking on the Model button in the Activity Window Model selects target and predictor variables specifies categorical predictors and weight variables chooses tree type regression specifies auxiliary variables Categorical sets up categorical class names Force Split specifies splitter for root node and its children Constraints specifies structural constraints on a tree Testing selects a testing or self validation method
299. ilable batteries 206 Chapter 10 CART Batteries CART CART 6 0 CART 6 0 Standard Pro Pro EX Battery CV Battery CV runs cross validation with the number of folds set to 5 10 20 and 50 bins See the CV CMD command file for run details on the BOSTON CSV dataset wai Battery Summary 1 BAR Models Contents Error Profiles Var Imp Averaging Charts Rel Eror Nodes 5 told 17 0 238 r View Zoomed kd p Chart Type Bar J Line 50 fold 17 0 225 Rel Error Test Rel Error Battery Types Regression Battery Models Cy Model Opt Terminal Rel Error Show Min Error Model Quality Sample Model Size Save Grove Ls f Tes Leam Min Cost 15E 207 Chapter 10 CART Batteries j Battery Summary 1 208 Chapter 10 CART Batteries CAR E 7 oy 2 2 CART 6 0 CART 6 0 Battery CVR Battery CVR repeats cross validation many times using different random number seeds We illustrate it on the BOSTON CSV dataset by requesting 20 cycles see CVR CMD command file for details wail Battery Summary 1 DER Models Contents Error Profiles Ave Min 0 238 Rel Error b pp ttt 10 12 14 16 Number of Nodes Show Models Profile Chart View Chat Table All None Average Min Max Legend AllTrees v Model Quality Sample Model Size Save Grove LS Test Leam Min Cost 1SE Note that the Var Imp Averaging tab is no longer available because each i
300. ile gt Save menu If you do this before exiting a CART session the resulting command file will contain the audit trail of the entire session rN The Command Log Window supports the cut and paste technique File New Notepad The CART GUI offers a simple text editor to write your own command files You may open multiple instances of the Notepad window using the File gt New Notepad menu You may also open an existing command file using the File gt Open gt Command File menu CART Notepad Untitled 1 BEE rN You may use the cut and paste technique to grab command sequences from the Command Log Window to edit in the notepad window File Submit Window This menu item allows you to submit a command sequence from a CART Notepad window to the CART engine Using this channel does not suppress the results window generated by the GUI back end a This option is also available for the Command Log Window in which case the entire session will be reproduced 301 Chapter 13 Working with Command Language Submitting multiple runs may produce too many open windows seriously affecting your system s performance Saving the contents of the notepad window into a command file and then using the File gt Submit Command File menu item see the following section may be preferable File Submit Command File This menu item allows you to submit a command file cmd directly to the CART engine When this channel is
301. ime series analysts can create one LCLIST for each predictor and its lagged values LCs constructed from such a list can be thought of as distributed lag predictors rN A variable can appear on more than one LCLIST meaning that LC lists can overlap You can even create an LCLIST with all numeric variables on it if you wish Below we have checked the box that activates LC lists for our example 110 Chapter 4 Classification Trees IV Use Linear Combinations for Splitting Minimum node sample size for linear combinations Variable deletion 12 significance level Number of nodes likely tobe Automatic split by linear combinations in maximal tree g Select Variables Clicking on the Select Variables button brings up this new window in which you may create your LC lists Only numeric variables will be displayed in this window Categorical variables will not be considered for incorporation into an LC even if they are simple 0 1 indicators This is one good reason to treat your 0 1 indicators as numeric rather than categorical predictors Sort Fis Order Click on New List to get started and then select the variables you want to include in the first list We will select AGE and SMOKE Add them and then click again on New List to start a second list Now Add HT PTD LWD and click OK to complete the LCLIST setup Click Start to begin the run Hovering your mouse over the nodes of
302. in Tree dialog is active You can also copy and paste the rules onto the clipboard or directly into another application Chapter Train Test Consistency TTC A new powerful feature designed to identify stable robust trees 186 Chapter 8 Train Test Consistency TTC CART fa CART 6 0 Pro EX Optimal Models and Tree Stability CART relies on the concept of pruning to create a sequence of nested models as final model candidates Independent testing or cross validation is subsequently used to identify the optimal tree with respect to overall model performance based on the expected cost criterion This however does not guarantee that the resulting tree will show stable performance at the node level It is quite possible to have a node created earlier in the tree exhibiting unstable behavior across different partitions of the data Often such nodes cannot be easily eliminated without picking a much smaller tree in the pruning sequence thus picking an inferior in terms of accuracy model Nonetheless some analysts might be more interested in finding robust trees with all nodes exhibiting stable behavior and be less concerned with the actual accuracy measures for example marketing segmentation problems The TTC feature of CART was designed to facilitate such an analysis of the tree sequence Spam Data Example We illustrate the specifics of the TTC feature using the SPAMBASE CSV dataset First use the Open gt Command File
303. ine shows how the class was partitioned between two children with the percentage of the class going to the left child shown on the left side and the percentage of the class going to the right child shown on the right side In this example less than 20 of Class 1 went to the left side and more than 80 went to the right side The Splitter Tab When a node is split on a categorical variable an additional tab called Splitter is available in the Node Information window for all internal nodes For example declare TANNING as categorical and proceed with the standard GYMTUTOR run introduced above The optimal tree now has seven nodes with node number 5 being split on TANNING Left click on this node and choose the Splitter tab 255 Chapter 11 CART Segmentation oix Competitors and Surrogates Classification Is TANNING 23 456 Levels That Go Left Levels That Go Right From this we immediately conclude that all cases with TANNING equal to 2 3 4 5 or 6 go to the left child node whereas all cases with TANNING equal to 0 or 1 go to the right child node This feature is useful for analyzing high level categorical splits or when the same categorical variable is used as the main splitter multiple times in a tree The Rules tab The Rules tab is not present in the root node report because there are no rules to display For our discussion here we use the Rules tab for Node 5 4 Navigator 1 7 Node 5
304. ing CART automatically searches for important patterns and relationships uncovering hidden structure even in highly complex data CART trees can be used to generate accurate and reliable predictive models for a broad range of applications from bioinformatics to risk management and new applications are being reported daily The most common applications include churn prediction credit scoring drug discovery fraud detection manufacturing quality control and wildlife research Several hundred detailed applications studies are available from our website at http www salford systems com CART uses an intuitive Windows based interface making it accessible to both technical and non technical users Underlying the easy interface however is a mature theoretical foundation that distinguishes CART from other methodologies and other decision trees Salford Systems CART is the only decision tree system based on the original CART code developed by world renowned Stanford University and University of California at Berkeley statisticians Breiman Friedman Olshen and Stone The core CART code has always remained proprietary and less than 20 of its functionality was described in the original CART monograph Only Salford Systems has access to this code which now includes enhancements co developed by Salford Systems and CART s originators There is only one true CART and Salford Systems in collaboration with CART s creators is the only source for this
305. inted output are represented with a period or dot and missing values can be generated and their values tested using standard expressions Thus you might type SIF NOSE LONG THEN LET ANSWER SIF STATUS THEN DELETE Missing values are propagated so that most expressions involving variables that have missing values will themselves yield missing values One important fact to note because the missing value is technically a very large negative number the expression X lt 0 will evaluate as true if X is missing BASIC statements included in your command stream are executed when a HOT Command such as ESTIMATE APPLY or RUN is encountered thus they are processed before any estimation or tree building is attempted This means that any new variables created in BASIC are available for use in MODEL and KEEP statements and any cases that are deleted via BASIC will not be used in the analysis More Examples It is easy to create new variables or change old variables using BASIC The simplest statements create a new variable from other variables already in the data set For example LETPROFIT PRICE QUANTITY2 LOG SQFTRENT 5 SQR QUANTITY BASIC allows for easy construction of Boolean variables which take a value of 1 if true and 0 if false In the following statement the variable XYZ would have a value of 1 if any condition on the right hand side is true and O otherwise LET XYZ X1 lt 5 OR X2 gt 17 OR X3 6 Suppose your
306. inue button to access Options from the Edit menu General Text Report Preferences CART is actually part of an integrated data mining system offering several analytical methods The CART 6 0 Standard Edition product offers only the CART subsystem at this time but in the future other modules will become available The Options General tab controls report and display preferences that are common across several data mining technologies including TreeNet and RandomForests The screen shot below shows one set of user preferences 126 Chapter 4 Classification Trees Options General CART Directories General Options Text Report Preferences Use short notation for commands JT Summary stats for all model variables E I Prediction success tables EE Window To Display When Data Is Opened I Gains tables Classic Output ROC tables Activity Window I Use exponential notation for values near zero Model Setup Decimal places 5 4 I Show command prompt at start up Default Variable Sorting Order File order ROC Graph Axes Labels Alphabetical Tue Pos Rate False Pos Rate Reset Messages Save as Defaults Recall Defaults Cancel The report preferences allow you to turn on and off the following parts in the CART classic output with command line equivalents included Summary stats for all model variables mean standard deviation min max etc In classification models the stats a
307. ions Variable ANYRAQT ONAER NSUPPS OFFAER NFAMMEM TANNING ANYPOOL SMALLBUS FIT HOME PERSTAN CLASSES z a W177 Min Cases Max Cases zogj aAa MATT mi a a a 7999 moa moma 7779799 i olololololololololololo Had we left the tree unconstrained ANYRAQT would have been the first split in the tree However as we can see from the tree details the constraint was implemented and ANYRAQT does not appear as a splitter until Node 2 with only 164 observations 2 Navigator 5 Main Tree Terminal Node 3 N 7 Command line users will use the following command syntax to set the constraints DISALLOW lt variable list gt ABOVE lt depth gt BELOW lt depth gt MORE lt node_size gt FEWER lt node_size gt SPLIT For example DISALLOW OFFAER ABOVE 3 SPLIT DISALLOW NFAMMEM BELOW 4 SPLIT DISALLOW ANYRAQT MORE 200 SPLIT DISALLOW CLASSES FEWER 25 SPLIT 283 Chapter 12 Features and Options To reset constraints use the command with no options DISALLOW Saving and Printing Text Output By default CART text output is sent to the Report window If you would like to save or print results use one of the following methods Specify Output File Prior to Processing To simultaneously save the text output to a file you must specify the output file prior to processing Once the output file is specified all subsequent output will be recorded in the s
308. irection 2 00 l Hide Agreed Rank 2 00 I Hide Agreed Fuzzy Match Select Columns to Display _Statistics Sho DiesionAgesnent i J ank Match gt ni TARGET Class 1 TARGET C Class E e ie t ma a O 962 ee a a a oo eee ah a e a a Terminal Agee Match 0 328 0 336 1 39 iea Se E a 6l Terminall Agree Match 1 589 224 1 97 47 48 A Temin Maton _ 0 1681 1 000 22 223 TARGET Class 1 TARGET Class 0 The upper half reports stability by trees one line per tree You can choose the class of interest by clicking on the corresponding tab Green marks stable trees while yellow marks unstable trees Note that because there are two different approaches to tree stability rank or directional it is possible to have a tree agree on one criterion and disagree on the other The columns in the Consistency by Trees section are Tree Name name of the tree Itis a constant for single trees but will have varying values for batteries of CART runs when applicable Terminal Nodes number of terminal nodes Direction Agreement contains Agree if all terminal nodes agree on the direction of classification within the supplied degree of confidence Rank Match contains Agree if all terminal nodes agree on the sorted sequence as described above Direction Max Z reports the z value of the standard statistical test on the dif
309. irritability coded 1 if present 0 otherwise As you might guess we are going to explore the possibility that characteristics of the mother including demographics health status and the mother s behavior might influence the probability of a low birth weight baby Later we will look into viewing the data and obtaining summary statistics graphical displays and histograms Right now let s click the Model button that brings up the Model Setup dialog 84 Chapter 4 Classification Trees Model Setup Advanced Model Penalty Battery Categorical Force Split Constraints Testing Select Cases Best Tree Method Variable Selection Save Grove CART Combine Variable Name Target Predictor Categorical Weight Aux Tree Type a Classification Regression Unsupervised esis Target Variable Weight Variable Sort File Order Y Number of Predictors 15 The dialog offers 13 tabs that allow you to control all details governing the modeling process Fortunately you can set up a model with as few as two mouse clicks The options are there only for those who need them Here is a brief description of each tab Model Categorical Force Split Constraints Testing Select Cases Best Tree Method Cost Priors Penalty Advanced Battery identifies target variable and select predictors notes which numeric predictors are categorical unordere
310. ith many hundreds of data sets in widely different subject matters we have still seen the Gini rule to be an excellent choice Further there is often only a small difference in performance among the rules However there will be circumstances in which the performance between say the Gini and Entropy is quite substantial and we have worked on problems where using the Twoing rule has been the only way to obtain satisfactory results Accuracy is not the only consideration people weigh when deciding on which model to use Simplicity and comprehensibility can also be important While the Gini might give you the most accurate tree the Twoing rule might tell a more persuasive story or yield a smaller 105 Chapter 4 Classification Trees although slightly less accurate tree Our advice is to not be shy about trying out the different rules and settings available on the Method tab Penaty Battey Constraints Testing SelectCases BestTiee Method Costs Priors Model Categorical Force Spit _ Select Spitting Method Classification Trees I Use Linear Combinations for Splitting e Favor Even Spits Less Symmetic Gini C Entropy z ao C Class Probability G AE woing 2 More Gini Least Absolute Deviation Save Grove CART Combine Here are some brief remarks on different splitting rules Gini This default rule often works well across a broad range of problems Gini
311. l only the first 32 characters of a class name are used and some text reports use fewer due to space limitations In our example we specify the following class names for the target variable LOW and predictor Ul These labels then will appear in the tree diagrams the CART text output and most displays The setup dialog appears as follows Categorical Variable Class Names Variables Class Names for i Level Low O Birth Weight gt 2 5 kg 1 Birth Weight lt 2 5 kg lt Add M Auto add Delete GUI CART users who use class names extensively should consider defining them with commands in a command file and submitting the command file from the CART notepad once the dataset has been opened The CLASS commands must be given before the model is built w If you use the GUI to define class names and wish to reuse the class names in a future session save the command log before exiting CART Cut and paste the CLASS commands appearing in the command log into a new command file Command line users will use the following command syntax to define class names CLASS lt variable gt lt valuel gt lt labeli gt lt value2 gt lt label2 gt etc CLASS LOW 0 Birth Weight gt 2 5 kg 1 Birth Weight lt 2 5 kg Uterine irritability Uterine irritability CLASS UI 0 NO 1 Yes 94 Chapter 4 Classification Trees You can add labels to the target variable AFTER a tree is grown bu
312. l can be translated into one of the supported languages SAS compatible C Java and PMML or into the classic text output The translation operation is very similar to scoring and requires a grove file As with scoring you may either use a separate grove file or the one embedded into a navigator Translating Using Navigator with Embedded Grove File The Navigator window must be open and active 1 2 Press the Translate button Enter the relevant information into the Model Translation window described below Press OK to activate the translation process The Grove File portion of the Model Translation window will contain your navigator file name this means that the embedded grove file will be used for translation You do not have to change this unless you want an external grove file to be used instead 181 Chapter 7 Scoring and Translating This mode is not available with older navigators or navigators that were saved without model information Translating Using Grove File Only If you have a grove file you would like to use for translation do the following steps 1 Make sure the CART Output window is active Choose Translate Model from the Model menu w Both these steps can be replaced by simply clicking the toolbar icon Enter relevant information into the Model Translation dialog including the name of the grove file in the Grove File section Press OK to activate the scoring process Mo
313. l functions can take several variables as arguments and automatically adjust for missing values Only numeric variables may be used as arguments The general form of the function is FUNCTION variable variable Integrated BASIC also includes a collection of probability functions that can be used to determine probabilities and confidence level critical values and to generate random numbers Multiple Argument Functions Function Definition Example AVG arithmetic mean LET XMEAN AVG X1 X2 X3 MAX maximum LET BEST MAX Y1 Y2 Y3 Y4 Y5 MIN minimum LET MINCOST MIN PRICE1 OLDPRICE MIS number of missing values STD standard deviation SUM summation Single Argument Functions Function Definition Example ABS absolute value ABSV AL ABS X ACS arc cosine ASN arc sine ATH arc hyperbolic tangent ATN arc tangent COS cosine EXP exponential LOG natural logarithm LET LOGXY LOG X Y SIN sine SQR square root LET PRICESR SQR PRICE TAN tangent The following shows the distributions and any parameters that are needed to obtain values for either the random draw the cumulative distribution the density function or the inverse density function Every function name is composed of three letters Key Letter This first letter identifies the distribution Distribution Type Letters RN random number CF cumulative DF density IF inverse 411 Appendix IV BASIC Programming Language BASIC Probability Functions CART BA
314. l of your work with the CART application x The number of session command logs that can be saved to the CART temporary files folder has no limit Chapter Classification Trees A Biomedical Example 82 Chapter 4 Classification Trees Building Classification Trees We start by walking through a simple classification problem taken from the biomedical literature The topic is low birth weight of newborns The task is to understand the primary factors leading to a baby being born significantly under weight The topic is considered important by public health researchers because low birth weight babies can impose significant burdens and costs on the healthcare system A cutoff of 2500 grams is typically used to define a low birth weight baby Begin by looking for the HOSLEM CSV data file that should be located in your Sample Data folder The CART installer normally creates a Sample Data directory for you under your CART 6 0 directory If you cannot locate the file you may need to rerun the installer requesting that it install only the sample data files Using the File Open gt Data File menu selections you should see a screen something like the following Open Data File Look in E Sample Data z e cf Be More Samples MoreSamplesReadMe txt F Boston csv RQ nis csv SS oymexama csv FQ sample csv R cymtutor csv RQ Hoslem csv B tris csv Filename Files of type FASCII Delimited csv dat txt
315. lative Cost curve color coded Relative Cost curve e percent population by node display The first two displays show the relative cost curve depending on the number of terminal nodes while the last display reports how the original data set is distributed into the terminal nodes in the currently selected tree rN If you click on an individual bar in the percent population by node display the corresponding node in the tree topology becomes yellow Pressing on the Smaller or Larger button causes the scale of the tree topology in the top half of the navigator window to become larger or smaller This is useful when analyzing very large trees When applicable you may switch between learn or test counts displayed for each node by pressing the Learn or Test button Because cross validation was used in this example only learn counts are available on the node by node basis You can also save the Navigator or Grove file needed for scoring by pressing the Grove button or you may translate CART models into SAS C PMML or Java representations by clicking the Translate button Finally you may apply any tree to data using the Score dialog accessed via the Score button See Chapter 7 for step by step instructions for scoring new data Viewing Variable Splits By hovering the mouse pointer over a non terminal green node you initially see terse information about the split as illustrated below 238
316. lative to the number of records in the partition node the HCC option is used Consider the expression ratio log base_2 Nrecords in node N categories 1 The HCC option weights the improvement of primary splitters and all competitors by the following function improvement improvement factor in which factor 1 0 if ratio gt 1 0 and factor l xh xh ratio xh2 if ratio lt 1 0 If xh1 and xh2 are set to values that result in taking a root of a negative number or result in improvement lt 0 improvement is set to 0 If improvement gt 1 it is set to 1 By default improvement penalties are applied to surrogates in the same way that they are applied to competitors To disable penalties for surrogates use the command PENALTY SURROGATE NO Variable groups may be used in the PENALTY command similarly to variable names The default values are MISSING 1 0 0 0 HCC 1 0 0 0 SURROGATE YES Examples PENALTY NFAMMEM 75 TANNING 25 MISSING 0 50 0 75 HLC 1 00 3 75 388 Appendix III Command Reference PRIORS Purpose The PRIORS command specifies prior class probabilities for classification trees The command syntax is PRIORS DATA LEARN TEST EQUAL MIX SPECIFY lt classl gt lt xl gt lt class2 gt lt x2 gt in which lt x gt lt x2 gt is a vector of real numbers The options set prior class probabilities as follows DATA priors match
317. le 79 CART BASICS To save the Command Log select Open Command Log from the View menu or press E the Command Log toolbar icon and then select Save from the File menu Specify a directory and the name of the command file saved by default with a CMD extension The commands can also help accelerate your work Once you have set up a model with controls that work well for your data you can use saved edited command logs to instantly recreate your working setup This way you can guarantee that you are including exactly the same list of predictors as you used previously and that you are using your preferred controls See Chapter 12 and 13 for more about the CART command log and running CART in batch mode See also Appendix for a quick reference to the command line menu equivalents vx CART automatically logs every command associated with your session and automatically saves it to a dedicated file in your CART temporary folder specified in Edit gt Options gt Directories This file will be saved even if your computer crashes for any reason and in the worst case scenario it will be missing only your last command vx The name of this file starts with CART followed by month and day followed by hour military convention 0 23 minutes and seconds followed by two underscores For example CART1101173521__ TXT refers to the CART session that was finished on November 1st at 5 35 21 pm This serves as a complete audit trai
318. le model are not produced for certain batteries You can disable this output for all batteries with BATTERY QUIET YES produce it with BATTERY QUIET NO or allow the program to decide what output is presented with BATTERY QUIET AUTO BATTERY PROXIMITY YES NO Indicates whether a proximity matrix report should be produced for the battery By default it is produced for BATTERY TARGET only but it is possible to produce this report for all batteries 333 Appendix III Command Reference BATTERY PF lt filename gt Saves the proximity matrix to a text comma separated file BATTERY SAMPLE Will result in a series of five models in which the learn sample is reduced randomly four times to examine the effect of learn sample size on error rate BATTERY DRAW lt proportion gt lt nreps gt CART EX Pro only Runs a series of models in which the learn sample is repeatedly drawn without replacement from the main learn sample The test sample is not altered The proportion to be drawn in the range 0 to 1 exclusive and the number of repetitions are specified e g BATTERY DRAW 0 25 20 will repeat the model 20 times each with a random 25 draw of the available learning data BATTERY SUB SAMPLE Varies the sample size that is used at each node to determine competitor and surrogate splits The default values used are 100 250 500 1000 5000 and no sub sampling You may list a set of values with the VALUES option as well as
319. learn sample observations in the node being split before any sub sampling is done rather than the depth 351 Appendix Ill Command Reference MORE N Variable will not be used if the node has N or more records FEWER M Variable will not be used if the node has M or fewer records The DISALLOW command is cumulative To reset all DISALLOW specifications i e to return to the default issue the empty command DISALLOW Variable groups may be used in the DISALLOW command in the same manner as individual variable names Examples DISALLOW SEGMENT ABOVE 3 DISALLOW REVMI ABOVE 1 SPLIT DISALLOW CODES ABOVE 3 SURROGATE DISALLOW OHIGHT BELOW 2 DISALLOW CODES BELOW 2 ABOVE 3 DISALLOW CODES FEWER 1000 352 Appendix III Command Reference ERROR Purpose The ERROR command specifies the method used to measure true regression error and misclassification rates The command syntax is ERROR CROSS lt n var gt EXPLORATORY PROPORTION lt x gt lt y gt SEPVAR lt vdr gt FILE lt filename gt lt x gt is between 0 and 1 lt n gt is an integer lt var gt is a variable and lt filename gt is any valid file CROSS V fold cross validation You may indicate a number of CV cycles in which case binning is carried out randomly while balancing on the target classes or you may specify a variable for which each distinct value defines a CV bin EXPLORATORY No independent testing resubsti
320. lity trees 106 Chapter 4 Classification Trees Twoing The major difference between the Twoing and other splitting rules is that Twoing tends to produce more balanced splits in size Twoing has a built in penalty that makes it avoid unequal splits whereas other rules do not take split balance into account when searching for the best split A Gini or Entropy tree could easily produce 90 10 splits whereas Twoing will tend to produce 50 50 splits The differences between the Twoing and other rules become more evident when modeling multi class targets with more than two levels For example if you were modeling segment membership for an eight way segmentation the Twoing and Gini rules would probably yield very different trees and performances Ordered Twoing The Ordered Twoing rule is useful when your target levels are ordered classes For example you might have customer satisfaction scores ranging from 1 to 5 and in your analysis you want to think of each score as a separate class rather than a simple score to be predicted by a regression If you were to use the Gini rule CART would think of the numbers 1 2 3 4 and 5 as arbitrary labels without having any numeric significance When you request Ordered Twoing you are telling CART that a 4 is more similar to a 5 than it is to a 1 You can think of Ordered Twoing as developing a model that is somewhere between a classification and a regression Ordered Twoing works by
321. lt is 0 0 no penalty trees grow unlimited Number of competing splits reported for each node Default 5 Number of competing splits printed for each node in the classic text output Defaults to the COMPETITORS option Number of trees reported in tree sequence summary Default 10 Forecast of the number of splits primary and surrogate on categorical variables in maximal tree This value is automatically estimated by CART but may be overridden lt n1 gt is the maximum number of surrogates to store in the binary tree file and to compute variable importance Default 5 lt n2 gt is the number of surrogates to report for each node and is set equal to lt n1 gt if not specified 335 SCALED NCLASSES CVLEARN PAGEBREAK NODEBREAK COPIOUS BRIEF OPTIONS Appendix III Command Reference Indicates the complexity specified IS NOT relative Any complexity specified as greater than 1 0 is considered scaled and the SCALED option is not required For classification problems in which the number of dependent levels is greater than two NCLASSES specifies the maximum number of classes allowed for an independent categorical variable for an exhaustive split search For independent categorical variables with more levels special high level categorical algorithms are used see the HLC option Depending on the platform for classification problems NCLASSES greater than 10 20 can result in significant increases in compu
322. luencing the selection of the best or optimal tree Default Best Tree settings minimum cost tree regardless of size all surrogates count equally five surrogates used to construct tree Advanced Y Cost s Pros Peat Battery Model Categorical Force Spit Constans Testing SelectCases Best Tree __ Method Parameters Influencing Selection of Best Tree 103 Chapter 4 Classification Trees Standard Error Rule The standard error rule the parameter CART uses to select the optimal tree following testing is specified in the Best Tree tab The default setting is the minimum cost tree regardless of size that is the tree that is most accurate given the specified testing method In certain situations you may wish to trade a more accurate tree for a smaller tree by selecting the smallest tree within one standard error of the minimum cost tree or by setting the standard error parameter equal to any nonnegative value The primary use of the standard error rule is for processing many models in batch mode or when you do not expect to be able to inspect each model individually In such circumstances you will want to give some thought to specifying how the best model should be selected automatically If you are examining each model visually on screen then the best tree definition is not that important as you can readily select another tree interactively on screen Variable
323. ly ignoring any data that comes after the allowed records This is useful when you have very large files and want to explore models based on a small portion of the initial data The control allows for faster processing of the data because the entire data file is never read Test Sample Size The TEST setting is similar to LEARN it limits the test sample to no more than the specified number of records for testing The test records are taken on a first come first served basis from the beginning of the file Once the TEST limit is reached no additional test data are processed 114 Chapter 4 Classification Trees Sub sample Size Node sub sampling is an interesting approach to handling very large data sets and also serves as a vehicle for exploring model sensitivity to sampling variation Although node sub sampling was introduced in the first release of the CART mainframe software in 1987 we have not found any discussion of the topic in the scientific literature We offer a brief discussion here Node sub sampling is a special form of sampling that is triggered for special purposes during the construction of the tree In node sub sampling the analysis data are not sampled Instead we work with the complete analysis data set When node sub sampling is turned on we conduct the process of searching for a best splitter for a node on a subsample of the data in the node For example suppose our analysis data set contained 100 000 records and our
324. making splits that tend to keep the different levels of the target together in a natural way Thus we would favor a split that put the 1 and 2 levels together on one side of the tree and we would want to avoid splits that placed the 1 and 5 levels together Remember that the other splitting rules would not care at all which levels were grouped together because they ignore the numeric significance of the class label As always you can never be sure which method will work best We have seen naturally ordered targets that were better modeled with the Gini method You will need to experiment w Ordered Twoing works best with targets with numeric levels When a target is a character variable the ordering conducted by CART might not be to your liking See the command reference manual section on the DISCRETE command for more useful information Favor Even Splits The favor even splits control is also on the Method tab and offers an important way to modify the action of the splitting rules By default the setting is 0 which indicates no bias in favor of even or uneven splits In the display below we have set the splitting rule to Twoing and the favor even splits setting to 1 00 107 Chapter 4 Classification Trees Favor Even Splits Less The favor even splits control is set by the POWER parameter in the command language For example the command METHOD TWOING POWER 1 is how we woul
325. mand syntax LIMIT NODES lt N gt DEPTH lt N gt Setting Limits Using Model Setup Advanced tab Alternative methods to limit the growth of a tree can be found in the Model Setup Advanced tab We are displaying the relevant portions of the Advanced tab as follows Tree Size Maximum number of nodes AUTO Depth auto Sample Sizes Learn Sample Size 293 Test Sample Size 293 Subsample Size 293 The parameter table displayed in the middle panel is a guide to tailoring the problem to the available resources The easily adjustable parameters listed in the first column of the table are defined below Maximum Nodes Forecast of the number of terminal nodes in the maximal tree Depth Forecast of the depth of the maximal tree Learn Sample Size Number of cases in the learn data set Test Sample Size Number of cases in the test data set Sub Sampling Node size above which a random sub sample v the full sample is used to locate splits default learn sample size To manually set any one parameter individually or any combination enter a value into the corresponding text box vx You can save the values entered in the Model Setup Advanced tab by clicking the Defaults button 288 Chapter 12 Features and Options Report Writer CART includes Report Writer a report generator word processor and text editor that allows you to construct custom reports from results diagrams tables and graphs as well
326. matically realize that you want to grow a Classification tree But when the target variable is numeric you do have the choice of growing a classification or regression tree and you may need to correct the selection indicated on the Model Setup dialog This is the heart of the Model Setup dialog 86 Chapter 4 Classification Trees Target Variable Selection The target variable is specified by checking off ONE variable in the target column of the Model Setup Model tab Locate the row with LOW as Variable Name and put a checkmark in the Target column Model Setup Advanced Costs Priors Penalty Battery Model Categorical Force Split Constraints Testing Select Cases Best Tree Method Variable Selection Tree Type Variable Name Target Predictor Categorical Weight Aux a Classification J Regression Unsupervised Set Focus Class Target Variable LOW Weight Variable Sort File Order Number of Predictors 14 Save Grove CART Combine Score Cancel Continue Start After the target has been checked the Model tab switches from red to black indicating that CART is ready to start an analysis according to the default settings Specifying Tree Type CART uses a set of Tree Type radio buttons to determine if the tree grown will be a classification tree or a regression tree The difference between the two tree types is simple
327. mber of misclassified cases in the class PCT ERROR Percent of cases misclassified COST Fraction of cases misclassified multiplied by cost assigned for misclassification In our example we can see that the misclassification errors were about 19 for the learn sample and 25 for the cross validated test results This tab is primarily useful when working with many target classes 67 CART BASICS ji Navigator 5 10 Tree Summary Reports BEE Root Splits if Terminal Nodes Variable Importance i Prediction Success Misclassification Leaming Sample Test Sample N NMis Pet N N Mis Pet Cases Classed Enor C Cases Classed Enor St 461 107 23 21 0 23 1 203 38 1872 019 203 2 2562 0 26 Class Sort by Class x Sortby Class x Prediction Success or Confusion Matrix The confusion matrix is a standard summary for classifiers of all kinds and has been used to assess statistical models such as logistic regression as well as more exotic data mining models We call it the Prediction Success table following Nobel Prize winning economist Daniel McFadden s 1979 paper on the subject The table is a simple report cross classifying true class membership against the predictions of the model The table for our 10 node follows ji Navigator 5 10 Tree Summary Reports BEE Gains Chart Root Splits Terminal Nodes if Variable Importance Misclassification Leating Sam
328. mbers 306 Chapter13 Working with Command Language If you have already mastered the classification run described in the previous section note that the only differences are The requested output file names have been changed in lines 2 and 3 The MODEL command line 8 now uses a continuous target The CATEGORY command line 9 no longer lists our target The PRIORS and Misclassify commands are no longer needed The METHOD is changed to LS least squares line 12 gt gt gt a A detailed description of each command in this command file is provided below Commands 1 through 3 control which files will be used or created during this run 1 gt gt The USE command specifies the data set to be used in modeling CART has built in support for comma separated ASCII files The GROVE command specifies the binary grove file to be created in the current directory This file will contain detailed model information and will be needed for the scoring and translating described later This binary file is needed to view trees and model results from inside the CART GUI It includes complete information about the model building process including pruning sequences and multiple collections of trees when applicable 2 gt gt The OUTPUT command specifies the classic output file This text file will report basic information about the data the model building process and the optimal tree The content of this file which is co
329. me and directory When the model is finished both the grove file and the embedded navigator will be saved The above procedure is equivalent to placing the GROVE lt file_name grv gt command before the BUILD command in your command file The default target folder for the grove files can be set in the Output Files gt Model Information section of the Options Directories tab when selecting the Edit gt Options menu 172 Chapter 7 Scoring and Translating Converting a Tree File to a Grove File The earliest versions of CART stored model information in a tree file extension tr1 Tree files had the severe limitation of containing information on a single tree only usually the optimal tree For backward compatibility we have added a command that allows you to translate any tr1 file into a grove file Of course the resulting grove file still has only one tree To translate an old tree file old_tree tr1 into a grove file old_tree grv use the following command syntax GROVE old _ tree grv IMPORT old_ tree tr1 Scoring CART models Scoring will differ depending on whether you are working with a grove file or a grove file embedded in a navigator created with CART 5 or CART 4 Scoring Using a CART 5 Navigator with an Embedded Grove File The navigator window must be open and active 1 Press the Score button 2 Enter the relevant information into the Score Data window described below 3 Pr
330. ment is oe FOR statements NEXT oe oe For example you might execute a block of statements only if a condition is true as in SIF WINE COUNTRY THEN FOR SLET FIRST CABERNET SLET SECOND RIESLING SNEXT When an index variable is specified on the FOR statement the statements between the FOR and NEXT statements are looped through repeatedly while the index variable remains between its lower and upper bounds oe FOR index variable and limits statements NEXT oe oe The index variable and limits form is FOR I start number To stop number STEP stepsize where is an integer index variable that is increased from start number to stop number in increments of stepsize The statements in the block are processed first with start number then with start number stepsize and repeated until gt stop number lf STEP stepsize is omitted the default is to step by 1 Nested FOR NEXT loops are not allowed DIM Creates an array of subscripted variables For example a set of five scores could be set up with DIM SCORE 5 This creates the variables SCORE 1 SCORE 2 SCORE 5 The size of the array must be specified with a literal integer up to a maximum size of 99 variable names may not be used You can use more than one DIM statement but be careful not to create so many large arrays that you exceed the maximum number of variables allowed currently 8019 409 Appendix IV BASIC Progr
331. mmary Reports dialog tabs To access the reports click the Summary Reports button at the bottom of the Navigator window or select Tree Summary Reports from the Tree menu 152 Chapter 5 Regression Trees R CAR 2 a Profit ARTS eat we The Profit tab provides a useful model summary in terms of the profit associated with each node It is assumed that each record in a dataset is associated with a certain continuous amount of profit This information is either represented by the continuous target itself in which case the profit value is the actual target of modeling or by any other continuous variable present in the dataset cross evaluation of model 4 Navigator 1 18 Tree Summary Reports Root Splits Terminal Nodes Variable Importance Mean by Node Profit Chart for M Target Cum Profit Leam Cum Ave Cases Profit Leam Leam Profit Ave Profit Node Leam Leam 18 1 352 90 4510 135290 4510 810 70 3525 216360 40 82 Average Profit 666 50 3 461 40 c pe 181811725 367 48 9 111013121415 602 10 4 063 50 Terminal Node 601 70 4 665 20 ai Profit Variable Profit fave Profit Cum Profit Cum Ave Profit ine erat MV Target Zoom Chart Type Default Sort Order Optimal x _Bar Dot Line Profile Ave Proft Leam First choose the Profit Variable carrying information about the profit associated with each record in
332. mpare learn and test node counts and percentages Simply point the mouse to the node of interest right click and choose the Compare Learn Test menu item The resulting window displays the learn and test counts and percentages by each target class 4 Navigator 1 11 Node 3 Learn Test Compare DER Leam N Leam 4il 36 61 C a 71 63 39 112 100 00 x When cross validation trees or exploratory trees are used only the learn counts are available for obvious reasons 140 Chapter 4 Classification Trees Saving Navigator Files CART allows you to save the Navigator to a file and then later reload it To save a Navigator file also Known as the Grove bring the Navigator window to the foreground and select Save gt Save Grove from the File menu In the Save As dialog box click on the File name text box to change the default file name The file extension is GRV and should not be changed Select the directory in which the Navigator file should be saved and click on Save Save in Examples File name GYMTUTOR grv Save as type Grove grv Cancel To open a Navigator file you have previously saved select Open gt Open Grove from the File menu In the Open Grove File dialog box specify the name and directory location of the navigator file and click on Open a CART 6 is backwards compatible with the previous navigator file formats nav nv2 nv3 However opening older ve
333. n 01479 Zoomed J Median 0 1849 Mean 0 1766 Max 0 1851 Chart Type Bar J Line Rel Error Test Rel Error Battery Types Classification Battery Models ATOM Model Opt Terminal Name Nodes Tree 1 Tiee 2 Rel Error MinChild Tree 4 Tree 5 0 1849 Tree 6 0 1849 Tree 0 1849 Show Min Error 0 1849 Model Quality Sample Model Size Save Grove Misclass _roc Test _Leam MinCost_1SE The relative error of the optimal model obtained in each run is shown on the upper graph within the Battery Summary Models tab You can view the graph in the Line or Bar styles Zoomed or All Models as well as Rel Error or Nodes for the Y axis When the Show Min Error button is pressed the model having the smallest relative error is highlighted in green Model performance can be viewed in terms of relative error Misclass button or the average area under the ROC curve ROC button Furthermore it can be presented on the test data default Test button or on the train data Learn button It is also possible to switch between optimal minimum cost trees Min Cost button and 1 SE trees 1 SE button A 1 SE tree is defined as the smallest tree with the relative cost within one standard deviation of the optimal smallest relative cost tree The Classification Battery Models section in the lower half contains a tabular de
334. nanan aa A aAA Aan a dueeduse tueedvessecedtestuess 247 Variable Importa NCE a re aeea n a aeaa n ae a aa ra aaa aaa aa Naia da aeae ENAS aara aa 248 Miscl ssificati n ore iieis rinane asaran araa anaE a NRE AANE No EENAA Ea AAEE 249 Prediction SUCCESS a sino eanan eaa arianna eanna a ARENA AAND ARNAR Eaa ANNARA 250 Detailed Node Reports cianie ana aeea cates eiedeecedevests fesaacet danedeveavententedes ttestanss 251 Terminal Node Report 22 cicisiscc ara anaana a a ae aaea Aaa an Taa aE ae ea iaa aa aaiae an iiaae 256 Saving the Grove File cccccceeeseeeeeeeeeeeeeeeeeeeeeeseeeeeeaseeneeenaeseeeeesseeeeeeeseseeeensesenen 257 vi Table of Contents CART Text Output eae aaar e raaa r ara aa r aaaea p ng eee shelley dies Saeeectiieeserteeaeed 257 Displaying and Exporting Tree Rules sssssussssennnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnne 258 Scoring Data posis necornrnsnanin erni knaar Koap a tE Aaa aE cag ARENE N AAE PEAR RENNARAR ESRR NAAG RAPAE 259 New Analysis inna oriai araa e Ena EnaA AEN eE EA EENAA EEEE A OEREN ORNEARREN NA 261 Saving Command Log ccccceeseeceeeeeeneeeeeeeeeeeeeeeeneeeeeeeeeeeeeseeeeeenseeeeeeenseeneeenseeeeeens 261 FEATURES AND OPTIONS wssicsceisiicostivesisiviedicossivesisseiesscessieasieieiesicestieaias 263 Features and Option ea ana a aaae aaa aa aa e daanan Aaaa adaa da daanan Eadan 264 Unsupervised Learning and Cluster AnalySis ccccccessseeseseeeeeseeeseeeeeeeseeeees 264 The Fo
335. ncertainty For the next displays we will work with the 1SE tree A tree of a specific size can be selected in several ways gt Viewing the Main Tree use the mouse to click on a blue box in the error profile use the left and right arrow keys to reach a specific tree click the Grow or Prune buttons on the right side of the navigator from the Tree menu select a tree or list of trees The Tree Details button on the navigator brings up an industry standard view of a decision tree This view includes node specific sample breakdowns so that we can see performance throughout the tree Starting with the five node 1SE tree selected click on the Tree Details button at the bottom of the Navigator or right click on the root node and select the Display Tree option to get Navigator 2 Main Tree era E OCCUP_BLANK lt 0 500 1 OCCUP_BLANK gt 0 500 1 Terminal Node 1 Class 0 Class Cases 95 ol Terminal Node 2 Class 1 Class Cases 0 323 1 1 10 769 W 13 000 N 13 N_INQUIRIES gt 1 500 N_INQUIRIES lt 4 500 N INQUIRIES gt 4 500 ni Terminal Node 5 NUMCARDS lt 1 500 Class 1 Class Cases Class Cases 0 103 64 4 0 60 36 1 1 57 35 8 1 106 639 W 160 000 W 166 000 N 160 N 166 NUMCARDS lt 1 600 NUMCARDS gt 1 500 1 1 Terminal Terminal Node 3 Node 4 Class 1 Class 0 Class Cases Class Cases 0 3
336. ndividual run has the same master sequence and resulting variable importance In addition to the actual run profiles you may add average Average button minimal Min button and maximal Max button profiles It is also possible to hide individual profiles using the None button You can switch from the chart view Chart button to the table view Table button In the table view columns represent relative error sequences for each model Optimal trees are highlighted in green while 1 SE trees are highlighted in pink 209 Chapter 10 CART Batteries ea Battery Summary 1 DER Models Contents Error Profiles CvR_1 CVR_2 OR3 CvR4 ORS5 CRE OR7 OR8 cvR9 Cvs Rel Eror Rel Emor Rel Enor Rel Eror Rel Enor Rel Enor Rel Enor Rel Eror Rel Eror Rel 06187 0 6331 06199 06630 0 6603 0 6636 0 6105 0 6375 0 7011 0 4428 0 4501 0 4227 0 4537 0 4511 04575 0 4244 04202 0 4946 0 3533 0 3681 0 3392 0 3719 0 3825 0 3798 0 3407 0 3476 0 3972 0 3277 0 3388 0 3216 0 3207 0 3687 0 3448 0 3379 0 3067 0 3518 0 3140 0 3153 0 3084 0 3118 0 3499 0 3291 0 3204 0 2998 0 3338 0 2821 0 2837 0 2980 0 3011 B 0 3010 0 2836 0 2812 0 3064 0 2827 0 2741 0 2866 0 2801 03195 0 2967 0 2608 0 2801 0 2827 02616 0 2772 0 3201 0 2901 0 2605 0 2604 02768 0 3194 0 2804 0 2542 0 2652 0 2732 0 2596 0 2580 0 3155 0 2799 0 2530 0 2578 12 0 2671 0 2596 0 2607 0 2666 0 3145 0 2783 0 2547 130 2611 0 2526 0 2585 0
337. ndscape Row 1 Col 1 a gt Header Footer amp F Page amp v amp H Node Shape Margins inches Node Hexagon x Left 0 5 Right Tem Rectangle ZI Top 0 5 Save as Defaults Cancel Printer Bottom 0 The page setup options and their default settings are Node Gaps 0 10 Orientation portrait Tree Scale 100 Border thin Header Footer Node Shapes Margins 0 50 Change the distance between the nodes by increasing or decreasing the horizontal setting and change the height of the tree branches by increasing or decreasing the vertical setting Choose portrait or landscape Increase decrease the overall size of the tree Change the width of the page border or select no border Enter text for header or select from the predefined settings by clicking on predefined settings include file name tree name column row current date and time also included here are the alignment options left right center Note To include an ampersand in the header type two ampersands amp amp Replace default footer text input file name page row and column by entering new text or select from the predefined settings by clicking on predefined settings are similar to those for headers see above Change the non terminal node and terminal node term default hexagon and rectangle shapes by clicking the down arrow and selecting an alternative shape C
338. needed and that makes use of other relevant information in the data Other trees treat all records with missing values as if the records all had the same unknown value with that approach all such missings are assigned to the same bin In CART each record is processed using data specific to that record allowing records with different data patterns to be handled differently and resulting in a better characterization of the data CART 6 also automatically analyzes whether missingness is in itself predictive and will optionally incorporate such findings into the optimal model Adjustable misclassification penalties help avoid the most costly errors CART includes cost sensitive learning so that models developed by CART can incorporate the seriousness of any mistake In a binary classification problem we often label the outcomes 0 and 1 and by default assume that all classification 13 Introducing CART 6 0 errors are equally costly But what if misclassifying a 1 as a 0 a false negative is far worse than misclassifying a 0 as a 1 a false positive CART users can specify a higher cost for the more serious mistakes causing the software to steer the tree away from that type of error That is in response to the cost information CART will actually grow a different tree The greater the cost of a specific kind of mistake the more CART will adjust the tree to avoid the high cost mistakes Further when CART cannot guarantee a
339. new grovefile containing only the harvested trees may be created with the OUTPUT option for example HARVEST SELECT KEEP 5 OUTPUT justone grv Examples USE gymtutor csv SAVE testPRED CSV MODEL GROVE BUILD GYMc GRV HARVEST PRUNE TREENUMBER 1 SCORE 361 Appendix III Command Reference HELP Purpose The HELP command provides information about CART commands You can abbreviate the name of the command The command syntax is HELP lt command gt Examples HELP lists commands available for the current procedure HELP HELP provides information on the HELP command 362 Appendix III Command Reference HISTOGRAM Purpose The HISTOGRAM command produces low resolution density plots The command syntax is HISTOGRAM lt varl gt lt var2 gt lt var3 gt FULL TICKS GRID WEIGHTED NORMALIZED BIG The plot is normally a half screen high the FULL and BIG options will increase it to a full screen 24 lines or a full page 60 lines TICKS and GRID add two kinds of horizontal and vertical grids WEIGHTED requests plots weighted by the WEIGHT command variable NORMALIZED scales the vertical axis to 0 to 1 or 1 to 1 Examples HISTOGRAM IQ FULL GRID HISTOGRAM LEVEL 4 7 NORMALIZED Only numerical variables may be specified Variable groups may be used in the HISTOGRAM command similarly to variable names 363 Appendix Ill Command Reference IDVAR Pu
340. ng of the entire pool of terminal nodes Pressing OK will produce two windows Hotspot Table and Hotspot Chart The Hotspot Table window contains the results of hotspot analysis in tabular form 2 Hotspot 1 Target TARGET Class 1 File Battery Summary 1 Target TARGET Class 1 File Battery Summary 1 Nodes lookup Edit Spread area Sorting Filtering Chart Details Et 196 Chapter 9 Hot Spot Detection The upper Nodes lookup table contains all requested terminal nodes one line per node sorted according to learn node richness The default columns are 5 gt gt gt a Tree unique tree identifier in the current battery Node unique terminal node identifier in the current tree Learn Sample Count node size on the train data Test Sample Count node size on the test data Learn Richness node richness in the focus class on the train data using this column table rows are sorted descending Test Richness node richness in the focus class on the test data In addition the Columns button in the Edit Spread group of controls allows the selective addition of more columns to the table gt gt gt o Select Columns Table columns O Spot Tree Node O Depth Learn Sample Count Test Sample Count O Weight Node Learning Count O Weight Node Test Count O Focus Class Learning Count Prau Piana ae et ee Show all Hide all Apply Cancel
341. ng the Learn or Test buttons Below we show the report for the root node The left child is now clearly dominated by GOODs and the right child contains an equal number of GOODs and BADs 4 Navigator 5 10 Node 1 DER Competitors and Surrogates Root Competitor Splits Classification Is N_INQUIRIES lt 1 5 he F The window offers a choice between bar charts pie charts and a horizontal bar chart embedding the sample split You can switch between counts and percentages by pressing the Cases or Pct buttons The horizontal bar chart offers an alternative view of the class partitions Each colored bar represents one target class The vertical line shows how the class was partitioned between two children with the percentage of the class going to the left child shown on the left side and the percentage of the class going to the right child shown on the right side In this example less than 20 of Class 1 went to the left side and more than 80 went to the right side The Root Competitor Splits tab In the root node a splitter has access to all the data Thus we have a special interest in the performance of variables as splitters in the root This report lists every variable available for splitting and includes this additional information 72 CART BASICS N missing Count of number of records missing data for this variable N left N right Count of records going to the left and right children ied
342. nger supported To obtain classic output on a tree other than the optimal you should translate that tree into LANGUAGE CLASSIC see the Translating section below Navigator Files versus Grove Files Previous versions of CART produced both navigator and grove files CART 6 combines the two types of information and stores it all in a grove file A grove file is a binary file that stores all the information about the tree sequence needed to apply any tree from the sequence to new data or to translate the tree into a different presentation Grove files contain a variety of information including node information the optimal tree indicator and predicted probabilities Grove files are not limited to storing only one tree sequence but may contain entire collections of trees obtained as a result of bagging arcing or cross validation The file format is flexible enough to easily accommodate further extensions and exotic tree related objects such as TreeNet models Navigator files on the other hand serve the sole purpose of presenting a single tree sequence using the GUI back end also known as the Navigator window In the previous chapters many examples of using navigator displays to analyze trees and present the results were provided 171 Chapter 7 Scoring and Translating To save a CART user the trouble of keeping track of two different files CART 6 embeds a corresponding navigator file into the grove file whenever the latter is saved
343. ngle good borrower and a single defaulter at random from a data set Our ROC score tells us that we would be able to correctly tell which one was the defaulter in 78 67 of all cases 50 CART BASICS w If you picked the defaulter at random you would be right on average for 50 of all cases Therefore a good model needs to deliver substantially better than an ROC of 50 In real world credit risk scoring an ROC of 70 would be considered respectable x The predictive performance of a model depends on many factors including the nature and quality of the data and the inherent predictability of the data under study You cannot expect every subject matter to support highly accurate models The color coding of the terminal nodes is controlled from the pull down control at the top of the navigator For 0 1 target variables the default coloring uses red to indicate a high concentration of 1s You can change that if you prefer to have red represent another class instead and you can also turn off special color coding leaving all the terminal nodes red CART offers many ways to view the tree details and interior We will start by hovering the mouse over a node Beginning with the root node at the top of the tree we note that we started with 461 GOODs 0s and 203 BADs 1s for a bad rate of 30 6 rN You can change the detail revealed when you hover your mouse over navigator nodes Right mouse click in the gray area of the navi
344. ngs priors equal applicable to classification trees only You can change the priors setting by clicking on the new setting s radio button If you select SPECIFY you must also enter a value for each level of your target variable Simply highlight the corresponding class and type in the new value rN Only the ratios of priors matter internally CART normalizes the specified priors so that the values always sum to one amp Certain combinations of priors may result in a No Tree Built situation This means that according to this set of priors having no tree a trivial model which makes the same class assignment everywhere is no worse than having a tree Knowing that your target cannot be predicted from your data can be very valuable and in some cases is a conclusion you were looking for From the Command line use the following syntax PRIORS EQUAL PRIORS DATA PRIORS MIX PRIORS LEARN PRIORS TEST PRIORS SPECIFY lt classi gt lt valuel gt lt class2 gt lt value2 gt etc PRIORS SPECIFY 0 25 1 75 121 Chapter 4 Classification Trees Ifthe target variable contains gt 5000 values you must use the command line for user specified priors The Penalty tab The penalties available in CART were introduced by Salford Systems starting in 1997 and represent important extensions to decision tree technology Penalties can be imposed on variables to reflect a reluctance to use a variable as a splitter
345. no effect on quoted file names 356 Appendix III Command Reference FORMAT Purpose The FORMAT command controls the number of digits that are displayed to the right of the decimal point in analysis output You may select from 1 to 9 digits or O digits or 1 for no digits and no decimal point The default is 3 The UNDERFLOW option prints tiny numbers those that would appear to be zero in the chosen precision in scientific exponential notation The command syntax is FORMAT lt gt UNDERFLOW Examples FORMAT 5 FORMAT 0 FORMAT 9 UNDERFLOW print tiny numbers with exponents 357 Appendix III Command Reference GROUP Purpose The GROUP command defines variable groups The command syntax is GROUP lt groupname gt lt variable gt lt variable gt Group names are used like variable names in commands that process variable lists resulting in more compact lists The following commands set up three groups and use them in the KEEP CATEGORY and CLASS commands along with variables SEGMENT AGE PROFIT for a three level classification tree model GROUP DEMOGRAPHICS GENDER RACES REGIONS PARTY EDUCLEV GROUP CREDITINFO FICO1l FICO2 TRW LOANAMOUNT AUTOPAYMENT MORTGAGEAMOUNT MORTGAGEPAY GROUP CREDITRANK RANKVER1 RANKVER2 RANKVER3 CATEGORY DEMOGRAPHICS TARGETS SEGMENT CREDITRANK CLASS CREDITRANK 0 Not available 1 Poor 2 Good 3 Excellent MODEL TARGETS KEEP DEMOGRAPHICS CREDITINF
346. nt can be penalized If penalizing these variables leads to models that are only slightly less predictive the penalties help physicians to optimize diagnostic procedures 123 Chapter 4 Classification Trees rN Setting the penalty to one is equivalent to effectively removing this predictor from the predictor list Missing Values Penalty At every node every predictor competes to be the primary splitter The predictor having the best improvement score is selected to be the primary splitter Variables with no missing values have their improvement scores computed using all the data in the node while variables with missings have their improvement scores calculated using only the subset with complete data Since it is easier to be a good splitter on a small number of records this tends to give heavily missing variables an advantage To level the playing field variables can be penalized in proportion to the degree to which they are missing This proportion missing is calculated separately at each node in the tree For example a variable with good data for only 30 of the records in a node would receive only 30 of its calculated improvement score In contrast a variable with good data for 80 of the records in a node would receive 80 of its improvement score A more complex formula is available for finer control over the missing value penalty using the Advanced version of the Penalty tab Missing Penalty Fract of Improvement Kept 0 00
347. ntage of the two alternative modes of control in CART command line and batch and provides a guide for using these two control modes For users running CART on a UNIX platform this chapter contains a detailed guide to command syntax and options and describes how the Windows version may assist you in learning the command line language The following picture illustrates common channels of interaction between a user and CART Command Files cmd Source Data Scored Data Command Files cmd Grove Files grv Grove Files grv Classic Output dat Output dat Reports rtf Commands Results J ok Results Plots Tables First note that CART itself is a sophisticated analytical engine controlled via command sequences sent to its input that can generate various pieces of output when requested An inexperienced user can communicate with the engine via the GUI front and back ends The GUI front end provides a set of setup screens and knows how to issue the right command sequences according to the user s input It is also possible to request the GUI front end to save command sequences into an external command file The GUI back end captures the results produced by the engine and displays various plots tables and reports Most of these can be directly saved to the hard drive for future reference The whole cycle marked by the large arrows in the 297 Chapter 13 Working with Command L
348. ntrolled using the LOPTIONS and FORMAT commands is somewhat limited Commands 4 through 7 control various engine settings 3 gt gt The BOPTIONS command sets important model building options 4 gt gt The LOPTIONS command sets various reporting options 5 gt gt The FORMAT command sets the number of decimal digits to be reported 6 gt gt The LIMIT command sets various limits including how many data are allowed the largest tree size allowed the largest tree depth the smallest node size allowed and whether sub sampling will be used For the most part these commands should be left unchanged unless you need fine control over the CART engine A more detailed description can be found in the Appendix III Command Reference Commands 8 through 16 specify model settings that usually change from run to run 7 gt gt The MODEL command sets the target variable 8 gt gt The CATEGORY command lists all categorical numeric variables 307 Chapter 13 Working with Command Language w Character variables are always treated as categorical and need not be listed here a In regression runs the target is always a continuous numeric variable 9 gt gt The KEEP command sets the predictor list rN This command is NOT cumulative 10 gt gt The ERROR command specifies the LEARN TEST partition method inthis example a dummy variable T separates the TEST part T 1 from the LEARN part T 0 Other useful methods are PROP lt ratio gt pr
349. o a detailed description of each of the available batteries and highlight the specifics of their use Common Battery Controls A large number of batteries are simply collections of runs where one specific control parameter is set to different values You access battery setup from the Battery tab in the Model Setup menu Consider for example battery ATOM that varies the atom size the minimum required parent node size see the ATOM CMD command file Model Setup Model cE Bee aan Testing Select Cases Best Tree j Method Advanced Costs Priors Penalty Battery Types Selected Batteries Battery Options Option Value BATTERY ATOM Yates fs 0 15 20 25 30 35 40 Add MCT MINCHILD _Bemove MVI NODES Remove All ONEOFF PRIOR RULES Number of folds in cross validation 4 Quiet Proximity is C Yes No Auto E Save Grove CART Combine Score Cancel Continue First highlight ATOM in the Battery Types selection panel and press the Add button Then type in a list of possible atom values in the Values entry box found in the panel titled Battery Options Pressing the Start button produces a summary report window for the resulting eight models with different atom settings 201 Chapter 10 CART Batteries i Battery Summary 1 Models Contents Accuracy Error Profiles Var Imp Averaging Charts Rel Eror Nodes ATM_S 54 0 148 ATM_15 17 0 171 View M
350. o the focus class in the node N Focus Test number of test records that belong to the focus class in the node N Other Learn number of train records that do not belong to the focus class in the node N Other Test number of test records that do not belong to the focus class in the node N Node Learn number of train records in the node N Node Test number of test records in the node You can control which columns are shown and in what order in the Select Columns to Display section 190 Chapter 8 Train Test Consistency TTC The following group of controls allows fine user input Thresholds Direction 2 00 l Hide Agreed Rank 0 50 Hide Agreed Fuzzy Match Direction sets the z value threshold on the directional stability A node is declared directionally unstable only if it has contradicting class assignments on learn and test samples and furthermore has the z value of the corresponding test greater than the threshold Otherwise the node is directionally stable has identical class assignments or z value is below the threshold Rank sets the z value threshold on the rank stability A pair of nodes taken from learn and test based sorted sequences is declared rank stable if the z value of the corresponding test is below the threshold Fuzzy Match determines whether empty nodes on test data are ignored Fuzzy Match is pressed or treated as unstable Fuzzy Match is not pressed Hide
351. observed sample shares in combined learn and test data LEARN priors match observed sample shares in learn data alone TEST priors match observed sample shares in test data alone EQUAL uniform priors automatically set to 1 number of classes MIX priors set to the average of DATA and EQUAL options SPECIFY lt class 1 gt lt x1 gt lt class2 gt lt x2 gt priors set to any strictly positive numbers CART will normalize the values to sum to 1 0 A value must be assigned to each class For character classes the class value must be in quotes The SPECIFY option requires that the dependent variable already be identified on the MODEL command Examples PRIORS SPECIFY COKE 1 Pepsi 2 H20 4 7UP 1 explicit list let CART rescale PRIORS EQUAL the default PRIORS MIX split the difference between DATA and EQUAL 389 Appendix III Command Reference PRINT Purpose The PRINT command switches you between standard and extended analysis results for certain procedures The command syntax is PRINT SHORT LONG MEDIUM Examples PRINT SHORT Produces only standard output from commands PRINT LONG Prints extended output for some procedures 390 Appendix III Command Reference QUIT Purpose The QUIT command ends your CART session The command syntax is QUIT The QUIT command will terminate the GUI so you probably do not want it at the end of command files intended to be run there via the Su
352. oe BEGIN ENDED oe oe 407 Appendix IV BASIC Programming Language oe DIM COLORS 10 FOR I 1 TO 10 STEP 2 ET COLORS I Y I NEXT IF SEX MALE THEN DELETE oe oe oe oe The symbol appears only once at the beginning of each line of BASIC code it should not be repeated anywhere else on the line You can leave a space after the symbol or you can start typing immediately BASIC will accept your code either way Our programming language uses standard statements found in many dialects of BASIC BASIC Overview of BASIC Components LET Assigns a value to a variable The form of the statement is LET variable expression IF THEN Evaluates a condition and if it is true executes the statement following the THEN The form is IF condition THEN statement ELSE Can immediately follow an IF THEN statement to specify a statement to be executed when the preceding IF condition is false The form is IF condition THEN statement ELSE statement 2 5 oe Alternatively ELSE may be combined with other IF THEN statements oe IF condition THEN statement ELSE IF condition THEN statement ELSE IF condition THEN statement ELSE statement A ol oe 408 Appendix IV BASIC Programming Language FOR NEXT Allows for the execution of the statements between the FOR statement and a subsequent NEXT statement as a block The form of the simple FOR state
353. of the tree The color coding helps us locate interesting terminal nodes Bright red nodes isolate defaulters Target class 1 and deep blue nodes are heavily populated with good borrowers Other colors indicate more mixed results g Navigator 1 Classification tree topology for TARGET Color code using Tat TARGET 1 gt Bete TT worse Smaller Next Prune Grow Prune Model Statistics Predictors Important Nodes Min Node Cases e 30 40 Best ROC Nodes 10 ROC Train 0 8474 Number of Nodes ROC Test 0 7867 Displays and Reports Save Model Leam Splitters Tree Details J Summary Reports Commands Grove Translate Score The tree displayed automatically is of the size determined by CART to be the most accurate classifier obtained Other tree sizes are also available for display In this example we can review trees with as few as two nodes or as many as 62 nodes 49 CART BASICS The performance of the different sized trees is displayed in the lower panel of the navigator This curve is a relative cost profile and traces the relationship between classification errors and tree size z 0 70 0 488 9S 0 60 a 0 50 z 10 20 30 40 50 60 70 Number of Nodes We call this a relative error curve because it is always scaled to lie between 0 and 1 0 means no error or a perfect fit and 1 represents the performance of random guessing The best t
354. ofit button 152 Average Profit Learn 152 Default Sort Order 152 Profit Learn 152 Profit Variable 152 programming language 101 progress report 47 234 prune 56 236 pruning 11 186 pruning test method 165 pruning tree 236 Q quartile range 156 QUIT command 390 R random number 132 208 210 random sub sampling 102 rank instability 187 reading ASCII files 33 34 reading data 31 reading Excel files 36 regression trees 146 148 relational operators 409 relative contribution 246 relative cost curve 74 237 relative error 149 REM command 391 repeated cases 167 Report Contents window 75 257 Report Current menu 289 report details committee of experts 166 Report menu 42 229 Report All 289 Report Current 289 Set Report Options 289 reporting controlling contents 129 cross validation results 131 number of competitors 131 number of surrogates 130 options 289 short command notation 127 text reports 125 tree sequence 131 reports box plots 156 classic text output 75 257 competitors and surrogates 68 155 251 node detail 68 73 154 251 256 node frequency distributions 71 253 node statistics 158 pre configured 289 Report Options dialog 288 Report Writer 288 rules 72 156 255 splitters 73 157 254 target class 289 terminal node detail 158 tree summary 61 151 244 viewing rules 158 resampling 163 response statistics tab 435 classifcation 176 regres
355. old cross validation a report for each of the ten cross validated trees will follow the report on the final pruned tree in the text output For this option to have full effect be sure to uncheck the Only summary tables of node information The GUI offers more a convenient way to review these CV details 132 Chapter 4 Classification Trees Command line equivalent BOPTIONS BRIEF BOPTIONS COPIOUS Controlling Random Number Seed Values As illustrated below the Options CART tab also allows you to set the random number seed and to specify whether the seed is to remain in effect after a tree is built or data are dropped down a tree Normally the seed is reset to 13579 12345 and 131 on start up and after each tree is constructed or after data are dropped down a tree The seed will retain its latest value after the tree is built if you click on the Retain most recent values for succeeding run radio button Random Number Seeds 135734 1235 1314 Seed Retention For Multiple Runs ve Reset to default values after each run e Retain most recent values for succeeding run amp Command line equivalent SEED lt N1 gt lt N2 gt lt N3 gt NORETAIN SEED lt N1 gt lt N2 gt lt N3 gt RETAIN Setting Directory Preferences The Option Directories tab allows you to set default directory preferences for input data model and command output model scoring results translation code and text report and temporar
356. older is located will have enough space at least the size of the largest data set you are planning to use Depending on your preferences you may choose one of two working styles 1 using the same location for input and output files 2 using separate locations for input and output files Temporary files with names like CART0314114746_ txt are records of your previous sessions The first part of the name refers to today s date 03 14 followed by a random series of digits to give the file a unique name These command logs provide a record of what you were doing during any session and will be stored even if you experience an operating system crash or power outage You may find the record invaluable if you ever need to reconstruct work you were doing Temporary files with names other than CARTnnnnn txt are normally deleted when you shut CART down If you find such files in your temporary directory you should delete them as they contain no useful information Additional Control Functions Ej Control icon that automatically changes all path references to make them identical with the Data entry Control icon that starts the Select Default Directory dialog allowing the user to browse for the desired directory Control icon that automatically changes all path references to make them identical with the Data entry Control that allows you to select from a list of previously used directories V Most Recently Used file
357. om scoring and translation code Run report classic output Temporary Files Temporary where CART will write temporary work files as needed where CART will write the command log audit trail x We suggest dedicating a separate temporary folder to CART Make it a habit to routinely check the Temporary Files Directory for unwanted scratch files These should only appear if for some reason your system crashed or was powered down in a way that did not permit CART to clean up w Depending on your preferences you may choose one of two working styles 1 using the same location for both input and output files 2 using separate locations for input and output files x The files with names like CART06125699_ txt are valuable records of your work sessions and provide an audit trail of your modeling activity Think of them as emergency copies of your command log You can delete these files if you are confident that your other records are adequate Make sure that the drive where the temporary folder is located will have enough space at least the size of the largest data set you are planning to use 134 Chapter 4 Classification Trees Additional Control Functions Control icon that automatically copies your Data file info to all other locations in the dialog except the Temporary File location j Control icon that lets the user browse among directories z Control that allows the user to select from a list of previou
358. on 101 variables auxiliary 85 91 134 categorical 47 88 92 234 character 33 class names 59 242 contribution 248 high level categorical 94 ID 174 importance 248 number of 32 penalize high level categorical 121 124 penalize improvement 121 penalize missing values 121 123 predictors 45 85 87 234 selecting 85 87 88 Index sorting list 92 target 45 85 174 234 transforming 406 weight 174 variable specific penalty 122 View menu 42 59 229 236 238 242 Assign Class Names 59 242 Data Info 291 Node Detail 57 139 151 241 242 Node Display 238 Open Command Log 140 299 Open Command Log 299 rules 159 Rules 183 259 Show Next Pruning 236 Update Command Log 299 View Data 290 viewing auxiliary variables information 134 data 290 data information 291 main splitters 238 main tree 57 241 sub tree 58 242 tree 56 239 variable splits 237 viewing rules 158 WwW warnings and errors 319 WEIGHT command 403 weights 90 missing 90 negative values 90 surrogate discount 103 zeroed 90 Window menu 42 68 229 251 windows CART Output 75 176 257 Data Viewer 290 Datalnfo Setup 291 Main Tree 56 57 239 241 Navigator 48 149 235 Node Report 68 251 Notepad 300 Output 283 Report Contents 75 257 Splitters 238 Sub Tree 58 242 Terminal Node Report 73 158 256 Tree Map 240 Windows keyboard conventions 43 working directories 28 work
359. oportion selected at random FILE lt file gt test set in a separate file and EXPLORE do not proceed with testing 11 gt gt The METHOD command sets the loss function LS least squares loss LAD least absolute deviation loss 12 gt gt The WEIGHT command sets the weight variable if applicable 13 gt gt The PENALTY command induces additional penalties on missing value and high level categorical predictors w We recommend always using the listed penalties For backwards compatibility with earlier CART engines one should use the following command instead PENALTY MISSING 1 0 HIC 1 0 The remaining two commands are action commands 14 gt gt The BUILD command signals the CART engine to start the model building process 15 gt gt The QUIT command terminates the program amp Anything following QUIT in the command file will be ignored Multiple runs may be conducted using a single command file by inserting additional commands 308 Chapter13 Working with Command Language Example Sample classification combine run The contents of a CLASSCOMB CMD sample command file are shown below Line by line descriptions and comments follow CART Notepad C Program Files Salford Data Mining CART 5 0 DAR E aa REM SAMPLE COMBINE RUN RENFTFFAAAA AAA AAA AAA AAA AAA AAAS EAE EERE EERE EERE ERE REM INPUT OUTPUT FILES REMATA SAA AAAA TEAK TEAK AAAATTAKAEAKKAAK AAA KAA K TEAK EER AERA REET USE
360. or example the number of candidate splits is 7 if K 11 the total is 1 023 if K 21 the number is over one million and if K 35 the number of splits is more than 34 billion Naive processing of such problems could take days weeks months or even years to complete To deal more efficiently with high level categorical HLC predictors CART has an intelligent search procedure that efficiently approximates the exhaustive split search procedure normally used The HLC procedure can radically reduce the number of splits actually tested and still find a near optimal split for a high level categorical The control option for high level categorical predictors appears in the Model Setup Categorical tab as follows High Level Categorical Variables Threshold level for enabling intelligent categorical split search 15 Search Intensity 200 More Accurate EE ae At 10 100 200 300 400 Faster 95 Chapter 4 Classification Trees The settings above indicate that for categorical predictors with 15 or fewer levels we search all possible splits and are guaranteed to find the overall best partition For predictors with more than 15 levels we use intelligent shortcuts that will find very good partitions but may not find the absolute overall best The threshold level of 15 for enabling the short cut intelligent categorical split searches can be increased or decreased in the Categorical dialog In the short cut method we conduct lo
361. or more terminal nodes are similar on most predictor variables Unsupervised learning by contrast does not begin with a target variable Instead the objective is to find groups of similar records in the data One can think of unsupervised learning as a form of data compression we search for a moderate number of representative records to summarize or stand in for the original database Consider a mobile telecommunications company with 20 million customers The company database will likely contain various categories of information including customer characteristics such as age and postal code product information describing the customer s mobile handset features of the plan the subscriber has selected details of the subscriber s use of plan features and billing and payment information Although it is almost certain that no two subscribers will be identical on every detail in their customer records we would expect to find groups of customers who are similar in their overall pattern of demographics selected equipment plan use and spending and payment behavior If we could find say 30 representative customer types such that the bulk of customers are well described as belonging to their type this information could be very useful for marketing planning and new product development We cannot promise that we can find clusters or groupings in data that you will find useful but we include a method quite distinct from that found in other statisti
362. or or Grove file needed for scoring by pressing the Save Grove button or you may translate CART models into SAS C or PMML representations by activating the Translate button Finally you may apply any tree to data using the Score dialog accessed via the Score button See Chapter 7 for step by step instructions for scoring new data CART Text Output The classic text output window contains the detailed technical log that will always be produced by the non GUI CART running on UNIX Linux and mainframe platforms Most modelers can safely ignore this window because the same information is reported in the GUI displays we have been demonstrating in this tutorial The classic text output will contain some exclusive reports and advanced information of interest to experienced modelers To turn to the text output select Classic Output shortcut Ctrl Alt C from the Window menu or click on the window if you can see it The classic output contains an outline panel on the left with hyperlinks for jumping to the specific locations Below we selected the first topic in the outline Target Frequency Table 76 CART BASICS 4 Classic Output Ctrl Alt C Report Contents ee es Tree 1 Target Frequency Table Descriptive Statistics Variable TARGET Missing Value Prevalence N Classes 2 Tree Sequence Data Value 461 69 43 203 30 57 Class Table CV Class Table CV Class Table Leam Class Table Learn Competitor
363. ors Character variables are automatically treated as discrete categorical Logically this is because only numeric values can be continuous in nature When a variable name does not end with a sign the variable is treated as numeric In this case if a character value is encountered it is automatically replaced by a missing value 34 Chapter 2 Reading Data Missing Value Indicators When a variable contains missing values CART uses the following missing values indicator conventions Numeric Either a dot or nothing at all e g comma followed by comma In the following example records the third variable is missing DPV PRED1 PRED2 PRED3 male 1 5 female 2 6 Character Either an empty quote string quote marks with nothing in between or nothing at all e g comma followed by comma In the following example records the first and fourth variables are missing DPV CHAR1 PRED2 CHAR3 PRED4 male 1 3 5 Calif female 2 4 Illinois Opening the Example ASCII File A sample ASCII file SAMPLE CSV comes as part of the CART distribution and resides in the Sample Data folder To open SAMPLE CSV you should 1 Click on File Open gt Data File 2 Inthe Open Data File dialog window choose ASCIll Delimited Text csv dat txt 3 When you double click on SAMPLE CSV the Model Setup dialog window should appear The Open Data File dialog lists only those files t
364. ors Command line only File Exit or lt Alt F4 gt Command line only Model Score Data Model Score Data View Data Info Edit Options CART Model Construct Model Select Cases File Submit Command File Model Translate Model File Open Data File Model Construct Model Model Command line only Appendix II Errors and Warnings This appendix provides information on common errors and warnings 320 Appendix II Errors and Warnings If you have any difficulty understanding or resolving any of the following errors and warnings please contact your technical support representative at Salford Systems Error 1 UNABLE TO UNDERSTAND WHAT YOU MEAN ABOUT HERE The program has encountered a problem with your command file syntax that it cannot resolve Check the syntax immediately before and after the position indicated in the error message Error 2 YOU CANNOT WRITE TO A FILE YOU ARE READING FROM You are attempting to use the same file for reading and writing Check the USE and SAVE commands Also make sure that none of the files involved are currently open in another application Error 3 THE PROBLEM IS TOO LARGE FOR THIS VERSION CART does not have enough resources to complete your run Check the run settings certain extreme situations such as high level categorical predictors and targets can render your run impossible to conduct Contact Salford Systems if this message appears under normal settings
365. ors only no regular predictors no penalties MVI_P use regular predictors missing value indicators and missing value penalties No_MVI_P use regular predictors and missing value penalties no MVIs As the graph above indicates one could reduce the relative error to 0 616 using missing value indicators alone Such remarkable predictability often indicates meaningful patterns of missing values in the data 218 Chapter 10 CART Batteries CART AR 2 l CART 6 0 CART 6 0 Pro Pro EX Battery NODES Battery NODES is very similar to battery DEPTH described above It varies the limit on the tree size in nodes according to a user supplied setting F CART 6 0 Pro EX Battery ONEOFF Battery ONEOFF was designed to generalize conventional co relational analysis by placing the CART engine in its core The battery contains the results of using one variable at a time to predict the response We illustrate this battery using the BOSTON CSV dataset see the ONEOFF CMD command file for details a Battery Summary 3 DAR Models Contents Error Profiles Var Imp Averaging Charts Rel Error _ Nodes 4_LSTAT 11 0 347 1_NOX 23 0 551 p View 10 Zoomed z Chart Type Bar J Line Rel Error Test Rel Error o ES ji Ivist WAL XON b SNANI b Battery Types Regression Battery Models ONEOFF Model Opt Terminal One Off Name Nodes tet ss Predictor 2 0 5511 N0X 0 6201 I
366. pe of characters to use for character graphics as opposed to high resolution SYGRAPH graphics You may choose either IBM screen and printer GRAPHICS characters or GENERIC characters that will print on any printer Caution GRAPHICS characters do not print correctly on some printers if you have problems switch to GENERIC The command syntax is CHARSET GRAPHICS GENERIC Examples CHARSET GRAPHICS CHAR GENERIC 342 Appendix III Command Reference CLASS Purpose The CLASS command assigns labels to specific levels of categorical variables target or predictor Labels are not limited in their length although in some reports they will be truncated due to space limitations For instance if variable DRINK takes on the values 0 1 2 and 3 in the data you might wish to assign labels to those levels CATEGORY DRINK CLASS DRINK O tea 1 Columbian coffee 2 soda pop 3 Cold German Beer Class labels will appear in the node detail misclassification reports terminal node reports and in most instances where the numeric levels would normally show up in lieu of the numeric levels themselves It is not necessary to specify labels for all levels of a categorical variable any levels without a label will show up as numbers The command syntax is crass lt variable gt lt level gt lt string gt lt level gt lt string gt You may issue separate CLASS commands for each variable such as CLASS PARTY 1 Repub 2
367. pecific records included here have been fictionalized Nevertheless we have retained the broad statistical relationships between the variables to yield a realistic study The variables available on the file include TARGET 0 good 1 bad defaulted AGE Age of borrower in years CREDIT_LIMIT Loan amount EDUCATION Category of level of schooling attained GENDER Male or Female HH_SIZE Number of family members INCOME Per month MARITAL Marital status N_INQUIRIES Credit bureau measure NUMCARDS Number of credit cards OCCUP_BLANK No occupation listed OWNRENT Home ownership status POSTBIN Postal code TIME_EMPLOYED Years work experience 41 CART BASICS The goal of our analysis is to uncover the factors that are predictive of default In such studies the predictors such as AGE and INCOME must pertain to the time at which the borrower was granted the loan and the TARGET records whether or not the loan was satisfactorily repaid subsequently A successful default model could be used to create a credit score and help the lender differentiate between good and bad risks in future loan applicants CART Desktop Double click on the CART program icon and you will see a screen similar to 4 CART Classic Output Ctri Alt C thet File Edit View Explore Model Report Window Help Snel S sele r cele Report Contents This launch supports up to 32768 variables The license supports up to 200
368. piction of the results with the following columns Model Name unique model identifier Opt Terminal Nodes number of terminal nodes in the optimal minimum relative error tree when Min Cost is pressed 1 SE Terminal Nodes number of terminal nodes in the 1 SE tree when 1 SE is pressed 202 Chapter 10 CART Batteries Rel Error relative error of the model when Misclass is pressed Avg ROC average area under the ROC curve when ROC is pressed Atom minimum required parent node size MinChild minimum required terminal node size Double click on any line in the Classification Battery Models section to open the corresponding Navigator window The entire battery can be saved using the Save Grove button The Contents tab includes summary information about the battery as well as a battery specific description of each individual model in the Models Specifications section wu Battery Summary 1 Models Contents Accuracy Error Profiles War Imp Averaging Models Specifications Grove Name atom ary at Test Dataset C Program Files Salford Data Mining CART Pro EX 6 0 EES Leam Dataset C Program Files Salford Data Mining CART Pro EX 6 05 Valid Dataset Total Count Models 8 Term Nodes In Main Trees 417 In All Trees 417 Trees 8 Save Grove The Battery Summary Accuracy tab furt
369. ple Prediction Success Table Predicted Class Total Percent 0 Cases Correct Total Overall Correct Leam Test Class None Count Row Column The rows of the table represent the true class and the columns the predicted class and can report either train or test sample results Here we have chosen to display test results based on cross validation Via cross validation we determine that for the 203 actual BADs we classify 151 of them correctly 74 38 and 52 incorrectly 68 CART BASICS Among the 461 GOODs we classify 354 correctly 76 79 and 107 incorrectly The overall correct is simply the total number classified correctly 151 354 divided by 664 the total number of cases The average correct is the simple average of the correct in each class 74 38 and 76 79 In this example the two averages are very close but they may well be quite different in other models To export the table as an Excel spreadsheet or copy it to the CART report document just right click anywhere in the display As you can see from the window you can opt to see Learn or Test results The cells of the table in either case can contain counts row percents or column percents w Prediction success tables based on the learn sample are usually too optimistic You should always use prediction success tables based on the test or on cross validation when a separate test sample is not available as fair estimates of CART
370. plitter is a splitter that is similar to the main splitter in how it assigns cases in a node to the left and right children The top surrogate is the splitter that comes closest to matching the main splitter s left right assignments but closest does not necessarily mean close In the 70 CART BASICS example the top surrogate has an association score of 0 13 on a scale of 0 00 to 1 00 which is a rather weak association You can think of the association as akin to correlation but scores above 0 20 represent a good degree of matching When a splitter does not have any close matching surrogates it means that the information content of that variable is unique and cannot be easily substituted for by any other variable In this example it should not be surprising to learn that the credit bureau variable N_INQUIRIES contains unique information not reflected in the other variables The top five surrogates are ranked by association score and are listed in the bottom right panel along with the splitting criterion and the improvement yielded by the surrogate split In this example the best surrogate HH_SIZE has an association value of 0 13 and a low improvement of 0 0007 The next surrogate GENDER is ranked 2 because of its association score but offers a much better improvement ra Surrogates play the role of splitter when the primary splitter is missing They play the role of backup splitter and are consulted in o
371. pology for SEGMENT Color code for Aux HOME Level 1 gt Bete TT Worse Smaller Next Prune Prune Model Statistics Predictors Important Nodes Min Node Cases Relative Cost Best ROC Nodes ROC Train 0 9826 Number of Nodes ROC Test 0 9756 Data Displays and Reports Save Model Leam Splitters Tree Details Summary Reports Commands Grove Translate Score When a categorical variable has more than two levels it is possible to group several levels to report frequency distributions for the entire group For example choose the NFAMMEM variable in the Current Variable selection box in the Select Target Variable window see the steps above explaining how to get to this window 138 Chapter 4 Classification Trees Now put checkmarks against levels 1 2 3 4 5 and click the Merge selected groups button As a result all five levels are now combined into one group Select Target Variable Curent variable NFAMMEM Group Level s Oo Color Set Default Colors Apply Cgda4 Jonon wn e E r r L Merge selected groups Select Target Variable Curent variable NFAMMEM Group Level s 1 2 3 4 5 6 Color Set Default Colors Apply cl tov 207177779777 Janon wna Now go back into the Navigator where you may color code terminal nodes by the group ji Navigator 1 Clessiication tree oS for SE GMENT Color code for Aux NFAMMEM
372. ports Model Leam Splitters Tree Details Summary Reports Translate Score By default CART uses the least squares splitting rule to grow the maximal tree and cross validated error rates to select the optimal tree In this example the optimal tree is the tree with 18 terminal nodes as displayed in the Navigator above The upper button in the group cycles over three possible display modes in the lower part of the Navigator Window Default Mode shows the relative error profile either Test Cross Validated or Learn depending on the testing method chosen in the Testing tab of the Model Setup window 150 Chapter 5 Regression Trees alll a 04 oa in 4 os a ER ahh i 2 aie eh elke EP EE a Number of Nodes 1 SE Mode shows the relative error profile where all trees with performance within one standard error of the minimal error tree are marked in green luu Relative Error Ol 4 sh eb Gy i re fa ah ag WME ies a ale WE ED 2a Number of Nodes Node Size mode shows the node size bar chart for the currently selected tree 20 00 E o 2 10 00 o o 0 00 1 2s el og fh oy eh im Wi We ae RP key as ale ae Terminal Nodes You can click on any of the bars to see the corresponding node highlighted in yellow on the tree display To change the currently selected tree go to one of the previous modes pick a new tree and switch back to
373. program icon and you will see the following screen 4 CART Classic Output Ctrl Alt C tet File Edt View Explore Model Report Window Help eer e Olea s x sf Bale at alas E Report Contents gt This launch supports up to 32768 variables The license supports up to 100 MB of learn sample data gt REM Resetting Preferences gt REN Setting General default options gt LOPTIONS MEANS NO PREDICTIONS NO TIMING NO GAINS NO ROC NO gt FORMAT 5 gt REM Setting CART default options gt LOPTIONS NOPRINT NO PLOTS NO PS NO gt BOPTIONS SURROGATES 5 PRINT 5 COMPETITORS 5 CPRINT 5 TREELIST 10 gt BRIEF gt 229 Chapter 11 CART Segmentation About CART Menus The menu items in CART change depending on the stage of your analysis and which window is actively in the foreground As a result some menus may be disabled if not available Similarly the commands that appear in the pull down menus and the toolbar icons are disabled if not accessible An overview of the layout of the main CART menus is presented below FILE e Open data set Navigator file or command file e Save analysis results Navigator file Grove file or command file e Export tree rules e Specify printing parameters e Activate interactive command mode Open notepad e Submit batch command files EDIT e Cut copy and paste selected text e Specify colors and fon
374. que color with a different plotting symbol as seen in the illustration above To print the contents of the Overlay Gains Chart dialog box select Print from the File menu To alter the layout prior to printing select Page Setup from the File menu x The tables in the Gains Chart Misclassification and Prediction Success dialog boxes can also be copied and pasted into spreadsheet and word processing programs such as Excel and Word All of these tables and graphs can also be exported into various graphical formats They include bmp emf jpg png and wmf To export right click on the table or graph and select Export form the menu Chapter Regression Trees This chapter provides instructions for the steps required to grow regression trees 146 Chapter 5 Regression Trees Building Regression Trees Our examples so far have focused on classification trees where the target is categorical Using regression trees CART can also be used to analyze and predict continuous target variables Most CART functions are shared by both classification and regression trees but there are several important differences when we grow regression trees these are the focus of this chapter Specifying a Regression Model We develop a regression tree using the Boston Housing Price dataset that reports the median value of owner occupied homes in about 500 U S census tracts in the Boston area together with several variables that m
375. rce Split tab nia eaae ae a sh cee ceede canned aa aaea a dendevsvencdieg devedvaese 267 The Constraints tab A e eaaa ccd pe raean a aa eea eie aa aaa cevedvesechedaeg tevedezese 275 Saving and Printing Text Output 2 0 0 eee cccceesee eee eeeeeeeeeeeeeeeeeeeseeeeeenseceneeenseeneeenes 283 Memory Manageme nt cceseecccecesceeeeeeenenseseeeeeeeeeeneesesesneeseeseeneesesesneesesesnaesnesenaees 285 ELT etela Wiite nt E TT T A T TAT 288 BETEA AA EEEE A A E AT 290 Data Information isis rr r cess actncnaeetuvasuasnccettvackranesscsunettaieaensvecestcetaare 291 WORKING WITH COMMAND LANGUAGE 00 eee 295 Introduction to the Command Language ceecccesseeseseseeeeseeeeneeseeeeeeeseeneeeeees 296 Alternative Control Modes in CART for Wind OWS ccccseeccssseeeeeeseeeeeeenseeenes 297 Command Line Mode oie iicccecececssectiectecensesvspscthcct era aaraa ahi ehana Sakra Paean aiani 298 Creating and Submitting Batch Files ccccessenneeeeeeeeeeeeeeeeeeeenseeneeeeneeeeeenees 298 Command LOG feesice cd cccc cece Sec cca she Tha Ea KENEN KAEA EENH KRT KAANE EA EENEN NAKEN ERAEN 299 View Open Command Log ceeccccseseeeceeeeneeeeeeeeeeeeeeeneeseseeeeeeesesneeesneeseeenseenanens 299 File New Notepad iaire eieaa barada aara e obra eck aada aa Eaa eainiie 300 Fie Submit Wind Ow a errar a rane Era n Erara a Kara EA aaaea Eana aKa Fionaan 300 File Submit Command File 2 ccccesseeeeeeeeeeeeeeeeeeeeeeeeeee
376. rder If both the primary and first surrogate splitter are missing CART would make use of the 28 ranked surrogate More effective surrogates are found in internal node 3 go left twice from the root and double click 4 Navigator 5 10 Node 3 Competitors and Surrogates Classification Is CREDIT_LIMIT lt 5546 Main Splitter Improvement 0 008 g 0 010 Competitor Split Improver 2 0 008 NUMCARDS a eo 0 004 o GENDER 1 000 AGE 56 000 Sae TIME_EMPLOYED 19 750 INCOME 5032 000 Competitor Surrogate Split Association NUMCARDS INCOME 2939 000s TIME_EMPLOYED 0 750r AGE 18 500s HH_SIZE 5 500r Here the main splitter is CREDIT_LIMIT and the top surrogate NUMCARDS has a strong association score of 0 61 This means that if NUMCARDS were used in place of CREDIT_LIMIT it would partition the data in a similar way and achieve a similar but lower improvement score In this node the top competitor is also the top surrogate but you should not expect to see this pattern often 71 CART BASICS 3 See the main reference manual for a detailed discussion of association and improvement The Classification tab The classification tab displays node frequency distributions in a bar graph or optionally a pie chart or horizontal bar chart for the parent left and right child nodes If you use a test sample frequency distributions for learn and test samples can be viewed separately usi
377. re reported for the overall train and test samples and then separately for each level of the target LOPTIONS MEANS YES NO Prediction success tables confusion matrix with misclassification counts and s by class level LOPTIONS PREDICTIONS YES NO Report analysis time CPU time required for each stage of the analysis LOPTIONS TIING YES NO Report Gains tables LOPTIONS GAINS YES NO Report ROC tables LOPTIONS ROC YES NO Decimal places precision to which the numerical output is printed FORMAT lt N gt Exponential notation for near zero values exponential notation used for values close to zero FORMAT lt N gt UNDERFLOW 127 Chapter 4 Classification Trees ROC Graph Labels ROC graphs are traditionally labeled differently in different industries You can select from the two labeling schemes displayed below ROC Graph Axes Labels True Pos Rate False Pos Rate True Pos Rate False Pos Rate Sensitivity 1 Specificit Press the Save as Defaults button to save your preferences permanently If you have made some temporary changes and wish to restore your previously saved defaults press the Recall Defaults button Use Short Command Notation Sets the minimal number of predictors that triggers a short command notation in the command log When the number of predictors is small each predictor is printed in the command log for example KEEP or CATEGORY commands However
378. reach agreement with some other OLS programs Specifies the maximum number of distinct levels in discrete variables The default is 20000 60000 which permits up to 20000 distinct classes for numeric variables and up to 60000 for character variables You should only consider increasing this parameter if the program is unable to obtain a complete tabulation of one or more of your discrete variables 349 Appendix III Command Reference ALLLEVELS By default node statistics will not list discrete variable levels for a node that is not represented N 0 in that node Specifying ALLLEVELS YES results in a complete tabulation of levels including those with N 0 in the node ORDER Discrete variable splitters and cross validation for classification trees can be affected by the sorting of your dataset ORDER YES adjusts for any sorting in your data and should be used when comparing results between CART 5 or greater and previous versions of CART The default is DISCRETE TABLES SIMPLE CASE MIXED MISSING MISSING REFERENCE FIRST ALLLEVELS NO ORDER NO MAX 20000 60000 350 Appendix III Command Reference DISALLOW Purpose The DISALLOW command specifies how predictor variables are constrained to be used as primary splitters and or as surrogates at various depths of the tree and according to the node learn sample size This command is only available in CART EX Pro is ignored by other versions By default all predictors
379. red is to select the Unsupervised radio button from the control section titted Tree Type As you can see all the other Model Setup tabs remain available for additional controls that the analyst may desire Model Setup Advanced Costs Priors Penalty Battery Model Categorical Force Split Constraints Testing Select Cases Best Tree Variable Selection Tree Type Variable Name Target Predictor Categorical Weight T EEA SEGMENT E C Regression AGE a 5 PLAN DATA MUSIC HOME NFAMMEM EMP AVGMIN Set Focus Class Target Variable aadadadadad Weight Variable Sort File Order SA Number of Predictors 11 Save Grove CART Combine xi Cancel Continue 4 267 Chapter 12 Features and Options rN If we simply scramble the data without resampling then the summary statistics for the Original and Copy data sets must be identical The scrambling destroys any correlation structure in the data linear or nonlinear Hence when using all the data for training no variable can split the data productively in the root node which is as it should be If the data sets can be separated at all a combination of at least two variables will be required Thus in the telecommunications example the average customer age is of course identical in the original and the copy But the average age of customers having iPhones may very well not be equal across Original and Copy datasets If it i
380. redictors box at the bottom of the column The Model tab will appear as follows Model Setup a az Categorical Force Split Constraints Testing Select Cases Best Tree Variable Selection Tree Type Classification E E 5 C Regression C Unsupervised Variable Name Target Predictor Categorical Weight Aux Set Focus Class Target Variable LOW EINEN BEBEL weitvaiane Select py Select Select Sort File Order Predictors N Cat Aux Number of Predictors 8 Save Grove CART Combine Score a Continue Start If you inadvertently include a variable as a predictor simply uncheck the corresponding box w Note also that each of the model setup tabs contains a Save Grove button in the lower left corner This allows you to request saving the model for future review scoring or export ra For command line users the MODEL command sets the target variable while the KEEP command defines the predictor list See the following command line syntax MODEL lt depvar gt KEEP lt indep varl indep var2 indep_var gt MODEL LOW KEEP AGE RACE SMOKE HT UI FTV PTD LWD 88 Chapter 4 Classification Trees Categorical Predictors Put checkmarks in the Categorical column against those predictors that should be treated as categorical For our example specify RACE UI and FTV as categorical predictor variables Alternatively as for predictor variables hold down the lt Ctrl
381. rence IMPORTANCE QUICKPRUNE DIAGREPORT HLC PROGRESS MREPORT MISSING VARDEF PLC Places weight on surrogate improvements when calculating variable importance Must be between 0 and 1 The default is 1 0 Invokes an algorithm that avoids rebuilding the tree after pruning has selected an optimally sized tree Produces tree diagnostic reports Accommodates high cardinality categoricals Assume the variable in question has nlev levels n1 number of initial random split trials lt n7 gt must be greater than 0 n2 number of refinement passes Each pass involves nlev trials lt n2 gt must be greater than 0 The default is HCC 200 10 The HCC option is identical to HLC Issues a progress report as the initial tree is built This option is especially useful for trees that are slow to grow LONG produces full information about the node SHORT produces just the main splitter info and NONE turns this feature off The default is NONE Produces a special report summarizing the amount of missing data in the learn and test samples Adds missing value indicators to the model It has several forms NO disables missing value indicators YES will produce missing value indicators for all predictors in the model that have missing values in the learn sample DISCRETE will produce missing value indicators only for discrete predictors CONTINUOUS will do so only for continuous predictors LIST specifies a list of variables t
382. results to small changes in the data Even in a large data set the class of interest may have only a handful of records When you have only a small number of records in an important target class you should think of your data set as small no matter how many records you have for other classes In such circumstances cross validation may be the only viable testing method Reducing the number of cross validation folds below ten is generally not recommended In the original CART monograph Breiman Friedman Olshen and Stone report that the CV results become less reliable as the number of folds is reduced below 10 Further for classification problems there is very little benefit from going up to 20 folds If there are few cases in the class of interest you may need to run with fewer than 10 CV folds For example if there are only 32 YES records in a YES NO classification data set and many more NOs then eight fold cross validation would allow each fold to contain four of these cases Choosing 10 fold for such data would probably induce CART to create nine folds with three YES records and one fold with five YES records In general the better balance obtained from the eight fold CV would be preferable There is nothing technically wrong with two fold cross validation but the estimates of the predictive performance of the model tend to be too pessimistic With 10 fold cross validation you get more accurate assessments of the model s predictive power
383. riables This number can be increased with the v command line flag Modes of Operation Console CART can be invoked interactively by invoking it at the command prompt without arguments You will get a series of startup messages looking something like this 313 Chapter 13 Working with Command Language CART TreeNet version 6 2 0 118 Copyright 1991 2006 Salford Systems San Diego California USA Launched on 9 8 2006 with no expiration This launch supports up to 32768 variables Model space 256 MB RAM allocated at launch partitioned as Real 65109998 cells Integer 1114112 cells Character 3539016 cells Data space allocated as needed The license supports up to 4096 MB of learn sample data Processing commands from usr local salford 1lib SALFORD CMD StatTransfer enabled gt You can then enter commands and get back responses Your session ends when you enter the QUIT command Since CART in interactive mode will accept commands through standard input and send responses through standard output it is sometimes convenient to invoke it this way via a script or batch file Example Read commands from a set of command files and write results to output dat cat runitl cmd runit2 cmd runit3 cmd cart gt output dat Generally the more convenient way to run console CART is in batch mode which can be invoked by specifying a command file as an argument Example Execute runit1 cmd in batch mode cart runitl
384. riginals USE INFILE CSV CATEGORY OCCUPCODE DIAGNOSTIC DISCRETE ORDER YES RUN SD PREPFILE CSV PDM YES 393 Appendix Ill Command Reference SCORE Purpose The SCORE command applies CART trees stored in a grove to data in your dataset reporting prediction success tables gains and ROC charts as well as saving predicted response s terminal node assignment s and predicted probabilities to an optional output dataset The command syntax is SCORE OFT lt yes no gt DCM lt yes no gt PROBS lt N gt PATHS lt yes no gt DEPVAR lt variable gt in which the following options may be set as follows OFT O mits the F irst T ree among trees sharing a common target variable from being a member of the committee for that target variable When CART builds a committee of trees it also builds an initial tree against which the committee is compared When scoring it may be desired for the initial tree to be added to those already in the committee In this event specify OFT NO The default is OFT YES consistent with previous versions of CART and the notion that the initial tree is not to be used as part of the committee DCM D etails C ommittee M embers By default DCM NO in which case prediction success tables terminal node summaries and gains and ROC charts are only produced for committees if a committee exists in the grove If you wish to see these reports for all trees in the committee s use DC
385. rious limits including how many observations and variables are allowed the largest tree size allowed the largest tree depth the smallest node size allowed and whether sub sampling will be used For the most part the above commands should be left unchanged unless you need fine control over the CART engine A more detailed description can be found in the Appendix III Command Reference Commands 8 through 16 specify model settings that usually change from run to run 8 gt gt The MODEL command sets the target variable 9 gt gt The CATEGORY command lists all categorical numeric variables Character variables are always treated as categorical and need not be listed here For classification models numeric targets must be declared categorical 10 gt gt The PRIORS command sets the prior probabilities for all target classes e The commands PRIORS DATA or PRIORS EQUAL are useful aliases for common situations 11 gt gt The MISCLASSIFY commands set the cost matrix rN Only non unit costs need to be introduced explicitly 304 Chapter13 Working with Command Language There will be as many MISCLASSIFY commands as there are non unit cost cells in the cost matrix 12 gt gt The KEEP command sets the predictor list wa This command is NOT cumulative 13 gt gt The ERROR command specifies the LEARN TEST partition method e Inthis example a dummy variable T separates the TEST part T 1 from the LEARN part T
386. robability trees Important extensions to these core CART methods found in CART 6 0 are discussed below A powerful binary split search approach CART trees deliberately restrict themselves to two way splits of the data intentionally avoiding the multi way splits common in other methods These binary decision trees divide the data into small segments at a slower rate than multi way splits and thus detect more structure before too few data are left for analysis Decision trees that use multi way splits fragment the data rapidly making it difficult to detect patterns that are visible only across broader ranges of data values An effective pruning strategy CART s developers determined definitively that no stopping rule could be relied on to discover the optimal tree They introduced the notion of over growing trees and then pruning back this idea fundamental to CART ensures that important structure is not overlooked by stopping too soon Other decision tree techniques use problematic stopping rules that can miss important patterns 12 Introducing CART 6 0 Automatic self test procedures When searching for patterns in databases it is essential to avoid the trap of over fitting that is of finding patterns that apply only to the training data CART s embedded test disciplines ensure that the patterns found will hold up when applied to new data Further the testing and selection of the optimal tree are an integral part of the CART
387. rpose The IDVAR command lists extra variables to save in the next dataset to be SAVED These can be any variables from the USE dataset that are not in the model Model variables are saved with the SAVE MODEL option The command syntax is If every case in your file has a unique identifier say SSN you could specify IDVAR SSN SAVE WATER CSV The file WATER CSV will include the variable SSN in addition to its normal contents If you want to include all the non model and model variables in the saved dataset you would issue IDVAR ALL SAVE lt filename gt MODEL Variable groups may be used in the IDVAR command similarly to variable names 364 Appendix III Command Reference KEEP Purpose The KEEP command specifies a list of independent variables The command syntax is KEEP lt indep_list gt in which lt indep_list gt is a list of potential predictor variables If no lt indep_list gt is specified all numeric variables are considered for node splitting unless an EXCLUDE command or lt indep _list gt is included on the MODEL statement Independent variables may be separated by spaces commas or signs A range of variables may be specified with the first and last variables in data set order separated by a dash See the MODEL and EXCLUDE commands for other ways to restrict the list of candidate predictor variables Examples MODEL CLASS KEEP AGE IQ EDUC FACTOR 3 8 RACE selected
388. rrently open CART results dialogs are listed and individual ones can be excluded or added to the list that will appear in the report when the Report Now button is clicked A stock report for the CART results that are currently active i e in the foreground can be generated by choosing Report Report Current If the active window is not a results window the Report Current menu item will be disabled Furthermore if you have several CART results windows open you can generate a report for all the trees in the order in which they were built by choosing the Report Report All menu item Default Target Class Reports summarizing class performance e g gains charts require a target class For binary models i e 0 1 or 1 2 the second level is assumed to be the target class For multinomial models e g 1 2 3 4 the lowest class is assumed to be the target class 290 Chapter 12 Features and Options Printing and Saving Reports Once you have generated a report it may be printed or previewed by using the Print Print Setup and Print Preview options on the File menu To save a report to a file use the File Save As option The contents of the Report window can be saved in three formats rich text format rtf text or text with line breaks txt The rich text rtf can be read by most other word processors and maintains the integrity of any graphics imbedded in the report Neither text format retains grap
389. rror profile and note the flat region near the 10 node mark z 0 70 0 488 9 0 60 0 50 3 10 20 30 40 50 60 70 Number of Nodes It is natural to suspect that one of these smaller trees is practically just as good as the optimal tree If you click on the on the left side of the navigator you will see a portion of the relative error profile turn green 0 488 0 70 a a 0 50 E 10 20 30 40 50 60 70 Number of Nodes elative Cost Ri o isl This tells us exactly which sizes of trees exhibit an accuracy performance that is statistically indistinguishable from the optimal tree The CART authors suggested that we use a 1 standard error or 1SE rule to identify these trees and in the display we have moved to the smallest of these trees rN The 1SE tree is the smallest tree displaying an error rate that is no worse than one standard error above the optimal tree w Because determining which tree is actually best is subject to statistical error we cannot be absolutely certain which tree is best Every tree marked in green is a defensible candidate for best tree 56 CART BASICS a In our example the 1SE tree has five terminal nodes with a relative error of 504 and a test ROC of 7552 The optimal tree has a relative error of 488 and a test ROC of 7867 The optimal tree is better but it is also twice the size and our measurements are always subject to some statistical u
390. rsions will result in some new navigator features being disabled To open navigators from previous versions select Open gt Open Navigator from the File menu In the Open Tree Navigator dialog box specify the name and directory location of the navigator file and click on Open Opening a navigator in subsequent sessions allows you to continue your exploration of detailed and summary reports for each of the trees in the nested sequence or to use the navigator for scoring or translation see Chapter 7 Scoring and Translating however reopening the file does not reload the model setup specifications in the GUI dialogs To do this you should learn the basics of command line use in Chapter 13 To save your model setup specifications save the settings in a command file prior to exiting CART The commands by default stored in CART s command log can be accessed by selecting Open Command Log from the View menu or by clicking the Command Log toolbar icon To save the command log select Save from the File menu To then reload your setting in the Model Setup dialog simply submit the command log The last set of model setup commands in the command file appears in the tabbed Model 141 Chapter 4 Classification Trees amp Command line users will use the following command syntax to save CART models and navigators GROVE lt file name grv gt Printing Trees To print the Main Tree or a sub tree bring the tree window
391. rule va The CART approach to decision tree construction is based on the foundation that it is impossible to know for sure when to stop growing a decision tree You can prove this mathematically Therefore CART does not stop but rather grows and grows and grows rN CART uses extraordinarily fast proprietary algorithms so it does not take much time to grow the initial largest tree Once we have the largest tree constructed we begin pruning This is done for you automatically The pruning process trims the tree by removing the splits and branches that are least useful A pruning step often removes just one split but sometimes several splits are removed together The mathematical details are provided in the original CART monograph To see which nodes are removed in the next pruning step click on the Next Prune button at the upper right side of the navigator The nodes to be pruned next will be highlighted in yellow Use the left arrow key to return to the CART optimal tree marked with the green bar 55 CART BASICS a es The Home key is a short cut to return to the CART optimal tree in the navigator Here we can clearly see which node would be pruned next if we wanted to select a smaller tree The reason CART would prune this particular node next is that by doing so CART would retain as much accuracy as possible Now click on Next Prune again to turn off the node highlighting Look again at the relative e
392. rved line represents the cumulative percentage of Class 1 cases column five in the grid versus the cumulative percentage of the total population column six with the data ordered from the richest to the poorest nodes The vertical difference between these two lines depicts the gain at each point along the x axis For example if you use the CART tree to find Class 1 observations and decide to target 30 percent of the population you would find 91 percent of the Class 1 observations If you target randomly you would expect to find only 30 percent of the Class 1 observations Therefore the gain in this case is 61 percent 91 30 at x equal to 30 Alternatively we can say that the lift in this case is 91 30 3 03 The Gains Table can be exported to Excel by a right mouse click and then choosing Export from the pop up menu 247 Chapter 11 CART Segmentation You can print individual Gains Charts as well as overlay and print Gains Charts for trees of different sizes and from different CART analyses see Chapter 4 You can also add Gains Charts and Tables into the CART report see Chapter 12 Root Splits The next summary report shows the competing root node splits in reverse order of improvement a Navigator 1 7 Tree Summary Reports Misclassification Prediction Success Gains Chart Root Splits J Terminal Nodes I Variable Importance Is ANYRAQT 0 Competitor Split Improvement N Left N Right N Missing
393. s In classification runs some of the reports generated by CART gains prediction success color coding etc have one target class in focus By default CART will put the first class it finds in the dataset in focus A user can overwrite this by pressing the Set Focus Class button 92 Chapter 4 Classification Trees Sorting Variable List The variable list can be sorted either in physical order or alphabetically by changing the Sort control box Depending on the dataset one of those modes will be preferable which is usually helpful when dealing with large variable lists The Categorical tab The Categorical tab allows you to manage text labels for categorical predictors and it also offers controls related to how we search for splitters on high level categorical predictors The splitter controls are discussed later as this is a rather technical topic and the defaults work well Model Setup Advanced Priors Penalty Battery Ce Force Split Constraints Testing Select Cases Best Tree Method Change Class Names and Set Categorical Search Parameters High Level Categorical Variables Variable Threshold level for enabling intelligent categorical split search 15 Search Intensity 200 Moe Fe Accurate Roane Chee cr 10 100 200 300 400 Set Class Names Save Grove CART Combine Cancel Continue Setting Class Names Class names are defined in the Categorical tab Press Set Class Nam
394. s Count Tree 1 Node 14 440 cases Lift 2 48 ry 300 400 500 600 Node Focus Class Count View Richness Lift Bar Scatter The graph shows a scatter plot of node richness or node lift when the corresponding button is pressed versus node focus class count You can switch between the Bar and Scatter views of the plot You can also switch between the Learn and Test results 198 Chapter 9 Hot Spot Detection Hovering the mouse pointer over a dot produces extra information that contains tree and node number as well as the actual coordinate values as shown above Finally the blue line marks the Effective Frontier the nodes most interesting in terms of balancing node richness versus node size Chapter CART Batteries A new and powerful feature designed to build multiple models automatically 200 Chapter 10 CART Batteries Batteries of Runs The CART algorithm is characterized by a substantial number of control settings Often the optimal values for many parameters cannot be determined beforehand and require a trial and error experimental approach In other cases it is desirable to try various settings to study their impact on the resulting models CART batteries were designed to automate the most frequently occurring modeling situations that require multiple collections of CART runs We start our discussion with a description of common battery controls and output reports Then we move on t
395. s except ID SSN and ATTITUDE can be used in the CART process 354 Appendix III Command Reference FORCE Purpose FORCE identifies CART splits to be implemented at the root and first child nodes in lieu of the splits that CART would naturally determine based on the learn data The FORCE command applies to CART trees only Its syntax is FORCE ROOT LEFT RIGHT ON lt predictor gt AT lt splits gt For example FORCE ROOT ON GENDERS AT Male Unknown FORCE LEFT ON REGION AT 0 3 4 7 999 FORCE RIGHT ON INCOME AT 100000 To reset forced splits use the command with no options FORCE 355 Appendix III Command Reference FPATH Purpose The FPATH command sets the default search path for unquoted file names Its syntax is FPATH lt file prefix or path gt OUTPUT SAVE SUBMIT GROVE USE OUTPUT Set the default path for classic text output files specified with the OUTPUT command SAVE Set the default path for output datasets specified with the SAVE command SUBMIT Set the default path for command files to be executed via the SUBMIT command GROVE Set the default path for grove files either input or output USE Set the default path for input datasets specified with the USE or ERROR FILE commands If no options are specified the path indicated applies to all file types If no path is given the existing path is replaced by the default which is the current working directory The FPATH command has
396. s is omitted the default is to step by 1 Remarks Nested FOR NEXT loops are not allowed and a GOTO which is external to the loop may not refer to a line within the FOR NEXT loop However GOTOs may be used to leave a FOR NEXT loop or to jump from one line in the loop to another within the same loop Examples To have an IF THEN statement execute more than one statement if it is true oe IF X lt 15 THEN FOR LET Y X 4 LET Z X 2 NEXT oe oe oe 420 Appendix IV BASIC Programming Language GOTO Statement Purpose Jumps to a specified numbered line in the BASIC program Syntax The form for the statement is GOTO where is a line number within the BASIC program Remarks This is often used with an IF THEN statement to allow certain statements to be executed only if a condition is met If line numbers are used in a BASIC program all lines of the program should have a line number Line numbers must be positive integers less than 32000 Examples 10 GOTO 20 20 STOP 10 IF X THEN GOTO 40 20 LET Z X 2 30 GOTO 50 40 LET Z 0 oe uo O STOP 421 Appendix IV BASIC Programming Language IF THEN Statement Purpose Evaluates a condition and if it is true executes the statement following the THEN Syntax IF condition THEN statement An IF THEN may be combined with an ELSE statement in two ways First the ELSE may be simply used to provide an alternativ
397. s not possible to develop a good model to separate Original and Copy data this means that there is little structure in the Original data and there are no distinctive patterns of interest This approach to unsupervised learning represents an important advance in clustering technology because a variable selection is not necessary and different clusters may be defined on different groups of variables b preprocessing or rescaling of the data is unnecessary as these clustering methods are not influenced by how the data are scaled c the missing values present no challenges because the methods automatically manage missing data d the CART based clustering gives easy control over the number of clusters and helps select the optimal number The Force Split tab The Model Setup Force Split tab is new in CART 6 0 This setup tab allows you to dictate the splitter to be used in the root node primary splitter or in either of the two child nodes of the root Users wanting to impose some modest structure on a tree frequently desire this control More specific controls also allow the user to specify the split values for both continuous and categorical variables if you prefer to do so Specifying the Root Node Splitter For this example we once again will be using the GYMTUTOR CSV data file To specify the target variable use the 7 to scroll down the variable list until SEGMENT is visible Put a checkmark inside the checkbox located in the T
398. s tree makes use of only one predictor and is actually quite predictive with a relative error rate of 573 and a test sample ROC value of 7132 This is unusually good for a single predictor and is far from typical To take a closer look move your mouse over the root and right click to reveal this menu 53 CART BASICS Node Report Display Tree Rules Compare Children Compare Learn Test Tag Node Select Compare Children to get the following display 4 Children of Node 1 B N_INQUIRIES lt 1 500 N_INQUIRIES gt 1 500 Terminal Terminal Node 1 Class 0 Class 1 Class Cases Class Cases 0 298 88 2 0 163 50 0 3 40 118 i 163 50 0 W 338 000 W 326 000 N 338 N 326 We see that having had more than one recent inquiry about a borrower at the credit bureau is a powerful indicator of default risk Recall that the default rate in these data is 30 6 overall whereas it is only 11 8 among those with one or no recent inquiries and 50 for those with two or more recent inquiries You can customize the colors and details shown in this window using the ViewNode Detail menu discussed later w CART trees are grown by a procedure called binary recursive partitioning The binary indicates that when a node is split it is divided into two and only two child nodes This is a distinctive characteristic of the CART tree and is one source of the power of the CART technology rN CART ea
399. sets Fraction of Cases Selected at Random for Testing Use this option to let CART automatically separate a specified percentage of data for test purposes Because no optimal fraction is best for all situations you will want to experiment In the original CART monograph the authors suggested a 2 3 1 3 train test split which would have you set the test fraction to 33 In later work Jerome Friedman suggested using a value of 20 In our work with large datasets we favor a value of 50 and in some cases we even use 70 when we want to quickly extract a modest sized training sample So our advice is don t be reluctant to try different values In the command language this value is set with a statement like ERROR P 20 The advantage of using ERROR P 50 is that the train and test samples are almost identical in size facilitating certain performance comparisons in individual nodes Setting ERROR P 80 for example is a fast way to pull a relatively small extract from a large database Just be sure to check the size of the sample that is selected for training If it is too small you cannot expect reliable results This mechanism does not provide you with a way of tagging the records used for testing If you need to know which records were set aside for testing you should create a flag marking them for test and then use the SEPVAR method for testing see below 99 Chapter 4 Classification Trees Three way Random Train Test Validat
400. signment number of cases in the node percentage of the data in the node and misclassification cost is also displayed for the learn data and if you use a test sample for the test data In our example terminal Node 1 is a pure node containing only Class 3 cases and consequently has an associated misclassification cost of zero 4 Navigator 1 7 Terminal Node 2 Classification L Rules Terminal Node 2 Leaming Data 3 Number of cases Percentage of data 24 2 Cost 0 Test Data Cases Pct 257 Chapter 11 CART Segmentation Saving the Grove File To save the Navigator aka Grove File so that you can subsequently reopen the file for further exploration in a later CART session select Save Grove from the File gt Save menu or press the Grove button in the Navigator window In the Save As dialog window click on the File Name text box to change the default file name in this case the data set name GYMTUTOR The file extension is by default grv Specify the directory in which the Grove file should be saved and then click on Save Save in Examples 5 ck E M GYMTUTOR grv Hj hotspot grv keep grv H mvi orv tte gry File name Save as type Grove grv amp Ifthe trees you are building are large e g over 100 terminal nodes Windows system resources can quickly be depleted To avoid memory problems be sure to close or save any open Naviga
401. sily creates the equivalent of multi way splits by using a variable more than once We show an example below Close the Children of Node 1 window and use the right arrow key to move all the way to the other extreme the largest tree grown in this run From the relative error profile and the model statistics you can see that this tree has 62 nodes its relative error is 676 and the test ROC is 6581 54 CART BASICS id Navigator 2 Classification tree topology for TARGET Color code using Tat TARGET fi zi Better TT worse E Next Prune Model Statistics Predictors Min Node Cases Best ROC Nodes ROC Train ROC Test Number of Nodes Displays and Reports Save Model Leam Splitters Tree Details J Summary Reports Commands Grove Translate Score This largest tree is quite a bit worse than the simple two node tree indicating that the large tree is seriously overfit While the largest tree is almost always overfit it is not necessarily worse than the smallest tree In some cases the largest tree is also quite accurate though in this example it is not The largest tree is actually the starting point for CART analysis CART first splits the root node then splits the resulting children then splits the grandchildren and so on The CART tree does not stop until it literally runs out of data This is in contrast to other decision trees that use a stopping
402. sion 179 rich text format rtf 290 robust trees 186 ROC graph labels 127 ROC curves 18 213 root node splitter specify 267 root splits 154 rules 72 156 158 255 All button 159 Tagged button 159 classic 159 SQL 159 viewing 158 Rules tab 156 running CART 26 permissions 26 S sample data SPAMBASE CSV 186 sample size 223 224 learn 113 sub sample 114 test 113 Save As 284 SAVE command 180 395 saving command log 78 261 committee of experts 166 grove 257 grove file 171 model specifications 140 navigators 74 140 171 257 output 76 258 reports 290 text output 283 tree topology 74 140 257 SCORE command 170 180 393 scoring classification 176 command line 180 data 77 259 Gains tab 177 GUI output 176 178 ID variables 174 output data 175 Prediction Success tab 177 proxy target variable 174 regression 178 Response Statistics tab 176 179 saving predictions 175 saving result to a file 173 Score Data dialog 173 selecting data file 173 Index selecting grove file 173 sub trees 174 target variable 174 tree sequence 174 weight variable 174 scoring models 170 172 using grove file 172 using navigator file 172 SEED command 132 396 select cases 100 Select Cases tab 146 232 Select Columns 196 Select Columns to Display Direction 190 Fuzzy Match 190 Hide Agreed 190 Rank 190 SELECT command 101 170 397 Select Default Directory dialog
403. sly specified directories V Most Recently Used file list 18 Control that allows the user to specify how many recently used files to remember in the File Open menu The maximum allowed is 20 files Working with Navigators The basics of working with navigators are described in detail in Chapter 3 CART BASICS in the section titled Tree Navigator If you have not already read Chapter 3 CART BASICS we encourage you to do so It contains important and pertinent information on the use of CART result menus and dialogs In the next section of this chapter we complete our exposition of the Navigator by explaining the remaining functions Viewing Auxiliary Variables Information Earlier in Chapter 3 CART BASICS we set up a model based on the GOODBAD CSV data file Here we set up a new but similar modeling run using GYMTUTOR CSV with the following variable and tree type designations 135 Chapter 4 Classification Trees C Program Files Salford Data Mining CART Pro EX 6 0 Examples GYMTUTOR CSV File Name GYMTUTOR CS Location C Program Files Salford Data Mining CART Pro EX 6 0 Examples Modified Wednesday May 15 2002 10 09 34 AM Variables ANYRAQT Data Records Variables Character Numeric ji SEGMENT Sort File Order v Activity Close Target Variable SEGMENT Predictor Variables TANNING ANYPOOL HOME CLASSES Categorical Variables SEGMENT HOME NFAMMEM Auxiliary
404. smaller commands 322 Appendix II Errors and Warnings Error 10014 NOT ENOUGH MEMORY TO DISPLAY A MISCLASSIFICATION MATRIX The prediction success tables cannot be displayed because you have too many distinct classes in your target Error 10015 YOU HAVE NOT SPECIFIED A TREE FILE YET Check for the presence of the GROVE command in your scoring runs Error 10017 Unable to locate or open your GROVE file Check the GROVE command Make sure the grove file is not held by another application Error 10018 THE ABOVE VARIABLE IS PART OF THE TREE AND MUST BE PRESENT ON THE CASE BY CASE DATA SET The file you are trying to score does not have one of the variables that were part of the model To enforce the scoring anyway you must complete your file with all missing model variables with values set to missing Error 10019 Your grove file does not contain any CART trees You are probably trying to use a grove file generated by TreeNet or MARS Check your GROVE command Error 10021 PRIORS SUM TO ZERO OR A NEGATIVE NUMBER Check the PRIORS SPCEIFY command the priors cannot be negative numbers or all zeroes Error 10023 Unable to proceed with model estimation CART has encountered a situation that prevents further modeling Check your run settings and your data Error 10024 The CASE command has been replaced by the SCORE Replace CASE with SCORE in your command file Error 10025 No learn sample variance in target
405. space usage 285 438 Index X XYPLOT command 404 Z zero case weight 90 Zoom 153 Zoom in 57 241 Zoom out 57 241 z threshold 191 z value 190
406. specified with the WEIGHT command which is issued after the USE command and before the BUILD command See the following command line syntax WEIGHT lt wgtvar gt 91 Chapter 4 Classification Trees Auxiliary Variables Auxiliary variables are variables that are tracked throughout the CART tree but are not necessarily used as predictors By marking a variable as Auxiliary you indicate that you want to be able to retrieve basic summary statistics for such variables in any node in the CART tree In our modeling run based on the HOSLEM CSV data we mark AGE SMOKE and BWT as auxiliary Model Setup Advanced or F Penalty Battery Model Categorical Force Split Constraints Testing Select Cases Best Tree Variable Selection 7 Tree Type Variable Name Target Predictor Categorical Weight Aux Classification CE resin E Unsupervised iv Iv r E FIFEN HErEr r D r a 5 Target Variable FIV v rT ia Low Iv BWI CEINE wove Select Select Baris ve Order El Predictors 7 Cat Number of Predictors 8 Cancel Continue Start Later in this chapter in the section titled Viewing Auxiliary Variable Information we discuss how to view auxiliary variable distributions on a node by node basis Command line users will use the following command syntax to specify auxiliary variables AUXILIARY lt auxvarl gt lt auxvar2 gt etc AUXILIARY AGE SMOKE BWT Setting Focus Clas
407. ss 1 ROC Test 0 7867 1 Leam canes Lit Cum Lit ROC Total cases 203 Percent of sample 30 57 Reading the gains curve is straightforward Consider the data sorted in order from most likely to be BAD to least likely If we were to look only at the top 10 of the data most likely to be BAD what fraction of all the BADs would we capture Looking at the graph it appears that we would capture about 23 of all BADs The ratio 23 10 or 2 3 is known as the lift among market researchers and relative risk in the biomedical world Clearly the larger the lift the better because it indicates more precise discrimination 3 Click on Show Perfect Model to provide a reference to compare against The perfect model would isolate all the BAD cases into their own nodes Our example has been run using the self testing cross validation method Cross validation is a clever technique for testing models without formally dividing the data into two separate learn and test portions However if in your own analyses you use a test sample buttons for selecting results based on the Learn Test or Both samples will appear in the lower portion of the Gains Chart dialog To view gains charts for the test sample click Test and to view gains charts for learn and test combined click Both vx When you use cross validation CV for testing you will obtain reliable estimates of the overall classification accuracy of the tree and a tes
408. st Ret Error Battery Types Classification Battery Models DEPTH Model Opt Terminal Rel Error Max Depth Name i i Tree 1 0 2975 Tree 2 0 2691 Tree 3 0 2067 Tree 4 5 Tree 6 0 1910 Tree 0 1910 Show Min Error Tree 8 Z Hldad H1d3d t H1d3d S H1d3d 9 Hldad 2 H1d3q 8 Hldad 6 Hidaq Ob Hld3A Model Quality Sample Model Size Save Grove Misclass Roc Test Leam Min Cost 15E Clearly beginning at the depth of 6 the relative error becomes quite flat exer Em 2 2 CART 6 0 CART 6 0 Pro Pro EX Battery DRAW Battery DRAW runs a series of models where the learn sample is repeatedly drawn without replacement from the main learn sample as specified by the Testing tab The test sample is not altered This battery is useful for determining the impact of varying random learn sample selection on ultimate model performance This is similar in spirit to the battery CVR described earlier 211 Chapter 10 CART Batteries We illustrate this battery on the SPAMBASE CSV dataset partitioned into 70 learn and 30 test with twenty 50 drawings from the learn partition see DRAW CMD command file 4 Battery Summary 1 Models Contents Accuracy Eror Profiles Var Imp Averaging Charts Rel Error Nodes View Zoomed zi Chart Type Rel Error DSMP_ 9 0 217 DSMP _ 6 0 157 Battery Types Classification Battery Models DER
409. t In the Print dialog box illustrated below you can select the pages that will be printed and the number of copies as well as specify printer properties You can also preview the page layout CART will automatically shift the positions of the nodes so they are not split by page breaks Printer Copies Name HP DeskJet 820Cse Properties Number of copies Status Ready a Type HP DeskJet 820Cse Where LPT1 Print to file Comment lt Fit to two pages if possible Select Pages C All pages Non blank only Selected Page Setup Cancel You can see from the preview that a small section of the GYMTUTOR main tree spills over to a second page To resize the tree to fit on one page click on the Page Setup 244 Chapter 11 CART Segmentation The current layout is depicted in the tree preview window of the Page Setup dialog shown below As you change the settings the print preview image changes accordingly To change which page is previewed use the left and right arrows just below the sample page image Page Setup Paper Sie Source Automatically Select X X Node Gaps Orientation Horz 9 10 Portrait Vert 0 10 C Landscape Tree Scale Border Sealing 100 2 Width Thin x Row 1 Cok 1 a gt Header Footer Jar Page amp V amp H Eel Node Shape Margins inches Left 0 5 Righ 0 5 Node Hexagon Si bd Tem
410. t BUPTIONS SURROGATES 5 PRINT 5 COMPETITORS 5 CPRINT 5 TREELIST 10 gt BRIEF gt l C Program Files Salford Data Mining CAR HOSLEM CSV Data Description Activity Window This new window can function as a brief description of your data file and a control panel for other data exploration and analysis activities C Program Files Salford Data Mining CART Pro EX 6 0 Examples HOSLEM CSV File Name Location Modified HOSLEM CSV C Program Files Salford Data Mining CART Pro EX 6 0 Examples Friday December 10 1999 1 47 52 PM Variables Data Records Variables Character Numeric Sort File Order X Activity Stats View Data Score From this screen you can conveniently request summary statistics a spreadsheet view of the data or the model set up dialog and you can also move directly to scoring the data using a previously saved model 129 Chapter 4 Classification Trees Once you close this window it can be reopened by clicking on the toolbar icon hammer and wrench icon Model Setup This is the window that came up automatically in CART 4 0 and CART 5 0 and you can also put CART 6 0 into this mode Model Setup Advanced Penalty Battery Model Categorical Force Split Constraints Testing Select Cases Best Tree it Method Variable Selection Tree Type Variable Name Target Predictor Categorical Weight Aux te Classification Lr
411. t a few options must be set to build the desired model Suppose that a series of runs are to be accomplished with little variation between each A batch command file containing the commands that define the basic model and options provides an easy way to perform many CART command functions in one user step For each run in the series the core batch command file can be submitted to CART followed by the few graphical user interface selections necessary for the particular run in question Creating an Audit Trail The Command Log window can help you create an audit trail when one is needed Imagine not being able to reproduce a particular analysis track perhaps because the specific set of options used to create a model e g the name of the data set itself was never recorded The updated command log provides you with the entire command set necessary to exactly reproduce your analysis provided the input data do not change 298 Chapter13 Working with Command Language Taking Advantage of CART s Built In Programming Language CART offers an integrated BASIC programming language that allows the user to define new variables modify existing variables access mathematical statistical and probability distribution functions and define flexible criteria to control case deletion and the partitioning of data into learn and test samples BASIC commands are implemented through the command interface either interactively or via batch command
412. t based measure of the area under the ROC curve The CV method does not produce a test based version of the actual Gains or ROC curve w Because we have used CV for testing in our example we will see test results on only some of the summary tabs 63 CART BASICS The grid displayed in the right panel contains various counts and ratios corresponding to each node of the tree and the quantities used to plot the gains curve Remember that the nodes have always been sorted for the focus class using learn data results The table displays the following information for each terminal node scroll the grid to view the last two columns NODE Node number CASES TGT CLASS N of cases in node belonging to focus class OF NODE TGT CLASS Percent cases in node that are focus class CLASS TGT CLASS Percent of all focus class present in node CUM TGT CLASS Cumulative percent of focus class CUM POP Cumulative percent of all data POP Percent of all data in node CASES IN NODE N of cases in node CUM GAINS Cum Focus Class Cum Pop LIFT INDEX node focus class pop focus class The Gains Table can be exported to Excel by a right mouse click and then choosing Export from the pop up menu You can print individual Gains Charts as well as overlay and print Gains Charts for trees of different sizes and from different CART analyses see Chapter 4 You can also add Gains Charts and Tables into the CART report see Chapter 12 Terminal
413. t exists Otherwise it would next try to open SOMEDATA SYD and if it fails continue down the list of extension until either a file with the expected name is found or the list of extensions is exhausted The command syntax is USE lt file gt Examples USE MYDATA reads from MYDATA SYS USE MONTHLY SURVEY SYS 403 Appendix III Command Reference WEIGHT Purpose The WEIGHT command identifies a case weighting variable The command syntax is WEIGHT lt variable gt in which lt variable gt is a variable present in the USE dataset The WEIGHT variable must be numeric containing any non negative real values no character variables 404 Appendix III Command Reference XYPLOT Purpose The XYPLOT command produces 2 D scatter plots plotting one or more y variables against an x variable in separate graphs The command syntax is XYPLOT lt yvarl gt lt yvar2 gt lt yvar3 gt lt xvar gt FULL TICKS GRID WEIGHTED BIG The plot is normally a half screen high the FULL and BIG options will increase it to a full screen 24 lines or a full page 60 lines TICKS and GRID add two kinds of horizontal and vertical gridding WEIGHTED requests plots weighted by the WEIGHT command variable NORMALIZED scales the vertical axis to 0 to 1 or 1 to 1 Examples XYPLOT IQ AGE FULL GRID XYPLOT LEVEL 4 7 INCOME NORMALIZED XYPLOT AGE WAGE INDIC DEPVAR 2 WEIGHTED Only num
414. t recently used 29 132 MVI 15 114 216 N NAMES command 380 Navigator window 48 149 158 235 navigators 14 61 170 171 244 opening 140 saving 140 negative case weight 90 NEW command 381 no independent testing 96 node assignment 175 Node Detail 57 241 242 Node Display 238 node frequency distributions 71 253 node report 158 Node Report window 68 154 251 Node Reports Box Plots tab 156 Classification tab 71 253 Competitors and Surrogates tab 68 155 251 Rules tab 72 156 255 Splitter tab 73 157 254 node size 275 node split 238 node statistics 158 nodes 287 comparing children 139 comparing learn test 139 maximum number 113 287 parent node minimum cases 111 richnes 247 terminal node minimum size 111 Index NODES command 218 node specific median 156 non linearities 219 224 NOTE command 382 notepad 300 number of surrogates 103 number of variables 32 numeric operators 409 0 observation number 409 Open Data File 34 35 Open File icon 43 230 Open gt Data File File menu 34 Open 43 230 opening file 43 230 navigators 140 operators logical 409 numeric 409 relational 409 opt 128 optimal models 186 optimal tree 102 options 125 advanced 111 classic output 127 command notation 127 default display window 127 Directories tab 28 132 Random Number tab 132 Report Writer 289 Reporting tab 129 ROC graph labels 127 tex
415. t reports 125 OPTIONS command 383 Options dialog 125 ordered twoing 13 106 outliers 156 output classic text 75 257 specifying filename 283 OUTPUT command 384 output files default directory 28 133 Output window 176 231 234 283 288 overfit 165 P page layout 60 243 page layout preview 141 page setup 141 Page Setup dialog 244 pair wise correlations 224 434 Index parent node 253 paste 290 path indicators 175 path references 29 Pearson correlations 218 penalty 121 147 high level categorical 124 missing values 123 216 variable specific 122 PENALTY command 125 386 Penalty tab 85 147 148 233 predicted probabilities 78 175 260 predicted response 175 predicting 170 180 prediction success table 61 67 245 250 predictor groups 277 predictor variables 45 85 87 234 categorical 47 234 categorical vs continuous 89 preparing data 28 primary split 65 248 primary splitters 275 PRINT command 389 printing gains chart 143 main tree 60 243 page layout preview 141 page setup 141 244 preview window 141 reports 290 text output 283 284 tree 141 tree rules 183 prior probabilities 219 priors 219 DATA 119 EQUAL 119 LEARN 119 MIX 119 SPECIFY 119 specifying 118 TEST 119 PRIORS command 120 219 388 Priors tab 85 233 probability trees 19 105 Profit tab Ave Profit button 152 Cum Ave Profit button 153 Cum Profit button 153 Pr
416. t these will appear only in the navigator window not in the text reports Activate a navigator window pull down the View menu and select the Assign Class Names menu item High Level Categorical Predictors We take great pride in noting that CART is capable of handling categorical predictors with thousands of levels given sufficient RAM workspace However using such predictors in their raw form is generally not a good idea Rather it is usually advisable to reduce the number of levels by grouping or aggregating levels as this will likely yield more reliable predictive models It is also advisable to impose the HLC penalty on such variables from the Model Setup Penalty tab These topics are discussed at greater length later in the manual In this section we discuss the simple mechanics for handling any HLC predictors you have decided to use For the binary target high level categorical predictors pose no special computational problem as exact short cut solutions are available and the processing time is minimal no matter how many levels there are For the multi class target variable more than two classes we know of no similar exact short cut methods although research has led to substantial acceleration HLCs present a computational challenge because of the sheer number of possible ways to split the data in a node The number of distinct splits that can be generated using a categorical predictor with K levels is 2 amp 1 1 If K 4 f
417. table GAINS Toggles the printing of gains charts in CART for classification models Binary models always show these charts ROC Toggles the printing of ROC charts in CART for classification models Binary models always show these charts PS Toggles printing of the pruning sequence when a tree is built PLOTS Toggles summary plots and allows a user specified plotting symbol DBMSCOPY Toggles support for the DBMS COPY data access engine deprecated STATTRAN Toggles support for the Stat Transfer data access engine To turn an option ON the YES portion is not needed Examples LOPTIONS MEANS turn MEANS printing on LOPTIONS MEANS NO turn MEANS printing off 373 Appendix III Command Reference MEMO Purpose The MEMO command defines a text memo that is saved with the model A memo is cumulative until an analysis is performed after which the memo is reset Enclosing the content of a memo in quotes is not necessary however case is preserved and certain punctuation marks e g apostrophes are better handled if the text is quoted Examples A two line memo in which the first line has case preserved by using quotes and the second does not MEMO This is my memo line one MEMO a second line will display entirely in uppercase A memo composed of a group of lines ending with the END tag which will add three lines to any existing memo MEMO This model focuses on IRR and income variables in Sept 03 A
418. te time The default is 12 NOTE For BINARY Classification trees special algorithms are used that allow exhaustive split searches for high level categoricals with essentially no compute time penalty Sets the maximum number of cases allowed in the learning sample before cross validation is disallowed and a test sample required The default is 3000 Defines a string that may be used to mark page breaks for later processing of CART text output The page break string may be up to 96 characters long and will be inserted before the tree sequence the terminal node report learn test tables variable importance and the final options listing Page breaks are also inserted in the node detail output according to the NODEBREAK options see below If the pagebreak string is blank no pagebreaks are inserted This option is only active if you have defined a nonblank pagebreak string with the PAGEBREAK option NODEBREAK allows you to specify how often the node detail report is broken by page breaks The options are ALL EVEN ODD NONE or you may specify a number such as 3 or 10 The default is ODD breaking prior to node 3 5 etc Even if you request NONE there will still be a pagebreak prior to the node detail title COPIOUS reports detailed node information for all maximal trees grown in cross validation The default is BRIEF Provides a report of advanced control parameters at the end of tree building 336 Appendix III Command Refe
419. tegrated BASIC does not permit the merging or appending of multiple files nor does it allow processing across observations In Salford Systems statistical analysis packages the programming work space for BASIC is limited and is intended for on the fly data modifications of 20 to 40 lines of code though custom large work space versions will accommodate larger BASIC programs For more complex or extensive data manipulation we recommend you use the large workspace for BASIC in DATA available from Salford Systems or your preferred database management software The remaining BASIC help topics describe what you can do with BASIC and provide simple examples to get you started The BASIC help topics provide formal technical definitions of the syntax Getting Started with BASIC Programming Language Your BASIC program will normally consist of a series of statements that all begin with a sign the sign can be omitted inside of a DATA block These statements could comprise simple assignment statements that define new variables conditional statements that delete selected cases iterative loops that repeatedly execute a block of statements and complex programs with the flow control provided by GOTO statements and line numbers Thus somewhere before a HOT Command such as ESTIMATE or RUN in a Salford module you might type oe LET BESTMAN WINNER IF MONTH 8 THEN LET GAMES ELSE IF MONTH gt 8 LET GAMES LET ABODE LOG CABIN
420. ter Split Disallowed At Or Below Depth Sort File Order zy wv shen on are Surrogate Splitter 1 2 3 Ind Save Grove CART Combine Score Cancel Continue Start nN FE Variable SMALLBUS FIT HOME PERSTRN CLASSES HY Hes THAT a a a a oooooooooooo D797 717 1 E aaia 281 Chapter 12 Features and Options Let s run an exploratory tree with the above constraints and view the splitters As you can see below the defined constraints for both groups were implemented None of the group 1 variables are below the depth of three D3 and none of the group 2 variables are found above the depth of four D4 Z Navigator 3 Main Tree Split Variable Sl fox oleae El a Learn Sample Size CART also allows the user to constrain a tree according to the size of the learn sample in the nodes Instead of using depth to control where a splitter can be used we disallow splits based on the size of the learn sample in the node The Min Cases and Max Cases columns are used to enter positive values in the cells Min Cases variable will not be used if the node has more than the specified number of records Max Case variable will not be used if the node has fewer than the specified number of records In the following example we constrain ANYRAQT from being used as a splitter unless there are fewer than 200 learn sample observations in a node 282 Chapter 12 Features and Opt
421. ters button at the bottom of the Navigator This display is often useful to quickly identify where in the tree if at all a certain variable of interest showed as the main splitter 239 Chapter 11 CART Segmentation ai Navigator 1 Main Tree Split Variables DER 100 OO le E Viewing the Main Tree In addition to the thumbnail sketch of the individual nodes tree topology or splitters window you can view a complete picture of the tree in a format similar to the way the tree will print To view the entire tree click on the Tree Details button at the bottom of the Navigator or right click on the root node and select Display Tree 240 Chapter 11 CART Segmentation ia Navigator 1 Main Tree d AAE ANYRAQT 0 Node 2 Class 2 FIT lt 3 46 Class Cases 1 13 6 2 100 47 4 98 46 4 N 211 ee FIT lt 3 46 Node 3 Class 3 ANYPOOL 0 Class Cases 1 n 13 4 2 o 00 3 71 366 N 82 Node 1 Class 1 ANYRAQT 0 Class Cases 1 95 32 4 2 100 341 3 98 33 4 N 293 oa ANYRAQT 1 Terminal Node 7 Class 1 Class Cases 1 82 100 0 0 0 0 0 FIT gt 3 45 Node 5 Class 2 TANNING 2 3 4 5 6 Class Cases 1 n 2 100 77 5 3 2 20 9 N 129 ANYPOOL 0 ANYPOOL 1 TANNING 2 3 4 5 6 TANNING 0 1 Terminal Terminal Node 6 Class 2 SMALLBUS 1 SMALLBUS 0 ONAER lt 2 50 ONAER gt 2 50 Terminal Terminal Terminal Terminal Node 1 Node 2
422. the learn data which define the sort order of the terminal nodes in the gains table Gains Chart Cases Node Cum Cum Tat Class Tgt Class Tat Class Tgt Class Pop 100 00 100 00 100 00 100 00 4 00 0 00 0 00 Node 0 0 20 4 60 80 100 Population Class 1 1 Tat Class 1 a gi Total cases 95 Percent of sample 32 42 Gains Lit Cumuit Roc Scoe Leam ROC Integral 0 9994 Grove Navigator 1 Data C Program Files Sa GYMTUTOR CS Prediction Success Tab The Prediction Success tab of the Score dialog displays the prediction success table that cross classifies the actual by the predicted class see also the text output for the actual by the predicted node To view row or column percentages instead of counts click on Row or Column 178 Chapter 7 Scoring and Translating Response Statistics Gains Prediction Success Results of Applying CART Tree to Data Predicted Class Actual Total Percent 1 2 3 Class Cases Correct N 94 N 101 N 98 96 00 94 90 Total 293 00 Average 96 62 Overall Correct 96 59 Count Row Column Grove Navigator 1 Data C Program Files GYMTUTOR CSV Case Output for Regression Trees To illustrate how to use Score to predict continuous target variables we will work with the BOSTON CSV dataset First build the default model using MV as the target and save the resulting navigator as boston nv3 Next
423. the Node Size mode wv The tree picture can be made smaller or larger by pressing the corresponding buttons in the left upper corner of the navigator window w As with classification trees to change the level of detail you see when hovering over nodes right click on the background of the Navigator window and select your preferred display from the local pop up menu 151 Chapter 5 Regression Trees Copy Add to Report VARIABLE VARIABLE lt 46 VARIABLE lt 46 Qi 42 1 Median 35 6 03 93 1 N 1040 VARIABLE lt 45 hin 14 Qi 42 1 hean 84 1 Median 85 6 03 98 1 htax 99 4 vx The Learn and Test group of buttons controls whether Learn or Test data partitions are used to display the node details on the hover displays or all related Tree Details windows Color Coding The terminal nodes can be color coded by either target mean or median Make your selection in the Color Code Using selection box Viewing Tree Splitters and Details The Splitters button and the Tree Details buttons work similarly to the classification case described previously see Chapter 3 CART BASICS The only difference is that node information now displays target means and variances instead of frequency tables and class assignments The Tree Details display can be configured using the View Node Detail menu Regression Tree Summary Reports The overall performance of the current tree is summarized in the four Su
424. the manual for a complete list of batteries offered CART Pro EX includes a larger set of batteries including new methods for refining the list of predictors KEEP list and assuring greater model stability These batteries can run hundreds or even thousands of models to help you find a model of suitable performance and complexity or simplicity CAR CART a fae CART 6 0 CART 6 0 Modeling Refinement from the Variable Importance List tx Once a model is built you can easily refine it by managing the variable importance list Just highlight the variables you want to keep for the next model and click the Build New Model button CART EX provides a higher degree of automation for predictor list refinement feature extraction and offers an automated pre modeling predictor discovery stage This can be very effective when you are faced with a large number of candidate predictors In our extensive experiments we have established that automatic predictor discovery frequently improves CART model performance on independent holdout validation data 17 Introducing CART 6 0 New Linear Combination Controls In classic CART linear combination splits are searched for over all numeric predictors If an LC splitter is found it is expressed in a form like If 2 2345 X1 01938 X2 98548 X3 lt 1 986 then a case goes left Such splitters are difficult to interpret and tend to be used only when interpretability can be sa
425. the right panel Click on the check boxes to turn each option on and off and then click Apply to update the Main Tree display To save your preferred display options as the default settings click the Set Defaults button 58 CART BASICS The internal and terminal node detail can be specified separately as each is given its own tab Press the Copy to Terminal Nodes or Copy to Internal Nodes buttons if you wish the current setup to be copied into the other tab The Set Defaults button only sets the defaults for the currently active tab If you want to set defaults for both terminal and internal nodes press this button twice once for each tab Viewing Sub trees Sometimes the tree you want to examine closely is too large to display comfortably on a single screen and looking at a sub tree is more convenient Sometimes you will want to look at two separated parts of the tree side by side To view sub trees first go back to the navigator you can close the tree details window or select the navigator from the Window menu Next right click on an internal node and select Display Tree Below we have done this twice once for the right child of the root and again for the left child bringing up two sub tree displays Below we display the two windows side by side 4 Navigator 2 Sub Tree Node 2 lelle alle E OCCUP_BLANK lt 0 500 Node 3 Class 0 CREDIT_LIMIT lt 5548 000 N 325 CREDI
426. ther Model Setup dialog tabs are left unchanged the following defaults are used gt gt o All remaining variables in the data set other than the target will be used as predictors the Model tab No weights will be applied the Model tab 10 fold cross validation will be used for testing the Testing tab The minimum cost tree will become the best tree the Best Tree tab Only five surrogates will be tracked and will all count equally in the variable importance formula the Best Tree tab The least squares splitting criterion for regression trees will be used the Method tab 148 Chapter 5 Regression Trees No penalties will be applied the Penalty tab Parent node requirements will be set to 10 and child node requirements set to 1 the Advanced tab Allowed sample size will be set to the currently open data set size the Advanced tab The 3000 limit warning for cross validation will be activated With respect to the command line CART determines which tree to grow classification or regression depending on whether the target appears in the CATEGORY command A classification tree is built for categorical targets and a regression tree for continuous targets To illustrate the regression tree concept we use the following steps to start the analysis 3 Select File gt Open gt Data File to open the BOSTON CSV dataset 506 observations In the Model Setup dialog check MV as the target variable and
427. thus directly enforcing different requirements in the tradeoff between node richness and class accuracy Spam Data Example We illustrate searching for hot spots using the SPAMBASE CSV dataset as an example First use the Open gt Command File option from the File menu to open the HOTSPOT CMD command file Note at the bottom of the command file that we will be running battery prior with priors on class 1 spam group varying between 0 5 and 0 9 in increments of 0 02 thus producing 21 models Second use the File gt Submit Window menu to build the battery The resulting Battery Summary contains information on all 21 models requested Our goal is to scan all terminal nodes across all models and identify the nodes richest in spam w In CART 6 0 we introduced the modeling automation technique known as batteries This feature discussed in the following chapter makes the process of modeling in batches as easy as a mouse Click Form the Report menu select Gather Hotspot which gives us the following HotSpot Setup dialog 195 Chapter 9 Hot Spot Detection Hotspot Setup Hot spot information N harvested trees 21 N harvested nodes 394 Variable TARGET Description target discrete integer Focus class LL L_L_LL Performance Report 1394 4 top richest nodes Corcel Note that there are 394 terminal nodes across 21 trees in the battery We also set Focus class to 1 Spam group and request actual processi
428. to the foreground click Tree Details on the Navigator dialog and then select Print from the File menu or use lt Ctrl P gt In the Print dialog box you can select the pages that will be printed and the number of copies as well as specify various printer properties The Print dialog also displays a preview of the page layout CART automatically shifts the positions of the nodes so they are not split by page breaks Printer Copies Name TEIN Properties Number of copies f 4 Status Ready o Type hp LaserJet 1300 PCL 6 Where HP on Daddy XP Comment I Print to file Fit to two pages if possible Select Pages C All pages Non blank only Selected Page Setup Cancel renin To alter the tree layout prior to printing click the Page Setup button As shown below the current layout is depicted in the tree preview window of the Page Setup dialog as you change the settings the print preview image changes accordingly You can use the left and right arrows just below the sample page image to change which page is previewed 142 Chapter 4 Classification Tre es Page Setup Paper ic a Source Automatically Select X Orientation Portrait Node Gaps Horz 0 10 Vert 10 10 Tree Scale Border Scaling 100 Width Thin X C La
429. tor windows before generating the next tree CART will advise you when you are running low on Windows resources and recommend that you close some of the Navigator windows CART Text Output Now turn to the text output displayed on the CART desktop by closing or minimizing the Node Report and Navigator windows The outline of the Report Contents for Tree 1 the only tree grown in our example is displayed in the left panel as illustrated below To view a particular section of the output click on its hyper link or use the scroll bars to browse the output 258 Chapter 11 CART Segmentation Classic Output Ctrl Alt C Report Contents Tree 1 HRAAAAAAAAAAERAA AER ER EKER ARERR ERE EERE ER ER EEE Node 1 ANYRAQT t N 293 HRRAAAARARAARARARARARARER ARERR EER ERERERERERE EEAKARAREEAAARERERE RARER EREER EERE bd Node 2 d Terminal Node bd N 211 ERRRRRRRRRRRRRRRRRRRRERERRRERRRRR ai Node 1 was split on ANYRAQT case goes left if ANYRAQT 0 Improvement 0 269294 Complexity Threshold 0 287719 Node Cases Wgt Counts Cost Class 1 293 293 00 0 667 1 2 211 211 00 0 532 2 7 82 82 00 0 000 1 We recommend that you save a copy of the text output as a record of your analysis by selecting Save CART Output from the File gt Save menu You can also copy and paste sections of the output into another application or to the clipboard The font used
430. ts e Control reporting options e Set random number seed e Specify default directories VIEW Open command log e View data e View descriptive statistics e Display next pruning e Assign class names and apply colors e View main tree and or sub tree rules e Overlay gains charts Specify level of detail displayed in tree nodes EXPLORE Generate frequency distributions MODEL Specify model setup parameters e Grow trees committee of experts e Generate predictions score data Translate models into SAS C or PMML TREE e Prune grow tree one level e View optimal minimum cost maximal tree e View tree summary reports REPORT Control CART reporting facility WINDOW Control various windows on the CART desktop HELP e Access online help 230 Chapter 11 CART Segmentation Opening a File To open the input data file GYMTUTOR CSV used in our example 1 Select Open gt Data File from the File menu or click on the toolbar icon w Note that you can set default input and output directories select Options from the Edit menu and select the Directories tab In the Open Data File dialog select the GYMTUTOR CSV file from the Sample Data folder and click on Open or double click the file name As indicated below Delimited Test csv dat txt must be selected in the Files of Type box to see files ending with the CSV extension Open Data File Look in Examples S eoston csy FQ spambase csv fncella
431. tter multiple times Terminal Node Report To view node specific information just single click the terminal node of your choice or right click and select Node Report A frequency distribution for the classes in the terminal node is displayed as a bar graph or optionally a pie chart as shown below for the left most terminal node Terminal Node 1 Summary node information class assignment number of cases in the node percentage of the data in the node and misclassification cost is also displayed for the learn data and if you use a test sample for the test data 74 CART BASICS Bi Navigator 6 10 Terminal Node 1 Saving the Navigator Grove File To save the Navigator so that you can subsequently reopen the file for further exploration in a later CART session first make sure that the navigator is your active window click anywhere on the navigator Then select Save Grove from the File gt Save menu or press the Save Grove button in the Navigator window In the Save dialog window click on the File Name text box to change the default file name in this case the data set name GOODBAD The file extension is by default grv and should not be changed Specify the directory in which the Navigator Grove file should be saved and then click on Save a Previous versions of CART saved two types of tree files navigator files with extensions like nav or nv3 and grove files CART
432. tution estimate PROPORTION Fraction of cases selected at random for testing and optionally validation SEPVAR Named variable separates learn test and validation samples The test value is 1 for numeric SEPVAR variables and TEST or test for character SEPVAR variables For the validation sample the values are 1 numeric and VALID Valid or valid FILE Test sample is contained in a separate data file For details on naming conventions see the reference for the USE command Examples ERROR CROSS 10 the default method for CART models ERROR PROPORTION 25 select 25 of cases at random for test ERROR FILE SHARP test cases are found in file SHARP SYS ERROR PROPORTION 3 2 30 testing 20 validation scoring ERROR CROSS MYBINS the variable MYBINS contains the CV fold assignments 353 Appendix Ill Command Reference EXCLUDE Purpose The EXCLUDE command specifies a list of independent variables to exclude from the analysis In other words all variables other than the target and those listed in EXCLUDE and WEIGHT commands will be used as predictors The command syntax is EXCLUDE lt varlist gt in which lt varlist gt is a list of variables NOT to be used in the model building process All other variables will be used See the MODEL and KEEP commands for other ways to restrict the list of candidate predictor variables Examples MODEL CHOICE EXCLUDE ID SSN ATTITUDE all numeric variable
433. u informed about the actions being taken and some timing information time elapsed time remaining Our example will run so fast you may not have a chance to notice everything on the progress indicator 48 CART BASICS Once the analysis is complete a new window the Navigator is opened The navigator is the key to almost all CART output reports and diagnostics so it will function as a model summary and guide to everything you may want to know about the results Experts may also redirect the classic text output and some other reports elsewhere These items are later discussed in this manual Tree Navigator The navigator packages everything you need to know about the CART tree You can save the navigator email it to others or just use it temporarily during a single CART session The navigator will offer you many views of the model and its findings will allow you to score new data and can generate formatted text reports tables charts and comparative performance summaries of competing models The rest of this chapter is devoted to discovering the charms of the navigator The initial navigator display is just a simple overview of the shape of the tree or its topology in the top panel and a predictive performance curve in the bottom panel The tree topology displayed in the top panel of the Navigator window provides an immediate snapshot of the tree s size and depth Here we have a tree with 10 terminal nodes nodes at the bottom
434. uation Methods Monte Carlo testing BATTERY MCT Randomization tests can provide useful sanity checks on model performance With the MCT battery CART takes the dependent variable and randomly shuffles it exchanging the correct value of the target with the value from another randomly selected row in the data Such shuffling should make it very difficult for CART to generate predictive trees The extent to which trees are still predictive is a measure of the potential over optimism in the measurement of any tree on the actual data CART CART 2 zia CART 6 0 CART 6 0 Profit display using defined auxiliary variables Pree Profit variables are any variables the modeler is interested in tracking in the terminal nodes The profit tab on the summary window includes tabular and graphical displays of these variables showing absolute and average node results and cumulative results based on the ordering of the nodes as determined by the original target variable CART CART Ed z CART 6 0 CART 6 0 Pro Pro EX Unsupervised Learning We believe that Leo Breiman invented this trick but we are not entirely sure We start with the original data and then make a copy The copy has each of its columns randomly shuffled to destroy its original correlation structure CART is then used to try to recognize whether a record belongs to the original data or to the shuffled copy The stronger the correlation structure in the original data the better CA
435. um number of nodes 8192 Depth 20 J We suggest however that you use caution when reducing these limits The initial objective should be to reduce these values without creating a shortfall for the maximal tree As long as the maximal tree size is less than the limitation you have set you need not be concerned that the true optimal tree one grown without limitations will be grown It is only when the imposed limits prevent completing the tree growing process so as to grow the maximal tree that concern should arise For example if you set the Maximum number of nodes 5000 and the tree sequence indicates the maximal tree contains 1500 nodes you can clearly see that the maximal tree was grown without limitation However if you set the Maximum number of nodes 1000 and the tree sequence indicates the maximal tree contains 985 nodes you may suspect that the maximal tree was never attained When this occurs the Tree Sequence report found in the CART Report window will be followed by a message that reads Limited tree produced complexity values may be erroneous 287 Chapter 12 Features and Options Maximum number of nodes Forces the tree generation process to stop when a specified number of nodes both internal plus terminal are produced Depth Forces the tree generation process to stop after a specified tree depth is reached The root node corresponds to the depth of 0 Command line users will use the following com
436. umber of Class 3 cases classified in each class where N is the total number of cases predicted as Class 3 The rest of the table represents a prediction success matrix with rows representing true class assignment and columns representing predicted class assignment In our example we can see that five Class 3 cases in the learn sample were misclassified as Class 2 four Class 2 cases were misclassified as Class 3 and only one Class 1 case was misclassified as Class 3 To switch to the test cross validated sample prediction success table click on Test and similarly to view row or column percentages rather than counts click Row or Column 251 Chapter 11 CART Segmentation id Navigator 1 7 Tree Summary Reports Gains Chart I Root Splits _TerminalNodes Variable Importance Misclassification Prediction Success Learning Sample Prediction Success Table Predicted Class Total Percent 1 2 3 Cases Correct N 93 N 104 N 96 100 96 00 98 92 86 Total 293 00 Average 95 58 Overall Correct 95 56 Leam Test Class None Count Row Column w Prediction success tables based on the learn sample are usually too optimistic You should always use prediction success tables based on the test or on cross validation when a separate test sample is not available as fair estimates of CART performance CART uses test set performance to find the expected cost and identify the
437. ummary statistics are also reported Both the color coding and the relative position of this node compared to the root node suggest that the highly priced segment is contained in this node i Navigator 1 18 Terminal Node 18 Node Statistics if Terminal Node 18 Min 21 90000 Q1 42 67500 m Mean 45 09667 Median 46 35000 Q3 50 00000 Max 50 00000 Cases 30 Entire This Sample Node rf The Rules tab has been described above For further discussion of regression tree modeling splitting rules and interpreting regression node statistics see the CART Reference Manual Viewing Rules There are several flexible ways to look at the rules associated with an entire tree or some specific parts of the tree In the Navigator window you can tag terminal nodes for further use by hovering the mouse over right mouse clicking and selecting the Tag Node menu item In the following example we tagged all nodes color coded in red and pink high end neighborhoods 159 Chapter 5 Regression Trees i Navigator 1 Regression tree topology for MV Color code using Tot MV Mean z nao Smaller Next Larger Prune Grow Node Report Display Tree Rules Prune Show Tagged Model Statistics i Predictors 13 Important 13 Nodes 18 Min Node Cases 20 SSIS gt oO Relative Error 0 2 m meee o A 2 A E G 9S S10 12ers e E Giz A8 aa 20
438. unnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnne 151 Detailed Node Reports c ccceecececeeeeeseeeeeeeeeeeeeeeseeaneeseeeeeseesseaeeeseeeseseeseeeaneeees 154 Terminal Node REPOM e e r aera e aara T ra ae a E Ea a aeea ara eaa 158 ENSEMBLE MODELS AND COMMITTEES OF EXPERTS 1 008 161 Building an Ensemble of Trees ccccssecceccssseeeeeseeeeeeeseeneeeeeseaeseseseaneeseseenseeeeseaes 162 Bootstrap Aggregation and ARCIIG ccseecccceseeneeeseeeeseeeeeenseeeseeeeeeeeseneeeeeseaes 162 The Combine Teal aaae Seed aeaaea ra aar aae aA AE e dv videcadvasdaesdeacendesaccudteads 164 v Table of Contents SCORING AND TRANSLATING eeeseeseeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeees 169 Scoring and Translating Models cccssseeeeceseeeeeeeseeeeeeneeeeeeeenseeeeeeesesseenenseseeees 170 Navigator Files versus Grove Fil S ccccccesssneeeeeeereeeeeeeneeeeeseeneeenseeneeenseeneeenees 170 Converting a Tree File to a Grove File ccccsssseeeeeeseeneeeeseeeeeeeeeseeeeeesesenenseseenees 172 Scoring CART Model S ieiiccccicctecesctetieseieccncsgteeccnc castediesdcecseesssecdieeavevsiesavedeeesees sued 172 Score Data Dialog soin sick edieteec anusan tat hak ede eed aN PANNAN EAA Sxcceecsdebstuasuveddecegeesuaess 173 Output Data Set ia aert araa aae aa tdegh ens tv Ae Ea EATA an cuentas cde davcenaueesduerenstuavesd 175 Score GUI Output for Classification Trees cccccccesseeeeeeceeeeeeeeeseeeeee
439. urrent navigator is active Alternatively you may right click on the root node and select Rules from the local menu rN You may also generate rules for only a branch of a tree by right clicking on the internal node that originates the branch and selecting Rules from the local menu To add the within node probabilities for the learn or test samples click Learn or Test Combined learn and test probabilities can be added by clicking Pooled For example the main tree rule dialog for the GOODBAD CSV dataset with learn sample probabilities activated is displayed below 4 Navigator 1 Main Tree Rules Ee Terminal Node 1 if ANYRAQT lt 0 5 amp amp FIT lt 3 454 amp amp ANYPOOL lt 0 5 amp amp ONAER lt 13 5 terminalNode 1 class 3 probClass1 0 probClass2 0 probClass3 1 Notation Nodes Probabilities Cassie Moon a ie eon You can also view the rules for a particular tree node In the Navigator click on the node of interest and select the Rules tab from the terminal node results dialog To export rules as a text file select Export from the File menu In the Save As dialog specify a directory and file name the file extension is by default txt 184 Chapter 7 Scoring and Translating Save in Rules File name GymRules Save as type Text Files txt To send the rules to the printer select Print from the File menu when the Ma
440. urrogates Box Plots Splitter f Is LSTAT lt 14 4 Main Splitter Improvement 14 45030 Competitor Split Improvement 4 0 5 84803 8 49994 19 90000 7 52697 75 75000 7 15087 16 57000 7 02126 Association Improvemei 0 57650s 2 239351 434 50000s 156 Chapter 5 Regression Trees The Box Plots tab The Box Plots tab shows the current node box plot on the left hand side and two children box plots on the right hand side This helps to interpret the nature of the split The blue box depicts the inter quartile range with the top of the box or upper hinge marking the 75th quartile and the bottom lower hinge marking the 25th quartile for the target variable MV The horizontal green line denotes the node specific median while the whiskers or upper and lower fences extend to plus minus 1 5 times the inter quartile range Red plusses represent values outside the fences usually referred to as outliers ia Navigator 1 18 Node 2 Competitors and Surrogates Box Plots Splitter Is LSTAT lt 14 4 re 4 Parent Left Child Right Child Node 2 Node 3 Node 10 The Rules tab The third tab in the node report the Rules tab is displayed as follows For reference we display the Rules tab for Node 2 Non terminal and terminal node reports with the exception of the root node contain a Rules tab This tab is discuss
441. use the variable FIT for the initial split even if it is not the optimal splitter The resulting dialog appears as follows 269 Chapter 12 Features and Options Model Setup at Advanced Casts Priors Penalty Battery Model Categorical Force Split Constraints Testing Select Cases Best Tree Specify Splitter For Root Node And Its Children Root Node Clear Set Left Set Split Value Set Right S Doa Change J Left Child Node Right Child Node ee Save Grove CART Combine Score Cancel Continue Start Keeping all other default settings click Start to build the model As you can see by hovering the mouse over the root node the resulting Navigator indeed splits on the variable FIT in the root node with a split point of 3 45388 ji Navigator 2 Classification tree topology for SEGMENT Color code using Tot SEGMENT None Smaller g FIT lt 3 45 Min Node Cases Relative Cost Best ROC Nodes ROC Train Number of Nodes ROC Test Data Displays and Reports Save Model Leam Test Splitters Tree Details Summary Reports 1 7 Consist Commands Grove Translate Score Now let us show a similar example except here we specify the split point as well In our previous example we saw the root node split of FIT lt 3 45388 In this example we will force the split on FIT s mean value of 3 96 270 Chapter 12 F
442. used all output sent to the GUI back end is completely suppressed w Use this mode when you want to execute multiple runs without cluttering the GUI with multiple results windows which may slow things down and drag the system to a halt w Consider using the OUTPUT command to save the classic text result to an ACSII text file Consider using the GROVE command to save the GUI results Command Syntax Conventions CART command syntax follows the following conventions Commands are case insensitive Each command takes one line starting with a reserved keyword oy Acommand may be split over multiple lines using a comma as the line continuation character No line may exceed 256 characters 302 Chapter13 Working with Command Language Example A sample classification run The contents of a CLASS CMD sample command file is shown below Line by line descriptions and comments follow CART Notepad C Program Files Salford Data Mining CART 5 0 DAR REM SAMPLE CLASSIFICATIN RUN REMASF SAAT AAA TAA AREA ATAAAEAK HEAR AAEK TEAK EEA REET K EERE TEER TEE REM INPUT OUTPUT FILES REMASF AAA TEAKR TAA ATE A ATA A TEAK TATA A TERT K TERE A TERA EER ATER TEE 1 gt gt 9 USE sample csv 2 gt gt 9 GROVE class grv 3 gt gt f OUTPUT class dat REMAS FSSA TEAR TAAATAA ATTA A ATA KATA ATTA RTE KEE A EEA TERT REM OPTIONS SETTINGS REM AAAAAAAAAAAAAAAAARAAAAAARAAAAAAAAAALARAAAAAAEAA ALARA A EER
443. variable Your target has the same value for all learn records Because it makes no sense to proceed modeling ends Error 10026 THE ABOVE VARIABLE IS ONE OF THE INDEPENDENT VARIABLES OF THE TREE AND MAY NOT BE USED AS THE DEPENDENT VARIABLE Check your MODEL and KEEP or EXCLUDE commands and make sure they do not overlap 323 Appendix II Errors and Warnings Error 10050 UNABLE TO LOAD ANY MORE DATA INTO RAM Increase the amount of RAM available on your machine Error 10055 Too many redraws trying to construct ARC resampling The ARC process has collapsed Use exploratory trees to reduce the chance this error will occur Error 10057 The above variable name in the model KEEP list has an illegal leading character Read the variable names requirements in the manual Error 10063 Error with prior class probabilities Check the PRIORS command Error 10065 Not enough memory to add the missing value indicators that your data require The total number of variables including missing value indicators exceeds the maximum allowed limit of 8128 Error 10066 The center cut power exponent can be no larger than 10 0 Modify the POWER2 setting in the METHOD command appropriately Error 10067 The model involves a missing value indicator automatically generated from the above variable The above variable must be present on the case by case data set Add the variable mentioned to the data filling it with missing values
444. variable as show above or by treating missing as a valid level The Create missing categorical level control specifies whether missing values for discrete variables are treated as truly MISSING or are considered a legal and distinct level The user can choose from three control options 1 Process missing values for ALL variables as legal 2 Process missing values only for predictor variables as legal 3 Process missing values only for the target variable as legal Command line users will use the following command syntax The following command syntax will process missing values for all variables as legal DISCRETE MISSING ALL To process missing values only for predictor variables as legal DISCRETE MISSING LEGAL 116 Chapter 4 Classification Trees To process missing values only for the target variable as legal DISCRETE MISSING TARGET To process missing values as truly missing default setting DISCRETE MISSING MISSING The Cost Tab Because not all mistakes are equally serious or equally costly decision makers are constantly weighing quite different costs If a direct mail marketer sends a flyer to a person who is uninterested in the offer the marketer may waste 1 00 If the same marketer fails to mail to a would be customer the loss due to the foregone sale might be 50 00 A false positive on a medical test might cause additional more costly tests amounting to several hundreds of dollars A false ne
445. variables MODEL CHOICE KEEP FOOD AGE HEIGHT WAIST 365 Appendix III Command Reference LABEL Purpose The LABEL command defines variable labels Labels are not limited in length although in some reports they will be truncated due to space limitations The command syntax is LABLE lt variable gt ADD LABEL IN QUOTES Examples LABEL RESPONSE Did subject purchase at least one item l yes 0 no or LABEL PARTYS Political affiliation sourced from public database If labels are imbedded in your dataset such as SAS tm datasets they will be used in CART and there is no need for you to issue LABEL commands unless you wish to change or remove them Variable groups may be used in the LABEL command similarly to variable names To see a summary of variable labels issue the command LABEL _TABLE_ 366 Appendix III Command Reference LCLIST Purpose The LCLIST command identifies a group of continuous predictors among which CART should attempt to produce a linear combination at each node The LINEAR command is now deprecated in favor of LCLIST Its syntax is LCLIST lt varlist gt lt options gt in which lt varlist gt can be an explicit list of continuous predictors or the _KEEP_ keyword shorthand for whatever the keep list is for the model Some examples iCLIST LCLIST credit_score rate rebate LCLIST _keep_ CLIST x y z N 100 EXH YES To clear out all LCLISTs
446. ver to a second and third page To resize and reorient the tree click on the Page Setup button By selecting the Landscape orientation we now manage to fit the tree on two pages 61 CART BASICS The Page Setup is most useful with larger trees because a little tweaking can reduce the total page count dramatically You can often obtain convenient thumbnail displays of the most complex tree by selecting Fit to two pages if possible on the Print menu Tree Summary Reports The overall performance of the current tree is summarized in seven Summary Reports dialog tabs To access the reports click Summary Reports at the bottom of the Navigator window or select Tree Summary Reports from the Tree menu Tree Summary Reports present information on the currently selected tree i e the tree displayed in the top panel of the Navigator To view summary reports for another size of tree you must first select that tree in the navigator For the summary reports that follow we work with the CART optimal tree with 10 nodes 4 Navigator 5 Classification tree topology for TARGET Color code using Tot TARGET i Better LTD Worse Smaller Next Prune Grow Prune Model Statistics Predictors Important Nodes Min Node Cases
447. ves an output dataset as a result of scoring whenever x Save Results to a file is checked in the Score Data dialog Depending on your settings different variables may appear in the output dataset Variables that are always created CASEID a record number identifier REPONSE a predicted response class assignment for classification trees or node average for regression trees NODE node assignment useful when working with hybrid models Depending on the file format having an original target called RESPONSE and checking Model Information see below will result either in two variables with identical names one for the predicted response and one for the actual response or in distinguishing the original response by renaming it RESPONSE1 We suggest avoiding this situation to eliminate possible complications Variables created only when a target exists CORRECT binary indicator telling if the predicted response is the same as the actual response When Model Information is checked CART will include the original target if available and all predictors that participate in the model that is that have non zero variable importance scores When Path Indicators is checked PATH_ lt N gt _ Each of these variables gives the node number that a case goes through at the lt N gt th depth in the tree PATH_1 is always set to 1 to indicate that the first node is the root node Positive numbers refer to internal nodes negative numbers
448. ving directly to a text file then pasting them into your text application Alternatively you can edit the text commands deleting or adding new commands and then resubmit the analysis by selecting either Submit Window or Submit Current Line to End from the File menu View Open Command Log Within a single work session CART keeps a complete log of all the commands given to the engine You may access this command list at any time through the View Open Command Log menu Command Log LOPTIONS MEANS NO NOPRINT NO PREDICTIONS NO TIMING NO PLOTS NO GAINS NO ROC NO PS NO FORMAT 3 USE C Program Files Salford Data Mining CART 5 0 Sample Data Gymtutor csv MODEL SEGMENT KEEP ANYRAQT ONAER NSUPPS OFFAER NFAMMEM TANNING ANYPOOL SMALLBUS FIT HOME PERSTRN CLASSES CATEGORY SEGMENT ANYRAQT TANNING HOME CLASSES ERROR CROSS 10 METHOD GINI POWER 0 0000 BUILD SAVE C Program Files Salford Data Mining CART 5 0 ScoreData csv MODEL GROVE C DOCUME 1 Owner LOCALS 1 Temp sp450 HARVEST PRUNE TREENUMBER 4 SCORE PATH NO PROBS 3 300 Chapter13 Working with Command Language a This feature is helpful for learning command syntax and writing your own command files All you need to do is set up run options using the GUI front end and then read the corresponding command sequence from the Command Log x You may save the Command Log into a command file on your hard drive using the F
449. vious section As previously noted variable names and values are usually separated using the comma character For example DPV PRED1 CHAR2 PRED3 CHAR4 PRED5 PRED6 PRED7 PRED8 PRED9 PRED10 IDVAR 0 2 32 MALE 3 05 B 0 0039 0 32 0 17 0 051 0 70 0 0039 1 0 2 32 FEMALE 2 97 0 0 94 1 59 0 80 1 86 0 68 0 940687 2 1 2 31 MALE 2 96 H 0 05398 0 875059 1 0656 0 102 0 35215 0 0539858 3 1 2 28 FEMALE 2 9567 0 1 27 0 83 0 200 0 0645709 1 62013 1 2781 4 Character variables are indicated by either placing a at the end of the variable name e g POLPARTY or surrounding the character data with quotes e g REPUBLICAN or both Distinguishing Character vs Numeric CART uses the following assumptions to distinguish numeric variables from character variables in ASCII files When a variable name ends with or if the data value is surrounded by quotes either or on the first record or both it is processed as a character variable In this case a will be added to the variable name if needed e Ifa variable name does NOT end with or if the first record data value is NOT surrounded by quotes the variable is treated as numeric w It is safest to use to indicate character fields Quoting character fields is necessary if is not used at the end of the variable name or if the character data string contains commas which would otherwise be construed as field separat
450. we see that persons who have a low number of inquiries but did not report an occupation are also high risk w Remember that these data are fictionalized and so should not be thought of as a completely faithful representation of real world credit risk Some surprises are inevitable in this example We find the splitters view of the tree helpful in giving us a quick overview of the main drivers in the tree We see the variables used at the top of the tree and the direction of their effect At the bottom left we see that being older is a default risk factor and at the bottom middle we see that a lower income is also a risk factor These are just quick impressions that help us acquire a feel for the message of the tree The splitters view is an excellent way to quickly detect significant data errors If you see a pattern of outcomes that is very different from what is expected or even possible you have identified a potential data flaw that needs to be investigated 52 CART BASICS Exploring Trees of Different Sizes When a CART run completes it displays the CART optimal tree typically the tree with the smallest misclassification rate or equivalently the highest classification accuracy There are reasons to want to look at trees of different sizes however The relative error profile is often flat near its minimum This means that smaller trees exist that are almost as accurate as the best tree found Classification accurac
451. y classification example 40 CART BASICS CART Tutorial This chapter provides a hands on tutorial to introduce you to the CART graphical user interface menus commands and dialogs See firsthand how easy CART is to use In this first tutorial you will learn how to set up a simple CART analysis how to navigate the dynamic tree displays and how to save your work rae A word on our examples CART can be applied to data from any subject We have come across CART models in agriculture banking genetics marketing security and zoology among many others and the citations to CART number in the thousands Because analysts prefer to work with examples from their own fields we have included a few alternative case studies This chapter deals with a simple YES NO outcome drawn from the field of credit risk If you prefer to work through a marketing segmentation example instead you can jump to Chapter 11 Chapter 4 works through a biomedical example and Chapter 5 using a discussion a housing regression tree example We recommend that you try to follow this first example as it primarily uses concepts with which most readers will be familiar Our first tutorial file GOODBAD CSV contains data on 664 borrowers 461 of whom repaid a loan satisfactorily and 203 who defaulted Clearly the defaulters have been oversampled few lenders could afford to have a loss rate as high as 31 While the data have their origin in the real world the s
452. y files By default all input and output directories are initially set to the CART installation directory the temporary directory is your machine s temporary Windows directory Below we have set directory preferences for our input and output files To change any of the default directories click on the El button next to the appropriate directory and specify a new directory in the Select Default Directory dialog box CART will retain default directory settings in subsequent analysis sessions When the Most Recently Used File list checkbox is marked CART adds the list of recently used files to the File gt Open menu 133 Chapter 4 Classification Trees Data File Ctr O Col and File Open Report Open Navigator Open Grove Grove Contents Open Grid 1 C Program Files Salford Data Mining CART Pro EX 6 0 Examples Prostate2 csv 2 C Program Files Salford Data Mining CART Pro EX 6 0 Examples Boston csv 3 C Program Files Salford Data Mining CART Pro EX 6 0 Examples spambase csv 4 C Program Files Salford Data Mining CART Pro EX 6 0 Examples HOSLEM CSY 5 C Program Files Salford Data Mining CART Pro EX 6 0 Examples GOODBAD CS Input Files Data input data sets train and test for modeling Model information previously saved model files navigators and groves Command command files Output Files Model information model files groves will be saved here Prediction results output data sets fr
453. y is not the only sensible criterion to use to select a model Many data mining specialists prefer to use the area under the ROC curve as their model selection criterion For decision making purposes you may be interested only in the top performing nodes of the tree If so the accuracy and reliability of these nodes are all that matter and the overall performance of the tree is not relevant Judgment can play an important role in the final tree selection The navigator makes it very easy to view display and obtain reports for every size of tree found by CART in its tree building process Select the navigator window and then use your left and right arrow keys to display different sized trees in the navigator topology display Begin by moving all the way to the left to reach the two node tree 4 Navigator 2 Classification tree topology for TARGET Color code using Tat TARGET fi z Better TT worse Next Prune Grow Paes Model Statistics 0 573 0 488 a 0 70 Predictors S oO 0 60 ede dete een roe 050 X 0 40 0 40 50 60 Best ROC Nodes ROC Train Number of Nodes ROC Test Data Displays and Reports Save Model Leam Spitters Tree Details Summary Reports Commands Grove Translate Score Technically we could go one step further to arrive at the one node tree the null tree but we make the two node tree the smallest we will display Thi
454. y pressing the Translate button in the Navigator window to get the complete representation of the CART model including surrogates See Chapter 7 for details Scoring Data There are many reasons to score data with a CART model You might want to run a quick test of the model s predictive power on new data or you might actually embed your model into a business process CART gives you several options for doing this CART can score data from any source using any previously built CART model All you need to do is to attach to your data source let CART know which grove file to use and decide where you want the results stored 78 CART BASICS CART scoring engines are available for deployment on high performance servers that can rapidly process millions of records in batch processes You can TRANSLATE your model into one of several programming languages including C SAS and PMML Java may be available by the time you read this The code produced needs no further modification and is ready to be run in accordance with the instructions provided in the main reference manual To score data using a model you have just built proceed as follows 1 Press Score in the Navigator window containing the model you want to apply 2 Inthe Score Data window Accept the current data file or change it using the Select button in the Data section Accept the current Grove file embedded into the current Navigator or use
455. y setting 57 241 default settings 85 147 DELETE command 409 416 delimited text comma 33 semicolon 33 spaces 33 tabs 33 dependent variable 85 depth 210 depth of tree 113 287 DESCRIPTIVE command 347 descriptive statistics 15 291 Desktop 41 228 detailed node report 238 DIM command 408 417 directional instability 187 directional stability 187 directories 29 134 input files 133 output files 133 specify defaults 132 temporary files 133 user specified 28 Directories tab 28 control functions 29 Input files 28 133 Output files 28 133 Temporary files 29 133 disallow 275 DISALLOW command 282 431 discount surrogates 103 248 DISCRETE command 115 348 MISSING 115 Display Tree 242 displaying tree rules 76 258 E Edit menu 229 Copy 284 Fonts 76 258 Options 125 effective frontier 198 ELSE command 407 418 embedded grove information 172 180 embedded model information 171 172 180 end of file 409 end of group 409 ensemble of trees 162 entropy 13 105 ERROR command 352 Error Profiles tab 204 error rate 66 67 249 250 errors and warnings 319 evaluation sample holdout 165 even splits 106 Excel format 36 EXCLUDE command 353 exploratory tree 96 exporting tree rules 76 183 258 F file formats 34 File menu 42 43 229 230 284 Command Prompt 298 Export 183 259 Log Results to 283 most recently used file 132 New Notepad

CART® for Windows

Contents

Download Pdf Manuals

Related Search

Related Contents

CART&reg; for Windows

Contents

Download Pdf Manuals

Related Search

Related Contents

CART® for Windows