Home
The MiningMart User Guide
Contents
1. 20 CHAPTER 2 BASIC CONCEPTS IN MININGMART will be different in each institution Therefore sharing MiningMart applications as explained in section 2 1 makes use only of the conceptual level Connections of the conceptual and relational level Information about a concept like Customer and about the specific business data table containing customer data must be linked Thus there exist connections in MiningMart between the conceptual level and the relational level Concepts are connected to columnsets features are connected to columns see the definitions of these terms There are two ways to create a connection the user can create one or the MiningMart compiler can do that The central idea is that there are some con cepts called DB concepts that represent the input business data for the case For these the connection to the database object they represent is defined by the user with the help of the concept editor When selecting a DB concept in the concept editor the menu item Create Connection is available to do so Other concepts called MINING concepts represent business data that was created automatically during the execution of a MiningMart step This execution is done by the compiler thus the compiler creates not only the data but also the connections to the concepts and features Case case is a knowledge discovery process or data preprocessing appli cation as modelled in MiningMart Users work on one c
2. TheOutputAttribute 4 4 14 DateToNumeric This operator extracts numerical parts of database fields in DATE format This is useful if you need to perform arithmetic operations on time stamps for example when you need to represent the time as days since a given start date Parameters are simply an ThelnputConcept The TargetAttribute of type TIME and Output Format currently one of Year_YYYY Year_YY Month_of_Year Day_of Month Hour of Day Minute of Hour and Second_of Minute The result is stored in TheOutputAttribute ObjType inherited TheInputConcept CON IN inherited TheTargetAttribute BA IN inherited type DATE OutputFormat V IN see text BA OUT TheOutput Attribute 4 4 15 TimelntervalManualDiscretization This operator can be used to discretize a time attribute manually The looped parameters specify a mapping to be performed from TheTargetAttribute a BaseAttribute of type TIME to a set of user specified categories As for all FeatureConstruction operators a BaseAttribute TheOutputAttribute is added to TheInput Concept The mapping is defined by looped parameters An interval is specified by its lower bound IntervalStart its upper bound IntervalEnd and two additional parameters StartIncExc and EndIncEzc stating if the interval bounds are in cluded value T or excluded value E The value an interval is mapped to is given by the looped parameter Map To If an input value does not belong to any interval
3. tion that is used for the regression can be biased such that predicting too high is more expensive LossF unctionPos gt LossFunctionNeg or less expensive LossFunctionNeg gt LossFunctionPos than predicting too low If both val ues are equal no bias is used The parameter C balances training error against generalisation quality positive values between 0 01 and 1000 have been used successfully in the literature Epsilon limits the allowed error an example may produce small values under 0 5 should be used ObjType TheInputConcept inherited TheTarget Attribute inherited ThePredictingAttributes KernelType SampleSize see explanation above see explanation above positive real try 1 0 positive real try 1 0 positive real try 1 0 positive real try 0 1 optional one of true false optional inherited LossFunctionPos LossFunctionNeg Epsilon UseDB_SVM TheKey TheOutputAttribute lt 4444944 an 22222222222 uU gt gt 4 4 12 Support VectorMachineForClassification A data mining operator Values in TheTargetAttribute are used as target func tion values to train the SVM on examples that are formed with ThePredicting 4 4 FEATURE CONSTRUCTION OPERATORS 59 Attributes TheTargetAttribute must be binary as Support Vector Machines can only solve binary classification problems The parameter Positive Target Value specifies the class label of the positive class
4. late them to different features in TheOutputConcept For this the parameters MapInput and MapOutput exist Use MapInput to specify any feature in one of TheConcepts and use MapOutput to specify the corresponding feature in TheOutputConcept To make sure that for each MapInput the right MapOut put is found by this operator it uses the coordination mechanism see section 4 1 However these two parameters only need to be specified for every pair of equally named features in TheConcepts So there are not necessarily as many maps as there are features in TheOutputConcept Object Type TheConcepts CON List no TheInputConcept TheKeys BA List MapInput BA or MCF coordinated MapOutput BA or MCF OUT coordinated TheOutputConcept CON inherited 4 2 3 UnionByKey Takes a list of concepts plus attributes indicating their primary keys and unifies the concepts In contrast to the operator JoinByKey section 4 2 2 the output 36 CHAPTER 4 OPERATORS AND THEIR PARAMETERS columnset is a union of the input columnsets rather than a join For each value occuring in one of the key attributes of an input columnset a tuple in the output columnset is created If a value is not present in all key attributes of the input columnsets the corresponding non key attributes of the output columnset are filled by NULL values In TheOutputConcept only one of the keys will be present Each BaseAtt ribute specified in TheKeys must be a primary key of one
5. 43 plications of the following steps This is e g useful to apply the same chain on different samples drawn in one of the succeeding steps ObjType TheInputConcept CON IN inherited HowOften V IN number of views to create TheOutput Concept CON OUT inherited 4 2 16 Materialize This operator is a normal concept operator but it is a pure technical construc tion to enforce materialization of ColumnSets The table name of the output can optionally be specified as a parameter which is useful if you want to ac cess the preprocessed data afterwards If multiple ColumnSets are connected to TheInputConcept then each of the corresponding table names will be extended by a numerical suffix like _1 ObjType TheInputConcept CON IN inherited TableName V IN name of output table TheOutput Concept BA IN j inherited 4 2 17 MaterializeWithPKs This operator is an extension of the Materialize operator described in section 4 2 16 It materialises the table and also creates a primary key constraint in the database that declares the columns that correspond to the BaseAttributes given in the parameter PrimaryKey as primary keys of the created table ObjType TheInputConcept CON IN inherited TableName V name of output table PrimaryKey BA List TheOutput Concept BA inherited 4 2 18 YaleModelA pplier This operator is able to apply a model written by the learning toolbox YALE to an example set as given by a database ta
6. 5 2 How to 72 5 3 How to 72 5 4 How to uploadacase 222 73 CONTENTS Chapter 1 Getting Started 1 1 Support Support for installing and working with MiningMart is available in this docu ment and some README files in the distribution Further support questions can be addressed by email to miningmart 1s8 cs uni dortmund de The MiningMart team consisting of Timm Euler and Martin Scholz will be happy to answer your specific questions 1 2 Licensing information Please refer to the file MM HOME LICENSES 1 3 File names in this document Unpacking the zip file you have downloaded from the MiningMart website or running the Windows installer see section 1 5 will create a directory named MiningMart 1 1 The complete path up to and including this directory is ab breviated in this and all other MiningMart documentation by MM HOME As an example assume that somewhere in the documentation the file MM HOME lib postgresql jar is mentioned This indicates that you can find the file postgresql jar in the directory e C Programs MiningMart 1 1 lib to use an example for Windows e home username analysis MiningMart 1 1 1ib as an example for Linux or Solaris 8 CHAPTER 1 GETTING STARTED 1 4 Connecting to the database MiningMart works with relational databases It currently supports three data base systems
7. Concepts have features which define them The MiningMart system provides a concept editor to create edit and delete concepts and their features Concepts belong to the conceptual level Concepts are connected to ColumnSets which represent the database contents on the relational level There are two types of concepts DB and MINING The first type are con cepts whose data exists before any MiningMart step is executed That is these concepts represent the input data for the case All MINING concepts in con trast are not connected to any data before the execution called compilation of a MiningMart step The MiningMart compiler creates the data that belongs to the MINING concepts and connects it to them See also under compiler and connection Subconcept link A concept is a subconcept of another concept if it has the same list of features but the corresponding Columnset contains only rows that are also present in the other concept s columnset The link between a subconcept and its super concept is displayed in the Concept Editor as a dashed arrow Projection A conceptis a projection of another concept if all of its features ap pear in the other concept but the other concept also has at least one additional feature Further the corresponding columnset contains only rows that are also present in the other concept s columnset except for the additional feature s A projection link is displayed in the Concept Editor as a dotted arrow 22
8. Generabissues mo eae ROIG Y d X RUD E 4 2 Concept Operatofs Donde Re s m po 401 eee we eS 4 2 1 MultiRelationalFeatureConstruction 42 2 JOmBY Key 2526 eh es Ge od ad 4 2 3 UnioniBy Wey a se a se Ra tee O0 N NNN 4 3 4 4 CONTENTS 42 4 Pivotize x odes Yb BOER A OR SE Ee ed 36 4 2 5 ReversePivotize RR ys 37 4 2 6 SpecifiedStatistics 38 4 27 40 4 2 8 RowSelectionByRandomSampling 40 4 2 9 DeleteRecordsWithMissingValues 40 4 2 10 41 4 2 11 SegmentationByPartitioning 41 4 2 12 SegmentationWithKMean 41 4 2213 UnSegment 4 ser d dE GRE OR d 42 4 2 14 RemoveDuplicates 42 42 15 c m yox 42 4 2 16 Materialize xeu GR BO ed wu 48 4 2 17 MaterializeWithPKs 43 4 2 18 YaleModelApplier 43 4 2 19 44 4 2 20 AttributeDerivation 44 4 2 21 FeatureConstructionByRelation 45 4 2 22 WAN OWING o urbcs ene Rom OR ho Ry ne dn Bes 46 4 2 23 SimpleMovingFunction llle 46 4 2 24 WeightedMovingFunction 47 4 2 25 Expon
9. all machine learning experiments MiningMart provides two operators that ease the integration with YALE see section 1 6 If your installation does not include them binary files for the external op erators can be put into the directory MM_HOME ext bin lt os name gt where lt os name gt is to be replaced with one of Windows Linux or SunOS Currently the following tools are supported if put to the directory specified above e myS VM a Support Vector Machine by Stefan Rueping For details please refer to http www ai cs uni dortmund de SOFTWARE MYSVM The wrapper expectes the target files nysvm and predict e Apriori implemetation of Bart Goethals For details please refer to http www cs helsinki fi u goethals software index html The wrapper expects the binary apriori 1 9 List of operators that use external algorithms e Apriori e FeatureSelectionWithSVM e Missing ValuesWithRegressionS VM e Support Vector MachineForRegression e SegmentationWithK Mean 16 CHAPTER 1 GETTING STARTED Chapter 2 Basic Concepts in Mining Mart In this chapter you will learn about the basic ideas behind MiningMart Its different components and the way they interact will be explained Basic notions that will be needed for any MiningMart session are presented This will also help you to understand this document and any other documents related to MiningMart MiningMart is a system that supports the development documentation and re use o
10. are created such as in the last output table in the example above TheInputConcept inherited TheIndexAttribute see text ThePivotAttribute see text TheGroupBy Attributes optional IndexValue looped MappedAttribute looped AggregationOperator SUM or MIN etc NullOrZero one of Null Zero TheOutputConcept inherited 4 2 5 ReversePivotize This operator reverses a pivotization see operator PIVOTIZE section 4 2 4 except for the aggregation which cannot be reversed since from an aggregated value the single values cannot be known The operator takes a list of attributes ThePivotizedAttributes from the input concept and assumes that all values of these attributes are compatible so that they can all be used as values of a single new attribute in the output concept which will get the name given in the parameter NameForPivotAttribute Further 38 CHAPTER 4 OPERATORS AND THEIR PARAMETERS the operator assumes that all ThePivotizedAttributes are assigned with a certain Index Value or even a combination of index values Thus for every attribute listed as ThePivotizedAttribute one entry for IndexValues must be listed The output concept will contain new index attributes for every type of index value These attributes will list all combinations of index values that are listed as Index Values as their values The output concept will also contain an attribute with the name given in NameForPivoAttribute which will take t
11. as absolute values integers or relative values real between 0 and 1 ObjType TheInputConcept IN inherited TheTargetAttribute BA IN must be numeric MinCardinality Type V IN ABSOLUTE or RELATIVE MinCardinality V IN positive TheOutputAttribute BA OUT should be categorial 4 4 26 GroupingGivenNoOfGroups A grouping operator Values of The TargetAttribute are grouped under a certain label which is stored in TheOutputAttribute which must be categorial This operator groups values of The TargetAttribute by iteratively merging in each step two groups with the lowest frequencies until the number of groups NoOfGroups is reached The algorithm has been inspired by hierarchical clustering Values of TheOutputAttribute can be specified in the parameter Label ObjType TheInputConcept inherited TheTargetAttribute must be numeric NoOfGroups positive integer Label optional TheOutput Attribute should be categorial 4 4 27 UserDefinedGrouping A grouping operator Values of TheTargetAttribute are grouped under a certain label which is stored in TheOutputAttribute which must be categorial This operator creates groups of TheTargetAttribute according to specifications given by the user in TheGroupings which is a list of values Each of the values in the list in turn is a String that lists values of TheTargetAttribute which should be grouped together separating them with a comma Values not specified for grouping retain their orig
12. be added manually and some parame ters of the reversing step may have to be adapted It employs a special operator called ReverseFeatureConstruction which cannot be employed directly but it is documented like other operators in chapter 4 3 3 CONCEPT EDITOR 29 MiningMart Concepts Case Insert Compile Tools Window Help tetta amp tog Concepts Drawing ActiveCRients WonChurners 9 BaseAttributes 2 BaseAttributes MultiColumnFeatures 0 MultiColumnFeatures 1 ColumnSets 1 ColumnSets 9 x j CrainingSetKeyAndLaber 7 2 BaseAttributes 2 BaseAttributes E PE S MultiCelumnFeatures B 0 MultiCelumnFeatures ultiColumnFeatures 1 ColumnSets 1 ColumnSets 1 ColumnSets 2 s Tacnvecment 8 BaseAttributes 2 BaseAttributes 0 MultiColumnFeatures MultiColumnFeatures 1 ColumnSets 1 ColumnSets 4 Application in edit mode Figure 3 2 The concept editor The other advanced option in the Tools menu is Recommend Materialisa tions In large cases this option can be used to automatically find suitable places for inserting a Materialisation operator see section 4 2 16 for a de scription of this operator Every time this option is called atmost one place for materialisation is recommended in order to get all recommendations please call this option repeatedly
13. contain a BaseAttribute for each number of unique values which is computed their names must be those of the input attributes followed by the suffix UNIQUE For ordinal attributes the parameter lists AttributesComputeMin and AttributesComputeMax exists The operator writes the minimum and maximum values of the corresponding attributes to the output BaseAttributes with the suffixes MIN and _ MAX Further for a BaseAttribute specified with AttributesComputeDistrib the distribution of its values is computed For example if a BaseAttribute contains the values 2 4 and 6 three output BaseAttributes will contain the number of entries in the input where the value was 2 4 and 6 respectively For each BaseAttribute whose value distribution is to be computed the possible values must be given with the parameter Distrib Values One entry in this parameter is a comma separated string containing the different values in the example the string would be 2 4 6 Thus the number of entries in AttributesComputeDis trib and Distrib Values must be equal TheOutputConcept will contain the corre sponding number of BaseAttributes three in the example their names will be those of the input attributes followed by the suffix value In the ex ample TheOutputConcept would contain the BaseAttributes inputBaName_2 x inputBaName4 and inputBaName_6 40 CHAPTER 4 OPERATORS AND THEIR PARAMETERS Object Type Thel
14. from one s own work The case base can be found on the MiningMart web site http mmart cs uni dortmund de The direct link is http mmart cs uni dortmund de caseBase index html 5 1 The Internet Presentation of Cases As soon as an efficient chain of preprocessing has been found it can easily be exported and added to an Internet repository of best practice MiningMart cases Only the conceptual level is submitted so even if a case handles sensitive infor mation as is true for most medical or business applications it is still possible to distribute the valuable ideas for re use while hiding all the sensitive data and even the local database schema To support users in finding the most relevant cases their inherent structure is exploited The internet interface visualizes the conceptual elements like chains steps and concepts as web pages linked by HTML links if the corresponding elements are connected in the case It is possible to navigate through the case base and to investigate single steps to see which operators were used on which kind of concepts to see which concepts were used as input or output in which steps which features belong to a concept and so on To each case some free text descriptions giving an overview of the case and the application domain can be added This allows other users to easily relate the work done in one case to their own goals rather than getting too much involved in technical details at an early
15. i Select one table or view and click Create This will create the columnset and you will have to connect each BaseAttribute of the concept to zero or one column ii Select one or more tables or views select a Matcher and click Match This will also create the columnset and MiningMart will use an automatic matching method that depends on the selected matcher to try to connect the BaseAttributes to the columns automatically This automatic matching is based on the names of the BaseAttributes and columns which means that if the names are not similar at all then no suitable matching is likely to be found But you can change the mapping of BaseAttributes to columns easily by hand afterwards and in fact at any time by right clicking on a concept s icon and choosing Show mapping If you click on a column in the mapping tab on the right hand side the list of available columns is displayed and the desired one can be chosen When a concept is connected right clicking on it and choosing Show Column sets displays the columnsets attached to it The buttons Show data and Show statistics can be used to inspect the data that is contained in the table or view represented by this columnset and some statistical information about it For views and virtual columns the latter are created by the compiler there is an SQL field that displays the SQL code which realises this view or virtual column Mining concepts are not usually created by hand but by creating a
16. integer not NULL MinSupport minimal support integer MinConfidence minimal confidence in 0 1 SampleSize the size of the sample to be used PremiseBA premises of rules ConclusionBA conclusions of rules TheOutputConcept inherited 4 2 28 Feature Construction with TF IDF This operator calulates term frequencies inverse document frequencies a mea sure known from information retrieval In this setting the operator is applied for time series with binary attributes instead The parameter TheSelectedAttributes contains a list of attributes for which the TF IDF values should be calculated TheKey is the primary key attribute of this time series while TheTimeStamp is the attribute holding the time infor mation of the tuple Unlike other Feature Construction operators this one yields a concept not a single feature ORT TheInputConcept inherited TheSelectedAttributes attribs to calc TFIDF for TheTimeStamp type TIME TheKey key attribute TheOutpatConcept 4 2 29 Union This operator implements the normal UNION functionality known from SQL thus the different Concepts specified as input need to be union compatible There is one main TheInputConcept which specifies the BaseAttributes of the TheOutputConcept If features of TheInputConcept are deselected then the features will also be deselected in the output All further attributes in these Concepts will be ignored all missing attributes will be replaced by artif
17. is relevant if you have old MiningMart cases that you still want to use after upgrading the system If you do not care about your old cases then you can simply install the latest system from scratch Otherwise you should export all your cases to XML files using the export facility of your old system Then you may delete your M4 schema and install it anew see section 1 4 5 Using the import facility of the new system you can then re create all your cases However you do not need to delete your cases Simply run the script e MM HOME m4install oracle replace operators sql or e MM HOME m4install postgres replace operators sql depending on your database system Start these scripts from an SQL command line in the M4 database schema This will update the information about Min ingMart s operators without changing the M4 schema 1 8 External code Although MiningMart was implemented in Java some operators use external algorithms not all of which are included in all distributions or for all operating systems A list of operators using external algorithms can be found in section 1 9 The operators that use these external implementations all employ machine learning algorithms The MiningMart team cannot give support for these ex ternal implementations If these operators do not work on your system it is suggested to use the very powerful YALE learning environment see section 1 6 1 9 LIST OF OPERATORS THAT USE EXTERNAL ALGORITHMS 15 for
18. method to fill the output can be used You have to add your own Java class to the At trDeriv jar java archive The name of your Java class is given to the opera 4 2 CONCEPT OPERATORS 45 tor as the input parameter ClassName Your Java class must implement the interface AttrDerivInterface which is included in the file AttrDeriv jar Please refer to the example implementation and its Java comments The in terface AttrDerivInterface prescribes one method to be implemented which returns a String object ie an array of strings The strings give the new values of the attribute in the output You must use Java 1 5 alias JDK 5 0 to compile your Java classes ObjType ClassName Name of your Java class TheTarget Attribute optional TypeOfOutputAttribute Datatype for new attribute TheOutput Attribute Name of new attribute TheOutputConcept 4 2 21 FeatureConstructionByRelation An operator that adds information from one concept to another concept The two concepts must be linked by a relationship which is given by the parameter TheRelation The From Concept of that relation provides the target attribute The operator creates a copy of the To Concept of TheRelation with a new addi tional attribute The name of the new attribute is given as TheOutputAttribute The values of the new attribute are aggregated values aggregated by the given function and computed from TheTargetAttribute The computation of aggregated values
19. of TheConcepts thus the number of entries in TheConcepts and TheKeys must be equal If several of the input concepts contain a BaseAttribute or MultiColumn Feature with the same name a special mapping mechanism is needed to re late them to different features in TheOutputConcept For this the parameters MapInput and MapOutput exist Use MapInput to specify any feature in one of TheConcepts and use MapOutput to specify the corresponding feature in TheOutputConcept To make sure that for each MapInput the right MapOut put is found by this operator it uses the coordination mechanism see section 4 1 However these two parameters only need to be specified for every pair of equally named features in TheConcepts So there are not necessarily as many maps as there are features in TheOutputConcept Object Type TheConcepts CON List no List MapInput BA or MCF looped MapOutput looped TheOutputConcept inherited 4 2 4 Pivotize Pivotisation means to take the values that occur in an index attribute and create a new attribute for each of these values The new attributes contain the values of a pivot attribute in those rows that contain the corresponding index value Thus the pivot values are distributed over the new attributes which correspond to the index values For clarification a simple example is given here Assume that this is the input table concept PrimaryKey IndexAttr Pivot
20. on the compilation of previous steps if a step uses input that is the output of a previous step The compiler can be executed in two modes lazy and eager This only makes a difference if there are concepts in the case that have more than one ColumnSet which can happen as the result of a segmentation operator see sections 4 2 10 4 2 11 and 4 2 12 in chapter 4 In lazy mode the compiler executes the operator specific task only on the first of the ColumnSets that belong to the input concept of that operator which saves time for testing For full compilation the eager mode is needed Concept editor In this view you can create view or delete concepts and their relations on both the conceptual and relational level Case editor In this view you can create view or delete steps you can ar range them into chains and define the input and output parameters of their operators The case editor shows the currently defined sequence of steps with their dependencies represented by arrows More details can be found in chapter 3 Export Cases can be exported with the export function This will store all the Metadata that defines the case into a single file This file can then be used for importing the case into another database by another user for example See also chapter 5 Import After exporting a case can be imported into a new database After im port all the Metadata of the case is available however the connections between the conceptua
21. schema The menu item Switch editor can be used to change from the case editor to the concept editor or back The menu items Print and Export image allow to print the contents of the main working area the central area to postscript or png files The Insert menu will be explained in sections 3 2 and 3 3 The options of the Compile menu are only available when the Case editor is opened see section 3 2 25 26 CHAPTER 3 THE GRAPHICAL USER INTERFACE The Tools menu allows to change the connection settings by which Mining Mart connects to the database menu option Edit DB settings This should not be done when a case is opened Another option in this menu called Arrange Items on Grid is to arrange the icons shown in either the case or concept ed itor on a grid which helps to straighten the lines The result of this can only be seen after the case has been saved closed and re opened The other options are explained in sections 3 2 and 3 3 The Windows menu offers different options Grid toggles between visible and invisible states of a grid on the working pane Look n Feel allows to change the look and feel of the graphical interface Show indicates which of the possible elements for the right hand frame are displayed Presentation creates a win dow that gives maximal space to the contents of the working pane The most important item in this menu is Preferences When you choose it a new win dow opens with three areas The first di
22. schema or just a second database under MySql for the data This M4 schema should be empty to start with and will be filled with MiningMart specific tables by the system see section 1 4 5 See section 1 4 to learn about the database connection information which you need when starting MiningMart for the first time With these hints you should be able to install MiningMart easily by following the instructions below They are also to be found in the files MM HOME README and MM_HOME INSTALL 1 5 2 Windows You can download a file called miningmart 1 1 exe Save the file somewhere on your file system Double click on the file to start the installation program You will be asked for the destination directory where MiningMart is to be in stalled The default directory is a subdirectory of your programs directory such as C Programs MiningMart 1 1 But please note that you need to have Administrator status in order to install MiningMart there If you do not have Administrator status on your computer simply choose another directory where you have write access If you would like to use an Oracle database you will also be asked to provide the location of a JDBC driver file that comes with Oracle This file contains Java classes that MiningMart which is written in Java needs to access your Oracle database The file is usually called ojdb14 jar for Java 1 4 classes13 jar classes12 zip for lower Java versions or similar MiningMar
23. step in the case editor whose output is a concept Such output concepts are always Mining concepts and their connection to database objects can only be realised 3 3 CONCEPT EDITOR 3l by compiling the steps whose output they are Depending on which operator created a Mining concept there are often some semantic links to the input concept In MiningMart the three types of semantic links between concepts are subconcept links represented by dashed arrows projections represented by dotted arrows and relationships represented by solid arrows see section 2 2 In the standard setting MiningMart always dis plays all three types of links but you can choose which ones to display in the Window Preferences Concept links menu item These links can also be cre ated by hand using the icons in the ConceptTools tab on the right hand side They can be removed by right clicking on a link and choosing Delete Relations Relations or relationships represent foreign key links between tables or views The two types one to many 1 n and many to many n m are distin guished When creating a relationship by hand the BaseAttributes of the two concepts involved which correspond to the database columns that realise the keys must be specified For many to many relationships also the cross table must be chosen from the available database objects and its key columns must be specified Data models More advanced users can try to connect se
24. All ThePredictingAttributes must belong to TheInputConcept TheOutputAttribute contains the predicted values There are some SVM specific parameters the table gives reasonable values to choose if nothing is known about the data or SVMs For the KernelType only the following values Strings are possible dot polynomial neural radial anova Dot is the linear kernel and can be taken as default This operator can use two different versions of the Support Vector Machine algorithm One runs in main memory it needs the parameter SampleSize to de termine a maximum number of training examples The other runs in the data base it is used if the optional parameter UseDB_SVM is set to the String true When this version is used an additional parameter TheKey is needed which gives the BaseAttribute whose column is the primary key of TheInputConcept TheKey can be left out only if the ColumnSet that belongs to TheInputConcept represents a table rather than a view The database algorithm restricts the pos sible kernel types to dot and radial It can also use the parameter SampleSize You can only use it if you have the DB version of MYSVM installed The parameter C balances training error against generalisation quality pos itive values between 0 01 and 1000 have been used successfully in the literature Epsilon limits the allowed error an example may produce small values under 0 5 should be used TheInputConcept inherited TheTargetAttribute inherit
25. As explained in section 2 1 MiningMart uses a layer of abstraction of the business data in order to hide sensitive details from other MiningMart users This layer is the conceptual level Its name stems from the fact that on this level the data is described in everyday concepts rather than in terms of its technical representation For example many institutions have got data about their customers So it could make sense to introduce the common concept Customer on the conceptual level where it represents the data about customers Information about this level forms part of the Metadata described above The conceptual level is the most important one for MiningMart users be cause all the data processing is described in terms of the conceptual level That is whenever the customer data in the above example is accessed this is done via the concept Customer In contrast to this level there is the relational level which also forms part of the Metadata but which contains less abstract information about the business data Both levels must be connected see below Relational level On this level the business data is described in terms of its technical representation This means that the relational level being part of the Metadata stores exact information about the tables and columns that contain the business data While a concept such as Customer may be rather common in several institutions the way the data about customers is organised
26. Attr PrimaryKey IndexAttr PivotAttr PivotAttr M PivotAttr F M 5 5 NULL M 4 4 NULL F T NULL T This operator Pivotize creates a new output concept that does not contain the original index nor pivot attribute Instead the rows in the output concept are 4 2 CONCEPT OPERATORS 37 grouped by additional attributes and optionally aggregated using an aggregation function In the example above if no GroupBy Attributes are selected and SUM is the aggregation function the output will be PivotAttr M PivotAttr However if the primary key attribute is selected as a GroupBy Attribute the output will look like this because each value of the primary key is its own group PrimaryKey PivotAttr M PivotAttrF 1 5 NULL 2 4 NULL 3 NULL 7 The values in TheIndexAttribute must be mapped by hand to names for the new attributes For this the parameters IndexValue and MappedAttribute are provided in loops so that each loop specifies one mapping Every value used in Index Value should of course occur in TheIndexAttribute ThePivotAttribute will not appear in TheOutputConcept only in TheInputConcept The AggregationOperator is one of NONE SUM MIN MAX AVG average and COUNT If NONE is chosen no aggregation will take place NONE cannot be chosen if any GroupBy attributes are given the parameter TheGroupByAt tributes is optional The parameter NullOrZero specifies whether NULL or 0 should be used whereever empty fields
27. CHAPTER 2 BASIC CONCEPTS IN MININGMART Feature A feature is an attribute of a concept For example a concept Cus tomer may have the features Age Income Address etc A concept Product may have the features Price Number of Sales and others There exist two kinds of features in MiningMart BaseAttributes and MultiColumn Features Like concepts features can be parameters BaseAttribute A BaseAttribute is a feature It represents a single attribute of the MiningMart concept it belongs to BaseAttributes are connected to Columns which represent a database column on the relational level For example the concept Customer may have a BaseAttribute Age which is connected to a column of a table in the database called cust_age MultiColumnFeature A MultiColumnFeature is a feature It represents a conceptual bundle of attributes of a concept Thus it consists of at least two BaseAttributes For example a MultiColumnFeature Address may be used to bundle the BaseAttributes Street City and TelephoneNumber Mul tiColumnFeatures are a conceptual device in MiningMart which may be used to structure the concepts in order to give a more intuitive view on the business data Relation A relation represents a database link between two tables It can either be a 1 n relation or an n m relation Relations in MiningMart store the information about foreign keys and primary keys as w
28. Enabling End User Datawarehouse Mining MiningMart Version 1 1 User Guide The MiningMart User Guide Timm Euler Martin Scholz Dortmund Germany June 26 2007 Contents 1 Getting Started LI Support 2x esee nu OG eS ee ee no ee A dou 1 2 Licensing information 13 File names in this 1 4 Connecting to the database LAT Or clec c a eae we yh RT OE A n 14 2 PostereSQ s soe OE oe eRe le L43 MySQL aud vA Se BO E dies 1 4 4 Editing the connection 1 4 5 M4datainformation 1 5 Installing and running MiningMart 1 5 1 General Hints es s ee RA GR p E De 1 5 2 W ThdOWS 1 afa erus grt tnos s 15 3 LinuxorSolaris aaa a a ioi BOE A 1 6 Integration with YALE a a 1 7 Upgrading from older versions e 18 External Code 4o ied kk ke a Ro vr A 1 9 List of operators that use external algorithms 2 Basic Concepts in MiningMart 2 1 The MiningMart 2 2 Basic notions in MiningMart 3 The Graphical User Interface General issues eue ale ace ae be RR Ae eae aue di 3 2 Caseedibor Sanus hyp b LARGE madura 3 3 Concept editor 24 06 eee E vues 4 Operators and their Parameters 4 1
29. LossFunctionPos than predicting too low If both val ues are equal no bias is used The parameter C balances training error against generalisation quality positive values between 0 01 and 1000 have been used successfully in the literature Epsilon limits the allowed error an example may produce small values under 0 5 should be used ObjType TheInputConcept inherited inherited TheTargetAttribute ThePredictingAttributes BA List KernelType SampleSize LossFunctionPos LossFunctionNeg Epsilon UseDB_SVM TheKey TheOutput Attribute see explanation above see explanation above positive real try 1 0 positive real try 1 0 positive real try 1 0 positive real try 0 1 optional one of true false optional inherited lt a lt a lt lt lt lt lt LLLA 222222222253 ww gt gt O e H 4 4 9 LinearScaling A scaling operator Values in TheTargetAttribute are scaled to lie between NewRangeMin and NewRangeMaz in TheOutputAttribute ObjType TheInputConcept IN inherited TheTargetAttribute BA IN inherited NewRangeMin V IN new min value NewRangeMax V IN new max value TheOutputAttribute BA OUT inherited 4 4 10 LogScaling A scaling operator Values in TheTargetAttribute are scaled to their logarithm to the given LogBase ObjType TheInputConcept IN inherited TheTargetAttribute BA IN inherited LogBase TheOutput Attribute BA OUT inherited 58 CHAPTER 4 OPERA
30. Oracle MySQL and PostgreSQL If you do not have access to an installed and running database system of one of these three types you cannot use MiningMart MiningMart distinguishes between two types of data the data you would like to transform prepare analyse and its own data see chapter 2 The former is called business data in this document while the latter is called M4 data since MiningMart is based on a model called M4 These two types of data can be stored in the same database schema accessible by the same database user or in different schemas with two different database users the latter setting is a little safer and is the default setting but it may also be convenient to use the first setting When two schemas are used you would typically create one database schema database user explicitly for the M4 data while using an existing schema user for your existing data When you start MiningMart for the first time you need to have certain pieces of information at hand in order to tell MiningMart how it should connect to your database and to which database users the data belongs The following sections explain this for each type of database system Please refer also to the documentation of your database system 1 4 1 Oracle e An Oracle database has a global name or SID The database denoted by this name can have many database schemas and can be accessed by many database users You must ask the administrator of your Oracle database i
31. PostgreSQL e A PostgreSQL database has a global name The database denoted by this name can have many database schemas and can be accessed by many data base users You must ask the administrator of your PostgreSQL database in order to get this name The name is case sensitive If MiningMart asks for two different database names one for the business and one for the M4 data use the same name for both e You may need to know the JDBC driver class name of the JDBC driver you are using MiningMart offers a default driver class name which should work without problems but if you are using a different JDBC driver see section 1 5 you may have to change this default class name e Your PostgreSQL database is running on a certain host whose name you must tell MiningMart example dbserver mynet de If the database system is running on the same computer as MiningMart the default entry localhost can be used e Your PostgreSQL database accepts connections at a certain port Min ingMart offers the default port value for PostgreSQL 5432 but if your database uses a different port you must tell MiningMart so e The business data is accessible by a certain database user You must give MiningMart the name of this user e The business user s password must be given e If you have created an extra database user for MiningMart s own data the name for this user must be given to MiningMart as the M4 user If you do not want to create an extra user
32. TORS AND THEIR PARAMETERS 4 4 11 Support VectorMachineForRegression A data mining operator Values in TheTargetAttribute are used as target func tion values to train the SVM on examples that are formed with ThePredicting Attributes All ThePredictingAttributes must belong to ThelnputConcept The OutputAttribute contains the predicted values There are some SVM specific parameters the table gives reasonable values to choose if nothing is known about the data or SVMs For the KernelType only the following values Strings are possible dot polynomial neural radial anova Dot is the linear kernel and can be taken as default This operator can use two different versions of the Support Vector Machine algorithm One runs in main memory it needs the parameter SampleSize to de termine a maximum number of training examples The other runs in the data base it is used if the optional parameter UseDB_SVM is set to the String true When this version is used an additional parameter TheKey is needed which gives the BaseAttribute whose column is the primary key of TheInputConcept TheKey can be left out only if the ColumnSet that belongs to TheInputConcept represents a table rather than a view The database algorithm restricts the pos sible kernel types to dot and radial It can also use the parameter SampleSize You can only use it if you have the DB version of MYSVM installed With the parameters LossFunctionPos and LossFunctionNeg the loss func
33. URE SELECTION OPERATORS 53 ObjType inherited TheAttributes j see section 4 3 TheClassAttribute must be categorial NoOfAttributes positive integer SampleSize positive integer TheOutputConcept inherited 4 3 6 FloatForwardFeatureSelectionGivenNoOfAtt A Feature Selection operator This operator adds one feature a time starting from empty set until the required number of features NoOfAttributes is reached The attributes are selected with respect to TheClassAttribute the group opti mises the information dependence criterion Unlike the simple operator after adding a feature a check is performed if another feature should be removed Use this operator if only a small number of original attributes is to be selected The selection is done from the set of TheAttributes attributes not specified in this set are selected automatically ObjType aoe inherited TheAttributes see section 4 3 TheClassAttribute must be categorial NoOfAttributes positive integer SampleSize positive integer TheOutputConcept inherited 4 3 7 FloatBackwardFeatureSelectionGivenNoOfAtt A Feature Selection operator This operator removes one feature a time start ing from all attributes until the required number of features NoOfAttributes is reached The attributes are selected with respect to TheClassAttribute the group optimises the information dependence criterion Unlike the simple oper ator after removing a feature a check is performed if a
34. al begins ObjType TheInputConcept inherited InputTimeBA InputValueBA Tolerance non negative real number AverageValueBA Increase ValueBA OutputTimeStartBA OutputTimeEndBA TheOutputConcept inherited 4 2 27 An implementation of the well known Apriori algorithm for the data mining step It works on a sample read from the database The sample size is given by the parameter SampleSize The input format is fixed There is one input concept TheInputConcept having a BaseAttribute for the customer ID parameter CustID one for the transaction ID TransID and one for an item part of this customer transaction s itemset Item The algorithm expects all entries of these BaseAttributes to be integers No null values are allowed It then finds all frequent parameter MinSupport rules with at least the specified confidence parameter MinConfidence Please keep in mind that these settings especially the minimal support are applied to a sample The output is specified by three parameters TheOutputConcept is the con cept the output table is attached to It has two BaseAttributes PremiseBA for the premises of rules and ConclusionBA for the conclusions Each entry for one of these attributes contains a set of whitespace separated item IDs integers 4 2 CONCEPT OPERATORS 49 ObjType Concept inherited CustID customer id integer not NULL TransID transaction id integer not NULL Item item id
35. apurConcent inherited AttributesComputeSum AttributesComputeCount BA List AttributesCompute Unique BA List AttributesComputeMin BA List AttributesComputeMax BA List AttributesComputeDistrib BA List GroupBy BA List DistribValues V List TheOutputConcept CON inherited numeric see text as GROUP BY in SQL 222222222 eee es 4 2 7 RowSelectionByQuery The output Concept contains only records that fulfill the SQL condition formu lated by the parameters of this operator This operator is loopable If applied in a loop the conditions from the different loops are connected by AND Every condition consists of a left hand side an SQL operator and a right hand side Together these three must form a valid SQL condition For example to specify that only records rows whose value of attribute sale is either 50 or 60 should be selected the left condition is the BaseAttribute for sale the operator is IN and the right condition is 50 60 If this operator is applied in a loop only the three parameters modelling the condition change from loop to loop while input and output Concept remain the same ObjType TheInputConcept IN inherited same in all loops TheLeftCondition BA IN any BA of input concept TheConditionOperator V IN an SQL operator lt TheRightCondition TheOutputConcept CON OUT inherited same in all loops 4 2 8 RowSelectionByRandomSampling Puts atmost as many rows into the output Conce
36. arts of the operator chain All features in TheOutputConcept must have a corresponding feature with the same name in TheInputConcept 43 FEATURE SELECTION OPERATORS 51 ObjType TheInputConcept CON IN inherited TheSelectedFeatures FEA IN will be in TheOutputConcept OUT TheOutputConcept CON inherited 4 3 2 RemoveFeatures This operator can be used for manual feature selection but here the user specifies all features not to be selected This is done by providing all and only the features that are to be removed from TheInputConcept The operator then simply copies all other features from TheInputConcept to TheOutputConcept It can be used to get rid of features that are not needed in later parts of the operator chain All features in TheOutputConcept must have a corresponding feature with the same name in TheInputConcept ObjType TheInputConcept CON IN inherited FeaturesToRemove FEA IN will not be in TheOutputConcept OUT TheOutputConcept CON inherited 4 3 3 FeatureSelectionWithS VM A Feature Selection operator This operator uses the a estimator as computed by a Support Vector Machine training run to compare the classification perfor mance of different feature subsets Searching either forward or backward it finds the best feature subset according to this criterion Thus it performs a simple beam search of width 1 TheTargetAttribute must be binary as Support Vector Machines can only solve binary classification pro
37. ase at a time A case contains the processing steps which may be organised in chains Cases can be exported and imported They are the unit of knowledge sharing the web plat form described in chapter 5 lists successful cases which were exported by other MiningMart users and can be downloaded and imported Only the conceptual level is ex or imported after import you need to connect that information to the relational level Step A step represents a single processing task in a case In each step ex actly one operator is applied Steps are represented by icons in the MiningMart workspace the case editor Steps are applied to the data in a certain user defined order where the input of one step depends on the output of the previ ous one These dependencies are represented in the MiningMart workspace by arrows They form a Directed Acyclic Graph DAG that is there must not be any cyclic dependencies You can give explanatory names to the steps of a case Chain Any number of steps can be organised into chains This provides a means to organise large cases with many steps so that the functions performed in that case become clearer Comprising several steps which together perform some definable task like data cleaning for example gives a better overview of the case You can give explanatory names to the chains of a case You can also nest chains i e have chains inside chains 2 2 BASIC NOTIONS IN MININGMART 21 Operator An operator per
38. ash It takes one of two values Priority or ValueCombination If the former is chosen then the value of The TargetAttribute is chosen over the one from the Attribute ToMerge only for clashes because the target attribute has priority over the attribute to merge If the latter value is taken a new value is introduced for TheOutputAttribute that represents the particular combination of the values that have clashed This operator should only be applied to sparsely populated attributes in order to create better populated attributes It is also useful for combining an attribute with missing values with an attribute that contains predictions for the missing values For this latter application choose the priority clash resolvation method and make the attribute with missing values the target attribute and the attribute with predicted values the attribute to merge ObjType TheInputConcept CON IN inherited TheTargetAttribute BA IN inherited AttributeToMerge BA IN ClashResolvation V IN Priority or ValueCombination OUT TheOutputAttribute result of merge 4 4 8 MissingValuesWithRegressionS VM A Missing Value operator Each missing value in TheTargetAttribute is replaced by a predicted value For prediction a Support Vector Machine SVM is trained in regression mode from ThePredictingAttributes taking TheTargetAttribute values that are not missing as target function values All ThePredictingAt tributes must belong to TheInp
39. atureConstruction 59 44 14 60 4 4 15 TimelntervalManualDiscretization 60 4 4 16 NumericIntervalManualDiscretization 61 4 4 17 EquidistantDiscretizationGivenWidth 61 4 4 18 EquidistantDiscretizationGivenNoOfIntervals 62 4 4 19 EquifrequentDiscretizationGivenCardinality 62 4 4 20 EquifrequentDiscretizationGivenNoOfIntervals 62 4 4 21 UserDefinedDiscretization 63 4 4 22 ImplicitErrorBasedDiscretization 63 4 4 23 ErrorBasedDiscretizationGivenMinCardinality 64 4 4 24 ErrorBasedDiscretizationGivenNoOflnt 64 4 4 25 GroupingGivenMinCardinality 65 4 4 26 GroupingGivenNoOfGroups 65 4 4 27 UserDefinedGrouping 65 4 4 28 UserDefinedGroupingWithDefaultValue 66 4 4 29 ImplicitErrorBasedGrouping 66 4 4 30 ErrorBasedGroupingGivenMinCardinality 67 4 4 31 ErrorBasedGroupingGivenNoOfGroups 67 4 5 Operators creating relationships lt soo o 68 4 5 1 CreateOneToManyRelation 68 4 5 2 CreateManyToManyRelation 68 4 6 Other Operators 69 4 6 1 ComputeSVMError 69 4 6 2 69 5 The Case Repository 71 5 1 The Internet Presentation of Cases 71
40. ble The result is available as a database view The first thing you need to have is a PrimaryKey feature in your example set view represented by TheInputConcept Unlike other operators this operator will not work correctly if the specified primary key attribute is not unique Usually not all of the available attributes will be available for prediction so an array of PredictingAttributes has to be specified Please note that the primary key must not be part of this list The model is referenced by an absolute path in your file system to the model file written by Yale Finally the base attribute to be predicted PredictedAttribute and TheOutputConcept need to be specified 44 CHAPTER 4 OPERATORS AND THEIR PARAMETERS The order of predicting attributes must be the same as during learning If you want to induce a model with Yale using a database view then please give an explicit list of attributes in the SELECT part of the DatabaseExampleSource operator of Yale The order of attributes in MiningMart is given by the order in the array PredictingAttributes It is possible to apply this operator in loops The input and output concept and the predicting attributes will be the same for all loops while the model file and the output base attribute should change for each loop ObjType TheInputConcept inherited PrimaryKey a unique attribute PredictingAttributes attribute list as during learning ModelFile absolute path to model file PredictedAttri
41. blems The a estimator can only be computed for classification problems The parameter PositiveTarget Value specifies the class label of the positive class There are some SVM specific parameters the table gives reasonable values to choose if nothing is known about the data or SVMs For the KernelType only the following values Strings are possible dot polynomial neural radial anova Dot is the linear kernel and can be taken as default This operator can use two different versions of the Support Vector Machine algorithm One runs in main memory it needs the parameter SampleSize to de termine a maximum number of training examples The other runs in the data base it is used if the optional parameter UseDB_SVM is set to the String true When this version is used an additional parameter TheKey is needed which gives the BaseAttribute whose column is the primary key of TheInputConcept TheKey can be left out only if the ColumnSet that belongs to TheInputCon cept represents a table rather than a view The database algorithm restricts the possible kernel types to dot and radial It can also use the parameter SampleSize 52 CHAPTER 4 OPERATORS AND THEIR PARAMETERS ObjType TheInputConcept inherited TheAttributes see section 4 3 TheTargetAttribute must be binary PositiveTarget Value the positive class label KernelType see explanation above SampleSize see explanation above positive real try 1 0 Epsilon positive real try 0 1 Us
42. bute new attribute to be predicted TheOutputConcept inherited 4 2 19 CreatePrimaryKey Simple concept operator for creating a view representing the same concept with an additional primary key If the original concept has duplicates then the ability of this operator to remove SELECT DISTINCT or keep these duplicates might be interesting This operator is also applicable to views that already have a primary key In the database the newly created attribute will be used for indexing in the output view Parameters are TheInputConcept and TheOutputConcept the PrimaryKey to be added and a flag AllowDuplicates indicating whether the created view should explicitly make sure that the same tuple will not appear multiple times in the output concept ObjType TheInputConcept pou inherited PrimaryKey the new primary key AllowDuplicates one of true or false TheOutputConcept box inherited 4 2 20 AttributeDerivation This is a general operator whose behaviour users can determine by Java pro gramming It creates a concept that contains all attributes of the input concept and one additional attribute The values of this additional attribute are deter mined by running a Java program that users can provide The file MM_HOME lib AttrDeriv jar provides an example implementation in the file AttrDerivExample java that simply copies the contents of the optional target attribute to the output attribute But in general any
43. butes one after the other and produces the respective number of output attributes For each loop the target attribute the output attribute and other parameters namely those marked as looped below can be different but the input concept is the same in all loops For the concept operator RowSelectionByQuery looping means that several query conditions are formulated using the parameters of this operator one set of parameters for each condition and that they are connected with AND See the description of this operator In the following sections all current operators are listed with their exact name a short description and the names of their parameters In general all input BaseAttributes belong to the input Concept and all output BaseAttributes belong to the output Concept 4 2 Concept operators All Concept operators take an input Concept and create at least one new ColumnSet which they attach to the output Concept The output Concept must have all its Features attached to it before the operator is compiled All Concept operators have the two parameters ThelnputConcept and TheOutputConcept which are marked as inherited in the following parameter descriptions 4 21 MultiRelationalFeatureConstruction Takes a list of concepts which are linked by relationships and selects specified Features from them which are collected in the output Concept via a join on the concepts of the chain To be more precise Relationships are only def
44. cretization operator Numeric attributes are discretized and the output is a categorial attribute This operator divides the range of TheTargetAttribute into intervals according to user given cutpoints TheCutpoints which is a list of values which each give a cutpoint for the intervals to be created The cutpoints must be given in ascending order Values of TheOutputAttribute can be specified in the parameter Label ObjType Concept inherited TheTargetAttribute must be numeric TheCutpoints see text ClosedTo one of LEFT or RIGHT Label optional TheOutputAttribute should be categorial 4 4 22 ImplicitErrorBasedDiscretization A discretization operator Numeric attributes are discretized and the output is a categorial attribute This operator divides the range of TheTargetAttribute into intervals by merging subsequent values with the same majority class or classes given in TheClassAttribute TheClassAttribute contains the labels of an example as in a Machine Learning setting The resulting intervals minimize the classification error If FullMerge is set to YES then an interval with two or more majority classes is merged with its neighbour if both intervals share the same majority class The parameter SampleSize gives a maximum number of learning examples for the external algorithm ObjType inherited TheTargetAttribute must be numeric TheClassAttribute must be categorial ClosedTo one of LEFT or RIGHT FullMerge on
45. cretized and the output is a categorial attribute This operator divides the range of TheTargetAttribute into intervals with given Cardinality number of examples whose values are in the interval The first and the last interval cover also the values out of range CardinalityType decides how the parameter Cardinality is to be interpreted Values of TheOutputAttributecan be specified in the parameter Label this makes sense only if CardinalityType is RELATIVE ObjType TheInputConcept inherited TheTargetAttribute must be numeric CardinalityType ABSOLUTE or RELATIVE Cardinality positive ClosedTo one of LEFT or RIGHT Label j optional TheOutputAttribute should be categorial 4 4 20 EquifrequentDiscretizationGivenNoOfIntervals A discretization operator Numeric attributes are discretized and the output is a categorial attribute This operator divides the range of TheTargetAttribute into the given number of intervals NoOfIntervals The intervals have the same cardinality number of examples with values within the interval The first and the last interval cover also the values out of range Values of TheOutputAttribute can be specified in the parameter Label 4 4 FEATURE CONSTRUCTION OPERATORS 63 ObjType Concept inherited TheTargetAttribute must be numeric NoOfIntervals positive integer gt 1 ClosedTo one of LEFT or RIGHT Label optional TheOutputAttribute should be categorial 4 4 21 UserDefinedDiscretization A dis
46. d tables with identical names are deleted If you have used MiningMart before this will delete all your cases Please refer to section 1 7 to learn about saving your old cases Please refer to the file MM HOME m4install README txt for further informa tion on how to prepare the database manually or simply start the system if you want the system to prepare the database automatically 1 5 Installing and running MiningMart 1 5 1 General hints MiningMart works with relational databases Three database management sys tems DBMS are supported Oracle version 8 1 6 or higher PostgreSQL ver sion 7 2 or higher and MySql tested for version 5 0 and higher Although MiningMart uses standard SQL wherever possible access to certain important database system tables is not standardised among DBMS vendors so that adapt ing MiningMart to other DBMS is not difficult but has not been done yet 12 CHAPTER 1 GETTING STARTED MiningMart distinguishes between business data and its own 4 data see also chapter 2 The business data is the collection of tables you want to prepro cess or analyse One specific database schema should be used for this data under MySql a specific database should be used This schema is called business schema everywhere in this documentation For information on database schemas and how to create and manipulate them please refer to your Oracle PostgreSQL or MySql documentation It is useful to create a second database
47. dMovingFunction This operator works like SimpleMovingFunction section 4 2 23 but the weighted average is computed The window size is not given explicitly but is determined from the number of Weights given The sum of all Weights must be 1 ObjType Thelnpui Concept inherited InputTimeBA Input ValueBA Weights j sum must be 1 Distance positive integer Output TimeStartBA OutputTimeEndBA Output ValueBA TheOutputConcept inherited 4 2 25 ExponentialMovingFunction A time series smoothing operator For two values with the given Distance the first one is multiplied with TailWeight and the second one with Head Weight The resulting average is written into Output ValueBA and becomes the new tail value HeadWeight and TailWeight must sum to 1 ObjType inherited InputTimeBA Input ValueBA Head Weight Tail Weight Distance positive integer OutputTimeBA Output ValueBA TheOutputConcept inherited 48 CHAPTER 4 OPERATORS AND THEIR PARAMETERS 4 2 26 SignalToSymbolProcessing A time series abstraction operator Creates intervals their bounds are given in OutputTimeStartBA and OutputTimeEndBA The average value of every interval will be in Average ValueBA The average increase in that interval is in Increase ValueBA Tolerance determines when an interval is closed and a new one is opened if the average increase interpolated from the last interval deviates from a value by more than Tolerance a new interv
48. diting otherwise the new value is not recognised e It is a good idea to save the case you are editing regularly in particular before starting the compiler e In general icons in both editors can be freely moved around allowing to create a clear web structure of steps or concepts For large cases the Bird s view can be very helpful in both editors e Every MiningMart item can be documented There is a Description field in the lower left corner of the working area which always applies to the 3 2 CASE EDITOR 27 I MiningMart DMTT Version 2 Case Insert Compile Tools Window Help RaeBweseS t e HH s Ea bur Varson2 Droning Ch For Drawing amp GW HandleCustomerUata Chain P3 am HandleRevenueData Chain aio HandleServicesData Chain ur 90 Select Client Ids For Mining Chain MarkN nCRurnera SelectNanChurners RenameAttribute SelectKeyAhd MarkSomeChurnedCustorners Step MarkNonChumers Step SelectM arkedChumers Step SelectNonChurners Step ati VarkSomeChurnedCustomera dChurners SelectKeyAndChurnMonth UnifyExamples MaterializeDaleSetKeys Dependency 6 Depend Figure 3 1 The case editor currently selected element You can enter any text here it will be saved and displayed whenever the element is selected again For BaseAttributes the description field ca
49. e compiler creates Columns for only some of them the selected ones This means that in later steps some of the features that are used for the input of an operator may not have a Column If the operator depends on a certain feature the compiler checks whether a Column is present and shows an error message if no Column is found If the operator is executable without that Column no error occurs All feature selection operators have a parameter TheAttributes which spec ifies the set of features from which some are to be selected Again this is not true for FeatureSelectionByAttributesand RemoveFeatures see 4 3 1 and 4 3 2 The parameter is needed because not all of the features of TheInputConcept can be used as they may include a key attribute or the target attribute for a data mining step which should not be deselected This means that all attributes from ThelnputConcept that are not listed as one of TheAttributes will be present in TheOutputConcept both on the conceptual and on the relational level 4 3 1 FeatureSelectionBy Attributes This operator can be used for manual feature selection which means that the user specifies all features to be selected This is done by providing all and only the features that are to be selected in TheOutputConcept The operator then simply copies those features from TheInputConcept to TheOutputConcept which are present in TheOutputConcept It can be used to get rid of features that are not needed in later p
50. e of YES or NO SampleSize optional positive integer TheOutput Attribute should be categorial 64 CHAPTER 4 OPERATORS AND THEIR PARAMETERS 4 4 23 ErrorBasedDiscretizationGivenMinCardinality A discretization operator Numeric attributes are discretized and the output is a categorial attribute This operator divides the range of TheTargetAttribute into intervals with cardinality greater or equal to MinCardinality MinCardinal ityType decides if MinCardinality values are read as absolute values integers or relative values real between 0 and 1 TheTargetAttribute is divided into intervals with respect to TheClassAttribute but unlike the implicit discretiza tion intervals with single majority class are further merged if they do not have the required cardinality This will increase the classification error TheClassAt tribute contains the labels of an example as in a Machine Learning setting The parameter SampleSize gives a maximum number of learning examples for the external algorithm ObjType TheInputConcept inherited TheTargetAttribute must be numeric TheClassAttribute must be categorial MinCardinalityType ABSOLUTE or RELATIVE MinCardinality ClosedTo SampleSize TheOutputAttribute positive one of LEFT or RIGHT optional positive integer should be categorial 4 4 24 ErrorBasedDiscretizationGivenNoOfInt A discretization operator Numeric attributes are discretized and the output is a categorial attribute This
51. eDB_SVM optional one of true false TheKey optional SearchDirection one of forward backward TheOutputConcept inherited IN IN IN IN IN IN IN IN IN 4 3 4 SimpleForwardFeatureSelectionGivenNoOfAttributes A Feature Selection operator This operator adds one feature a time start ing from the empty set until the required number of features NoOfAttributes is reached The attributes are selected with respect to TheClassAttribute the group optimises the information dependence criterion Use this operator if only a small number of original attributes is to be selected The selection is done from the set of TheAttributes attributes not specified in this set are selected automatically ObjType TheInputConcept inherited TheAttributes j see section 4 3 TheClassAttribute must be categorial NoOfAttributes positive integer SampleSize positive integer TheOutputConcept inherited 4 3 5 SimpleBackwardFeatureSelectionGivenNoOfAttributes A Feature Selection operator This operator removes one feature a time start ing from all attributes until the required number of features NoOfAttributes is reached The attributes are selected with respect to TheClassAttribute the group optimises the information dependence criterion Use this operator if a large number of original attributes is to be selected The selection is done from the set of TheAttributes attributes not specified in this set are selected auto matically 4 3 FEAT
52. ecifies indirectly which of the three segmentation operators is reversed If a SegmentationStratified operator is reversed section 4 2 10 this param eter gives the name of the BaseAttribute that was used for stratified segmen tation Note that this BaseAttribute will belong to TheOutputConcept of this operator because the re unified ColumnSet contains different values for this at tribute whereas before the execution of this operator the different ColumnSets did not contain this attribute but each represented one of its values If a SegmentationByPartitioning operator is reversed section 4 2 11 this parameter must have the value Random If SegmentationWithK Mean operator is reversed section 4 2 12 this pa rameter must have the value KMeans Note that the segmentation to be reversed by this operator can be any segmentation in the chain before this operator TheInputConcept inherited UnsegmentAttribute see text TheOutputConcept inherited 4 2 14 RemoveDuplicates This operator produces an output concept that is a copy of the input concept but all duplicate entries are removed from the corresponding ColumnSet ObjType TheInputConcept CON IN inherited TheOutputConcept CON OUT inherited 4 2 15 Repeat The repeat operator will create the same view of TheInputConcept as often as specified by HowOften in the OutputConcept which leads to repeated ap 4 2 CONCEPT OPERATORS
53. ed must be binary ThePredictingAttributes BA List KernelType V SampleSize C Epsilon UseDB SVM TheKey A PositiveTarget Value V TheOutputAttribute BA Z see explanation above see explanation above positive real try 1 0 positive real try 0 1 optional one of true false optional the positive class label inherited V V V V OO 2 2 2 2 2 Z 12 2 2 O g H 4 4 13 GenericFeatureConstruction This operator creates an output attribute on the basis of a given SQL definition Parameter SQL String The definition must be well formed SQL defining how values for the output attribute are computed based on one of the attributes in TheInputConcept To refer to the attributes in TheInputConcept the names of the BaseAttributes are used and not the names of any Columns For exam ple if there are two BaseAttributes named INCOME and TAX in ThelIn putConcept this operator can compute their sum if SQL_String is defined as INCOME TAX 60 CHAPTER 4 OPERATORS AND THEIR PARAMETERS TheTargetAttribute is needed to have a blueprint for TheOutputAttribute The operator ignores The TargetAttribute except that it uses its conceptual data type and the relational data type of its column to specify the corresponding data types for TheOutputAttribute ObjType inherited TheInputConcept CON IN inherited TheTargetAttribute BA IN inherited specifies datatype SQL String V IN see text BA OUT
54. ed if it is associated with at least one Columnset the latter directly represents a database table or view There are two ways of creating a DB concept The first is to choose Insert New concept then give the name for the concept and then add BaseAttributes to it Adding BaseAttributes is done by right clicking on the concept icon choosing Show BaseAttributes and then using the New button to add single BaseAt tributes The second way of creating a DB concept is to choose Insert Concept from Table and then choose directly the database table or view this concept is going to represent Then the names of the concept and its BaseAttributes will be copied from the names of the database table or view and its columns The advantage is that a concept created in this way is immediately connected the disadvantage is that the names may have to be changed manually if the ones from the database objects are unsuitable A DB concept that is not connected to a columnset shows a red database icon a connected one shows a green one Further the number of columnsets is indicated in the concept icon the number 0 indicates a missing connection to columnsets Only DB concepts can be manually connected to database objects represented by columnsets To this end right click on the concept icon and choose Create connection This will display a tab in the right hand area which allows you to choose the database table or view for the connection Now you have two options
55. ell as optional cross tables so that the operators can use this information Thus relations can be parameters like concepts and features As such they should belong to the con ceptual level however since they also store database related information they might also be said to belong to both levels conceptual and relational A rela tion is displayed in the Concept Editor as a solid arrow for 1 n relations or a solid double arrow for n m relations ColumnSet ColumnSets are MiningMart objects that directly represent a database table or view As such they belong to the relational level Each ColumnSet is connected to exactly one concept but a concept may have more than one ColumnSet Each ColumnSet contains one or more Columns Column Column is a MiningMart object that directly represents a column in a database table or view Columns belong to the relational level Each Column belongs to exactly one ColumnSet but a ColumnSet can contain any positive number of Columns Compiler compilation The MiningMart compiler performs the central task in MiningMart it executes operators That is it reads the input parameters of an operator applies the operator specific processing to the data that corresponds to is connected to this input and creates the output data and connects it to the concepts or features that are specified by the operator s output parameters 2 2 BASIC NOTIONS IN MININGMART 23 The compilation of any step depends
56. enter the same user name as for the business data e The MA user s password must be given If your M4 user is the same as your business user then of course the password is also the same 10 CHAPTER 1 GETTING STARTED 1 4 3 MySQL You may need to know the JDBC driver class name of the JDBC driver you are using MiningMart offers a default driver class name which should work without problems but if you are using a different JDBC driver see section 1 5 you may have to change this default class name Your MySQL database is running on a certain host whose name you must tell MiningMart example dbserver mynet de If the database system is running on the same computer as MiningMart the default entry localhost can be used Your MySQL database accepts connections at a certain port MiningMart offers the default port value for MySQL 3306 but if your database uses a different port you must tell MiningMart so The business data is accessible in a certain database You must give Min ingMart the name of this database The business data is accessible by a certain database user You must give MiningMart the name of this user The business user s password must be given You can create an extra database and an extra user for MiningMart s own data If you do so the database name user name and password for this M4 database must be given to MiningMart Otherwise you can use the same entries as for the business data 1 4 4 Editi
57. entialMovingFunction 47 4 2 26 SignalToSymbolProcessing 48 4 2 27 ADEO e e Bad A r dd EEEE ees eR 48 4 2 28 Feature Construction with TF IDF 49 4 220 UBIOD 9 E a SEEM Eee i fee Ss ydo un 49 Feature selection operators 50 4 3 1 FeatureSelectionByAttributes 50 4 3 2 RemoveFeatures eA 51 4 3 8 FeatureSelectionWithSVM llle 51 4 3 4 SimpleForwardFeatureSelectionGivenNoOfAttributes 52 4 3 5 SimpleBackwardFeatureSelectionGivenNoOfAttributes 52 4 3 6 FloatForwardFeatureSelectionGivenNoOfAtt 53 4 3 7 FloatBackwardFeatureSelectionGivenNoOfAtt 53 4 3 8 UserDefinedFeatureSelection 54 Feature construction operators 54 4 41 AssignAverageValue 54 4 4 2 AssignModalValue 54 4 4 3 AssignMedianValue 54 4 4 4 55 4 4 5 55 4 4 6 Binariy s 4 0 BRAD A eee ox 55 4 4 7 MergeAttributes 56 4 4 8 MissingValuesWithRegressionSVM 56 4 4 9 LinearScaling amp umm REG eee ok 57 44 10 Logsealing 2 34 0e grado we wha eb 57 CONTENTS 5 4 4 11 SupportVectorMachineForRegression 58 4 4 12 Support VectorMachineForClassification 58 4 4 13 GenericFe
58. f results in knowledge discovery It is assumed that you are familiar with general concepts in Knowledge Discovery Data Mining However we give a few informal definitions here to provide a common understanding More information about Data Mining can be found on the MiningMart webpages http mmart cs uni dortmund de e The Knowledge Discovery Process refers to the technical steps of data acquisition data cleaning data preparation as well as data mining and model testing e Data Mining is the step in the knowledge discovery process where a Ma chine Learning algorithm is applied to learn a model which is used to make predictions on new data e Preprocessing comprises all steps that are undertaken in order to bring the data into a format that is accessible for data mining The result of pre processing is the input for data mining without any further modifications The input for preprocessing is the data as it is stored in a data warehouse or even the operational database of an institution Section 2 1 gives an overview of the MiningMart approach to the knowledge discovery process In section 2 2 basic terms that are used in MiningMart are defined and explained Those terms will be used everywhere in the MiningMart system and documentation so it is a good idea to familiarize yourself with them 17 18 CHAPTER 2 BASIC CONCEPTS IN MININGMART 2 1 The MiningMart approach MiningMart provides support for knowledge discovery applications Th
59. forms a single precisely defined task on the busi ness data An operator is applied by creating a step setting its parameters and compiling the step A step s parameters define the input and output for an operator in terms of the data on the conceptual level There are two basic kinds of operators those whose output is a concept and those that add an extra feature to their input concept A few operators do not belong to either of these categories Examples for tasks that operators perform are the replacement of missing values in the data that belongs to the input concept or the creation of a new view on the data from the input concept or the selection of important features from the input concept etc A list of all operators with their technical description and details can be found in chapter 4 Parameter Parameters are related to steps they define their input and output on the conceptual level Some parameters that many steps need are TheInput Concept which defines the concept whose data is processed TheOutputConcept or TheOutputAttribute which define the output ie the result of processing etc For every operator its parameters are listed in detail in chapter 4 Concept A concept in MiningMart represents an everyday notion for which there exists data in the database For example as mentioned earlier a con cept Customer may exist in MiningMart and refer to one or more tables in the database that contain data about customers
60. he value of ThePivotizedAttributes that corresponds to the combination of the index values taken by the index attributes For clarification a simple example is given here Assume that this is the input table concept with fictitious distribution of gender in some cities Inhabitants Male Inhabitants Female London Paris Vienna Then reverse pivotization can produce the following output London Male London Female Paris Male Paris Female Vienna Male Vienna Female To achieve this ThePivotizedAttributes must be Inhabitants Male and In habitants Female two corresponding Indez Values must be given as Male and Female the name Gender must be given as parameter NameForIndezAttributes and the name Percentage must be given as parameter NameForPivotAttribute The optional parameter TheKeyAttribute can be set to any extra attribute if there are any here City The only effect is that a 1 n relation can be created between the input and output concept of this operator in the concept editor Object Type Theliput Concent CON inherited ThePivotizedAttributes BA List see text Index Values V List one per pivotized attribute TheKey Attribute BA optional NameForPivotAttribute V name of attribute in output NameForIndexAttributes V List name of attributes in output TheOutputConcept CON inherited 4 2 6 SpecifiedStatistics An operator which computes certain statistical values for the TheInputConcept The computed values a
61. hich may take some time ObjType TheInputConcept CON IN inherited TheTarget Attribute BA IN must be numeric OUT TheOutputAttribute inherited 4 4 2 AssignModalValue A Missing Value operator Each missing value in TheTargetAttribute is replaced by the modal value of that Column The operator computes the column statistics if they are not computed yet which may take some time ObjType TheInputConcept inherited TheTargetAttribute TheOutputAttribute BA inherited 4 4 3 AssignMedian Value A Missing Value operator Each missing value in TheTargetAttribute is replaced by the median of that Column The operator computes the column statistics if they are not computed yet which may take some time 4 4 FEATURE CONSTRUCTION OPERATORS 55 ObjType TheInputConcept inherited TheTargetAttribute TheOutputAttribute BA inherited 4 4 4 AssignDefault Value A Missing Value operator Each missing value in TheTargetAttribute is replaced by the Default Value ObjType TheInputConcept CON IN inherited TheTargetAttribute BA IN inherited Default Value V IN OUT inherited TheOutputAttribute 4 4 5 AssignStochastic Value A Missing Value operator Each missing value in TheTargetAttribute is replaced by a value which is randomly selected according to the distribution of present values in this attribute For example if half of the entries in The TargetAttribute have a specific value this value is chosen with a
62. icially added named NULL values which does not work for all datatypes Please note that if you have no primary keys defined and you have multiple occurences of the same tuples then SQL will usually remove all duplicates when applying a UNION operation In some cases you may prefer a bag or multi set semantics For this reason the parameter DataMode allows to switch between set and multi set 50 CHAPTER 4 OPERATORS AND THEIR PARAMETERS ObjType TheInputConcept inherited DaaMode V 3 ist V set or multi set mode TheOutputConcept inherited IN inherited CON Lis Union compatible Concepts 4 3 Feature selection operators Feature selection operators are also concept operators in that their output is a Concept but they are listed in their own section since they have some common special properties All of them except FeatureSelectionByAttributes see 4 3 1 and RemoveFeatures see 4 3 2 use external algorithms to determine which fea tures are taken over to the output concept This means that at the time of designing an operating chain it is not known which features will be selected How can a complete valid chain be designed then since the input of later op erators may depend on the output of a feature selection operator which is only determined at compile time The answer is that conceptually all possible features are present in the out put concept of a feature selection operator while th
63. inal values Values of TheOutputAttribute can be spec ified in the parameter Label 66 CHAPTER 4 OPERATORS AND THEIR PARAMETERS ObjType inherited TheTargetAttribute must be numeric TheGroupings j see text Label j optional TheOutput Attribute should be categorial 4 4 28 UserDefinedGroupingWithDefault Value A grouping operator Values of TheTargetAttribute are grouped under a certain label which is stored in TheOutputAttribute which must be categorial This operator creates groups of The TargetAttribute values according to specifications given by the user in TheGroupings which is a list of values Each of the values in the list in turn is a String that lists values of TheTargetAttribute which should be grouped together separating them with a comma Values not specified for grouping are grouped into default group Default Values of TheOutputAttribute can be specified in the parameter Label ObjType TAR inherited TheTargetAttribute must be numeric Default default group Label j optional TheOutput Attribute should be categorial 4 4 29 ImplicitErrorBasedGrouping A grouping operator Values of The TargetAttribute are grouped under a certain label which is stored in TheOutputAttribute which must be categorial This operator merges the values of TheTargetAttribute into groups with the same majority class or classes given in TheClassAttribute If FullMerge is set to yes then a group with two or more majori
64. ined 4 2 CONCEPT OPERATORS 35 by the user between initial Concepts of a Case Suppose there is a chain of initial Concepts C1 C such that between all C and Ci41 1 lt i lt n is the FromConcept of the i th Relationship and Cj is its ToConcept These Concepts may be modified in the Case being modelled to result in new Concepts C1 Ch where some C may be equal to C However the BaseAttributes that pops pond to the Relationship keys are still present in the new Concepts C By using their names this operator can find the key Columns and join the new Concepts C The parameter table below refers to this explanation Note that all input Concepts are the new Concepts C but all input Relations link the original Concepts C Object Type CON inherited TheConcepts CON List Concepts C5 C7 TheRelations REL List they link C Cn TheChainedFeatures BA or MCF List from C1 C TheOutput Concept CON inherited 4 2 2 JoinByKey Takes a list of concepts plus attributes indicating their primary keys and joins the concepts In TheOutputConcept only one of the keys will be present Each BaseAttribute specified in TheKeys must be a primary key of one of TheCon cepts thus the number of entries in TheConcepts and TheKeys must be equal If several of the input concepts contain a BaseAttribute or a MultiColumn Feature with the same name a special mapping mechanism is needed to re
65. ingMart This operator makes the combination of YALE and MiningMart more convenient If ThelnputConcept to this operator has more than one Columnset the YALE file produced will contain the YALE operator IteratingOperatorChain and for each Columnset a query file will be produced that is read in one itera tion of the operator chain in YALE The parameter QueryFilePrefiz is used for this purpose the query files produced will have a number added to the given prefix starting with 1 If there is only one Columnset in ThelnputConcept the parameter QueryFilePrefiz is ignored Both filename parameters expect full paths The parameters TheLabel and ThePrimaryKey are optional They provide information to YALE about the label for learning and the primary key of the data set Of course this information can also be added manually in YALE ObjType TheInputConcept CON IN TheLabel BA IN optional ThePrimaryKey BA IN optional ExperimentFileName V IN complete path and filename QueryFilePrefix V IN only for multiple columnsets 70 CHAPTER 4 OPERATORS AND THEIR PARAMETERS Chapter 5 The Case Repository One of the basic ideas behind MiningMart is the aspect of sharing knowledge about successful cases The MiningMart project has set up a central web plat form which allows the public exchange and documentation of exported cases This chapter describes how the platform can be used to benefit from other users work and to let others benefit
66. is done only for those entities of the To Concept for which related entities are available in the From Concept the latter are then aggregated What is more the aggregation is specified to range only over particular entities of the From Concept namely those whose value of TheTargetAttribute matches the value of TheTargetAttribute that is most frequent in the relationship For example suppose that two concepts with data about customers and products of a company are linked by a relationship that indicates which prod uct has been bought by which customer The From Concept has the product information and the To Concept has the customer data Then this operator can compute the number of times a customer has bought the product that has been bought most often by any customer Thus the operator computes a single new aggregated value for each entity in the customer concept the value may be empty if the customer has not bought the frequent product In this example the AggregationOperator would be COUNT Choosing VALUE_OF as Aggregation Operator omits any aggregation 46 CHAPTER 4 OPERATORS AND THEIR PARAMETERS ObjType IN TheRelation TheTargetAttribute BA IN from From Concept of TheRelation AggregationOperator V IN COUNT SUM or VALUE OF TheOutput Attribute BA added to To Concept of TheRelation TheOutputConcept 4 2 22 Windowing Windowing is applicable to time series data It takes two BaseAttributes from the input Concept one
67. it is mapped to the value Default Value To be able to cope with various time formats e g HH MI SS the operator reads the given format from the parameter TimeFormat 4 4 FEATURE CONSTRUCTION OPERATORS 61 ObjType Tel put Concept inherited TheTargetAttribute inherited type TIME IntervalStart looped lower bound of interval IntervalEnd looped upper bound of interval value to map time interval to one of I and E one of I and E Default Value value if no mapping applies TimeFormat ORACLE specific time format TheOutputAttribute inherited MapTo StartIncExc EndIncExc ce ce 222222222 4 4 16 NumericIntervalManualDiscretization This operator can be used to discretize a numeric attribute manually It is very similar to the operator TimelIntervalManualDiscretization described in 4 4 15 The looped parameters ntervalStart IntervalEnd StartIncExc EndIncExc and MapTo again specify a mapping to be performed If an input value does not be long to any interval it is mapped to the value Default Value TheTargetAttribute must be of type ordinal ObjType TheInputConcept inherited TheTargetAttribute inherited type ORDINAL IntervalStart looped lower bound of interval IntervalEnd looped upper bound of interval MapTo value to map time interval to StartIncExc one of I and E EndIncExc one of I and E Default Value
68. itial data model should be connected to MiningMart will then use its schema matching engine to determine a suitable mapping of the concepts of the initial data model to the tables views selected by you Before the connections are performed you 32 CHAPTER 3 THE GRAPHICAL USER INTERFACE are shown MiningMart s mapping suggestions and you can edit the mapping before creating the connections Sometimes the suggested mapping is only par tial because no tables that are similar enough to a given concept could be found In these cases you can still complement the suggested mapping After creating the connections you can always change the connection of a particular concept by right clicking on it and choosing Create connection see above The menu item Match Result of Step lets you connect the data model resulting from a particular Step which you will choose to some database tables or views The menu item Match Any Data Model finally can be used to let Mining Mart search for the best matching data model that fits to the tables views you selected for connecting The search will include the initial data model and the data model resulting from any Step The best point of matching will be deter mined based on name similarities This feature is useful when you download a large case from the case base see chapter 5 and you would like to find the point in the given preparation process where your data fits in best so that you can start pr
69. l and relational level must still be made see under connections See also chapter 5 24 CHAPTER 2 BASIC CONCEPTS IN MININGMART Chapter 3 The Graphical User Interface This chapter explains the main issues for handling MiningMart as a user Users should read chapter 2 and section 4 1 before they start to work with the sys tem MiningMart provides two views on the data transformation process one is centred on the steps and their dependencies the other focuses on the data sets and semantic links between them The former view is called the Case editor and is described in section 3 2 The latter view is called the Concept editor and is described in section 3 3 Before describing these editors section 3 1 gives a few general hints 3 1 General issues The central notion in MiningMart is the Case it can be compared to a file in text editors or other applications except that only one case can be open at any time The Case menu therefore offers options to open and save an existing case or to create a new case Cases are stored in the database Cases can be exported from the database to files or imported from files to the database Deleting a case means to remove it from the database The menu item New Case From Data allows to set up a new case directly from a set of database tables or views Simply select which tables and or views you would like to have represented in your case and MiningMart will create a case with the corresponding data
70. le if an obligatory parame ter is missing an error message will be given when this button is clicked The button Show input estimations will display some available information about the data contents of the input concept of the current step This information is usually estimated and therefore not always reliable but it can help to choose suitable parameter settings Steps can be collected in Chains Despite their names chains can not only contain linear sequences of steps but any transition structure Every case must have exactly one top level chain when you create a new case the first thing you have to do therefore is to create a chain using the menu item Insert Chain Chain All other steps and chains are subelements of this chain Double clicking on a chain opens a new working tab in which its steps and subchains are displayed Chains can be created by selecting any number of steps using the mouse to mark a rectangular area of selection and by then choosing Insert Chain Merge Chain A chain can be dissolved by selecting it and choosing Insert Chain Un merge Chain The OperatorTools pane on the right hand side of the working area provides shortcuts for most of these tasks Right clicking on a chain and choosing Show concepts involved opens the concept editor but displays only the concepts used as input or output in any step of that chain This allows to keep a clearer overview of the concepts in large cases When a step and its paramete
71. level and connected to keys in the database when the step that employs this operator is compiled Note that for technical reasons the two input concepts must be connected to a ColumnSet that represents a database table not a view since views cannot have primary keys If you would like to apply this operator to input views use materialisa tion operators section 4 2 16 in order to produce tables before applying this operator ObjType TheFromConcept CON must be a table TheToConcept CON must be a table FromConceptKeys BA List ToConceptKeys BA List TheRelation REL Name of new relationship 4 5 2 CreateManyToManyRelation This operator creates an n m relationship between its two given input con cepts The relationship is created on the conceptual level and connected to keys in the database when the step that employs this operator is compiled Note that for technical reasons the two input concepts must be connected to a ColumnSet that represents a database table not a view since views cannot have primary keys If you would like to apply this operator to input views use materialisa tion operators section 4 2 16 in order to produce tables before applying this operator The template for the cross table for the new relationship must be rep resented by a MiningMart concept given as parameter TheCrossTable This concept can represent a view The actual cross table for the new relationship will be created by the operator its na
72. me is given by the optional parameter NameForCross Table TheFromConcept CON must be a table TheToConcept CON must be a table TheCrossTable CON NameForCrossTable V optional FromConceptKeys BA List ToConceptKeys BA List CrossToFromConceptKeys BA List CrossToToConceptKeys BA List TheRelation REL Name of new relationship 4 6 OTHER OPERATORS 69 4 6 Other Operators 4 6 1 ComputeSVMError A special evaluation operator used for obtaining some results for the regres sion SVM Values TheTarget ValueAttribute are compared to those in The Predicted ValueAttribute The average loss is determined taking the asymmet ric loss function into account That is why the SVM parameters are needed here as well Note that they must have the same value as for the operator SupportVectorMachineForRegression which must have preceded this evalu ation operator in the chain ObjType TheInputConcept CON IN inherited TheTarget ValueAttribute BA IN actual values ThePredictedValueAttribute BA IN predicted values LossFunctionPos IN same values LossFunctionNeg IN as in SVM Epsilon IN ForRegression 4 6 2 PrepareForYale This operator has no output for MiningMart It produces an XML file that can be read by the YALE software The file will be given the name provided in ExperimentFileName In YALE this will provide a very basic experiment that starts by reading data from the database that is from the business data schema used by Min
73. n MiningMartHome properties point to a symbolic link and to change this link before starting MiningMart 1 4 5 MA data information The MiningMart system generally handles two database schemas users The first one is called the business data schema It holds the data you want to analyse and preprocess with the MiningMart system The second schema the so called M4 schema holds metadata information about your business data and your preprocessing chains You should not only reserve sufficient space on disk for your source business data but account some extra space for materializing some of the views For the M4 schema on the other hand 100 MByte should be sufficient for normal usage When you start MiningMart for the first time the system will create a number of tables in the M4 schema which it uses to store the cases etc unless these tables exist already from an older MiningMart version If any table in the MA schema happens to have a name that is used by MiningMart that table will be deleted If you want to make sure that none of your tables are deleted you can check the file MM HOME m4install XyourDbSystem CreateMATables sql for the table names that are going to be used You can also choose to prepare the M4 tables yourself by using the scripts in the directory MM HOME m4install lt yourDbSystem gt Note that when running the provided scripts that install M4 tables to be used by MiningMart into your M4 database schema any ol
74. n be found in the lower right corner of the working area when a BaseAttribute is selected in the concept editor see 3 3 3 2 Case editor In this editor you mainly create Steps Each step uses one operator that trans forms the input data and creates an output representation The output repre sentation can be inspected in the concept editor section 3 3 The output is not actually created in the database until the step is compiled A step is created by choosing the operator it employs either from the context menu which appears when you right click into the working tab or from the menu item Insert Operator Steps are connected by Transitions If two steps are not connected the out put of one is not available as input for the other The menu item Insert Tran sition can be used to create a transition from one step to the next Figure 3 1 shows a screenshot of the case editor with steps and transitions Double clicking on a step or choosing Insert Show StepSettings will open the parameter view for this step on the right hand side of the working area You can edit the parameters of the selected step here You should read at least section 4 1 and the section in chapter 4 that corresponds to the oper ator your step is using in order to understand how the parameters work and 28 CHAPTER 3 THE GRAPHICAL USER INTERFACE what values they should be given The Valid button can be used to check the syntactical validity of the parameters for examp
75. n order to get this name The name is case sensitive If MiningMart asks for two different database names one for the business and one for the M4 data use the same name for both e You may need to know the JDBC driver class name of the JDBC driver you are using MiningMart offers a default driver class name which should work without problems but if you are using a different JDBC driver see section 1 5 you may have to change this default class name e Your Oracle database is running on a certain host whose name you must tell MiningMart example dbserver mynet de If the database system is running on the same computer as MiningMart the default entry localhost can be used e Your Oracle database accepts connections at a certain port MiningMart offers the default port value for Oracle 1521 but if your database uses a different port you must tell MiningMart so e The business data is accessible by a certain database user You must give MiningMart the name of this user 1 4 CONNECTING TO THE DATABASE 9 e The business user s password must be given e If you have created an extra database user for MiningMart s own data the name for this user must be given to MiningMart as the M4 user If you do not want to create an extra user enter the same user name as for the business data e The M4 user s password must be given If your M4 user is the same as your business user then of course the password is also the same 1 4 2
76. ng examples All these BaseAttributes use the same prefix for their parameter name here The PredictingAttributes Such parameters which may contain a list are marked with the word List in the operator descriptions below 33 34 CHAPTER 4 OPERATORS AND THEIR PARAMETERS Coordinated parameters Sometimes several parameters are realised as arrays see previous paragraph and their semantics require that the values of two or more parameters are related to each other for the same position in the arrays For example the grouping operators have parameters that specify which value s of an input attribute are mapped to which value of the new output attribute These operators can map several groups of values to new values which values are mapped to which others is then specified by using the same position in the parameter arrays In the GUI such parameters are highlighted by blue frames the button Show groups can be used to verify that the values of the parameters are sorted in the right way This button is only available if any coordinated parameters are available Loops Special attention is needed if an operator is applied in a loop All feature con struction operators loopable further the concept operator RowSelection ByQuery is loopable Feature construction operators are applied to one target attribute of an input concept and produce an output attribute Looping means that the operator is applied to several target attri
77. ng the connection information Each MiningMart session accesses one particular database schema or data base under MySQL for the business data and one particular database schema database under MySQL for its own data The latter may be the same as the former If you have different database schemas with different sets of data you would like to analyse or if your database organisation changes you may want to edit the connection information that MiningMart uses This can be done in two ways After MiningMart has started choose the item Edit DB settings in the Tools menu Make sure to save and close your Case before doing so After editing the connection information the MiningMart system will exit and you have to start it again this is necessary to ensure that the right M4 data is read Before MiningMart has started you can edit the configuration file that MiningMart uses to store its connection information The complete path to and name of this file is given in the file MiningMartHome properties as the value for the key DB CONFIG PATH You can open the indicated file using 1 5 INSTALLING AND RUNNING MININGMART 11 an ordinary text editor and edit the connection settings You can keep several different versions of the connection configuration file and indicate in the file MiningMartHome properties which one you would like to use each time you start MiningMart Under Linux or Solaris it is perhaps more convenient to let the entry i
78. ngMart MiningMart supports preprocessing by applying a number of data process ing steps to its input Each step is graphically represented in the MiningMart workspace The complete sequence of steps is stored in the database and can also be exported to other sites where MiningMart is in use In this way a docu mentation of the whole knowledge discovery process is achieved All the details of a discovery process can be easily saved for later usage can be modified using a graphical user interface and can be transferred from one discovery process to another MiningMart uses a layer of abstraction of the actual data to model the knowledge discovery process This abstraction allows to publish successful dis covery applications for the benefit of other users while sensitive details are hidden This means that you can benefit easily from the work done by other MiningMart users The MiningMart web pages provide a central platform for the exchange of successful discovery processes called cases see section 2 2 On this platform such cases are described both in terms of their relevance to a business and in technical terms which allows you to find cases which are similar to the application you have in mind You can then download such cases into your MiningMart system and make the necessary modifications towards your own data The following section describes these central ideas in more detail by explain ing the basic MiningMart terminology Once yo
79. nother feature should be added Use this operator if a large number of original attributes is to be selected The selection is done from the set of TheAttributes attributes not specified in this set are selected automatically ObjType Ta a inherited TheAttributes see section 4 3 TheClassAttribute must be categorial NoOfAttributes positive integer SampleSize positive integer TheOutputConcept inherited 54 CHAPTER 4 OPERATORS AND THEIR PARAMETERS 4 3 8 UserDefinedFeatureSelection A Feature Selection operator This operator copies exactly those features from ThelnputConcept to TheOutputConcept that are specified in TheSelectedAt tributes It can be used for the same task as the operator FeatureSelection ByAttributes see 4 3 1 namely when the user knows which features to select ObjType TheInputConcept CON IN inherited TheSelectedAttributes BA list IN the user s selection OUT TheOutputConcept CON inherited 4 4 Feature construction operators Almost all operators in this section are loopable For loops TheInputConcept remains the same while TheTargetAttribute TheOutputAttribute and further operator specific parameters can change from loop to loop loop numbers start with 1 See also section 4 1 4 4 1 AssignAverageValue A MissingValue operator Each missing value in TheTargetAttribute is replaced by the average value of that Column The operator computes the column statis tics if they are not computed yet w
80. ocessing the data from there instead of from the beginning of the given Case which might include superfluous or unsuitable operations Chapter 4 Operators and their Parameters This chapter explains the operators currently available in MiningMart Oper ators form the basic building blocks of the data preprocessing phase in the Knowledge Discovery process There are a number of operators for different purposes as explained below 4 1 General issues There are mainly two kinds of operators distinguished by their output on the conceptual level those that have an output Concept Concept Operators listed in sections 4 2 and 4 3 and those that have an output BaseAttribute Feature Construction Operators listed in section 4 4 Two operators create relation ships see section 4 5 Further there are some special operators without any output on the conceptual level listed in section 4 6 All operators have parameters such as input Concept or output BaseAt tribute The parameters must be instantiated i e given values for every step that uses the operator The name of such a parameter is fixed for the operator and cannot be changed For instance TheInputConcept is used for the input Concept for all operators Parameter arrays Some operators have an unspecified number of parameters of the same type For example the learning operators take as input a number of BaseAttributes of the same concept and use them to construct their traini
81. of them contains time stamps the other values In the output Concept each row gives a time window there will be two time stamp BaseAttributes which give the beginning and the end of each time window Further there will be as many value attributes as specified by the WindowSize they contain the values for each window Distance gives the distance between windows in terms of number of time stamps While TimeBaseAttrib and ValueBaseAttrib are BaseAttributes that be long to TheInputConcept Output TimeStartBA Output TimeEndBA and the Win dowed ValuesBAs belong to TheOutputConcept ObjType TheInputConcept inherited TimeBaseAttrib time stamps ValueBaseAttrib values WindowSize positive integer Distance positive integer Output TimeStartBA start time of window OutputTimeEndBA end time of window Windowed ValuesBA as many as WindowSize TheOutputConcept inherited 4 2 28 SimpleMovingFunction This operator combines windowing with the computation of the average value in each window There is only one OutputValueBA which contains the average of the values in a window of the given WindowSize windows are computed with the given Distance between each window See also the description of the Windowing operator in section 4 2 22 4 2 CONCEPT OPERATORS 4T ObjType inherited InputTimeBA Input ValueBA WindowSize Distance OutputTimeStartBA OutputTimeEndBA Output ValueBA TheOutputConcept inherited 4 2 24 Weighte
82. omething like this MMART HOME home myname analysis MiningMart 1 1 Save the edited version of the file under the same name 5 Make sure that the file mentioned in step 4 is executable The command chmod u x mmart sh will ensure this 6 That s all You can now run MiningMart by issuing the command mmart sh amp from the MM HOME bin directory 14 CHAPTER 1 GETTING STARTED 1 6 Integration with YALE To enable a smooth integration between preprocessing learning and applying models two new operators called PrepareForYale and YaleModelApplier have been implemented to bridge the gap between MiningMart and the very powerful open source Yale learning toolbox Yale has a Weka wrapper offers automatic parameter setting and powerful feature selection and construction algorithms on top of common classifiers as provided by Weka After preprocessing raw data within your database you might want to draw a sample fitting in main memory which can be read directly by a Yale operator This can easily be done with the help of the PrepareFor Yale operator Then you are able to train your classifiers or induce some other kind of model based on the samples Finally you may want to apply the model to unseen data In this case MiningMart s YaleModelA pplier operator can be used to create a new database view holding the predictions for the new data Yale can be downloaded at http yale sf net 1 7 Upgrading from older versions This section
83. operator divides the range of TheTargetAttribute into at most NoOfIntervals intervals TheTargetAttribute is divided into intervals with respect to TheClassAttribute but unlike the implicit discretization if the number of interval exceeds NoOfIntervals intervals are further merged This will increase the classification error TheClassAttribute contains the labels of an example as in a Machine Learning setting Values of TheOutputAttribute can be specified in the parameter Label The parameter SampleSize gives a maximum number of learning examples for the external algorithm ObjType Consent TheTargetAttribute TheClassAttribute NoOfIntervals ClosedTo Label SampleSize TheOutputAttribute inherited must be numeric must be categorial positive integer gt 1 one of LEFT or RIGHT optional optional positive integer should be categorial 4 4 FEATURE CONSTRUCTION OPERATORS 65 4 4 25 GroupingGivenMinCardinality A grouping operator Values of TheTargetAttribute are grouped under a certain label which is stored in TheOutputAttribute which must be categorial This operator groups values of TheTargetAttribute by iteratively merging in each step two groups with the lowest frequencies until all groups have the cardinality number of examples with values within the interval at least Min Cardinality The algorithm has been inspired by hierarchical clustering Min Cardinality Type decides if Min Cardinality values are read
84. ppear in a ColumnSet which contains exactly one row with the statistical values per group of tuples and which belongs to TheOut 4 2 CONCEPT OPERATORS 39 putConcept Groups of tuples are built by listing attributes with the GroupBy parameter Each combination of values of the underlying BaseAttributes forms one group If no attributes are listed with the parameter list GroupBy then the operator will output a single tuple with the statistics of all the ColumnSet The sum of all values of a numerical attribute can be computed by specifying the corresponding BaseAttribute with the parameter AttributesComputeSum There can be more such attributes the sum is computed for each TheOut putConcept must contain a BaseAttribute for each sum which is computed their names must be those of the input attributes followed by the suffix SUM The total number of entries in an attribute can be computed by specifying a BaseAttribute with the parameter AttributesComputeCount There can be more such attributes the number of entries is computed for each TheOutput Concept must contain a BaseAttribute for each count which is computed their names must be those of the input attributes followed by the suffix COUNT The number of unique values in an attribute can be computed by specifying a BaseAttribute with the parameter AttributesComputeUnique There can be more such attributes the number of unique values is computed for each The OutputConcept must
85. probability of 0 5 The operator computes the column statistics if they are not computed yet which may take some time TheInputConcept CON IN inherited TheTargetAttribute BA IN inherited OUT TheOutput Attribute inherited 4 4 6 Binarify This operator is often called Dichotomisation It introduces one new attribute per loop The new attribute takes only two values 1 and 0 The value 1 is taken whenever the target attribute takes the target value otherwise the value 0 is taken Thus the new attribute is a binary indicator or a boolean flag showing the presence or absence of the target value in the target attribute for each row in the input concept By looping this operator several target values can be used so that several new boolean attributes are created ObjType TheInputConcept CON IN inherited TheTargetAttribute BA IN inherited Target Value V IN looped OUT TheOutput Attribute looped 56 CHAPTER 4 OPERATORS AND THEIR PARAMETERS 4 4 7 MergeAttributes This operator merges TheTargetAttribute and the Attribute TheOutpu tAttribute will have the value of The TargetAttribute where the Attribute ToMerge has the value NULL and it will have the value of the AttributeToMerge where TheTargetAttribute has the value NULL So if both attributes have the value NULL so will TheOutputAttribute If both attributes have a non null value the parameter ClashResolvation decides how to resolve this cl
86. pt as are specified in the parameter HowMany Selects the rows randomly ObjType TheInputConcept CON IN inherited HowMany V IN max no of rows TheOutputConcept CON OUT inherited 4 2 9 DeleteRecordsWithMissing Values Puts only those rows into the output Concept that have an entry which is NOT NULL in the Column for the specified The TargetAttribute 4 2 CONCEPT OPERATORS 41 ObjType TheInputConcept inherited TheTargetAttribute BA may have NULL entries TheOutputConcept inherited 4 2 10 SegmentationStratified A MultiStep operator creates several ColumnSets for the output Concept The input Concept is segmented according to the values of the specified attribute so that each resulting Columnset corresponds to one value of the attribute For numeric attributes intervals are built automatically this makes use of the statistics tables and the functions that compute the statistics The specified attribute will not be present in the output concept ObjType TheInputConcept CON IN inherited TheAttribute BA IN TheOutputConcept CON OUT inherited 4 2 11 SegmentationByPartitioning A MultiStep operator creates several ColumnSets for the output Concept The input Concept is segmented randomly into as many Columnsets as are specified by the parameter HowManyPartitions ObjType TheInputConcept CON IN inherited HowManyPartitions V IN positive integer TheOutput Concept CON OUT inherited 4 2 12 Segmen
87. putConcept inherited TheTargetAttribute must be numeric TheClassAttribute must be categorial SampleSize optional positive integer MinCardinalityType ABSOLUTE or RELATIVE MinCardinality positive TheOutputAttribute should be categorial 4 4 81 ErrorBasedGroupingGivenNoOfGroups A grouping operator Values of The TargetAttribute are grouped under a certain label which is stored in TheOutputAttribute which must be categorial This operator merges the values of TheTargetAttribute into at most NoOfGroups groups The grouping is performed with respect to TheClassAttribute but un like the implicit discretization if the number of groups exceeds NoOfGroups groups are further merged This will increase the classification error Values of TheOutputAttribute can be specified in the parameter Label TheClassAttribute contains the labels of an example as in a Machine Learning setting The param eter SampleSize gives a maximum number of learning examples for the external algorithm ObjType Toa oe inherited TheTargetAttribute must be numeric TheClassAttribute must be categorial NoOfGroups integer gt 1 Label optional SampleSize optional positive integer TheOutput Attribute should be categorial 68 CHAPTER 4 OPERATORS AND THEIR PARAMETERS 4 5 Operators creating relationships 4 5 1 CreateOneToManyRelation This operator creates a 1 n relationship between its two given input concepts The relationship is created on the conceptual
88. rs are fully specified it can be compiled Com pilation means to create the relational level output in the database that corre sponds to the output concepts that have been created in MiningMart See also chapter 2 and section 3 3 The Compile menu offers various options to compile a single step or a number of steps Compilation results are logged in the log tab and file Error messages are displayed An error immediately stops the compi lation Choosing the menu item Garbage collection removes the compilation results from the database The flag Compile in lazy mode can be used to make the compiler create only one output Columnset when several Columnsets are ex pected see the paragraph on compilation in section 2 2 The menu item Kill compile can be used to stop the current compiler run There are two features for more advanced users in the Tools menu when the case editor is opened three other features are relevant for the concept editor and are explained in section 3 3 The first called Add Reversing Step is to add a step automatically that reverses a computation of another step Only steps that employ the operators LinearScaling LogScaling or any Grouping operator can be reversed The step to be reversed must be selected before choosing this option if no step is selected or the selected step does not involve a reversable operator nothing will happen Otherwise the new reversing step is displayed automatically The step transition has to
89. shing its conceptual level in the internet case repository MiningMart allows you to export all conceptual metadata into a single file After you have opened a case choose Export Case from the Case menu You are then shown a file browsing dialogue with which you can choose a name for the exported file It is common to use the file extension xml for exported MiningMart files since these files use an XML syntax Please wait until all M4 objects are exported You can now send the exported file to the following email address miningmart 1s8 cs uni dortmund de The MiningMart team will then do some technical tests to check the consistency of the case You will be kindly asked to provide some background information about the application domain etc per email Then the case will be added by MiningMart administrators to the central case base and will be available under the web address above
90. splays all available operators for your overview they are also available in the Insert menu and the context menu see section 3 2 The second allows you to change the verbosity level of the log out put The different levels are given names which indicate the kinds of operations the system logs if that level or a lower level is chosen The higher you go in the list the less output is produced in the log window and log file Finally the third area allows you to choose the types of links between concepts that the concept editor should display This choice is applied from the next time that the concept editor is opened onwards See also section 3 3 The Help menu opens the MiningMart help system menu item Contents which allows you to browse through help topics You can also start a web browser if your system is configured to allow this The About option displays copyright information The following list provides some general hints which may help you getting familiar with the handling of MiningMart e Every case must have one top level chain to which all its steps and other chains belong So when creating a new case please create a new chain first using the Insert menu then you can start creating steps inside this chain Open the chain by double clicking on it e When creating a new step you can view and edit its parameters by double clicking it When editing a text field parameter by typing please remem ber to press Enter after e
91. stage To use the internet case repository please use an ordinary web browser and 71 72 CHAPTER 5 THE CASE REPOSITORY go to the address given at the beginning of this chapter You can click through the structure of the cases which are already there The following sections describe what to do if you have found a case that you would like to download and modify in your own MiningMart system and what to do if you want to contribute a case to the internet repository 5 2 How to download a case On the MiningMart case base start page see above there is a list of cases Clicking on one case name brings up the main starting page for that case There is a bulleted item called Exported into MA file clicking on the file name will make your browser download the file The file may have to be unpacked if its name ends with zip After unpacking the file you can import it into your MiningMart system There is a menu item Import Case in the Case menu Using it will allow you to select the file and give a name to the case which is to be imported Then the case is available in your MiningMart system You should save it if you want to keep it since importing the case does not automatically save it to your database If you want to execute the case or a modified version of it you now have to link the concepts of type DB to your own database tables or views This may mean that you have to adjust the exact form of concepts to the struct
92. t will work with any of these files Please consult your Oracle documentation to find the file If you are not going to use Oracle you can skip this section of the installer After the installer has finished you can start MiningMart in one of two ways e Click Start then Programs then MiningMart 1 1and then StartMiningMart e Go to the directory where MiningMart was installed then go to the sub directory subfolder called bin Double click on the file startMM bat 1 5 3 Linux or Solaris Follow these steps 1 You should read the file MM HOME README 2 You must have Java installed on your system MiningMart is tested for Java 1 4 and 1 5 For more information see http java sun com 1 5 INSTALLING AND RUNNING MININGMART 13 3 Your Oracle Postgres or MySql installation provides a JDBC driver This is a file which a Java program can use to read and write from the database Your Oracle Postgres or MySql documentation will tell you where to find the file The Oracle file is often called ojdbc14 jar classes13 jar classes12 zip or similar The Postgres file is usually called postgresql jar and the MySql file should be called mysql connector version bin jar or similar As an Oracle user please place copy of that file into MM HOME lib If you are using Linux or Solaris that s it If you are using Windows and have not chosen this file during installation you may now have to edit one line in the file MM HOMEVbinWnmart bat yo
93. tationWithK Mean A MultiStep operator creates several ColumnSets for the output Concept The input Concept is segmented according to the clustering method KMeans an ex ternal learning algorithm The number of ColumnSets in the output concept is therefore not known before the application of this operator However the parameter HowManyPartitions specifies a maximum for this number The pa rameter OptimizePartitionNum is a boolean that specifies if this number should be optimized by the learning algorithm but it will not exceed the maximum The parameter SampleSize gives a maximum number of learning examples for the external algorithm The algorithm KMeans uses ThePredictingAttributes for clustering these attributes must belong to TheInput Concept 42 CHAPTER 4 OPERATORS AND THEIR PARAMETERS ObjType Tipa ne CON inherited HowManyPartitions V positive integer OptimizePartitionNum V true or false ThePredictingAttributes BA List SampleSize V positive integer TheOutputConcept CON inherited 4 2 13 UnSegment This operator is the inverse to any segmentation operator see 4 2 10 4 2 11 4 2 12 While a segmentation operator segments its input concept s ColumnSet into several ColumnSets UnSegment joins several ColumnSets into one This operator makes sense only if a segmentation operator was applied previously in the chain because it exactly reverses the function of that operator To do so the parameter UnsegmentAttribute sp
94. ty classes is merged with a group that has the same majority class The resulting grouping minimizes the classifica tion error TheClassAttribute contains the labels of an example as in a Machine Learning setting The parameter SampleSize gives a maximum number of learn ing examples for the external algorithm ObjType inherited TheTargetAttribute must be numeric TheClassAttribute must be categorial FullMerge one of YES or NO SampleSize optional positive integer TheOutputAttribute should be categorial 44 FEATURE CONSTRUCTION OPERATORS 67 4 4 30 ErrorBasedGroupingGivenMinCardinality A grouping operator Values of TheTargetAttribute are grouped under a certain label which is stored in TheOutputAttribute which must be categorial This operator merges the values of TheTargetAttribute into groups with the cardi nality above the given threshold MinCardinality MinCardinalityType decides if Min Cardinality values are read as absolute values integers or relative values real between 0 and 1 The grouping is performed with respect to TheClas sAttribute but unlike implicit grouping groups with a single majority class are further merged if they do not have the required cardinality This will increase the classification error TheClassAttribute contains the labels of an example as in a Machine Learning setting The parameter SampleSize gives a maximum number of learning examples for the external algorithm ObjType TheIn
95. u have become familiar with those basic notions you can start your own MiningMart application easily 2 2 BASIC NOTIONS IN MININGMART 19 2 2 Basic notions in MiningMart This section explains several terms that are used throughout the MiningMart system and its documentation You can use this section for general reference Where words are printed in italics they have their own entry in this section Business data This is the data in which knowledge is to be discovered It must be stored in a relational database It can consist of any number of tables views and relations between them The MiningMart system assumes that all data is stored in one database schema if this is not the case a single schema with database links to the needed tables should be set up please refer to the documentation of your DBMS Metadata This is administrative data which MiningMart uses to store in formation about the business data as well as about the knowledge discovery process Metadata can be stored in a separate database schema which can live in a separate database from the business data or in the same schema Mining Mart uses a fixed data model for its metadata which is called 4 MiningMart MetaModel M4 MiningMart MetaModel This is the fixed data model in which Min ingMart stores its own information called Metadata M4 consists of several parts but it is not important for users of MiningMart to know much about it Conceptual level
96. u will find the line set ORA 4MMART_HOME 1ib classes12 zip in this file Change the line so that the file name is correct ie matches the Oracle JDBC driver file name If you are going to use postgres Postgres JDBC file is already provided in MM_HOME 1ib under the license described in MM HOME licenses LICENSE POSTGRESQL_JDBC However you may still have to find the JDBC file that belongs to your own Postgres installation if you are experiencing troubles particularly for Postgres versions 8 x or higher Just copy the file to MM HOME 1lib If you are using Linux or Solaris that s it If you are using Windows you may now have to edit one line in the file MM HOME bin mmart bat you will find the line set POSTGR AMMART HOME M libMpostgresql jar in this file Change the line so that the file name is correct ie matches the Postgres JDBC driver file name For MySql the procedure is the same as for Postgres a JDBC file is already provided which has been tested for MySql version 5 0 If you are going to use a higher version of MySql you may also have to use a later version of the JDBC file Please proceed as described above for Postgres in order to let MiningMart use your JDBC file 4 Open and edit the file MM HOME bin mmart sh Follow the instructions given in comments in that file there is only one obligatory line to edit The line looks like this before you have changed it MMART HOME After you have changed it it should look s
97. until a message informs you that no more places will be recommended You can choose to have the Materialisation steps be inserted automatically by MiningMart If you choose this option please be aware that it works best if the chain into which the step is inserted automatically is currently visible If you are experiencing troubles it is recommended to save the case then close and re open it If there are still problems you may have to perform some edits around the new step Normally the automatic insertion should run without problems If you want to be safe export your case before inserting a materialisation operator then you can always re create the exported version by importing the exported file 3 3 Concept editor Concepts represent data tables or views in MiningMart and the concept editor is the tool to create manipulate or delete them Figure 3 2 shows a screenshot of the concept editor In MiningMart there are two types of concepts DB and Mining The former are indicated by a small database icon the latter show a tool icon The DB concepts are used to represent tables or views in the database which have existed 30 CHAPTER 3 THE GRAPHICAL USER INTERFACE before MiningMart was started The Mining concepts represent tables or views created with MiningMart Consequently DB concepts have to be connected to database objects manually while this connection is set up automatically by the compiler for Mining concepts A concept is connect
98. ure of your database objects or that you have to insert additional steps to the case which bring your data into a suitable format For every concept of type DB the ones with a small database symbol in their icon use the concept editor and its Create connection function Then continue with the relationships between the concepts if there are any Once these items are connected to your database tables or views you can continue by compiling the steps or making adjustments to the case 5 3 How to document a case For the documentation of your case which is especially important if you want to publish its conceptual level in the internet case repository see following section there is a description field for every step chain concept baseattribute etc which can be edited directly in the GUI It can be found in the lower left frame of the MiningMart GUI there is a special tab Description which applies to the currently selected element For the attributes of a concept there is an extra description field in the right hand frame when the BaseAttributes of a concept are displayed Any descriptions entered in these fields are shown in the internet repository if you publish your case there They help other users to find out about details of your case 5 4 HOW TO UPLOAD A CASE 73 5 4 How to upload a case If you have developed a successful knowledge discovery case you have the option to let other users benefit from your work by publi
99. us the system is aimed at those people in an institution who actually work with the institution s data and process it in various ways in order to gather statistics or other higher level information While the system provides an intuitive ac cess to data and easy handling of processing steps users should have a certain knowledge about how their data is stored before the application of MiningMart MiningMart works with relational databases It assumes that all input data is given in tables in a relational database and its output are new tables in this database It also stores its own data in relational tables Thus there are no limitations to the amount of data that MiningMart can handle Referring to the definitions at the beginning of this chapter MiningMart supports the whole knowledge discovery process but focusses clearly on pre processing That is the system provides a few common data mining algorithms which can be applied directly from the system but its main value is the support for the technical steps that are needed to bring the data into a format which can be used for data mining Like the input the output of the system is a number of relational database tables but in the output tables the data is stored in a representation suitable for data mining Thus you can use your favourite data mining algorithm easily because the input data for it is stored in a table in your database in exactly the right format after the application of Mini
100. utConcept TheOutputAttribute contains the orig inal values plus the predicted values where the original ones were missing There are some SVM specific parameters the table gives reasonable values to choose if nothing is known about the data or SVMs For the KernelType only the following values Strings are possible dot polynomial neural radial anova Dot is the linear kernel and can be taken as default This operator can use two different versions of the Support Vector Machine algorithm One runs in main memory it needs the parameter SampleSize to de termine a maximum number of training examples The other runs in the data base it is used if the optional parameter UseDB_SVM is set to the String true When this version is used an additional parameter TheKey is needed which gives the BaseAttribute whose column is the primary key of TheInputConcept TheKey can be left out only if the ColumnSet that belongs to TheInputConcept 44 FEATURE CONSTRUCTION OPERATORS 57 represents a table rather than a view The database algorithm restricts the pos sible kernel types to dot and radial It can also use the parameter SampleSize You can only use it if you have the DB version of MYSVM installed With the parameters LossFunctionPos and LossFunctionNeg the loss func tion that is used for the regression can be biased such that predicting too high is more expensive LossF unctionPos gt LossFunctionNeg or less expensive LossFunctionNeg gt
101. value if no mapping applies TimeFormat ORACLE specific time format TheOutput Attribute inherited Z OO G an ZZZZZZZ Z 4 4 17 EquidistantDiscretizationGivenWidth A discretization operator Numeric attributes are discretized and the output is a categorial attribute This operator divides the range of TheTargetAttribute into intervals with given width IntervalWidth starting at StartPoint The first and the last interval cover also the values out of range ObjType Thea oaei inherited TheTargetAttribute must be numeric StartPoint optional IntervalWidth a positive real number Closed To one of LEFT or RIGHT TheOutputAttribute should be categorial 62 CHAPTER 4 OPERATORS AND THEIR PARAMETERS 4 4 18 EquidistantDiscretizationGivenNoOflIntervals A discretization operator Numeric attributes are discretized and the output is a categorial attribute This operator divides the range of TheTargetAttribute into the given number of intervals NoOfIntervals with the same width The first and the last interval cover also the values out of range Values of TheOutputAttribute can be specified in the parameter Label ObjType ThcizputConcept inherited TheTarget Attribute must be numeric NoOflntervals integer ClosedTo one of LEFT or RIGHT Label j optional TheOutput Attribute should be categorial 4 4 19 EquifrequentDiscretizationGivenCardinality A discretization operator Numeric attributes are dis
102. veral concepts to database tables or views at once To understand this functionality the notion of a data model is important A data model is technically simply a collection of concepts to gether with relation ship s between them projections and subconcept links are ignored Conceptually a data model is a snapshot of the database at a partic ular point during data preparation The most salient data model is the initial or input data model of a Case This data model consists simply of all concepts of type DB since they represent the input to the Case before any preparation has taken place When a few operations have been applied to a data model a new data model is the result it represents the results of the preparation so far Thus every Step in MiningMart produces not only an output concept but this output concept is also part of a resulting data model that represents all prepared data at the point of processing that immediately follows the Step The context menu item Show resulting data model available for Steps in the case editor allows to inspect this resulting data model The concepts of one data model can be connected to some database tables or views all at once The menu item Match Input Data in the Tools menu available when the concept editor is opened can be used to match the input initial data model to some tables views When selecting this functionality you are asked to provide at least one database table or view that the in
Download Pdf Manuals
Related Search
Related Contents
BS 75 BS 75 E Design de Interface - PUC-SP DeLOCK 18133 User Manual Bosch 2609255009 Acer Altos R520 M2 Parquet l`indispensable Moteur - DB Moteurs Copyright © All rights reserved.
Failed to retrieve file