Home

Self Organizing Map & 2-stage Clustering User Manual

1. The weight of each pixel is greater in proportion to the proximity to the pixel that is currently blurring The Gaussian equation used for the calculation of the weights is the following lli jll e 20 4 G i j where 2102 i current blured pixel j pixel of which we are computing weight li jl euclidean distance on the grid between neurons i and j o width of gaussian function which define the bluring level The application of this function to all the pixels surrounding the pixel considered for the blur will provide a matrix of weights The size of this matrix will depend on the number of pixels considered neighbours of the current node This parameter known as kernel can be the whole image or have a smaller radius The new value of the currently analysed pixel will be the sum of the values of the pixels included in the blur radius multiplied by the respective weight Note that the results obtained with the Umat CC method may vary in some details according to the blurring level applied parameter o of the equation 4 There is not a rule for the calculation of this parameter because it depends on the characteristics of the specific experiment configured Therefore it will have to be manually set if the result obtained is not considered satisfactory This adjustment procedure can be done looking the U Matrix which will be shown in output without applying the filter In this way when the user find possible errors
2. 2 3 2 SOM Topographic error The topographic error is used to the computation of dissimilarity of pattern assigned to different BMU according to the following formula TE 3L um 6 where N number of pattern of dataset Noor i if first and second BMU of pattern i ared adiacent on map H 0 otherwise 14 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved E waRE LI DAta Mining amp Exploration po Program Pa a Weng Pt cedettero 2 4 Clustering quality indicator The results of the second stage of clustering process can be evaluated using the Davies Bouldin DB index This index measures the ratio of intra cluster and extra cluster distances measured from centroids Davies amp Bouldin 1979 The internal scatter of a cluster C can be written as S Yrec X 211 4 i 1 K 7 Lq ICil X EGi l where C is the number of pattern assigned to cluster i x and z are respectively a pattern of cluster i and his centroid g is an absolute value K is total number of clusters The distance between two clusters can be written as e 1 t dije Z z Pala zs 8 where z e z represent respectively centroids of clusters i and jJ z Z denotes the absolute value of the difference between vectors z and z computed on dimension s D is the total number of pattern t is an absolute value So the DB index can
3. VONEURAL PRO NA 0002 Rel 1 1 D Abrusco Cavuoti frontend_VONEURAL SDD NA 0004 Rel1 4 Manna FW_VONEURAL SDD NA 0005 Rel2 0 Fiore REDB_VONEURAL SDD NA 0006 Rel1 5 Nocella driver VONEURAL SDD NA 0007 Rel0 6 d Angelo dm model_ VONEURAL SDD NA 0008 Rel2 0 Cavuoti Di Guido ConfusionMatrixLib_VONEURAL SPE NA 0001 Rel1 0 Cavuoti softmax_entropy_VONEURAL SPE NA 0004 Rel1 0 Skordovski Clustering con Modelli Software Dinamici Seminario Dip Esposito F di Informatica Universita degli Studi di Napoli Federico II http dame dsf unina it documents html dm_ model VONEURAL SRS NA 0005 Rel0 4 Cavuoti DMPlugins_DAME TRE NA 0016 Rel0 3 Di Guido Brescia BetaRelease_ReferenceGuide DAME MAN NA 0009 Brescia Rell 0 BetaRelease_GUI_UserManual DAME MAN NA 0010 Brescia Rel1 0 SOM and 2 stage clustering models Design and Esposito Brescia Requirements som _DAME SPE NA 0014 Rel4 0 Table 5 Applicable Documents DAMEWARE SOM 2stage Clustering Model User Manual DAta Mining amp Exploration Date 15 10 2008 19 02 2008 30 05 2007 04 04 2011 17 07 2007 06 10 2007 18 03 2009 14 04 2010 29 03 2010 03 06 2009 22 03 2010 07 07 2007 02 10 2007 2013 05 01 2009 14 04 2010 28 10 2010 03 12 2010 2013 This document contains proprietary information of DAME project Board All Rights Reserved 26 DAta Mining amp Exploration Program 000 DAMEWARE SOM 2stage Clustering Model User Manual This document contains prop
4. according to the following equation W kK J new n hj n X k sE WK Jota 2 where W k j synaptic weight of link beetween input k and winner node j X k k feature of input pattern n learning rate in range 0 1 hj neighborhood function DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved DAta Mining amp Exploration Program Practically the weights to update at each input presentation are as many as input neurons In several cases are updated with the same criterion using decreasing values of n learning decay also the weights of neurons within an appropriate neighborhood dimensionally comparable with the positive part of Mexican Hat excitation lateral distance E F a F F Ms _ Figure 3 Activation rule of a node with the Mexican Hat function Carefully reading the equation 2 we can see that its role is to rotate the synaptic weights vector to the input one in the parameters space In this way the winner neuron is even more trained to identification of presented input Figure 4 Figure 4 Learning diagram with vectors X input K and K neurons When a self organizing network is used the grid of neurons of the output layers is shown as a set of activity bubbles that correspond to the class in which the algorithm has divided the input based on their similarity If a network is
5. functionality Workspace esomExp Select a Running Train n Experiment image Mode Select a Clustering _SOM_Auto v Functionality Field is Required Figure 24 Automatic post processing functionaliy The following flowchart explain the behavior of the model Let it be K the number of expected clusters K defined by the user Number of K Number of pattern BMU gt K Figure 25 Behavior of the automatic post processing 22 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved DAta Mining amp Exploration Program If some knowledge on the number of expected cluster is available it can be used Otherwise heuristically the expected number of cluster will be the square root of the number of patterns in the input dataset Obviously in order the use the K Means as post processing method remembering the behavior describe in paragraph 2 2 1 the number of expected cluster must be less than the number of BMU found by the SOM at first stage Otherwise will be use the Umat CC method that is independent from the number of clusters and BMU However this process does not allow evaluating the best method to apply in relation to the input problem In order to do this we can proceed as shown in the following example in which an astronomical image in FITS format will be used as input dataset After the creation of the Workspace and
6. nodes may have in the process of clustering by the method explained 15 ta 20 ta m 24 35 as ATTI ss pia dI z os 76 a i a i ee ss so ENER EENEN EENEN een i J es Bi E A E 135 e EHE 00 OS 10 16 20 25 30 35 40 45 9 65 Figure 16 Example of external node to individuate separation zone and or outliers 13 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved L WARE Lf DAta Mining amp Exploration Program 2 3 SOM quality indicators To evaluate the quality of a SOM obtained as first stage of a clustering process we can use criteria suggested by Kiviluoto 1996 i What is the degree of continuity for the map topology ti What is the resolution of the map topology A quantification of these two properties can be obtained by computation of quantization error and topographic error Chi amp Yang 2008 2 3 1 Quantization error The quantization error is used to the computation of similarity of pattern assigned to the same BMU according to the following formula 1 N gt 5 QE 7 i 1 Wgmu Xll 5 where WBMUL weights vector of BMU i N number of pattern of dataset x input vector i assigned to current BMU The equation 5 corresponds to the average of distance of each pattern form its BMU
7. of a number of expected clusters may be required partitional clustering CECEHEE E BEEBE ice Stage 1 oe Stage 2 see JOY SOM i N input samples Kohonen BMUs clusters Figure 6 2 Stage Clustering general diagram From the above it is evident a well known dichotomy of approach that distinguishes hierarchical and partitional clustering In both cases it is always necessary to distinguish between the different types of metrics used There are three post processing methods actually implemented The first one is the classic K Means algorithm Hartigan amp Wong 1979 which belongs to class of partitional clustering methods The second is the U Matrix with connected components Umat CC Hamel amp Brown 2011 a hierarchical algorithm with bottom up approach The third one is a model created by DAME group inspirited by an agglomerative method based on dynamics SOM hereinafter defined as Two Winners Linkage TWL 2 2 1 Post SOM with K Means The K Means is the classic example of partitional clustering Figure 7 K Means algorithm In order to exclude completely the pattern from analysis carried out in this phase the initial cluster centres are not selected based on the distribution of points in the dataset but between the BMU calculated by the SOM This type of approach allows reducing the sensitivity to noise because the BMU are local averages of the data and therefore less sensitive to their variation
8. the uploading of the input dataset m101 fits an experiment must be created selecting Clustering SOM as functionality and Train as use case Workspace somExp Select a Running Trai Experiment SOM Mode Select a ciustering SOM Functionality i Field is Required input file m101 fits v configuration file v O dataset type 3 input nodes 1 output rows 5 output columns output dimension normalize data neighbor size epochs initial learning rate final learning rate Submit Figure 26 SOM single stage functionality A the end of the experiment will obtain a trained network to which apply any one of the post processing method by creating another experiment and selecting Clustering SOM_Kmeans Clustering SOM_UmatCC or Clustering SOM_TWL and being careful to select Run as use case Workspace somExp Select a Running nun Experiment Kmeans Mode Select a Clustering_SOM_Kmeans v Field is Required input file m101 fits v configuration file SOM_Train_Network_Cor v dataset type 3 Submit Functionality Figure 27 Clustering SOM_Kmeans in Run use case The Run use case ensure that the network parameters read form the configuration file will be not modified and so starting from the configuration file obtained from the SOM single stage every post processing method will work on the same trained network and the results can be compared 23 DAMEWARE SOM 2s
9. weights vector are generated always in the range 1 1 2 1 1 SOM output grid visualization U matrix The U Matrix Unified Distance Matrix is the standard for the evaluation and interpretation of a SOM During the training of the network the weights vectors of neurons are computed in such a way that elements near on the map will be near also in the weights space In this way the Kohonen layer can represents multi dimensional data on a map of two or three dimensions preserving the topology Let it be n neuron on map NN n set of node adjacent ton on the map w n weights vector of neuronn l w n w m euclidean distance between weights vectors of neuron n and m Uneignt n value associated to neuronn According to the following equation a value will be assigned to each node of Kohonen layer Uneignt n m enn llw n w m 3 The value thus computed becomes an identifier of the distance between a node and his nearest neighbours and can be visualized on a heat map in which light colours represents nearby nodes in the weights space while dark colours represents distant nodes Moutarde amp Ultsch 2005 Typically the map is represented on a greyscale as shown in Figure 5 ss EE E EN DC x EO OT Figure 5 Examples of U Matrix In order to increase further the interpretability of U Matrix is possible to overlay to each node BMU of some pattern a colour that identify the relative cluster Ob
10. will repeat the post processing setting a blur level lower or higher depending on the case Figure 11 Use of blurring level An excessive blurring may produce mistakes as shown on the left 11 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved DAta Mining amp Exploration Program 2 2 3 Post SOM with TWL The Two Winners Linkage TWL is a clustering method used as post processing of Kohonen layer inspired by the neighbourhood computation technic used in Evolving SOM model Deng amp Kasabov 2003 This mechanism consists of establishing a connection between the two BMU of each pattern At the end of algorithm the connected components will show the clusters Unlike what happens in Evolving SOM the connections does not have a weight However a mechanism to prevent the connection of nodes far away is required In order to obtain this result we can use the dual of the concept of a CC internal node seen in the previous paragraph The CC internal node was defined as a node that results to have a gradient on the U Matrix minor than all the other adjacent nodes This occurs in the case in which the previously mentioned nodes are very close to the internal node Therefore we could define a CC external node a node whose gradient is greater than all its adjacent nodes nae Zee Figure 12 External node on red on U Matrix An external n
11. 3 Figure 22 Moving configuration file in the Workspace and uploading of target clusters file Now we have to create a new experiment and choose the functionality Clustering SOM_Kmeans and select Test as use case For this model test has only five mandatory parameters e input file iris txt configuration file file produced by a Train use case which contains experiment parameters dataset target file file that report the cluster of each pattern present in the input dataset dataset type 0 which indicates and ASCII input file expected clusters 3 K parameter of K Means 21 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved DAta Mining amp Exploration Program Workspace somExp Select a Running nr Experiment testiris Mode Select a Clustering_SOM_Kmeans v Field is Required input file iris txt v configuration file SO0M_Kmeans_Train_Neti v O dataset target file iris_target txt v dataset type 0 Functionality Figure 23 The SomKmeansIris test configuration tab After submission the experiment will be executed and will produced the output file expected 4 2 Second Example choice of second stage As often happens is difficult to determine a priori the best post processing method of SOM assuming we want to use one of them In this case a good solution is proposed by the Clustering SOM_ Auto
12. A typical complete experiment consists of the following steps 1 Train the network with a dataset as input then store as output the final weight matrix best configuration of trained network weights 2 Test the trained network with a dataset containing both input and target features in order to verify training quality 3 Run the trained and tested network with new datasets The Run use case implies the simple execution of the trained and tested model like a generic static function 3 1 Input We also remark that massive datasets to be used in the various use cases are and sometimes must be different in terms of internal file content representation Remind that it is possible to use one of the following data types ASCII extension dat or txt simple text file containing rows patterns and columns features separated by spaces CSV extension csv Comma Separated Values files where columns are separated by commas FITS extension fits or fit fits files containing images and or tables VOTABLE extension votable formatted files containing special fields separated by keywords coming from XML language with more special keywords defined by VO data standards JPEG extension jpg or jpeg image files PNG extension png image files GIF extension gif image files 16 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved Pass p
13. DAta Mining amp Exploration Program if i h E Dipartimento di Scienze Fisiche 2 2 ISTITUTO NAZIONALE di ASTROFISICA F Q CALTECH SO Universita di Napoli Federico H Pre OSSERVATORIO ASTRONOMICO di CAPODIMONTE I 2 Mi Self Organizing Map amp 2 stage Clustering User Manual DAME MAN NA 0020 Issue 1 2 Author M Brescia F Esposito Doc SOM2stageClustering_UserManual_DAME MAN NA 0020 Rel1 22 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved DAta Mining amp Exploration Program Index MPA ON CU Oot este nescence terse es corde ca aie e 4 2 SOM and 2 Stage clustering theoretical OVErVview 5 2a MTN iNOS SONE indians iris 6 2 11 SOM Output grid vis alization U MAtriK ccciccccucssesentesvensaessbcncsiochonecsastudastestebsssesearewesbensqusassens 8 22 2 Stage clustering SOM post processing scicciccavsscicenndsndesesseonaeiadseasanaeucssenuddesessebnensndaddeaaeenendtonteddesess 8 ZIM POST SO WT Wath Ke WC ahs an acs necentscecanstaguteeiadsi en eusuntanetecenec tacttotatines 9 222 POST SONI i nie CC 10 223 OST OV with Erin airone isa benna ian soir EEEE OE Ea AE Eria 12 23 SOM ndi AON sca sacanceneseacanscosasasecaccsaneassasesneacaoa ns TOTI RANIERI ANIA SACE SOIA RANA e sg 14 Lil VUINUZIUORA lennon rn 14 25 2 SOMTGPOSRAPRICERTOf viccsecrsnsscsscesssanctaxeradandaiaseboauja
14. are Error Neural Network Osservatorio Astronomico di Capodimonte Personal Computer Principal Investigator Registry amp Database Rich Internet Application Sloan Digital Sky Survey Service Layer Self Organizing Feature Map Self Organizing Map Software Two Winners Linkage User Interface Uniform Resource Indicator Virtual Observatory eXtensible Markup Language Table 3 Abbreviations and acronyms 24 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved Pass 4 i i A Vey Het 00 See Vere DAta Mining amp Exploration Program Reference amp Applicable Documents Title Code Author Date Dynamic cell structure learns perfectly topology preserving Bruske J Sommer G 1995 map Neural Comput 7 845 865 A Two stage Clustering Method Combining Ant Colony SOM Chi S C Yang C C 2008 and K means Journal of Information Science and Engineering 24 1445 1460 A cluster separation measure IEEE Transactions on Davies D L Bouldin D W 1979 Pattern Analysis and Machine Intelligence Vol 1 224 227 On line pattern analysis by evolving self organizing maps Deng D Kabasov N 2003 Neurocomputing 51 Elsevier 87 103 Improved interpretability of the unified distance matrix with Hamel L Brown C W 2011 connected components Proceedings of the 2011 International Conference on Data Mining Extending the Ko
15. be written as DB DI Max j i pesta 9 Low values of this index indicate a better clustering However note that on non linearly divisible dataset could not be objective A more objective evaluation can be obtained if the cluster of each input data is known In such case is possible to computes the Index of Clustering Accuracy ICA and the Index of Clustering Completeness ICC Let it be NC number of tehoretical clusters NC number of clusters found NCa number of disjoint clusters Two theoretical clusters are disjoint if the intersection of the label assigned by clustering process in the two clusters is the empty set ica ae 10 NC NCy Icc 1 4 11 NC Low values of these indices reflects best results 15 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved L WARE A DAta Mining amp Exploration Program 3 Use of the SOM and 2 stage Clustering models For the user the SOM and 2 stage clustering systems offer three use cases Train Test Run Additionally to use cases just described is possible to perform a Train starting form a previously trained network This use case is called Resume Training Note that if the network is too much trained final learning rate is greater than initial one the software will perform a tuning phase in which only the winning node will be update after each pattern presentation
16. ccccccccccccccce ee neseeeeeeeeeeeeeeeeeee seen eeaaa aaa aeessseeeeeeeeeeeeeeeaaaaas 19 Figure 19 The SomKmeansIris experiment configuration tab ii 20 Figure 20 Experiment finished message iiii 20 Teure Lio DCP ilaria 21 Figure 22 Moving configuration file in the Workspace and uploading of target clusters file 21 Figure 23 The SomKmeansIris test configuration taD iii 22 Figure 24 Automatic post processing functionaliy ii 22 Figure 25 Behavior of the automatic post processing iii 22 Figure 26 SOM single stage functionality i 23 Figure 27 Clustering_SOM_Kmeans in Run use case iii 23 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved WARE E bic DAta Mining amp Exploration TIE pa MEO un se Program 1 Introduction he present document is the user guide of the data mining model Self Organizing Maps SOM and other post SOM clustering models hereinafter 2 Stage clustering as implemented and integrated into the DAMEWARE web application It is a suite of hierarchical models that can be used to execute scientific experiments for clustering on massive data sets formatted in one of the supported types ASCII columns separated by spaces CSV comma separated values FITS Table numerical col
17. d node is not the minimum then will exits a path along the node with minimum gradient which connects the nodes of a CC The procedure stops on an internal node and the next node will be examined Showing on the top of U Matrix the CC thus created the membership of a node to a specific cluster becomes evident as can be seen in the example proposed by authors in Figure 9 DD Figure 9 On the left the standard U Matrix and on the right the nodes connected by Umat CC The visualization method proposed by Hamel amp Brown 2001 was not implemented because the method the method proposed by us was considered more effective 10 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved DAta Mining amp Exploration Program To ae nama E EEEE EEO C a5 Stage 1 oo Stage2 gt an 7 on SOM Umat CC clusters N input samples Kohonen BMUs Figure 10 2 Stage Clustering Som Umat CC Experimentally the authors observed that the use of a technique capable of eliminating unnecessary fragmentation of the U Matrix and making it more homogeneous improves the results obtained in the search of the CC This effect is obtainable through the application to the map of a blur filter commonly used in different graphics software The basic idea of this filter is that each pixel becomes a weighted average of the pixels around it
18. en layer as it is commonly called are not subject to training but they are constant and positive in the neighborhood of each neuron Output layer Input layer Xi Xo n Figure 2 Self Organizing Map architecture Only one neuron of the output layer must be winner for each one of pattern given as input to the network This neuron identify a class to which the input belongs Each neuron of the Kohonen layer receives a stimulus equal to sum of input multiplied by the respective synaptic weight AG S kK wG k x k 1 The neuron of output layer with the greatest activation value is chosen as winner and assumes value 1 while all the other assumes the value 0 following the classic WTA rule Winner Takes All Generally a softer version of this rule is used the WTM Winner Takes Most Applying this rule we can consider the output layer nodes connected following a lateral inhibition system called Mexican Hat The Mexican Hat link leads to the creation of activity bubbles which identify similar input The goal of a Kohonen network is to have near winning neuron for similar input so that every activation bubble represents a class of input with similar features The behaviour just described is reached after the presentation of many input pattern to the network for a number of times modifying at each input presentation only the weights which connect the winner neuron in the output layer with the neurons of input layer This
19. esasnbsaaatanszadendebpaewtuesianeadssaaeetensaeieicouseesisas 14 2A lustenno Quality 1M CIC AIO oss sorti 15 3 Use of the SOM and 2 stage Clustering models 16 ili Illica 16 SA PR E E se cacenscnasanesauascanenoasanieenseeameaensaeeeee 17 Dey REI ARR RR PR E I 18 gt E O e E E E E 19 41 PirstExample Iris dataset cxsecices a nxscecasorasnsccotatcecdiessestdveuhtudasestxavacesatdedietuant duatatotadeuseetecetstaudeeveuseoeansits 19 al GENRE CERRI 19 ddl Este Cisti ioni prin i 21 42 Second Example choice OF second GIA GG ciciziricnoniuodiarrizarioniaia sonore dliraaailaiapazoni iaia isa cisanea 22 5 Appendix Referencesand ACLONVIAS scio ii rain pened dedepbeamennetedaadstenendeseueusaes 24 TABLE INDEX CAZZI A NI EIA 17 Table 2 List of model parameter setup web help pages available iii 18 Table 3 Abbreviations and GCTONVING oixccistuissceheswrasassnnsdeseenndseceavesSesnswaiceshsaebaoseenstatseeeetucncsnsoasedeebbeseenseaiseees 24 TOD CF Rejerence DICE 2a ALIA EEA NENE TEE EEEE ENEE NEETA ES 26 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved DAta Mining amp Exploration Program FIGURE INDEX Figure 1 Flow chart of a generic Unsupervised neural network iii 5 PUI CL OCI Map AI CHICO ARR RR RARE 6 Figure 3 Activation rule of a node with the Mexican Hat fuUnc
20. honen self organizing map networks for Kiang M Y 2001 clustering analysis Computational Statistics amp Data Analysis Vol 38 161 180 Topology preservation in self organizing maps Kiviluoto K 1996 Proceedings of the International Conference on Neural Networks 294 299 Self Organizing Maps 3 ed Springer Kohonen T 2001 U F Clustering A new performant cluster mining method on Moutarde F Ultsch A 2005 segmentation of self organizing map Proceedings of WSOM 05 September 5 8 Paris France 25 32 Clustering with SOM U C Proc Workshop on Self Ultsch A 2005 Organizing Maps Paris France 75 82 Clustering of the Self Organizing Map IEEE Transactions Vesanto J Alhoniemi E 2000 on neural networks Vol 11 No 3 586 600 A K means clustering algorithm Applied Statistics 28 Hartigan J A Wong M A 1979 100 108 Table 4 Reference Documents DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved ID Al A2 A3 A4 AS A6 A7 AS A9 A10 All AI2 A13 Al4 AI5 A16 A17 A18 A19 Program Title Code Author SuiteDesign_VONEURAL PDD NA 0001 Rel2 0 DAME Working Group project_plan_ VONEURAL PLA NA 0001 Rel2 0 Brescia statement_of_work_VONEURAL SOW NA 0001 Rell 0 Brescia mlpGP_DAME MAN NA 0008 Rel2 0 Brescia pipeline_test_VONEURAL PRO NA 0001 Rel 1 0 D Abrusco scientific_example
21. l Rights Reserved DAta Mining amp Exploration Program We give iris txt as training dataset specifying e dataset type 0 which is the value indicating an ASCII file e input nodes 4 because 4 are the columns in input dataset e output rows 5 e output columns 5 e expected clusters 7 K parameter of Kmeans Workspace somExp Select a Running Train so Experiment somKmeanslris Mode Select a Clustering_SOM_Kmeans v Functionality Field is Required input file iris txt v configuration file v dataset type 0 input nodes 4 output rows 5 output columns 5 output dimension normalize data neighbor size epochs final learning rate initial learning rate Submit Figure 19 The SomKmeansIris experiment configuration tab After submission the experiment will be executed and a message will be shown when the execution is completed Workspace v File Manager wr Workspace Ea New Workspace ila Plot Editor Image Viewer someExp E Dow Edit File Type Last Access Rename Workspace C Upload E Experiment Delete E B rist asci 2013 09 03 f TestSOM B i gt 4 Note Xx f TestESOM B i Experiment Finished Y Please refer to somExp workspace for results f somExp a OK Figure 20 Experiment finished message The list of output files obtained at the end of the experiment available when the status is ended is shown in the dedicated sec
22. nd strategies By merging for fun two famous commercial taglines we say Think different Just do it casually this is an example of data text mining DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved WARE 2 DAta Mining amp Exploration a Program 2 SOM and 2 Stage clustering theoretical overview The goal of this guide is to show the use of the following data mining tools available through DAMEW ARE web application e unsupervised model for clustering dimensional reduction Self Organizing Maps SOM e a library of clustering model to be used as refining post processing of SOM START Reset back to first training pattern to try Submit a training pattern to the network sts or NO Increase epoch MODE counter by 1 patterns y Lu Adjust weights based on learning rule again NO STOP Figure 1 Flow chart of a generic unsupervised neural network The theory of neural network is based on computational models introduced in 40s by McCulloch amp Pitts 1943 which reproduced in a simplified way the behaviour of a biological neuron The neural networks are self adaptive computational models based on the concept of learning from examples supervised or self organizing unsupervised The self organizing neural networks are suitable for the solution of different problems compared to networks with su
23. ode can be interpreted in different ways In the simplest case the external node is not BMU and so it identify an empty area of data space These nodes make the U Matrix a powerful visualization tool distinguishing areas characterised by high density of pattern Figure 13 EXTERNAL NODES Figure 13 External nodes on the U Matrix identifying low density areas of data space If the external node is a BMU there are two possible interpretations The node may identify outliers and so the isolation of the node leads to their individuation Figure 14 12 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved DAta Mining amp Exploration Program Figure 14 External node as outliers identifier Otherwise in very complex dataset an external node may derive by a particular configuration of pattern When clusters are note clearly divisible the external nodes can be seen as dividers placed between two clusters making possible their division Figure 15 E Cig Les I al pa br e e e Figure 15 External nodes as clusters dividers Considering the above the Gaussian Blur filter shown in the preceding paragraph should take very low values in order to avoid the excessive standardization of the map not recognizing external nodes The following image Figure 16 shows clearly the role that these external
24. pervised training The main use of these networks is precisely the data analysis in order to found groups having similarities pre processing and data clustering or form classification recognition of images or signals The supervised learning consists in the training of a network by input target pairs that obviously are knows solutions of optimization problems in specific points of data space parameters space of problem itself classification approximation or functions regression Sometimes there is not the possibility to have data relative to solution of problems but data to analyse without specific information on them unsupervised training A typical problem of such type is the research of class or groups of data with similar features within an unordered group of data clustering DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved lo L WARE g e N DAta Mining amp Exploration Program 2 1 The model SOM The most well known self organizing neural network model take its name from the author Kohonen 2001 It is composed of a two layers network one of which is the input layer while the other is the output layer The neurons of the two layers are completely connected to each other while the neurons of output layer are connected each one of them with a neighborhood of neurons The connection weights in the output layer or Kohon
25. re i i vs phe dw Sees ws 3 2 Output DAta Mining amp Exploration Program The following table shows the output file produced The table is valid for all the models described in this document so in the name of file lt second stage gt will be substituted by the type of second stage selected Auto Kmeans TWL UmatCC FILE DESCRIPTION REMARKS SOM_ lt second stage gt _Train_Network_Configuration txt SOM_ lt second stage gt _Train_Status log SOM_ lt second stage gt _Test_Status log SOM_ lt second stage gt _Run_Status log SOM_ lt second stage gt _Train_Results txt SOM_ lt second stage gt _Test_Results txt SOM_ lt second stage gt _Run_Results txt SOM_ lt second stage gt _Train_Normalized_Results txt SOM_ lt second stage gt _Test_Normalized_Results txt SOM_ lt second stage gt _Run_Normalized_Results txt SOM_ lt second stage gt _Train_Histogram png SOM_ lt second stage gt _Test_Histogram png SOM_ lt second stage gt _Run_Histogram png SOM_ lt second stage gt _Train_Validity_indices txt SOM_ lt second stage gt _Test_Validity_indices txt SOM_ lt second stage gt _Run_Validity_indices txt SOM_ lt second stage gt _Train_U_matrix png SOM_ lt second stage gt _Test_U_matrix png SOM_ lt second stage gt _Run_U_matrix png SOM_ lt second stage gt _Train_Output_Layer txt SOM_ lt second stage gt _Test_Output_Layer txt SOM_ lt second stage gt _Run_Output_Layer txt SOM_ lt second stage gt _Train_Clusters txt SOM_ lt second
26. rietary information of DAME project Board All Rights Reserved 21 DAta Mining amp Exploration Program DAME Program we make science discovery happen 28 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved
27. s Even the outliers are not a problem because by definition represents only a small percentage of the number of data and are unable to influence the process However if our aim was to find outliers then the use of this approach would be counterproductive DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved DAta Mining amp Exploration Program clusters N input samples Kohonen BMUs Figure 8 2 Stage Clustering SOM K Means As already said the subdivision into partitions occurs between BMU nodes of Kohonen layer In order to execute this process it is therefore necessary to identify a number of clusters minor than the expected number of BMU identified by the SOM 2 2 2 Post SOM with Umat CC In massive multi dimensional dataset the visualization of clusters on U Matrix can be difficult Hamel and Brown 2011 propose a method to improve the interpretability of Kohonen map imagining the nodes as the vertices of a graph in which the connected components CC identify the clusters The procedure to identify CC is based on concept that for each one of them will be exists a node defined CC internal node which gradient will be minor then all other nodes in the same CC The gradient of a node is computed as output of equation 3 For each node on the map the gradients of adjacent nodes will be evaluated If the gradient of the examine
28. s txt must be uploaded in the workspace just created Vorkspace v File Manager ni Workspace lr New Workspace ila Plot Editor fa Image Viewer somExp E Dow g Edit File Type Last Access Rename Workspace C Upload 9 Experiment Delete a ristxt ascii 2013 09 03 f TestSOM l it x f TestESOM B E x P somExp t E x Figure 17 The starting point with a Workspace somExp created and input dataset uploaded 4 1 1 Train Use Case Let suppose we create an experiment named SomKmeansIris and we want to configure it After creation the new configuration tab is open Here we select Clustering SOM _Kmeans which indicates the functionality the model and the type of second stage selected We select also Train as use case Workspace somExp Select a Running gt Experiment somKmeansiris Mode Select a Clustering_SOM_Kmeans v Functionality Field is Required Figure 18 Selection of functionality and use case Now we have to configure parameters for the experiment In particular we will leave empty the not required fields labels without asterisk The meaning of the parameters for this use case are described in paragraph 3 3 of this document As alternative you can click on the Help button to obtain detailed parameter description and their default values directly from the web application 19 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board Al
29. stage gt _Test_Clusters txt SOM_ lt second stage gt _Run_Clusters txt SOM_ lt second stage gt _Train_Clustered_Image png SOM_ lt second stage gt _Test_Clustered_Image png SOM_ lt second stage gt _Run_Clustered_Image png SOM_ lt second stage gt _Train_Clustered_Image txt SOM_ lt second stage gt _Test_Clustered_Image txt SOM_ lt second stage gt _Run_Clustered_Image txt SOM_ lt second stage gt _Train_Datacube zip SOM_ lt second stage gt _Test_Datacube zip SOM_ lt second stage gt _Run_Datacube zip File containing the parameters of a trained network File containing details on the executed experiment File that for each pattern reports ID features BMU cluster and activation of winner node File with same structure of precedent described file but with normalized features Histogram of clusters found File that reports the validity indices of the experiment U Matrix image File that for each node of output layer reports ID coordinates clusters number of pattern assigned and Uheight value File that for each clusters reports label number of pattern assigned percentage of association respect total number of pattern and its centroids Image that show the effect of the clustering process File that for each pixel reports ID coordinates features and cluster assigned Archive that includes the clustered images of each slice of a datacube Table 1 Output file list Must be moved to File Manager tab
30. tage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved pm WARE 4 vs Phe de Sees wwe _ bal y ee DAta Mining amp Exploration Program 5 Appendix References and Acronyms Abbreviations amp Acronyms A amp A Al ANN ARFF ASCII Bok BP BLL CC CSOM CSV DAL DAME DAMEW ARE DAPL DL DM DMM DMS FITS FL FW GRID GSOM GUI HW Meaning Artificial Intelligence Artificial Neural Network Attribute Relation File Format American Standard Code for Information Interchange Base of Knowledge Back Propagation Business Logic Layer Connected Components Clustering SOM Comma Separated Values Data Access Layer DAta Mining amp Exploration DAME Web Application REsource Data Access amp Process Layer Data Layer Data Mining Data Mining Model Data Mining Suite Flexible Image Transport System Frontend Layer FrameW ork Global Resource Information Database Gated SOM Graphical User Interface Hardware A amp A KDD IEEE INAF JPEG LAR MDS MLC MLP MSE NN OAC PC PI REDB RIA SDSS SL SOFM SOM SW TWL UI URI VO XML Meaning Knowledge Discovery in Databases Institute of Electrical and Electronic Engineers Istituto Nazionale di Astrofisica Joint Photographic Experts Group Layered Application Architecture Massive Data Sets Multi Layer Clustering Multi Layer Perceptron Mean Squ
31. tering SOM TWL test hutpy dame dst unina iVclusterine 2stagesom himi sometui tet Clustering SOM automatic test hup dame dst unina elustering 2stagesom htmlisomeauto test __ Table 2 List of model parameter setup web help pages available 18 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved E waRE LI DAta Mining amp Exploration po Program Pa a Weng Pt cedettero 4 Examples This section 1s dedicated to show some practical examples of the correct use of the web application Not all aspects and available options are reported but a significant sample of features useful for beginners of DAME suite and with a poor experience about data mining methodologies with machine learning algorithms In order to do so very simple and trivial problems will be described Further complex examples will be integrated here in the next releases of the documentation 4 1 First Example Iris dataset This example shows the use of the SOM model with K Means at the second stage applied to the dataset Iris Note that the following guide is also valid for all the other models described in this document Models are slightly different only for some input parameter More information about input parameters can be found in the paragraph 3 3 The first step consists in the creation of a new workspace named for example somExp and the input dataset iri
32. tiOn ccccccccccccsssseeeccccccceccanessesccceeeeaaaaneeseees 7 Figure 4 Learning diagram with vectors X input K ANd K neurons ccccccccccccessseecccccccceccaeesseecceeesaaaaesseeees 7 Figure J SOIL 1A m0 Me UMAMIN ee ne ee ee ee nee AA tono S Figure 6 2 Stage Clustering general Cid gran cccccccccccccssseseeccccceeeeeesesseeeeeeeeeaaeeeeeeeeeeeeeaaaaeeeeeeeeessaaaaaeeeees 9 AOC ORE 9 Figure 8 2 Stage Clustering SOM K Means iii 10 Figure 9 On the left the standard U Matrix and on the right the nodes connected by Umat CC 10 Figure 10 2 Stage Clustering Som UMat CC ie 11 Figure 11 Use of blurring level An excessive blurring may produce mistakes as shown on the left Il Figure 12 External node on red on U MAatrix i 12 Figure 13 External nodes on the U Matrix identifying low density areas of data space 12 Figure 14 External node as Outliers identifier ii 13 Figure 15 External nodes as Clusters dividers ccccccccccsssnsseseecccccuscnausseescccccsaaasssesscccessuaaaseeeseccesuenanseeeees 13 Figure 16 Example of external node to individuate separation zone and or outliers iii 13 Figure 17 The starting point with a Workspace somExp created and input dataset uploaded 19 Figure 18 Selection of functionality and use COSC icccccc
33. tion Each file can be downloaded or moved in the Workspace 20 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved DAta Mining amp Exploration Program v My Experiments Workspace somExp Experiment Status Last Access gt Delete 4 somKmeansiris ended 2013 09 03 x C Download AddinwS File Type Description Dl SOM_Kmeans_Train_U_matrix png other U matrix of output layer lt 3 1 SOM_Kmeans_Train_Clusters txt ascii experiment clusters G4 i SOM_Kmeans_Train_Network_Cor other internal network configuration C3 SOM_Kmeans_Train log ASCII File log 3 SOM_Kmeans_Train_Status log other experiment log Figure 21 List of output file produced 4 1 2 Test Use Case In this paragraph is shown how execute a Test Use Case starting from a Train previously executed Test use case is useful to evaluating the executed clustering by the indices described in paragraph 2 4 In order to do this referring to the example shown above we have to move the file SOM_Kmeans_Network_Configuration txt in the Workspace Moreover in order to execute a Test we need a file with one single column with the target clusters of each pattern Also this file must be uploaded in the Workspace Norkspace somExp C Dow Edit File Type Last Access 3 iris ascii 2013 09 03 we amp _ iris_target txt ascii 2013 09 03 3 SOM_Kmeans_Train_Network_Configura other 2013 09 0
34. to be used for test and run use cases The file is produced only if normalization of dataset was requested Quantization and topographic error are always produced DB index is produced only in 2 Stage Clustering case ICA and ICC are produced only in Test use case The Uheight value is used to generate the U Matrix The file is produced only if input dataset is an image The file is produced only if input dataset is an image The file is produced only if input dataset is a datacube 17 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved WARE E fi DAta Mining amp Exploration contre Program 3 3 Experiment parameter setup There are several parameters to be set to achieve training specific for network topology and learning algorithm setup In the experiment configuration there is also the Help button redirecting to a web page dedicated to support the user with deep information about all parameters and their default values We remark that all parameters labeled by an asterisk are considered as required In all other cases the fields can be left empty default values are used and shown in the help web pages The following table reports the web page addresses for all clustering models and related use cases subject of this manual Functionality Model USE SETUP HELP PAGE CASE Clustering SOM K means Clustering SOM UmatCC Clus
35. trained on data which classification is known each activity bubble may be associated to one of the class However the very attractive of unsupervised paradigms is to extract similarity information from the manifold working on Massive Datasets MDS measured in the real world This kind of network shows very well how generally neural networks are data intensive process more data than computations instead of number crunching process More activity bubbles can represent an input class this may be due to heterogeneity of a class or to his extension in the space of possible shapes in this case two input pattern placed in the extremes of the class may led to creation of different activity bubble The input data can be normalized in a range usually 0 1 or 1 1 as implemented in this software In order to do this a pre processing phase on input data is required The normalization or the pre processing of data can be done in various way according to problem type in such case is important the shape of input pattern recognition of signal or images while in other cases is important to keep intact a dimensional relationship DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved DAta Mining amp Exploration Program fam f WARE amp between input pattern to distinguish the point defined by coordinates x y within an area The
36. umns embedded into the fits file VOTable GIF JPG and FITS Image This manual is one of the specific guides one for each data mining model available in the webapp having the main scope to help user to understand theoretical aspects of the model to make decisions about its practical use in problem solving cases and to use it to perform experiments through the webapp by also being able to select the right functionality associated to the model based upon the specific problem and related data to be explored to select the use cases to configure internal parameters to launch experiments and to evaluate results The documentation package consists also of a general reference manual on the webapp useful also to understand what we intend for association between functionality and data mining model and a GUI user guide providing detailed description on how to use all GUI features and options So far we strongly suggest to read these two manuals and to take a little bit of practical experience with the webapp interface before to explore specific model features by reading this and the other model guides All the cited documentation package is available from the address http dame dsf unina it beta_info html where there is also the direct gateway to the webapp As general suggestion the only effort required to the end user is to have a bit of faith in Artificial Intelligence and a little amount of patience to learn basic principles of its models a
37. viously the nodes without an overlaying coloured round have never been BMU of some input pattern 2 2 2 stage clustering SOM post processing The goal of a 2 stage clustering method is to overcome the major problems of the conventional methods as the sensibility to initial prototypes proto cluster and the difficulty of determining the number of clusters expected The most used approach is the combination of a hierarchical clustering method or a SOM followed by a partitional clustering method The aim of the SOM at the first stage is to identify the number 8 DAMEWARE SOM 2stage Clustering Model User Manual This document contains proprietary information of DAME project Board All Rights Reserved DAta Mining amp Exploration Program sm L WARE kaasar ENE of clusters and relative centroids overcame the problems described above In the second stage a partitional clustering method will assign each pattern to the definitive cluster Chi amp Yang 2008 Alternatively is possible to use the SOM to map the input data onto Kohonen layer which nodes will be used in next clustering stage which could be a SOM again a hierarchical or partitional method The main advantage of the second described approach is to make a clustering of nodes proto cluster of Kohonen layer which are generally less than input pattern Therefore this method led to an advantage from the computational point of view However in this case the choice

Self Organizing Map & 2-stage Clustering User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents

Self Organizing Map &amp; 2-stage Clustering User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents

Self Organizing Map & 2-stage Clustering User Manual