Home

PDF, ~4.3MB

1. 5 The plot shows that for p gt 0 75 the detection is perfect n 200 k 2 q 0 1 0 50 055 0 60 0 65 0 70 O75 0 80 0 85 090 095 1 00 p false falset Figure 6 The x axis shows p the probability that two elements from the same cluster are grouped together by the similarity matrix The y axis shows the error percentage The probability q is 1 p The situation in figure 6 is only slightly different In fact the only difference in the choice of the parameters is that here we fix p to 0 1 The parameter p is again increased by steps of 0 05 starting from 0 5 It is clear that the error is much smaller for the same value of p but in fact if we look at p q then the setup p 0 7 q 0 3 from the previous run shows slightly better detection than we see in this run with parameters p 0 5 q 0 1 Both parameter choices have p q 0 4 though When the interval between p and q is shifted around from its centered position around 0 5 then the detection seems to become more difficult for the same p q Finally in figure 7 you can see false and false for different values of k namely k 2 4 7 The ratio of elements that do not belong to clusters fhaa is 0 4 for all following choices of parameters Again as in the first series q 1 p and p runs from 0 05 to 1 0 in steps of 0 05 The size of the problem n 60k depends on k thus making the problem size of the k
2. 7 runs equal to 420 24 n 60 k q 1 p r54470 4 0 50 0 55 0 60 0 65 0 70 0 75 0 80 0 85 0 90 0 95 1 00 p alse falset k 2 ke4 k 7 Figure 7 The x axis shows p the probability that two elements from the same cluster are grouped together by the similarity matrix The y axis shows the normalized error The probability q is fixed to 0 1 In the plot one can see that in general an increasing k makes the detec tion harder That seems to have something to do with the degeneration of the linear threshold as in figure 2 25 6 Conclusion One of the the main products of this work is the implementation of the two algorithms ClusterHierarchical and SpectralReconstruct With their help one can do the described two stage clustering of a set of elements that are characterized by the similarity matrix A only ex ploiting the spectral properties of the matrix i e without knowing k p and q The algorithm SpectralReconstruct is replaceable by any other partition reconstruction algorithm The focus of the experimental work lay clearly on the evaluation of the algorithm ClusterHierarchical with the help of which one can sort out elements that do not belong to any cluster Concerning the algorithm ClusterHierarchical it is the threshold value T that is of special interest The goal was to find out if and how the threshold depends on the parameters k p and q Also it
3. The following routine is needed to make use of the converging computation of TRLAN Instead of computing all eigen pairs from start one can define how many eigenpairs are wanted and also one can gradually extend the result to more eigenvalues by leaving the workspace unchanged and call TRLAN again with a request for more eigenvalues This is done by this routine ned specifies the number of eigenpairs that should be computed initially set to one Then ned is increased step after step until the absolute value of the next computed eigenvalue becomes too small call TRLan ned print is the call that actually calls the Fortran function print is a integer flag that indicates if output to the standard output should be done or not VM ML ML PP P CM P P Mg MP TTT LTT TTT TTT MM P PL Pg ll HH PHP TTT 29 void call TRLan gradually int print FALTELELTLT ELT ATLEAST AALS ITAA TATA TET TT EP T int ned 1 int tmpK 1 float cmp sqrt size nec 0 nec 7 call TRLan ned print if print print f f gt nec d n eval nec ned 1 nec while med lt size amp amp nec gt 0 amp amp eval nec ned 1 gt cmp tmpK ned nec call TRLan ned print if print 1 printf k d n tmpK printf 4f gt nec d n eval nec ned 1 nec k tmpK if print printf k end d n K Interfacing TRLAN The following function does the call to the F
4. In lines 3 to 6 a characteristic vector c is computed indicating the status of each element If c 0 then the element j is considered to be a member of the set N or alternatively j If cj 1 then the element j is considered to be a member of the set P In the last line the algorithm returns the similarity matrix D restricted to the elements from P 12 4 Implementation To test the algorithms on random data we have implemented a C pro gram called FASD which stands for Finding All Similar Documents the motivation and original title of my semester project For calculating the eigenvalues and eigenvectors of the adjacency matrix the implementation makes use of the TRLAN library 4 1 Random Number Generator All experiments were done with random matrices Drand48 from the C standard library was used to generate the random matrices include lt stdlib h gt include time h double number srand48 time NULL number drand48 Drand48 generates double precision pseudo random numbers uniformly distributed in the interval 0 1 The routine works by generating a sequence of 48 bit integer values X according to the linear congruential formula Xn aX c mod m n gt 0 where m 2 a 5DEECE66D s and c Bs The random number generator is initialized every time the program is started This is done by the call to srand48 long int seed As seed the system time in seconds since the 1 1 1970 is
5. amp nrow wrk amp lwrk if print printf nThe eigenvalues Wn for i20 i ipar 3 i 3l printf E Ai 125 178 16 4e n i 1 eval i wrk i output the k biggest eigenvectors printf nThe d biggest eigenvectors n ned for j 0 j lt size j for Q 1ipar 3 1 i ipar 3 ned 1 i printf f evecli size tj printf n printf aekekekokoetokookokekekookolokeokolelekooleleejoojelelejorejelelereres n printf End of TRLan n printf Mee A RARE ORG amr k n Nn return ipar 3 return nec 32 References 1 Z F redi and J Koml s The eigenvalues of random symmetric ma trices Combinatorica I 3 233 241 1981 2 M Krivelevich and V H Vu On the concentration of eigenvalues of random symmetric matrices Microsoft Technical Report 60 2000 3 F McSherry Spectral partitioning of random graphs Proceedings of 42nd IEEE Symosium on Foundations of Computer Science pages 529 537 2001 4 M Meila and D Verma A comparison of spectral clustering algo rithms UW CSE Technical report 03 05 01 33
6. is of very practical interest for what ranges of the parameters the method works This work proposes a threshold that is a linear function of k As for the dependence of p and q we could not find evidence for any However the experiments show the ranges for which the algorithm correctly detects the elements that should be subjected to the second stage of the clustering process Obviously if p gets too small and q to large or alternatively if the difference of p and q gets too small then the detection collapses It has to be mentioned that it is not clear yet if the results from the section about spectral properties apply to the values for n that we used in the experiments All the results are asymptotical results for n oo so it is questionable if we chose a large enough n However the results are satisfactory in the sense that we managed to detect the elements that belong to clusters for a reasonable range of k p and q Matrices of moderate size 200 gt n gt 500 could typically be calculated in the order of minutes on a 3 0 GHz Pentium 4 2GB RAM personal computer The largest matrices 1000 gt n gt 1200 that we computed took about half an hour per single matrix including the random generation of the matrix As for the algorithm SpectralReconstruct no series of experiments were conducted since the computation of the maximum matching that is needed for the measurement of error is not yet implemented The algo rithm depends on
7. questions that one is hoping to answer reliably and efficiently Such a partitioning can be modeled as a graph The elements become the vertices of the graph and two vertices are connected by an edge if and only if the associated elements belong to the same cluster From this graph we can derive a similarity matrix which is square and symmetric In our case we use the n x n adjacency matrix having zeros between elements that do not belong to the same cluster and ones between el ements that do In the model of planted partitions that we will use the algorithms do not have access to the adjacency matrix or else the reconstruction will be trivial Instead the model assumes that the matrix representing the perfect clusters is disturbed by changes that occur with a certain probability very much like one would imagine the clustering information to be when derived from a loosely related set of elements such as a collection of articles Spectral clustering tries to exploit the spectral properties of the sim ilarity matrix in order to reconstruct the clusters only from that infor mation Important parameters are of course the probabilities by which the matrix is changed but also the size and the number of clusters The idea behind hierarchical clustering is very simple The reconstruc tion of the clusters is divided into different steps that form a hierarchy of processing The two algorithms presented ClusterHierarchical and SpectralReconstruct a
8. two thresholds that are also derived from asymptotical considerations therefore it is not clear to what value they should be set 26 exactly to minimize the error of the planted partition reconstruction Manual adjustment of the thresholds showed that it is indeed possible to achieve a flawless reconstruction of the clusters 7 Outlook As already mentioned the next step could be to implement the maxi mum matching used to measure the error of the reconstruction Then the two thresholds of SpectralReconstruct could be scrutinized in order to come up with practical values that can be substituted for the asymptoti cal thresholds Only then will it be possible to assess the whole two stage clustering process as described in the introduction 2T A Source Code In this appendix we have included some of the source code of the FASD program Global Variables First of all we include the definitions of some global variables that will be needed in the following code fragments int size 0 size of the problem int nec 0 number of eigenvalues needed double mat variable to store the matrix double eval evec wrk workspace for trlan short marker result of trlan int NROW int LOHI int MAXLAN int MEV 0 0 1 computes the large eigenpairs int k 0 number of clusters float p 0 9f float q 0 1f Matrix Vector Multiplication The following routine had to be im plemented to pass it to TRL
9. used obtained by the call to time long int sec 4 2 Building Under Cygwin A considerable amount of time was used to make TRLAN accessible on a Windows XP machine That is why we describe here the steps that led to the final solution in order to facilitate further work There were basically two different approaches The Microsoft Visual Studio C and the Cygwin solution In the end we have only managed to get the 13 Cygwin solution running TRLAN and the other libraries are written in Fortran Our choice of favored implementation language was C C However the interfacing between the two languages sometimes proved to be treacherous Microsoft Visual Studio C It is possible to compile Fortran code with the Visual Studio but only with an integrated Fortran compiler such as Intel Visual Fortran compiler It is possible to set up a project and compile TRLAN to a library It is rather tedious work since one has to manually add those files to the project that are needed by TRLAN from LAPACK and BLAS directories In the file collection that comes with this semester project there is a list of all the files that are needed If you do it manually it is good practice to include all the TRLAN files to the project and compile it once Then the compiler will complain about undefined references The missing files have the same name as the functions that were not found After adding all missing files one has to compile again to find all the unde
10. AN as an argument TRLAN requires the caller to provide his or her own multiplication routine that implements the multiplication of the stored matrix A with one or more vectors The parameters are nRow the number of rows of A nCol the number of columns z n and yOut are two one dimensional arrays that are used to store the vectors to multiply or the resulting vectors respectively ldx and idy are the number if elements in the vectors UC IN LATTA LATTA AT PEPE AAT AT AT TAT AAT PTAA PTAA AAA void mult vectors int nRow int nCol double xIn int ldx double yOut int ldy LUI PITT AT M PP TAA TAA TATA TAA AA TG inti j k double xTmp yTmp for i20 i nCol i 28 yImp yOut ldy i for j 0 j lt nRow j yTmp xTmp xIn ldx i yImp 0 0f for k 0 k lt nRow k xTmp yTmp xTmp double mat j k Initialization The following routine initializes the workspace and sets the maximal relative error that is allowed for the computation of the eigenpairs This routine is always called once at the beginning of the eigenvalue computation 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1l1 void init_TRLan 1 11 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 int i for i 0 i lt MEV i eval i 0 0 for i 0 i lt NROW i evec i 1 0 relative tolerance on residual norms wrk 0 1 4901E 8 Stepwise Calculation
11. Hierarchical spectral clustering Yves Brise September 23 2005 Professor Emo Welzl Advisors Joachim Giesen Dieter Mitsche Contents 1 Introduction 2 1 Planted Partition Modell 22 Disturbed Planted Partition Modell 2 3 Spectral Properties For Partitioning 3 Algorithms L SpectralReconstruct 2 2 2 ClusterHierarchical 0 4 Implementation 4 1 Random Number Generator sss 4 2 Building Under Cygwin llle 43 TRLAN LAPACK and BLAS 2 2 2 2 2 44 FASDI 5 Experiments 5 1 Threshold For ClusterHierarchical 5 2 Increasing Number Of Bad Elements 5 3 Varying p amp q 7 Outlook A Source Code w oo c Cc C 11 12 13 13 13 15 15 19 19 21 23 26 27 28 1 Introduction The goal of this semester project was to implement and test two al gorithms that deal with the clustering problem That is the problem of grouping n possibly related items into k disjoint clusters It goes without saying that related items should be grouped together whereas unrelated items should be distributed into different clusters This kind of problem can as an example be motivated by the task of thematically grouping documents in a newsgroup Given a collection of sports related newslet ters what are all the documents concerning basketball This is one of the
12. N For a complete description and the source code of LAPACK and BLAS refer to 4 4 FASD General FASD is a simple command line based C program that lets you generate random matrices and run the described algorithms on those matrices In particular you can e save load matrices e generate different random matrices e run ClusterHierarchical on the loaded matrix e run SpectralReconstruct on the loaded matrix 15 e run series of experiments with ClusterHierarchical using differ ent parameters The commands will be explained in the following Typing fasd exe will run the program from the Cygwin command line FASD will enter the first command loop and prompt you for the command to execute From here you can load matrices generate random matrices and run series of experiments Most importantly you can run help from here to get a short description of all commands First Command Loop The program will prompt command and wait for user input This is a list of all commands that you can run from the first prompt e load Type load and press enter The program will ask you for a file name Enter the name of a file that has columns separated by spaces and rows separated by line breaks The first line of the file should be the string size n with n the actual dimension of the matrix The format can always be checked by generating a random matrix saving it and opening it with a text editor e params The comm
13. and params lets you change the parameters p and q as defined in section 2 The command will prompt you for two real numbers between 0 and 1 The first one will be the new value for p and the second will be the new value for q One more note Instead of typing params enter and then two times a number enter one can directly write params p q enter The appended parameters will be recognized correctly T his applies to all commands If you know the format and the order of the following prompts you can append their parameters to the first command prompt Just separate the command line parameters by spaces e rand Type rand and press enter to run the command FASD will prompt you for two integer numbers n the size of the problem and k the number of clusters The command rand will generate a random matrix with at least k clusters If n is not a multiple of k then the size of the clusters is i2 and the remaining elements will 16 be packed into their own cluster of smaller size The probability that elements within a cluster are connected is p and the proba bility that elements between different clusters are connected is q Remember that p and q can be set with params randdist Type randdist and press enter FASD will prompt you for three numbers As in rand it will prompt for n and k This time it will also prompt for a number faa 0 1 to specify the ratio of unclassifiable elements that are added to the matrix F
14. e same cluster with probability p and with probability q to any other element Planted partition detection problem Given a matrix D drawn from the D y p q distribution The planted partition detection problem asks to find all elements i 1 n with y i x Let us define the size of this special cluster naa x Then we can also define Nok n ag Assume that all other cluster o D l 1 k have the same size s Nox k Let us further define frad ny 44 n to be the ratio of elements that do not belong to a cluster Quality of the planted partition detection A planted partition de tection algorithm takes a matrix D drawn from the distribution D q p q as input and outputs a function 1 n 45 The function divides the elements into two groups There is P 4 P for positive the set of elements that the algorithm assumes to belong to a cluster and N N for negative the set of elements that do not belong to any cluster according to the algorithm The name of the algorithm presented later ClusterHierarchical is motivated by its purpose to apply this preprocessing step to the data The following measure of error is used to evaluate the quality of a detection algorithm 1 An algorithm that tries to assign an element e 1 n to either P or N may mistakenly decide that the element does not belong to any cluster in spite of y e Z x Therefore it will p
15. envalues gt gt An of A are 7 1 p n 2 q p q p and p with corresponding multiplicities 1 k 1 and n k respectively A possible orthonormal basis of the eigenspace corresponding to the k largest eigenvalues of A is ui l k whose j th coordinates are given as follows TEE ac und euo eget 0 else Theorem 1 Weyl max A Ai i 1 n A Al Spectral separation The spectral separation A of the eigenspace of the matrix A of expectations corresponding to its k largest eigenvalues from its complement is defined as the difference of the k th and the k 1 th eigenvalue i e it is dg A p q Projection matrix The matrix P that projects any vector in R to the eigenspace corresponding to the k largest eigenvalues of a matrix A drawn from the distribution A y p q i e the projection onto the space spanned by the vectors vj Uz is given as k T P 10 i 1 The matrix P that projects any vector in R to the eigenspace corre sponding to the k largest eigenvalues of the matrix A of expectations can be characterized even more explicitly Its entries are given as JE ei l p 4 0 i pli Lemma 2 All the k largest eigenvalues of A are larger than Yn and all the n k smallest eigenvalues of A are smaller than y n with probability at least 1 2e c 8 provided that n is su
16. fficiently large and k lt 972 n Pnoor Plugging in our assumption that k lt 7 n gives that the k largest eigenvalues of A are larger than 4 n p gt 2 n By the lemma of F redi and Koml s it is A A Vn with probability at least 1 2e 8 Now it follows from Weyl s theorem that the k largest eigenvalues of are larger than Vn with probability at least 1 2e 8 Since the n k smallest eigenvalues of A are p it also follows that the n k smallest eigenvalues of are smaller than vn with probability at least 1 2e 48 Lemma 3 With probability at least 1 Le n 8 it holds n N 2 k k p q p Vn X and X p 4 k vn provided n is sufficiently large PROOF It holds Az z p q p By combining Weyl s theorem and the lemma of F redi and Koml s we get that with probability at least 1 2e 8 it holds TL do 0 2 vin o4 p vn Hence with the same probability k k o k k k o lt p q lt or Jn SS qs z Vn y where we used p lt 1 for the upper bound and p gt 0 for the lower bound Remark All the considerations in this subsection apply to matrices drawn from the A y p q distribution only so strictly speaking we can not derive anything about the D y p q distribution from that Lemma 2 suggests that we can estimate the number clusters k with a high probability by counting
17. fined references that were caused by the newly added files One has to repeat that process until there are no more undefined references We used a trial version of the Intel Visual Fortran Compiler and tried to make it work Unfortunately we did not manage to call any of the Fortran functions from C C We suppose with some additional work it should be possible to compile everything under Visual Studio Anyway we had to look for a quicker way to the solution that is why in the end we developed the Cygwin solution One other problem with the Visual Studio solution is that it is not free software of course The Intel Fortran Compiler and the Microsoft Visual Studio are commercial products so the redistributability and usability of the FASD code would be depending on the availability of these programs Cygwin Cygwin is a open source emulation of Linux UNIX like sys tems on Windows PCs It consists of a DLL cygwinl dll which acts as an emulation layer providing substantial POSIX Portable Operat ing System Interface system call functionality and a collection of tools which provide a Linux look and feel The Cygwin DLL works with all x86 versions of Windows since Windows 95 The API follows the Single Unix Specification as much as possible and then Linux practice Using Cygwin it is possible to compile all the libraries with the Makefile that they come with and more importantly there are free compilers available for C C as well as for For
18. hold and the y axis shows the ratio of wrongly classified elements The size n of the matrix varies since the size of the clusters is fixed n 60 k r aa 0 4 p 0 9 q 0 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 false false t k Figure 2 The x axis shows the number of clusters k and the y axis shows the ratio of wrongly classified elements The size n of the matrix varies since the size of the clusters is fixed For a relatively small value of k this threshold detects the bad elements very well In figure we can see the results for a broader range of k We use the linear threshold that was suggested by the last experiment 20 The size of the clusters is fixed to 36 and the ratio of elements that do not belong to clusters ryaa is 0 4 That is why the size n amounts to 60k The probabilities are set as follows p 0 9 and q 0 1 We let k run from 3 to 20 For every setup of k 30 trials have been calculated to average the error It may be mentioned that this experiment includes the computation of the largest matrices that have been calculated For the case of k 20 the size of the matrices is 1200 x 1200 5 2 Increasing Number Of Bad Elements In this experiment we try to find out how ClusterHierarchical be haves when confronted with an increasing number of elements that do not belong to any cluster As the threshold T we use the linear threshold from the previous subsection This time the
19. ion 2 1 Again the range of parameters and the output format have to be changed in the source code and the program has to be recompiled to change the setting of the experiment quit Typing this command will result in FASD freeing all of its memory and exiting 17 Second Command Loop In case load rand or randdist have been run there will be a specific matrix loaded in the program TRLAN will automatically be called to get the eigenvalue decomposition of the matrix FASD enters another mode and prompts you for an action this time asking for an action that should be applied to the currently loaded matrix action The different options that you have are e save Typing save and pressing enter will result in FASD asking you for a filename If you type a filename FASD will store the currently loaded matrix to that file overwriting any present file The matrix will be stored in a simple space separated format The first line of the file will be size n where n is the size of the stored matrix On the second line follows the first row of the matrix its values separated by spaces On the third line is the second row of the matrix and so on e spectral Typing this keyword and pressing enter lets the algo rithm SpectralReconstruct run on the loaded matrix The re constructed set of clusters is printed out as a vector of integers The size of the vector is n and the value of the integer indicates the cluster that this eleme
20. irst randdist determines nok and Nbaq as defined in section 2 2 The first ny entries of the matrix are treated just as in the case of rand Ideally no should be multiple of k The probabilities are p and q again The last nj44 entries of the matrix are connected to any other element with the probability q just as two elements that belong to the undisturbed part of the matrix but not to the same cluster seriesCH This command runs an automated series of experiments with random matrices for ClusterHierarchical Random matri ces of the same type as described in randdist are generated and the algorithm ClusterHierarchical tries to identify the bad entries The results are written to a file named fasdCH output Be care ful results from previous experiments will be overwritten without prompt The measure of error is the same as defined in the section 2 2 As to the exact range of parameters we have to refer to the source code To change the range for lets say k one has to change that in the source code and recompile the program Also the exact format of the output file can be specified easily in the source code seriesHR This command is very similar to the command seri esCH in the sense that it also conducts parameter studies but this time for the planted reconstruction problem with no bad elements and the algorithm SpectralReconstruct The results are written to a file named fasdHR output The measure of error is defined on sect
21. mplete miss k 3 k 4 0 00 0 10 020 030 040 050 060 070 080 0 90 0 00 0 10 0 20 030 040 050 060 0 70 0 80 0 90 k 5 k 6 oo N w LL oo ON 0 00 0 10 020 030 040 050 060 070 0 80 0 90 0 00 0 10 0 20 0 30 040 050 060 0 0 080 0 90 false false Figure 4 The x axis shows faa the ration of elements that do not belong to any cluster The y axis shows false and false n 300 In figure 4 there are the plots of false and false for other k With increasing k we notice that the point at which the detection becomes error prone shifts to the left As mentioned this is at least partly due to the smaller cluster size that results from the fixed n An interesting point about this series of plots is that for rjaq greater than a certain value the two measures of error rise above 0 5 So more than half of the elements of each class is actually in the wrong class This degeneration of the detection coincides with the collapse of the estimation of k in ClusterHierarchical If the eigenvalues of the clusters become 22 to small then not all of them will be accounted for in the estimation of k ClusterHierarchical will attempt detection in the belief that there are less clusters than there are in reality k lt k For example for k 5 and ryad gt 0 7 the estimation of k collapses Shortly after that point the values for false and false rise ab
22. nt belongs to according to the reconstruction e cluster Runs the algorithm ClusterHierarchical on the loaded matrix The separation between elements that belong to a cluster and elements that do not is printed out as a vector of size n The vector consists of zeros and ones only A zero indicates that the corresponding index belongs to the disturbed part of the matrix i e is considered an element that does not belong to a cluster A one stands for an element that belongs to a cluster Since the matrices generated by randdist assign high indices to the disturbed entries the vector should have ones in the beginning and as many zeros at the end as there are elements that do not belong to a cluster e stop Unloads the current matrix and leaves the second command loop FASD will enter the first command loop again and prompt you for a command Type quit from there to completely leave the program 18 5 Experiments The following subsections describe different experiments that were done with FASD The variables that we use are all defined in section 2 2 5 1 Threshold For ClusterHierarchical An important parameter for the algorithm ClusterHierarchical is the threshold T that is used to separate the elements The goal of the experiments is to find this threshold We use the following function for T T ale S which reduces the problem to finding This first series of experiments investigates the dependence of a on k The
23. or tran library ipar is an array of integers to pass all the arguments to the function call For an explanation of all the parameters please refer to the user manual of TRLAN 1 1 111111111111 1 1 111 1111 1 1 1111111111 1 1 1 1 1 1 11 1 11111111111 int call_TRLan int ned int print 1 1 111111111111 1 111 1111 1 1 1111111111 1 111 1 11 11111111111 if print printf Wnodekokokokololokekookokolelolokekookololelolokokookoloelelelelekoololelelek n printf TRLan n 30 pr int f oaebekekokokokololokekookokololelokekookokolelolekekookolelelelekejooloeleler Vn printf Matrix loaded Ad x 4dNn size size printf Starting eigenvalue decomposition NnNn int i j lwrk nrow mev ipar 32 double tt to tp tr nrow NROW mev MEV lwrk MAXLAN MAXLAN 10 ipar 0 0 ipar 1 LOHI ipar 2 ned ipar 3 nec nec 0 ipar 4 MAXLAN ipar 5 1 restarting scheme 1 ipar 6 2000 maximum number of MATVECs ipar 7 0 MPI_COMM ipar 8 10 verboseness Fortran IO unit number for log messages ipar 9 99 iguess 1 supply the starting vector ipar 10 1 checkpointing flag write about 5 times ipar 11 5 Fortran IO unit number for checkpoint files ipar 12 98 floating point operations per MATVEC per PE ipar 13 3 NROW call Fortran function TRLAN77 F_TRLAN mult_vectors ipar amp nrow amp mev eval evec
24. ove 0 5 It turns out that the estimation of k always collapses before the errors cross the 0 5 marker 5 3 Varying p amp q In this experiment we are interested in the effect that varying p and q have on the detection algorithm ClusterHierarchical As the threshold T we use the linear threshold from the previous sections In all the experiments from this subsection ry a is set to 0 4 That means that 40 of the elements do not belong to clusters The size of the problem n the number and size of the clusters and the value for p and q themselves are varied in the different runs The actual values are indicated in the label or the caption of the plot For every setup of p q n and k 30 random experiments have been calculated to average the error n 200 k 2 q 1 p 0 50 055 0650 065 0 70 0 75 0 80 0 85 090 0 95 1 00 false false t Figure 5 The x axis shows p the probability that two elements from the same cluster are grouped together by the similarity matrix The y axis shows the normalized error The probability q is set to 1 p In figure we can see the plot of a run where q p 1 starting from the completely random distribution for D with p q 0 5 Then p is increased in steps of 0 05 Consequently q is decreased by 0 05 thus 23 increasing p q by 0 1 If we look at x p q then we find ourselves in the symmetric situation around 0 5 that is p 0 5 5 and q 0 5
25. re intended to be applied to the data in sequen tial order The first step is done by ClusterHierarchical which tries to sort out elements that are not classifiable that is which can not be associated with any of the clusters The algorithm divides the elements into two different groups and only clears one of these groups for the further processing by the following steps The elements from the other group are discarded In our model we will call this the planted partition detection problem or just detection for short see 2 2 The second step is done by SpectralReconstruct which attempts to do the actual grouping into clusters We call this the planted partition reconstruction problem or reconstruction for short Here is an outline of the different sections of the thesis Section describes the mathematical models and the spectral prop erties of the adjacency matrix It also leads to the estimation of the number of clusters which is a crucial ingredient of the two algorithms The detection and the reconstruction problem basically rely on the same model but nevertheless a few differences have to be explained The next section 3 presents the algorithms in pseudo code and gives a short prose description of what they are doing Section 4 is an experience report about the problems that occurred dur ing the implementation as well as a sort of user manual for the program FASD which is one of the results of this semester project Finally
26. refore we fix the following parameters The number of elements in each cluster is 35 The ratio of elements that do not belong to a cluster pad is 0 3 Then we fix p to 0 9 and q to 0 1 Remember that wj x vij is the sum of the j components of the k eigenvectors that belong to the k largest eigenvalues This w is compared with T to decide if the element j should be considered as belonging to a cluster or not The experiment conducts a parameter study for a letting it run from 1 0 to 4 0 in steps of 0 1 Different values for k were tried Figure shows the plots of the results for 2 lt k lt 5 The size of the clusters is fixed that is why the size of the problem n grows with increasing k For every different setup of k and a 30 random experiments have been run The error measures false and false have been averaged over these trials The plots that we can see in figure suggest a linear function for the parameter a For every k there is a very distinct interval where both errors false and false tend to zero Increasing k by one seems to shift this interval to the right by about 0 6 units of a That led to the threshold 1 6 4 0 6 k 2 Jn T 19 k 2 n 100 k 3 n 150 1 2 1 0 08 0 6 04 0 2 0 0 O AF AD AG AP SS ah ah 49 aP af a k 4 n 200 k 5 n 250 nO AM AB AG AP a ah ah a amp a af a3 alse false Figure 1 The x axis shows the parameter a from the thres
27. rix drawn from the A y p q distribution and A be the matrix of expectations corresponding to this distribution Let c min p 1 p q 1 q and assume that c gt logn n Then A A vn with probability at least 1 2e 8 Here denotes the Ly matrix norm i e B maxj Br Planted partition reconstruction problem Given a matrix A drawn from the A y p q distribution Assume that all clusters o I 1 1 k have the same size n k Then the function is called a par tition function The planted partition reconstruction problem asks to reconstruct i up to a permutation of 1 k from A only 5 Quality of a reconstruction algorithm A planted partition re construction algorithm takes a matrix drawn from the distribution A y p q as input and outputs a function v 1 n 1 k There are two natural measures to assess the quality of the reconstruction algorithm 1 The probability of correct reconstruction i e Plp v up to a permutation of 1 k 2 The distribution of the number of elements in 1 n misclas sified by the algorithm The definition for the number of misclas sifications used here see also Meila et al is as the size of a maximum matching on the weighted complete bipartite graph whose vertices are the original clusters o 7 1 k and the clusters Y j 7 1 k produced by the algorithm The weight of the edge
28. size n is fixed to 300 As before p 0 9 and q 0 1 The ratio of elements that do not belong to clusters feaa was varied between 0 0 and 0 95 increasing by steps of 0 05 The number of clusters is varied between 2 and 6 For every setup of k and fraa 30 random experiments have been calculated to average the error In this experiment the cluster size s varies because n is fixed It is important to note that also the smaller s will make the planted partition detection harder not only the increasing rpg k 2 o gt a 0 00 0 10 0 20 030 040 050 060 0 0 0 80 0 90 bad false falset j Figure 3 The x axis shows faa the ratio of elements that do not belong to any cluster The y axis shows false and false n 300 k 2 21 In figure we can see the plots of false and false for k 2 The algorithm ClusterHierarchical manages to make an almost perfect detection for feaa lt 0 7 In this setup feaa 0 7 means that there are 210 elements that do not belong to clusters and 90 that do The size of a cluster is therefore 45 If rbaa increases further then the detection collapses First false rises shortly before false rises too In the end false and false come close to 0 5 Note that these values are attained by a detection algorithm that will simply distribute the elements to P or N with probability 0 5 As expected the result for pad 0 95 can be called a co
29. the clusters takes place Roughly speaking two indices 7 7 are put into the same cluster if the Hamming distance of the corresponding vectors c and c is small test in line 14 A cluster as created in lines 12 to 17 is not allowed to be too small test in line 18 otherwise its elements get distributed into other clusters that are going to be constructed in future executions of the body of the while loop Note that the algorithm runs in time polynomial in n and only makes use of quantities that can be deduced from A i e it does not need to know the values of p q and k 3 2 ClusterHierarchical This algorithm deals with the removal of elements that do not belong to a cluster The input is the symmetric similarity matrix D and the output is the matrix D only restricted to a certain subset of its columns and rows Elements that do not seem to belong to any cluster are detected and discarded for the further reconstruction Let v be the j entry of the eigenvector corresponding to the i largest eigenvalue CLUSTERHIERARCHICAL D 1 K number of eigenvalues of D that are larger than ym 2 w De lll lt j lt n 3 for j 1 ton do 4 if w gt T then c 1 5 else c 0 6 end for 7 return D restricted to rows and columns j with cj 1 In line 1 the estimated number of clusters k is calculated The es timate is motivated by Lemma 2 In line 2 the component wise sum of the k largest eigenvectors is calculated
30. the eigenvalues that are greater than yn This is a very impor tant fact that we use in our algorithm SpectralReconstruct However in a similar derivation it can be shown that this procedure is also appli cable to the case of the D y p q distribution as done in the algorithm ClusterHierarchical 10 3 Algorithms 3 1 SpectralReconstruct Now we have all prerequisites at hand that we need to describe our spectral algorithm to solve the planted partition reconstruction problem SPECTRALRECONSTRUCT A 1 2 A Ww k number of eigenvalues of A that are larger than Vn P projection matrix computed from the k largest eigenvectors U1 Uk of A for i 1 to n do Ri set of row indices which are among the largest entries of the i th column of P for j 1 to n do s l 1 j Ri 0 else end for G ci ET jm end for de nb sl while exists an unmarked index 7 J do C 0 for each j I do if cic gt 4 do Bk C C U Ul end if end for itii gt 1 ERE g do IIO i41 else mark index i end if end while Ci I return C C In line 1 the number of planted clusters K is estimated The estimate is motivated by Lemma fo In line 2 the projection matrix that belongs 11 to is computed From line 3 to line 9 for each column i of A a vector c 0 1 with exactly entries that are one is computed In lines 10 to 24 the actual reconstruction of
31. the section 5 shows the results of experiments that were made In these experiments we generate random clustering data and then mea sure the error of the detection algorithm ClusterHierarchical for dif ferent parameters This is followed by a conclusion and an outlook on what future work could comprise In the appendix we add and comment on some of the source code es pecially those parts that are preoccupied with the calling of external libraries that we use 2 Model amp Problem 2 1 Planted Partition Model In this subsection we introduce the planted partition reconstruction prob lem and define two quality measures that can be used to compare different partitioning algorithms We first introduce the A y p q distribution see also McSherry 3 A y p q distribution Given a surjective function y 1 n 1 k and probabilities p q 0 1 with p gt q The A y p q distribution is a distribution on the set of n x n symmetric 0 1 matrices with zero trace Let A 855 be a matrix drawn from this distribution It is 0 if j and for ij P 1 p if p t v 7 P 0 1 p if pli y i P ij 1 q if p t Z v j P j 0 1 4 if yli vt independently The matrix of expectations A aij corresponding to the A y p q distribution is given as aij p if p t e j and i7 j aij q if v t Z y i Lemma 1 F redi and Koml s 1 Krivelevich and Vu 2 Let be a mat
32. tran As C C compiler we used g 14 and as Fortran compiler we used g95 First one has to compile TRLAN with g95 to make a Cygwin library Then the FASD program is compiled with g The linking has to be done with g95 again Unfortunately we did not manage to give the g95 linker access to the C library which led to the inconvenience of not being able to use any C syntax That is the reason why FASD is written in pure C It would have been favorable if we could have used C standard libraries but again after a lot of tinkering around we had to decide for the running version In the appendix there are some excerpts from the source code that show how the interfacing between TRLAN and FASD is done 4 3 TRLAN LAPACK and BLAS TRLAN is a program designed to find the eigenvalues and the corre sponding eigenvectors of sparse symmetric matrices It is especially op timized to find the extreme eigenvalues using a restarted Lanczos algo rithm This is convenient for us since we only need the largest eigenval ues for our algorithms TRLAN is written in Fortran95 and had to be built from source For a complete description of TRLAN please refer to http crd Ibl gov kewu trlan html the web page of TRLAN TRLAN makes use of LAPACK Linear algebra PACKage and BLAS Basic Linear Algebra two freely available Fortran77 libraries The li braries provide efficient implementations of matrix and vector operations that are needed by TRLA
33. ut the element 7 in the class of negative decisions We define false negative to be the set of elements that were mistakenly put in the negative class N It is also convenient to define the ratio false to be the ratio of mistakes within the class N _ teNle e Z xM false negative 5 Nok u Nok false 2 Similarly we will refer to false positive as the subset of the elements i P that do not actually belong to a cluster i e y i x Again we also define false to be the ratio of wrong decisions within the class P falset fee Ply e x false positive Tibad Tibad 2 3 Spectral Properties For Partitioning Any real symmetric n x n matrix has m real eigenvalues and R has a corresponding eigenbasis Here we are concerned with two types of real symmetric matrices First any matrix drawn from an A q p q distribution Second the matrix A of expectations corresponding to the distribution A y p q We want to denote the eigenvalues of A by P gt N PIECE n and the vectors of a corresponding orthonormal eigenbasis of R by v1 Un ie it is Av A v vv Qutig j and vfu T and the v1 Un span the whole R For the sake of analysis we want to assume here without loss of generality that the matrix A of expectations has a block diagonal structure i e the elements in the i th cluster have indices from 7 i 1 1 to zi in 1 n It is easy to verify that the eig
34. y 2 v j is o i nv 1 4 i e the size of the intersection of the clusters The matching gives a pairing of the clusters defined by and v Assume without loss of general ity that always w i and v i are paired Then the number of misclassifications is given as min k k n M p Nd i 1 2 2 Disturbed Planted Partition Model The assessment of ClusterHierarchical requires a slightly different model We introduce a special symbol to represent the set of elements that do not belong to any cluster and we have to adapt the measurement of error D q p q distribution Given a surjective function y 1 n 1 k U x and prob abilities p q 0 1 with p gt q The D y p q distribution is a distri bution on the set of n x n symmetric 0 1 matrices with zero trace Let D dij be a matrix drawn from this distribution It is dj 0 if i j and for i j P di 1 p if p t eU z x P di 20 1 p if y i eG Fx P dij 1 4 if y i x V ol x V e v P dij 0 1 4q if i x V pl x V pl pl independently The matrix of expectations D d corresponding to the D y p q distribution is given as d 0 ifi j di p ifizj and y t Z x di q ifizj and y i xV yl xV e z e That means that elements that do not belong to any cluster get connected to any other element with probability q Elements that do belong to clusters get connected to elements from th

PDF, ~4.3MB

Contents

Download Pdf Manuals

Related Search

Related Contents