Home

ParMETIS 4.x Manual

1. 17 int ParMETIS_V3_PartGeomKway Description idx_t vtxdist idx_t xadj idx_t adjncy idx_t vwegt idx_t adjwegt idx_t wegtflag idx_t numflag idx_t ndims real_t xyz idx_t ncon idx_t nparts real_t tpwegts real_t ubvec idx_t options idx_t edgecut idx_t part MPI Comm comm This routine is used to compute a k way partitioning of a graph on p processors by combining the coordinate based and multi constraint k way partitioning schemes Parameters vtxdist This array describes how the vertices of the graph are distributed among the processors See discus sion in Section 4 2 1 Its contents are identical for every processor xadj adjncy These store the local adjacency structure of the graph at each processor See discussion in Sec tion 4 2 1 vwet adjwet wetflag numflag ndims XYZ ncon nparts tpwegts ubvec options These store the weights of the vertices and edges See discussion in Section 4 2 1 This is used to indicate if the graph is weighted wgtflag can take one of four values 0 No weights vwgt and adjwgt are both NULL 1 Weights on the edges only vwgt is NULL 2 Weights on the vertices only adjwgt is NULL 3 Weights on both the vertices and edges This is used to indicate the numbering scheme that is used for the vixdist xadj adjncy and part arrays numflag can take one of two values 0 C style numbering that starts from 0 1 Fortran style numbering
2. If ITR is set high a repartitioning with a low edge cut will be computed If it is set low a repartitioning that requires little data redistri bution will be computed Good values for this parameter can be obtained by dividing inter processor communication time by data redistribution time Otherwise a value of 1000 0 is recommended options This is an array of integers that is used to pass additional parameters for the routine The first element i e options 0 can take either the value of 0 or 1 If it is 0 then the default values are used otherwise the remaining three elements of opt ions are interpreted as follows options 1 This specifies the level of information to be returned during the execution of the algorithm Timing information can be obtained by setting this to 1 Additional options for this parameter can be obtained by looking at parmetis h The nu merical values there should be added to obtain the correct value The default value is 0 options 2 This is the random number seed for the routine options 3 This specifies whether the sub domains and processors are coupled or un coupled If the number of sub domains desired i e npart s and the number of processors that are being used is not the same then these must be un coupled However if nparts equals the number of processors these can either be coupled or de coupled If sub domains and processors are coupled then the initial partitioning will be obtained implicitly from t
3. ParMETIS_V3_PartGeomKway ParMETIS_V3_PartGeom 2 0 6 5 5525 ParMETIS_V3_PartMeshKway 5 2 Graph Repartitioning 0 ParMETIS_V3_AdaptiveRepart 5 3 Partitioning Refinement o o os ce een eee ea es ParMETIS_V3_RefineKway 54 Fill reducme Ordering 2 gt cs nb ee eR ER HH OS ParMETIS_V3_NodeND ParMETIS_V32 NodeND 52 085 5 Mesh to Graph Translation 2 665 6 eee ees ParMETIS_V3_Mesh2Dual Restrictions amp Limitations Hardware amp Software Requirements and Contact Information Copyright amp License Notice RWW WwW OoOmAAmANIAHDUAMNM 10 10 10 10 12 12 13 14 15 16 16 18 20 21 23 23 25 25 27 27 28 30 30 31 31 31 1 Introduction PARMEIIS is an MPI based parallel library that implements a variety of algorithms for partitioning and repartitioning unstructured graphs and for computing fill reducing orderings of sparse matrices PARMETIS is particularly suited for parallel numerical simulations involving large unstructured meshes In this type of computation PARMETS dramati cally reduces the time spent in communication by computing mesh decompositions such that the numbers of interface elements are minimized The algorithms in PARMEIIS are based on the multilevel partitioning and fill reducing ordering algorithms that are implemented in the widely used serial package METIS 5
4. API calls have been mapped to the new routines However the expanded functionality provided with this release is only available by using the new calling sequences The four adaptive repartitioning routines ParMETIS_RepartLDiffusion ParMETIS_RepartGDiffusion ParMETIS_RepartRemap and ParMETIS_RepartMLRemap have been replaced by a single implementa tion of a unified repartitioning algorithm 15 ParTMETIS_V3_AdaptiveRepart that combines the best features of the previous routines Multiple vertex weights balance constraints are supported for most of the routines This allows PARMETIS to be used to partition graphs for multi phase and multi physics simulations In order to optimize partitionings for specific heterogeneous computing architectures it is now possible to specify the target sub domain weights for each of the sub domains and for each balance constraint This feature for example allows the user to compute a partitioning in which one of the sub domains is twice the size of all of the others The number of sub domains has been de coupled from the number of processors in both the static and the adaptive partitioning schemes Hence it is now possible to use the parallel partitioning and repartitioning algorithms to compute a k way partitioning independent of the number of processors that are used Note that Version 2 0 provided this functionality for the static partitioning schemes only e Routines are provided for both directly pa
5. Mesh to Graph Translation int ParMETIS_V32_Mesh2Dual Description This routine is used to construct a distributed graph given a distributed mesh It can be used in conjunction with other routines in the PARMETS library The mesh can contain elements of different types Parameters elmdist eptr eind idx_t elmdist idx_t eptr idx_t eind idx_t numflag idx_t ncommonnodes idx_t xadj idx_t adjncy MPI Comm comm This array describes how the elements of the mesh are distributed among the processors It is anal ogous to the vt xdist array Its contents are identical for every processor See discussion in Section 4 2 3 These arrays specifies the elements that are stored locally at each processor See discussion in Section 4 2 3 numflag This is used to indicate the numbering scheme that is used for the elmdist elements xadj and adjncy arrays numflag can take one of two values 0 C style numbering that starts from 0 1 Fortran style numbering that starts from 1 ncommonnodes This parameter determines the degree of connectivity among the vertices in the dual graph Specifi cally an edge is placed between any two elements if and only if they share at least this many nodes This value should be greater than 0 and for most meshes a value of two will create reasonable dual graphs However depending on the type of elements in the mesh values greater than 2 may also be valid choices For example for meshes conta
6. adjwegt idx_t wegtflag idx_t numflag idx_t ncon idx_t nparts real_t tpwgts real_t ubvec idx_t options idx_t edgecut idx_t part MPI_Comm comm Description This routine is used to improve the quality of an existing a k way partitioning on p processors using the multi level k way refinement algorithm Parameters vtxdist This array describes how the vertices of the graph are distributed among the processors See discus sion in Section 4 2 1 Its contents are identical for every processor xadj adjncy These store the local adjacency structure of the graph at each processor See discussion in Sec tion 4 2 1 vwet adjwet These store the weights of the vertices and edges See discussion in Section 4 2 1 ncon This is used to specify the number of weights that each vertex has It is also the number of balance constraints that must be satisfied nparts This is used to specify the number of sub domains that are desired Note that the number of sub domains is independent of the number of processors that call this routine wetflag This is used to indicate if the graph is weighted wgtflag can take one of four values 0 No weights vwgt and adjwgt are both NULL 1 Weights on the edges only vwgt is NULL 2 Weights on the vertices only adjwgt is NULL 3 Weights on both the vertices and edges numflag This is used to indicate the numbering scheme that is used for the vtxdist xadj adjncy and part arrays numflag c
7. computed by METIS nested dissection routines In addition version 3 2 contains a number of bug fixes and documentation corrections Note that changes in the documentation are marked using change bars 2 3 Changes between 3 0 3 1 and 2 0 Version 3 x contains a number of changes over the previous major release version 2 x These changes include the following Version 1 0 Version 2 0 Version 3 0 PARKMETIS ParMETIS _PartkK way ParMETIS_V3_PartKway PARGKMETIS ParMETIS_PartGeomKway ParMETIS _V3_PartGeomKway PARGMETIS ParMETIS PartGeom ParMETIS_V3_PartGeom PARGRMETIS Not available Not available PARRMETIS ParMETIS _RefineK way ParMETIS_V3_RefineKway PARUAMETIS ParMETIS_RepartLDiffusion PARDAMETIS ParMETIS_RepartGDiffusion Not available ParMETIS _RepartRemap Not available ParMETIS _RepartMLRemap ParMETIS _V3_AdaptiveRepart PAROMETIS ParMETIS_ NodeND ParMETIS_V3_NodeND Not available Not available ParMETIS_V3_PartMeshKway Not available Not available ParMETIS_V3_Mesh2Dual Table 1 The relationships between the names of the routines in the different versions of PARMEIS The names and calling sequence of all the routines have changed due to expanded functionality that has been provided in this release Table 1 shows how the names of the various routines map from version to version Note that Version 3 0 is fully backwards compatible with all previous versions of PARMETS That is the old
8. known bugs this does not mean that all of its bugs have been found and fixed If you have any problems please send email to karypis cs umn edu with a brief description of the problem 8 Copyright amp License Notice PARMEIIS is copyrighted by the Regents of the University of Minnesota It can be freely used for educational and research purposes by non profit institutions and US government agencies only Other organizations are allowed to use PARMETIS only for evaluation purposes and any further uses will require prior approval The software may not be sold or redistributed without prior approval One may make copies of the software for their use provided that the copies are not sold or distributed are used under the same terms and conditions As unestablished research software this code is provided on an as is basis without warranty of any kind either expressed or implied The downloading or executing any part of this software constitutes an implicit agreement to these terms These terms and conditions are subject to change at any time without prior notice References 1 R Biswas and R Strawn A new procedure for dynamic adaption of three dimensional unstructured grids Applied Numerical Mathematics 13 437 452 1994 2 C Fiduccia and R Mattheyses A linear time heuristic for improving network partitions In Jn Proc 19th IEEE Design Automation Conference pages 175 181 1982 3 J Fingberg A Basermann G Lonsdale J Cli
9. simulations 3 Figure 3 illustrates the characteristics of partitionings that are needed for these simulations Figure 3 a shows a mesh for a particles in cells computation Assuming that a synchronization separates the mesh based computation from the particle computation a partitioning is required that balances both the number of mesh elements and the number of particles across the sub domains Fig ure 3 b shows a mesh for a contact impact simulation During the contact detection phase computation is performed only on the surface i e lightly shaded elements while during the impact phase computation is performed on all of the elements Therefore in order to ensure that both phases are load balanced a partitioning must balance both the total number of mesh elements and the number of surface elements across the sub domains The solid partitioning in Figure 3 b does this The dashed partitioning is similar to what a traditional graph partitioner might compute This partitioning balances only the total number of mesh elements The surface elements are imbalanced by over 50 A new formulation of the graph partitioning problem is presented in 6 that is able to model the problem of balancing multiple computational phases simultaneously while also minimizing the inter processor communications In this formulation a weight vector of size m is assigned to each vertex of the graph The multi constraint graph partitioning problem then is to compute a p
10. that are well distributed The algorithm implemented by ParMETIS_V32_NodeND is based on a multilevel nested dissection algorithm This algorithm has been shown to produce low fill orderings for a wide variety of matrices Furthermore it leads to balanced elimination trees that are essential for parallel direct factorization ParMETIS_V32_NodeND uses a multilevel node based refinement algorithm that is particularly suited for directly refining the size of the separators To achieve high performance ParMETIS_V32_NodeND first uses ParMETIS_V3_PartKway to compute a high quality partitioning and redistributes the graph accordingly Next it proceeds to compute the log p levels of the elimination tree concurrently When the graph has been separated into p parts where p is the number of processors the graph is redistributed among the processor so that each processor receives a single subgraph and METIS serial nested dissection ordering algorithm is used to order these smaller subgraphs 4 PARMETS API The various routines implemented in PARMETIS can be accessed from a C C or Fortran program by using the supplied library In the rest of this section we describe PARMETIS API by first describing various calling and usage conventions the various data structures used to pass information into and get information out of the routines followed by a detailed description of the calling sequence of the various routines 4 1 Header files Any pr
11. that is analogous to the vwgt array 4 2 4 Format of the Computed Partitionings and Orderings Format of the Partitioning Array The partitioning and repartitioning routines require that arrays called part of sizes n where n is the number of local vertices be passed as parameters to each processor Upon completion of the PARMETS routine for each vertex j the sub domain number i e the processor label to which this vertex belongs will have been written to part j Note that PARMETIS does not redistribute the graph according to the new partitioning it simply computes the partitioning and writes it to the part array Additionally whenever the number of sub domains does not equal the number of processors that are used to com pute a repartitioning ParMETIS_V3_RefineKway and ParMETIS_V3_AdaptiveRepart require that the previously computed partitioning be passed as a parameter via the part array This is also required whenever the user chooses to de couple the sub domains from the processors See discussion in Section 5 2 This is because the initial partitioning needs to be obtained from the values supplied in the part array If the numbers of sub domains and processors are equal then the initial partitioning can be obtained from the initial graph distribution and so this information need not be supplied In this case for each processor 7 every element of part would be set to 7 Format of the Ordering and Separator Sizes Arrays Each proces
12. that starts from 1 The number of dimensions of the space in which the graph is embedded The array storing the coordinates of the vertices described in Section 4 2 2 This is used to specify the number of weights that each vertex has It is also the number of balance constraints that must be satisfied This is used to specify the number of sub domains that are desired Note that the number of sub domains is independent of the number of processors that call this routine An array of size ncon X nparts that is used to specify the fraction of vertex weight that should be distributed to each sub domain for each balance constraint If all of the sub domains are to be of the same size for every vertex weight then each of the ncon x nparts elements should be set to a value of 1 nparts If ncon is greater than one the target sub domain weights for each sub domain are stored contiguously similar to the vwgt array Note that the sum of all of the tpwgts fora give vertex weight should be one An array of size ncon that is used to specify the imbalance tolerance for each vertex weight with 1 being perfect balance and nparts being perfect imbalance A value of 1 05 for each of the ncon weights is recommended This is an array of integers that is used to pass parameters to the routine Their meanings are identical to those of ParMETIS_V3_PartKway 18 edgecut Upon successful completion the number of edges that are cut by the partitioning is writte
13. the same amount of mesh based work associated with them The second weight represents the work associated with the particle based computation This value is estimated by the number of particles that fall within each element A multi constraint partitioning is shown that balances both of these weights 3 6 Partitioning for Heterogeneous Computing Architectures Complex heterogeneous computing platforms such as groups of tightly coupled shared memory nodes that are loosely connected via high bandwidth and high latency interconnection networks and or processing nodes that have complex memory hierarchies are becoming more common as they display competitive cost to performance ratios The same is true of platforms that are geographically distributed Most existing parallel simulation codes can easily be ported to a wide range of parallel architectures as they employ a standard messaging layer such as MPI However complex and heterogeneous architectures present new challenges to the scalable execution of such codes since many of the basic parallel algorithm design assumptions are no longer valid We have taken the first steps toward developing architecture aware graph partitioning algorithms These are able to compute partitionings that allow computations to achieve the highest levels of performance regardless of the computing platform Specifically we have enabled ParMETIS V3 PartKway ParMETIS_V3_PartGeomKway ParMETIS_V3_PartMeshKway ParMETIS_V3_Ref
14. two values 0 No weights elmwgt is NULL 2 Weights on the vertices only numflag This is used to indicate the numbering scheme that is used for the elmdist elements and part arrays numflag can take one of two values 0 C style numbering that starts from 0 1 Fortran style numbering that starts from 1 ncon This is used to specify the number of weights that each vertex has It is also the number of balance constraints that must be satisfied ncommonnodes This parameter determines the degree of connectivity among the vertices in the dual graph Specifi cally an edge is placed between any two elements if and only if they share at least this many nodes This value should be greater than 0 and for most meshes a value of two will create reasonable dual graphs However depending on the type of elements in the mesh values greater than 2 may also be valid choices For example for meshes containing only triangular tetrahedral hexahedral or rectangular elements this parameter can be set to two three four or two respectively Note that setting this parameter to a small value will increase the number of edges in the resulting dual graph and the corresponding partitioning time nparts This is used to specify the number of sub domains that are desired Note that the number of sub domains is independent of the number of processors that call this routine tpwgts An array of size ncon x nparts that is used to specify the fraction of vertex weight t
15. ETIS_V3_AdaptiveRepart Description idx_t vtxdist idx_t xadj idx_t adjncy idx_t vwgt idx_t vsize idx_t adjwgt idx_t wegtflag idx_t numflag idx_t ncon int nparts real_t tpwgts real_t ubvec real_t itr idx_t options idx_t edgecut idx_t part MPI Comm comm This routine is used to balance the work load of a graph that corresponds to an adaptively refined mesh Parameters vtxdist This array describes how the vertices of the graph are distributed among the processors See discus sion in Section 4 2 1 Its contents are identical for every processor xadj adjncy These store the local adjacency structure of the graph at each processor See discussion in Sec tion 4 2 1 vwet adjwet These store the weights of the vertices and edges See discussion in Section 4 2 1 vsize This array stores the size of the vertices with respect to redistribution costs Hence vertices associ ated with mesh elements that require a lot of memory will have larger corresponding entries in this array Otherwise this array is similar to the vwgt array See discussion in Section 4 2 1 wetflag This is used to indicate if the graph is weighted wgtflag can take one of four values 0 No weights vwgt and adjwgt are both NULL 1 Weights on the edges only vwgt is NULL 2 Weights on the vertices only adjwgt is NULL 3 Weights on both the vertices and edges numflag This is used to indicate the numbering scheme that is used for th
16. Here is the list of the major changes in version 4 0 e Support for 64 bit architectures by explicitly defining the width of the scalar integer data type idx_t used to store the adjancency structure of the graph e It is based on the 5 0 distribution of METIS which itself contains many enhancements over the previous version e A complete re write of its internal memory management which resulted in lower memory requirements e Better quality partitionings for multi constraint partitioning problems 2 2 Changes between 3 2 and 3 1 The major change in version 3 2 is its better support for computing fill reducing orderings of sparse matrices Specifi cally version 3 2 contains the following enhancements additions e A new parallel separator refinement algorithm that leads to smaller separators and less fill in e Parallel orderings can now be computed on non power of two processors It provides support for computing multiple separators at each level both during the parallel and the serial phases The smallest separator among these multiple runs is selected There is a new API routine ParMETIS_V32_NodeND that exposes additional parameters to the user in order to better control various aspects of the algorithm The old API routine ParMETIS_V3_NodeND is still valid and is mapped to the new ordering routine The end results of these enhancements is that the quality of the orderings computed by PARMETIS are now compa rable to those
17. However PARMETIS extends the functionality provided by METIS and includes routines that are especially suited for parallel computations and large scale numerical simulations In particular PARMETIS provides the following functionality e Partition unstructured graphs and meshes e Repartition graphs that correspond to adaptively refined meshes e Partition graphs for multi phase and multi physics simulations e Improve the quality of existing partitionings e Compute fill reducing orderings for sparse direct factorization e Construct the dual graphs of meshes The rest of this manual is organized as follows Section 2 briefly describes the differences between major versions of PARMEIIS Section 3 describes the various algorithms that are implemented in PARMEIIS Section 4 2 describes the format of the basic parameters that need to be supplied to the routines Section 5 provides a detailed description of the calling sequences for the major routines in PARMETIS Finally Section 7 describes software and hardware requirements and provides contact information 2 Changes Across Key Releases 2 1 Changes between 4 0 and 3 2 The 4 0 release of PARMEIIS represents a major code refactoring to allow full support of 64 bit architectures As part of that re factoring no additional capabilities have been added to the library However since the 4 0 release relies on the latest version of METIS it allows for better support of multi constraint partitioning
18. PARMETS Parallel Graph Partitioning and Sparse Matrix Ordering Library Version 4 0 George Karypis and Kirk Schloegel University of Minnesota Department of Computer Science and Engineering Minneapolis MN 55455 karypis cs umn edu March 30 2013 PARMEIS is copyrighted by the regents of the University of Minnesota 1 Contents 1 2 Introduction Changes Across Key Releases 2 1 Changes between and3 2 lt 2 4 6 4 eee SSE wes 2 2 Changes between 3 2and3 1 2 3 Changes between 3 0 3 1 and 20 o o oea york cae ex Algorithms Used in PARMENS 3 1 Unstructured Graph Partitioning 3 2 Partitioning Meshes Directly 0 3 3 Partitioning Adaptively Refined Meshes 34 Partition Refinement ocos ocea mm a aa ee 3 5 Partitioning for Multi phase and Multi physics Computations 3 6 Partitioning for Heterogeneous Computing Architectures 3 7 Computing Fill Reducing Orderings PARMETIS API 41 Headeriiles 2 056 nb kk eh we a we le 4 2 Input and Output Formats used by PARMETIS 4 2 1 Format of the Input Graph 4 2 2 Format of Vertex Coordinates 4 2 3 Format of the Input Mesh 4 2 4 Format of the Computed Partitionings and Orderings 4 3 Numbering and Memory Allocation Calling Sequence of the Routines in PARMENS 5 1 Graph Partitioning 2 6 ccs c osese ee ee ParMETIS_V3_PartKway
19. RMETIS that is used to partition unstructured graphs This routine takes a graph and computes a k way partitioning where k is equal to the number of sub domains desired while attempting to minimize the number of edges that are cut by the partitioning i e the edge cut ParMETIS_V3_PartKway makes no assumptions on how the graph is initially distributed among the processors It can effectively partition a graph that is randomly distributed as well as a graph that is well distributed If the graph is initially well distributed among the processors ParMETIS_V3_PartKway will take less time to run However the quality of the computed partitionings The reader should note the difference between the terms graph distribution and graph partition A partitioning is a mapping of the vertices to the processors that results in a distribution In other words a partitioning specifies a distribution In order to partition a graph in parallel an initial distribution of the nodes and edges of the graph among the processors is required For example consider a graph that corresponds to the dual of a finite element mesh This graph could initially be partitioned simply by mapping groups of n p consecutively numbered elements to each processor where n is the number of elements and p is the number of processors Of course this naive approach is not likely to result in a very good distribution because elements that belong to a number of different regions of the mesh may g
20. an take the following two values 0 C style numbering is assumed that starts from 0 1 Fortran style numbering is assumed that starts from 1 tpwgts An array of size ncon x nparts that is used to specify the fraction of vertex weight that should be distributed to each sub domain for each balance constraint If all of the sub domains are to be of the same size for every vertex weight then each of the ncon x nparts elements should be set to a value of 1 nparts If ncon is greater than 1 the target sub domain weights for each sub domain are stored contiguously similar to the vwgt array Note that the sum of all of the tpwgts fora give vertex weight should be one ubvec An array of size ncon that is used to specify the imbalance tolerance for each vertex weight with 1 being perfect balance and nparts being perfect imbalance A value of 1 05 for each of the ncon weights is recommended options This is an array of integers that is used to pass parameters to the routine Their meanings are identical to those of ParMETIS_V3_AdaptiveRepart edgecut Upon successful completion the number of edges that are cut by the partitioning is written to this parameter 25 part This is an array of size equal to the number of locally stored vertices Upon successful completion the partition vector of the locally stored vertices is written to this array See discussion in Section 4 2 4 If the number of processors is not equal to the number of sub domains and or op
21. artitioning such that the edge cut is minimized and that every sub domain has approximately the same amount of each of the vertex weights The routines ParMETIS_V3_PartKway ParMETIS_V3_PartGeomKway ParMETIS_V3_RefineKway and ParMETIS_V3_AdaptiveRepart are all able to compute partitionings that satisfy multiple balance constraints Figure 4 gives the dual graph for the particles in cells mesh shown in Figure 3 Each vertex has two weights here The first represents the work associated with the mesh based computation for the corresponding element These are all a b Figure 3 A computational mesh for a particle in cells simulation a and a computational mesh for a contact impact simulation b The particle in cells mesh is partitioned so that both the number of mesh elements and the number of particles are balanced across the sub domains Two partitionings are shown for the contact impact mesh The dashed partitioning balances only the number of mesh elements The solid partitioning balances both the number of mesh elements and the number of surface lightly shaded elements across the sub domains Ayes aS Figure 4 A dual graph with vertex weight vectors of size two is constructed from the particle in cells mesh from Figure 3 A multi constraint partitioning has been computed for this graph and this partitioning has been projected back to the mesh ones because we assume in this case that all of the elements have
22. bisections levels of the remaining levels of the nested dissection when the matrix has been divided among the processors and each processor proceeds independently to order its portion of the matrix The bisections that achieve the smallest separator are selected The default value is 1 when NULL is supplied but values greater than 1 can lead to better quality orderings However this is a time quality trade off 28 ubfrac This value indicates how unbalanced the two partitions are allowed to get during each bisection level The default value when NULL is supplied is 1 05 but higher values typical ranges 1 05 1 25 can lead to smaller separators seed This is the seed for the random number generator When NULL is supplied a default seed is used dbglvl This specifies the level of information to be returned during the execution of the algorithm This is identical to the opt ions 2 parameter of the other routines When NULL is supplied a value of 0 is used order This array returns the result of the ordering described in Section 4 2 4 sizes This array returns the number of nodes for each sub domain and each separator described in Sec tion 4 2 4 comm This is a pointer to the MPI communicator of the processes that call PARMENS For most programs this will point to MP I_COMM_WORLD Returns METIS_OK Indicates that the function returned normally METIS_ERROR Indicates some other type of error 29 5 5
23. ble values are 0 C style numbering is assumed that starts from 0 1 Fortran style numbering is assumed that starts from 1 This is used to indicate the scheme to be used for computing the matching The possible values defined in parmetis nh are PARMETIS_MTYPE_LOCAL A local matching scheme is used in which each pair of matched vertices reside on the same processor PARMETIS_MTYPE_GLOBAL A global matching scheme is used in which the pairs of matched vertices can reside on different processors This is the default value if a NULL value is passed This is used to indicate the separator refinement scheme that will be used The possible values defined in parmetis h are PARMETIS_SRTYPE_GR T EDY Uses a simple greedy refinement algorithm PARMETIS_SRTYPE_2PHASE Uses a higher quality refinement algorithm which is somewhat slower This is the default value if a NULL value is passed Specifies the number of different separators that will be computed during each bisection at the first log p levels of the nested dissection these are computed in parallel among the processors The bisection that achieves the smallest separator is selected The default value is 1 when NULL is supplied but values greater than can lead to better quality orderings However this is a time quality trade off Specifies the number of different separators that will be computed during each of the
24. by an example on a three processor system The 15 vertex graph in Figure 5 a is distributed among the processors so that each processor gets 5 vertices and their corresponding Multi constraint and multi objective graph partitioning formulations 6 13 can get around this requirement for some applications These also allow the computation of partitionings for bipartite graphs as well as for graphs corresponding to non square and non symmetric matrices 11 OOE TOLG 10 11 12 13 14 a A sample graph Description of the graph on a serial computer serial MeTiS xadj 025 8 11 13 16 20 24 28 31 33 36 39 42 44 adjncy 15026137248390610157 11268 12379 1348 145 116 10127 11 138 12 149 13 b Serial CSR format Description of the graph on a parallel computer with 3 processors ParMeTiS Processor 0 xadj 025811 13 adjncy 1502613724839 vtxdist 05 1015 Processor 1 xadj 037 11 15 18 adjncy 061015711268 12379 1348 14 vtxdist 05 1015 Processor 2 xadj 02581113 adjncy 51161012711 138 12149 13 vtxdist 05 1015 c Distributed CSR format Figure 5 An example of the parameters passed to PARMEIS in a three processor case The arrays vwgt and adjwgt are assumed to be NULL adjacency lists That is Processor Zero gets vertices 0 through 4 Processor One gets vertices 5 through 9 and Processor Two gets vertices 10 through 14 This figure shows the xadj adjncy and vt xdist arrays for each processor Note that t
25. dges See discussion in Section 4 2 1 wetflag This is used to indicate if the graph is weighted wgtflag can take one of four values 0 No weights vwgt and adjwgt are both NULL 1 Weights on the edges only vwgt is NULL 2 Weights on the vertices only adjwgt is NULL 3 Weights on both the vertices and edges numflag This is used to indicate the numbering scheme that is used for the vtxdist xadj adjncy and part arrays numflag can take one of two values 0 C style numbering that starts from 0 1 Fortran style numbering that starts from 1 ncon This is used to specify the number of weights that each vertex has It is also the number of balance constraints that must be satisfied nparts This is used to specify the number of sub domains that are desired Note that the number of sub domains is independent of the number of processors that call this routine tpwgts An array of size ncon x nparts that is used to specify the fraction of vertex weight that should be distributed to each sub domain for each balance constraint If all of the sub domains are to be of the same size for every vertex weight then each of the ncon x nparts elements should be set to a value of 1 nparts If ncon is greater than 1 the target sub domain weights for each sub domain are stored contiguously similar to the vwgt array Note that the sum of all of the tpwgts fora give vertex weight should be one ubvec An array of size ncon that is used to specify the imbala
26. e vtxdist xadj adjncy and part arrays numflag can take the following two values 0 C style numbering is assumed that starts from 0 1 Fortran style numbering is assumed that starts from 1 ncon This is used to specify the number of weights that each vertex has It is also the number of balance constraints that must be satisfied nparts This is used to specify the number of sub domains that are desired Note that the number of sub domains is independent of the number of processors that call this routine tpwgts An array of size ncon x nparts that is used to specify the fraction of vertex weight that should be distributed to each sub domain for each balance constraint If all of the sub domains are to be of the same size for every vertex weight then each of the ncon x nparts elements should be set to a value of 1 nparts If ncon is greater than one the target sub domain weights for each sub domain are stored contiguously similar to the vwgt array Note that the sum of all of the tpwgts fora give vertex weight should be one ubvec An array of size ncon that is used to specify the imbalance tolerance for each vertex weight with 1 being perfect balance and nparts being perfect imbalance A value of 1 05 for each of the ncon weights is recommended 23 itr This parameter describes the ratio of inter processor communication time compared to data redistri bution time It should be set between 0 000001 and 1000000 0
27. et mapped to the same processor That is each processor may get a number of small sub domains as opposed to a single contiguous sub domain Hence you would want to compute a new high quality partitioning for the graph and then redistribute the mesh accordingly Note that it may also be the case that the initial graph is well distributed as when meshes are adaptively refined and repartitioned Multilevel K way Partitioning Go Go C a ar 6 aos Figure 2 The three phases of multilevel k way graph partitioning During the coarsening phase the size of the graph is successively decreased During the initial partitioning phase a k way partitioning is computed During the multilevel refinement or uncoarsening phase the partitioning is successively refined as it is Coarsening Phase aseud Buluasieooun Initial Partitioning Phase projected to the larger graphs G o is the input graph which is the finest graph G41 is the next level coarser graph of G G4 is the coarsest graph does not depend on the initial distribution The parallel graph partitioning algorithm used in ParMETIS_V3_PartKway is based on the serial multilevel k way partitioning algorithm described in 6 7 and parallelized in 4 14 This algorithm has been shown to quickly produce partitionings that are of very high quality It consists of three phases graph coarsening initial partitioning and uncoarsening refinement In the graph coarsening phase a series of g
28. formance will be achieved if the vertices are distributed so that each processor gets an equal number of vertices 2 The routines must be called by at least two processors That is PARMETIS cannot be used on a single processor If you need to partition on a single processor use METIS 3 The partitioning routines in PARMEIIS switch to a purely serial implementation via a call to the corresponding METIS routine when the following conditions are met i the graph matrix contains less than 10000 vertices ii the graph contains no edges and iii the number of vertices in the graph is less than 20 x p where p is the number of processors 7 Hardware amp Software Requirements and Contact Information PARMEIIS is written in ANSI C and uses MPI for inter processor communication Instructions on how to build PARMEIS are available in the Install txt file In the directory called Graphs you will find some graphs that can be used to test PARMEIIS using the testing program that are built with PARMETIS In order to use PARMEIIS in your application you need to have a copy of the serial METIS library and link your program with both libraries i e Libparmetis a and libmetis a Note that the PARMETIS package already contains the source code for the METIS library The included build system automatically construct both libraries PARMETIS have been extensively tested on a number of different parallel computers However even though PARMEIIS contains no
29. graph can correspond to the dual of the finite element mesh In this case each vertex corresponds to an element and two vertices are connected via an edge if the corresponding elements share an edge in 2D or a face in 3D Also the graph can be similar to the dual but be more or less connected That is instead of limiting edges to those elements that share a face edges can connect any two elements that share even a single node However the graph is constructed it is usually undirected That is for every pair of connected vertices v and u it contains both edges v u and u v In PARMEIIS the structure of the graph is represented by the compressed storage format CSR extended for the context of parallel distributed memory computing We will first describe the CSR format for serial graphs and then describe how it has been extended for storing graphs that are distributed among processors Serial CSR Format The CSR format is a widely used scheme for storing sparse graphs Here the adjacency structure of a graph is represented by two arrays xadj and adjncy Weights on the vertices and edges if any are represented by using two additional arrays vwgt and adjwgt For example consider a graph with n vertices and m edges In the CSR format this graph can be described using arrays of the following sizes xadj n 1 vwgt n adjncy 2m and adjwgt 2m Note that the reason both adjncy and ad4jwgt are of size 2m is because every edge is l
30. h an initial partitioning need not be computed as one can either be derived from the initial graph distribution in the case when sub domains are coupled to processors or else one needs to be supplied as an input to the routine in the case when sub domains are de coupled from processors However this partitioning does need to be balanced The balancing phase is performed on the coarsest graph twice by alternative methods That is optimized variants of remapping and diffusion algorithms 16 are both used to compute new partitionings A quality metric for each of these partitionings is then computed using the ITR Factor and the partitioning with the highest quality is selected This technique tends to give very good points from which to start multilevel refinement regardless of the type of repartitioning problem or the value of the ITR Factor Note that the fact that the algorithm computes two initial partitionings does not impact its scalability as long as the size of the coarsest graph is suitably small 8 Finally multilevel refinement is performed on the balanced partitioning in order to further improve its quality Since ParMETIS_V3_AdaptiveRepart starts from a graph that is already well distributed it is extremely fast Appropriate values to pass for the ITR Factor parameter can easily be determined depending on the times required to perform i all inter processor communications that have occurred since the last repartitioning and ii the da
31. hat same numbering scheme be used consistently for all the arrays passed to it and it writes to the part and order arrays similarly PARMEIIS allocates all the memory that it requires dynamically This has the advantage that the user does not have to provide workspace However if there is not enough memory on the machine the routines in PARMETIS will abort Note that the routines in PARMETIS do not modify the arrays that store the graph e g xadj and adjncy They only modify the part and order arrays 14 5 Calling Sequence of the Routines in PARMENS The calling sequences of the PARMETIS routines are described in this section 15 5 1 Graph Partitioning int ParMETIS_V3_PartKway idx_t vtxdist idx_t xadj idx_t adjncy idx_t vwet idx_t adjwegt idx_t wegtflag idx_t numflag idx_t ncon idx_t nparts real_t tpwgts real_t ubvec idx_t options idx_t edgecut idx_t part MPI_Comm comm Description This routine is used to compute a k way partitioning of a graph on p processors using the multilevel k way multi constraint partitioning algorithm Parameters vtxdist This array describes how the vertices of the graph are distributed among the processors See discus sion in Section 4 2 1 Its contents are identical for every processor xadj adjncy These store the local adjacency structure of the graph at each processor See discussion in Sec tion 4 2 1 vwet adjwet These store the weights of the vertices and e
32. hat should be distributed to each sub domain for each balance constraint If all of the sub domains are to be of the same size for every vertex weight then each of the ncon x nparts elements should be set to a value of 1 nparts If ncon is greater than 1 the target sub domain weights for each sub domain are stored contiguously similar to the vwgt array Note that the sum of all of the tpwgts fora give vertex weight should be one 21 ubvec An array of size ncon that is used to specify the imbalance tolerance for each vertex weight with 1 being perfect balance and nparts being perfect imbalance A value of 1 05 for each of the ncon weights is recommended options This is an array of integers that is used to pass parameters to the routine Their meanings are identical to those of ParMETIS_V3_PartKway edgecut Upon successful completion the number of edges that are cut by the partitioning is written to this parameter part This is an array of size equal to the number of locally stored vertices Upon successful completion the partition vector of the locally stored vertices is written to this array See discussion in Section 4 2 4 comm This is a pointer to the MPI communicator of the processes that call PARMETS For most programs this will point to MP I_COMM_WORLD Returns METIS_OK Indicates that the function returned normally METIS_ERROR Indicates some other type of error 22 5 2 Graph Repartitioning int ParM
33. he graph distribution However if sub domains are un coupled from processors then the initial partitioning needs to be obtained from the initial values assigned to the part array A value of PARMETIS_PSR_COUPLED indicates that sub domains and processors are coupled and a value of PARMETIS_PSR_UNCOUPLED indicates that these are de coupled The default value is PARMETIS_PSR_COUPLED if nparts equals the number of processors and PARMETIS_PSR_UNCOUPLED un coupled otherwise These constants are defined in parmetis h edgecut Upon successful completion the number of edges that are cut by the partitioning is written to this parameter part This is an array of size equal to the number of locally stored vertices Upon successful completion the partition vector of the locally stored vertices is written to this array See discussion in Section 4 2 4 If the number of processors is not equal to the number of sub domains and or opt ions 3 is set to PARMETIS PSR_UNCOUPLED then the previously computed partitioning must be passed to the routine as a parameter via this array comm This is a pointer to the MPI communicator of the processes that call PARMETS For most programs this will point to MP T_COMM_WORLD Returns METIS_OK Indicates that the function returned normally METIS_ERROR Indicates some other type of error 24 5 3 Partitioning Refinement int ParMETIS _V3_RefineKway idx_t vtxdist idx_t xadj idx_t adjncy idx_t vwet idx_t
34. he vtxdist array will always be identical for every processor When multiple vertex weights are used for multi constraint partitioning the c vertex weights for each vertex are stored contiguously in the vwgt array In this case the vwgt array is of size nc where n is the number of locally stored vertices and c is the number of vertex weights and also the number of balance constraints 4 2 2 Format of Vertex Coordinates As discussed in Section 3 1 PARMETIS provides routines that use the coordinate information of the vertices to quickly pre distribute the graph and so speedup the execution of the parallel k way partitioning These coordinates are specified in an array called xyz of type real_t If d is the number of dimensions of the mesh i e d 2 for 2D meshes or d 3 for 3D meshes then each processor requires an array of size d n where n is the number of locally stored vertices Note that the number of dimensions of the mesh d is required as a parameter to the routine In this array the coordinates of vertex i are stored starting at location xyz i x d up to but not including location xyz i xd d For example if d 3 then the x y and z coordinates of vertex i are stored at xyz 3 i xyz 3 i 1 and xyz 3 i 2 respectively 4 2 3 Format of the Input Mesh The routine ParMETIS_V3_PartMeshKway takes a distributed mesh and computes its partitioning while ParMETIS_V3_Mesh2Dual takes a distributed mesh and constructs a dis
35. ineKway and ParMETIS_V3_AdaptiveRepart to compute ef ficient partitionings for networks of heterogeneous processors To do so these routines require an additional array tpwgts to be passed as a parameter This array describes the fraction of the total vertex weight each sub domain should contain For example if you have a network of four processors the first three of which are of equal pro cessing speed and the fourth of which is twice as fast as the others the user would pass an array containing the values 0 2 0 2 0 2 0 4 Note that by allowing users to specify target sub domain weights as such heterogeneous processing power can be taken into account when computing a partitioning However this does not allow us to take heterogeneous network bandwidths and latencies into account Optimizing partitionings for heterogeneous networks is still the focus of ongoing research 3 7 Computing Fill Reducing Orderings ParMETIS_V3_NodeND and ParMETIS_V32_NodeND are the routines provided by PARMETIS for computing fill reducing orderings suited for Cholesky based direct factorization algorithms Note that Par METIS_V3_NodeND is simply a wrapper around the more general ParMETIS_V32_NodeND routine and is included for backward compat ibility Par METIS_V32_NodeND makes no assumptions on how the graph is initially distributed among the proces sors It can effectively compute fill reducing orderings for graphs that are randomly distributed as well as graphs
36. ining only triangular tetrahedral hexahedral or rectangular elements this parameter can be set to two three four or two respectively Note that setting this parameter to a small value will increase the number of edges in the resulting dual graph and the corresponding partitioning time xadj adjncy comm Returns Note METIS_OK Indicates that the function returned normally ME IS_ Upon the successful completion of the routine pointers to the constructed xadj and adjncy arrays will be written to these parameters See discussion in Section 4 2 1 The calling program is respon sible for freeing this memory by calling the METIS_F ree routine described in METIS manual This is a pointer to the MPI communicator of the processes that call PARMETS For most programs this will point to MP I_COMM_WORLD ERROR Indicates some other type of error This routine can be used in conjunction with ParMETIS_V3_PartKway ParMETIS_V3_PartGeomKway or ParMETIS_V3_AdaptiveRepart It typically runs in half the time required by ParMETIS_V3_PartKway 30 6 Restrictions amp Limitations The following is a list of restrictions and limitations imposed by the current release of PARMETIS Note that these restrictions are on top of any other restrictions described with each API function 1 The graph must be initially distributed among the processors such that each processor has at least one vertex Substantially better per
37. isted twice i e as v u and u v Also note that in the case in which the graph is unweighted i e all vertices and or edges have the same weight then either or both of the arrays vwgt and adjwgt can be set to NULL ParMETIS_V3_AdaptiveRepart additionally requires a vsize array This array is similar to the vwgt array except that instead of describing the amount of work that is associated with each vertex it describes the amount of memory that is associated with each vertex The adjacency structure of the graph is stored as follows Assuming that vertex numbering starts from 0 C style the adjacency list of vertex 2 is stored in array adjncy starting at index xadj 2 and ending at but not including index xadj z 1 in other words adjncy xadj 7 up through and including adjncy xadj z 1 1 Hence the adjacency lists for each vertex are stored consecutively in the array adjncy The array xadj is used to point to where the list for each specific vertex begins and ends Figure 5 b illustrates the CSR format for the 15 vertex graph shown in Figure 5 a If the graph was weights on the vertices then vwgt 7 is used to store the weight of vertex 2 Similarly if the graph has weights on the edges then the weight of edge adjncy j is stored in adjwgt 7 This is the same format that is used by the serial MENS library routines Distributed CSR Format PARMETIS uses an extension of the CSR format that allows the vertices of the graph a
38. lly refined meshes become very high especially as the complexity and size of the problems increase By locally refining and de refining the mesh either to capture flow field phenomena of interest 1 or to account for variations in errors 11 adaptive methods make standard computational methods more cost effective The efficient execution of such adaptive scientific simulations on parallel computers requires a periodic repartitioning of the underlying computational mesh These repartitionings should minimize both the inter processor communications incurred in the iterative mesh based compu tation and the data redistribution costs required to balance the load Hence adaptive repartitioning is a multi objective optimization problem PARMETS provides the routine Par METIS_V3_AdaptiveRepart for repartitioning such adap tively refined meshes This routine assumes that the mesh is well distributed among the processors but that due to mesh refinement and de refinement this distribution is poorly load balanced Repartitioning algorithms fall into two general categories The first category balances the computation by incre mentally diffusing load from those sub domains that have more work to adjacent sub domains that have less work These schemes are referred to as diffusive schemes The second category balances the load by computing an entirely new partitioning and then intelligently mapping the sub domains of the new partitioning to the processors such tha
39. n to this parameter part This is an array of size equal to the number of locally stored vertices Upon successful completion the partition vector of the locally stored vertices is written to this array See discussion in Section 4 2 4 comm This is a pointer to the MPI communicator of the processes that call PARMETS For most programs this will point to MP T_COMM_WORLD Returns METIS_OK Indicates that the function returned normally METIS_ERROR Indicates some other type of error Note The quality of the partitionings computed by ParMETIS_V3_PartGeomKway are comparable to those pro duced by ParMETIS_V3_PartKway However the run time of the routine may be up to twice as fast 19 int ParMETIS_V3_PartGeom idx_t vtxdist idx_t ndims real_t xyz idx_t part MPI Comm comm Description This routine is used to compute a p way partitioning of a graph on p processors using a coordinate based space filling curves method Parameters vtxdist This array describes how the vertices of the graph are distributed among the processors See discus sion in Section 4 2 1 Its contents are identical for every processor ndims The number of dimensions of the space in which the graph is embedded xyz The array storing the coordinates of the vertices described in Section 4 2 2 part This is an array of size equal to the number of locally stored vertices Upon successful completion stores the partition vector of the locally
40. nce tolerance for each vertex weight with 1 being perfect balance and nparts being perfect imbalance A value of 1 05 for each of the ncon weights is recommended options This is an array of integers that is used to pass additional parameters for the routine The first element i e options 0 can take either the value of 0 or 1 If it is 0 then the default values are used otherwise the remaining two elements of options are interpreted as follows 16 edgecut part comm options 1 This specifies the level of information to be returned during the execution of the algorithm Timing information can be obtained by setting this to 1 Additional options for this parameter can be obtained by looking at parmetis h The nu merical values there should be added to obtain the correct value The default value is 0 options 2 _ This is the random number seed for the routine Upon successful completion the number of edges that are cut by the partitioning is written to this parameter This is an array of size equal to the number of locally stored vertices Upon successful completion the partition vector of the locally stored vertices is written to this array See discussion in Section 4 2 4 This is a pointer to the MPI communicator of the processes that call PARMETS For most programs this will point to MP T_COMM_WORLD TIS_OK Indicates that the function returned normally TS ERROR Indicates some other type of error
41. nckemaillie J Gratien and R Ducloux Dynamic load balancing for parallel structural mechanics simulations with DRAMA ECT2000 2000 4 G Karypis and V Kumar A coarse grain parallel multilevel k way partitioning algorithm In Proceedings of the 8th SIAM conference on Parallel Processing for Scientific Computing 1997 5 G Karypis and V Kumar METIS A software package for partitioning unstructured graphs partitioning meshes and comput ing fill reducing orderings of sparse matrices version 4 0 Technical report Univ of MN Dept of Computer Sci and Engr 1998 31 6 7 8 9 10 11 12 13 14 15 16 17 G Karypis and V Kumar Multilevel algorithms for multi constraint graph partitioning In Proc Supercomputing 98 1998 G Karypis and V Kumar Multilevel k way partitioning scheme for irregular graphs Journal of Parallel and Distributed Computing 48 1 1998 G Karypis and V Kumar Parallel multilevel k way partitioning scheme for irregular graphs Siam Review 41 2 278 300 1999 B Kernighan and S Lin An efficient heuristic procedure for partitioning graphs The Bell System Technical Journal 49 2 29 1 307 1970 L Oliker and R Biswas PLUM Parallel load balancing for adaptive unstructured meshes Journal of Parallel and Distributed Computing 52 2 150 177 1998 A Patra and D Kim Efficient mesh partitioning for adaptive hp finite element meshes Technical
42. nd their adjacency lists to be distributed among the processors In particular PARMETIS assumes that each processor P stores n consecutive vertices of the graph and the corresponding m edges so that n D Ni and 2xm X Mi Here each processor stores its local part of the graph in the four arrays xadj n 11 vwgt n adjncy m and adjwgt m using the CSR storage scheme Again if the graph is unweighted the arrays vwgt and adjwgt can be set to NULL The straightforward way to distribute the graph for PARMETIS is to take n p consecutive adjacency lists from adjncy and store them on consecutive processors where p is the number of processors In addition each processor needs its local xad Jj array to point to where each of its local vertices adjacency lists begin and end Thus if we take all the local adjncy arrays and concatenate them we will get exactly the same adjncy array that is used in the serial CSR However concatenating the local xadj arrays will not give us the serial xadj array This is because the entries in each local xadj must point to their local adjncy array and so xadj 0 is zero for all processors In addition to these four arrays each processor also requires the array vtxdist p 1 that indicates the range of vertices that are local to each processor In particular processor P stores the vertices from vt xdist 2 up to but not including vertex vtxdist 7 1 Figure 5 c illustrates the distributed CSR format
43. ning unstructured graphs when coordi nates for the vertices are available Par METIS_V3_PartGeom computes a partitioning based only on the space filling curve method Therefore it is extremely fast often 5 to 10 times faster than ParMETIS_V3_PartGeomKway but it computes poor quality partitionings it may cut 2 to 10 times more edges than ParMETIS_V3_PartGeomKway This routine can be useful for certain computations in which the use of space filling curves is the appropriate partitioning technique e g n body computations 3 2 Partitioning Meshes Directly PARMEIIS also provides routines that support the computation of partitionings and repartitionings given meshes and not graphs as inputs In particular ParMETIS_V3_PartMeshKway take a mesh as input and computes a partitioning of the mesh elements Internally Par METIS_V3_PartMeshKway uses a mesh to graph routine and then calls the same core partitioning routine that is used by ParMETIS_V3_PartKway PARMEIIS provides no such routines for computing adaptive repartitionings directly from meshes However it does provide the routine ParMETIS_V3_Mesh2Dual for constructing a dual graph given a mesh quickly and in parallel Since the construction of the dual graph is in parallel it can be used to construct the input graph for ParMETIS_V3_AdaptiveRepart 3 3 Partitioning Adaptively Refined Meshes For large scale scientific simulations the computational requirements of techniques relying on globa
44. ogram using PARMETIS API needs to include the parmetis h header file This file provides function prototypes for the various API routines and defines the various data types and constants used by these routines During PARMEMS installation time the met is include metis h defines two important data types and their widths These are the idx_t data type for storing integer quantities and the real_t data type for storing floating point quantities The idx_t data type can be defined to be either a 32 or 64 bit signed integer whereas the real_t data type can be defined to be either a single or double precision float point number All of PARMETIS API routines take as input arrays and or scalars that are of these two data types 4 2 Input and Output Formats used by PARMENS 4 2 1 Format of the Input Graph All of the graph routines in PARMETS take as input the adjacency structure of the graph the weights of the vertices and edges if any and an array describing how the graph is distributed among the processors Note that depending on the application this graph can represent different things For example when PARMETIS is used to compute fill reducing orderings the graph corresponds to the non zero structure of the matrix excluding the diagonal entries In 10 the case of finite element computations the vertices of the graph can correspond to nodes points in the mesh while edges represent the connections between these nodes Alternatively the
45. raphs is constructed by collapsing together adjacent vertices of the input graph in order to form a related coarser graph Computation of the initial partitioning is performed on the coarsest and hence smallest of these graphs and so is very fast Finally partition refinement is performed on each level graph from the coarsest to the finest i e original graph using a KL FM type refinement algorithm 2 9 Figure 2 illustrates the multilevel graph partitioning paradigm PARMEIIS provides the ParMETIS_V3_PartGeomKway routine for computing partitionings for graphs derived from finite element meshes in which the vertices have coordinates associated with them Given a graph that is dis tributed among the processors and the coordinates of the vertices ParMETIS_V3_PartGeomKway quickly computes an initial partitioning using a space filling curve method redistributes the graph according to this partitioning and then calls ParMETIS_V3_PartKway to compute the final high quality partitioning Our experiments have shown that ParMETIS_V3_PartGeomKway is often two times faster than ParMETIS_V3_PartKway and achieves identical par tition quality Note that depending on how the graph is constructed from the underlying mesh the coordinates can correspond to either the actual node coordinates of the mesh nodal graphs or the coordinates of the coordinates of the element centers dual graphs PARMEIIS also provides the Par METIS_V3_PartGeom function for partitio
46. report Dept of Mech Engr SUNY at Buffalo 1999 A Pothen H Simon L Wang and S Bernard Towards a fast implementation of spectral nested dissection In Supercom puting 92 Proceedings pages 42 51 1992 K Schloegel G Karypis and V Kumar A new algorithm for multi objective graph partitioning In Proc EuroPar 99 pages 322 331 1999 K Schloegel G Karypis and V Kumar Parallel multilevel algorithms for multi constraint graph partitioning In Proc EuroPar 2000 2000 Accepted as a Distinguished Paper K Schloegel G Karypis and V Kumar A unified algorithm for load balancing adaptive scientific simulations In Proc Supercomputing 2000 2000 K Schloegel G Karypis and V Kumar Wavefront diffusion and LMSR Algorithms for dynamic repartitioning of adaptive meshes IEEE Transactions on Parallel and Distributed Systems 12 5 451 466 2001 J Watts M Rieffel and S Taylor A load balancing technique for multi phase computations Proc of High Performance Computing 97 pages 15 20 1997 32
47. rtitioning a finite element mesh and for constructing the dual graph of a mesh in parallel In version 3 1 these routines have been extended to support mixed element meshes 3 Algorithms Used in PARMETS PARMEIIS provides a variety of routines that can be used to compute different types of partitionings and repartitionings as well as fill reducing orderings Figure 1 provides an overview of the functionality provided by PARMETIS as well as a guide to its use i l iit l righ al ParMETIS_V3_PartGeomKway yes What are your g ia J r 5 time quality tradeoffs Low quality a i i ig Partition a graph gt Bo you have coordinates ParMETIS_V3_PartGeom for the vertices N ES or Ne ParMETIS_V3_PartKway i Partition a mesh m ParMETIS_V3_PartMeshKway D Construct a graph from a mesh gt ParMETIS V3 Mesh2Dual L J Repartition a graph correspondin to an adaptively refined mesh ParMETIS_V3_AdaptiveRepart L J q d ParMetis Can Do The Following Refine the quality of a partitioning ParMETIS_V3_RefineKway L r 5 g Compute a fill reducing ParMETIS_V3_NodeND ordering ParMETIS_V32_NodeND k Figure 1 A brief overview of the functionality provided by PARMEIS The shaded boxes correspond to the actual routines in PARMETS that implement each particular operation 3 1 Unstructured Graph Partitioning ParMETIS_V3_PartKway is the routine in PA
48. s of the two separators at the second level from left to right Similarly sizes 2p 8 through sizes 2p 5 store the sizes of the four separators of the third level from left to right and so on This array can be used to quickly construct the separator tree a form of an elimination tree for direct factorization Given this separator tree and the sizes of the sub domains the nodes in the ordering produced by ParMETIS_V3_NodeND are numbered in a postorder fashion Figure 6 illustrates the sizes array and the postorder ordering 13 too PR 35 45 Io Ro 1o z 6 H HOO 40 SA BA OY ty pon eee gtew glk eute lt sizes D22223 order 10146 7 4 51310112 312 8 9 Figure 6 An example of the ordering produced by ParMETIS_V3_NodeND Consider the simple 3 x 5 grid and assume that we have four processors ParMETIS_V3_NodeND finds the three separators that are shaded It first finds the big separator and then for each of the two sub domains it finds the smaller At the end of the ordering the order vector concatenated over all the processors will be the one shown Similarly the sizes arrays will all be identical to the one shown corresponding to the regions pointed to by the arrows 4 3 Numbering and Memory Allocation PARMEIIS allows the user to specify a graph whose numbering starts either at 0 C style or at 1 Fortran style Of course PARMEIIS requires t
49. sor running ParMETIS_V3_NodeND and ParMETIS_V32_NodeND writes its portion of the computed fill reducing ordering to an array called order Similar to the part array the size of order is equal to the number of vertices stored at each processor Upon completion for each vertex j order j stores the new global number of this vertex in the fill reducing permutation Besides the ordering vector Par METIS_V3_NodeND also returns information about the sizes of the different sub domains as well as the separators at different levels This array is called sizes and is of size 2p where p is the number of processors Every processor must supply this array and upon return each of the sizes arrays are identical To accommodate runs in which the number of processors is not a power of two ParMETIS_V3_NodeND performs log p levels of nested dissection Because of that let p gUceP be the largest number of processors less than p that is a power of two Given the above definition of p the format of the sizes array is as follows The first p entries of sizes starting from 0 to p 1 store the number of nodes in each one of the p sub domains The remaining p 1 entries of this array starting from sizes p up to sizes 2p 2 store the sizes of the separators at the log p levels of nested dissection In particular sizes 2p 2 stores the size of the top level separator sizes 2p 4 and sizes 2p 3 store the size
50. stored graph described in Section 4 2 4 comm This is a pointer to the MPI communicator of the processes that call PARMETS For most programs this will point to MP I_COMM_WORLD Returns METIS_OK Indicates that the function returned normally METIS_ERROR Indicates some other type of error Note The quality of the partitionings computed by ParMETIS_V3_PartGeom are significantly worse than those produced by ParMETIS_V3_PartKway and ParMETIS_V3_PartGeomKway 20 int ParMETIS_V3_PartMeshKway idx_t elmdist idx_t eptr idx_t eind idx_t elmwegt idx_t wegtflag idx_t numflag idx_t ncon idx_t ncommonnodes idx_t nparts real_t tpwgts real_t ubvec idx_t options idx_t edgecut idx_t part MPI_Comm comm Description This routine is used to compute a k way partitioning of a mesh on p processors The mesh can contain elements of different types Parameters elmdist This array describes how the elements of the mesh are distributed among the processors It is anal eptr eind ogous to the vtxdist array Its contents are identical for every processor See discussion in Section 4 2 3 These arrays specifies the elements that are stored locally at each processor See discussion in Section 4 2 3 elmwegt This array stores the weights of the elements See discussion in Section 4 2 3 wetflag This is used to indicate if the elements of the mesh have weights associated with them The wgtflag can take
51. t the redistribution cost is minimized These schemes are generally referred to as remapping schemes Remapping schemes typically lead to repartitionings that have smaller edge cuts while diffusive schemes lead to repartitionings that incur smaller redistribution costs However since these results can vary significantly among different types of applications it can be difficult to select the best repartitioning scheme for the job ParMETIS_V3_AdaptiveRepart is a parallel implementation of the Unified Repartitioning Algorithm 15 for adaptive repartitioning that combines the best characteristics of remapping and diffusion based repartitioning schemes A key parameter used by this algorithm is the ITR Factor This parameter describes the ratio between the time required for performing the inter processor communications incurred during parallel processing compared to the time to perform the data redistribution associated with balancing the load As such it allows us to compute a single metric that describes the quality of the repartitioning even though adaptive repartitioning is a multi objective optimization problem ParMETIS_V3_AdaptiveRepart is based on the multilevel partitioning algorithm and so is in nature similar to the the algorithm implemented in ParMETIS_V3_PartKway However this routine uses a technique known as local coarsening Here only vertices that have been distributed onto the same processor are coarsened together On the coarsest grap
52. ta redistribution associated with the last repartitioning load balancing phase Simply divide the first time by the second The result is the correct ITR Factor In case these times cannot be ascertained e g for the first repartitioning load balancing phase our experiments have shown that values between 100 and 1000 work well for a variety of situations ParMETIS_V3_AdaptiveRepart can be used to load balance the mesh either before or after mesh adaptation In the latter case each processor first locally adapts its mesh leading to different processors having different numbers of elements ParMETIS_V3_AdaptiveRepart can then compute a partitioning in which the load is balanced However load balancing can also be done before adaptation if the degree of refinement for each element can be estimated a priori That is if we know ahead of time into how many new elements each old element will subdivide we can use these estimations as the weights of the vertices for the graph that corresponds to the dual of the mesh In this case the mesh can be redistributed before adaption takes place This technique can significantly reduce data redistribution times 10 3 4 Partition Refinement ParMETIS_V3_RefineKway is the routine provided by PARMETS to improve the quality of an existing partitioning Once a graph is partitioned and has been redistributed accordingly Par METIS_V3_RefineKway can be called to compute a new partitioning that further improves the q
53. tions 3 is set to PARMETIS _PSR UNCOUPLED then the previously computed partitioning must be passed to the routine as a parameter via this array comm This is a pointer to the MPI communicator of the processes that call PARMETS For most programs this will point to MP I_COMM_WORLD METIS_OK Indicates that the function returned normally METIS_ERROR Indicates some other type of error 26 5 4 Fill reducing Orderings int ParMETIS_V3_NodeND Description idx_t vtxdist idx_t xadj idx_t adjncy idx_t numflag idx_t options idx_t order idx_t sizes MPI_ Comm comm This routine is used to compute a fill reducing ordering of a sparse matrix using multilevel nested dissection Parameters vtxdist This array describes how the vertices of the graph are distributed among the processors See discus sion in Section 4 2 1 Its contents are identical for every processor xadj adjncy These store the local adjacency structure of the graph at each processor See discussion in Sec tion 4 2 1 numflag This is used to indicate the numbering scheme that is used for the vixdist xadj adjncy and order arrays numflag can take the following two values 0 C style numbering is assumed that starts from 0 1 Fortran style numbering is assumed that starts from 1 options This is an array of integers that is used to pass parameters to the routine Their meanings are identical to those of ParMETIS_V3_PartKway order This array ret
54. tributed dual graph Both of these rou tines require an elmdist array that specifies the distribution of the mesh elements but that is otherwise identical 12 to the vtxdist array They also require a pair of arrays called ept r and eind as well as the integer parameter ncommonnodes The eptr and eind arrays are similar in nature to the xadj and adjncy arrays used to specify the adjacency list of a graph but now for each element they specify the set of nodes that make up each element Specifically the set of nodes that belong to element 7 is stored in array eind starting at index ept r i and ending at but not including index eptr i 1 in other words eind eptr i up through and including eind eptr 7 1 1 Hence the node lists for each element are stored consecutively in the array eind This format allows the specification of meshes that contain elements of mixed type The ncommonnodes parameter specifies the degree of connectivity that is desired between the vertices of the dual graph Specifically an edge is placed between two vertices if their corresponding mesh elements share at least g nodes where g is the ncommonnodes parameter Hence this parameter can be set to result in a traditional dual graph e g a value of two for a triangle mesh or a value of four for a hexahedral mesh However it can also be set higher or lower for increased or decreased connectivity Additionally Part METIS_V3_PartMeshKway requires an elmwgt array
55. uality ParMETIS_V3_RefineKway can be used to improve the quality of partitionings that are produced by other partitioning algorithms such as the technique discussed in Section 3 1 that is used in ParMETIS_V3_PartGeom ParMETIS_V3_RefineKway can also be used repeatedly to further improve the quality of a partitioning However each successive call to Par METIS_V3_RefineKway will tend to produce smaller improvements in quality 3 5 Partitioning for Multi phase and Multi physics Computations The traditional graph partitioning problem formulation is limited in the types of applications that it can effectively model because it specifies that only a single quantity be load balanced Many important types of multi phase and multi physics computations require that multiple quantities be load balanced simultaneously This is because synchronization steps exist between the different phases of the computations and so each phase must be individually load balanced That is it is not sufficient to simply sum up the relative times required for each phase and to compute a partitioning based on this sum Doing so may lead to some processors having too much work during one phase of the computation and so these may still be working after other processors are idle and not enough work during another Instead it is critical that every processor have an equal amount of work from each phase of the computation Two examples are particle in cells 17 and contact impact
56. urns the result of the ordering described in Section 4 2 4 sizes This array returns the number of nodes for each sub domain and each separator described in Sec tion 4 2 4 comm This is a pointer to the MPI communicator of the processes that call PARMETS For most programs this will point to MP I_COMM_WORLD Returns METIS_OK Indicates that the function returned normally METIS ERROR Indicates some other type of error 27 int ParMETIS_V32_NodeND Description idx_t vtxdist idx_t xadj idx_t adjncy idx_t vwegt idx_t numflag idx_t mtype idx_t rtype idx_t p_nseps int s_nseps real_t ubfrac idx_t seed idx_t dbglvl idx_t order idx_t sizes MPI_Comm comm This routine is used to compute a fill reducing ordering of a sparse matrix using multilevel nested dissection Parameters vtxdist This array describes how the vertices of the graph are distributed among the processors See discus sion in Section 4 2 1 Its contents are identical for every processor xadj adjncy vwet numflag mtype rtype p nseps s nseps These store the local adjacency structure of the graph at each processor See discussion in Sec tion 4 2 1 These store the weights of the vertices A value of NULL indicates that each vertex has unit weight See discussion in Section 4 2 1 This is used to indicate the numbering scheme that is used for the vtxdist xadj adjncy and order arrays The possi

ParMETIS 4.x Manual

Contents

Download Pdf Manuals

Related Search

Related Contents