Home

The Dynamic Adaptation of Parallel Mesh-Based

1. tioning can be divided into a parallel phase and a serial phase e we construct in parallel the weighted graph Mj 1 Communication is not required at this point Each processor P computes the weights of each element E Mo This is done using a simple recursive algorithm In the same way it computes the weight 41 in parallel compute ElemWeight E EdgeWeight E and Weight E Ep for each ele ment E and each pair of adjacent elements Ea Es M5 each processor P sends its portion of M and the weights to a common processor Pp Pp receives sections of My from each processor and computes a partition IT of M This can be done using RSB or a serial multilevel algorithm Pp returns II to each processor P P migrates the elements and nodes according to II Figure 18 The parallel Nested Repartitioning algorithm of each edge W Ea Ey Once P obtains its portion of Mj it sends it to Pp for the serial part of the algorithm The M graph for the mesh in Figure 17 is shown in Figure 19 We consider two ways of defining set W W Eq Eo Ea Ey Mo Eq and Ep have a common vertex W Ea Eb Ea Eo Mo Ea and Ey have a common edge e once Pp receives a message for each processor F it partitions the reduced graph Mo using a serial partitioning algorithm As Mj is assumed to be relatively small we can use at this stage algorithms that would be considered to
2. END IF END IF END WHILE Figure 14 Detecting the termination of the refinement phase 34 Once Pc has received a AddRef Msg from each other processor it broadcasts a Continue Msg Each processor P starts the refinement phase waiting for a message If it receives a Continue Msg from the initiator it knows that it can proceed to the next phase If the message is a Refine Msg k i it inserts the elements indicated in the message in R and refines it using the serial algorithm Rather than having one critical message now P can have several critical messages sent by the same or different processors P gives them a sequence number and stores them in a table If the refinement of the elements in R creates shared nodes in a boundary between P and Pj then P sends a Refine Msg i j message to P P keeps track of how many of these Refine Msg i j it sends for each critical Refine Msg it receives Once it has received a AddRef Msg j i for each Refine Msg i j it can send back a AddRef Msg t k response to Pk P continues to listen for messages until it receives a Continue Msg from the coordinator Figure 15 shows an example where the refinement propagates cyclically between processors 7 Load balancing In this section we present a strategy for repartitioning and rebalancing the mesh We first explain serial multilevel refinement algorithms We then introduce a new highly parallel repartitioning method called the Parallel Nested Repart
3. Initial Partition air mesh 700 Spectral a 600 Multilevel b 500 400 Shared nodes 300 200 100 4 8 16 32 Number of partitions a Initial Partition big mesh 1400 1200 p Spectral a Multilevel b 1000 800 Shared nodes 600 400 200 4 8 16 32 Number of partitions b Figure 37 Partition of the initial mesh using Chaco Number of shared of nodes after computing the partitions of a air mesh and b big mesh 78 refinements we were able to create meshes consisting of more than 2 000 000 elements We estimate that we could store around 40 000 to 50 000 elements in each processor After that the data sets were too big and the performance of the algorithm drops considerably due to trashing These results are shown in Tables 9 10 11 and 12 The serial time is the time spent creating new elements and nodes and the communication time is the time spent propagating the refinement to adjacent processors For each refinement we include the total number of elements the average number of elements per processor and the number of elements in the processor that stores the largest number of elements and the one that stores the smallest number Imbalance is the ratio between the absolute value of the difference between the largest or smallest number of elements whichever is larger and the average Shared nodes is the number of nodes in an internal boundary between pr
4. The numerical solution of complex partial differential equations using computational re sources requires the definition of a domain Q in which the problem is to be solved and a set of conditions to be applied at its boundaries 10 The continuous domain and boundary conditions are discretized so they become amenable to computer manipulation A computa tional mesh M is thereby produced This mesh is constructed by splitting the domain into a set of simple polygons such as triangles and quadrilaterals in 2 dimensions or tetrahedrals in 3 dimensions called elements that are connected by faces edges and nodes Once a mesh is constructed elements can be split into a set of nested smaller elements or combined into a macroelement This process is called the adaptation of the mesh In an adaptive method the selective and local refinement of the mesh is interleaved with the solution of the problem by contrast with the static grid generation approach in which a fixed discretization of the geometry is done in a preprocessing step Adaptive methods can be schematically described as a feedback process where the auto matic construction of a quasi optimal mesh is performed in the course of the computation 1 Rather than using a uniform mesh with grid points evenly spaced on a domain adaptive mesh refinement techniques place more grid points where the solution is changing rapidly The mesh is adaptively refined during the computation according to local er
5. e a move message Move Msg i j Ea is used to migrate an element Ea from a source processor P to a destination processor P We also assume that this message has two other fields The first field Move Msg i j nodes E contains the vertices of the element Ea For each node V Adj Ea if P has a local copy of vi then only the reference P Vz is included in the message Otherwise we send the node Vp We also set Ref Vj U P Vj to initialize Vz The other field Move Msg i j ref Vb contains the references to V in the other processors e an add reference message AddRef Msg 1 j V2 vi is used to add a reference to node V in processor P In this case we set Ref Vz Ref Vz U P v This is essentially the same kind of message used for refining the mesh e a delete reference message DelRef Msq i j Vz Vi is used to remove a reference to the node Vi in P So Ref Vz Ref Vz P V3 In the rest of this section we discuss a migration algorithm that uses these messages 7 2 1 The migration algorithm Assume that we need to move an element Ea from F to P that is Ea iN and Ea E IT Assume also Adj Ea Vp Vpn is the set of vertices of Ea We initially send a Move Msg i j Ea message from P to P If Vp Adj Ea and also P Vz Ref V we only include the reference in the message Otherwise if V Adj Ea and P Vz Ref vi we include the node Vp in the m
6. set Vp VQ END IF END WHILE END FOR Figure 2 Rivara s two triangle refinement algorithm Figure 3 Refinement of a mesh a shows the initial mesh Mo while b c and d show the meshes M M and M3 Associated with each node and element is its level level 14 2 children For each node Vp we define Level V 0 if Vp is in Mo and Level V Level E 1 if Vp was created as the result of refining an element E Figure 3 gives an example of the refinement of a mesh along with the associated levels Note that the meshes M Mz and M3 in b c and d do not only include nodes and elements of the corresponding level but can also include nodes and elements of previous levels Also the elements are replaced by their children when they are refined but the nodes are not For example every node V such that Level V 0 will be present in all the successive meshes Also note that some elements EF of level have as vertices nodes of level 1 or less The iterative execution of this algorithm produces nested meshes If Mo is the coarsest 10 mesh then for any level l Mi lt Mi 1 lt lt Mo where M lt M _ is a relation that indicates that M has all the nodes present in M _ and that some elements in M _ have been split to form the elements in M 3 1 Multilevel refinement A sequence of nested refinements creates an element hierarchy In this hierarchy each element of the initial mesh belongs to th
7. Laboratories is another related project The refinement algorithm 3 avoids the creation of duplicate nodes in the boundaries of processors by refining the elements in independent 89 PNR algorithm Serial algorithm nti 3 Ret a Ro iit 1 Ret 2 Rat Send mesh to Po Partition Receive partition from Po Migration Total time Number of shared nodes Table 14 Repartition of the air mesh in 4 processors after none one and two refine ments using the PNR algorithm and the serial Multilevel Bisection algorithm Times are in seconds PNR algorithm Serial algorithm nit 1 Ref 2 Ret iil 1 Ref 2 Ref Send mesh to Py Partition Receive partition from Po Migration Total time Number of shared nodes Table 15 Repartition of the air mesh in 8 processors after none one and two refine ments using the PNR algorithm and the serial Multilevel Bisection algorithm Times are in seconds 90 PNR algorithm Serial algorithm Tait 1 Ref 2 Ref Tit 1 Ret 2 Ret Send mesh to Po 46 39 237 69 Partition Receive partition from Pp Migration Total time Number of shared nodes Table 16 Repartition of the air mesh in 16 processors after none one and two refine ments using the PNR algorithm and the serial Multilevel Bisection algorithm Times are in seconds PNR algorithm Serial algorithm Ted a Ref 2 Ret Tak Send mesh to Pp Partition Receive partition from Po Migratio
8. we initialize Ref V P Vz and Ref Vz P V Although at the end of refinement phase Ref Vz 1 for each new node created in that phase this might not hold after the load balancing phase It is possible that a new partition converts an internal node into a shared node and vice versa or that it modifies Ref V so that it is shared by more than two processors 18 Figure 5 A square mesh partitioned by elements between four processors a and its internal representation using remote references b 19 INPUT M E V where M is a finite element mesh with a set E of elements and a set V of vertices compute the dual M 1 E W of M where W Ea Eo Ea Ey E a b Adj Ea Adj E W is a set of adjacent elements and Adj Ea is the nodes of element E partition M7 into P regions using a graph partitioning algorithm such that Ea Il if Ea is assigned to P where I E and TI N 0 OVi j Nodes II is the set of nodes corresponding to elements in II Note that is not required that Nodes II N Nodes II FOR each V Nodes Il in paraliel DO Ref Vi END FOR FOR each V Nodes II in parallel DO IF E ElemAdj V and Ea Il and j 7 THEN Ref V Ref V U P Vg END IF END FOR Figure 6 Computing the initial references in a parallel mesh 20 The design of these references is highly influenced by the element partitioning method Their main
9. 0 490462 0 566953 0 313826 0 265786 0 195512 0 147489 0 550117 0 363984 0 315014 0 258172 0 224284 0 184256 0 435300 0 558364 0 329452 0 317619 0 238315 0 187827 0 489746 0 559008 0 318954 0 340022 0 222170 0 195876 0 496505 0 537064 0 316744 0 346304 0 116656 0 196713 0 504589 0 531281 0 201673 0 324911 0 078395 0 139036 test 1000 5000 10000 50000 100000 Table 3 All to all communication Performance is measured in MBytes sec and the length of the message is in doubles Finally we ran the same tests in an IBM SP 2 with 24 processors These results are shown in Table 4 and plotted in Figure 33 d The SP 2 was able to run the same tests around 35 times faster than the network of workstations Furthermore contention has a much smaller effect on that machine 69 ew pee Wea vease aa aa 0 030493 0 049504 0 032990 0 049397 0 047356 0 449846 0 936888 0 728459 0 820869 0 786651 0 962093 1 980803 1 431854 1 617866 1 552719 3 441315 6 545091 5 087394 6 464023 6 080583 5 241769 10 247676 8 709354 10 275279 9 726054 14 986043 21 422860 20 492599 21 749445 22 099546 1000 19 998413 24 756026 26 153441 27 523068 25 587889 5000 28 619531 27 910036 22 897113 31 495
10. Methods 570 579 Prentice Hall New Jersey 1989 5 Ivo Babuska Jagdish Chandra and Joseph E Flaherty eds Adaptive computational methods for partial differential equations SIAM Philadelphia 1983 6 M Mamman and B Larrouturou Dynamical mesh adaptation for two dimensional re active flow simulations Numerical Grid Generation in Computational Fluid Dynamics and Related Fields North Holland Amsterdam 1991 7 L Ferragut R Montenegro and L Plaza Efficient refinement derefinement algorithm of nested meshes to solve evolution problems Communications in Numerical Methods in Engineering Vol 10 403 412 1994 8 Kenneth G Powell Philip L Roe and James Quirk Adaptive mesh algorithms for computational fluid dynamics Algorithmic Trends in Computational Fluid Dynamics Springer Verlag 1991 9 J E Savage and M Wloka Parallelism in Graph Partitioning Journal of Parallel and Distributed Computing 13 257 272 1991 97 10 Roy Williams Adaptive parallel meshes with complez geometry Numerical Grid Gen eration in Computational Fluid Dynamics and Related Fields North Holland 1991 11 A Bowyer Computing Dirichlet tessalations Comp J 24 162 1981 12 J E Akin Finite Elements for Analysis and Design Academic Press 1994 13 E J Schwabe G E Blelloch A Feldmann O Ghattas J R Gilbert G L Miller D R O Hallaron J R Shewchuck and S Teng A separator based framework for
11. There is also an option for zooming regions of the mesh A sample display with the mesh used in Section 7 1 is shown in Figure 30 The elements are colored according to their processor assignment In this example the mesh is partitioned between 4 processors Although the window is managed by Po the actual elements and nodes are located in different processors When we want to display the mesh all the processors need to send a message to Po that contains the elements and its coordinates When the user issues a command by selecting an option from the menus Po broadcasts the command to all the other processors All this distributed input and output is managed by 4 classes When the user selects an option from the menu the program calls a method of a Console object located in Po This object is responsible for putting a wrapper around the command and broadcasting it to all the other P processors When P receives a command message from 60 Figure 30 Sample display of the window interface of the program 61 XWindow ConsoleStub ConsoleStub Figure 31 Implementation of the distributed input output Po it invokes a method on its ConsoleStub object This object unwraps the command and calls the appropriate method on the FEMesh object that implements the mesh This FEMesh object executes the command probably communicating with some other FEMesh objects in other processors When P wants to produce some output like drawing an
12. and a memory address It represents a reference to a node located in that processor at a specific memory location To invoke a method into a node located in a remote processor these addresses are packed into a message as long integers MPI does not support the notion of messages of pointers and sent to the remote processor Once the message arrives at the destination processor the address is cast into a pointer to the node and the corresponding method is invoked on the node For this reason it is very unsafe to delete or move a node if the other processors still have a reference to it This restriction had a big influence on the migration algorithm of Section 7 2 The elements are implemented using inheritance There is an abstract class AbsElement from where we derive classes for the different types of elements such as Triangle and Quadrilateral An element has a pointer to its parent that is the element whose refine ment created it If the element was also refined it has a list of pointers to its children Finally the element has a vector of pointers to its vertices This makes it easy to access the coordinates of the vertices of the element to generate its local matrices for the numerical simulations Another important class is the Node class This class keeps track of the coordinates of the node and has a list of NodeRef to the copies of the node in other processors For an internal node this list is empty For a node located in an internal bounda
13. and plotted in Figure 40 We perform this permutation migration after none one and two refinements on the air mesh for 4 8 16 and 32 processors Remember that the refinement of all the elements of the mesh more than doubles the size of the mesh Also the migration does not include only the elements in the fine mesh but also all the elements in the intermediate meshes When the mesh is not refined the cost of the algorithm is dominated by the communication This is why we do not observe any speedup in the column corresponding to the migration of the initial mesh After one or two refinements we observe linear speedups The migration of the mesh after two refinements using four processors is a special case We believe that the low performance of the algorithm at that case is because we are consuming too much memory Migration after successive refinements Initial mesh 1 refinement 2 refinements Processors Table 13 Migration of the mesh Time in seconds to migrate each element of the air mesh mesh that it is assigned to P to P after none one or two refinements Figure 41 shows two stages of this test In 41 a we display a snapshot of the mesh with no refinements before the migration and in b we show the same mesh after this permutation migration Figure 42 a shows another migration example In this case all the elements in each processor are migrated to a random processor In Figure 42 b we migrate the elements a
14. and replaced by their children b In c the element Ea is further refined Under the meshes we show the corresponding mesh hierarchy e when an element E is refined it is replaced by its children in the fine mesh Mi To coarsen an element all its children must be selected for coarsening In this case the children in the fine mesh M are replaced by their parent and destroyed e both refinement and coarsening can propagate to adjacent elements The algorithms are not completely local because they need to preserve conformality requirements This sequence of refinements is explained in Figure 4 Initially the elements E and Ep are selected for refinement a Under the mesh we show the internal representation Both Ea and Ep belong to Mo and M After the first refinement 4 new elements are created At this point M includes E Eaz Ep and Ep b 12 3 2 Local coarsening The selection of elements for coarsening is performed by evaluating an adaptation criterion at their vertices The nodes in a finite element mesh are associated with the degrees of freedom and the unknowns of a system of equations After finding its solution at time t the system might desire the elimination of nodes considered unnecessary according to a selected evaluation criterion to reduce the number of unknowns This destruction of nodes requires coarsening of elements to maintain a conforming mesh The successful evaluation of the adaptation criteria at a nod
15. automated partitioning and mapping of parallel algorithms for numerical solution of PDEs Proc 1992 DAGS Symposium 1992 14 John J Barton and Lee R Nackman Scientific and Engineering C An introduction with advanced techniques and examples Addison Wesley 1994 15 H D Simon Partitioning of unstructured meshes for parallel processing Computing Systems Eng 1991 16 S T Barnard and H D Simon A fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems Proceedings of the 6th SIAM confer ence on Parallel Processing for Scientific Computing 711 718 1993 17 S T Barnard and H D Simon A parallel implementation of multilevel recursive spectral bisection for application to adaptive unstructured meshes Proceedings of the 7th SIAM conference on Parallel Processing for Scientific Computing 627 632 1995 18 G Karypis and V Kumar A fast and high quality multilevel scheme for partitioning irregular graphs Tech Rep CORR 95 035 University of Minnesota Dept of Computer Science 1995 19 G Karypis and V Kumar Parallel Multilevel Graph Partitioning Tech Rep CORR 95 036 University of Minnesota Dept of Computer Science 1995 20 B Hendrickson and R Leland A multilevel algorithm for partitioning graphs Technical Report SAND93 1301 Sandia National Laboratories 1993 98 21 B Hendrickson and R Leland The Chaco user s guide Technical R
16. element in the screen it calls a method on its DrawStub object This object collects several output instructions and sends an individual message to Py Fo receives the output messages through the Draw object Finally to display the output the Draw object calls the appropriate Tcl Tk cominands The relations between these objects are shown in Figure 31 8 2 Object oriented representation of the FE mesh The distributed mesh is implemented using the FEMesh class There is only one FEMesh object per processor As this class is in essence a container for elements and nodes its two 62 most important data members are a list of pointer to elements and a list of pointers to nodes These lists are implemented using the Tools 30 templates library The list of elements is a list of pointers to the elements in the fine mesh assigned to the processor while the list of nodes is a list of pointers to all the nodes in the processor The FEMesh class has also a pointer to a root element The children of this distinguished element are the elements of the coarse mesh assigned to the processor In this way it is possible to traverse down in the element hierarchy from the elements in the coarse mesh to their descendants in the fine mesh The FEMesh class has methods for all the algorithms described in the previous sections The remote references described in Section 5 2 are implemented using the NodeRef class This class is just a pair consisting of a processor
17. graph M7 we define ElemWeight Ea to be the number of descendants of Ea in the fine mesh or 1 if Ea has no children We also define EdgeWeight E to be the number of edges between the descendants of Ea in the fine mesh That is Edge Weight Ea LE Ecm Eb Ec such that Ep and E are the lowest level descendants of E and they are adjacent to each other For each pair Ea Ey W we define Weight Ea Eb to be the number of edges between the descendants of E and EF in the fine mesh Note that if both Ea and Ep are unrefined then Weight Ea Eb 1 Although we defined Elem Weight EdgeWeight and Weight based on our multilevel elements of Section 3 we could also define them recursively for any mesh Given a graph M 1 E W we can define Elem Weight E 1 Edge Weight Ea 0 and Weight E Eb lif Ea and Ey are adjacent and 0 otherwise If we collapse two or more elements Ea and Er into one element E then Elem Weight E Elem Weight Ea Elem Weight Ep Edge Weight E Edge Weight Ea Edge Weight Es Weight Ea Eb 39 In a similar way we can define Weight E Ea for two collapsed elements E and Ey The goal of the coarsening phase in the partitioning algorithm is to approximate cliques A refined vertex E of the graph M approximates a clique if EdgeDensity E 2x Edge Weight E Edge Weight E EdgeWeight E 1 is close to 1 and it does not approximate a clique if it is close to 0 Intuitively
18. migration of the mesh 93 Total time air mesh 700 600 Pies 8 e ss cae 500 no refinement fol PS T A 2 re C Bp 400 1 ref serial d D 2 ref serial e Processors a Shared nodes air mesh 1800 1600 no refinement a ae 1 ref PNR b a ey 1400 2 ref PNR c sete wd 1 ref serial d a a Q 2 ref serial e re cae l O c gej e 50 i 7 4 8 16 32 Processors b Figure 45 Partitioning of the mesh after none one and two refinements using the PNR algorithm and the serial Multilevel Bisection algorithm a is the total time The number of shared nodes is shown in b 94 sets so that adjacent elements in different processors are never refined at the same time These independent sets are computed using Monte Carlo methods Also the partitioning algorithm is based on the Orthogonal Recursive Bisection method In general is possible to obtain much better partition using multilevel methods In 33 42 an adaptive environment for unstructured problems is presented In this project the migration is only done to adjacent processors using iterative local migration techniques The load balancing algorithm is done by comparing the work between adjacent processors This means that we might require several iterations to rebalance the work The refinement is performed using quadtree structures 11 Conclusion and future work In this th
19. nodes Ea insert P V into Move Msg i j ref Vp FOR each reference Pk Vs Ref V DO insert Pr VE into Move Msg i j ref Vp END FOR END IF END FOR END FOR FOR each processor P DO IF i j and Move Msg i j7 4 THEN send Move Msg i 7 END IF END FOR Figure 24 Migration phase sending the elements to the destination processors 53 e Once a processor P receives a message Move Msg i j it first creates the new nodes as specified in the message and then constructs the elements If V is a node of a an element Ea such that Ea E Move Msg i j and Vp E Move Msg i 7 nodes E then a new copy Vj is created in P and Ref v2 is initialized to Move Msg i j ref Vp remember that this also includes a reference to the sending processor Pi V3 At this point Pi Vj E Ref V2 but P V2 g Ref V It is responsibility of P to inform the other processors of the newly created copy Continuing with the example when Pz sends Ey to Po it should also include a copy of Vz Before P constructs the element Es it should first create the node V Using that node and the local copies V and V it can then create the element Ey Note that the copy of Vz in Po has a reference to the copies in P and P and not vice versa It is the responsibility of Po to inform the other processors of the newly created copy When P receives the element Ey it also creates a copy of V7 but the copies in Po and P of that node know n
20. refined mesh we collapse vertices in parallel The PNR method differs from other methods in that it uses the refinement history of the mesh to collapse the vertices while other methods use maximum matchings or independent sets As a consequence we are able to collapse vertices locally in the parallel phase without any communication overhead unlike other methods Our tests show that by using the refinement history we obtain partitions that are almost always of higher quality than those obtained by the multilevel algorithms yet PNR is very fast For simplicity we assume that the initial mesh fits into one processor and marks the transition between the serial and the parallel phase In Section 7 1 5 we discuss possible generalizations of this method 7 1 1 The serial Multilevel Graph Partitioning algorithms The pseudocode for a standard serial Multilevel Partitioning Algorithm is sketched in Figure 16 In general serial multilevel algorithms perform the partitioning of a mesh in three phases e in the coarsening phase these algorithms construct a sequence of graphs Go G1 Gk such that the G lt G _ by collapsing adjacent nodes or contracting edges This contraction is implementing by finding a maximal independent set 16 or a maximal matching 20 Given a graph G U F where U is a set of vertices and F is a set of edges then U C U is an independent set of G if for all v U v w E F awe U An independent set U is maxima
21. such as Kernighan and Lin 32 on each G _ for the purpose of improving the quality of the partition To implement this algorithm on a parallel computer note that for each level of the coarsening phase we need to compute either an independent set or a matching of the graph This implies that for each G _ we will need to send messages to insure that two adjacent 38 processors do not select adjacent vertices of the graph at the same time an operation that can be very time consuming As we will show our algorithm uses a different heuristic for collapsing the vertices of the graph that does not require synchronization at each level 7 1 2 General repartitioning of an adapted mesh In Section 5 1 we explained that our meshes are partitioned by elements That is given a mesh M E V where E is a set of elements and V is a set of vertices we construct a partition I Il I2 Ip such that each element Ea E is assigned to only one partition Il This is done by creating a graph M 1 E W that is the dual of M where W is a set of pairs of adjacent elements Ea E E W if and only if Ea and Ep are adjacent Thus the elements in E are the vertices of the graph M and the pairs in W are its edges A partition of the vertices of M is a partition of the elements of the mesh M In order to contract the graph while preserving its global structure we associate weights to each element Ea and each pair of adjacent elements Ea Eb Given the
22. termination condition is reached when no element is marked for refinement so if R is the set of all the elements selected for refinement then the algorithm finishes when R This global termination condition implies a local termination condition for processor P that holds when R We assume that the refinement is started in one special processor referred to as the coordinator Po To simplify the explanation we assume initially that the refinement does not propagate cyclically from processor P to processor P and then from processor P back to processor P We will show later that this is not a reasonable restriction but it does not affect the algorithm significantly The algorithm for detecting termination uses two basic kind of messages e a refine message Refine Msg i j sent from a source processor P to processor P is used to request the refinement of one or more elements of processor P We will specify the contents of this message later but let s assume for now that it can either indicate the elements selected for refinement if the message is sent by the coordinator or it can include a reference to a shared node If P receives a Refine Msg i j V it creates the node Vz it initializes Ref Vz P Vi and inserts the unrefined element Ea E Adj V into Rj Note that at this stage Vv has no reference to v2 To update Ref V we use the next type of message e an update message AddRef Msg j z is returned from
23. the performance of the network between machines in the 5th floor and machines in the SunLab a shows the performance of a ping application where a source processor sends a message to a destination processor and then waits for a response This example is used to measure the latency of the network Only when the messages are around 8K 1000 doubles we do obtain full speed b c and d show the performance of a broadcast example where a source processor sends an individual message to every other processor and waits for a response from all of them 68 3 and Figures 33 a b and c As it is shown in the figures when we increase the number of processors we notice a lower performance due to contention The maximum speed for the 4 processor case was around 0 6 MByte sec almost half of the performance of the point to point program When we increase the number of processors to 16 the maximum performance drops to 0 2 MByte sec compared with 1 MByte sec in the point to point meer a CT Mae 5th floor SunLab 5th floor SunLab 5th floor SunLab 5th floor SunLab 5th floor SunLab 5th floor SunLab_ 0 002787 0 003874 0 002790 0 002305 0 000545 0 000169 0 041049 0 014917 0 018661 0 015755 0 010178 0 007148 0 075899 0 033313 0 066663 0 028382 0 024450 0 015553 0 360098 0 276849 0 194468 0 073495 0 158196 0 059094 0 483465 0 248942 0 251540 0 091595 0 196801 0 105000
24. 0 021793 0 050197 0 034597 0 179837 0 166160 0 260687 0 246555 0 586020 0 565639 0 741308 0 725855 0 786961 0 766936 0 820302 0 764463 0 843152 0 820467 0 859652 0 838322 5 10 50 100 500 1000 5000 10000 50000 100000 Table 1 Point to point data rates between machines located in the 5th floor and machines located in the SunLab Performance is measured in MBytes sec and the length of the message is in double The second test corresponds to the broadcast operation In this case a distinguished processor P sends a different message to all the other processors and waits for a response from all of them This operation is usually executed when Po broadcasts the result of the repartitioning algorithm The results for 4 8 and 16 processors are shown in Table 2 and plotted in Figure 32 b c and d Again the full speed is obtained when the messages are of 8K For longer messages we start to notice a degradation of the performance possible due to contention In these tests the maximum performance was also around 1 MByte sec The last test is an all to all communication program In this test each processor sends a message to each other processor and waits for a response This test is an extreme case of our migration algorithm The effect is that it saturates the network as it is shown in Table 66 D e Process l Proceso 0 007748 0 006733 0 006
25. 189 22 869594 10000 31 171557 32 700909 24 209284 31 805331 22 370358 50000 33 140733 34 230389 25 071044 33 086489 24 191446 100000 32 841156 32 861832 25 181348 32 626888 25 082283 Table 4 Point to point broadcast and all to all communication on the SP2 Performance is measured in MBytes sec and the length of the message is in doubles 70 All to All 4 Processors All to All 8 Processors 5th Floor 5th Floor o o N N E 3 ES ES a fea 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 Buffer size doubles Buffer size doubles a b All to All 16 Processors SP2 Point to Point Broadcast and All to all 40 5th Floor 35 30 g 3 2 2 2 gt A gt a iva S 2 15 10 5 c 0 eee 10 100 1000 10000 100000 1 10 100 1000 10000 100000 Buffer size doubles Buffer size doubles c d Figure 33 Comparison of the performance of the network between machines in the 5th floor and machines in the SunLab and the performance of the same applications run on the SP2 a b c show the performance of an all to all communication example where each processor sends an message to each other These example are a measure of the contention on the network d is the result of running the same tests on the SP2 71 av gt va ET D A gt DEK fs K A gt Coe waa YY VV waa Mi X Se NY KS 2 S
26. 648 0 004730 0 005601 0 006167 0 038316 0 033345 0 030359 0 025214 0 026274 0 027965 0 077707 0 066563 0 060923 0 049014 0 051953 0 057124 0 329701 0 300788 0 326305 0 158403 0 245903 0 282621 0 515726 0 494174 0 531610 0 491153 0 465172 0 517887 0 838217 0 896251 0 952069 0 972119 0 917585 1 041698 0 937303 0 998263 1 026578 0 958222 1 048212 1 046217 0 819534 0 861735 0 819441 0 861253 0 931990 0 908919 0 757282 0 746826 0 915833 0 869889 0 936937 0 917223 0 849323 0 809480 0 891440 0 842223 0 904827 0 842918 0 862556 0 833068 0 881041 0 842355 0 850441 0 837908 1000 5000 10000 50000 100000 Table 2 Broadcast from a single source to all the other processors Performance is mea sured in MBytes sec and the length of the message is in doubles 67 Point to Point Broadcast 4 processors 5th Floor Sth Floor Q S o 2 gt a o 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 Buffer size doubles Buffer size doubles a b Broadcast 8 processors Broadcast 16 processors 5th Floor Sth Floor SunLab SunLab oO oO o 2 2 a my 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 Buffer size doubles Buffer size doubles c d Figure 32 Comparison of
27. BROWN UNIVERSITY Department of Computer Science Master s Project CS 96 M7 The Dynamic Adaptation of Parallel Mesh Based Computation by Jose Gabriel Castanos The Dynamic Adaptation of Parallel Mesh Based Computation Jos Gabriel Castanos Sc M Thesis Department of Computer Science Brown University Providence Rhode Island 02912 May 1996 The Dynamic Adaptation of Parallel Mesh Based Computation by Jos Gabriel Castanos Licenciate in Operations Research Universidad Cat lica Argentina 1989 Thesis Submitted in partial fulfillment of the requirements for the Degree of Master of Science in the Department of Computer Science at Brown University May 1996 This thesis by Jose G Casta os is accepted in its present form by the Department of Computer Science as satisfying the thesis requirement for the degree of Master of Science John E Savage Approved by the Graduate Council Kathryn T Spoehr 1 Introduction Although massively parallel computers can deliver impressive peak performances their com putational power is not sufficient to simulate physical problems with highly localized phe nomena by using only brute force computations Adaptive computation offers the potential to provide large increases in performance for problems with dissimilar physical scales by focusing the available computing power on the regions where the solution changes rapidly Since adaptivity significan
28. P to P for each refine message sent from P to P to indicate the completion of the requested refinement This 27 _ Refine Msg ES i gt AddRef Msg Figure 11 Parallel refine algorithm In a the initiator sends a Refine Msg to each other processors Processors Po and P return immediately a AddRef Msg to the initiator but the refinement in processor Pz propagates to P so P sends a Refine Msg to P b After P returns a AddRef Msg to P c Pz returns its AddRef Msg to the initiator d message also includes the necessary information to update the references to the nodes shared between P and P When P receives an AddRef Msg j i vi it inserts Pi Vz into Ref V If P is the coordinator we return AddRef Msg j C The coordinator sends at t 0 a Refine Msg C i message to one or more processors P indicating that the refinement phase has started The initiator can explicitly select the elements for refinement or it can instruct the processors to select the elements based on an adaptation criteria Processor P then executes the serial refinement algorithm on these marked elements possibly sending Refine Msg i j messages to neighboring processors P when a node Vi is created in an internal boundary between processors P and P The local termination condition holds for processor P when no more elements are marked for refinement When this condition holds processor P does not generate new 28 R
29. Po and P and the refinement of E creates another shared node he P needs to create its local copies Vy and Vi It then marks the nonconforming elements Ep and En for refinement by inserting them in R and invokes the serial refinement algorithm again 6 1 Refinement collision The parallel algorithm can run into two synchronization problems 3 First if processor P refines an element Ea and processor P refines an adjacent element Fy it is possible that each processor could create a different node at the same position In this case it is important that both processors do not consider them as two distinct nodes when assembling the matrices and vectors to compute the solution of the system and that the node incorporates 23 the contributions of all the elements around it Related to this problem is what processor P believes is a nonconforming element E in processor P might have already been refined there Processor P needs to evaluate and update the propagation requests it receives before executing them In this case P should insert a descendant of Ey in Rj These two problems are illustrated in Figure 9 In the top row we show a case where the receiving processor has already refined the element but further refinement is required Initially Ea and E are selected for refinement The refinement of E creates the shared node Vy P creates its copy of Vp but it then has to determine which of the children of E Es or Es should be inserte
30. algorithm By using the refinement history we are able to obtain better partitions at a lower cost than the standard methods Our PNR algorithm allows for a very natural parallel implementation Is is possible to compute the ElemWeight EdgeWeight and Weight of Mj in parallel using only local 40 ieee Sarees SA Sameer eee V7 eee ee ee NS i ETEA ca BS EE EEN E awuataay m P 4 anak Pa BS saaa aai aeaiia 2 P2 NE P3 P2 a p3 a b Figure 17 The Parallel Nested Repartitioning algorithm a shows the initial mesh Mo and b shows the mesh M at the beginning of the partitioning phase information We then send the graph M 1 to a common processor Pp which partitions the reduced graph Mj 1 At this point all the other processors wait until Pp sends back a message informing them of which elements to migrate Finally the migration is performed by moving fully refined subtrees The partitioning algorithm will inform a processor P of the elements Ea Mo to move to another processor P It is assumed that P will not only send E but also all its descendants to P The intention is not to break partition trees to simplify the unrefinement of the elements In the rest of this section we will explain the algorithm in more detail using the example shown in Figure 17 The pseudocode for the PNR algorithm is shown in Figure 18 and explained below As we explained earlier our Parallel Nested Repartitioning algorithm for mesh parti
31. are not close to 0 we expect that NodeAdj V be close to a constant k A graph G is constructed from the finite element mesh M Its adjacency matrix H has one row and column for each node Vp V The entry hp 1 if the nodes Vp and Vy are adjacent to a common element and hp 0 otherwise The adjacency matrix H can be directly constructed from NodeAdj V Since Vp NodeAdj Vq gt Vq NodeAdj V the matrix H is symmetric and G is an undirected graph In general NodeAdj V V so we expect that H will be very sparse 5 1 Partitioning by elements or partitioning by nodes In an iterative method for solving systems of equations the cost of the algorithm is domi nated by the cost of performing repeated sparse matrix vector products Ab c where A is V x V A and H have the same sparsity structure This implies that a good partition for G is also a good partition for A because it minimizes the communication required to perform the matrix vector products There are two basic strategies for partitioning the graph G e node partitioning there is a partition 6 81 2 Pp of the nodes between P processors such that V and N ViF j If Vp 9 it is assigned to P Each node is assigned to a single processor The partition of G is performed by removing some edges leaving sets of connected nodes The edges removed express the communication pattern between processors and the cost of the partition is
32. ated with a node without changing the geometry of the mesh 10 where p is the polynomial order of some element Through the rest of this paper we concentrate mainly on h refinement although some of the techniques for mesh partitioning and migration are independent of the refinement strategy Since p refinement also modifies the workload in each processor the repartitioning and migration algorithms apply to it also 2 3 1 Local h refinement algorithms Starting from a conforming mesh M formed by a set E of non overlapping elements E E that discretize a domain Q of interest and a set of elements R R C E that are selected for refinement h refinement algorithms construct a new conforming mesh M of embedded elements E such that e if E R E is split into a set of nonoverlapping subelements E E ty 12 Ej that replace Ej o if E R Ej El The selection of elements for refinement or coarsening in R is made by examining the values of an adaptation criteria 6 that can be related to a discretization error Usually these refinement methods cause the propagation of the refinement to other mesh elements so an element E R might also be refined in order to obtain a conforming mesh Coarsening algorithms have similar problems One common refinement algorithm is the Rivara bisection refinement algorithm for tri angular elements that it is used in two dimensional problems In its simplest form i
33. ccording to a target partition 84 Mesh Migration air mesh 100 r initial mesh i after 1 refinement b 80 after 2 refinements c ee Ao Time sec 40 20 2 ss A ne Te 4 8 16 32 Processors Figure 40 Migration of the mesh We migrate each element of the air mesh that it is assigned to P to P after none one or two refinements 9 6 Dynamic partitioning of the mesh Finally we put all the tests together In this section we present tests that refine the mesh find a new partition at run time and migrate the mesh according to the new partition These tests are performed using two different methods For the first set of tests we partition the mesh using the Parallel Nested Repartitioning algorithm see Section 7 1 3 In parallel we compute the weight of the mesh M by collapsing elements that are descendants of a common element of the coarse initial mesh This phase does not require communication We then send the dual of the coarse mesh to Po to start the serial phase of the algorithm In the serial phase we find a partition of the coarse mesh using a serial algorithm In this phase we use Chaco s serial Multilevel Bisection algorithm to partition this reduced graph We then broadcast the partition to the processors to start the migration phase For comparison we run the same tests using the serial Multilevel Bisection algorithm to repartition the mesh Rather than collapsing the refined elements in pa
34. conds Edge cuts is the number of edges cut reported by Chaco Number of partitions Clock Time 5 24 43 7 4 16 16 6 4 43 06 6 4 34 39 7 1 00 13 0 1 02 58 5 User Time System Time 19 57 6 Edge cuts 8084 Avg elements 946 Shared nodes Table 6 Spectral Bisection on the big mesh using Chaco on a 64Mb Sun SparcSta tion The dual of the mesh has 30269 vertices and 178639 edges The times are in hours minutes seconds 75 Number of partitions Clock Time User Time System Time Edge cuts Avg elements Shared nodes Table 7 Serial Multilevel Bisection on the air mesh using Chaco on a 64Mb Sun Sparc Station The times are in hours minutes seconds Clock Time User Time System Time 6 1 Edge cuts 6701 Avg elements 946 Shared nodes Table 8 Serial Multilevel Bisection on the big mesh using Chaco on a 64Mb Sun Sparc Station The times are in hours minutes seconds 76 Initial Partition air mesh 450 400 Spectral a Multilevel b 350 300 250 200 150 100 50 Time sec 8 6 Number of partitions a Initial Partition big mesh 20000 18000 16000 14000 Spectral fa 12000 Multilevel b 10000 8000 6000 4000 2000 Time sec 4 8 16 32 Number of partitions b Figure 36 Partition of the initial mesh using Chaco Time spent to compute 4 8 16 and 32 partitions of a air mesh and b big mesh 77
35. d into R for further refinement In the bottom row the receiving processor should only update the reference rather than creating a new copy Both Po and P create shared nodes Vp and V in the same mesh location as the result of the refinement of E and Ep We need to detect that both nodes are the same and update the references accordingly The solution to the synchronization problem is greatly simplified by using the nested elements of our multilevel algorithm When an element Ep in processor P is refined into two or more elements Ey and Es the element E is not destroyed as it would be the case in other refinement algorithms Any message arriving to processor P from processor P with the instruction of making a copy of a shared node V in processor P named Vz that causes the refinement of the element Ep can be compared against the status of the element Ey If the element E was already refined in the local phase but processor P did not know about this then the element E might not need to be refined again If the node Vp was already created in the local phase of processor P then a reference is added pointing to its copy Vi in processor P If the refinement of the element Ep did not cause or was not caused by the creation of the shared node V for example the refinement was done dividing another edge as in the top row of Figure 9 then its children Fy and Ep are evaluated and the one that shares the internal boundary between P and P i
36. daptive Nu 99 43 G C Fox R D Williams and P C Messina Parallel Computing Works Morgan Kaufmann Publishers Inc 1994 101
37. ds As we have mentioned earlier we generally perform repeated matrix vector products Ab c when we need to assemble matrices and vectors All the effort in the following phases has the goal of improving the performance and quality of this phase At some time t tg we decide that it is convenient to adapt the mesh so we start a refinement coarsening phase Using error estimates we select elements for refinement that we insert into R and if we select elements for coarsening we insert them in C If the refinement of the elements in R creates a new shared node V in an internal boundary between two processors P and P we create the two local copies Vi and vz and we initialize 21 find an initial partition IT load the mesh using the partition TI initialize the references Ref V3 using the algorithm in Figure 6 FORt lt T DO compute a solution refine coarsen the mesh For each new shared node V determine Ref Vj find a new partition IT migrate the elements and nodes according to II If a node vi is moved from P to P then if V is another copy of Vp in Pp update Ref V Ref V Pi V U P Vz and set Ref Vj Ref V END FOR Figure 7 Outline of a general algorithm for computing the solution of dynamic physical system using a parallel mesh Ref Vi Pj Vp and Ref Vp P Vp Since adaptation produces imbalances in the distribution of the work we compute a ne
38. e V is not a sufficient condition for the destruction of a node Nodes created as a result of the propagation of refinement need to be adequately coarsened to preserve the conformality of the mesh Elements at a lower level that reference the node need to be eliminated to prevent dangling references and this might require the destruction of other nodes The conditions for a correct coarsening algorithm are e to select an element of level l as a candidate for coarsening all its vertices Vp that are nodes of level l should be selected as candidates for coarsening by evaluating the adaptivity criteria at the node e assume that this element EF of level is the child of some other element E of level l 1 In order to replace all the children of by E all its children should be selected as a candidates for coarsening or none of them are e a node V is selected for coarsening not anymore as a simple candidate only if all its adjacent elements FE are selected as candidates for coarsening This condition prevents dangling references from elements to destroyed nodes e if an element E that is an element of level l has more than one vertex of level and not all of them are selected for coarsening then none of its vertices of level l is selected for derefinement since an element that has vertices of its level that are not selected for coarsening will not be coarsened and we need to prevent that its vertices allow the coarsenin
39. e coarse mesh Mo and time t gt 0 each element that it is not refined belongs to the fine mesh A decision to perform an n fold refinement of E R is transmitted to the refinement module as the pair E n For example if n 1 then using Rivara s bisection refinement the element E is divided into 2 triangles If n gt 1 then each of its children is refined n 1 times The multilevel algorithm for refinement has the following properties e an element that has no parents has level 0 and belongs to the coarse initial mesh Mo No coarsening is done above this level e an element with no children belongs the fine mesh M The numerical simulations are always based on the fine mesh an element could be at the same time in both the coarse mesh Mo and the fine mesh M for example before any refinement is done or in any intermediate mesh e only elements that are in the fine mesh M can be selected for refinement or coarsening The hierarchy of elements is only modified at its leaves e anode Vp such that Level V l is a vertex of elements E of level 1 or below An element E of level has vertices of level m where m lt l e as the elements are individually selected for refinement or coarsening the hierarchy can have different depths in different regions of the mesh 11 a b c Figure 4 Multilevel refinement Initially elements Ea and E in a are selected for refinement Both elements are refined
40. efine Msg messages and it is not waiting for any AddRef Msg messages Also proces sor P does not insert new elements into R until a Refine Msg j i message arrives from some other processor P In this case new nodes are created in the internal boundary as instructed in the message and the corresponding elements are selected for refinement by inserting them into R Then processor P executes the serial refinement algorithm which might cause further propagation to other processors An example is shown in Figure 11 where Pc sends an initial Refine Msg to the other processors P and Py complete their work without propagation so they return a AddRef Msg message to Pc On the other hand the refinement of elements in P propagates to P so P sends a Refine Msg 2 1 message to P P completes this request without further propagation so it returns to P which in turn returns an AddRef Msg 2 C message to Pc We say that the parallel refinement terminates the global termination condition holds at some t if e R for each processor P at time t e there is no Refine Msg or AddRef Msg in transit at time t The termination detection procedure is based on message acknowledgments In particu lar Refine Msg i j messages received by processor P from processor P are acknowledged by P by sending to P a AddRef Msg j i message These messages return the references to the newly created vertices so that if Vi is a vertex in processor P over a
41. eport SAND93 2339 Sandia National Laboratories 1993 22 B Hendrickson and R Leland The Chaco user s guide Version 2 0 Technical Report SAND94 2692 Sandia National Laboratories 1995 23 Message Passing Interface Forum MPI A Message Passing Interface Standard 1994 24 W Gropp E Lusk and A Skellum Using MPI Portable parallel programming with the Message Passing Interface MPI Press 1994 25 M Snir S Otto S Huss Lederman D Walker and J Dongarra MPI The complete reference MIT Press 1996 26 W Gropp and E Lusk User s Guide for MPICH A portable Implementation of MPI Argonne National Lab and Missisipi State University 27 R Butler and E Lusk User s guide to the p4 Parallel Programming System Technical Report ANL 92 17 Argonne National Laboratory 1992 28 B Welch Practical Programming in Tcl and Tk Prentice Hall 1995 29 J Ousterhout Tcl and the Tk Toolkit Addison Wesley 1994 30 Sun Microsystems Tools h Class Library Introduction and Reference Manual Sun Pro 1993 31 R D Williams DIME A User s Manual Caltech Concurrent Computation Report C3P 861 1990 32 B Kernighan and S Lin An efficient heuristic procedure for partitioning graphs Bell System Technical Journal 29 291 307 1970 33 K Devine J Flaherty R Loy and S Wheat Parallel partitioning strategies for the adaptive solution of conservation laws Modeling Mesh Generation and A
42. es Also it is very easy to find the nodes of an element and the coordinates of a node These are the two most common operations that are required to construct the local and global matrices explained in Section 5 Unfortunately the storage scheme using simple arrays is inflexible for a dynamic environment where the nodes and elements are continuously created and deleted as the result of the refinement and coarsening algorithins In this section we will give an overview of the object oriented approach that we have taken to support the dynamic mesh adaptation We will also explain how we implemented the remote references as smart pointers to avoid memory leaks or dangling references Our program has been designed to run on distributed memory parallel computers The current version runs in parallel in a network of SUN workstations For communication we use the MPI 23 message passing library In particular our program uses the MPICH 26 implementation of MPI from the Argonne National Lab MPI is becoming the standard for message passing libraries and there are efficient implementations for many parallel comput ers Although MPI has many different ways of sending a message between two computers we use the standard blocking send and receive When a processor P wants to send a memory buffer to another processor P it calls a C function and blocks until it is safe to reuse the buffer This does not guarantee that the message is actually delivered When a proce
43. es to its adjacent processors and listens for propagation messages from them If it receives such a message it creates the new shared nodes and inserts the nonconforming elements into Ri To determine which element to refine we use ElemAdj Vp Suppose that the refinement of an element Ea in P creates a shared node Vp in a boundary between P and P This new node is created at a midpoint between two other shared nodes V and V Note that P V2 Ref Vj and also P Vz Ref V We use these references to send a message from P to P When Pj receives this message it determines the unrefined element Ep ElemAdj V2 N ElemAdj VZ and inserts it into Rj As it can be easily seen the parallel algorithm is not a perfect parallelization of the serial one and it can result in a different mesh The serial Rivara s algorithm 1 and 2 first selects an element Ea from R It then continues refining all the nonconforming elements 26 that result from the refinement of Ea before proceeding with another element from R In our parallel implementation we ignore this serialization We approximate the serial algorithm within each processor as much as possible but do not impose it across processor boundaries We claim that this modification does not affect the quality of the refinement 6 2 Termination detection The algorithm for detecting the termination of parallel refinement is based on a general termination algorithm in 4 A global
44. esh us ing the geometry description of the problem domain The complex topology of the problem is discretized into a set of simpler elements This is a global process usually performed on a sequential computer and it might require human assistance e mesh adaptation the selective refinement coarsening of sections of the mesh improves the quality of the solutions either by increasing the resolution in interesting areas or by decreasing it on regions of little interest The refinement of elements is largely a local process The compatibility of the mesh to the problem topology and correct treatment of the boundaries are not the only requirements for high quality meshes In addition it is desirable to have meshes whose elements are 1 e conforming the intersection of elements is either a common vertex or a common side e non degenerate the interior angles of the elements are bounded away from zero e smooth the transition between small and big elements is not abrupt 2 3 Mesh adaptation The following are the two principal strategies for mesh refinement 12 e h refinement is performed by splitting an element into two or more smaller subele ments refinement or by combining two or more subelements into one element coars ening h is a parameter of the size of the elements This method involves the modi fication of the graph structure of the mesh e p refinement can be thought as increasing the amount of information associ
45. esis we present contributions in the areas of mesh refinement mesh repartitioning and mesh migration We have developed parallel refinement algorithms for unstructured meshes and used the refinement history to develop a Parallel Nested Repartitioning algorithm superior to the algorithms in 19 when applied to the partition of adapted meshes Although we explained the theoretical problems of doing the refinement in parallel our tests showed very little overhead due to communication We used the Parallel Nested Repartitioning algorithm to compute partitions on the fly We showed that we can obtain high quality partitions at a reasonable cost We also showed that we can use these partitions to rebalance the work by migrating elements and nodes between the processors The implementation of the project was highly simplified by using C The use of an object oriented language allowed us to design irregular data structures efficiently It also reduced the number of bugs as we encapsulated the most dangerous components like the remote pointers into objects There are two major possible improvements to this project The first one is to port it to the IBM SP 2 The second one is to find a real physical problem to drive the computation 95 Although we have worked only with triangles we designed the data structures to allow different types of elements both in 2D and in 3D 12 Acknowledgements I would like to thank the great guys of Room 402 for t
46. essage P creates the copy vz and it initializes Ref Vz Ref V U P V3 At this point P has a reference to all the copies of Vp It then sends an AddRef Msg j i Vi V2 for each reference in Ref V2 and all the other copies update their references to the new copy Using vi processor P creates the element Ea P then deletes its element Ea It can happen that Ea was the only element that pointed to Vp in P In this case we wish to remove also V P sends a DelRef Msg i j V Vi P p 48 a b e Figure 21 A migration example a shows the initial mesh The goal is to move Ea from Po to Pz We first copy Ea to P b and then we delete the element Ea in Po c to each reference in Ref v Once all the other processors remove their references to V we can delete the node A simple illustration of this algorithm is shown in Figure 21 In this example we show a mesh with two elements partitioned between three processors Po P and P Our goal is to move the element Ea form Py to Pz We initially send a Move Msg 0 2 Ea that includes Vo V and V2 We initialize Ref V Pi Vo Po V We similarly initialize Ref V and also Ref V Pz then sends a AddRef Msg to P and P to the copies of Vo V and V2 that includes references to V V and V Po then deletes its copy of Ea Since it can also remove VP VP and Vy it sends a DelRef Msg to the other processors We will explain the algorithm in
47. etect that there are two copies of Vs in P and two copies of V3 in P3 One of the copies is destroyed and the corresponding elements are updated accordingly The state of the mesh at the end of this phase is shown in Figure 27 a and b e We now remove the elements from the source processors Moreover if there is some node V such that ElemAdj V we delete the node to free memory In the example the destruction of Ee and Ey in processor P causes the deletion of V Vg and Ve Note that the node vV is referenced by the new element Ep so it is not deleted Before deleting the nodes we send a DelRef Msg message to the processors that have a reference to the node indicating that they should remove the reference Finally we destroy the nodes The representation of the mesh at the end of the migration is shown in Figure 28 Figure 29 shows this procedure 55 FOR each new node V DO FOR each reference P Vz Ref Vi DO insert Vz Vz into AddRef Msg i k END FOR END FOR FOR each processor P DO IF i j and AddRef Msg i 7 THEN send AddRef Msg i 7 END IF END FOR FOR each message AddRef Msg j i sent from other processor P to P DO receive AddRef Msg j i FOR each reference V V AddRef Msg j i DO insert P Vz into Ref Vi END FOR END FOR Figure 26 Migration phase updating the references to the new nodes 56 a b Figure 27 A migration example internal representation of the mes
48. evaluate the quality and performance of our system we performed a series of tests on a network of SUN SparcStation10 workstations each with 32MB of RAM and running Solaris 2 4 The processor Po that was responsible for the window interface and the serial part of the refinement algorithm is a multiprocessor workstation with 64MB We tested our program using between 4 and 32 processors These machines were scattered between the 1st and the 5th floor of the CIT building and were connected by a 10Mbs ethernet network Most of the machines that we used were located in the SunLab in the 1st floor while Po was located in 64 the 5th floor The network is divided in several subnetworks that were connected through gateways in the 5th floor In particular to send a message from a machine in one row of the SunLab to a machine in a different row that is connected to a different subnetwork the message has to travel all the way up to the 5th floor and then all the way down to the SunLab The tests cover the major components of the system We found that the cost of the refinement algorithm is dominated by the serial part By performing a sequence of successive refinements of the whole mesh we obtained some very big meshes Our parallel Parallel Nested Repartition algorithm computed high quality partitions in a very reasonable time By using the refinement history we were able to obtain better partitions than in other multilevel algorithms We ran these test whe
49. f the propagation does not generate cycles As shown in Figure 12 it is possible to design a mesh where the refinement propagates back to processor Po In that example Po refines an element Ea creating a shared node Vv It then sends a Refine Msg 0 1 to P P creates its copy of the shared node and proceeds to refine the nonconforming elements but before P is ready to return a AddRef Msg 1 0 a new shared node Vv is created in the boundary between P and Po In this case P sends a new Refine Msqg 1 0 When Po performs all the required refinements it returns a AddRef Msg 0 1 to P which in turn returns a AddRef Msg 1 0 to Po corresponding to the initial message In this example is not easy to detect which one is the parent or the child processor It also shows that the refinement of some meshes can have a cycle It is possible to extend the idea to a long but finite sequence of Refine Msg messages through two processors before being ready to return a AddRef Msg Fortunately we can modify the previous algorithm to deal with these problems In the active state a processor P can receive not only AddRef Msg messages from its neighbors but also new Refine Msg messages from other processors P These new Refine Msg might cause further propagation Now there is not just one critical message for proces sor P but there is still only one critical message for each of the Refine Msg messages that processor P transmits to other processors We modify the Refi
50. finement termination holds only when all the processors have refined their elements and there is no propagation message in the network A processor P might have no more local elements to refine but it needs to wait for possible propagations from neighbor processors Only when all the processors agree on the termination of the refinement phase can they proceed to the next phase The adaptation of the mesh produces imbalances on the work assigned to each processor as elements and nodes are dynamically created and destroyed Also mesh partitions are 14 computed at runtime interleaved with the numerical simulation In this environment we cannot afford expensive algorithms that recompute the partitions from scratch after each refinement Instead we propose repartitioning algorithms that use the information available from previous partitions and the refinement history Finally we must keep a consistent mesh while migrating elements and nodes between pro cessors In our meshes the physical location of nodes and elements is not fixed throughout the computation Instead our design supports dynamically changing connectivity infor mation where the references to remote elements and nodes are updated as new nodes or elements are created deleted or moved to a new processor to balance the work load In the following sections we address these problems in detail and we present our solutions First however we introduce some definitions explain a strategy for st
51. finements of the air mesh in 16 processors In each phase all the elements are selected for refinement Number of refinements malas E REAREN Time serial 5 91 22 80 77 66 376 17 Time comm 0 39 0 68 1 18 4 00 Time total 6 30 23 48 78 84 380 17 Elements total 115363 251490 535902 1124612 Elements avg 3605 7859 16747 35114 Elements max 4567 10214 22080 46764 Elements min 3081 6271 13069 26979 Imbalance 26 69 29 97 31 84 33 06 Shared nodes 2138 3307 4633 7093 Table 12 Successive refinements of the air mesh in 32 processors In each phase all the elements are selected for refinement 81 Mesh Refinement air mesh computation 160 1st refinement a 140 ay 2nd refinement R 120 AN 100 o o 2 m 80 ag 60 40 20 0 Processors a Mesh Refinement air mesh communication 1st refinement a 2nd refinement b 3rd refinement c 4th refinement d T Q 2 o He Processors b Figure 38 Successive refinements of the mesh In a we show time spent in the serial part of the algorithm while in b we show the time spent on communication 82 A ri X NS S 6 Figure 39 Refinement of the mesh The initial air mesh a and after refining all its elements b 83 involves the destruction of all the elements in P and the creation of the same elements in P 4 The timings for this test are shown in Table 13
52. g of adjacent elements 13 e finally an element E of level l is selected for coarsening if the element E has no children it belongs to the fine mesh the element E has a parent it does not belong to the coarse mesh its vertices are nodes of level m where m lt I all its vertices that are nodes of level are selected for coarsening and are not simple candidates This last condition will ensure that the resulting mesh is conforming because a node Vp is selected for coarsening only if there will be no references to it 4 The challenge of exploiting parallelism The data structures and algorithms introduced in the previous section allow us to refine and coarsen a mesh in a serial computer Most of the work that we will present in the rest of this paper extends these ideas to a parallel computer Parallelism introduces a series of problems that we need to solve in order to perform the dynamic adaptation of parallel mesh based computation Refinement algorithms typically use a local information to perform refinement Unfortu nately the refinement of an element Ea that creates a new node V in an internal boundary between two processors requires synchronization between the processors The second problem concerns with the termination of the refinement phase The serial algorithm terminates when no more elements are marked for refinement This is not always easy to detect in a parallel environment In this case global re
53. g processors This pseudocode is explained below Initially Pc sends a Refine Msg C i to all the other processors P to start the refinement phase These messages are used to select the elements in R to be refined in P Each P returns a AddRef Msg i C once they have refined these elements and has also received a AddRef Msg k i for each Refine Msg i k produced by the propagation of the refinement to P The algorithm uses a new type of message a continue message Continue Msg i j sent from the initiator to every other processor is used to inform them that the refinement phase concluded and that they can continue to the next phase 33 seq 0 WHILE true DO wait for a message msg IF msg Continue Msg j i THEN return ELSE IF msg Refine Msg j i1 THEN set seq table seq critical msg and table seq count 0 FOR each element Ea msg DO create the shared nodes and insert E in R END FOR refine the elements in R FOR each shared node Vi created in an internal boundary between P and P DO send Refine Msg i k containing seq and the elements to refine table seq count END FOR IF table seq count 0 THEN return AddRef Msg i j as msg did not cause refinement to other processors END IF ELSE IF msg AddRef Msg j i THEN seq is the sequence number returned by msg set table seq count IF table seq count 0 THEN send AddRef Msg i j to P where P sent the message stored in table seq critical
54. h after copying the elements to the destination processors a and after removing the elements from the source processors b Figure 28 A migration example internal representation of the mesh at the end of the migration phase 57 FOR each element Ea such that Ea Ij Ea 1i i j DO delete element Eg FOR each node V Adj E DO remove Ea from ElemAdj V IF ElemAdj V THEN FOR each reference Pj V Ref V DO insert V V into DelRef Msg i j END FOR delete Vi END IF END FOR END FOR FOR each processor P DO IF i Z j and DelRef Msg i j THEN send DelRef Msg i j END IF END FOR FOR each message DelRef Msg j i sent from other processor P to P DO receive DelRef Msg j i FOR each reference V V DelRef Msg j i DO remove Pj V3 from Ref V END FOR END FOR Figure 29 Migration phase deleting the elements on the source processors 58 8 Project overview In finite element programs written in FORTRAN meshes are typically represented by a table of elements and a table of nodes Each row in the table of nodes corresponds to a node and in its columns we store the coordinates of the node The elements are stored in the table of elements and each element keeps track of its vertices by storing their indices in the table of nodes This storage scheme allows for compact representations This is a desirable property as real problems have thousands of elements and nod
55. if a vertex Ea has a a large edge density it will not be cut in half by the bisection in the partition of the contracted graph 7 1 3 The Parallel Nested Repartitioning PNR algorithm The partitioning algorithm that we discuss in this section incorporates the idea that the fine mesh M at time t was obtained as a sequence of refinements of a coarse initial mesh Mo The mesh M includes all the elements Ea such that Children M at time t We define M as the number of elements in the mesh We assume that Mo lt M but in general Mo lt M Note that it is possible to have an element Ea Mo N Mi if Ea is not refined Figure 17 shows an example of an initial mesh Mo and the refined mesh M at time t The amount of work for processor Po is far larger than the amount of work of the other processors The goal of the repartitioning algorithm is to rebalance the work so each processor has approximately the same number of elements The PNR algorithm uses the information that M was obtained as a sequence of refine ments Rather than computing directly a partition of M it computes a partition of Mp and then projects this partition to Mj The notion is that in the coarsening phase we do not need to find a matching or independent set to collapse the children of refined elements Instead we use the refinement tree to collapse the descendants of each element of the coarse mesh Mo In Section 9 6 we compare the PNR with the serial multilevel
56. in When P sends the element E to P it also includes the reference P vi instead of the node Vp Then P can use vi to create E This condition has an important implication processor P cannot delete its copy of vi until it has received all the elements even if processor P has already sent the only element E that points to vi to another processor Pk because some other processor P might expect P to have a copy of Vp e if processor P sends more than one element Ea and Ep to P and there is a node Vp Adj Ea N Adj Eb so Vp is a vertex of both Ea and Ey then only one copy vi should be created in P and both elements Ea and Ep should refer to it in the destination processor e now if two processors P and P send the elements Ea and Ep respectively to processor P where elements E and Ey are adjacent elements and there is a shared node V Adj Ea M Adj Ea so Pk V Ref Vj and Fi Vi Ref V then Pj should detect that Vi and VS are two copies of the same node In this case P should create only one copy vz for both elements Ea and F e finally if processor P sends an element Ea to another processor P and P sends an element Ep to P and Ea and Ep are adjacent elements in two different processors that share a common node V then we should insure that Pi Vi E Ref VF and Pk VF Ref V so Ve and v refer to each other 47 In the migration phase we use three difierent kinds of messages
57. in the second phase Py computes the dual of the graph and spawns a process to run Chaco In the case of the PNR algorithm Chaco computes a partition of the reduced mesh Mo 1 In the case of the serial algorithm Chaco computes a partition using all the elements of the fine mesh in the third phase Po sends the new partition to all the processors e finally we migrate the elements according to the new partition First note that if the mesh is not refined both the serial and PNR algorithm are essen tially the same and should produce similar results Only when the mesh is refined we would note a difference between both approaches It is not surprising that the partition is significantly faster if we send only the coarse mesh to Fo as in the PNR algorithm rather than the fine mesh as in the serial algorithm In the PNR algorithm the cost of computing the weights of the mesh Mj and sending it to P does not increase too much as we perform successive refinements of the mesh In this 88 case M5 is a constant so the same number of elements are always sent to Py This is obviously not true in the serial algorithm As Po needs to partition a much smaller graph during the serial phase of the PNR algorithm the cost of performing this partition is much smaller than if it was made using the serial algorithm The time to receive the partition from Po increases faster in the serial algorithm than in the PNR algorithm as Po needs to return longer message
58. ion criteria we select some elements E R C E for refinement and others E C C E for coarsening For each refined element E we define the Children E E E E to be the elements into which is refined and let Parent E E Also for each element E E we define Level E 0 if E is in Mo and Level E Level Parent E 1 otherwise The children of an element E of level can be further refined and they become the parents of d e f Figure 1 Bisection refinement in a only one element is selected for refinement b shows the mesh after the refinement of that element A non conforming white node is created so we propagate the refinement to an adjacent element c d and e show different stages of the refinement and f shows the final mesh Although only one element was initially in R we refined 3 more elements to obtain a conforming mesh FOR each E R DO bisect by the midpoint of its longest side generating a new node V and two triangles E and E WHILE V is a non conforming node in the side of some triangle DO make E conforming by bisecting E by its longest side generating a node V and two triangles E and Ej IF Vp V THEN Vp is a non conforming node in the midpoint of one of the sides of either E or E Assume that V is in one side of F Bisect over the side that contains Vp obtaining two triangles E or EY Now V is a vertex of both triangles 1 na
59. ir neighbors and vertices The data structures in this case are more complex than in structured meshes but it is easier to represent complex geometries Each type of mesh has its advantages and disadvantages Structured meshes require simpler codes with less overhead but are more limited in the representation of complex domains Unstructured meshes are more complex require more storage and overhead per element but can easily represent complex geometries and moving bodies Some techniques implement the meshes as a combination of both approaches In such cases the mesh is usually decomposed in a set of unstructured super elements where each super element is decomposed into a structured grid The choice of the mesh type determines the data structures and algorithms available for refinement partitioning and rebalancing For example a partitioning method adequate for unstructured meshes such as Recursive Spectral Bisection 15 is useless for structured meshes A refinement algorithm will perform well on some type of meshes but is not recommended for anothers And the migration algorithm described in Section 7 2 highly depends on how the mesh is actually stored In the rest of this paper we assume that the domain is discretized using unstructured meshes 2 2 Mesh generation The generation of meshes for unsteady problems is usually done in two distinct phases 10 e initial mesh creation involves the creation of a compatible unstructured m
60. itioning PNR algorithm which is fast and gives high quality partitions In Section 7 2 we explain a mesh migration algorithm This algorithm receives as input the partition obtained from the repartitioning of the mesh and migrates the elements and nodes according to this partition 7 1 The mesh repartitioning problem While the PNR repartitioning algorithm is based on the serial multilevel algorithms pre sented in 15 20 and 18 it also makes use of the refinement history to achieve great reductions in execution time and an improvement in the quality of the partitions produced 35 e a b ee d c Figure 15 Propagation of the refinement In a we show an arbitrary mesh distributed in 4 processors The refinement of an element b causes the refinement to come back to the processor c If we repeat this operation we obtain the mesh in d 36 General multilevel algorithms partition the mesh by constructing a tree of vertices They create a sequence of smaller graphs by collapsing vertices then partition a suitable small graph and finally reverse the collapsing process to produce a partition of the larger graphs The Parallel Nested Repartitioning algorithm can be divided into a serial phase and a parallel phase When the graph is small enough to fit into one processor we use a serial refinement algorithm to partition the graph When the mesh is very large as in the case of a highly
61. l if the the addition of any vertex in U would make it no longer an independent set A matching of G is a set F C F od edges such that no two of which are incident on the same verex A mathing F is maximal if the addition of an edge in F would make it no longer a matching A contraction 37 compute the weighted graph Mg E W Go WHILE G gt K DO compute a coarser graph G by collapsing the vertices of G END WHILE partition the coarsest graph Gy FOR each level z in the refine tree from k downto 0 DO project the partition of G to G _1 improve the partition of G _ using local heuristics END FOR Figure 16 Serial Multilevel Partitioning algorithm operation is repeated until G is smaller than a defined constant K e in the partitioning phase standard multilevel methods find a partition II of the graph Gk using any one of a number of different graph partitioning algorithms such as Recursive Spectral Bisection 15 Note that typically G lt Go so their use of RSB is generally not very expensive e in the uncoarsening phase these methods project the partition found for G to the graph G _ by reversing the collapsing process Assume that two or more vertices v and w in the graph G _ are collapsed to form a vertex u in G in the coarsening phase If u is assigned to processor P then both r and s are initially assigned to P After projecting the partition to G _ they typically perform local heuristics
62. measured by the number of edges removed e element partitioning in this case there is a partition II I Il2 Ip of the elements between P processors such that U I E and I N lj ViF j If Ea Il it is assigned to P Each element is assigned to a single processor The 16 partition is performed across the edges that separate two elements If Vp Adj Ea and also Vp Adj E where Ea Ili E Il and i j then Vp is a shared node Both FP and P have their own copy of Vp that we will denote Vi and vi respectively Communication is required between multiple copies of the same node so the cost of the partition is measured by the number of shared nodes We define Nodes II V Ea I and Vp Adj E hence Nodes II is the set of nodes corresponding to the elements in II Note that Nodes II N Nodes II if the two partitions Il and II are adjacent To find a partition of the mesh using element partitioning we first compute the dual M E W of the mesh M where W Eq Ep Ea Ey E Ea Ep Adj Ea N Adj Es W is a set of pairs of adjacent elements so they have at least one node in common We then use a graph partitioning algorithm to assign elements to processors It is shown in 13 that partitioning by elements has several advantages over partitioning by nodes due to the way the matrix A is computed in the finite element method The matrix A is the result of an a
63. mesh Average elements is the number of elements in each processor while shared nodes is the number of nodes in the internal boundaries between processors A lower number of shared nodes represents a better partition as it requires less communication In these examples Chaco s Multilevel Bisection outperformed Chaco s Spectral Bisection both in time and in the quality of the partitions The low performance of RSB on computing the partitions for the big mesh was due to the fact that it required a considerable amount of memory more than the 64MB available in the computer Although all the partitions required more than 4 30 hours of clock time only 1 hour was spent doing useful work In all cases the serial Multilevel Bisection algorithm produced better partitions in less than a minute 9 4 Refinement of the mesh To test the refinement algorithm we performed successive refinements of the mesh In each of these phases all the elements of the mesh are selected for refinement The number of elements grows exponentially with the level of refinement By doing a series of successive 74 Number of partitions Clock Time User Time System Time 3 7 5 4 8 0 11 1 Edge cuts 928 1702 2747 4417 Avg elements 2250 1125 563 281 Shared nodes Table 5 Spectral Bisection on the air mesh using Chaco on a 64Mb Sun SparcSta tion The dual of the mesh has 9000 vertices and 52507 edges The times are in hours minutes se
64. more detail using the example in Figure 22 There we show a mesh composed of 8 elements Ea Eh and 9 nodes Vo Vg partitioned between 4 processors Po P3 In the top Figure we show the initial partition II and in the bottom Figure we show the target partition IIt The initial representation of the mesh is shown in Figure 23 a Our goal is to move the elements from the initial partition to the destination partition This can be done by executing the commands e Po move E to Ps by sending the message Move Msg 0 3 Ea 49 P move Eq to P by sending the message Move Msg 0 2 Ea P2 move Ee to P and Ey to Po by sending the messages Move Msg 2 3 Ee and Move Msg 2 0 Es P3 move E to P and Ep to P by sending the messages Move Msg 3 1 Eg and Move Msg 3 2 En First we send the elements to the destination processor If there is an element Ea located in P and Ea II then we send a Move Msg i j Ea message from P to P If an element Ea refers to a node vi of which P has no local copy then P must also include the node in the message Determining if P has a local copy of Vp is easy we only need to look at the references to remote copies of V is Pj V2 Ref V in the sending processor P If we find that P has a local copy Vj then we use that copy to create the element Ea in Pj When we send a node we also include all the references to other copies This way the recei
65. n Total time Number of shared nodes Table 17 Repartition of the air mesh in 32 processors after none one and two refine ments using the PNR algorithm and the serial Multilevel Bisection algorithm Times are in seconds 91 Sending the mesh air mesh 300 250 200 2 2 gt 150 100 50 0 4 8 16 32 Processors a Partitioning the mesh air mesh 250 no refinement a Pt T T 2 re c or 8 200 1 ref serial d 7 o 2 ref serial e 150 4 Processors b Figure 43 Partitioning of the mesh after none one and two refinements using the PNR algorithm and the serial Multilevel Bisection algorithm a shows the time spent on sending the mesh to one processor b shows the time spent by Po to partition the graph 92 Receiving the partition air mesh no refinement a 1 ref PNR b 2 ref PNR c 1 ref serial d 2 ref serial e 2 2 o ke Processors a Migration air mesh 300 250 fs no refinement a 1 ref PNR b y 2 ref PNR c 200 N 1 ref serial d gt N 2 ref serial e 2 a 150 ke 100 F 50 0 Processors d Figure 44 Partitioning of the mesh after none one and two refinements using the PNR algorithm and the serial Multilevel Bisection algorithm a is the time spent on commu nicating back the results of the partition from Po to the processors and b shows the time spent on the
66. n most of the machines were idle but there is no guarantee that the timings are not influenced by other users 9 1 Network performance The first test does not evaluate our system directly Instead our goal is to determine the performance of the network and compare it with a real parallel computer We tried three sets of programs on machines located in the 5th floor and machines located in the 1st floor The intuition was that as the distance between machines in the SunLab is longer than the distance between the machines in the 5th floor their messages should take more time The first test is a point to point communication program where a processor P sends a message to some other processor P and waits for a response We tried this test for several messages whose length ranged from 1 to 100000 double Table 1 shows the results of this test measured in MBytes per seconds These results are plotted in Figure 32 a Sending a message of only 1 double takes 0 0015 sec this is the latency of the message while the cost of sending an additional byte for long messages is 0 0015 msec and the maximum performance was around 1 MByte sec consistent with performance obtainable a using 10Mbit sec ethernet network We did not experience too much difference between 65 machines in the lab and machines connected to the same subnetwork In both cases only when the messages are around 8K we do obtain full speed eh oon 0 005234 0 004999 0 025177
67. nard et al 16 It produces high quality partitions at a low cost a very important requirement for recomputing partitions at runtime It has a very natural parallel implementation that allows us to partition meshes of arbitrary size The collapsing of the vertices is performed locally using the refinement history and avoiding the communication overhead of other partitioning methods 19 Compared to iterative local migration techniques 42 this method does not require several iterations to rebalance the work Finally we design a mesh data structure where the elements and nodes are not assigned to a fixed processor throughout the computation but can easily migrate from one processor to another in order to rebalance the work The mesh is represented as a set of C objects To avoid the problem of having dangling pointers between different address spaces the references to remote objects are handled through local proxies These proxies keep track of the migration of objects as a result of load balancing To evaluate these ideas we designed and implemented a system in C This program runs on a network of workstations NOW and uses MPI 23 to communicate between processors The most salient characteristic of adaptive codes is the high sophistication of their data structures The use of object oriented techniques allowed us to reduce the complexity of the implementation without significantly affecting the performance 2 Mesh based computation
68. ne Msg message to include a unique sequence number for each processor We also modify the AddRef Msg message to return the number of the Refine Msg that caused the refinement All the critical messages are added to a table of critical messages When processor P sends back a AddRef Msg message it needs to identify which critical message in processor 31 Figure 12 A propagation cycle Po has initially one element marked for refinement a The refinement propagates to P b and then comes back to Po c d shows the final mesh 32 FOR each processor P DO send Refine Msg C i indicating elements to refine END FOR FOR each processor P DO wait for an AddRef Msg i C END FOR FOR each processor P DO send Continue Msg C i to finish the refinement phase END FOR Figure 13 Detecting the termination of the refinement phase Coordinator algorithm P caused the propagation to P using the sequence number When a critical message in a processor P receives a AddRef Msg message for all the propagations it caused then processor P removes the critical message from the table and it sends back a AddRef Msg message to the processor that initially sent that critical message to it The processor P is in the inactive state if R and it has no critical messages in its table The pseudocode for the algorithm executed by the coordinator is shown Figure 13 while Figure 14 has an outline of the program executed by all the remainin
69. o expensive to apply to the refined mesh The result of this partition is shown in Figure 20 a e finally we resume the parallel phase Pp sends a message to each processor P inform ing it of which elements to migrate P executes the migration algorithm described in the following section to distribute the mesh as shown in Figure 20 6 Our method does not require that the complete fine mesh be in one processor in order to compute the partition It is sufficient that the coarse initial mesh is small enough to fit into one processor The refined mesh can be of an arbitrary size 42 Figure 19 The Parallel Nested Repartitioning algorithm a shows the graph G where there is an edge in G between each two elements in Mo that share a node b shows the graph G where there is an edge in G between each two elements in Mo that share a common edge 43 Figure 20 The Parallel Nested Repartitioning algorithm a shows the partition II of M using the PNR algorithm Finally we use the migration algorithm to migrate elements and nodes to their destination processors b shows the resulting mesh 44 7 1 4 Partitioning of an initial mesh We have yet to explain how to compute the partition of My l in Pp In theory we can use any serial graph partitioning algorithm without affecting the structure of the PNR algorithm but in practice we use one of two approaches In one case Pp spawns a new process that calls Chaco 21 Thi
70. ocessors The algorithm spends most of its time in the serial part and the communication cost is very small This is not surprising because of the way we are selecting the elements for refinement It is unlikely that by selecting all the elements the refinement is going to propagate to too many processors In these tests the longest propagation was to a couple of processors Also note that when we increase the number of processors there is a higher imbalance of the element distribution that reaches a 33 per cent after 6 successive refinements As it is shown in the figures we are able to obtain superliner speedups This is due to the fact that when we use a small number of processors we require a lot of memory This is particularly true when the number of elements assigned to a processor is more than 50 000 In some of these tests we measured 90MB of virtual memory on machines that have around 15MB of physical memory available This speedup tends to become linear as the number of processors increase and the memory is less of an issue Figure 38 plots these results and Figure 39 shows the air mesh before and after the refinement of all its elements 9 5 Migration of the mesh The migration tests are performed by migrating all the elements in processor P to processor P41 This is probably one of the most demanding migrations that we can perform It 79 Number of refinements mimi a ats fa Time serial Time comm Time total Elements t
71. ointers to a remote address space Since this is not allowed in almost any programming language we designed the remote references as C objects using the notion of smart pointers We will come back to this when we discuss the implementation details If Vp is a node internal to the processor then Ref Vp A node in an internal boundary can be shared by more than two processors Hence if V is a shared node then lt Ref V lt P 1 where P is the number of processors In a conforming mesh we expect that Ref V lt P 1 and usually Ref V 1 for a shared node since most of the shared nodes are shared by only two processors in a 2 D mesh The example in Figure 5 shows a mesh with 8 elements and 9 nodes The node V4 is shared by four processors Po P4 P and P so Ref V2 Pi V2 Pz V Ps V3 while Ref V Po Ve P2 V2 Ps V2 Figure 6 states for initializing the references There is no need to have more than one copy per node in each processor Suppose that a processor P has two copies of the same node Vi and vi so that P Vi E Ref V We can detect this condition because the reference points to a node in the same processor P We then remove the copy Vi after updating all the references in other processors that point to Vi to point to Vi For a similar reason we do not need or allow duplicate references in Ref V When a node Vp is created in an internal boundary between two processors P and P
72. oring meshes in a distributed memory parallel computer that we call a parallel mesh and show how to use the mesh to solve dynamic problems 5 Mesh representation in a parallel computer In Section 3 we presented a data structure to represent a refined mesh in a serial computer and we introduced serial refinement and coarsening algorithms In this section we extend this data structure to store adapted meshes in parallel computers Let M E V be a finite element mesh where E is a set of elements and V is a set of nodes We define Adj Ea Vp Vp is a vertex of Ea In a similar way we define ElemAdj Vp Eq Vp is a node of Ea and NodeAdj Vp V Vp and Vg are both nodes of a common element Ea Adj Ea of an element Ea is the set formed by the vertices of Eg In the case of triangular elements Adj Ea 3 and in the case of quadrilateral elements Adj E y 4 ElemAdj V of a node Vp is the set formed by the elements adjacent to V and NodeAdj V is the set formed by the nodes adjacent to V Two nodes are considered adjacent not only because there is an edge between them in the mesh M but also if they are adjacent to a common element In the case of quadrilateral elements two 15 nodes at opposite corners are node adjacent In an unstructured mesh NodeAdj V is not a constant Although in theory we can construct meshes where NodeAdj V can have arbitrary values if the mesh is non degenerate the interior angles
73. otal Elements avg Elements max Elements min Imbalance Shared nodes Table 9 Successive refinements of the air mesh in 4 processors elements are selected for refinement Time serial Time comm Time total Elements total Elements avg Elements max Elements min Imbalance Shared nodes 162 63 0 40 163 03 115347 28837 29617 28461 2 70 440 Number of refinements 50 89 0 30 51 19 115363 14420 15900 13708 10 26 815 223 40 0 67 224 07 251490 31436 35005 29708 11 35 1241 927 25 9 80 937 05 251458 62865 64743 61949 2 99 676 In each phase all the 1396 67 10 92 1407 59 530896 66987 75020 63009 11 99 1773 Table 10 Successive refinements of the air mesh in 8 processors In each phase all the elements are selected for refinement 80 Number of refinements mia els e 15 64 81 00 347 69 2535 00 0 31 0 58 1 69 86 84 15 95 81 58 349 38 2620 84 115347 251458 535840 1124496 7209 15716 33490 70281 9173 20514 44357 93940 6316 13481 28298 58739 27 24 30 53 32 45 33 66 1292 2044 2795 4357 Time serial Time comm Time total Elements total Elements avg Elements max Elements min Imbalance Shared nodes Table 11 Successive re
74. othing about each other at this moment There is another problem P receives Ey from P and En from P and both messages include the vertex Vs a similar problem happens in P with V3 We will explain later how to handle these conditions Figure 23 b shows the mesh at this stage and Figure 25 presents an outline of this phase e In the next phase we update the references to the new nodes Assume that Vi is a node created in P in the previous phase as the result of a MoveMsg j i P needs to inform P and all the other processors Pk that have a copy of Vp about the location of Vi in memory so they can create a reference to it Using Ref V P sends a AddRef Msg i k for each reference Pas VE Ref Vi This procedure is shown in Figure 26 In the previous example Po sends a message to P and P to update their references to V and so does P P detects that there is more than one new copy of V7 so it informs Pp to update the reference from V to V The same thing happens in P3 54 FOR each message Move Msg i 7 sent from other processor P to P DO receive Move Msg z 7 FOR each element Ea Move Msg i 7 DO FOR each node V Move Msg i j nodes E DO IF V does not exist in P THEN create the node V and initialize Ref Vi Move Msg i j ref Vp END IF END FOR construct the element Eg END FOR END FOR Figure 25 Migration phase creating the elements in the destination processors At this stage we d
75. processor P then all the descendants of Ea are also assigned to P For this reason we have not yet implemented parallel heuristics such as the MOB heuristic 9 to try to improve the quality of the partition 7 2 Using remote references for work migration Although we demonstrated in the previous section how to compute a partition II that balances the work at this stage of the computation the mesh is still distributed according to an unbalanced partition I In this section we present an algorithm that migrates elements and nodes between processors If IIt T 1 then there is at least one element E such that Ea Ti and Ea E IT where i j Remember that we assume that the mesh is partitioned by elements so that the II partitions are disjoint Ea mi and Ea II To adjust the mesh to the II partition we need to move Eq from P to P Let s assume that 46 the vertices of Ea are Adj Ea Vp Vpn Our algorithm considers several cases that depend on P having a local copy of these nodes or if they are included in the message to P e for each node Vp E Adj Ea if P Vz g Ref V so Vp is not a shared node between P and P at time t 1 then we need to create the node vi in P and then use this node to create the element E in P e otherwise P Vz E Ref V so Vp is a shared node between P and P at time t 1 and P has a local copy Vz then we should not create the vi node aga
76. rallel as in the 85 Figure 41 Migration of the mesh In a we show the air mesh distributed between 32 processors This partition was obtained using a Multilevel Bisection algorithm In b we migrate every element in each processor to the next processor 86 Figure 42 Migration of the mesh In a we migrate the elements to a random processor and in b we migrate the mesh according to a Spectral Bisection partition 87 PNR algorithm we send to Py the whole mesh In this test we forget that the fine mesh was obtained as the result of the refinement of another coarser mesh For this reason we eliminate all the intermediate elements and we send the fine mesh to Py We perform this operation by flattening the hierarchy of elements we delete all the intermediate refined elements leaving only the elements of the fine mesh The serial multilevel algorithm running in Po collapses the elements by computing a matching of the graph For a complete description of the method see 21 Note that if we were implementing this method in a parallel computer we would require to communicate at each level to find a matching of the graph The results are shown in Tables 14 15 16 and 17 These results are also plotted in Figures 43 44 and 45 In these pictures and tables we divide the repartitioning in four phases e in the first phase every processor calculates the weights of its portion of the coarse mesh and sends it to Po e
77. ror estimates on the domain 3 Meshes are usually refined for two main reasons 10 e to obtain a better solution by increasing the resolution in a particular region steady case e to better resolve transient phenomena like shocks in the simulation of stiff unsteady two dimensional flows 6 During the computation the mesh is refined and coarsened called sometimes fission and fusion operations as the regions of interest evolve and move The construction of meshes for this type of problem requires data structures that allow addition of elements when an element is refined by replacing it by two or more nested elements coalescence of elements into larger elements when the mesh is coarsened Although the computational power of parallel computers is continuously increasing it is unlikely that they will reach the level of performance required to solve problems of very localized physical phenomena using a uniform discretization of the domain Rather than using this brute force approach adaptive meshes restrict the use of small elements to the regions of interest while maintaining a coarser mesh everywhere else The use of adaptive meshes has the potential of producing large computational savings but at the price of significantly increasing the sophistication of codes and algorithms As the mesh is no longer regular we need to develop new data structures that are usually more difficult to implement than the regular ones Also
78. ry between 63 processors this list contains the references to the copies of the node in the remote processor The Node class has also a list of pointers to their adjacent elements When a element is created it is automatically added to the lists in its vertices When the element is deleted it is removed from the lists of its vertices If the list of pointers to the adjacent elements of a Node object becomes empty then there is no element pointing to that node In this case the node can be safely deleted without leaving any dangling pointers to it The messages are also encapsulated in classes that inherit from a common Message class To send or receive a message the user just calls the send and receive methods of the message object These classes also handle all the manipulation of the buffers so the user does not have to call MPI routines directly The MPI functions that are not related to messages such as obtaining the processor number or the number of processors are encapsulated in an MPI class 8 3 File format The file format for the meshes is very simple Each mesh description consists of a header line that includes the number of nodes and the number of elements After this header line there is a line for each node that includes the node number and its coordinates After all the nodes there is a line for each element For each element we include its type the element number and the number of its vertices 9 Experimental results To
79. s and if we find that there is an improvement in the quality of the partition flip them While these algorithms have been implemented performance results are not reported because the method provides at least equally good results 45 7 1 5 Improving the PNR algorithm In this section we discuss three possible improvements to the PNR algorithm Two of them are generalizations of the PNR algorithm while the third one is a discussion of parallel heuristics to improve the quality of the partition We assumed that the transition between the parallel phase and the serial phase is given by the initial mesh M This does not always have to be the case If an element Ea Mo is highly refined we can send the children of E rather than Ea to Pp or some of its descendants Although we send the full mesh Mo to Pp with all the weights each time we repartition the mesh this is not necessary If we assume that the serial partition of Mp is computed using a serial multilevel algorithm then we can just compute the tree once and store it in Pp To repartition the mesh P needs to send to Pp only the changes of the weights produced as the result of the refinement and coarsening of the mesh In this way we are extending the PNR to graphs that are not obtained as the result of a refinement process In the migration algorithm explained in the next section we migrate fully refined trees This means that at every time step t if an element Ea Mo is assigned to
80. s because it partitioned a bigger graph It is also not surprising that the migration phase using the serial algorithm performs better than the one using the PNR algorithm because in the serial algorithm we removed all the intermediate elements In this case the migration corresponding to the PNR partition does not only need to migrate the elements in the fine mesh but also all the refined elements But this advantage is outweighed by the cost of sending the mesh to P and performing the partition on a bigger mesh The really important results are obtained by looking at the last row of Tables 14 15 16 and 17 and comparing the quality of the partitions Our Parallel Nested Repartitioning algorithm produced almost always better partitions than the serial multilevel algorithm and we have shown in Section 9 3 that the serial Multilevel Bisection algorithm produced better partitions than the highly acclaimed Recursive Spectral Bisection algorithm This proves that the information from the refinement can be effectively used in the mesh partitioning algorithis 10 Related projects This project follows the spirit of the Distributed Irregular Mesh Environment DIME 31 43 by Roy Williams at the California Institute of Technology DIME allowed the refinement of triangles but was not able to coarsen them Also it is not clear how its parallel refinement and load migration work The Scalable Unstructured Mesh Computation SUMAA3d at the Argonne National
81. s marked for refinement using the shared node Vj The pseudocode for this algorithm is shown in Figure 10 but there are a some details 24 Figure 9 Refinement of adjacent elements located in different processors In the top row two elements are selected for refinement a The refinement of Ea creates the shared node Vp b We then select Ey for further refinement rather than E c The bottom row shows another example a where two processors create shared nodes in the same position b In c we detect this problem and update the references 25 R is the set of elements selected for refinement in P WHILE Rk 49 DO extract an element E from R refine E using a serial refinement algorithm FOR each shared node Vi created in an internal boundary between P and P DO Send a message from P to P requesting the creation of a shared node vz in P lf a node vi already exists then return a reference to it Otherwise create the node Vi determine the element to refine and insert it into R Finally return its reference to P END FOR END WHILE Figure 10 Avoiding refinement collisions in a parallel mesh that they are not explained there First we do not send a message for each individual node because of the high cost of sending messages Instead we first refine all the elements in R keeping track of the shared nodes that P creates as a result of refining elements in R Once R P sends the messag
82. s process finds a partition of My 1 using the multilevel algorithm or any other partitioning algorithm supported by Chaco and returns it to Pp In the other case Pp computes the partition of Mj directly We initially find a matching of the graph defined to be a set of edges such that no two edges in that set are incident in the same vertex Once we find this matching we collapse pairs of vertices to form a new vertex As a consequence for 1 lt i lt k we create a subgraph G E W of Gi 1 E 1 Wi 1 where E lt E 1 We also compute ElemWeight E EdgeWeight E and Weight E Ea for each element E and each pair Es Ea of adjacent elements in Gj We choose a matching that has an approximate maximal edge density We approximate the matching by using a randomized algorithm We select in random order an unmatched vertex r and we determine for each unmatched neighbour s of r the EdgeDensity of a vertex u formed by collapsing r and s Then r is collapsed with the neighbour with which it has the largest edge density We then partition the graph Gk using a partitioning algorithm In our tests we used Recursive Spectral Bisection We usually call Chaco for this purpose This is very fast since Ex is small Finally we uncoarsen the graph for each level k gt 7 gt 0 At this time we also improve the partition using local heuristics that are a variation on Kernighan Lin 32 We compare pairs of elements assigned to different processor
83. shared edge that caused a propagation to processor P and v2 is its copy in processor P a reference to vg is added at V and vice versa A processor P can be in two possible states the inactive state and the active state While in an inactive state P can send no Refine Msg or AddRef Msg but it can receive messages If it receives a receives a Refine Msg j i from another processor P while in an inactive state it moves from the inactive to the active state It creates the shared nodes as stated in the message and proceeds to refine the nonconforming elements The message Refine Msg j 1 is called the critical message because it caused the refinement that P is 29 performing and P is the parent of processor P In the active state while a processor P is refining some of its elements the refinement can propagate to a neighbor processor P requiring another Refine Msg t k message to it An active processor becomes inactive at the first time t for which the following conditions hold e no new Refine Msg message is received by the processor at time t e there are no elements marked for refinement in processor P the local termination condition holds e the processor has transmitted prior to t an AddRef Msg message for each Refine Msg message it has received except for the critical message e the processor has received prior to t a AddRef Msg message for each Refine Msg message it has transmitted Using this definition
84. ssembly process We first compute a local square matrix of L Ea of size Adj Eq for each element Ea E L Ea represents the contribution of Ea to its nodes Vp The global matrix A is equal to p eg L Ea where means the direct sum of the local matrices after converting from the local indices in L to the global indices in A The matrix A is also partitioned between the processors If the node V is a shared node between two or more processors P and P then the entry in A corresponding to vi has the contributions of the elements E Il and the entry in A corresponding to vi has the contributions of the elements E Il j The matrix A in processor P is partially assembled since it only considers the contributions of the elements E P The fully assembled matrix is A gt Aj The matrix vector product Ab c is performed in two phases In the first phase each processor computes A b c The resulting vectors c are also partially assembled In the second phase we communicate the individual vectors c to obtain c gt 17 5 2 Implementing a parallel mesh using remote references A remote reference is a pair Pi VA where P is a processor and Vi Nodes Il It represents a reference to the Vi copy of node Vp in processor P We define Ref vi P V2 Vz is a copy of Vp in P i j This relation is also symmetric so that if P Vz Ref Vj then Pi Vj Ref V2 The remote references are p
85. ssor P wants to receive a buffer from P it calls another C function and waits until a message from P arrives to P We have designed C wrappers around these C routines In our environment a message is just another kind of object 59 We also designed a very simple window interface that allows us to select an element or a group of elements for refinement We use windows to display the mesh where the elements are colored depending on which processor they are located Po is usually responsible for managing the windows This processor collects information from all the other processors and broadcasts the user commands to them Finally we use the Chaco 20 graph partitioning program to generate the initial par titions and for some of the mesh repartitioning algorithms Since Chaco is a sequential program we also run it in Po 8 1 The user interface The user interface is designed around the Tcl Tk scripting package 28 29 The user is presented with a window that displays the mesh Using the mouse the user can click on individual elements to select them or it can drag the cursor to select a group of elements He can then choose commands from the menus to refine partition or migrate the mesh using different algorithms There are several options to display information about a mesh or about individual elements The user can load different meshes using a file selection box and can select different initial partition files for a particular mesh
86. systems in Figure 7 was to obtain an initial partition of the mesh This partition is usually computed using a serial computer during a preprocessing step and it is not part of our system Nevertheless it allows us to compare the quality of the partitions obtained using serial multilevel algorithms with more standard algorithms like Recursive Spectral Bisection 73 15 Recursive Spectral Bisection is knows to produce very good partitions but it is too expensive to use for repartitioning the mesh Our program assumes that there is an initial partition of the mesh and we generate this partition using Chaco in a preprocessing step This system provides several method for partitioning a graph We used both Recursive Spectral Bisection and Multilevel Bisection As Chaco is a serial program we run these tests on a single SparcStation10 workstation with 64MB of RAM The results are shown in Tables 5 6 7 and 8 The time required to compute the partitions of both meshes is shown in Figure 36 and the number of shared nodes is shown in Figure 37 Clock time is the time elapsed to compute the partition while user and system time denote the time spent in user and system mode The difference between clock time and the sum of user and system time represents the time the system was idle because of trashing Remember that this partition is computed using the dual of the mesh In this case the row labeled edges cut is the number of edges cut in the dual of the
87. t bisects the longest edge of a triangle to form two new triangles with equal area There are several variants of the serial bisection refinement algorithm In Figure 1 we illustrate an example of the 2 triangles bisection algorithm 1 and 2 described in Figure 2 In Figure 1 a the element selected for refinement is shaded The refinement of this element creates a non conforming white node on its longest edge The shaded element in 1 b must now be refined to to maintain a conforming mesh This process is repeated in c where there are two non conforming nodes Finally in f we show the resulting mesh Using the bisection refinement algorithm the propagation is guaranteed to terminate Also if we start the refine ment with a mesh M that has elements that are smooth conforming and non degenerate then the elements of the resulting mesh M will also have the same properties 3 Multilevel mesh adaptation To support dynamic adaptation of meshes we designed a data structure based on a multilevel finite element mesh with a filiation hierarchy between two consecutive levels As we will show in later sections our algorithms for refinement partition and migration take good advantage of this mesh representation We assume that the user supplies an initial coarse mesh Mo E V called a 0 level mesh where E is a set of elements and V is a set of nodes This is the coarsest mesh that the refinement algorithm is able to manipulate Using defined adaptat
88. the design of adaptive meshes in a parallel environment requires a close interaction between the algorithms that refine partition and rebalance the mesh and the numerical simulation The success of an adaptive strategy will depend strongly on how well these different modules can communicate with each other There is a wide variety of strategies for mesh refinement 8 In the remaining part of this section we review some of the most common techniques for mesh generation and refinement In the following section we introduce a strategy to implement adaptive meshes using a sequence of nested refinements Later we show how to implement this approach on a parallel computer We also explain the object oriented techniques that we use to simplify the software design 2 1 Selection of the mesh type The selection of the mesh type depends on the problem to be studied since there is no strategy that it is considered best for every problem Among the most common approaches we mention 8 e structured meshes there is a mapping from the physical space to the computational space In the computational space the elements appear as squares in two dimensions or cubes in three dimensions and the neighbors and vertices of an element are easily calculated using an array based data structure The data structures for this type of mesh are very regular e unstructured meshes in this case the elements store explicit connectivity information to determine the
89. the local termination condition might hold in processor P R but processor P might be in an active state waiting for a AddRef Msg j 7 from processor P if the refinement of the elements of P caused the refinement to propagate to processor P and P has not yet responded When a processor becomes inactive it returns a AddRef Msg message to the processor that originally sent its critical message Initially at time t 0 the coordinator is active and all other processors are inactive Furthermore at t 0 the local termination condition holds at all processors except the coordinator It can be seen that if a processor is inactive at time t the following rules hold e its local termination condition holds at t e it has transmitted an AddRef Msg for all the Refine Msg messages it has received prior to t e it has received AddRef Msg messages for all Refine Msg messages it has transmitted prior to t 30 If the processor is active at time t at least one of the above conditions is violated We say that global termination is detected when the coordinator becomes inactive In the case of the coordinator only the last of the previous rules is relevant as it has no local elements to refine The coordinator will receive an AddRef Msg message from all the processors P only when all the processors are inactive In this case there are no elements marked for refinement and there are no other messages in the network This algorithm works i
90. tly increases the complexity of algorithms and software new de sign techniques based on object oriented technology are needed to cope with the complexity that arises In this thesis we study problems that arise when finite element and spectral methods are adapted to dynamically changing meshes Adaptivity in this context means the local refinement and derefinement of meshes to better follow the physical anomalies Adaptation produces load imbalances among processors thereby creating the need for repartitioning of the work load We present new parallel adaptation repartioning and rebalancing algorithms that are strongly coupled with the numerical simulation Adaptation repartitioning and rebalancing each offer challenging problems on their own Rather than studying these problems individually we put special emphasis on investigating the way these different components interact By considering adaptivity as a whole we obtain new results that are not available when these problems are studied separately We discuss the difficulties of designing parallel refinement algorithms and we introduce a refinement algorithm based on the Rivara s bisection algorithm for triangular elements 1 2 By representing the adapted mesh as a forest of trees of elements we avoid the synchronization problems for which Jones et al use randomization 3 We propose a new Parallel Nested Repartitioning algorithm that has its roots in the multilevel bisection algorithm of Bar
91. use is to maintain the connections between the different regions of the mesh as the mesh is partitioned between the processors As will be shown in later sections they provide a very flexible mechanism for maintaining a dynamic mesh When a node is moved to a new processor it can use its reference list to find its copies in other processors It can then send a message to these copies telling them to update their references to the new location The references also simplify the task of assembling matrices and vectors from partially assembled ones as new nodes are created and moved at runtime because no assumption is initially made about origin and destination of these messages 5 3 Using a parallel mesh for the solution of dynamic physical problems In this paper we assume that we are given an initial coarse mesh Mo at time t 0 from which we find an initial partition II This partition is computed in a preprocessing step We distribute the nodes and elements between the processors according to that partition and we compute the initial references using the algorithm in Figure 6 Our algorithm for finding the solution of dynamic problems consists of four consecutive phases that we execute repeatedly Figure 7 gives a high level outline of the program In the first phase we use numerical approximation techniques to find the solution of the partial differential equations by solving a system of linear equations We solve this system using iterative metho
92. vas O Sy A yy Pa O SODAS OF ORERE AOO S KAVANA AVN A Figure 34 The air mesh is a 2D finite element grid of a complex airfoil with triangular elements It consists of 9000 elements and 4720 nodes This mesh is provided with the Chaco program 9 2 Mesh examples Our tests were run on two basic meshes The first mesh is of relatively small size It contains 9000 triangular elements and 4720 nodes and is a 2D unstructured FE grid of an airfoil This mesh is provided with the Chaco program and it is known as the Hammond mesh In our examples we will refer to it as air mesh Several views of this mesh are shown in Figure 34 The second mesh is a larger 2D finite element grid of around 30000 triangular elements and 15000 nodes We will refer to it as the big mesh Four different views of this mesh are 72 FERIIS LAKE ORAZ D ZRNECA SOCAN AAS DEROODORRU OES SA vas ZSSS SVANA NAV AVAVA Y AVATTAVAT SNN ESAA Ar eee KK GN SENIN Savi Figure 35 The big mesh is a 2D finite element mesh of a complex airfoil It con sists of 30269 elements and 15606 nodes It is obtained from riacs edu in the directory pub grids big displayed in Figure 35 As it is shown in these pictures there is a big disparity on the size of the elements 9 3 Initial partition of the mesh Recall that the first task of the general algorithm for computing the solution of dynamic
93. ving processor can create its local copy and then send a message to the other processors to update their references to it Also when we are sending multiple elements to a processor we need to be careful to include only one copy of the nodes The description of this phase is shown in Figure 24 The initial messages for the previous example are Po move E to P by sending the message Move Msg 0 3 Ea Include in the message the nodes Vo and V3 and a reference to Vie In P use these two nodes and the existing copy of V4 to create the element Ea P send Ea to P by sending the message Move Msg 1 2 E4 with the nodes V V2 and Vs P similarly send Ee to P3 with V3 and Ve and a reference to V7 and E to P with V7 and reference to V and V Ps at the same time send E to P with Vy and Vg and a reference to Vf and send Ep to P with Vs and Vg with a reference to V 50 Figure 22 Migration of elements from an initial partition II a to a target partition TI b 51 Figure 23 A migration example internal representation of the mesh at the beginning of the migration a and after copying the elements to the destination processors b 52 FOR each element Ea such that Ea Ii Ea I i Zj DO insert E into Move Msg i 7 FOR each node V Adj Z DO IF P Vg Ref V THEN insert P V2 into Move Msg i j nodes Eq ELSE insert V into Move Msg i j
94. w partition It If It IIt we need to migrate some elements and nodes to adequate the mesh to the new partition This phase does not create new nodes or elements but it modifies the reference lists as nodes are moved to new processors 6 Parallel mesh adaptation Using the data structures presented in the previous section we now introduce an algorithm for adapting the mesh in a parallel computer Let R be a set of elements selected for refinement and let R be the subset of the elements of R assigned to processor P In this case R JR and also R N R for i j because by using the element partitioning method of assigning elements to processors each element is assigned to only one processor Each processor has all the information it needs to refine in parallel its own subset R using 22 Figure 8 Propagation of refinement to adjacent processors In a the elements Ea Ee Ef and Eg are selected for refinement The refinement of these elements creates two nodes Vp and V in the boundary between Py and Pi P creates its local copies Vp and vi and selects the nonconforming elements Ep and Ep for refinement b c shows the resulting mesh a serial algorithm but nonconforming elements might be created on the boundary between processor as suggested in Figure 8 In that example four elements are selected for refinement so Ro Ea Ee Ef Eg and Ri The refinement of Ea creates a node v9 in an internal boundary between
95. wo wonderful years Al Mamdani Andy Foersberg Dawn Garneau Jon Metzger Laura Paglione Madhu Jalan Rob Mason and especially Sonal Jha who were always finding new ways of keeping me away from the CIT Thanks also to my friends Vaso Chatzi Laurent Michel and Michael Benjamin and my officemates Sonia Leach and Michael Littman Many thanks to my parents Gianna and Buby and my stepfather Carlos for their support and motivation for coming to the States I have worked with many professors while at Brown but I want to mention four of them that had the biggest influence on me Paris Kanellakis Paul Fischer Tom Dean and Tom Doeppner Finally Prof John Savage more than an advisor has been a mentor and a friend 96 x References 1 Maria Cecilia Rivara Selective refinement derefinement algorithms for sequences of nested triangulations International Journal for Numerical Methods in Engineering Vol 28 2889 2906 1989 2 Maria Cecilia Rivara Algorithms for refining triangular grids suitable for adaptive and multigrid techniques International Journal for Numerical Methods in Engineering Vol 20 745 756 1984 3 Mark T Jones and Paul E Plassmann Parallel algorithms for the adaptive refinement and partitioning of unstructured meshes Proceedings of the Scalable High Performance Computing Conference Knoxville Tennessee 1994 4 Dimitri P Bertsekas and John N Tsisiklis Parallel and Distributed Computation Numerical

The Dynamic Adaptation of Parallel Mesh-Based

Contents

Download Pdf Manuals

Related Search

Related Contents