Home

MAPHYS USER'S GUIDE 0.9.2

1. domain 0 domain 1 myndof 7 myndof 6 mando f interior 2 myndof interior 1 mando f intr f 5 myndo f intr f 5 myint Lg 0 myint Lg 0 lenindintr f 10 lenindintr f 10 mysizeintr f 5 mysizeintr f 151 gballintr f 6 ghallintrf 6 mynbvi 3 3 myindezVi 3 2 1 myindezVi 3 2 0 myptrindexVi 1 3 6 11 myptrindezVi 1 3 6 II myindexintr f myindezintr f 415 3145 3145112 415 3145 3 145112 myinterface 415 1412 3 myinter face 4 5 1412 3 1 myndoflogicintr f 0 mylogicintr f 314151617 mylogicintr f domain 2 domain 3 myndof 6 myndof 4 mando f interior 2 myndof interior 1 myndo f intr f 4 mqndo f intr f 3 myint Lg 0 myint Lg lenindintr f 9 lenindintr f mysizeintr f 4 mysizeintr f E gballintr f 6 gballintr f 6 mynbvi 3 mynbvi 3 myindexVi 3 1 0 myindexVi 2 1 0 myptrindexVi 1 4 7 10 myptrindezVi 1 4 6 8 myindexintr f myindezintr f 3 4 1 2 3 4 2 3 4 213111213123 myinter face 6 1 2 3 myinter face 6 2 3 myndoflogicintr f 1 myndoflogicintr f 0 mylogicintr mylogicintr The local matri
2. 6 7 2 Example with finite element Writing a matrix to a file 8 Using two levels of parallelism inside MAPHYS 9 Control parameters bo ie ea eh Se edm te ae Se 02 gt RONTE eee eee UE DE Oh es 9 3 Compatibility N O Oou 13 13 13 14 14 14 14 15 16 16 16 16 16 17 17 17 18 18 18 19 22 22 10 Information parameters rp bbb he ACRES 11 Error diagnostics 12 Examples of use of MaPHyS ages 13 2 How to referto MAPHYS 133 30 30 32 34 34 34 35 1 General overview MaPHyS is software package for the solution of sparse linear system Az where A is a square real or complex non singular general matrix b is a given right hand side vector and is the solution vector to be computed It follows a non overlapping algebraic domain decomposition method that first reorder the linear system into a 2 x 2 block system Arr Arr br J where Arr and Arr respectively represent interior subdomains and separators and Arr and Ap are the coupling between interior and separators By eliminating the unknowns associated with the interior subdomains we get A A ee es S Arr Apr Arr and f bp Arr Ag br 3 with The matrix
3. 1 Product of schur complement with a vector is performed explicitly That is the Schur complement is stored in a dense matrix and a BLAS routine is called to perform the product 2 Product of schur complement with a vector is performed implicitly That is we do not use a schur complement stored in a dense matrix to perform the product Indeed a solve and several sparse matrix vector products are used to perform the product This is done by using the definition schur complement which is Air A l Air Ar r This option reduce the memory footprint by substituing in memory the schur complement by the preconditioner Default 0 28 controls the iterative solver Controls whether you want to use the resi dual norm on the local rhs or in the global rhs It must equal to 0 or 1 It is accessed by all processes during the solve step 0 Compute the residual on the local right hand side 1 Compute the residual the global right hand side only available in the centralized version ICNTL 43 0 Default 0 ICNTL 29 UNUSED in current version ICNTL 30 controls how to compute the schur complement matrix or its approxima tion It is accessed during the factorisation step by each process Each processor must have the same value current version do not check it Valid values are 0 use the schur complement returned by the sparse direct solver package 2 approximate compu
4. Choose the user s compiling options arithmetic real or complex solvers choice the Makefile inc file Compile make all in order to get the library and test program 13 4 Main functionalities of MaPHyS 0 9 2 4 1 Input matrix structure In the current version of MAPHYS the input matrix can only be supplied centrally on the host processor process 0 in elemental format For full details refer to Section 6 6 1 4 2 Arithmetic versions The package is available in the four arithmetics REAL DOUBLE PRECISION COMPLEX DOUBLE COMPLEX Please refer to the root README of MAPHYS sources for full details 4 3 The working host processor In the current version the working host process is fixed to the process with MPI rank 0 according to the chosen communicator see Section 6 4 14 5 Sequence which routines are called The MAPHYS package provides one single static library libmaphys a which contains all the artihmetic versions specified in WITH ARITHS see topdir Makefile inc The calling sequence of MAPHYS is similar for all arithmetics and may look like this for the DOUBLE PRECISION case Use DMPH maphys mod Implicit None Include mpif h Integer ierr Type DMPH maphys t mphs one per linear system to solve Call MPI Init ierr mphs comm set communicator mphs job 1 request initialisation Call DMPH maphys driver mphs mphs sym set matrix input mph
5. 13 14 17 18 19 5 S 21 22 W Hackbusch Multigrid methods and applications Springer 1985 A Haidar On the parallel scalability of hybrid solvers for large 3D problems Ph D dissertation INPT June 2008 TH PA 08 57 M R Hestenes and E Stiefel Methods of conjugate gradients for solving linear system J Res Nat Bur Stds B49 409 436 1952 N J Higham Accuracy and Stability of Numerical Algorithms Society for In dustrial and Applied Mathematics Philadelphia PA USA second edition 2002 G Karypis and V Kumar MEIS Unstructured Graph Partitioning and Sparse Matrix Ordering System Version 2 0 University of Minnesota June 1995 Z Li Y Saad and M Sosonkina pARMS A parallel version of the algebraic recursive multilevel solver Technical Report Technical Report UMSI 2001 100 Minnesota Supercomputer Institute University of Minnesota Minneapolis 2001 T Mathew Domain Decomposition Methods for the Numerical Solution of Par tial Differential Equations Springer Lecture Notes in Computational Science and Engineering Springer 2008 A Quarteroni and A Valli Domain decomposition methods for partial diffe rential equations Numerical mathematics and scientific computation Oxford science publications Oxford 1999 G Radicati and Y Robert Parallel conjugate gradient like algorithms for solving nonsymmetric linear systems on a vector multiprocesso
6. if ierr MPI SUCCESS ierr 1 if ierr 0 goto 9999 Call MPI Comm rank MPI COMM WORLD myid ierr if ierr MPI SUCCESS ierr 1 if ierr 0 goto 9999 init instance id job JOB INIT init job id comm MPI COMM WORLD Call dmph maphys driver id ierr id iinfog 1 if ierr 0 goto 9999 init matrix id sym 2 n 1000 nnz 5998 Allocate irn nnz Allocate jcn nnz Allocate a nnz Allocate rhs n Allocate sol n Do i 1 1000 rhs 1 1 0 1 1 1 0 1 1 1 999 1000 1 1 1 0 irn 1000 i 1 40 100041 End Do Do i 1 999 1999 1 1 1 01 1999 1 jon 1999 1 Do 1 1000 2998 1 1 irn 2998 1 jon 2998 1 3998 1 i irn 3998 1 jon 3998 1 4998 1 1 irn 4998 1 jon 4998 1 id sym 2 1 0 99 1 0 nm 00 x1 0 np i 4 i 5 i 5 01 define the problem on the host If myid 0 then n id nnz nnz id rows gt irn id cols gt jon id sol gt sol id values gt id rhs gt rhs End If Call the Maphys package id ICNTL 6 2 id ICNTL 24 500 id ICNTL 26 500 id ICNTL 27 1 22 21 11 Lh id ICNTL id RCNTL id RCNTL id job 6 call dmph maphys driver id id job JOB END Call dmph maphys driver id If
7. i j of the U part such as U i j U 4 3 lt RCNTL 8 are deleted It is only relevant and must be set by the user before the factorisation step while using PILUT See also 1 30 ICNTL 8 ICNTL 9 andRCNTL 9 Warning the default value is a bad value Default 0 e0 Note For more details on ILU method you may refer to Section 3 RCNTL 9 specifies the threshold used to sparsify the schur complement while com puting it with PILUT Namely all entries i j of the schur S such as S i j S i 4 lt RCNTL 9 are deleted It is only relevant and must be set by the user be fore the factorisation step while using PILUT See also I CNTL 30 CNTL 8 ICNTL 9 and RCNTL 9 Warning the default value is a bad value Default 0 e0 Note For more details on ILU method you may refer to Section 2 3 RCNTL 11 specifies the threshold used to sparsify the Schur complement while computing a sparse preconditioner It is only useful while selecting a sparse preconditioner ICNTL 21 2 or 3 It must be above 0 entries i j of the assembled schur S such as S 4 7 S 1 3 S 3 3 RCNTL 11 are deleted Default 1 e 4 Note For more details on dropping strategy ICNTL 21 2 or 3 you may refer to Section 2 2 RCNTL 15 specifies the imbalance tolerance used in Scotch partitionner to create the subdomains more high it is more Scotch will allow to have imba
8. lists the row indices of the matrix entries mphs COLS integer array of size NNZ lists the column indices of the ma trix entries mphs VALUES Real Complex array of size NNZ lists the values of the matrix entries Note that in the case of symmetric matrices S YM 1 or 2 only half of the matrix should be provided For example only the lower triangular part of the matrix inclu ding the diagonal or only the upper triangular part of the matrix including the dia gonal can be provided in ROWS COLS and VALUES More precisely a diagonal nonzero A must be provided as A k ROWS k COLS k i and a pair of off diagonal nonzeros A Aj must be provided either as A k A and ROW S k i COLS K j or vice versa Again duplicates entries are summed In particular this means that if both A and A are provided they will be summed 17 6 6 2 Right hand side and solution vectors matrices RHS Real Complex array of lenght mphs N is the centralized right hand side of the linear system It holds the right hand side allocated and filled by the user on the host before the solve step mphs 6JOB 4 6 MAPHYS do not alterate its value mphs SOL Real Complex array of lenght mphs N is the centralized initial guess and or the solution of the linear system It must be allocated by the user before the solve step mphs JOB 4 6 Before the solve step on the host it holds the initial guess if
9. a sparse preconditioner RINFO 28 The time spent to perform communications while calling the iterative sol ver It is cumulated over the iterations RINFO 29 Total time spent to perform the matrix vector products in the iterative solver where the matrix is the Schur complement It includes the time to syn chronise the vector 33 RINFO 30 Total time spent to apply the preconditioner in the iterative solver This includes the time to synchronise the vector RINFO 31 Total time spent to perform scalar products in the iterative solver It in cludes the time to reduce the result RINFO 32 to RINFO 35 Fields unused in current version 10 2 Information available on all processes 10 2 1 IINFOG 1 the instance status Warning In current version there is no guarantee that on failure IINFOG 1 is properly set see section 11 lt 0 an Error 0 instance is in a correct state 1 error occured on processor IINFOG 2 IINFOG 2 the additionnal information if an error occured Warning In current version there is no guarantee that on failure IINFOG is properly set see section 11 If IINFOG 1 1 specifies which processor failed from 1 to np IINFOG 3 order of the input matrix IINFOG 4 input matrix s number of entries It is only available on processor 0 IINFOG 5 number of iterations performed by the iterative solver It is set during the solving step 10 2 2 RINFOG RIN
10. called before any other call to MAPHYS MAPHYS will nullify all pointers of the structure and fill the controls arrays mphs ICNTL and mphs RCNTL with default values 2 destroys the instance MAPHYS will free its internal memory 1 perfoms the analysis In this step MAPHYS will pretreat the input matrix compute how to distribute the global system and distribute the global system into local systems 2 performs factorization assuming the data is distributed in the form of local systems In this step MAPHYS will factorize the interior of the local matrices and compute their local Schur complements 3 performs the preconditioning In this step MAPHYS will compute the pre conditioner of the Schur complement system Aperforms the solution In this step MAPHYS will solve the linear system Na mely MAPHYS will reformulate the right hand side of the input linear system mphs RHS into local system right hand sides solve the Schur complement system s solve the local sytem 5combines the actions from JOB 1 to JOB 3 6combines the actions from JOB 1 to JOB 4 6 4 Control of parallelism mphs COMM integer is the MPI communicator used by MAPHYS It must be set to a valid MPI communicator by the user before the initialization step mphs JOB 1 and must not changed later In the current version the MPI communicator must have a number of process that is a power of 2 16 6 5 Matrix type
11. computationally ex pensive because the dense matrices S should be factorized We intend to reduce the storage and the computational cost to form and apply the preconditioner by using sparse approximation of S in following the strategy described by Equation 12 The ap proximation S can be constructed by dropping the elements of S that are smaller than a given threshold More precisely the following symmetric dropping formula can be applied l ES 84 lt See 5351 a2 803 Otherwise where 5 denotes the entries of Si The resulting preconditioner based on these sparse approximations reads N Box c RE S Rm 13 i l We notice that such a dropping strategy preserves the symmetry in the symmetric case but it requires to first assemble 5 before sparsifying it Note see ICNTL 21 2 or 3 and RCNTL 11 for more details on how to control this dropping strategy 2 3 Sparse approximation based on partial LU t p One can design a computationally and memory cheaper alternative to approximate the local Schur complements S Among the possibilities we consider in this report a variant based on the ILU t p that is also implemented in pARMS 15 The approach consists in applying a partial incomplete factorisation to the matrix The incomplete factorisation is only run on and it computes its ILU factors Li and D using to the dropping parameter threshold t factor Ar Li OV 0
12. dense submatrices that can then be proces sed with computational intensive standard dense linear algebra kernels such as BLAS LAPACKand SCALAPACK Sparse direct solvers have been for years the methods of choice for solving linear systems of equations because of their reliable numerical be havior 13 There are on going efforts in further improving existing parallel packages However it is admitted that such approaches are intrinsically not scalable in terms of computational complexity or memory for large problems such as those arising from the discretization of large 3 dimensional partial differential equations PDEs Further more the linear systems involved in the numerical simulation of complex phenomena result from some modeling and discretization which contains some uncertainties and approximation errors Consequently the highly accurate but costly solution provided by stable Gaussian eliminations might not be mandatory Iterative methods on the other hand generate sequences of approximations to the solution either through fixed point schemes or via search in Krylov subspaces 20 The best known representatives of these latter numerical techniques are the conjugate gradient and the GMRES methods These methods have the advantage that the memory requirements are small Also they tend to be easier to parallelize than direct methods However the main problem with this class of methods is the rate of convergence which depends on the properti
13. direct sol vers within the many core nodes of the machine and message passing among the nodes to implement the gluing parallel iterative scheme The general underlying ideas are not new They have been used to design domain decomposition techniques for the numerical solution of PDEs 16 Domain decomposition refers to the splitting of the computational domain into sub domains with or without overlap The splitting strategies are generally governed by various constraints objectives but the main one is to enhance parallelism The numerical pro perties of the PDEs to be solved are usually extensively exploited at the continuous or discrete levels to design the numerical algorithms Consequently the resulting spe cialized technique will only work for the class of linear systems associated with the targeted PDEs In our work we develop domain decomposition techniques for general unstructured linear systems More precisely we consider numerical techniques based on a non overlapping decomposition of the graph associated with the sparse matrices The vertex separator constructed using graph partitioning 14 5 defines the interface variables that will be solved iteratively using a Schur complement approach while the variables associated with the interior sub graphs will be handled by a sparse direct sol ver Although the Schur complement system is usually more tractable than the original problem by an iterative technique preconditioning treatment is s
14. estimation in MegaBytes of the peak of the internal data allocated by the MaPHyS Process It only takes into account the allocated data with the heaviest memory cost Schur preconditioner factors of local system IINFO 25 estimation in MegaBytes of the internal data allocated by the MaPHyS Process It only takes into account the allocated data with the heaviest memory cost Schur preconditioner factors of local system IINFO 26 memory usage of the iterative solver It is only available after the solution step IINFO 27 memory peak of the iterative solver It is only available after the solution step IINFO 28 memory used in MAPH YSbuffers It is available after the analysis step IINFO 29 memory peak in MAPHySbuffers It is available after the analysis step IINFO 30 The internal memory usage of the sparse direct solver library MUMPS PaSTixX etc It is available after the factorisation step Note that an external PaSTiX instance may influence this data 31 IINFO 31 internal memory peak of sparse direct solver library MUMPS PaSTixX etc It is available after the factorisation step Note that an external PaSTiX instance may influence this data IINFO 32 The identifier of the node from 1 to the number of nodes It is available after the initialization step IINFO 33 The number of domains associated to the node It is available after the initialization step IINFO 34 The amount of memory used on the
15. non empty writtent only by process 0 37 12 1 3 System description Key JOB Fields JOB Integer kind 4 Description set the job to perform Contraints are those of JOB Key ICNTL 0 9 Fields ICNTL 0 9 Type Integer kind 4 Description set an ICNTL parameter Contraints are those of ICNTL Key RCNTL 0 9 Fields RCNTL 0 9 Type Real kind 8 Description set an RCNTL parameter Contraints are those of RCNTL Note For more details on possible value for ICNTL and RCNTL you may refer to Section 9 12 2 A cube generation problem An other example that we provide is a cubes generations problem that you can have using the command line make testdistsys Those examples of MAPHYSare stored in the directory examples There is the drivers using MAPHYS lt Arithms gt _testdistsys is drivers that use input in file Where lt Arithms gt is one of the different arithmetic that could be used smph for real arithmetic dmph for double precision arithmetic cmph for complex arithmetic zmph for complex double arithmetic Only the test with the arithmetic you have choose during the compilation will be available In order to launch these tests you need to use the command mpirun np lt nbproc gt lt Arithms gt _testdistsys CubeFile lt InputFile gt Where CubeF ile is a string that contains the link to the cube description file yo
16. not bind 1 Thread to core binding 2 Grouped bind note 1 Be carreful that if you use more threads than cores available no binding will be perform note 2 Depending on the problem you want to solve it could be better to use more MPI process than using threads you can found a study on some problem in the Stojce Nakov PHD thesis 22 9 Control parameters The execution of MAPHYS can be controlled by the user through the 2 arrays mphs ICNTL mphs RCNTL When an execution control should be expressed as an integer or an equivalent this control is a entry of mphs ICNTL and when an execution control should be expressed as a float or an equivalent this control is a entry of mphs RCNTL 91 ICNTL mphs 46 ICNTL is an integer array of size MAPHYS ICNTL SIZE ICNTL 1 Controls where are written error messages Fortran unit used to write error messages Negative value means do not print error messages Default 0 stderr ICNTL 2 Controls where are written warning messages Fortran unit used to write warning messages Negative value means do not print warning messages Default 0 stderr ICNTL 3 Controls where are written statistics messages Fortran unit used to write statistics messages Negative value means do not print statistics Default 6 stdout ICNTL 4 Controls the verbosity of maphys lt 0 do not print output 1 print errors 2 print errors warnings amp main statisti
17. the initial guess option is activated mphs icntl 23 1 After the solve step on the host it holds the solution The current version does not support multiple right hand sides 6 7 Finite element interface experimental 6 7 1 How to use it MAPHYS provides a distributed interface according to finite element interface The control of which interface you want to use is define by ICNTL 3 In order to use the finite element interface the user has to call a routine called xmph_create_domain defined as follows before the analysis step xmph create domain mphs myndof myndofinterior myndofintrf lenindintrf mysizeintrf gballintrf mynbvi myindexvi myptrindexvi myindexintrf myinterface myndoflogicintrf mylogicintrf The parameters are defined as follow mphs the MAPHYS instance myndof integer equal to the degree of freedom of the entire domain interface interior myndofinterior integer equal to the degree of freedom of the domain inter ior myndofintr f integer equal to the degree of freedom of the domain interface lenindintr f integer equal to the size of the array myindexintr f mysizeintr f integer equal to the size of the domain interface gballintr f integer equal to the total size of each interfaces in each processus integer equal to the number of neighbor of the processus mwyindezV i list of each neighbors of the domain by rank proc
18. you have choose during the compilation will be available You can see some examples of this files in the template available in the same directory examples such as real bcsstk17 in is the input file to solve b with A the bcsstk17 matrix real SPD from Matrix Market b is stored in file x the solution should be the vector ones real bcsstk17 noRHS in is the input file to solve b with A the bcsstk17 matrix real SPD from Matrix Market b generated such as solution x is a pseudo random vector complex younglc in is the input file to solve b with A the younglc ma trix complex unsymmetric from Matrix Market b generated such as solution x is a pseudo random vector In order to launch these tests you need to use the command mpirun np lt nbproc gt Arithms examplekv lt ExampleFile gt Where ExampleFile is a string that contains the link to the input file you want to use and lt nbproc gt the number of processus you want to use For example you can use the following command mpirun np 4 smph examplekv real bcsstkl7 in Warning Make sure that lt nbproc gt is a multiple of two because maphys is based on dissection method 12 1 1 How to write an input file The input in file are file in free format An example be found in template in It should be written as follow lines begining with are a comment line Empty lines are possible Setting parameter is bas
19. 20 1 lt 1000 i 2998 1 1 1 1 0 2998 1 1 1 2998 1 499 3998 1 1 1 1 0 3998 1 1 1 3 3998 1 500 4998 1 1 1 1 0 4998 1 1 1 4998 1 501 id sym 2 Define the problem on the host if myid 0 1 id n n id nnz nnz id rows irn id cols jcn id sol sol id values id rhs rhs define ICNTL I icntl I 1 macro s t indices match documentation x define RCNTL I rcntl I 1 macro s t indices match documentation Call the Maphys package id ICNTL 6 2 id ICNTL 24 500 id ICNTL 26 500 id ICNTL 27 1 id ICNTL 22 3 id RCNTL 21 1 0e 5 id RCNTL 11 1 0e 6 43 100 6 dmph maphys driver c amp id id job JOB END dmph maphys driver c amp id Terminate instance x if myid 0 printf Ending maphys ok ierr MPI Finalize if ierr 0 fprintf stderr Problem with mpi finalise return 0 44 13 Notes on MaPHyS distribution 13 1 License MAPHYSis under the CeCILL C license you can find the description of the licence in file doc CeCILL C 13 2 How to refer to MAPHYS In order to refer to MAPHYSsolver you can refer to the article called 13 3 Authors People involved in MAPHYS development are Azzam Haidar Luc
20. FOG 1 UNUSED in current version RINFOG 2 Value of Da estimated in centralized form where is the euclidean norm the input matrix x the computed solution b the given right hand side It is available after the solve step RINFOG 3 the backward error of the Schur system It is computed during the solve step and is available on each processor RINFOG 4 Value of p estimated in distributed form It this thus the same value as RINFOG 2 but computed on the distributed data 34 11 Error diagnostics In this current version on errors MAPHySwill 1 print a trace if it was compiled with the option flag DMAPHYS BACKTRACE ERROR 2 exit by calling MPI Abort on the MPI communicator mphs COMM Messages in the trace look like 00002 ERROR dmphN maphysN mod F90 1line 278 Check Failed Here it means that a check failed on process rank 2 in file dmph maphys mod F90 at line 278 35 12 Examples of use of MaPHyS The is some driver that we provide to use some examples of maphys 121 A matrix problem Those examples of MAPHYSare stored in the directory examples There is the drivers using MAPHYS lt Arithms gt _examplekv is a drivers that use input in file Where lt Arithms gt is one of the different arithmetic that could be used smph for real arithmetic dmph for double precision arithmetic cmph for complex arithmetic zmph for complex double arithmetic Only the test with the arithmetic
21. Giraud Emmanuel Agullo Yohan Lee tin yien Julien Pedron Stojce Nakov 45
22. INFO 8 Time to preprocess the input matrix It is only available on the host after the analysis step RINFO 9 Time to convert the input system into local system It is available after the analysis step RINFO 10 Time to extract subblocks of the local matrix It is available after the analysis step RINFO 11 Time to factorize the local matrix It is available after the factorisation step If the Schur complement is computed by the sparse direct solver It in cludes the time to compute the Schur complement matrix 32 RINFO 12 Time to compute the Schur complement or its approximation It is avai lable after the factorisation step If the Schur complement is computed by the sparse direct solver this is zero Otherwise it is the time to compute the Schur complement from factorization Or to estimate the Schur complement with ILUT RINFO 13 Time to assemble the Schur complement or its approximation It is avai lable after the preconditioning step RINFO 14 Time to factorize the assembled Schur complement the preconditioner It is available after the preconditioning step RINFO 15 the solve step execution time to distribute the global right hand side RINFO 16 In the solve step execution time to generate right hand side of the itera tive solver RINFO 17 In the solution step execution time of the iterative solver to solve the interface 18 In the solution step execution time of the sparse direct solver t
23. L A j aos pILU pILU am So where 59 AD by Aa The incomplete factors are then used to compute an approximation of the local Schur complement Because our main objective is to get an approximation of the local Schur complement we switch to another less stringent threshold parameter tschur to compute the sparse approximation of the local Schur complement Such a calculation can be performed using a IKJ variant of the Gaussian elimina tion 201 where the L factor is computed but not stored as we are only interested in an approximation of S This further alleviate the memory cost Contrary to Equation see Section 2 2 the assembly step can only be perfor med after the sparsification which directly results from the ILU process as a conse quence the assembly step is performed on sparse data structures thanks to a few neigh x Ep bour to neighbour communications to form from 594 The preconditionner finally reads N i 1 Bg 8 Rr 14 l Note To use this method you should put ICNTL 30 2 the control parameters related to the ILUT method are ICNTL 8 ICNTL 9 RCNTL S and RCNTL 9 2 4 Exact vs approximated Schur algorithm to build the precond In a full MPI implementation the domain decomposition strategy is followed to assign each local PDE problem sub domain to one MPi process that works indepen dently of other processes and exchange d
24. MAPHYS USER S GUIDE 0 9 2 This document explain the Fortran 90 interface of the package MAPHYS the Massively Parallel Hybrid Solver Author HiePaCS Team INRIA Bordeaux France Table des mati res 1__General overview 2 General introduction 2 1 Parallel hierarchical implementation 2 2 Dropping strategy to sparsify the preconditioner 2 3 Sparse approximation based on partial LU t p 2 4 Exact vs approximated Schur algorithm to build the precond 3 Installation 4 5 6 7 3 1 Requirements so mo omo omo s 546 54 OR Ros 3 2 Compilation and installation Main functionalities of MaPHyS 0 9 2 4 1 Input matrix 4 2 Anthmetic versions le 4 3 Theworkinghostprocessor een Sequence in which routines are called Input and output parameters 6 1 Maphysinstance 6 2 Versi n 6 3 Control of the four main 6 4 1 0 5 Matix type se rp REGE eMe Ee 6 6 Centralized assembled matrix input 66 1 Element matrix input lees 6 6 2 Right hand side and solution vectors matrices 67 Finite element 6 7 11
25. NTL 42 Is the control paramater that active the 2 level of parallelism version 0 Only MPI will be use 1 Activate the Multithreading MPI Default 0 Note If the 2 level of parallelism are activated you have to set the parameters UCNTL 36 ICNTL7 CNTLS ICNTL 39 and ICNTL 30 ICNTL 43 specifies the input system centralized on the host distributed Enumeration ICNTL INSYSTEM 1 The input system is centralized on host Enumeration INSYSTEM IsCentralized the input matrix and the right hand member are centralized on the host User must provides the global system by setting on the host the global matrix before the analysis step mphs rows mphs cols etc the global right hand side before the solve step mphs rhs 2 The input system is distributed Enumeration INSYSTEM IsDistributed the input matrix and the right hand side are distributed User must provides on each MPI process the local system by setting the local matrix before the analysis step mphs rows mphs cols etc the local right hand side before the solve step mphs rhs the data distribution before the analysis step mphs lc domain Warning see directory gendistsys testdistsys for an example Default 1 9 2 RCNTL mphs RCNTL is an integer array of size MAPHYS RCNTL SIZE 27 RCNTL 8 specifies the threshold used to sparsify the LU factors while using PI LUT Namely all entries
26. S is referred to as the Schur complement matrix Because most of the fill in appears in the Schur complement the Schur complement system is solved using a preconditioned Krylov subspace method while the interior subdomain systems are solved using a sparse direct solver Although the Schur complement system is signifi cantly better conditioned than the original matrix A it is important to consider further preconditioning when employing a Krylov method In MaPHyS several overlapping block diagonal preconditioning techniques are im plemented where each diagonal block is associated with the interface of a subdomain 1 dense block diagonal factorization each diagonal block is factorized using the appropriated LAPACKsubroutine 2 sparsified block diagonal factorization the dense diagonal block are first spar sified by droppring entry s if it is lower than s 8 3 The sparse factorization is performed by a sparse direct solver 3 sparse approximation of the diognal block a sparse approximation of the dia gonal block are computing by replacing Are by an approximation computed using an incomplete JLU t p factorization Computed approximation of the Schur complement is further sparsified Because of its diagonal nature consequently local the preconditioner tends to be less efficient when the number of subdomains is increased An efficient solution using MaPHyS results from a trade off between the two contradictory i
27. and is accessed during factorisation by each process User must set this parameter while using ILUT before the factorisation See also ICNTLGO ICNTL S and RCNTL 9 Warning the default value is a bad value Default 1 Note For more details on ILU method you may refer to Section 2 3 ICNTL 10 12 UNUSED in current version ICNTL 13 the sparse direct solver package to be used by MAPHYS It is accessed during the factorization and the the preconditionning step by each MPI process 1 use MUMPS 2 use PaSTiX 3 use multiple sparse direct solvers See ICNTL 15 and ICNTL 32 Default the first available starting from index 1 ICNTL 14 UNUSED in current version ICNTL 15 the sparse direct solver package to be used by MAPHYS for the compu ting the preconditioner It is accessed during the preconditioning step by each MPI process 1 use MUMPS 2 use PaSTiX Default value of ICNTL 13 ICNTL 16 19 UNUSED in current version ICNTL 20 controls the 3rd party iterative solver used 0 Unset 1 use Pack GMRES 2 use CG 3 use Pack if matrix is SPD Pack GMRES otherwise Default 3 ICNTL 21 The strategy for preconditionning the Schur Warning If ICNTL 30 2 ICNTLQT1 4 then ICNTLQT must be 3 1 Preconditioner is the dense factorization of the assembled local schur 2 Preconditioner is the sparse factorization of the assemble
28. ata using message passing In that computa tional framework the implementation of the algorithms based on preconditioners built from the exact or approximated local Schur complement only differ in the preliminary phases The parallel SPMD algorithm is as follow 1 Initialization phase Exact Schur using the sparse direct solver we compute at once the LU factorization of A and the local Schur complement S using A as input matrix Approximated Schur using the sparse direct solver we only compute the LU factorization of A then we compute the approximation of the local Schur complement S0 by performing a partial I LU factorization of A 2 Set up of the preconditioner Exact Schur we first assemble the diagonal problem thanks to few neigh bour to neighbour communications computation of S we sparsify the assembled local Schur 1 9 that 15 then factorized Approximated Schur we assemble the sparse approximation also thanks to few neighbour to neighbour communications and we factorize the resulting sparse approximation of the assembled local Schur 3 Krylov subspace iteration the same numerical kernels are used The only dif ference is the sparse factors that are considered in the preconditioning step de pendent on the selected strategy exact v s approximated 10 From a high performance computing point of view the main difference relies in the computation of the local Schur
29. ces according to this decomposition is described in figure The user has to describe this matrix and the right hand side as in the centralized interface except that the matrix and rhs given will be the local ones and not the global one 20 Domain 0 gob7 8 4 5 1 2 3 Domain 1 gob9 4 5 1 2 3 loc 123 4 5 67 loc 1 23 45 6 Domain 3 glob 126 2 3 loc 12 3 4 loc 4 2 3 FIGURE 2 The different matrices with their local and global orderings The interface nodes are in red the interior nodes are in black 21 7 Writing a matrix to a file The user may get the preprocessed input matrix assembled and eventually symmetrised in structure by setting the string mphs eWRITE MATRIX string on process 0 before the analysis step mphs JOB 1 This matrix will be written during the analysis step on the file mphs eWRITE MATRIX in matrix market format 8 Using two levels of parallelism inside MAPHYS The current release of MAPHYS support a two levels of parallelism MPI THREADS In oder to activate the multithreading into MAPHYS you have to set some control parameters Activate the multithreading inside MAPHYS The number of nodes ICNTL 38 The number of cores on each nodes The number of MPI process The number of Threads that you want to use inside each MPI process You can also set CNTL 36 in order to specify the binding you want to use inside MAPHYS as follow 0 Do
30. complement In the exact situation this calcula tion is performed using sparse direct techniques which make intensive use of BLAS 3 technology as most of the data structure and computation schedule are performed in a symbolic analysis phase when fill in is analyzed For partial incomplete factoriza tion because fill in entries might be dropped depending on their numerical values no prescription of the structure of the pattern of the factors can be symbolically compu ted Consequently this calculation is mainly based on sparse BLAS 1 operations that are much less favorable to an efficient use of the memory hierarchy and therefore less effective in terms of their floating point operation rate In short the second case leads to fewer operations but at a lower computing rate which might result in higher ove rall elapsed time in some situations Nevertheless in all cases the approximated Schur approach consumes much less memory as illustrated later on in this report Note To choose between each strategy you should use ICNTL 30 R f rences 1 R Barrett M Berry T F Chan J Demmel J Donato J Dongarra V Eijkhout R Pozo C Romine and H Van der Vorst Templates for the Solution of Linear Systems Building Blocks for Iterative Methods SIAM Philadelphia PA second edition 1994 2 J F Bourgat R Glowinski P Le Tallec and M Vidrascu Variational formula tion and algorithm for trace operator in domain decomposition ca
31. cs 3 print errors warnings amp several detailled statistics 4 print errors warnings amp verbose on detailled statistics gt 4 debug level Default 3 5 Controls when to print list of controls cntl This option is useful to check if a control is taken into account 0 never print 1 print once at the begining of analysis 2 print at the begining of each step Default 0 ICNTL 6 Controls when to print list of informations info This option is useful while performing benchmarks 0 never print 1 print once at the end of solve 2 print at the end of each step Default 0 ICNTL 7 UNUSED in current version ICNTL 8 controls the Partial ILUT solver level of filling for L and U in the ILUT method that is the maximun number of entries per row in the LU factorization 23 This parameter is only relevant while using ILUT and is accessed during facto risation by each process User must set this parameter while using ILUT before the factorisation See also ICNTLOO ICNTLO RCNTL S and RCNTL 9 Warning the default value is a bad value Default 1 Note For more details on ILU method you may refer to Section 2 3 ICNTL 9 controls the Partial ILUT solver level of filling for the Schur complement in the ILUT method that is the maximun number of entries per row in the Schur complement This parameter is only relevant while using ILUT
32. d local schur you have to specify the threshold usingIRCNTLC D 3 Preconditioner is the factorization of a sparse matrix Where the sparse matrix is the assembly of the approximated local schur computed with the Partial Incomplete LU PILUT you have to specify the threshold using RCNTL 9 24 4 No Preconditioner 5 autodetect order of preferences is 1 gt 2 gt gt 4 Default 2 Note For more details on dropping strategy CNTL 21 2 or 3 you may refer to Section 2 2 the parameter RCNTL 8 RCNTL 9 ifICNTL 21 2 3 res pectivly ICNTL 22 controls the iterative solver Determines which orthogonalisation scheme to apply 0 modified Gram Schmidt orthogonalization MGS 1 iterative selective modified Gram Schmidt orthogonalization IMGS 2 classical Gram Schmidt orthogonalization CGS 3 iterative classical Gram Schmidt orthogonalisation ICGS Default 2 3 ICNTL 23 controls the iterative solver Controls whether the user wishes to supply an initial guess of solution vector in mphs SOL It must equal to 0 or 1 It is accessed by all processes during the solve step 0 initilization guess start from null vector 1 User defined initialization guess Default 0 ICNTL 24 controls the iterative solver It define the maximumn number of iterations cumulated over restarts allowed It must be equal or larger than 0 and be the same on each process 0
33. deas that are increase the number of domains to reduce the cost of the sparse factorization of the interior domain and reduce the number of domain to make easier the iterative solution for the interface solution 2 General introduction Solving large sparse linear systems Ax b where A is a given matrix b is a given vector and x is an unknown vector to be computed appears often in the inner most loops of intensive simulation codes It is consequently the most time consuming com putation in many large scale computer simulations in science and engineering such as computational incompressible fluid dynamics structural analysis wave propagation and design of new materials in nanoscience to name a few Over the past decade or so several teams have been developing innovative numerical algorithms to exploit ad vanced high performance large scale parallel computers to solve these equations effi ciently There are two basic approaches for solving linear systems of equations direct methods and iterative methods Those two large classes of methods have somehow op posite features with respect to their numerical and parallel implementation efficiencies Direct methods are based on the Gaussian elimination that is probably among the oldest method for solving linear systems Tremendous effort has been devoted to the de sign of sparse Gaussian eliminations that efficiently exploit the sparsity of the matrices These methods indeed aim at exhibiting
34. ed on key value in the following format KEY VALUE 36 12 1 2 System description Key MATFILE Fields 1 J V N NNZ Type string Description Path to the matrix file Constrains supported file extension are mtx or mm Matrix Market jv coordinate format rsa Harwell Boeing Key SYM Fields SYM Type Integer kind 4 Description Specifies the matrix symmetry if given it supperseed the sym mertry read in MATFILE Constrains supported value are 0 General 1 SPD 2 symmetric Key RHSFILE Fields RHS Type string Description Path to the file containing the right hand side Constrains supported file extension are ijv coordinate format Note 1 If file is invalid or not set the right hand side is generated from a pseudo random solution Key INITGESS Fields SOL string Description Path to the file containing the initial guess Constrains supported file extension are coordinate format Note 1 User must also set I CNTL 28 to 1 to activate this option Note 2 If not set initial guess is vector Null Key OUTRHSFILE Fields RHS Type string Description Path to the file where to write the rhs Constrains must be non empty writtent only by process 0 Key OUTSOLFILE Fields SOL Type string Description Path to the file where to write the solution Constrains must be
35. er of entries in the local matrix order It is available after the analysis step IINFO 7 flag indicating which sparse direct solver was used to compute the factori zation It is available after the factorisation step IINFO S8 flag indicating how the Schur was computed It is available after the facto risation step IINFO 9 Size in MegaBytes used to stores the factors of the local matrix if avai lable It is available after the factorisation step means the data is unvailable gt Ois the size in megabits IINFO 10 the number of static pivots during the factorization It is available after the factorisation step means the data is unvailable gt Ois the number of pivots IINFO 11 Local Schur complement order It is available after the analysis step IINFO 12 in MegaBytes used to stores the Schur complement or it approximation It is available after the factorisation step IINFO 13 Number of entries in the approximation of the local Schur only available with ILU It is available after the factorisation step IINFO 14 Preconditioner strategy The flag indicating which strategy was used to compute the preconditioner available after the preconditioning step precon ditioner is local dense local sparse etc IINFO 15 indicates which sparse direct solver was used to compute the precondi tioner It is available after the preconditioning step 30 IINFO 16 The order of the precondi
36. es of the matrix Very effective techniques have been designed for classes of problems such as the multigrid methods 10 that are well suited for specific problems such as those arising from the discretization of elliptic PDEs One way to improve the convergence rate of Krylov subspace solver is through preconditioning and fixed point iteration schemes are often used as preconditioner In many computational science areas highly accurate solutions are not required as long as the quality of the computed solution can be assessed against measurements or data uncertainties In such a framework the iterative schemes play a central role as they might be stopped as soon as an accurate enough solution is found In our work we consider stopping criteria based on the backward error analysis 7 9 Our approach to high performance scalable solution of large sparse linear sys tems in parallel scientific computing is to combine direct and iterative methods Such hybrid approach exploits the advantages of both direct and iterative methods The iterative component allows us to use a small amount of memory and provides a natu ral way for parallelization The direct part provides its favorable numerical properties Furthermore this combination enables us to exploit naturally several levels of paral lelism that logically match the hardware feature of emerging many core platforms as stated in Section 1 In particular we can use parallel multi threaded sparse
37. essus myptrindexVi pointer to the first node of the interface of each domain myindexintr list of the node of the domain interface index are in local ordering myinter f ace conversion from local ordering to global ordering myndoflogicintr f Integer corresponding to the number of node in the inter face wich who the domain is responsable for 18 FIGURE 1 The separators circled in red the number in bold are the processus id the nodes number are in global ordering mylogicintr f list of the nodes that this domain is responsible for in the local ordering 6 7 2 Example with finite element interface According to this input graph of the matrix already distributed amound processors as describe in figure i The users will have to call the function with the parameters of xmph create domains defined as follow for each processus 19
38. he Schur complement to the interface I This local assembled preconditioner can be built from the local Schur complements 5 by assembling their diagonal blocks With these notations the preconditioner reads N M RE S Rr 10 i 1 If we considered a planar graph partitioned into horizontal strips 1D decomposi tion the resulting Schur complement matrix has a block tridiagonal structure as de picted in Equation Sy k Sk k 1 c Sk 1 k 1 1 2 0 1 2__9 2 2 For that particular structure of S the submatrices in boxes correspond to the S Such diagonal blocks that overlap are similar to the classical block overlap of the Schwarz method when written in a matrix form for 1D decomposition Similar ideas have been developed in a pure algebraic context in earlier papers for the solution of general sparse linear systems Because of this link the preconditioner defined by Equation is referred to as algebraic additive Schwarz for the Schur complement One advantage of using the assembled local Schur complements instead of the local Schur complements like in the Neumann Neumann 2 6 is that in the SPD case the assembled Schur complements cannot be singular as S is SPD 41 The original idea of non overlapping domain decomposition method consists into subdividing the graph into sub graphs that are individually mapped to one processor With this data distributi
39. lance bet ween domains in order to reduce the interface so the Schur Default 0 2 RCNTL 21 specifies the stopping criterior threshold in the Iterative Solver It must be above 0 Default 1 e 5 9 3 Compatibility table Some values of the control parameters cannot be used at the same time here is a table that summarizes the conflict between the various control parameters 28 Control parameters values uncompatibility parameters ICNTL 30 0 ICNTLGJICNTLO RCNTL 8 and N 9 unused ICNTL 30 2 and ICNTLGO ICNTL 15 and ICNTL 32 are unused ICNTL 21 must be forced to 3 21 4 ICNTLOO 3 ICNTL 20 2 ICNTL 25 is unused ICNTLOD 2 or ICNTLQ1 3 RCNTL I 1 is unused 10 Information parameters Values return to the user so that he she had feed back on execution 10 1 Information local to each processor 1011 IINFO IINFO 1 The status of the instance 0 means success lt 0 means failure gt 0 warning IINFO 2 UNUSED in current version IINFO 3 flag indicating the strategy used to partition the input linear system It is available after the analysis step IINFO 4 Local matrix order It is available after the analysis step IINFO 5 Size in MegaBytes used to store the local matrice and its subblocs It is available after the analysis step IINFO 6 Numb
40. lculations In Tony Chan Roland Glowinski Jacques P riaux and Olof Widlund editors Do main Decomposition Methods pages 3 16 Philadelphia 1989 SIAM 3 X C Cai and Y Saad Overlapping domain decomposition algorithms for general sparse matrices Numerical Linear Algebra with Applications 3 221 237 1996 4 L M Carvalho L Giraud and G Meurant Local preconditioners for two level non overlapping domain decomposition methods Numerical Linear Algebra with Applications 8 4 207 227 2001 5 C Chevalier and F Pellegrini PT SCOTCH a tool for efficient parallel graph ordering Parallel Computing 34 6 8 2008 6 Y H De Roeck and P Le Tallec Analysis and test of a local domain decom position preconditioner In Roland Glowinski Yuri Kuznetsov G rard Meurant Jacques P riaux and Olof Widlund editors Fourth International Symposium on Domain Decomposition Methods for Partial Differential Equations pages 112 128 SIAM Philadelphia PA 1991 7 J DrkoSova M Rozloznik Z Strako and A Greenbaum Numerical stability of the GMRES method BIT 35 309 330 1995 8 L Giraud A Haidar and L T Watson Parallel scalability study of hybrid pre conditioners in three dimensions Parallel Computing 34 363 379 2008 9 A Greenbaum Iterative methods for solving linear systems Society for Industrial and Applied Mathematics SIAM Philadelphia PA 1997 11 10 11 12
41. means the parameter is unset MAPHYS will performs at most 100 iterations gt 0 means the parameter is set MAPHYS will performs at most CNTL 24 iterations Default 2 0 ICNTL 25 controls the iterative solver Controls the strategy to compute the residual at the restart it must be the same on each process It is unrelevant while the iterative solver is ICNTL 20 2 3 0 recurrence formula is used to compute the residual at each restart except if the convergence was detected using the Arnoldi residual during the previous restart 1 The residual is explicitly computed using a matrix vector product Default 0 ICNTL 26 controls the iterative solver it is the restart parameter of GMRES Controls the number of iterations between each restart It is unrelevant while the iterative solver is CG ICNTL 0 2 3 0 the parameter is unset MAPHYS will restart every 100 iterations gt 0 the parameter is set MAPHYS will restart every ICNTL 24 iterations Default 0 ICNTL 27 controls the iterative solver it is the method used to perform multipli cation between the Schur complement matrix and a vector It should be set 25 before the solving step each process It can be set on each process before the preconditioning step to allow possible memory footprint reduction peak and usage 1 Value unset error 0 automatic 1 if possible 2 if not
42. mphs SYM integer specifies the input matrix type It is accessed during the analysis and the solve steps mphs JOB 1 and mphs 6JOB 4 respecti vely MAPHYS does not modify its value during its execution Possible values of mphs SYM 0 or SM SYM IsGeneral means the input matrix is General also known as Unsymmetric 1 or SM SYM IsSPD means the input matrix is SPD Symmetric Positive Definite in real arithmetics s d In complex arithmetics z c this is an invalid value 2orSM SYM IsSymmetric means the input matrix is Symmetric even in complex In this version MAPHYS does not support Hermitian or Hermitian Positive Definite values 6 6 Centralized assembled matrix input There is currently no way to specifies that the matrix is already assembled Current version will treat it as an elemental matrix input 6 6 1 Element matrix input mphs SYM mphs N mphs NNZ mphs ROWS mphs COLS mphs VALUES specifies the input matrix in elemental coordinate format Those components are ac cessed during the analysis step and the solving step to compute statistics by the host process They are not altered by maphys In details mphs SYM integer specifies the type of symmetry see section 6 5 mphs N integer specifies the order of the input matrix hence N 0 mphs NNZ integer specifies the number of the entries in the input matrix hence NNZ gt 0 mphs ROWS integer array of size NNZ
43. myid 0 Then ll 0 41 print x Ending maphys ok End If Call MPI Finalize ierr If ierr 0 Then Print Problem with mpi finalise End If 9999 Continue If ierr 0 then Print error End If End Program Call maphys 12 3 2 C example Here is a C example of a driver using MAPHYS Example program using the C interface to the x double precision version of Maphys dmph maphys driver c include lt stdio h gt include mpi h include dmph_maphys_type_c h define JOB INIT 1 define JOB END 2 define USE COMM WORLD 987654 int main int argc char argv 1 DMPH_maphys_t_c id int n 1000 int nnz 5998 int i int irn nnz int jcn nnz double a nnz double rhs n double sol n int myid ierr ierr MPI Init amp argc amp argv ierr MPI Comm rank MPI COMM WORLD amp myid Initialize a Maphys instance Use MPI COMM WORLD id job JOB INIT id sym 2 id comm USE COMM WORLD dmph maphys driver c amp id 42 Define and rhs x for i20 1 lt 1000 i rhs i 1 0 11 141 1 0 irn i i 1 jen i itl sol i 0 0 for 0 i 999 i 1000 1 1 1 1 0 irn 1000 i i 2 1000 1 for 0 i 999 i 1999 1 1 1 1 01 irn 1999 i i 1 jen 1999 i i 2 for i
44. node in MegaBytes It is updated at the end of each step Note that an external PaSTiX instance may influence this data IINFO 35 The amount of memory used on the node in MegaBytes It is updated at the end of each step Note that an external PaSTiX instance may influence this data IINFO 35 The number of non zero in the schur after the factorisation of the local matrix It is available after the preconditionning step Note that is only available for the PaSTiX version 10 1 2 RINFO RINFO 1 is UNUSED in current version RINFO 2 Norm 1 of the Schur stored in dense matrix It is only available is the dense local preconditioner was selected ICNTL 21 1 It is computed during the preconditioning step and is available on each processor RINFO 3 Reciproque of the condition number of the local assembled dense Schur matrix It is only available is the dense local preconditioner was selected ICNTL 21 1 It is computed during the preconditioning step and is avai lable on each processor RINFO 4 Execution Time of the analysis step It is available after the analysis step RINFO 5 Execution Time of the factorisation step It is available after the factorisa tion step RINFO 6 Execution Time of the precond step It is available after the precond step RINFO 7 Execution Time of the solution step include iterative solution of the Schur and backsolve to compute interior It is available after the solution step R
45. o solve the interiors RINFO 19 In the solution step execution time to gather the solution RINFO 20 Sum of the execution times of the 4 steps It is available after the solution step RINFO 21 Starting Total execution time from the begining of the analysis to the end of the solve Before the solve step it contains the starting time of execution After the solve step it contains the total execution time Warning it includes the time between each steps idle time due to synchronisations RINFO 22 The estimated floating operations for the elimination process on the local system It is only available after the factorisation RINFO 23 The floating operations for the assembly process on the local system It is only available after the factorisation RINFO 24 The floating operations for the elimination process on the local system It is only available after the factorisation RINFO 25 The estimated floating operations for the elimination process to obtain the sparse preconditioner It is only available after the preconditioning step with a sparse preconditioner RINFO 26 The floating point operations for the assembly process to obtain the sparse preconditioner It is only available after the preconditioning step with a sparse preconditioner RINFO 27 The floating point operations for the elimination process to obtain the factorisation of the sparse preconditioner It is only available after the precon ditioning step with
46. on each processor P can concurrently partially factorize it to compute its local Schur complement S This is the first computational phase that is performed concurrently and independently by all the processors The second step corresponds to the construction of the preconditioner Each processor communicates with its neighbors in the graph partitioning sense to assemble its local Schur com plement S and performs its factorization This step only requires a few point to point communications Finally the last step is the iterative solution of the interface problem Equation B For that purpose parallel matrix vector product involving 5 the pre conditioner M and dot product calculation must be performed For the matrix vector product each processor P performs its local matrix vector product involving its local Schur complement and communicates with its neighbors to assemble the computed vector to comply with Equation 7 Because the preconditioner Equation 10 has similar form as the Schur complement Equation 7 its parallel application to a vector is implemented similarly Finally the dot products are performed as follows each pro cessor P performs its local dot product and a global reduction is used to assemble the result In this way the hybrid implementation can be summarized by the above main three phases 2 20 Dropping strategy to sparsify the preconditioner The construction of the proposed local preconditioners can be
47. r Parallel Computing 11 223 239 1989 Y Saad ILUT a dual threshold incomplete ILU factorization Numerical Linear Algebra with Applications 1 387 402 1994 Y Saad Iterative Methods for Sparse Linear Systems SIAM Philadelphia 2003 Second edition Y Saad and M H Schultz GMRES A generalized minimal residual algorithm for solving nonsymmetric linear systems SIAM J Sci Stat Comp 7 856 869 1986 Smith P Bj rstad and W Gropp Domain Decomposition Parallel Multi level Methods for Elliptic Partial Differential Equations Cambridge University Press New York 1st edition 1996 12 3 Installation 31 Requirements MAPHYS needs some libraries in order to be installed A MPI communication library INTELMPI MPICH2 A BLAS and LAPACK library MKLlof preference 5 library for the partitioning of the adjacency graph of the sparse matrix The MUMPS ad as sparse solver Optionnally the library for a better fit with the hardware topology 3 2 Compilation and installation To compile and install MAPHYS follow the six following steps Install SCOTCH 5 or higher Install HWLOC Install MUMPS 4 7 or higher Uncompress the MAPHYS archives copy the file Makefile inc example into Makefile inc Set the variable of the Makefile inc depending of your architecture and library such as MPI and BLAS available
48. s n set matrix input mphs nnz set matrix input mphs rows set matrix input mphs cols set matrix input mphs values set matrix input mphs srhs set the right hand side mphs icntl 10 set the arguments components of the structure mphs job 6 request to perform all steps at once Call DMPH driver mphs Call MPI Finalize ierr To select another arithmetic simply change the prefix DMPH by CMPH for the COMPLEX arithmetic or SMPH for the REAL arithmetic or ZMPH for the DOUBLE COMPLEX arithmetic For convenience purpose MAPHYS also generates static libraries specific to one arithmetic only namely libcmaphys for the COMPLEX arithmetic libdmaphys for the DOUBLE PRECISION arithmetic libsmaphys for the REAL arithmetic libzmaphys for the DOUBLE COMPLEX arithmetic 15 6 Input and output parameters 6 1 Maphys instance mphs is the structure that will describe one linear system to solve You have to use one structure for each linear system to solve 6 2 Version number mphs VERSION string gives information about the current MAPHYS version It is set during the intialization step mphs JOB 1 6 3 Control of the four main phases The main action performed by MAPHYS is specified by the integer mphs JOB The user must set it on all processes before each call to MAPHYS Possible values of mphs JOB are 1 initializes the instance It must be
49. split the unknowns in two subsets This induces the follo wing block reordered linear system Ge alee e 4 Arr Arr Xr br J where contains all unknowns associated with sub graph interfaces and x contains the remaining unknowns associated with sub graph interiors Because the interior points are only connected to either interior points in the same sub graph or with points on the boundary of the sub graphs the matrix Azz has a block diagonal structure where each diagonal block corresponds to one sub graph Eliminating x from the second block row of Equation 4 leads to the reduced system S r f 5 where S Arr and f bp Arr Arp br 6 The matrix S is referred to as the Schur complement matrix This reformulation leads to a general strategy for solving Equation 4 Specifically an iterative method can be applied to Equation 5 Once is known x be computed with one additional solve on the sub graph interiors Let denote the entire interface defined by U T we notice that if two sub graphs and share an interface then 0 As interior unknowns are longer considered new restriction operators must be defined as follows Let Rp be the canonical point wise restriction which maps full vectors defined on into vectors defined on Thus in the case of many sub graphs the fully assembled global Schur S is obtained by summing the contrib
50. tation done using a modifyied version of partial ILUT PILUT In this case user must set ICNTL 8 and the parameters of PILUT This option force the precondi tioner choice ICNTL 21 3 Note For more details on ILU method you may refer to Section Default 0 Note For more details on exact and approximated Schur you may refer to Section 2 4 ICNTL 31 UNUSED in current version ICNTL 32 the sparse direct solver package to be used by MAPHYS for the compu ting the local schur and perform the local factorisation It is accessed during the factorization step by each MPI process 1 use MUMPS 2 use PaSTiX Default value of ICNTL 13 ICNTL 33 35 UNUSED in current version 26 ICNTL 36 Is the control paramater that specify how to bind thread inside MAPHYS 0 Do not bind 1 Thread to core binding 2 Grouped bind Default 2 ICNTL 37 In 2 level of parallelism version specifies the number of nodes It is only useful if CNTL 42 0 Default 0 ICNTL 38 In 2 level of parallelism version specifies the number of cores per node It is only useful if I CNTL 42 0 Default 2 0 ICNTL 39 In 2 level of parallelism version specifies the number of threads per do mains It is only useful if ICNTL 42 z 0 Default 0 ICNTL 40 In 2 level of parallelism version specifies the number domains It is only useful if CNTL 42 0 Default 0 IC
51. till required For that purpose we developed parallel preconditioners and designed a hierarchical parallel im plementation Linear systems with a few tens of millions unknowns have been solved on a few thousand of processors using our designed software 2 1 Parallel hierarchical implementation In this section methods based on non overlapping regions are described Such do main decomposition algorithms are often referred to as substructuring schemes This terminology comes from the structural mechanics discipline where non overlapping ideas were first developed The structural analysis finite element community has been heavily involved in the design and development of these techniques Not only is their definition fairly natural in a finite element framework but their implementation can preserve data structures and concepts already present in large engineering software pa ckages Let us now further describe this technique and let Ax b be the linear problem For the sake of simplicity we assume that A is symmetric in pattern and we denote G V the adjacency graph associated with A In this graph each vertex is associated with a row or column of the matrix A and it exists an edge between the vertices and j if the entry j is non zero We assume that the graph is partitioned into non overlapping sub graphs G1 Gy With boundaries The governing idea behind substructuring or Schur complement methods is to
52. tioner It is available after the preconditioning step IINFO 17 Size in megabits used to stores the preconditioner if available It is avai lable after the preconditioning step IINFO 18 the number of static pivots done while computing the preconditioner It is available after the preconditioner step 1 means the data was unvailable gt Ois the number of pivots IINFO 19 the percentage of kept entries in the assembled Schur complement to build the local part of the preconditionner It is available after the preconditioner step with sparse local preconditioners ICNTLQ1 2 1 means the data was unvailable gt 0 is the value IINFO 20 specifies the iterative solver The flag indicating which iterative solver was used to solve the Schur complement system It is available after the solution step IINFO 21 specifies the Schur matrix vector product The flag indicating how was performed the schur matrix vector product Implicit Explicit It is available after the solution step IINFO 22 On the local system size in megabytes of internal data allocated by the sparse direct solver For MUMPS it is IINFOG 19 It is available after the factorization step IINFO 23 On the local Schur system size in megabytes of internal data allocated by the sparse direct solver to compute the preconditioner For MUMPS it is IINFOG 19 It is available after the preconditioning step with sparse precon ditioners IINFO 24
53. u want to use and Input File is a string that contains the link to the input file you want to use and lt nbproc gt the number of processus you want to use For example you can use the following command mpirun np 4 smph testdistsys cube in dmph in 38 Warning Make sure that lt nbproc gt is a multiple of two because maphys is based on dissection method The InputFile is a file that is describe as in section xxx The CubeF ile is a file that is describe as follow 10 size x 10 size y 10 size z 0 problem 0 an 0 40 pl 0 40 p2 Where size_x size_y and size_z are the cube dimensions problemis anis pl and p2 are 12 3 Howto use MAPHYS in an other code 12 3 1 Fortran example Here is a fortran example of a driver using MAPHYS include mph_defs_f h include mph_macros_f h define JOB_INIT 1 define JOB_END 2 define USE_COMM_WORLD 987654 Program Call_maphys Modules amp co x Use DMPH_maphys_mod Implicit None Include mpif h Integer 2 ierr Integer myid Integer ts ob Integer nbdom Integer 2482222 Arrays Real KIND 8 pointer sol Real KIND 8 pointer rhs Real KIND 8 pointer a 39 D teger pointer irn teger pointer D Matrix description Integer icum Integer tI Anz Integer 12 Sym Derived types DMPH maphys t id Call MPI INIT ierr
54. utions over the sub graphs The global Schur complement matrix Equation 6 can be written as the sum of elementary matrices N S RE SiRr 7 i 1 where Si Az r 8 is a local Schur complement associated with G and is in general a dense matrix eg if Azz is irreducible S is dense It can be defined in terms of sub matrices from the local matrix A defined by A Arr 9 Apu While the Schur complement system is significantly better conditioned than the original matrix A it is important to consider further preconditioning when employing a Krylov method It is well known for example that 2 when corres ponds to a standard discretization e g piecewise linear finite elements of the Laplace operator on a mesh with spacing h between the grid points Using two non overlapping sub graphs effectively reduces the condition number of the Schur complement ma trix to 5 O h 1 While improved preconditioning can significantly lower this condition number further We introduce the general form of the preconditioner considered in this work The preconditioner presented below was originally proposed in 4 in two dimensions and successfully applied to large three dimensional problems and real life applications in 11 To describe this preconditioner we define the local assembled Schur complement Rr SR that corresponds to the restriction of t

MAPHYS USER'S GUIDE 0.9.2

Contents

Download Pdf Manuals

Related Search

Related Contents