Home

User`s Guide - Chasqueweb

1. 63 Reference manual PIM RBICGSTAB Algorithm A 8 RBi CGSTAB Qi b AQox r Ug 0 4 py 1 2 0 0 1 for k 1 2 po wpo for j 0 1 restart 1 p rif B ap po po pi ui Ti fui LU uj41 Qi AQou L ul f a po amp ri Tj rhe t O0 9 Pi QiAQor To To QUO endfor check stopping criterion rlr ri rijo for j 2 3 restart a 17 Beli Mi T gj Tj Tj Yj T0 13 03 endfor restart Y restart E srestart Cur Tesque J restart 1 1 Vi VG Y Yj41 jus eM e J l restart 1 zo To YTO TQ 70 restart restart uo U0 YrestartYrestart ug ug yjuj J 1 restart 1 zo zo qu j l restart 1 ro ro TP j l restart 1 endfor 64 Reference manual PIM RGMRES A 11 PIM RGMRES Purpose Solves the system Q1 AQ r Q b using the restarted GMRES method Synopsis PIMSRGMRES X B WRK IPAR SPAR MATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS PIMDRGMRES X B WRK IPAR DPAR MATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS PIMCRGMRES X B WRK IPAR SPAR MATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS PIMZRGMRES X B WRK IPAR DPAR MATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS Storage requirements Parameter No of words X B IPAR 4 WRK 4 IPAR 5 IPAR 4 IPAR 13 PAR 6 Possible exit status values returned by IPAR 12 0 Functi
2. Compute with local grid points Need eastern data wait for completion of receive CALL MPI WAIT RIDO ISTAT IERR Compute with local interfacing grid points in the east Need west data wait for completion of receive CALL MPI WAIT RID1 ISTAT IERR Compute with local interfacing grid points in the west Release message IDs CALL MPI WAIT SIDO ISTAT IERR CALL MPI WAIT SID1 ISTAT IERR RETURN END The computation of the transpose matrix vector product for the PDE case 1s performed in a similar fashion Before the computation starts each processor exchanges with its left and right 33 neighbouring processors the east and west coefficients corresponding to the interfacing grid points The computation performed is then similar to the matrix vector product described above except that for each interfacing grid point we apply Vig 05 jj V4 juit j Bici ici EHA i jiti j 8 Comparing 8 to 7 we see that the coefficients are swapped in the north south and east west directions Note that due to the partitioning imposed we do not need to exchange the north and south coefficients A matrix vector product for parallel vector architectures For parallel vector architec tures like the Cray Y MP2E the routines outlined above are not efficient because of the small vector lengths involved A routine requiring the use of long vectors may be obtained by writing the matrix vector product for the 5 point
3. None 02 Reference manual Algorithm A 3 CGNR ro Qi AT 6 AT AQoao po TQ 00 r ro wo Q1 AT AQspo g p wo fork 2 1 2 Ok 1 Ok 1 Ek 1 Tk TE Ok ipk i Tk Tk 1 T Og Wp 1 check stopping criterion sp Qi AT AQore Ok TL Tk r ri sk Pr Orf 0k 1 Pk rk Dkpk A Wk Sk Pewp1 En r Br ra endfor 54 PIM CGNR Reference manual PIM CGNE A 6 PIM CONE Purpose Solves the system Q AAT Qox Qib using the CGNE method Synopsis PIMSCGNE X B WRK IPAR SPAR MATVEC TMATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS PIMDCGNE X B WRK IPAR DPAR MATVEC TMATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS PIMCCGNE X B WRK IPAR SPAR MATVEC TMATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS PIMZCGNE X B WRK IPAR DPAR MATVEC TMATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS Storage requirements Parameter No of words X B IPAR 4 WRK 6 IPAR 4 IPAR 13 PAR 6 Possible exit status values returned by IPAR 12 0 Function dependencies BLAS COPY _AXPY _DOT _DOTC LIBPIM Notes None 55 Reference manual Algorithm A 4 CGNE ro Q b AA Qoo po ro 00 T To wo Q1 AAT Qopo amp p wo for k 1 2 Ok 1 Ok 1 Ek 1 Tk Tr O 1Pk 1 Tk Tk 1 Og 1W 1 check stopping criterion sp Qi AA1 Qork Ok TL Tk r ri sk Pr On 0k 1 pk rk Dkpk A Wk Sk
4. PDNRM PROGRESS PIMCCGS X B WRK IPAR SPAR MATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS PIMZCGS X B WRK IPAR DPAR MATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS Storage requirements Parameter No of words X B TPAR 4 WRK 10 IPAR 4 IPAR 13 PAR 6 Possible exit status values returned by IPAR 12 0 Function dependencies BLAS COPY _AXPY DOT DOTO LIBPIM Notes None 59 Reference manual PIM CGS Algorithm A 6 CGS 1 ro Qi b Asr 2 po so To TQ 3 Po FT ro for k 2 1 2 wii Qi AQopr 1 amp i f wa KL Dp exea tk 1 Sk 1 Og 1W 1 Weal Spay ki Tk Vk Ck WE 1 Tk Tk 0 11 A a check stopping criterion pr i te Bk pk pk 1 Sk Tk Dktk 1 wk tk 1 Bkpk i pk Sk Ppwr endfor 60 Reference manual PIM_BICGSTAB A 9 PIM_BICGSTAB Purpose Solves the system Q1 AQ z Q b using the Bi CGSTAB method Synopsis PIMSBICGSTAB X B WRK IPAR SPAR MATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS PIMDBICGSTAB X B WRK IPAR DPAR MATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS PIMCBICGSTAB X B WRK IPAR SPAR MATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS PIMZBICGSTAB X B WRK IPAR DPAR MATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS Storage requirements Parameter No of words X B IPAR 4 WRK 10 IPAR 4 IPAR 13 PAR 6 Possible exit status values returned by IPAR 12 0 Function
5. 1 p mk a check stopping criterion endfor 76 Reference manual PIM SETPAR A 17 PIM SETPAR Purpose Sets the parameter values in the arrays IPAR and PAR Synopsis PIMSSETPAR IPAR SPAR LDA N BLKSZ LOCLEN BASISDIM NPROCS PROCID PRECONTYPE STOPTYPE MAXIT EPSILON INTEGER IPAR REAL SPAR INTEGER LDA N BLKSZ LOCLEN BASISDIM NPROCS PROCID PRECONTYPE STOPTYPE MAXIT REAL EPSILON PIMDSETPAR IPAR DPAR LDA N BLKSZ LOCLEN BASISDIM NPROCS PROCID PRECONTYPE STOPTYPE MAXIT EPSILON INTEGER IPAR DOUBLE PRECISION DPAR INTEGER LDA N BLKSZ LOCLEN BASISDIM NPROCS PROCID PRECONTYPE STOPTYPE MAXIT DOUBLE PRECISION EPSILON Storage requirements Parameter No of words IPAR 13 PAR 6 Notes 1 When using the COMPLEX and DOUBLE COMPLEX PIM routines call PIMSSETPAR and PIMDSETPAR respectively TT Reference manual A 18 PIM PRTPAR Purpose Prints the parameter values on the arrays IPAR and PAR Synopsis PIMSPRTPAR IPAR SPAR INTEGER IPAR REAL SPAR PIMDPRTPAR IPAR DPAR INTEGER IPAR DOUBLE PRECISION DPAR Storage requirements Parameter No of words IPAR 13 PAR 6 Notes 1 May be called only on a processing element with I O capability PIM PRTPAR 2 When using the COMPLEX and DOUBLE COMPLEX PIM routines call PIMSPRTPAR and PIMDPRTPAR respectively 78 Reference manual _INIT A 19 INIT Purpose Initialises a vector of length n with the scalar value alpha Based
6. 1 RETURN where DVPROD is a routine based on the BLAS DAXPY routine that performs an element by element vector multiplication This example also shows the use of dummy arguments PDUMR Note that it is the responsibility of the user to ensure that when using preconditioning the matrix Qj AQ must satisfy any requirements made by the iterative method being used with respect to the symmetry and or positive definiteness of the matrix For example if 4 is a matrix with arbitrary i e non constant diagonal entries then both diag A T A and A diag A will not be symmetric and the CG and CGEV methods will generally fail to converge For these methods symmetric preconditioning diag A A diag A should be used Inner products vector norms and global accumulation When running PIM routines on multiprocessor architectures the inner product and vector norm routines require reduction 24 and broadcast operations in some message passing libraries these can be supplied by a single routine On vector processors these operations are handled directly by the hardware whereas on distributed memory architectures these operations involve the exchange of messages among the processors When a PIM iterative routine needs to compute an inner product it calls DOT to compute the partial inner product values The user supplied routine P SUM is then used to generate the global sum of those partial sums The following code shows the routines to comp
7. Journal on Matriz Analysis and Applications 13 3 778 795 1992 T C Oppe W D Joubert and D R Kincaid NSPCG user s guide version 1 0 Report No CNA 216 Center for Numerical Analysis University of Texas at Austin April 1988 A Pindor Experiences with implementing PIM Parallel Iterative Methods package on KSRI In Supercomputing Symposium 94 Toronto June 1994 Y Saad Krylov subspace methods for solving large unsymmetric systems Mathematics of Computation 37 105 126 1981 Y Saad and M H Schultz Conjugate Gradient like algorithms for solving nonsymmetric linear systems Mathematics of Computation 44 170 417 424 1985 Y Saad and M H Schultz GMRES a generalized minimal residual algorithm for solving nonsymmetric linear systems SIAM Journal of Scientific and Statistical Computing 7 856 869 1986 G L G Sleijpen and D R Fokkema BiCGSTAB Z for linear matrices involving unsym metric matrices with complex spectrum ETNA 1 11 32 September 1993 P Sonneveld CGS a fast Lanczos type solver for nonsymmetric linear systems STAM Journal of Scientific and Statistical Computing 10 36 52 1989 G Stellner S Lamberts and T Ludwig NXLIB User s Guide Technical report In stitut f r Informatik Lehrstuhl f r Rechnertechnik und Rechnerorganisation Technische Universit t M nchen October 1993 H A van der Vorst Bi CGSTAB A fast and smoothly converging variant of Bi CG for the solution of nonsymm
8. NSPCG may be used on a parallel vector supercomputer like a Cray Y MP there are no versions of these packages available for distributed memory parallel computers Second there is no debugging support this is dictated by the fact that in some multiprocessing environments parallel I O is not available The third aspect is that we do not provide a collection of preconditioners but leave the responsibility of providing the appropriate routines to the user In this sense PIM has many similarities to a proposed standard for iterative linear solvers by Ashby and Seager 5 In that proposal the user supplies the matrix vector product and preconditioning routines We believe that their proposed standard satisfies many of the needs of the scientific community as drawing on its concepts we have been able to provide software that has been used in a variety of parallel and sequential environments PIM does not always follow the proposal especially with respect to the format of the matrix vector product routines and the lack of debugging support Due to the openness of the design of PIM it is also possible to use it on a sequential machine In this case the user can take advantage of the BLAS 16 to compute the above operations This characteristic is important for testing purposes once the user is satisfied that the selection of preconditioners and stopping criteria are suitable the computation can be accelerated by using appropriate parallel versions of t
9. Vy form approximate solution compute eigenvalues of Hrestart rk Qi b AQ27 4 Pr relle endfor 68 Reference manual PIM_RGCR A 13 PIM RGCR Purpose Solves the system Q4 AQox Ob using the restarted GCR method Synopsis PIMSRGCR X B WRK IPAR SPAR MATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS PIMDRGCR X B WRK IPAR DPAR MATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS PIMCRGCR X B WRK IPAR SPAR MATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS PIMZRGCR X B WRK IPAR DPAR MATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS Storage requirements Parameter No of words X B IPAR 4 WRK 5 2 IPAR 5 IPAR 4 2 IPAR 5 IPAR 13 PAR 6 Possible exit status values returned by IPAR 12 0 Function dependencies BLAS COPY _AXPY DOT DOTO LIBPIM Notes 1 The restarting value must be stored in IPAR 5 69 Reference manual PIM_RGCR Algorithm A 11 RGCR 1 ro Qi b AQ22z0 for k 2 1 2 2 P Tk 1 Tk Tk 1 Tk Tk 1 for j 1 2 restart W Q1AQ2P G W W oj TE W IC ik tp t oj P Tk Tk ajW check stopping criterion q QAQ Piri rk ag Wal GF endfor endfor TO Reference manual PIM QMR A 14 PIM QMR Purpose Solves the system Q4 4957 Q 1b using the highly parallel algorithm for the QMR method developed by B cker and Sauren 8 Synopsis PIMSQMR X B WRK IPAR SPAR MATVEC TMATVEC PRECONL PRECONR PSSUM PS
10. applied using the local interfacing points and those from the neighbouring processors This parallel computation offers the possibility of overlapping communication with the com putation If the number of local grid points 1s large enough one may expect that while Equation 7 is being applied to those points the interfacing grid points of the neighbouring processors wil have been transferred and be available for use This method attempts to minimize the overheads incurred by transferring the data note that we only make gains if the asynchronous transfer of messages is available The example below is taken from the matrix vector product routine using MPI 32 SUBROUTINE PDMVPDE NPROCS MYID LDC L MYL COEFS U V UEAST UWEST INCLUDE mpif h Declarations Send border U values to myid 1 th processor MSGTYPE 1000 TO MYID 1 CALL MPI ISEND U EIO L MPI DOUBLE PRECISIOW TO MSGTYPE MPI_COMM_WORLD SIDO IERR Post to receive border U values from myid 1 th processor MSGTYPE 1001 CALL MPI IRECV UEAST L MPI DOUBLE PRECISION MPI ANY SOURCE 4 MSGTYPE MPI COMM WORLD RIDO TIERR Send border U values to myid 1 th processor MSGTYPE 1001 TO MYID 1 CALL MPI ISEND U WIO L MPI DOUBLE PRECISION TO MSGTYPE MPI COMM WORLD SIDi1 IERR Post to receive border U values from myid 1 th processor MSGTYPE 1000 CALL MPI IRECV UWEST L MPI DOUBLE PRECISION MPI ANY SOURCE 4 MSGTYPE MPI COMM WORLD RID1 IERR
11. computed B The right hand side vector of length IPAR 4 WRK A work vector used internally see the description of each routine for its length IPAR see below PAR see below MATVEC Matrix vector product external subroutine TMATVEC Transpose matrix vector product external subroutine PRECONL Left preconditioning external subroutine PRECONR Right preconditioning external subroutine P_SUM Inner product external function P_NRM Vector norm external function PROGRESS Monitoring routine IPAR input Element Description 1 Leading dimension of A 2 Number of rows or columns of A depending on partitioning Block size for cyclic partitioning Number of vector elements stored locally Restarting parameter used in GMRES GCR and RBi CGSTAB Number of processors Processor identification CON SS CB au Preconditioning 0 No preconditioning 1 Left preconditioning 2 Right preconditioning 3 Symmetric preconditioning 9 Stopping criterion see Table 2 10 Maximum number of iterations allowed 45 Reference manual Description of parameters IPAR output Element 11 12 13 Description Number of iterations Exit status 0 converged to solution 1 no convergence has been achieved 2 soft breakdown solution may have been found 3 hard breakdown no solution 4 conflict in preconditioner and stopping criterion selected if IPAR 8 0 or IPAR 8 22 then IPAR 9 Z6 5 error in stopping criterion 3 T
12. contains examples of the timing functions available on the Cray the IBM RS 6000 and also the UNIX etime function By default the latter is used this file must be modified to use the timing function available on the target machine The PVM and MPI example programs use the Fortran INCLUDE statement to include the PVM and MPI header files Some compilers have a switch usually I which allows the user to provide search directories in which files to be included are located as with the IBM AIX XL Fortran compiler while others require the presence of those files in the same directory as the source code resides In the first case you will need to include in the FFLAGS variable the relevant switches see 4 3 in the latter you will need to install the PVM and MPI header files fpvm3 h and mpif h respectively by typing make install pvm include INCFILE name of fpvm3 h make install mpi include INCFILE lt name of mpif h gt where you should replace lt name of fpvm3 h gt and lt name of mpif h gt by the full filename of the required include files for instance if PVM is installed on usr 1ocal pvm3 then you should type make install pvm include INCFILE usr local pvm3 include fpvm3 h Figure 2 shows the directory tree containing the examples To build them type make followed by the name of a subdirectory of examples e g make sequential single dense gt The PVM examples use the groups library libgpvm a which provides the reduction
13. equations of the form Ou Ou du fl D 0 os z Ot E dy i have been solved with CG using a Neumann polynomial approximation to AT as a precondi tioner 12 CG with eigenvalues estimation An important characteristic of CG is its connection to the Lanczos method 29 which allows us to obtain estimates of the eigenvalues of O AQ with only a little extra work per iteration These estimates y and un are obtained from the Lanczos tridiagonal matrix Tj whose entries are generated during the iterations of the CG method 29 pp 475 480 523 524 If we define the matrices A diag po pi py 1 Gr diaglEo E1 Ex 1 and Lo 1 cs B 1 k 1 where p lla r is the residual at the i th iteration pT Ap and 8 r rj ri vii are generated via the CG iterations at no extra cost we obtain the Lanczos s matrix via the relation T A B GBAT 4 Due to the structure of the matrices Bj A and Gg the matrix 77 can be easily updated during the CG iterations The general formula for Tj is ai B Eiz i pz 4 Bj120 2 1 2 k bi amp 1Biti pi 1pi 1 2 k 1 where a and b are the elements along the diagonal and subdiagonal of Tj respectively The strategy employed to obtain the eigenvalue estimates is based on Sturm sequences 29 pp 437 439 For the matrix 75 obtained during the first iteration of CG the eigenvalues are obtained directly
14. from the quadratic equation derived from p 4 det T5 pI We also set an interval c d y fn For the next iterations we update the interval Je d using Gerschgorin s theorem This is easily accomplished since at each iteration only two new values are added to Tj to give 7444 the updated interval is then min e ax bri bi jail Db d max d laz bx i bel Jai bi The new estimates for the extreme eigenvalues are then computed using a bisection routine applied to the polynomial p u det T444 uI which is computed via a recurrence expression 29 pp 437 The intervals c ui and un d are used in the bisection routine to find the new estimates of 41 and un respectively A possible use of this routine would be to employ adaptive polynomial preconditioners see 2 and 3 where at each iteration information about the extreme eigenvalues of Qi AQ is obtained and the polynomial preconditioner is modified to represent a more accurate approximation to AT This routine can also be used as a preliminary step before solving the system using the Chebyshev acceleration routine PIM CHEBYSHEV CGNR and CGNE For nonsymmetric systems one could use the CG formulation applied to systems involving either A A or AAT these are called CGNR and CGNE respectively The difference between both methods is that CGNR minimises the residual b Az and CGNE the error A7 5 zk A potential
15. functions 19 Figure 2 Directories containing the examples dense F pde single E pvp pde harwell boeing sequenti dense double lt pde harwell boeing complex dense dcomplex dense examples i dense single pde pun double SS mpi pae complex dense dcomplex dense The example programs can also be built locally in those directories by changing to a specific directory and typing make Cleaning up You may wish to remove some or all of the compiled codes or other files installed under the PIM directory in this case you may type one of the following make make make make make make make make make make make make which will singleclean doubleclean complexclean dcomplexclean sequentialclean pvmclean mpiclean clean pvm include clean mpi include examplesclean makefilesclean realclean clean up the PIM routines the examples the Makefiles the include files and all generated files returning the package to its distribution form 20 Using PIM in your application To use PIM with your application link your program with the o file corresponding to the PIM iterative method routine being called and with the PIM support library libpim a 4 4 Calling a PIM iterative method routine With the exception of the Bi CG CGNR CGNE and QMR methods all the implemented methods have the same parameter list as CG The argument list for the double precision im plementation of the CG
16. on the level 1 BLAS routine COPY Synopsis SINIT N ALPHA SX INCX REAL ALPHA SX INTEGER N INCX DINIT N ALPHA DX INCX DOUBLE PRECISION ALPHA DX INTEGER N INCX CINIT N ALPHA CX INCX COMPLEX ALPHA OX INTEGER N INCX ZINIT N ALPHA ZX INCX DOUBLE COMPLEX ALPHA ZX INTEGER N INCX Storage requirements Parameter No of words X IPAR 4 Notes None 79 Index Example programs dense storage 30 description of 27 Eigenvalues estimation and Chebyshev acceleration 29 PDE storage 30 PDE matrix vector product for parallel vector architectures 34 preconditioners 35 results 36 External routines description of 22 inner product and vector norm 25 matrix vector product 22 monitoring the iterations 26 preconditioning step 23 synopsis of 47 Inner product see External routines 25 Installation procedures 18 Building the examples 19 Building the PIM core functions 18 Cleaning up 20 Using PIM in your application 21 Iterative methods Bi CG 9 routine 57 Bi CGSTAB 10 routine 61 CG 7 routine 49 CG with eigenvalues estimation 8 routine 51 CGNE 9 routine 55 80 CGNR 9 routine 53 CGS 10 routine 59 Chebyshev acceleration 12 routine 75 GCR 11 routine 69 GMRES 10 routine 65 GMRES with eigenvalues estimation 11 routine 67 increasing parallel scalability of 15 overview 7 QMR 11 71 Restarted Bi CGSTAB 10 routine 63 TFOMR 11 routine 73 Ma
17. problem with this approach is that the condition number of ATA or AA is large even for a moderately ill conditioned A thus requiring a substantial number of iterations for convergence However as noted by Nachtigal et al 37 CGNR is better than GMRES and CGS for some systems including circulant matrices More generally CGNR and CGNE perform well if the eigenvalue spectrum of has some symmetries examples of such matrices are the real skew symmetric and shifted skew symmetric matrices A e T ol T T 0 real and o complex Bi CG Bi CG is a method derived to solve non Hermitian systems of equations and is closely related to the Lanczos method to compute the eigenvalues of A The method requires few vec tors per iteration and the computation of a matrix vector product as well as a transpose matrix vector product ATu The iterates of Bi CG are generated in the Krylov subspace K ro A ro roA r0A2 where rg b Axo A Galerkin condition vr 0 Vw K fo AT is imposed on the residual vector where Tg 1s an arbitrary vector satisfying rl fg 0 It is important to note that two sequences of residual vectors are generated one involving r and A and the other f and AT but the solution vector rj is updated using only the first sequence Bi CG has an erratic convergence with large oscillations of the residual 2 norm which usually cause a large number of iterations to be performed until convergence is achieved M
18. storage needed Although the restar ted GMRES does not break down 42 pp 865 it may depending on the system and the value of c produce a stationary sequence of residuals thus not achieving convergence Increasing the value of c usually cures this problem and may also increase the rate of convergence A detailed explanation of the parallel implementation of the restarted GMRES used can be found in 14 GMRES with eigenvalues estimation It is very easy to obtain estimates of the eigenvalues of Q1 AQ at each iteration of GMRES since the upper Hessenberg matrix Hj computed during the Arnoldi process satisfies Q1 AQoV VL Hy The eigenvalues of Hj approximate those of Q1AQ especially on the boundaries of the region containing A 405 The QR algorithm can be used to obtain the eigenvalues of Hg The LAPACK routine HSEGR 1 pp 158 159 is used for this purpose The routine PIM RGMRESEV returns a box in the complex plane defining the minimum and maximum values along the real and imaginary axes These values can then be used by the Chebyshev acceleration routine PIM CHEBYSHEV GCR The GCR method is generally used in its restarted form for reasons similar to those given above for GMRES It is mathematically equivalent to the restarted version of GMRES but it is not as robust It is applicable to systems where the coefficient matrix is of the form A uI R y complex and R real symmetric and A uI S ji real and S S arising i
19. that only the neighbouring points in the vertical and horizontal directions are needed to compute v j 30 Figure 3 Matrix vector product dense storage format A Partitioning in columns B Exam ple and C Compntation and communication steps A G Processors ENN IM CL B AB CD EF G a Aa Bb Cc Dd Ee Ff Gg Hh I K L O P b Ta Jb Kc Ld Me Nf 0g Ph C d E 4 g L h L C peni Ford la Jb Kc Ld Me Nf Og Ph Kc Ld Og Ph Step l B Step 2 31 Figure 4 Matrix vector product PDE storage format Processor 0 Processor 1 Processor 2 6 E EL C NE O Boundary grid points exchanged mi Grid points eo Data exchange Five point stencil A parallel computation of 7 may be organised as follows The grid points are partitioned by vertical panels among the processors as shown in Figure 4 A processor holds at most 1 p columns of grid points To compute the matrix vector product each processor exchanges with its neighbours the grid points in the interfaces between the processors the points marked with white squares in Figure 4 Equation 7 is then applied independently by each processor at its local grid points except at the local interfacing points After the interfacing grid points from the neighbouring processors have arrived at a processor Equation 7 is
20. that the methods that have failed to converge in this example do converge for other systems Scalability In Table 6 we present the execution times obtained by solving the test problem above but with n 16384 equations with the PIMDRGMRES routine using 10 basis vectors and the Neumann polynomial preconditioner of first degreee on the IBM SP 1 Intel Paragon XP S Kendall Square Research KSR1 SGI Challenge Cray Y MP2E and Cray C9016E The PIMDRGMRES routine converged to a tolerance of 10713 in 70 iterations The results for the Cray machines were obtained with the modified matrix vector product routines described in 4 6 3 The results for the KSR1 are obtained using the KSRBLAS routines The programs running on the SGI Challenge are from the set of examples available with the PIM distributed software using the PVM message passing library The results for the IBM SP 1 are obtained using the IBM PVMe 2 0 version of PVM which enables the use of the IBM SP 1 High Performance Switch Note that for the IBM SP 1 SGI Challenge and Intel Paragon XP S superlinear effects occur we believe this is due to the specific memory organization of those machines hierarchic memories and or the presence of a cache memory 5 Summary We have described in this report PTM the Parallel Tterative Methods package a collection of Fortran 77 routines for the parallel solution of linear systems using iterative methods The package was designed to be used in a vari
21. which can be easily computed as a sequence of vector updates and matrix vector products using Horner s algorithm Note that the Ym coefficients define the kind of polynomial preconditioner being used The Neumann preconditioner is obtained when Ym 1 Vi the weighted and unweighted least squares polynomial preconditioners are those reported in 36 The maximum available degree of the polynomial for these latter two is m 13 4 6 5 Results In this section we provide some results for the example programs discussed above Stopping criteria As pointed out earlier the selection of the stopping criterion has a sub stantial effect on the execution time Evidently there is a trade off between the time spent on each iteration and the total number of iterations required for convergence In Table 3 we show for each of the stopping criteria provided the execution time per iteration when PIMDCG is applied to the tridiagonal system described in 4 6 of order n 500 with diagonal left preconditioning The increase in execution time of each stopping criterion with respect to criterion 4 the cheapest one is shown Table 3 Effect of different stopping criteria on an iterative method routine Stopping Time s criterion kk ri iteration Increase 1 19 3 56x 1071 0 3331 2 66 2 15 6 901 x10 0 3531 2 79 3 14 129x108 0 3286 2 62 4 18 3829x1077 0 1254 5 14 6 45x107 0 1967 1 57 6 15 1 73x10 0 2904 2 32 7 19 3 70x10 0 4148 3
22. 2 The examples include the solution of systems using different preconditioners In the dense and Harwell Boeing formats the examples include diagonal and polynomial preconditioners the five point PDE format includes a variant of the incomplete LU factorisation and polynomial preconditioners The polynomial preconditioners provided are the Neumann and the weighted and unweighted least squares polynomials found in 36 28 4 6 1 Eigenvalues estimation and Chebyshev acceleration Consider the use of Chebyshev acceleration to obtain a solution to a linear system whose coefficient matrix has real entries only the eigenvalues of the iteration matrix I Q A are known to lie in the complex plane We can use a few iterations of the routine PIMDRGMRESEV to obtain estimates of the eigenvalues of Q A and then switch to PIMDCHEBYSHEV Before the latter is called a transformation on the extreme values on the real axis must be made as described in Section 2 In the example below we use the Jacobi preconditioner as shown in 4 5 Note that the vector X returned by PIMDRGMRESEV may be used as an improved initial vector for the routine PIMDCHEBYSHEV Both routines are combined in a loop to produce a hybrid method the code below is based on the algorithm given by Elman et al 21 page 847 PROGRAM HYBRID INTEGER MAXIT EXTERNAL MATVEC PRECON PDUMR PDSUM PDNRM2 SET MAXIMUM NUMBER OF ITERATIONS FOR THE HYBRID LOOP MAXIT INT N 2 1 SET LEFT PRECONDI
23. 31 General results We present below the results obtained from solving a system of n 64 equations derived from the 5 point finite differences discretisation of Equation 6 36 We used both the IDLU 0 and the Neumann polynomial preconditioner of degree 1 as left preconditioners to solve this problem The stopping criterion used was number 5 with e 1079 and the 2 norm using this criterion a solution will be accepted if zp lt 3 802 x 107 4 except for PIMDRGMRES which stops its iterations when the norm of the residual is less than e The maximum number of iterations allowed was 32 and the initial value of the solution vector was 0 0 0 7 For the restarted GMRES and GCR the restarting value used was 10 The results are reported for the double precision versions of the routines Tables 4 and 5 show the results obtained with the PIM routines for the IDLU 0 and Neumann preconditioners on a single workstation A status value of 0 on exit from a PIM routine indicates that the convergence conditions have been satisfied a non zero status value indicates that a problem has been encountered In particular a status value of 1 is returned when the maximum number of iterations specified by the user has been exceeded This example is characteristic of the problems facing the user of iterative method i e not all methods converge to the solution and some preconditioner may cause an iterative method to diverge or converge slowly We stress
24. 46 the restarted stabilised version of Bi Conjugate Gradients RBi CGSTAB 43 the restarted generalised minimal residual RGMRES 42 the restarted generalised conjugate residual RGCR 20 the highly parallel algorithm for the quasi minimal residual QMR method recently pro posed by B cker and Sauren 8 the transpose free quasi minimal residual TFQMR 24 and Chebyshev acceleration 32 The routines allow the use of preconditioners the user may choose to use left right or symmetric preconditioning Several stopping criteria are also available PIM was developed with two main goals 1 To allow the user complete freedom with respect to the matrix storage access and parti tioning To achieve portability across a variety of parallel architectures and programming environ ments These goals are achieved by hiding from the PIM routines the specific details concerning the computation of the following three linear algebra operations 1 Matrix vector and transpose matrix vector product 2 Preconditioning step 3 Inner products and vector norm Routines to compute these operations need to be provided by the user Many vendors supply their own optimised linear algebra routines which the user may want to use A number of packages for the iterative solution of linear systems are available including ITPACK 30 and NSPCG 38 PIM differs from these packages in three main aspects First while ITPACK and
25. B WRK IPAR SPAR MATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS PIMZRGMRESEV X B WRK IPAR DPAR MATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS Storage requirements Parameter No of words X B IPAR 4 WRK 4 IPAR 5 IPAR 4 IPAR 13 PAR 6 Possible exit status values returned by IPAR 12 0 Function dependencies BLAS COPY AXPY DOT DOTO SCAL TRSV LAPACK HSEQR LIBPIM Notes 1 The size of the orthonormal basis maximum of 50 vectors must be stored in IPAR 5 If the user needs to use a larger basis then the parameter IBDIM defined on PIM RGMRESEV must be changed accordingly 67 Reference manual PIM RGMRESEV 2 The user must supply a routine to compute the 2 norm of a vector 3 A box containing estimates of the eigenvalues of AQ is returned in DPAR 3 DPAR 4 DPAR 5 DPAR 6 these values representing the minimum and maximum values in the real and imaginary axes respectively Algorithm A 10 RGMRESEV 1 ro Qi b AQoxg 2 Po Irolle for k 1 2 3 g Bist Bie 4 Vi ria Pa for j 1 2 restart Rij VI QAQ Vj m Qi AQ Vj Xl HU Pi i 110112 Via DIR apply previous Givens s rotations to R j compute Givens s rotation to zero IBi 1 apply Givens s rotation to g if gj41 lt RHSSTOP then perform steps 13 and 14 with restart i stop endif endfor solve Ry g solution to least squares problem k 2 1
26. BPIM Notes e If more accuracy is required in the computation of the estimates of the eigenvalues the user may modify the value of the maximum number of iterations allowed 1n the routine BISECTION files pim22 single src pimscgev f or pim22 double src pimdcgev f e Not available in COMPLEX or DOUBLE COMPLEX versions 51 Reference manual Algorithm A 2 CGEV To Qi b AQoxo po TO 00 r TQ wo Qi AQ po o pa wo for k 2 1 2 01 eui 6 1 Tk Tk 1 Ok 1Pk 1 Tk Tk 1 Ok 1Wk 1 check stopping criterion Sk Q AQory Ok PL Tk r ri sk Pr ek 0k 1 pk rk Pkpi a Wk Sk PpWwr 1 En z Pr ra compute estimates of eigenvalues endfor 52 PIM CGEV Reference manual PIM_CGNR A 5 PIM_CGNR Purpose Solves the system Q AT AQox Q1 A using the CGNR method Synopsis PIMSCGNR X B WRK IPAR SPAR MATVEC TMATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS PIMDCGNR X B WRK IPAR DPAR MATVEC TMATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS PIMCCGNR X B WRK IPAR SPAR MATVEC TMATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS PIMZCGNR X B WRK IPAR DPAR MATVEC TMATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS Storage requirements Parameter No of words X B IPAR 4 WRK 6 IPAR 4 IPAR 13 PAR 6 Possible exit status values returned by IPAR 12 0 Function dependencies BLAS COPY _AXPY _DOT _DOTC LIBPIM Notes
27. DSUM PDNRM PROGRESS PIMCCG X B WRK IPAR SPAR MATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS PIMZCG X B WRK IPAR DPAR MATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS Storage requirements Parameter No of words X B IPAR 4 WRK 6xIPAR 4 IPAR 13 PAR 6 Possible exit status values returned by IPAR 12 0 Function dependencies BLAS COPY AXPY _DOT _DOTC LIBPIM Notes None 49 Reference manual Algorithm A 1 CG ro Qi b AQ gt 270 po ro 00 T To wo Qi AQ po amp pj wo for k 1 2 Ok 1 Ok 1 Ek 1 Tk Tr O 1Pk 1 Tk Tk 1 amp k 1Wk check stopping criterion sk Qi AQor Ok TL Tk r ri sk Pr On 0k 1 pk rk Dkpk A Wk Sk Pewp1 k k Piar endfor 50 PIM CG Reference manual PIM CGEV A 4 PIM CGEV Purpose Solves the system Q4 AQ27 Q b using the CG method returns at each iteration the estimates of the smallest and largest eigenvalues of Q1 AQ derived from the associated Lanczos tridiagonal matrix Synopsis PIMSCGEV X B WRK IPAR SPAR MATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS PIMDCGEV X B WRK IPAR DPAR MATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS Storage requirements Parameter No of words X B IPAR 4 WRK 6 IPAR 4 2 IPAR 10 1 IPAR 13 PAR 4 Possible exit status values returned by IPAR 12 0 Function dependencies BLAS COPY AXPY DOT LI
28. L Zk 0 6 stopping criterion invalid on PIM CHEBYSHEV f no estimates of eigenvalues supplied for PIM CHEBYSHEV 8 underflow while computing p on PIM_CGEV 9 overflow while computing p on PIM CGEV 10 underflow while computing un on PIM_CGEV 11 overflow while computing un on PIM CGEV If IPAR 12 is either 2 or 3 it gives the step number in the algorithm where a breakdown has occurred PAR input Element 1 Description The value of lt for use in the stopping criterion PAR output Element Q Cu d Ww LD Description The left hand side of the stopping criterion selected Minimum real part of the eigenvalues of Q1 AQ Maximum real part of the eigenvalues of Qi AQ Minimum imaginary part of the eigenvalues of O AQ Maximum imaginary part of the eigenvalues of Q1 AQ 46 Reference manual External routines A 2 External routines Purpose To compute the matrix vector product transpose matrix vector product left preconditioning right preconditioning global sum of a vector vector norm and to monitor the progress of the iterations Note The coefficient matrix and the preconditioning matrices can be made available to MATVEC TMATVEC PRECONL and PRECONR using COMMON blocks Synopsis Matrix vector product v Au Left preconditioning v Qu SUBROUTINE MATVEC U V IPAR SUBROUTINE PRECONL U V IPAR precision UC V precision U V INTEGER IPAR 4 INTEGER IPAR Parameters Type Para
29. M were more restrictive in this sense Matrix vector product Consider as an example a dense matrix partitioned by contiguous columns among a number of processors For illustrative purposes we assume that N is an integer multiple of NPROCS and that LOCLEN N NPROCS The following code may then be used PROGRAM MATV A IS DECLARED AS IF USING A COLUMN PARTITIONING FOR AT LEAST TWO PROCESSORS INTEGER LDA PARAMETER LDA 500 INTEGER LOCLEN PARAMETER LOCLEN 250 DOUBLE PRECISION A LDA LOCLEN COMMON PIMA A SET UP PROBLEM SOLVING PARAMETERS FOR USE BY USER DEFINED ROUTINES THE USER MAY NEED TO SET MORE VALUES OF THE IPAR ARRAY LEADING DIMENSION OF A IPAR 1 LDA NUMBER OF ROWS COLUMNS OF A IPAR 2 N NUMBER OF PROCESSORS 22 IPAR 6 NPROCS NUMBER OF ELEMENTS STORED LOCALLY IPAR 4 N IPAR 6 CALL PIM ROUTINE CALL PIMDCG X B WRK IPAR DPAR MATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS STOP END MATRIX VECTOR PRODUCT ROUTINE CALLED BY A PIM ROUTINE THE ARGUMENT LIST TO THIS ROUTINE IS FIXED SUBROUTINE MATVEC U V IPAR DOUBLE PRECISION U V INTEGER IPAR INTEGER LDA PARAMETER LDA 500 INTEGER LOCLEN PARAMETER LOCLEN 250 DOUBLE PRECISION A LDA LOCLEN COMMON PIMA A RETURN END The scheme above can be used for the transpose matrix vector product as well We note that many different storage schemes are available for storing sparse matrices the reader may find useful
30. NR PDSUM PDNRM PROGRESS PIMCTFQMR X B WRK IPAR SPAR MATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS PIMZTFQMR X B WRK IPAR DPAR MATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS Storage requirements Parameter No of words X B IPAR 4 WRK 12 IPAR 4 IPAR 13 PAR 6 Possible exit status values returned by IPAR 12 0 Function dependencies BLAS COPY _AXPY _DOT _DOTC LIBPIM Notes 1 The user must supply a routine to compute the 2 norm of a vector 73 Reference manual Algorithm A 13 TFQMR CON CE WHF ro Qi b AQoz0 un yp T0 vo g Q1 AQoyi d U To lrolle 09 0 0 TQ TQ T po Toro for k 1 2 AD Ok 1 T Uk 1 KL Pr1 0 1 Yok U2k 1 Ck 1Uk 1 h QI AQ2Y2k for m 2k 1 2k Mp Wm 04 19 05 wmaill2 Tm i Cm 1 VA 02 Tm Tm 19mCm Nm AKI dm Ym O mnes ye dad Lm Em 1 Pa o Km Tm Vm 1 if Km lt check stopping criterion gh endfor Pk Wok Pr gt pk pk 1 Y2k 1 Warsi Pryor g Q1 AQ2Y2r 1 vk g By h Prvr 1 endfor 74 PIM_TFQMR Reference manual PIM CHEBYSHEV A 16 PIM CHEBYSHEV Purpose Solves the system AQ oz b using the Chebyshev acceleration Synopsis PIMSCHEBYSHEV X B WRK IPAR SPAR MATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS PIMDCHEBYSHEV X B WRK IPAR DPAR MATVEC PRECONL PRECONR PDSUM PDNRM P
31. NRM PROGRESS PIMDQMR X B WRK IPAR DPAR MATVEC TMATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS PIMCQMR X B WRK IPAR SPAR MATVEC TMATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS PIMZQMR X B WRK IPAR DPAR MATVEC TMATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS Storage requirements Parameter No of words X B IPAR 4 WRK 11 IPAR 4 IPAR 13 PAR 6 Possible exit status values returned by IPAR 12 0 Function dependencies BLAS COPY _AXPY _DOT _DOTC LIBPIM 71 Reference manual PIM QMR Algorithm A 12 QMR b det w 0 ro Qi b AQ279 po qo do so 0 on Oula E llla o DTi e AT T 0 TSH Pk r Hkpk i dk z AT tr KA Tey kt Apk Z WL dk HWE check stopping criterion Wi nci lo Ek i engi los Pet HE TK ek i CAT Dega Orya mu KEK PRA g QE TPA Tk 1 L Al E 1 IMP UA a U k E el YkTkKk I Ag Tk 2 y 41 Ak Tk Tk Ak il dp Orda KkPk Sk Onsp 1 KK App dy Tk 1 dk Tk Tk 1 Sk endfor Kk At 1 pn 72 Reference manual PIM_TFQMR A 15 PIM TFQMR Purpose Solves the system Q4 AQ ox Ob using the TFQMR method with 2 norm weights see 24 Algorithm 5 1 Synopsis PIMSTFQMR X B WRK IPAR SPAR MATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS PIMDTFQMR X B WRK IPAR DPAR MATVEC PRECONL PRECO
32. PIM 2 2 The Parallel Iterative Methods package for oystems of Linear Equations User s Guide Fortran 77 version Rudnei Dias da Cunha Mathematics Institute and National Supercomputing Centre Universidade Federal do Rio Grande do Sul Brasil Tim Hopkins Computing Laboratory University of Kent at Canterbury United Kingdom Abstract We describe PIM Parallel Iterative Methods a collection of Fortran 77 routines to solve systems of linear equations on parallel computers using iterative methods A number of iterative methods for symmetric and nonsymmetric systems are avail able including Conjugate Gradients CG Bi Conjugate Gradients Bi CG Conjugate Gradients squared CGS the stabilised version of Bi Conjugate Gradients Bi CGSTAB the restarted stabilised version of Bi Conjugate Gradients RBi CGSTAB generalised min imal residual GMRES generalised conjugate residual GCR normal equation solvers CGNR and CGNE quasi minimal residual QMR transpose free quasi minimal residual TFQMR and Chebyshev acceleration The PIM routines can be used with user supplied preconditioners and left right or symmetric preconditioning are supported Several stopping criteria can be chosen by the user In this user s guide we present a brief overview of the iterative methods and algorithms available The use of PIM is introduced via examples We also present some results obtained with PIM concerning the selection of stopping criteri
33. Pewp1 k k Piar endfor 56 PIM CGNE Reference manual PIM BICG A T PIMBICG Purpose Solves the system Q4 AQ r Ob using the Bi CG method Synopsis PIMSBICG X B WRK IPAR SPAR MATVEC TMATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS PIMDBICG X B WRK IPAR DPAR MATVEC TMATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS PIMCBICG X B WRK IPAR SPAR MATVEC TMATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS PIMZBICG X B WRK IPAR DPAR MATVEC TMATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS Storage requirements Parameter No of words X B IPAR 4 WRK 8 IPAR 4 IPAR 13 PAR 6 Possible exit status values returned by IPAR 12 0 Function dependencies BLAS COPY _AXPY DOT DOTO LIBPIM Notes None 57 Reference manual PIM BICG Algorithm A 5 Bi CG ro Q b AQoxo TQ po po ro po PL ro wo Qi AQ2p0 g DL wo for k 2 1 2 Oc Yl Tk TE F amp k 1Pk 1 Tk Tk 1 ORI y check stopping criterion k ki On 1Qi A G e i Sk Qi AQ rk pr PL PL r FT sp Br pk pk 1 pk rk Php Dk Tk PkPk Wk Sk PkWk En k Pr ra endfor 58 Reference manual PIM CGS A 8 PIM CGS Purpose Solves the system Q4 AQox Ob using the CGS method Synopsis PIMSCGS X B WRK IPAR SPAR MATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS PIMDCGS X B WRK IPAR DPAR MATVEC PRECONL PRECONR PDSUM
34. ROGRESS PIMCCHEBYSHEV X B WRK IPAR SPAR MATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS PIMZCHEBYSHEV X B WRK IPAR DPAR MATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS Storage requirements Parameter No of words X B IPAR 4 WRK 5xIPAR 4 IPAR 13 PAR 6 Possible exit status values returned by IPAR 12 0 Function dependencies BLAS COPY AXPY SWAP DOT DOTO LIBPIM Notes 1 Only stopping tests 1 2 and 7 are allowed 2 The box containing the eigenvalues of I AQ must be stored in DPAR 3 DPAR 4 DPAR 5 DPAR 6 these values representing the minimum and maximum values in the real and imaginary axes respectively 75 Reference manual PIM CHEBYSHEV Algorithm A 14 CHEBYSHEV 1 Set parameters for iteration e If DPAR 3 lt A 1 Q4 A DPAR 4 in the real axis c DPAR 4 DPAR 3 2 DPAR 4 DPAR 3 y 2 2 DPAR 4 DPAR 3 e If DPAR 5 lt A T Qi A lt DPAR 6 in the imaginary axis 0 max DPAR 5 DPAR 6 y 1 e If DPAR 3 lt Re A 1 QI A AR 4 and lt DP DPAR 5 Im A I Q1 A lt DPAR 6 in the complex plane 72 2 p v2 DPAR 4 DPAR q V2 DPAR 6 DPAR d DPAR 3 DPAR 4 c p 47 1 dy y 1 1 d 3 5 2 f 7Q16 for k 2 1 2 1 k 1 p 1 o2 2 1 ks 1 p 10 4 k 2 w I OAG S41 PR M Qi A f 1 yrr
35. TIONING IPAR 8 1 CALL DINIT N 0 0D0 X 1 DO 10 I 1 MAXIT SET SMALL NUMBER OF ITERATIONS FOR RGMRESEV IPAR 10 3 CALL PIMDRGMRESV X B WRK IPAR DPAR MATVEC PRECONR PDUMR PDSUM PDNRM PROGRESS IF IPAR 12 NE 1 THEN IPAR 11 I GO TO 20 END IF MODIFY REAL INTERVAL TO REFLECT EIGENVALUES OF I Q1A BOX CONTAINING THE EIGENVALUES IS RETURNED IN DPAR 3 DPAR 4 DPAR 5 DPAR 6 THE FIRST TWO ARE THE INTERVAL ALONG THE REAL AXIS THE LAST TWO ARE THE INTERVAL ALONG THE IMAGINARY AXIS MU1 DPAR 3 MUN DPAR 4 DPAR 3 1 0DO MUN DPAR 4 1 0DO MU1 LR A X SET NUMBER OF ITERATIONS FOR CHEBYSHEV IPAR 10 5 CALL PIMDCHEBYSHEV X B DWRK IPAR DPAR MATVEC PRECON PDUMR PDSUM PDNRM2 PROGRESS IF IPAR 12 EQ 0 OR IPAR 12 EQ 6 OR 29 IPAR 12 EQ 7 THEN IPAR 11 I GO TO 20 END IF 10 CONTINUE 20 CONTINUE 4 6 2 Dense storage For the dense case the coefficient matrix 1s partitioned by columns among the p processors which are considered to be logically connected on a grid see Figure 3 A Each processor stores at most n p columns of A For the example shown in Figure 3 B the portion of the matrix vector product to be stored in processor 0 is computed according to the diagram shown in Figure 3 C Basically each processor computes a vector with the same number of elements as that of the target processor 0 in the example which holds the partial sums for each element This vect
36. a V Eijkhout R Pozo C Romine and H van der Vorst Templates for the solution of linear systems building blocks for iterative methods SIAM Philadelphia 1993 R A A Bruce J G Mills and G A Smith CHIMP MPI user guide EPCC KTP CHIMP V2 USER 1 2 Edinburgh Parallel Computer Centre University of Edinburgh June 1994 H M B cker and M Sauren A parallel version of the unsymmetric Lanczos algorithm and its application to QMR KFA ZAM IB 9605 Zentralinstitut f r Angewandte Mathematik Forschungszentrum J lich Gmbh KFA March 1996 R Butler and E Lusk User s guide to the p4 programming system ANL 92 17 Argonne National Laboratory October 1992 F C Cheng P Vaughan D Reese and A Skjellum The Unify system User s guide document for version 0 9 2 NSF Engineering Research Center Mississippi State University September 1994 40 11 12 13 14 15 16 17 18 19 20 191 22 E J Craig The N step iteration procedures Journal of Mathematical Physics 34 64 73 1955 R D da Cunha A Study on Iterative Methods for the Solution of Systems of Linear Equations on Transputer Networks PhD thesis Computing Laboratory University of Kent at Canterbury July 1992 R D da Cunha and T R Hopkins Parallel preconditioned Conjugate Gradients methods on transputer networks Transputer Communications 1 2 111 125 1993 Also as TR 5 93 Computing Laboratory Universi
37. a and parallel scalability A reference manual can be found at the end of this report with specific details of the routines and parameters Contents 1 Introduction 2 An overview of the iterative methods CG with eigenvalues estimatlon CGNR and GENE seslisi 0 bo edel hoe ei 3r e 373 x3 BROG lel edem AV e edes ed du lE eim COS coa docuere A PS wo we gu qe Seals BECESTABO cod bi le guck Aem dee Rd Rode xS Te ee M en IcBIEEC E sou utes set wie a Ate IT ge aaah R es at S 35 S pen GMRES e ke lt qo dotem m tenu babent esed e eq ob P ra GCR A Cc 3 Internal details of PIM 3 1 Supported architectures and environments 3 2 Parallel programming model les 3 3 Data partitioning ve yek eek cla aueia Da le RER q Gel Sun WR RR e 3 4 Increasing the parallel scalability of iterative methods 3 9 Stopping eritera vd X ADR aed Rd biem SG he Se UE AA A KU x E 4 Using PIM 4 1 Naming convention of routines 42 Obtaming PIM auo pem eo RE ORDER ues 4 97 Histalling PTM 26 e kem reper Rees Besok Aer eds Rr dre Pena a Lt Building the PIM core functions Building the examples 2 2 2 a leaning poss So dpe A Ba ete ao queen d ehe Using PIM in your application 4 4 Calling a PIM iterative method routine 4 5 External routines se eee eee eee era e ee sas ses s ee s e os ns ns e M
38. achcons f and pim22 common dmachcons f respectively Default values are supplied for the IEEE 754 floating point standard and are stored separately in the files pim common smachcons f ieee754 and pim common dmachcons f ieee754 these are used by default However if you are using PIM on a computer which does not support the IEEE 754 standard you may 18 1 type make smachcons or make dmachcons this will compile and execute a program which uses the LAPACK routine _LAMCH to compute those constants and the relevant files will be generated 2 edit either pim common smachcons f orig or pim common dmachcons f orig and re place the strings MACHEPSVAL UNDERFLOWVAL and OVERFLOWVAL by the val ues of the machine epsilon underflow and overflow thresholds to those of the particular computer you are using either in single or double precision To build PIM type make makefiles to build the makefiles in the appropriate directories and then make single make double make complex or make dcomplex to build the single precision double precision complex or double complex versions of PIM This will generate o files one for each iterative method routine along with the library file libpim a which contains the support routines Building the examples Example programs are provided for sequential use and for parallel use with MPI and PVM The example programs require a timing routine The distribution comes with the file examples common timer f which
39. at contains the eigenvalues of G If we obtain a box r s t u where r Re A G s and t Im A G u then the axes of this ellipse are defined as p V2 r s 2 q V2 t u 2 These parameters for the Chebyshev iteration are computed by PIM CHEBYSHEV An example of the use of this routine may be found in Section 4 5 For nonsymmetric systems one may use a combination of the routine PTM RGMRESEV and PIM CHEBYSHEV as proposed by Elman et al as a hybrid method 21 page 847 12 Figure 1 Selecting an iterative method Symmetric matrix QU ss Eigenvalue Eigenvalue estimation estimation Transpose CGEV CG RGMRESEV I matrix vector product CHEBYSHEV available Bi CG CG 1 CGNR CGS CGNE Bi CGSTAB Notes 1 only for mildly non QMR RBi CGSTA symmetric systems 2 TFOMR 2 use a small restarting 2 RGMRES 2 value if storage is at RGCR 2 a premium CHEBYSHEV To conclude this section Figure 1 shows a diagram to aid in the selection of an iterative method 3 Internal details of PIM 3 1 Supported architectures and environments PIM has been tested on scalar vector and parallel computers including the Cray Y MP2E 232 Cray Y MP C90 16256 SGI Challenge Intel Paragon TMC CM 5 and networks of worksta tions under PVM 3 3 6 CHIMP MPI v1 2 Argonne MPI p4 v1 2 TCGMSG 4 02 and Intel NX Table 1 lists the architectures and environments on which PIM has been successfu
40. atriz vector product sx 4 keli am m s pod a Preconditioning jio sco 4 sek dren hg ee ea b 4 4 Ro 03 Inner products vector norms and global accumulation 13 13 14 14 15 16 4 6 Monitoring the iterations 4 2l ls Example programs y Z lll J sss s sso oss 4 6 1 Eigenvalues estimation and Chebyshev acceleration 4 6 2 Densesborage va s ca osx eR RR ee a EG 4 023 PDE storage ois sia xor RR dat em ate RB RUP UR de Ron a A matrix vector product for parallel vector architectures 4 6 4 Preconditionees E ee 4 025 Results S lele la eek ER Q aa ae Se Se E Ma en Mer R cay a Stopping criterias E ae E o BUS Saha ls queue uba eni qb eene General result La 2 ep bou wot Weder ROI deep Scalability as aa suis aoo ue thee de i A EE qe Pc 5 Summary References A Reference manual A 1 Description of parameters A 2 t rnalopoutin ss 1 2k 22 ss gr eic eR mew de RO Not rr us doi dee e ded ode e Ec Matrir vector product v AL Transpose matriz vector product v Alu Left preconditioning v Gu Right preconditioning v QL Parallel Sur Las a UE ead A kiii dell an ed LUE died a BY Parallel mec ROPAS bos ae soirs ond Gy boe opes aste d Monitoring routine s passade s E E Rss aas AUT ETN OO P e detener Gia SU DE asp NO nd Marke T epi es hee MuR S NUES ye De R NT A N PE GE ECT ao age ae ede mk Hae eye Se a A ACS PIMICGNE
41. d MPI 3 3 Data partitioning With PIM the iterative method routines have no knowledge of the way in which the user has chosen to store and access either the coefficient or the preconditioning matrices We thus restrict ourselves to partitioning the vectors The assumption made is that each PE knows the number of elements of each vector stored in it and that all vector variables in a processor have the same number of elements This is a broad assumption that allows us to accommodate many different data partitioning schemes including contiguous cyclic or wrap around and scattered partitionings We are able to make 14 this assumption because the vector vector operations used vector accumulations assignments and copies are disjoint element wise The other operations used involving matrices and vectors which may require knowledge of the individual indices of vectors are the responsibility of the user PIM requires that the elements of vectors must be stored locally starting from position 1 thus the user has a local numbering of the variables which can be translated to a global numbering if required For example if a vector of 8 elements is partitioned in wrap around fashion among 2 processors using blocks of length 1 then the first processor stores elements 1 3 5 and 7 in the first four positions of an array the second processor then stores elements 2 4 6 and 8 in positions 1 to 4 on its array We stress that for most of the co
42. dependencies BLAS COPY _AXPY DOT DOTO LIBPIM Notes None 61 Reference manual PIM_BICGSTAB Algorithm A 7 Bi CGSTAB Q b AQ vo ro v 0 po amp 0 wo 1 for k 1 2 pk PLT Pr pror 1 pr 10r 1 Dk Tk 1 Prlpr 1 wk 1vk 1 Up Q1AQopk Er Pl ok Ak pr Ek Sk Tk 1 LDL if s lt macheps soft breakdown has occurred tk QI AQosy Wk tf sy t Tt Ek Tk 1 Upper URSI Tk Sk Wktk check stopping criterion endfor 62 Reference manual PIM RBICGSTAB A 10 PIM RBICGSTAB Purpose Solves the system Q1 AQ27 Ob using the restarted Bi CGSTAB method Synopsis PIMSRBICGSTAB X B WRK IPAR SPAR MATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS PIMDRBICGSTAB X B WRK IPAR DPAR MATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS PIMCRBICGSTAB X B WRK IPAR SPAR MATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS PIMZRBICGSTAB X B WRK IPAR DPAR MATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS Storage requirements Parameter No of words X B IPAR 4 WRK 6 2 IPAR 5 IPAR 4 IPAR 13 PAR 6 Possible exit status values returned by IPAR 12 0 Function dependencies BLAS COPY _AXPY DOT DOTO LIBPIM Notes 1 The degree of the MR polynomial the maximum degree is 10 must be stored in IPAR 5 If the user needs to use a larger degree then the parameter IBDIM defined on PIM RBICGSTAB must be changed accordingly
43. dp ema b etes phe AV go ou ut M Edd d des seu Au PIMECGNES cnet a e e TRR T Y A TRES ARA E e e EU AGS SI qi este trek diae or epe ei ee d al ir t A0 SPIMCBLGG OT AB 2 S gua do Bee qr Gente ta e URN o Ve Ru Se ds etg ce My ACTOJPIMSRBIOGSTAB c cei uti Mom vt tab d euo e T ans ty m es tees AS TISPTMGRGMRES AS 200 A EP nta qo cona eh gh gi MODA AO Anas ipae TR al Alaa o a NA AUIZPIM RGMRESEWV eps SER Meksa ma bermeli a uA Ee SENERE Lu Al Sq p c e Q 3 3 El p ek SG AGBEPIMOGME mi gri kb beni bilek es wha kok habei ip ded er ETM ZL EOIN ri Say la E F ADEM Keme euet R ee d irr 73 A LG PIM CHEBYSHEV e s xv R NON al t ei Oe diea diede dee 75 A PIMSSETPAR vk R Roe aude Pe eee eee Bam uu wo 77 AJSOPBIMIPRIPAR eee aly R hiq Qe ee Re Ge m le ELS 78 ALO ANTR ho ro Ge A Sa toe a So D eA S s Oh kli dme ru A 79 1 Introduction The Parallel Iterative Methods PIM is a collection of Fortran 77 routines designed to solve systems of linear equations SLEs on parallel computers using a variety of iterative methods PIM offers a number of iterative methods including Conjugate Gradients CG 29 Conjugate Gradients for normal equations with minimisation of the residual norm CGNR 35 Conjugate Gradients for normal equations with minimisation of the error norm CGNE 11 Bi Conjugate Gradients Bi CG 22 Conjugate Gradients squared CGS 44 the stabilised version of Bi Conjugate Gradients Bi CGSTAB
44. etric linear systems SIAM Journal of Scientific and Statistical Computing 13 631 644 1992 Also as Report No 90 50 Mathematical Institute University of Utrecht 43 Reference manual A Reference manual In this section we provide details of each individual subroutine im PIM Each entry describes the purpose the name and parameter list storage requirements function dependencies and restrictions of the respective routine For each iterative method routine we also provide a description of the implemented algo rithm Vectors and scalar values are denoted by lower case letters and matrices by capital letters Subscripts indicate either the iteration or a vector column the latter in the case of a matrix Each computational step in the algorithm is numbered if a routine suffers a breakdown the step number where the failure occurred is returned in IPAR 13 Whenever an underscore appears it indicates the type of a variable or function REAL DOUBLE PRECISION COMPLEX DOUBLE COMPLEX and should be replaced by S D C or Z which ever is appropriate The COMPLEX and DOUBLE COMPLEX PIM routines compute inner products using the BLAS CDOTC and ZDOTC routines respectively 44 Reference manual Description of parameters A 1 Description of parameters The parameters used in an iterative method routine are Parameter Description X A vector of length IPAR 4 On input contains the initial estimate On output contains the last estimate
45. ety of parallel environments without imposing any restriction on the way the coefficient matrices and the preconditioning steps are handled The user may thus explore characteristics of the problem and of the particular parallel architec 37 Table 4 Example with IDLU 0 preconditioner Method k Time s lr ls Status CG 32 0 0900 59 8747 1 CGEV 32 0 1500 59 8747 cul Bi CG 32 0 1000 10 8444 1 CGS 11 0 0500 1 8723 x 107 0 BiCGSTAB 12 0 0400 1 1099 x 10 0 RBi CGSTAB 7 0 0300 3 7795 x 10 12 0 RGMRES 3 0 0600 2 8045 x 10712 0 RGMRESEV 3 04500 2 8045 x 10712 0 RGCR 3 0 0800 2 8048 x 1071 0 CGNR 32 0 0800 81 5539 1 CGNE 32 0 0900 37 0570 1 QMR 32 0 1200 4 0091 1 TFQMR 11 0 0500 5 3482 x 107 0 Table 5 Example with Neumann polynomial preconditioner Method k Time s C Status CG 32 0 1000 41 9996 E CGEV 32 0 1300 41 9996 1 Bi CG 32 0 000 13 8688 CGS 14 0 0500 7 4345 x 107 0 Bi CGSTAB 14 0 0500 5 3863 x 107 0 RBi CGSTAB 8 0 0500 8 4331 x 107 0 RGMRES 3 0 0600 4 0454 x 10 0 RGMRESEV 3 0 3900 4 0454 x 10 0 RGCR 3 0 0900 4 0455 x 10 0 CGNR 32 0080 7 4844 x 107 1 CGNE 32 0 1000 3 6205 x 10 1 QMR 32 0 1600 2 3351 esi TFQMR 14 0 1100 7 6035 x 1071 0 38 Table 6 Execution time in seconds for test problem n 16384 solved by PIMDRGMRES with the Neumann polynomial preconditioner of first degrece IBM Intel Intel SGI Cray Cray Cray p SP 1 Paragon XP S iPSC 860 Challen
46. f the usual ILU 0 method This modification was made to allow the com putation of the preconditioning step without requiring any communication to be performed To achieve this we note that the matrices arising from the five point finite difference discretisation have the following structure 19 F B y a where E and F are diagonal matrices and a p and y are the central north and south coefficients derived from the discretisation the subscripts are dropped for clarity Each matrix B is approximating the unknowns in a single vertical line on the grid on Figure 4 To compute a preconditioner Q LU we modify the ILU 0 algorithm in the sense that the blocks E and F are discarded because only the diagonal blocks are considered we refer to this factorisation as ZDLU 0 The resulting L and U factors have the following structure X 1 X x 1 L X A E X y 1 Y f Y AN U y PE f 20 Y 5 where amp and Y are the modified coefficients arising from the IL U 0 algorithm From the structure of L and U it may be clearly seen that applying the preconditioning step reduces to the solution of small order 1 independent triangular systems Each of these systems correspond to a vertical line in the grid since it was partitioned in vertical panels these systems can be solved independently in each processor 35 The polynomial preconditioners used can be expressed by E Ymi I diag A 4y diag AY 21 i 0
47. ge KSR1 T3D Y MP2E C9016 1 150 42 265 83 99 60 453 20 2 39 60 131 64 80 90 297 40 11 64 4 20 43 60 54 46 95 26 73 4 99 8 7 10 29 82 31 62 166 80 16 16 35 35 38 4 04 32 11 20 W H Purvis Daresbury Laboratory U K 1 S Thomas CERCA Montr al A Pindor U of Toronto 39 2 P T M Bulh es Cray Research Inc ture being used Indeed the performance of a PIM routine is dependent on the user supplied routines for the matrix vector products inner products and vector norms and the computation of the preconditioning steps PIM is an ongoing project and we intend to improve it and include other iterative methods We encourage readers to send their comments and suggestions the authors may be contacted via e mail at either rudnerOmat ufrgs br or trh uke ac uk Acknowledgements We would like to thank the National Supercomputing Centre CESUP Brazil the National Laboratory for Scientific Computing LNCC CNPg Brazil the Parallel Laboratory University in Bergen Norway the Army High Performance Computing Research Center Minnesota USA and Digital Equipment Corporation via the Internet Alpha Program who kindly made their facilities available for our tests We also thank our collaborators Matthias G Imhof MIT Paulo Tib rio M Bulh es Cray Research Steve Thomas CERCA Montr al Andrzej Pindor University of Toronto William H Purvis Daresbury Laboratory U K and Ramiro B Willmersdorf LNCC Brazil for their
48. he iterative methods present in PIM More details are available in the works of Ashby et al 4 Saad 40 41 Nachtigal et al 37 Freund et al 25 26 and Barrett et al 6 We introduce the following notation CG Bi CG CGS Bi CGSTAB restarted GMRES restarted GCR and TFQMR solve a non singular system of n linear equations of the form QiAQor Qib 1 where Q4 and Q are the preconditioning matrices For CGNR the system solved is Qi AT AQox Qi A b 2 and for CGNE we solve the system Q AAT Qor Qib 3 CG The CG method is used mainly to solve Hermitian positive definite HPD systems The method minimises the residual in the A norm and in finite precision arithmetic it terminates in at most n iterations The method does not require the coefficient matrix only the result of a matrix vector product Au is needed It also requires a relatively small number of vectors to be stored per iteration since its iterates can be expressed by short three term vector recurrences With suitable preconditioners CG can be used to solve nonsymmetric systems Holter et al 36 have solved a number of problems arising from the modelling of groundwater flow via finite differences discretisations of the two dimensional diffusion equation The properties of the model led to systems where the coefficient matrix was very ill conditioned incomplete factorisations and least squares polynomial preconditioners were used to solve these systems Hyperbolic
49. he three linear algebra operations mentioned above A package similar to PIM is the Simplified Linear Equation Solvers SLES by Gropp and Smith 31 part of the PETSc project In SLES the user has a number of iterative methods CG CGS Bi CGSTAD two variants of the transpose free OMR restarted GMRES Chebyshev and Richardson which can be used together with built in preconditioners and can be executed either sequentially or in parallel The package may be used with any data representation of the matrix and vectors with some routines being provided to create matrices dynamically in its internal format a feature found on ITPACK The user can also extend SLES in the sense that it can provide new routines for preconditioners and iterative methods without modifying SLES It 1s also possible to debug and monitor the performance of a SLES routine Portability of code across different multiprocessor platforms is a very important issue For distributed memory multiprocessor computers a number of public domain software libraries have appeared including PVM 28 TCGMSG 33 NXLIB 45 p4 9 the latter with support for shared memory programming These libraries are available on a number of architectures making it possible to port applications between different parallel computers with few if any modifications to the code being necessary In 1993 the Message Passing Interface Forum a consortium of academia and vendors drawing on the experiences
50. help on testing PIM This work was supported in part by the Army High Performance Computing Research 39 Center under the auspices of Army Research Office contract number DA AL03 89 C 0038 with the University of Minnesota References 1 3 6 10 E Anderson Z Bai C Bischof J Demmel J Dongarra J Du Croz A Greenbaum S Hammarling A McKenney S Ostrouchov and D Sorensen LAPACK Users Guide SIAM Philadelphia 1992 S F Ashby Minimax polynomial preconditioning for Hermitian linear systems Report UCRL JC 103201 Numerical Mathematics Group Computing amp Mathematics Research Division Lawrence Livermore National Laboratory March 1990 S F Ashby T A Manteuffel and J S Otto A comparison of adaptive Chebyshev and least squares polynomial preconditioning for Hermitian positive definite linear systems Report UCRL JC 106726 Numerical Mathematics Group Computing amp Mathematics Research Division Lawrence Livermore National Laboratory March 1991 S F Ashby T A Manteuffel and P E Saylor A taxonomy for Conjugate Gradient meth ods SIAM Journal of Numerical Analysis 27 1542 1568 1990 S F Ashby and M K Seager A proposed standard for iterative linear solvers version 1 0 Report UCRL 102860 Numerical Mathematics Group Computing amp Mathematics Research Division Lawrence Livermore National Laboratory January 1990 R Barrett M Berry T Chan J Demmel J Donald J Dongarr
51. hough numerically their iterations will usually differ The method does not require the computation of transpose matrix vector products as in Bi CG and a smaller number of vectors need to be stored per iteration than for other restarted methods like GMRES GMRES The GMRES method is a very robust method to solve nonsymmetric systems The method uses the Arnoldi process to compute an orthonormal basis vi v9 ug of the Krylov subspace K 4 v1 The solution of the system is taken as zo Vey where W is a matrix whose columns are the orthonormal vectors v and yz is the solution of the least squares problem H y rg oe1 where the upper Hessenberg matrix Hj is generated during The PIM implementation of Bi CG CGS and Bi CGSTAB sets fo ro but the user may modify the code if another choice of fo 1s desirable 10 the Arnoldi process and e 1 0 0 0 7 This least squares problem can be solved using a QR factorisation of H A problem that arises in connection with GMRES is that the number of vectors of or der n that need to be stored grows linearly with k and the number of multiplications grows quadratically This may be avoided by using a restarted version of GMRES this is the method implemented in PIM Instead of generating an orthonormal basis of dimension k one chooses a value c c amp n and generates an approximation to the solution using an orthonormal basis of dimension c thereby reducing considerably the amount of
52. ical Computing 1 840 855 1986 R Fletcher Conjugate Gradient Methods for Indefinite Systems volume 506 of Lecture Notes in Mathematics pages 13 89 Spring Verlag Heidelberg 1976 41 23 24 195 0g 27 28 199 30 131 139 33 134 35 Message Passing Interface Forum MPI A message passing interface standard TR CS 93 214 University of Tennessee November 1993 R W Freund A transpose free quasi minimal residual algorithm for non Hermitian linear systems Submitted to SIAM Journal of Scientific and Statistical Computing R W Freund G H Golub and N M Nachtigal Iterative solution of linear systems Acta Numerica 1 57 100 1991 R W Freund G H Golub and N M Nachtigal Recent advances in Lanczos based itera tive methods for nonsymmetric linear systems RIACS Technical Report 92 02 Research Institute for Advanced Computer Science NASA Ames Research Center 1992 To appear on Algorithmic trends for Computational Fluid Dynamics in the 90 s R W Freund and M Hochbruck A Biconjugate Gradient type algorithm for the itera tive solution of non Hermitian linear systems on massively parallel architectures RIACS Technical Report 92 08 Research Institute for Advanced Computer Science NASA Ames Research Center 1992 To appear on Computational and Applied Mathematics I Algorithms and Theory A Geist A Beguelin J Dongarra W Jiang R Manchek and V S Sunde
53. is stored in a single processor and is then broadcast to the remaining processors The CGS and TFQMR implementations available on PIM do not benefit from this approach 3 5 Stopping criteria PIM offers a number of stopping criteria which may be selected by the user In Table 2 we list the different criteria used rj b Ar is the true residual of the current estimate Tp zp is the pseudo residual usually generated by linear recurrences and possibly involving the preconditioners and lt is the user supplied tolerance Note that the norms are not indicated these depend on the user supplied routine to compute a vector norm Table 2 Stopping criteria available on PIM No Stopping criterion 1 rk lt 2 rp lt e 3 yrize lt ellbii 4 zz 5 zl ell 6 zl ell Qi T k zp lt If speed of execution is of the foremost importance the user needs to select the stopping criterion that will impose the minimum overhead The following notes may be of use in the selection of an appropriate stopping criterion 16 1 If the stopping criterion selected is one of 1 2 or 3 then the true residual is computed except when using TFQMR with either no preconditioning or left preconditioning 2 The restarted GMRES method uses its own stopping criterion see 42 page 862 which is equivalent to the 2 norm of the residual or pseudo residual if preconditioning is used 3 If either no preco
54. lly tested 2 The results obtained are based upon a beta version of the software and consequently is not necessarily representative of the performance of the full version of this software 13 Table 1 Computers where PIM has been tested Architecture Compiler and O S Sun SPARC Sun Fortran 1 4 SunOS 4 1 3 Sun SPARC Sun Fortran 2 0 1 SunOS 5 2 Sun SPARC EPC Fortran 77 SunOS 4 1 3 Sun SPARC EPC Fortran 90 SunOS 4 1 3 Sun SPARC NAG Fortran 90 SunOS 4 1 3 DEC AXP 4000 610 DEC Fortran 3 3 1 DEC OSF 1 1 3 DEC AXP 3000 800 DEC Fortran 3 4 480 DEC OSF 1 2 0 SGI IRIS Indigo MIPS Fortran 4 0 5 SGI IRIX 4 0 5F SGI IRIS Crimson MIPS Fortran 4 0 5 SGI IRIX 4 0 5C SGI Indy IT MIPS Fortran 5 0 SGI IRIX 5 1 1 Cray Y MP2E 232 Cray Fortran 6 0 UNICOS 7 0 5 2 Cray Y MP C90 16256 Cray Fortran 7 0 UNICOS 8 2 3 SGI Challenge MIPS Fortran 5 2 SGI IRIX 5 2 Intel Paragon XP S Portland if77 4 5 OSF 1 1 2 6 IBM 9076 SP 1 IBM XL Fortran 6000 2 3 AIX 3 2 Cray T3D Cray Fortran 8 0 UNICOS 8 3 3 TMC CM 5 CM Fortran 77 3 2 Parallel programming model PIM uses the Single Program Multiple Data SPMD programming model The main implica tion of using this model is that certain scalar values are needed in each processing element PE Two of the user supplied routines to compute a global sum and a vector norm must provide for this preferably making use of a reduction and or broadcast routine like those present on PVM 3 3 6 an
55. meters Type U INPUT U INPUT V OUTPUT V OUTPUT IPAR INPUT IPAR INPUT Transpose matrix vector product v ATu Right preconditioning v Qu SUBROUTINE TMATVEC U V IPAR SUBROUTINE PRECONR U V IPAR precision U V precision UC V CX INTEGER IPAR INTEGER IPAR Parameters Type Parameters Type U INPUT U INPUT V OUTPUT V OUTPUT IPAR INPUT IPAR INPUT AT Reference manual External routines Synopsis Parallel sum SUBROUTINE P SUM ISIZE X IPAR Monitoring routine INTEGER ISIZE precision X SUBROUTINE PROGRESS LOCLEN ITNO INTEGER IPAR NORM X RES TRUERES Parameters Type INTEGER LOCLEN ITNO ISIZE INPUT Ue X INPUT OUTPUT precision NORM X CX RES IPAR INPUT d TRUERES QN Parameters Type Parallel vector norm LOCLEN INPUT ITNO INPUT s NORM INPUT precision FUNCTION P NRM LOCLEN U IPAR X INPUT IUE LOCLEN RES INPUT pre ven TRUERES INPUT INTEGER IPAR Parameters Type LOCLEN INPUT U INPUT IPAR INPUT Notes 1 Replace precision by REAL DOUBLE PRECISION COMPLEX DOUBLE COMPLEX 2 In the monitoring routine PROGRESS above the array TRUERES contains the value of the true residual rj b Ax only if IPAR 9 is 1 2 or 3 48 Reference manual PIM CG A 3 PIM CG Purpose Solves the system Q4 AQ r Ob using the CG method Synopsis PIMSCG X B WRK IPAR SPAR MATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS PIMDCG X B WRK IPAR DPAR MATVEC PRECONL PRECONR P
56. method is SUBROUTINE PIMDCG X B WRK IPAR DPAR MATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS and for Bi CG as well as for CGNR CGNE and QMR SUBROUTINE PIMDBICG X B WRK IPAR DPAR MATVEC TMATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS where the parameters are as follows Parameter Description X A vector of length IPAR 4 On input contains the initial estimate On output contains the last estimate computed B The right hand side vector of length IPAR 4 WRK A work vector used internally see the description of each routine for its length IPAR An integer array containing input output parameters PAR A floating point array containing input output parameters MATVEC Matrix vector product external subroutine TMATVEC Transpose matrix vector product external subroutine PRECONL Left preconditioning external subroutine PRECONR Right preconditioning external subroutine P_SUM Global sum reduction external function P_NRM Vector norm external function PROGRESS Monitoring routine Note in the example above that contrary to the proposal in 5 PIM uses separate routines to compute the matrix vector and transpose matrix vector products See the reference manual sections A 1 and A 2 for the description of the parameters above and the synopsis of the external routine 21 4 5 External routines As stated earlier the user is responsible for supplying certain routines to be used internally by the iterative method routines One of
57. mmonly used partitioning schemes data may be retrieved with very little overhead 3 4 Increasing the parallel scalability of iterative methods One of the main causes for the poor scalability of implementations of iterative methods on T n v DO uii where u and v are vectors distributed across p processors without loss of generality assume distributed memory computers is the need to compute inner products v u that each processor holds n p elements of each vector This computation can be divided in three parts 1 The local computation of partial sums of the form 5 y ttj on each processor 2 The reduction of the 8 values where these values travel across the processors in some efficient way for instance as if traversing a binary tree up to its root and are summed during the process At the end the value of a gt Dj is stored in a single processor 3 The broadcast of o to all processors During parts 2 and 3 a number of processors are idle for some time A possible strategy to reduce this idle time and thus increase the scalability of the implementation is to re arrange the operations in the algorithm so that parts 2 and 3 accumulate a number of partial sums corresponding to some inner products Some of the algorithms available in PIM including CG CGEV Bi CG CGNR and CGNE have been rewritten using the approach suggested by D Azevedo and Romine 15 Others like Bi CGSTAB RBi CGSTAB RGCR RGMRES and QMR have
58. n electromagnetics and quantum chromodynamics applications respectively 4 QMR The QMR method by Freund and Nachtigal 27 overcomes the difficulties associated with the Bi CG method The original QMR algorithm uses the three term recurrences as found in the underlying Lanczos process In 8 B cker and Sauren propose a highly parallel algorithm for the QMR method with a single global synchronization point which is implemented in this version of PIM TFQMR TFQMR is a variant of CGS proposed by Freund 24 TFQMR uses all available search direction vectors instead of the two search vectors used in CGS Moreover these vectors 11 are combined using a parameter which can be obtained via a quasi minimisation of the residual The method is thus extremely robust and has the advantage of not requiring the computation of transpose matrix vector products PIM offers TFQMR with 2 norm weights see 24 Algorithm 5 1 Chebyshev acceleration The Chebyshev acceleration is a polynomial acceleration applied to basic stationary methods of the form TL Grp f where G I 0 4 f Qib If we consider k iterations of the above method the iterates zk may be linearly combined such that y S cjzj isa better approximation to z Alb The coefficients c are chosen so that the norm of error vector is minimized and E 4 cj 1 If we assume that the eigenvalues of G are contained in an interval o 0 with 1 lt o lt f lt 1 then the Cheb
59. nditioning or right preconditioning is used and criterion 6 is selected the PIM iterative method called will flag the error and exit without solving the system except for the restarted GMRES routine 4 Using PIM 4 1 Naming convention of routines The PIM routines have names of the form PIM method where indicates single precision S double precision D complex C or double precision complex Z and method is one of CG CGEV CG with eigenvalue estimation CGNR CGNE BICG CGS BICGSTAB RBICGSTAB RGMRES RGMRESEV RGMRES with eigenvalue estimation and RGCR QMR TFQMR and CHEBYSHEV 4 2 Obtaining PIM PIM 2 0 is available via anonymous ftp from unix hensa ac uk file pub misc netlib pim pim22 tar Z and ftp mat ufrgs br file pub pim pim22 tar gz There is also a PIM World Wide Web homepage which can be accessed at http www mat ufrgs br pim e html which gives a brief description of the package and allows the reader to download the software and related documentation The current distribution contains e The PIM routines in the directories single double complex and dcomplex e A set of example programs for sequential and parallel execution using PVM and MPI in the directories examples sequential examples pvm and examples mpi e This guide in PostScript format in the doc directory 17 4 3 Installing PIM To install PIM unpack the distributed compressed or gzipped tar file uncompress pim22 tar Z or gun
60. not been re arranged but some or all of their inner products can be computed with a single global sum operation The computation of the last two parts depends on the actual message passing library being used With MPI parts 2 and 3 are also offered as a single operation called MPI ALLREDUCE Applications using the PVM 3 3 6 Fortran interface should however call PVMFREDUCE and then PVMFBROADCAST 15 An important point to make is that we have chosen modifications to the iterative methods that reduce the number of synchronization points while at the same time maintaining their convergence properties and numerical qualities This is the case of the D Azevedo and Romine modification also in the specific case of GMRES which uses the Arnoldi process a suitable reworking of the modified Gram Schmidt procedure to compute a vector basis the computation of several inner products with a single global sum does not compromise numerical stability For instance in the algorithm for the restarted GMRES see Algorithm A 9 step 5 involves the computation of 7 inner products of the form WV 7 1 2 3 It is thus possible to arrange for each processor to compute 7 partial sums using the BLAS routine DOT and store these in an array Then in a single call to a reduction routine these arrays are communicated among the processors and their individual elements are summed On the completion of the global sum the array containing the respective 7 inner products
61. of users of those and other libraries defined a standard interface for message passing operations called MPI 23 To day we have available implementations of MPI built on top of other existing libraries like the CHIMP MPI library developed at the Edinburgh Parallel Computer Centre 7 and the Unify project 10 which provides an MPI interface on top of PVM It is expected that native implementations will be available soon In the previous releases of PIM 1 0 and 1 1 we had distributed examples using PVM TCGMSG p and NXLIB however from this release onwards we will support only PVM the de facto standard for message passing and MPI We would like to mention two projects which we believe can be used together with PIM The first is the proposed standard for a user level sparse BL AS by Duff et al 19 and Heroux 34 This standard addresses the common problem of accessing and storing sparse matrices in the context of the BLAS routines such routines could then be called by the user in conjunction with a PIM routine The second is the BLACS project by Dongarra et al 17 which provides routines to perform distributed operations over matrices using PVM 3 1 2 An overview of the iterative methods How to choose an iterative method from the many available is still an open question since any one of these methods may solve a particular system in very few iterations while diverging on another In this section we provide a brief overview of t
62. on dependencies BLAS COPY AXPY DOT DOTO SCAL _TRSV LIBPIM Notes 1 The size of the orthonormal basis maximum of 50 vectors must be stored in IPAR 5 If the user needs to use a larger basis then the parameter IBDIM defined on PIM RGMRES must be changed accordingly 2 The user must supply a routine to compute the 2 norm of a vector 65 Reference manual PTM_RGMRES Algorithm A 9 RGMRES 1 ro Q1 b AQ22z0 2 Po Irolle for k 1 2 3 g Bri Bias Vi PIH for j 1 2 restart Rig Vi Q AQV t 1 55 Q1AQ V Y Bij Vi Fg l lle Via DIR apply previous Givens s rotations to R j compute Givens s rotation to zero R 1 j apply Givens s rotation to g if 9j41 lt RHSSTOP then perform steps 13 and 14 with restart 7 stop endaf endfor solve Ry g solution to least squares problem Tp 1 1 Vy form approximate solution ry Q1 b AQ oxzy Br Irelle endfor 66 Reference manual PIM RGMRESEV A 12 PIM RGMRESEV Purpose Solves the system Q AQ x Ob using the restarted GMRES method returns at each itera tion the estimates of the smallest and largest eigenvalues of J AQ obtained from the upper Hessenberg matrix produced during the Arnoldi process Synopsis PIMSRGMRESEV X B WRK IPAR SPAR MATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS PIMDRGMRESEV X B WRK IPAR DPAR MATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS PIMCRGMRESEV X
63. or is then sent across the network to be summed in a recursive doubling fashion until the accumulated vectors carrying the contributions of the remaining processors arrive at the target processor These accumulated vectors are then summed together with the partial sum vector computed locally in the target processor yielding the elements of the vector resulting from the matrix vector product This process 1s repeated for all processors This algorithm 1s described in 13 To compute the dense transpose matrix vector product ATu each processor broadcasts to the other processors a copy of its own part of u The resulting part of the v vector is then computed by each processor 4 6 3 PDE storage For the PDE storage format a square region is subdivided into 1 rows and columns giving a grid containing 1 internal points each point being numbered as j 1 1 i j 1 2 1 see Figure 4 At each point we assign 5 different values corresponding to the center north south cast and west points on the stencil aij Bij Vij B Eig respectively which are derived from the PDE and the boundary conditions of the problem Each grid point represents a variable the whole being obtained by solving a linear system of order n 1 A matrix vector product v Au 1s obtained by computing vij oi jtij Diii E J Oi jUi gl T GL 7 where some of the o 8 y and e may be zero according to the position of the point relative to the grid Note
64. oreover the method may break down for example the iterations cannot proceed when some quantities dependent on fo become zero CGS CGS is a method that tries to overcome the problems of Bi CG By rewriting some of the expressions used in Bi CG it is possible to eliminate the need for AT altogether Sonneveld 44 also noted that it is possible to theoretically increase the rate of convergence of Bi CG at no extra work per iteration However if Bi CG diverges for some system CGS diverges even faster It is also possible that CGS diverges while Bi CG does not for some systems Bi CGSTAB Bi CGSTAB isa variant of Bi CG with a similar formulation to CGS However steepest descent steps are performed at each iteration and these contribute to a considerably smoother convergence behaviour than that obtained with Bi CG and CGS It is known that for some systems Bi CGSTAB may present an erratic convergence behaviour as does Bi CG and CGS RBi CGSTAB The restarted Bi CGSTAB proposed by Sleijpen and Fokkema 43 tries to overcome the stagnation of the iterations of Bi CGSTAB which occurs with a large class of systems of linear equations The method combines the restarted GMRES method and Bi CG being composed of two specific sections a Bi CG part where 1 1 u and r vectors are produced l being usually 2 or 4 and a minimal residual step follows when the residuals are minimized RBi CGSTAB is mathematically equivalent to Bi CGSTAB if l 1 alt
65. problem it 1s important to be able to obtain feedback from the PIM routines as to how an iterative method is progressing To this end we have included in the parameter list of each iterative method routine an external subroutine called PROGRESS which receives from that routine the number of vector el ements stored locally LOCLEN the iteration number ITNO the norm of the residual NORMRES according to the norm being used the current iteration vector X the residual vector RES and the true residual vector rj b Arp TRUERES This last vector contains meaningful values only if IPAR 9 is 1 2 or 3 The parameter list of the monitoring routine is fixed as shown in 4 2 The example below shows a possible use of the monitoring routine for the DOUBLE PRECISION data type SUBROUTINE PROGRESS LOCLEN ITNO NORMRES X RES TRUERES INTEGER LOCLEN ITNO DOUBLE PRECISION NORMRES DOUBLE PRECISION X RES TRUERES EXTERNAL PRINTV WRITE 6 FMT 9000 ITNO NORMRES WRITE 6 FMT 9010 X CALL PRINTV LOCLEN X WRITE 6 FMT 9010 RESIDUAL CALL PRINTV LOCLEN RES WRITE 6 FMT 9010 TRUE RESIDUAL CALL PRINTV LOCLEN TRUERES RETURN 9000 FORMAT I5 1X D16 10 9010 FORMAT A END SUBROUTINE PRINTV N U INTEGER N DOUBLE PRECISION U INTEGER I DO 10 I 1 N WRITE 6 FMT 9000 U I 10 CONTINUE RETURN 9000 FORMAT 4 D14 8 1X 26 END As with the other external routines used by PIM this routine needs
66. ram PVM 3 user s guide and reference manual Research Report ORNL TM 12187 Oak Ridge National Laboratory May 1993 G H Golub and C F Van Loan Matrix Computations Johns Hopkins University Press Baltimore 2nd edition 1989 R G Grimes D R Kincaid and D M Young ITPACK 2 0 user s guide Report No CNA 150 Center for Numerical Analysis University of Texas at Austin August 1979 W Gropp and B Smith Simplified linear equation solvers users manual ANL 93 8 Argonne National Laboratory February 1993 L A Hageman and D M Young Applied Iterative Methods Academic Press New York 1981 R J Harrison TCGMSG Send receive subroutines version 4 02 User s manual Battelle Pacific Northwest Laboratory January 1993 M A Heroux A proposal for a sparse BLAS toolkit SPARKER working note 2 Report TR PA 92 90 CERFACS October 1992 M R Hestenes and E L Stiefel Method of Conjugate Gradients for solving linear systems Journal of Research National Bureau of Standards 49 435 498 1952 42 36 W H Holter L M Navon and T C Oppe Parallelizable preconditioned Conjugate Gra 37 138 139 40 141 149 43 144 145 46 dient methods for the Cray Y MP and the TMC CM 2 Technical report Supercomputer Computations Research Institute Florida State University December 1991 N M Nachtigal S C Reddy and L N Trefethen How fast are nonsymmetric matrix iterations SIAM
67. sequential execution only The parallel programs for the dense and PDE storage formats have different partitioning strategies and the matrix vector products have been designed to take advantage of these The systems solved have been set up such that the solution is the vector z 1 1 1 7 in order to help in checking the results For the dense storage format the real system has the tridiagonal coefficient matrix of order n 500 and the complex system of order n 100 has the form A uI S where S S 4 4i and 27 S Was cL 3E 5 1 1 0 1 41 1 4 0 The problem using the Harwell Boeing format is NOS4 from the LANPRO collection of problems in structural engineering 18 pp 54 55 Problem NOS4 has order n 100 and is derived from a finite element approximation of a beam structure For the PDE storage format the system being solved is derived from the five point finite difference discretisation of the convection diffusion equation Pu du Ou Ou E 2 cosla gt ame 0 6 on the unit square with e 0 1 a 7 6 and u 224 y on OR The first order terms were discretised using forward differences this problem was taken from 44 A different set of systems is used for the HYBRID examples with dense storage format The real system has a nonsymmetric tridiagonal coefficient matrix of order n 500 and the complex system of order n 100 has A defined as 2 2 2 1 241 2 1 2
68. stencil as a sequence of AXPYs The use of _AXPYs also provides a better performance because these operations are usually very efficient on such machines Consider the storage scheme described above i e five coefficients o D y 6 and are stored per grid point and numbered sequentially as j 1 i j 1 2 1 The coefficients can then be stored in five separate arrays of size n gt The matrix vector product Au can then be obtained by the following sequence of operations UR apup k 1 2 n 9 Up L L HL k 1 2 n 1 10 UR wu yuk k 2 33 n 11 Up pt Opupyi k 1 2 n 1 12 UR vptepup_y K lL 4 1 142 n 13 and the transpose matrix vector product v A u is obtained similarly Up pup k 1 2 n 14 Va ve us k 1 2 n 1 15 Uk UH yuk k 2 33 n 16 Vet Ur d kuk k 1 2 n l 17 Uk vp ptepup k 1 1 1 2 n 18 Experiments on the Cray Y MP2E 232 showed that this approach gave a three fold improve ment in the performance from 40 MFLOPS to 140MFLOPS A separate set of the PDE examples containing these matrix vector product routines are provided under the pim22 examples seguential pvp pde directory This may only need to be done once if the coefficient matrix is unchanged during the solution of the system 34 4 6 4 Preconditioners The examples involving an incomplete LU factorisation as the preconditioner for the PDE case are a modification o
69. the characteristics of PIM is that if external routines are not required by an iterative method routine they are not called the only exception being the monitoring routines The user only needs to provide those subroutines that will actually be called by an iterative method routine depending on the selection of method preconditioners and stopping criteria dummy parameters may be passed in place of those that are not used Some compilers may require the presence of all routines used in the program during the linking phase of the compilation in this case the user may need to provide stubs for the dummy routines Section A 2 gives the synopsis of each user supplied external routine used by PIM The external routines have a fixed parameter list to which the user must adhere see A 2 Note that from version 2 2 onwards the coefficient and the preconditioning matrices do not appear in the parameter list of the PIM routines Indeed we regard the matrix vector products and preconditioning routines as operators returning only the appropriate resulting vector thus the PIM routines have no knowledge of the way in which the matrices are stored The external routines however may access the matrices declared in the main program via COMMON blocks This strategy hides from the PIM routines details of how the matrices are declared in the main program and thus allows the user to choose the most appropriate storage method for her problem previous versions of PI
70. to be supplied by the user we have included the source code for the routine as shown above in the directory pim examples common and this may be used as 1s or can be modified by the user as required Please note that for large system sizes the routine above will produce very large amounts of output We stress that this routine 1s always called by the PIM iterative method routines if no monitoring is needed a dummy routine must be provide Note that some of the iterative methods contain an inner loop within the main iteration loop This means that for PIM_RGCR and PIM TFQMR the value of ITNO passed to PROGRESS will be repeated as many times as the inner loop is executed We did not modify the iteration number passed to PROGRESS so as to reflect the true behaviour of the iterative method being used 4 6 Example programs In the distributed software the user will find a collection of example programs under the direc tory examples The example programs show how to use PIM with three different matrix storage formats including dense matrices those derived from the five point finite difference discretisa tion of a partial differential equation PDE and the standard sparse representation found in the Harwell Boeing sparse matrix collection 18 Most of the examples are provided for sequential and parallel execution the latter with separate codes for PVM 3 3 6 and MPI libraries The examples involving the Harwell Boeing sparse format are provided for
71. to consult Barrett et al 6 pp 57ff where such schemes as well as algorithms to compute matrix vector products are discussed Preconditioning For the preconditioning routines one may use the scheme outlined above for the matrix vector product in some cases this may not be necessary when there is no need to operate with A or the preconditioner is stored as a vector An example is the diagonal or 1 Jacobi left preconditioning where O diag A PROGRAM DIAGP INTEGER LDA PARAMETER LDA 500 INTEGER LOCLEN PARAMETER LOCLEN 250 Q1 IS DECLARED AS A VECTOR OF LENGTH 250 AS IF USING AT LEAST TWO PROCESSORS DOUBLE PRECISION A LDA LOCLEN Q1 LOCLEN COMMON PIMQ1 Q1 EXTERNAL MATVEC DIAGL PDUMR PDSUM PDNRM 23 SET UP PROBLEM SOLVING PARAMETERS FOR USE BY USER DEFINED ROUTINES THE USER MAY NEED TO SET MORE VALUES OF THE IPAR ARRAY LEADING DIMENSION OF A IPAR 1 LDA NUMBER OF ROWS COLUMNS OF A IPAR 2 N NUMBER OF PROCESSORS IPAR 6 NPROCS NUMBER OF ELEMENTS STORED LOCALLY IPAR 4 N IPAR 6 SET LEFT PRECONDITIONING IPAR 8 1 DO 10 I 1 N Q1 I 21 0D0 A I I 10 CONTINUE CALL DINIT IPAR 4 0 0D0 X 1 CALL PIMDCG X B WRK IPAR DPAR MATVEC DIAGL PDUMR PDSUM PDNRM PROGRESS STOP END SUBROUTINE DIAGL U V IPAR DOUBLE PRECISION U V INTEGER IPAR INTEGER LOCLEN PARAMETER LOCLEN 250 DOUBLE PRECISION Q1 LOCLEN COMMON PIMQ1 Q1 CALL DCOPY IPAR 4 U 1 V 1 CALL DVPROD IPAR 4 Q1 1 V
72. trix vector product see External routines 22 Naming convention of routines 17 Obtaining PIM 17 Parallelism data partitioning 14 programming model 14 Parameters description of 45 printing 78 setting 77 Preconditioning step see External routines 23 Stopping criteria 16 Supported architectures and environments 13 Vector initialisation 79 Vector norm see External routines 25 81
73. ty of Kent at Canterbury U K R D da Cunha and T R Hopkins A parallel implementation of the restarted GMRES iterative method for nonsymmetric systems of linear equations Advances in Computa tional Mathematics 2 3 261 277 April 1994 Also as TR 7 93 Computing Laboratory University of Kent at Canterbury E F D Azevedo and C H Romine Reducing communication costs in the Conjugate Gradi ent algorithm on distributed memory multiprocessors Research Report ORNL TM 12192 Oak Ridge National Laboratory 1992 J J Dongarra J Du Croz S Hammarling and R J Hanson An extended set of FOR TRAN Basic Linear Algebra Subprograms ACM Transactions on Mathematical Software 14 1 1 17 1988 J J Dongarra R A van de Geijn and R C Whaley A users guide to the BLACS v0 1 Technical report Computer Science Department University of Tennessee 1993 LS Duff R G Grimes and J G Lewis Users guide for the Harwell Boeing sparse matrix collection Report TR PA 92 86 CERFACS October 1992 LS Duff M Marrone and G Radicati A proposal for user level sparse BLAS SPARKER working note 1 Report TR PA 92 85 CERFACS October 1992 S C Eisenstat A note on the generalized Conjugate Gradient method STAM Journal of Numerical Analysis 20 358 361 1983 H C Elman Y Saad and P Saylor A hybrid Chebyshev Krylov subspace algorithm for solving nonsymmetric systems of linear equations SIAM Journal of Scientific and Statist
74. ute the global sum and the vector 2 norm u vut using the BLAS DDOT routine and the reduction plus broadcast operation provided by MPI SUBROUTINE PDSUM ISIZE X IPAR INCLUDE mpif h INTEGER ISIZE DOUBLE PRECISION X DOUBLE PRECISION WRK 10 INTEGER IERR IPAR EXTERNAL DCOPY MPI ALLREDUCE CALL MPI ALLREDUCE X WRK ISIZE MPI DOUBLE PRECISION MPI SUM MPI COMM WORLD IERR CALL DCOPY ISIZE WRK 1 X 1 RETURN END DOUBLE PRECISION FUNCTION PDNRM LOCLEN U IPAR INCLUDE mpif h INTEGER LOCLEN DOUBLE PRECISION U DOUBLE PRECISION PSUM INTEGER IERR IPAR DOUBLE PRECISION DDOT EXTERNAL DDOT INTRINSIC SQRT DOUBLE PRECISION WRK 1 EXTERNAL MPI ALLREDUCE PSUM DDOT LOCLEN U 1 U 1 CALL MPI ALLREDUCE PSUM WRK 1 MPI DOUBLE PRECISION MPI SUM MPI COMM WORLD IERR PDNRM SQRT WRK 1 RETURN END It should be noted that P SUM is actually a wrapper to the global sum routines available on 25 a particular machine Also when executing PIM on a sequential computer these routines are empty i e the contents of the array X must not be altered in any way since its elements already are the inner product values The parameter list for these routines was decided upon after inspecting the format of the global operations available from existing message passing libraries Monitoring the iterations In some cases most particularly when selecting the iterative method to be used for solving a specific
75. yshev polynomials satisfy the above conditions on the c s We refer the user to 32 pp 45 58 332 339 for more details The Chebyshev acceleration has the property that its iterates can be expressed by short three term recurrence relations and especially for parallel computers no inner products or vector norms are needed except for the stopping test The difficulty associated with the Chebyshev acceleration 1s the need for good estimates either for the smallest or largest eigen values of G if the eigenvalues are real or in the case of a complex eigenspectrum a region in the complex plane containing the eigenvalues of minimum and maximum modulus With PIM the user may make use of two routines PIM CGEV and PIM RGMRESEV to obtain such estimates PIM_CGEV covers the case where the eigenvalues of G are real for the complex case PIM_RGMRESEV should be used To obtain appropriately accurate estimates these routines must be used with left preconditioning and should be allowed to run for several iterations The estimates for the eigenvalues of Q1 A should then be modified to those of I Q4 A This is done by replacing the smallest and largest real values r and s by 1 s and 1 r respectively The imaginary values should not be modified We note that even if A has only real eigenvalues G may have complex or imaginary only eigenvalues In this latter case the Chebyshev acceleration is defined in terms of a minimum bounding ellipse th
76. zip pim22 tar gz tar xfp pim22 tar cd pim and edit the Makefile The following variables may need to be modified HOME Your top directory e g ui users fred FC Your Fortran compiler of choice usually 77 FFLAGS Flags for the Fortran compilation of main programs example programs OFFLAGS Flags for the Fortran compilation of separate modules PIM routines and modules of examples NOTE This must include at least the flag required for separate compilation usually c AR The archiver program usually ar HASRANLIB Either t true or f false indicating if it is necessary to use a random library program usually ranlib to build the PIM library BLASLIB Either the name of an archive file containing the BLAS library or 1b1as if the library libblas a has been installed on a system wide basis PARLIB The compilation switches for any required parallel libraries This variable must be left blank if PIM is to be used in sequential mode For example if PVM 3 is to be used then PARLIB would be defined as L PVM ROOT lib PVM ARCH 1fpvm3 lpvm3 lgpvm3 Each iterative method routine is stored in a separate file with names in lower case fol lowing the naming convention of the routines e g the routine PIMDCG is stored in the file pim22 double pimdcg f Building the PIM core functions PIM needs the values of some machine dependent floating point constants The single or double precision values are stored in the files pim22 common sm

User`s Guide - Chasqueweb

Contents

Download Pdf Manuals

Related Search

Related Contents