Home
        User`s Guide - Chasqueweb
         Contents
1.      63    Reference manual PIM RBICGSTAB    Algorithm A 8 RBi CGSTAB      Qi b     AQox    r    Ug   0  4  py     1  2 0 0     1  for k  1 2      po      wpo  for j   0 1    restart     1  p     rif  B   ap   po  po pi  ui   Ti     fui  LU   uj41   Qi AQou   L   ul f  a  po  amp   ri   Tj    rhe t O0     9  Pi   QiAQor   To   To   QUO  endfor  check stopping criterion  rlr    ri rijo  for j   2 3     restart  a  17 Beli  Mi T  gj   Tj Tj Yj   T0 13 03  endfor    restart   Y    restart   E srestart    Cur Tesque J  restart    1     1    Vi VG  Y   Yj41   jus eM e J   l     restart     1    zo   To   YTO   TQ   70     restart restart   uo   U0     YrestartYrestart   ug   ug     yjuj  J  1     restart     1   zo   zo   qu  j   l    restart     1   ro   ro     TP j     l    restart     1  endfor       64    Reference manual PIM  RGMRES    A 11  PIM RGMRES  Purpose    Solves the system Q1 AQ  r   Q  b using the restarted GMRES method   Synopsis  PIMSRGMRES  X B WRK IPAR SPAR MATVEC   PRECONL   PRECONR   PSSUM  PSNRM  PROGRESS     PIMDRGMRES  X B WRK IPAR DPAR MATVEC   PRECONL   PRECONR   PDSUM  PDNRM  PROGRESS     PIMCRGMRES  X B WRK IPAR SPAR MATVEC   PRECONL   PRECONR   PCSUM  PSCNRM  PROGRESS                             PIMZRGMRES  X B WRK IPAR DPAR MATVEC   PRECONL   PRECONR   PZSUM  PDZNRM  PROGRESS     Storage requirements    Parameter No  of words    X  B IPAR  4    WRK  4 IPAR 5   IPAR 4   IPAR 13    PAR 6    Possible exit status values returned by IPAR 12    0     Functi
2.      Compute with local grid points         Need  eastern  data wait for completion of receive  CALL MPI WAIT RIDO ISTAT IERR       Compute with local interfacing grid points in the  east          Need  west  data wait for completion of receive  CALL MPI WAIT RID1 ISTAT IERR       Compute with local interfacing grid points in the  west        Release message IDs   CALL MPI WAIT SIDO ISTAT IERR    CALL MPI WAIT SID1 ISTAT IERR     RETURN  END     The computation of the transpose matrix vector product for the PDE case 1s performed in a  similar fashion  Before the computation starts  each processor exchanges with its left and right    33    neighbouring processors the east and west coefficients corresponding to the interfacing grid  points   The computation performed is then similar to the matrix vector product described  above except that for each interfacing grid point we apply    Vig   05 jj   V4 juit  j   Bici ici   EHA     i jiti j   8        Comparing  8  to  7  we see that the coefficients are swapped in the north south and east west  directions  Note that due to the partitioning imposed we do not need to exchange the north  and south coefficients     A matrix vector product for parallel vector architectures For parallel vector architec   tures like the Cray Y MP2E  the routines outlined above are not efficient  because of the small  vector lengths involved  A routine requiring the use of long vectors may be obtained by writing  the matrix vector product for the 5 point 
3.     None    02    Reference manual    Algorithm A 3 CGNR    ro   Qi AT 6     AT AQoao     po   TQ    00   r   ro    wo   Q1 AT AQspo      g   p   wo  fork 2 1 2      Ok   1   Ok 1 Ek 1  Tk   TE    Ok ipk i  Tk   Tk   1 T Og  Wp 1  check stopping criterion  sp   Qi AT AQore  Ok   TL Tk    r   ri sk  Pr   Orf 0k   1  Pk   rk   Dkpk A  Wk   Sk   Pewp1  En     r     Br  ra  endfor    54       PIM CGNR    Reference manual PIM  CGNE    A 6 PIM CONE  Purpose    Solves the system Q   AAT Qox   Qib using the CGNE method     Synopsis    PIMSCGNE X B WRK IPAR SPAR MATVEC  TMATVEC   PRECONL   PRECONR   PSSUM  PSNRM  PROGRESS   PIMDCGNE X B WRK IPAR DPAR MATVEC  TMATVEC   PRECONL   PRECONR   PDSUM  PDNRM  PROGRESS     PIMCCGNE X B WRK IPAR SPAR MATVEC  TMATVEC   PRECONL   PRECONR   PCSUM  PSCNRM  PROGRESS                             PIMZCGNE X B WRK IPAR DPAR MATVEC  TMATVEC   PRECONL   PRECONR   PZSUM  PDZNRM  PROGRESS     Storage requirements    Parameter No  of words    X  B IPAR 4   WRK 6 IPAR 4   IPAR 13    PAR 6    Possible exit status values returned by IPAR 12    0     Function dependencies    BLAS  COPY  _AXPY   _DOT _DOTC   LIBPIM    Notes    None    55    Reference manual    Algorithm A 4 CGNE    ro   Q  b     AA  Qoo     po   ro    00   T   To    wo   Q1 AAT Qopo     amp    p   wo  for k     1 2      Ok   1   Ok 1 Ek 1  Tk   Tr    O  1Pk 1  Tk   Tk 1     Og 1W 1  check stopping criterion  sp   Qi AA1 Qork  Ok   TL Tk    r   ri sk  Pr   On   0k   1  pk   rk   Dkpk A  Wk   Sk   
4.   PDNRM   PROGRESS     PIMCCGS X B WRK IPAR SPAR MATVEC   PRECONL   PRECONR   PCSUM  PSCNRM  PROGRESS                       PIMZCGS X B WRK IPAR DPAR MATVEC  PRECONL   PRECONR   PZSUM  PDZNRM  PROGRESS     Storage requirements    Parameter No  of words    X  B TPAR 4   WRK 10 IPAR 4   IPAR 13    PAR 6    Possible exit status values returned by IPAR 12    0     Function dependencies    BLAS  COPY  _AXPY    DOT  DOTO   LIBPIM    Notes    None    59    Reference manual PIM CGS    Algorithm A 6 CGS    1  ro   Qi  b     Asr   2  po   so   To   TQ  3  Po     FT ro  for k 2 1 2      wii   Qi AQopr 1   amp  i   f wa  KL   Dp exea  tk   1   Sk 1     Og 1W 1  Weal   Spay     ki  Tk   Vk    Ck WE 1  Tk   Tk      0 11 A  a     check stopping criterion  pr     i te  Bk   pk   pk   1  Sk   Tk   Dktk   1  wk   tk   1 Bkpk i  pk   Sk   Ppwr  endfor             60    Reference manual PIM_BICGSTAB    A 9 PIM_BICGSTAB  Purpose    Solves the system Q1 AQ  z   Q  b using the Bi CGSTAB method   Synopsis  PIMSBICGSTAB X B WRK IPAR SPAR MATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS     PIMDBICGSTAB X B WRK IPAR DPAR MATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS     PIMCBICGSTAB X B WRK IPAR SPAR MATVEC PRECONL PRECONR PCSUM PSCNRM PROGRESS                       PIMZBICGSTAB X B WRK IPAR DPAR MATVEC PRECONL PRECONR PZSUM PDZNRM PROGRESS     Storage requirements    Parameter No  of words    X  B IPAR 4   WRK 10 IPAR 4   IPAR 13    PAR 6    Possible exit status values returned by IPAR 12    0     Function 
5.  1     p mk a  check stopping criterion    endfor       76    Reference manual PIM SETPAR    A 17 PIM SETPAR  Purpose    Sets the parameter values in the arrays IPAR and  PAR     Synopsis    PIMSSETPAR IPAR SPAR LDA N BLKSZ LOCLEN BASISDIM  NPROCS   PROCID   PRECONTYPE STOPTYPE MAXIT EPSILON   INTEGER IPAR     REAL SPAR     INTEGER LDA N BLKSZ LOCLEN BASISDIM NPROCS PROCID   PRECONTYPE STOPTYPE MAXIT  REAL EPSILON    PIMDSETPAR  IPAR DPAR LDA N BLKSZ LOCLEN BASISDIM NPROCS PROCID   PRECONTYPE STOPTYPE MAXIT EPSILON   INTEGER IPAR     DOUBLE PRECISION DPAR     INTEGER LDA N BLKSZ LOCLEN BASISDIM NPROCS PROCID   PRECONTYPE STOPTYPE MAXIT  DOUBLE PRECISION EPSILON    Storage requirements    Parameter No  of words    IPAR 13   PAR 6  Notes    1  When using the COMPLEX and DOUBLE COMPLEX PIM routines  call PIMSSETPAR and  PIMDSETPAR respectively     TT    Reference manual    A 18  PIM PRTPAR    Purpose    Prints the parameter values on the arrays IPAR and  PAR     Synopsis    PIMSPRTPAR IPAR SPAR   INTEGER IPAR     REAL SPAR       PIMDPRTPAR  IPAR DPAR   INTEGER IPAR     DOUBLE PRECISION DPAR       Storage requirements    Parameter No  of words    IPAR 13   PAR 6  Notes    1  May be called only on a processing element with I O capability     PIM PRTPAR    2  When using the COMPLEX and DOUBLE COMPLEX PIM routines  call PIMSPRTPAR and    PIMDPRTPAR respectively     78    Reference manual _INIT    A 19 INIT  Purpose  Initialises a vector of length n with the scalar value alpha  Based
6.  1   RETURN    where DVPROD is a routine based on the BLAS DAXPY routine that performs an element by   element vector multiplication  This example also shows the use of dummy arguments  PDUMR     Note that it is the responsibility of the user to ensure that  when using preconditioning  the  matrix Qj AQ   must satisfy any requirements made by the iterative method being used with  respect to the symmetry and or positive definiteness of the matrix  For example  if 4 is a matrix  with arbitrary  i e   non constant  diagonal entries  then both diag  A T  A and A diag  A   will  not be symmetric  and the CG and CGEV methods will generally fail to converge  For these  methods symmetric preconditioning  diag  A     A diag  A      should be used     Inner products  vector norms and global accumulation When running PIM routines  on multiprocessor architectures  the inner product and vector norm routines require reduction    24    and broadcast operations  in some message passing libraries these can be supplied by a single  routine   On vector processors these operations are handled directly by the hardware whereas  on distributed memory architectures these operations involve the exchange of messages among  the processors    When a PIM iterative routine needs to compute an inner product  it calls DOT to compute  the partial inner product values  The user supplied routine P  SUM is then used to generate the  global sum of those partial sums  The following code shows the routines to comp
7.  Journal on Matriz Analysis and Applications  13 3  778 795  1992     T C  Oppe  W D  Joubert  and D R  Kincaid  NSPCG user s guide     version 1 0  Report  No  CNA 216  Center for Numerical Analysis  University of Texas at Austin  April 1988     A  Pindor  Experiences with implementing PIM  Parallel Iterative Methods  package on  KSRI  In Supercomputing Symposium  94  Toronto  June 1994     Y  Saad  Krylov subspace methods for solving large unsymmetric systems  Mathematics  of Computation  37 105 126  1981     Y  Saad and M H  Schultz  Conjugate Gradient like algorithms for solving nonsymmetric  linear systems  Mathematics of Computation  44 170  417 424  1985     Y  Saad and M H  Schultz  GMRES  a generalized minimal residual algorithm for solving  nonsymmetric linear systems  SIAM Journal of Scientific and Statistical Computing  7 856     869  1986     G L G  Sleijpen and D R  Fokkema  BiCGSTAB Z  for linear matrices involving unsym   metric matrices with complex spectrum  ETNA  1 11 32  September 1993     P  Sonneveld  CGS  a fast Lanczos type solver for nonsymmetric linear systems  STAM  Journal of Scientific and Statistical Computing  10 36 52  1989     G  Stellner  S  Lamberts  and T  Ludwig  NXLIB User s Guide  Technical report  In   stitut f  r Informatik  Lehrstuhl f  r Rechnertechnik und Rechnerorganisation  Technische  Universit  t M  nchen  October 1993     H A  van der Vorst  Bi CGSTAB  A fast and smoothly converging variant of Bi CG for  the solution of nonsymm
8.  NSPCG may be used on a parallel vector supercomputer like a Cray Y MP   there are no versions of these packages available for distributed memory parallel computers   Second  there is no debugging support  this is dictated by the fact that in some multiprocessing  environments parallel I O is not available  The third aspect is that we do not provide a collection  of preconditioners but leave the responsibility of providing the appropriate routines to the user    In this sense  PIM has many similarities to a proposed standard for iterative linear solvers   by Ashby and Seager  5   In that proposal  the user supplies the matrix vector product and  preconditioning routines  We believe that their proposed standard satisfies many of the needs  of the scientific community as  drawing on its concepts  we have been able to provide software  that has been used in a variety of parallel and sequential environments  PIM does not always  follow the proposal especially with respect to the format of the matrix vector product routines  and the lack of debugging support    Due to the openness of the design of PIM  it is also possible to use it on a sequential machine   In this case  the user can take advantage of the BLAS  16  to compute the above operations   This characteristic is important for testing purposes  once the user is satisfied that the selection  of preconditioners and stopping criteria are suitable  the computation can be accelerated by  using appropriate parallel versions of t
9.  Vy  form approximate solution   compute eigenvalues of Hrestart  rk   Qi b    AQ27 4   Pr     relle  endfor       68    Reference manual PIM_RGCR    A 13  PIM RGCR  Purpose    Solves the system Q4 AQox   Ob using the restarted GCR method   Synopsis  PIMSRGCR X B WRK IPAR SPAR MATVEC PRECONL   PRECONR   PSSUM PSNRM  PROGRESS     PIMDRGCR X B WRK IPAR DPAR MATVEC PRECONL   PRECONR   PDSUM  PDNRM  PROGRESS     PIMCRGCR X B WRK IPAR SPAR MATVEC PRECONL   PRECONR   PCSUM  PSCNRM  PROGRESS                       PIMZRGCR X B WRK IPAR DPAR MATVEC PRECONL   PRECONR   PZSUM PDZNRM  PROGRESS     Storage requirements    Parameter No  of words    X  B IPAR  4    WRK  5 2 IPAR 5   IPAR 4  2 IPAR 5   IPAR 13    PAR 6    Possible exit status values returned by IPAR 12    0     Function dependencies    BLAS  COPY  _AXPY    DOT  DOTO   LIBPIM    Notes    1  The restarting value must be stored in IPAR 5     69    Reference manual PIM_RGCR    Algorithm A 11 RGCR    1  ro   Qi  b     AQ22z0   for k 2 1 2      2  P   Tk   1  Tk   Tk 1  Tk   Tk   1  for j   1 2     restart  W    Q1AQ2P   G   W  W     oj   TE W  IC  ik   tp t oj P   Tk   Tk     ajW   check stopping criterion  q   QAQ      Piri   rk ag Wal GF   endfor  endfor       TO    Reference manual PIM QMR    A 14  PIM QMR  Purpose    Solves the system Q4 4957   Q 1b using the highly parallel algorithm for the QMR method  developed by B  cker and Sauren  8      Synopsis  PIMSQMR X B WRK IPAR SPAR MATVEC  TMATVEC   PRECONL   PRECONR   PSSUM  PS
10.  applied using the  local interfacing points and those from the neighbouring processors    This parallel computation offers the possibility of overlapping communication with the com   putation  If the number of local grid points 1s large enough  one may expect that while Equation   7  is being applied to those points  the interfacing grid points of the neighbouring processors  wil have been transferred and be available for use  This method attempts to minimize the  overheads incurred by transferring the data  note that we only make gains if the asynchronous  transfer of messages is available   The example below is taken from the matrix vector product  routine using MPI    32    SUBROUTINE PDMVPDE NPROCS  MYID LDC L MYL COEFS  U V UEAST  UWEST   INCLUDE  mpif h     Declarations         Send border U values to  myid 1  th processor  MSGTYPE   1000  TO   MYID   1  CALL MPI  ISEND U EIO  L MPI DOUBLE PRECISIOW TO MSGTYPE     MPI_COMM_WORLD   SIDO   IERR       Post to receive border U values from  myid 1  th processor  MSGTYPE   1001  CALL MPI  IRECV UEAST L MPI DOUBLE PRECISION MPI ANY SOURCE   4 MSGTYPE  MPI COMM  WORLD RIDO TIERR       Send border U values to  myid 1  th processor  MSGTYPE   1001  TO   MYID   1  CALL MPI ISEND U WIO  L MPI DOUBLE  PRECISION TO MSGTYPE     MPI  COMM  WORLD SIDi1 IERR       Post to receive border U values from  myid 1  th processor  MSGTYPE   1000  CALL MPI  IRECV UWEST L MPI DOUBLE PRECISION MPI ANY SOURCE   4 MSGTYPE  MPI COMM WORLD RID1 IERR  
11.  computed    B The right hand side vector of length IPAR 4   WRK A work vector used internally  see the description  of each routine for its length   IPAR see below   PAR see below  MATVEC Matrix vector product external subroutine  TMATVEC Transpose matrix vector product external subroutine    PRECONL Left preconditioning external subroutine  PRECONR Right preconditioning external subroutine  P_SUM Inner product external function   P_NRM Vector norm external function   PROGRESS Monitoring routine    IPAR  input     Element Description  1 Leading dimension of A  2 Number of rows or columns of A  depending    on partitioning    Block size  for cyclic partitioning    Number of vector elements stored locally   Restarting parameter used in GMRES  GCR and RBi CGSTAB  Number of processors   Processor identification    CON SS CB au    Preconditioning   0  No preconditioning   1  Left preconditioning   2  Right preconditioning   3  Symmetric preconditioning   9 Stopping criterion  see Table 2    10 Maximum number of iterations allowed    45    Reference manual Description of parameters    IPAR  output     Element  11  12    13    Description  Number of iterations  Exit status   0  converged to solution     1  no convergence has been achieved     2     soft    breakdown  solution may have been found   3     hard    breakdown  no solution     4  conflict in preconditioner and stopping criterion selected   if IPAR 8  0 or IPAR 8  22 then IPAR 9  Z6     5  error in stopping criterion 3  T
12.  contains examples of the timing functions available on the  Cray  the IBM RS 6000 and also the UNIX etime function  By default  the latter is used  this  file must be modified to use the timing function available on the target machine    The PVM and MPI example programs use the Fortran INCLUDE statement to include the  PVM and MPI header files  Some compilers have a switch  usually  I  which allows the user  to provide search directories in which files to be included are located  as with the IBM AIX  XL Fortran compiler   while others require the presence of those files in the same directory as  the source code resides  In the first case  you will need to include in the FFLAGS variable the  relevant switches  see  4 3   in the latter  you will need to install the PVM and MPI header  files  fpvm3 h and mpif h respectively  by typing    make install pvm include INCFILE   name of fpvm3 h    make install mpi include INCFILE  lt name of mpif h gt     where you should replace  lt name of fpvm3 h gt  and  lt name of mpif h gt  by the full filename of  the required include files  for instance  if PVM is installed on  usr 1ocal pvm3 then you should    type  make install pvm include INCFILE  usr local pvm3 include fpvm3 h    Figure 2 shows the directory tree containing the examples  To build them  type make  followed by the name of a subdirectory of examples  e g   make sequential single dense      gt The PVM examples use the    groups    library libgpvm a which provides the reduction
13.  equations of the form    Ou   Ou    du fl D         0        os         z   Ot E    dy i  have been solved with CG using a Neumann polynomial approximation to AT  as a precondi   tioner  12      CG with eigenvalues estimation An important characteristic of CG is its connection to  the Lanczos method  29  which allows us to obtain estimates of the eigenvalues of O  AQ    with only a little extra work per iteration  These estimates  y   and un  are obtained from  the Lanczos tridiagonal matrix Tj whose entries are generated during the iterations of the  CG method  29  pp  475 480  523 524   If we define the matrices A   diag po  pi      py   1    Gr   diaglEo E1     Ex 1  and    Lo    1 cs  B      1       k  1  where p           lla  r  is the residual at the i th iteration         pT Ap  and 8    r  rj ri  vii    are generated via the CG iterations  at no extra cost   we obtain the Lanczos s matrix via the  relation    T    A   B  GBAT   4     Due to the structure of the matrices Bj  A and Gg  the matrix 77  can be easily updated during  the CG iterations  The general formula for Tj  is    ai    B  Eiz     i    pz  4  Bj120  2 1 2    k    bi       amp  1Biti  pi 1pi      1 2  k   1    where a  and b  are the elements along the diagonal and subdiagonal of Tj  respectively    The strategy employed to obtain the eigenvalue estimates is based on Sturm sequences  29   pp  437 439   For the matrix 75  obtained during the first iteration of CG  the eigenvalues are  obtained directly
14.  from the quadratic equation derived from p 4    det T5     pI   We also set  an interval  c  d     y    fn     For the next iterations  we update the interval Je  d  using Gerschgorin s theorem  This is  easily accomplished since at each iteration only two new values are added to Tj  to give 7444   the updated interval is then    min e   ax       bri       bi   jail     Db    d   max d  laz     bx   i     bel  Jai    bi    The new estimates for the extreme eigenvalues are then computed using a bisection routine  applied to the polynomial p  u    det T444     uI  which is computed via a recurrence expression   29  pp  437   The intervals  c  ui  and  un  d  are used in the bisection routine to find the new  estimates of 41 and un respectively    A possible use of this routine would be to employ adaptive polynomial preconditioners  see  2   and  3   where at each iteration information about the extreme eigenvalues of Qi AQ   is obtained  and the polynomial preconditioner is modified to represent a more accurate approximation to  AT   This routine can also be used as a preliminary step before solving the system using the  Chebyshev acceleration routine  PIM  CHEBYSHEV     CGNR and CGNE For nonsymmetric systems  one could use the CG formulation applied  to systems involving either A  A or AAT  these are called CGNR and CGNE respectively  The  difference between both methods is that CGNR minimises the residual    b     Az      and CGNE  the error    A7  5     zk       A potential
15.  functions     19    Figure 2  Directories containing the examples     dense  F pde  single E pvp pde    harwell boeing    sequenti dense  double  lt  pde  harwell boeing    complex        dense  dcomplex     dense  examples   i dense   single pde  pun double         SS   mpi pae  complex        dense  dcomplex     dense     The example programs can also be built locally in those directories by changing to a specific    directory and typing make     Cleaning up You may wish to remove some or all of the compiled codes or other files installed    under the PIM directory  in this case you may type one of the following    make  make  make  make  make  make  make  make  make  make  make  make    which will    singleclean  doubleclean  complexclean  dcomplexclean  sequentialclean  pvmclean  mpiclean  clean pvm include  clean mpi include  examplesclean  makefilesclean  realclean    clean up the PIM routines  the examples  the Makefiles  the include files and all    generated files  returning the package to its distribution form     20    Using PIM in your application To use PIM with your application  link your program with  the  o file corresponding to the PIM iterative method routine being called and with the PIM  support library libpim a     4 4 Calling a PIM iterative method routine  With the exception of the Bi CG  CGNR  CGNE and QMR methods  all the implemented    methods have the same parameter list as CG  The argument list for the double precision im   plementation of the CG 
16.  on the level 1 BLAS routine     COPY     Synopsis    SINIT N ALPHA SX INCX   REAL ALPHA SX     INTEGER N INCX    DINIT N ALPHA DX INCX   DOUBLE PRECISION ALPHA DX     INTEGER N INCX    CINIT N ALPHA CX INCX   COMPLEX ALPHA OX     INTEGER N INCX    ZINIT N ALPHA ZX INCX   DOUBLE COMPLEX ALPHA ZX     INTEGER N INCX    Storage requirements    Parameter No  of words  X IPAR 4     Notes    None    79    Index    Example programs  dense storage  30  description of  27  Eigenvalues estimation and Chebyshev  acceleration  29  PDE storage  30  PDE  matrix vector product for parallel  vector architectures  34  preconditioners  35  results  36  External routines  description of  22  inner product and vector norm  25  matrix vector product  22  monitoring the iterations  26  preconditioning step  23  synopsis of  47    Inner product  see External routines  25  Installation procedures  18  Building the examples  19  Building the PIM core functions  18  Cleaning up  20  Using PIM in your application  21  Iterative methods  Bi CG  9  routine  57  Bi CGSTAB  10  routine  61  CG  7  routine  49  CG with eigenvalues estimation  8  routine  51  CGNE  9    routine  55    80    CGNR  9  routine  53   CGS  10  routine  59   Chebyshev acceleration  12  routine  75    GCR  11  routine  69  GMRES  10    routine  65  GMRES with eigenvalues estimation  11  routine  67  increasing parallel scalability of  15  overview  7  QMR  11  71  Restarted Bi CGSTAB  10  routine  63  TFOMR  11    routine  73    Ma
17.  problem with this approach is that the condition number  of ATA or AA    is large even for a moderately ill conditioned A  thus requiring a substantial  number of iterations for convergence  However  as noted by Nachtigal et al   37   CGNR is better  than GMRES and CGS for some systems  including circulant matrices  More generally  CGNR  and CGNE perform well if the eigenvalue spectrum of    has some symmetries  examples of such  matrices are the real skew symmetric and shifted skew symmetric matrices A   e     T   ol    T   T      0 real and o complex     Bi CG  Bi CG is a method derived to solve non Hermitian systems of equations  and is closely  related to the Lanczos method to compute the eigenvalues of A  The method requires few vec   tors per iteration and the computation of a matrix vector product as well as a transpose   matrix vector product ATu  The iterates of Bi CG are generated in the Krylov subspace  K ro  A     ro  roA  r0A2       where rg   b     Axo     A Galerkin condition vr    0  Vw     K fo  AT   is imposed on the residual vector where  Tg 1s an arbitrary vector satisfying rl fg   0  It is important to note that two sequences of  residual vectors are generated  one involving r and A and the other f    and AT but the solution  vector rj is updated using only the first sequence    Bi CG has an erratic convergence with large oscillations of the residual 2 norm which usually  cause a large number of iterations to be performed until convergence is achieved  M
18.  storage needed  Although the restar   ted GMRES does not break down  42  pp  865   it may  depending on the system and the value  of c  produce a stationary sequence of residuals  thus not achieving convergence  Increasing the  value of c usually cures this problem and may also increase the rate of convergence    A detailed explanation of the parallel implementation of the restarted GMRES used can be  found in  14      GMRES with eigenvalues estimation It is very easy to obtain estimates of the eigenvalues  of Q1 AQ   at each iteration of GMRES  since the upper Hessenberg matrix Hj computed during  the Arnoldi process satisfies Q1 AQoV    VL Hy  The eigenvalues of Hj approximate those of  Q1AQ    especially on the boundaries of the region containing A    405   The QR algorithm  can be used to obtain the eigenvalues of Hg  The LAPACK routine HSEGR  1  pp  158 159  is  used for this purpose    The routine PIM  RGMRESEV returns a box in the complex plane  defining the minimum and  maximum values along the real and imaginary axes  These values can then be used by the  Chebyshev acceleration routine  PIM CHEBYSHEV     GCR The GCR method is generally used in its restarted form for reasons similar to those  given above for GMRES  It is mathematically equivalent to the restarted version of GMRES  but it is not as robust  It is applicable to systems where the coefficient matrix is of the form  A  uI   R  y complex and R real symmetric and A   uI   S  ji real and S       S  arising  i
19.  that only the neighbouring points in the vertical and horizontal directions are  needed to compute v  j     30    Figure 3  Matrix vector product  dense storage format  A  Partitioning in columns   B  Exam   ple and C  Compntation and communication steps           A  G Processors                               ENN IM CL                                              B  AB CD EF G a Aa Bb Cc Dd Ee Ff Gg Hh  I K L O P b Ta Jb Kc Ld Me Nf 0g Ph  C    d    E         4      g  L h L       C  peni    Ford  la Jb Kc Ld Me Nf Og Ph Kc Ld Og Ph          Step l     B  Step 2                31    Figure 4  Matrix vector product  PDE storage format        Processor 0 Processor 1 Processor 2    6  E EL          C                      NE          O Boundary grid points  exchanged   mi Grid points    eo Data exchange             Five point stencil       A parallel computation of  7  may be organised as follows  The grid points are partitioned  by vertical panels among the processors as shown in Figure 4  A processor holds at most   1 p    columns of   grid points  To compute the matrix vector product  each processor exchanges with  its neighbours the grid points in the  interfaces  between the processors  the points marked  with white squares in Figure 4   Equation  7  is then applied independently by each processor  at its local grid points  except at the local interfacing points  After the interfacing grid points  from the neighbouring processors have arrived at a processor  Equation  7  is
20.  that the methods that have failed to converge in this example do converge  for other systems     Scalability In Table 6 we present the execution times obtained by solving the test problem  above  but with n     16384 equations  with the PIMDRGMRES routine  using 10 basis vectors   and the Neumann polynomial preconditioner  of first degreee  on the IBM SP 1  Intel Paragon  XP S  Kendall Square Research KSR1  SGI Challenge  Cray Y MP2E and Cray C9016E  The  PIMDRGMRES routine converged to a tolerance of 10713 in 70 iterations  The results for the Cray  machines were obtained with the modified matrix vector product routines described in  4 6 3   The results for the KSR1 are obtained using the KSRBLAS routines  The programs running  on the SGI Challenge are from the set of examples available with the PIM distributed software  using the PVM message passing library  The results for the IBM SP 1 are obtained using  the IBM PVMe 2 0 version of PVM  which enables the use of the IBM SP 1 High Performance  Switch  Note that for the IBM SP 1  SGI Challenge and Intel Paragon XP S superlinear effects  occur  we believe this is due to the specific memory organization of those machines  hierarchic  memories and or the presence of a cache memory      5 Summary    We have described in this report PTM  the Parallel Tterative Methods package  a collection of  Fortran 77 routines for the parallel solution of linear systems using iterative methods     The package was designed to be used in a vari
21.  which can be easily computed as a sequence of vector updates and matrix vector products using  Horner s algorithm  Note that the Ym  coefficients define the kind of polynomial preconditioner  being used  The Neumann preconditioner is obtained when Ym    1  Vi  the weighted and  unweighted least squares polynomial preconditioners are those reported in  36   The maximum  available degree of the polynomial for these latter two is m   13     4 6 5 Results    In this section we provide some results for the example programs discussed above     Stopping criteria As pointed out earlier  the selection of the stopping criterion has a sub   stantial effect on the execution time  Evidently  there is a trade off between the time spent  on each iteration and the total number of iterations required for convergence  In Table 3 we  show  for each of the stopping criteria provided  the execution time per iteration when PIMDCG  is applied to the tridiagonal system  described in  4 6  of order n     500 with diagonal left   preconditioning  The increase in execution time of each stopping criterion with respect to  criterion 4  the  cheapest  one  is shown     Table 3  Effect of different stopping criteria on an iterative method routine     Stopping Time s    criterion kk       ri      iteration Increase  1 19 3 56x 1071 0 3331 2 66  2 15 6 901 x10  0 3531 2 79  3 14 129x108 0 3286 2 62  4 18 3829x1077    0 1254     5 14 6 45x107  0 1967 1 57  6 15 1 73x10  0 2904 2 32  7 19 3 70x10    0 4148 3 
22. 2    The examples include the solution of systems using different preconditioners  In the dense  and Harwell Boeing formats the examples include diagonal and polynomial preconditioners  the  five point PDE format includes a variant of the incomplete LU factorisation and polynomial  preconditioners  The polynomial preconditioners provided are the Neumann and the weighted  and unweighted least squares polynomials found in  36      28    4 6 1 Eigenvalues estimation and Chebyshev acceleration    Consider the use of Chebyshev acceleration to obtain a solution to a linear system whose  coefficient matrix has real entries only  the eigenvalues of the iteration matrix I     Q A are  known to lie in the complex plane  We can use a few iterations of the routine PIMDRGMRESEV to  obtain estimates of the eigenvalues of Q A and then switch to PIMDCHEBYSHEV  Before the latter  is called a transformation on the extreme values on the real axis must be made as described in  Section 2    In the example below  we use the Jacobi preconditioner as shown in  4 5  Note that the  vector X returned by PIMDRGMRESEV may be used as an improved initial vector for the routine  PIMDCHEBYSHEV  Both routines are combined in a loop to produce a hybrid method  the code  below is based on the algorithm given by Elman et al   21  page 847      PROGRAM HYBRID  INTEGER MAXIT  EXTERNAL MATVEC PRECON PDUMR PDSUM PDNRM2      SET MAXIMUM NUMBER OF ITERATIONS FOR THE HYBRID LOOP  MAXIT INT N 2  1      SET LEFT PRECONDI
23. 31    General results We present below the results obtained from solving a system of n   64  equations derived from the 5 point finite differences discretisation of Equation  6      36    We used both the IDLU 0  and the Neumann polynomial preconditioner of degree 1 as left   preconditioners to solve this problem  The stopping criterion used was number 5 with e   1079  and the 2 norm  using this criterion a solution will be accepted if    zp       lt  3 802 x 107  4  except  for PIMDRGMRES which stops its iterations when the norm of the residual is less than e  The  maximum number of iterations allowed was 32 and the initial value of the solution vector was   0 0     0 7  For the restarted GMRES and GCR the restarting value used was 10  The results  are reported for the double precision versions of the routines    Tables 4 and 5 show the results obtained with the PIM routines for the IDLU 0  and  Neumann preconditioners on a single workstation  A status value of 0 on exit from a PIM  routine indicates that the convergence conditions have been satisfied  a non zero status value  indicates that a problem has been encountered  In particular a status value of    1 is returned  when the maximum number of iterations specified by the user has been exceeded  This example  is characteristic of the problems facing the user of iterative method i e   not all methods converge  to the solution and some preconditioner may cause an iterative method to diverge  or converge  slowly   We stress
24. 46     the restarted  stabilised version of Bi Conjugate Gradients  RBi CGSTAB   43    the restarted  generalised minimal residual  RGMRES   42     the restarted  generalised conjugate residual  RGCR   20      the highly parallel algorithm for the quasi minimal residual  QMR  method  recently pro   posed by B  cker and Sauren  8      the transpose free quasi minimal residual  TFQMR   24  and    Chebyshev acceleration  32      The routines allow the use of preconditioners  the user may choose to use left   right  or    symmetric preconditioning  Several stopping criteria are also available   PIM was developed with two main goals    1      To allow the user complete freedom with respect to the matrix storage  access and parti   tioning       To achieve portability across a variety of parallel architectures and programming environ     ments     These goals are achieved by hiding from the PIM routines the specific details concerning the    computation of the following three linear algebra operations    1  Matrix vector  and transpose matrix vector  product  2  Preconditioning step  3  Inner products and vector norm    Routines to compute these operations need to be provided by the user  Many vendors supply  their own  optimised linear algebra routines which the user may want to use    A number of packages for the iterative solution of linear systems are available including  ITPACK  30  and NSPCG  38   PIM differs from these packages in three main aspects  First   while ITPACK and
25. B WRK  IPAR SPAR MATVEC   PRECONL   PRECONR   PCSUM  PSCNRM  PROGRESS                             PIMZRGMRESEV  X B WRK  IPAR DPAR MATVEC   PRECONL   PRECONR   PZSUM  PDZNRM  PROGRESS     Storage requirements    Parameter No  of words    X  B IPAR 4    WRK  4 IPAR 5   IPAR 4   IPAR 13    PAR 6    Possible exit status values returned by IPAR 12    0     Function dependencies    BLAS  COPY   AXPY    DOT  DOTO    SCAL   TRSV  LAPACK   HSEQR   LIBPIM   Notes    1  The size of the orthonormal basis  maximum of 50 vectors  must be stored in IPAR 5   If  the user needs to use a larger basis then the parameter IBDIM  defined on PIM RGMRESEV  must be changed accordingly     67    Reference manual PIM RGMRESEV    2  The user must supply a routine to compute the 2 norm of a vector     3  A box containing estimates of the eigenvalues of    AQ   is returned in DPAR  3   DPAR  4    DPAR 5   DPAR 6   these values representing the minimum and maximum values in the  real and imaginary axes  respectively     Algorithm A 10 RGMRESEV    1  ro   Qi  b     AQoxg   2  Po    Irolle  for k  1 2      3  g   Bist Bie  4  Vi   ria Pa  for j   1 2      restart  Rij   VI QAQ  Vj   m       Qi AQ   Vj     Xl HU  Pi  i   110112  Via DIR  apply previous Givens s rotations to R  j  compute Givens s rotation to zero IBi   1    apply Givens   s rotation to g  if  gj41   lt RHSSTOP then  perform steps 13 and 14 with restart  i  stop  endif  endfor    solve Ry   g  solution to least squares problem       k   2 1  
26. BPIM    Notes    e If more accuracy is required in the computation of the estimates of the eigenvalues  the  user may modify the value of the maximum number of iterations allowed 1n the routine  BISECTION  files pim22 single src pimscgev f or pim22 double src pimdcgev f      e Not available in COMPLEX or DOUBLE COMPLEX versions     51    Reference manual    Algorithm A 2 CGEV      To   Qi  b     AQoxo      po     TO     00     r   TQ     wo   Qi AQ  po       o   pa wo   for k 2 1 2       01   eui   6 1  Tk   Tk 1   Ok 1Pk 1  Tk     Tk   1     Ok 1Wk 1  check stopping criterion  Sk   Q AQory  Ok   PL Tk    r   ri sk  Pr   ek  0k   1  pk   rk Pkpi a  Wk   Sk   PpWwr 1  En     z     Pr  ra  compute estimates of eigenvalues    endfor    52       PIM CGEV    Reference manual PIM_CGNR    A 5 PIM_CGNR  Purpose    Solves the system Q   AT AQox   Q1 A     using the CGNR method     Synopsis    PIMSCGNR X B WRK IPAR SPAR MATVEC TMATVEC PRECONL PRECONR PSSUM PSNRM PROGRESS   PIMDCGNR X B WRK IPAR DPAR MATVEC TMATVEC PRECONL PRECONR PDSUM PDNRM PROGRESS     PIMCCGNR X B WRK IPAR SPAR MATVEC  TMATVEC   PRECONL   PRECONR   PCSUM  PSCNRM  PROGRESS                             PIMZCGNR X B WRK IPAR DPAR MATVEC  TMATVEC   PRECONL   PRECONR   PZSUM  PDZNRM  PROGRESS     Storage requirements    Parameter No  of words    X  B IPAR 4   WRK 6 IPAR 4   IPAR 13    PAR 6    Possible exit status values returned by IPAR 12    0     Function dependencies    BLAS  COPY  _AXPY   _DOT _DOTC   LIBPIM    Notes
27. DSUM   PDNRM   PROGRESS     PIMCCG X B WRK IPAR SPAR MATVEC   PRECONL   PRECONR   PCSUM  PSCNRM  PROGRESS                       PIMZCG X B WRK IPAR DPAR MATVEC   PRECONL   PRECONR   PZSUM  PDZNRM   PROGRESS     Storage requirements    Parameter No  of words    X  B IPAR 4   WRK 6xIPAR  4   IPAR 13    PAR 6    Possible exit status values returned by IPAR 12    0     Function dependencies    BLAS  COPY  AXPY   _DOT _DOTC   LIBPIM    Notes    None    49    Reference manual    Algorithm A 1 CG    ro   Qi  b     AQ gt 270     po   ro    00   T   To    wo   Qi AQ  po     amp    pj wo  for k     1 2      Ok   1   Ok 1 Ek 1  Tk   Tr    O  1Pk 1  Tk   Tk 1      amp k   1Wk      check stopping criterion  sk   Qi AQor    Ok   TL Tk    r   ri sk  Pr   On   0k   1  pk   rk   Dkpk A  Wk   Sk   Pewp1  k     k Piar  endfor       50    PIM CG    Reference manual PIM  CGEV    A 4  PIM CGEV  Purpose    Solves the system Q4 AQ27   Q  b using the CG method  returns  at each iteration  the estimates  of the smallest and largest eigenvalues of Q1 AQ   derived from the associated Lanczos tridiagonal  matrix     Synopsis  PIMSCGEV X B WRK IPAR SPAR MATVEC  PRECONL   PRECONR   PSSUM  PSNRM  PROGRESS   PIMDCGEV X B WRK IPAR DPAR MATVEC  PRECONL   PRECONR   PDSUM  PDNRM  PROGRESS     Storage requirements    Parameter No  of words    X  B IPAR  4    WRK 6 IPAR 4  2 IPAR 10  1  IPAR 13    PAR 4    Possible exit status values returned by IPAR 12    0     Function dependencies    BLAS  COPY   AXPY   DOT  LI
28. L Zk   0     6  stopping criterion invalid on PIM CHEBYSHEV     f  no estimates of eigenvalues supplied for PIM  CHEBYSHEV   8  underflow while computing p   on PIM_CGEV     9  overflow while computing p   on PIM CGEV     10  underflow while computing un on PIM_CGEV     11  overflow while computing un on PIM CGEV  If IPAR  12  is either  2 or  3  it gives the step number in  the algorithm where a breakdown has occurred     PAR  input     Element    1    Description   The value of  lt  for use in the stopping criterion     PAR  output     Element    Q Cu d Ww LD    Description   The left hand side of the stopping criterion selected  Minimum real part of the eigenvalues of Q1 AQ    Maximum real part of the eigenvalues of Qi AQ    Minimum imaginary part of the eigenvalues of O  AQ    Maximum imaginary part of the eigenvalues of Q1 AQ      46    Reference manual External routines    A 2 External routines  Purpose    To compute the matrix vector product  transpose matrix vector product  left preconditioning   right preconditioning  global sum of a vector  vector norm  and to monitor the progress of the  iterations     Note The coefficient matrix and the preconditioning matrices can be made available to  MATVEC  TMATVEC  PRECONL and PRECONR using COMMON blocks     Synopsis  Matrix vector product v   Au Left preconditioning v   Qu  SUBROUTINE MATVEC U V IPAR  SUBROUTINE PRECONL U V IPAR   precision UC   V    precision U    V     INTEGER IPAR 4  INTEGER IPAR      Parameters Type Para
29. M were more restrictive in this sense     Matrix vector product Consider as an example a dense matrix partitioned by contiguous  columns among a number of processors  For illustrative purposes we assume that N is an integer  multiple of NPROCS  and that LOCLEN N NPROCS  The following code may then be used    PROGRAM MATV      A IS DECLARED AS IF USING A COLUMN PARTITIONING FOR AT LEAST  TWO PROCESSORS    INTEGER LDA   PARAMETER  LDA 500    INTEGER LOCLEN   PARAMETER  LOCLEN 250    DOUBLE PRECISION A LDA LOCLEN    COMMON  PIMA A           SET UP PROBLEM SOLVING PARAMETERS FOR USE BY USER DEFINED ROUTINES    THE USER MAY NEED TO SET MORE VALUES OF THE IPAR ARRAY  LEADING DIMENSION OF A  IPAR 1  LDA    NUMBER OF ROWS COLUMNS OF A  IPAR 2  N    NUMBER OF PROCESSORS         22    IPAR 6  NPROCS  NUMBER OF ELEMENTS STORED LOCALLY  IPAR 4  N IPAR 6   CALL PIM ROUTINE  CALL PIMDCG X B WRK IPAR DPAR MATVEC   PRECONL   PRECONR   PDSUM  PDNRM  PROGRESS   STOP  END                   MATRIX VECTOR PRODUCT ROUTINE CALLED BY A PIM ROUTINE  THE  ARGUMENT LIST TO THIS ROUTINE IS FIXED   SUBROUTINE MATVEC U V IPAR   DOUBLE PRECISION U     V     INTEGER IPAR     INTEGER LDA  PARAMETER  LDA 500   INTEGER LOCLEN  PARAMETER  LOCLEN 250   DOUBLE PRECISION A LDA LOCLEN   COMMON  PIMA A         RETURN  END    The scheme above can be used for the transpose matrix vector product as well  We note that  many different storage schemes are available for storing sparse matrices  the reader may find  useful 
30. NR   PDSUM  PDNRM  PROGRESS     PIMCTFQMR  X B WRK  IPAR SPAR MATVEC  PRECONL   PRECONR   PCSUM  PSCNRM   PROGRESS                             PIMZTFQMR  X B WRK  IPAR DPAR MATVEC  PRECONL   PRECONR   PZSUM  PDZNRM  PROGRESS     Storage requirements    Parameter No  of words    X  B IPAR 4   WRK 12 IPAR 4   IPAR 13    PAR 6    Possible exit status values returned by IPAR 12    0     Function dependencies    BLAS  COPY  _AXPY   _DOT _DOTC   LIBPIM    Notes    1  The user must supply a routine to compute the 2 norm of a vector     73    Reference manual    Algorithm A 13 TFQMR    CON CE WHF      ro   Qi  b     AQoz0   un   yp   T0     vo   g   Q1 AQoyi   d   U     To     lrolle     09   0   0      TQ   TQ  T    po     Toro  for k  1 2        AD  Ok 1   T Uk   1    KL   Pr1 0  1   Yok   U2k   1     Ck   1Uk   1   h   QI AQ2Y2k   for m   2k     1 2k  Mp    Wm     04 19  05    wmaill2 Tm i  Cm   1 VA   02   Tm   Tm   19mCm  Nm   AKI  dm   Ym     O mnes ye  dad  Lm   Em 1   Pa o   Km   Tm Vm   1  if Km  lt     check stopping criterion  gh   endfor   Pk        Wok   Pr  gt  pk  pk   1   Y2k 1   Warsi   Pryor   g   Q1 AQ2Y2r 1   vk   g   By h   Prvr 1    endfor       74    PIM_TFQMR    Reference manual PIM CHEBYSHEV    A 16 PIM CHEBYSHEV  Purpose    Solves the system AQ  oz   b using the Chebyshev acceleration   Synopsis  PIMSCHEBYSHEV  X B WRK IPAR SPAR MATVEC  PRECONL   PRECONR   PSSUM  PSNRM  PROGRESS     PIMDCHEBYSHEV  X B WRK IPAR DPAR MATVEC  PRECONL   PRECONR   PDSUM  PDNRM  P
31. NRM  PROGRESS     PIMDQMR X B WRK IPAR DPAR MATVEC  TMATVEC   PRECONL   PRECONR   PDSUM  PDNRM  PROGRESS     PIMCQMR X B WRK IPAR SPAR MATVEC  TMATVEC   PRECONL   PRECONR   PCSUM  PSCNRM  PROGRESS                                   PIMZQMR X B WRK IPAR DPAR MATVEC  TMATVEC   PRECONL   PRECONR   PZSUM  PDZNRM  PROGRESS     Storage requirements    Parameter No  of words    X  B IPAR 4   WRK 11 IPAR 4   IPAR 13    PAR 6    Possible exit status values returned by IPAR 12    0     Function dependencies    BLAS  COPY  _AXPY   _DOT _DOTC   LIBPIM    71    Reference manual PIM QMR    Algorithm A 12 QMR    b det      w   0    ro   Qi  b     AQ279      po   qo   do   so   0   on   Oula  E   llla  o    DTi  e    AT  T      0    TSH       Pk     r     Hkpk   i  dk   z AT tr KA    Tey      kt   Apk     Z   WL    dk     HWE   check stopping criterion   Wi    nci  lo Ek  i    engi  los Pet   HE TK ek  i   CAT Dega  Orya   mu  KEK PRA    g QE TPA  Tk 1   L Al       E 1  IMP UA     a U     k   E el     YkTkKk   I  Ag  Tk  2    y 41    Ak  Tk  Tk    Ak il   dp   Orda       KkPk  Sk   Onsp   1   KK App  dy   Tk   1   dk  Tk   Tk   1     Sk  endfor          Kk         At 1   pn       72    Reference manual PIM_TFQMR    A 15  PIM TFQMR  Purpose    Solves the system Q4 AQ  ox   Ob using the TFQMR method with 2 norm weights  see  24   Algorithm 5 1       Synopsis  PIMSTFQMR  X B WRK IPAR SPAR MATVEC  PRECONL   PRECONR   PSSUM  PSNRM  PROGRESS     PIMDTFQMR  X B WRK  IPAR DPAR MATVEC  PRECONL   PRECO
32. PIM 2 2  The Parallel Iterative Methods package for  oystems of Linear Equations  User s Guide     Fortran 77 version     Rudnei Dias da Cunha  Mathematics Institute and National Supercomputing Centre  Universidade Federal do Rio Grande do Sul  Brasil    Tim Hopkins  Computing Laboratory  University of Kent at Canterbury  United Kingdom    Abstract   We describe PIM  Parallel Iterative Methods   a collection of Fortran 77 routines to  solve systems of linear equations on parallel computers using iterative methods    A number of iterative methods for symmetric and nonsymmetric systems are avail   able  including Conjugate Gradients  CG   Bi Conjugate Gradients  Bi CG   Conjugate   Gradients squared  CGS   the stabilised version of Bi Conjugate Gradients  Bi CGSTAB    the restarted stabilised version of Bi Conjugate Gradients  RBi CGSTAB   generalised min   imal residual  GMRES   generalised conjugate residual  GCR   normal equation solvers   CGNR and CGNE   quasi minimal residual  QMR   transpose free quasi minimal residual   TFQMR  and Chebyshev acceleration    The PIM routines can be used with user supplied preconditioners  and left   right  or  symmetric preconditioning are supported  Several stopping criteria can be chosen by the  user    In this user   s guide we present a brief overview of the iterative methods and algorithms  available  The use of PIM is introduced via examples  We also present some results obtained  with PIM concerning the selection of stopping criteri
33. Pewp1  k     k Piar  endfor       56    PIM CGNE    Reference manual PIM BICG    A T PIMBICG  Purpose    Solves the system Q4 AQ  r   Ob using the Bi CG method     Synopsis    PIMSBICG X B WRK IPAR SPAR MATVEC  TMATVEC   PRECONL   PRECONR   PSSUM  PSNRM  PROGRESS   PIMDBICG X B WRK IPAR DPAR MATVEC  TMATVEC   PRECONL   PRECONR   PDSUM  PDNRM  PROGRESS     PIMCBICG X B WRK IPAR SPAR MATVEC  TMATVEC   PRECONL   PRECONR   PCSUM  PSCNRM  PROGRESS                             PIMZBICG X B WRK IPAR DPAR MATVEC  TMATVEC   PRECONL   PRECONR   PZSUM  PDZNRM  PROGRESS     Storage requirements    Parameter No  of words    X  B IPAR 4   WRK 8 IPAR 4   IPAR 13    PAR 6    Possible exit status values returned by IPAR 12    0     Function dependencies    BLAS  COPY  _AXPY    DOT  DOTO   LIBPIM    Notes    None    57    Reference manual PIM BICG    Algorithm A 5 Bi CG    ro   Q    b     AQoxo      TQ   po   po   ro     po   PL ro     wo   Qi AQ2p0       g   DL wo   for k 2 1 2      Oc Yl  Tk   TE  F  amp k   1Pk   1  Tk   Tk   1 ORI y  check stopping criterion    k     ki     On  1Qi A    G    e i  Sk   Qi AQ  rk  pr   PL PL    r   FT sp  Br   pk  pk   1  pk   rk Php  Dk   Tk   PkPk  Wk   Sk   PkWk  En     k     Pr  ra  endfor       58    Reference manual PIM CGS    A 8  PIM CGS  Purpose    Solves the system Q4 AQox   Ob using the CGS method   Synopsis  PIMSCGS X B WRK IPAR SPAR MATVEC  PRECONL   PRECONR   PSSUM  PSNRM  PROGRESS     PIMDCGS X B WRK IPAR DPAR MATVEC  PRECONL   PRECONR   PDSUM 
34. ROGRESS     PIMCCHEBYSHEV  X B WRK IPAR SPAR MATVEC   PRECONL   PRECONR   PCSUM  PSCNRM  PROGRESS                       PIMZCHEBYSHEV  X B WRK IPAR DPAR MATVEC   PRECONL   PRECONR   PZSUM  PDZNRM  PROGRESS     Storage requirements    Parameter No  of words    X  B IPAR 4   WRK 5xIPAR  4   IPAR 13    PAR 6    Possible exit status values returned by IPAR 12    0     Function dependencies    BLAS  COPY  AXPY  SWAP    DOT  DOTO   LIBPIM    Notes    1  Only stopping tests 1  2 and 7 are allowed     2  The box containing the eigenvalues of I        AQ   must be stored in DPAR 3   DPAR  4    DPAR 5   DPAR 6   these values representing the minimum and maximum values in the  real and imaginary axes  respectively     75    Reference manual PIM CHEBYSHEV    Algorithm A 14 CHEBYSHEV    1  Set parameters for iteration   e If DPAR 3   lt  A 1    Q4 A      DPAR 4   in the real axis    c    DPAR 4      DPAR 3    2     DPAR 4      DPAR 3    y   2  2     DPAR 4      DPAR 3      e If DPAR 5   lt  A T     Qi A   lt  DPAR 6   in the imaginary axis    0        max DPAR 5   DPAR 6    y 1    e If DPAR 3   lt  Re A 1     QI A AR 4  and        lt  DP   DPAR 5      Im A I     Q1 A    lt  DPAR 6   in the complex plane    72   2    p  v2 DPAR 4      DPAR   q   V2 DPAR 6      DPAR   d    DPAR 3    DPAR 4     c     p    47   1     dy   y 1  1    d     3  5            2      f  7Q16  for k 2 1 2      1  k 1  p   1     o2 2  1  ks    1    p 10  4   k  2  w   I   OAG    S41   PR  M     Qi A    f     1 yrr    
35. TIONING  IPAR 8  1  CALL DINIT N 0 0D0 X 1   DO 10 I   1 MAXIT      SET SMALL NUMBER OF ITERATIONS FOR RGMRESEV  IPAR 10  3  CALL PIMDRGMRESV  X B WRK IPAR DPAR MATVEC   PRECONR   PDUMR  PDSUM  PDNRM  PROGRESS   IF  IPAR 12  NE  1  THEN  IPAR 11    I  GO TO 20  END IF    MODIFY REAL INTERVAL TO REFLECT EIGENVALUES OF I Q1A  BOX CONTAINING  THE EIGENVALUES IS RETURNED IN DPAR 3   DPAR 4   DPAR 5   DPAR 6    THE FIRST TWO ARE THE INTERVAL ALONG THE REAL AXIS  THE LAST TWO ARE  THE INTERVAL ALONG THE IMAGINARY AXIS    MU1   DPAR 3    MUN   DPAR 4    DPAR 3    1 0DO   MUN   DPAR 4    1 0DO   MU1    LR A   X         SET NUMBER OF ITERATIONS FOR CHEBYSHEV  IPAR 10  5  CALL PIMDCHEBYSHEV  X B DWRK IPAR DPAR MATVEC  PRECON   PDUMR   PDSUM  PDNRM2   PROGRESS   IF   IPAR 12  EQ 0   OR   IPAR 12  EQ  6   OR     29       IPAR 12  EQ  7   THEN  IPAR 11    I  GO TO 20  END IF  10 CONTINUE  20 CONTINUE    4 6 2 Dense storage    For the dense case  the coefficient matrix 1s partitioned by columns among the p processors   which are considered to be logically connected on a grid  see Figure 3 A   Each processor stores  at most  n p  columns of A  For the example shown in Figure 3 B  the portion of the matrix   vector product to be stored in processor 0 is computed according to the diagram shown in  Figure 3 C  Basically  each processor computes a vector with the same number of elements as  that of the target processor  0 in the example  which holds the partial sums for each element   This vect
36. a  V  Eijkhout  R  Pozo   C  Romine  and H  van der Vorst  Templates for the solution of linear systems  building    blocks for iterative methods  SIAM  Philadelphia  1993     R A A  Bruce  J G  Mills  and G A  Smith  CHIMP MPI user guide  EPCC KTP CHIMP   V2 USER 1 2  Edinburgh Parallel Computer Centre  University of Edinburgh  June 1994     H M  B  cker and M  Sauren  A parallel version of the unsymmetric Lanczos algorithm and  its application to QMR  KFA ZAM IB 9605  Zentralinstitut f  r Angewandte Mathematik   Forschungszentrum J  lich Gmbh  KFA   March 1996     R  Butler and E  Lusk  User s guide to the p4 programming system  ANL 92 17  Argonne  National Laboratory  October 1992     F  C  Cheng  P  Vaughan  D  Reese  and A  Skjellum  The Unify system  User s guide   document for version 0 9 2  NSF Engineering Research Center  Mississippi State University   September 1994     40     11      12      13      14      15      16     17    18      19      20           191      22     E J  Craig  The N step iteration procedures  Journal of Mathematical Physics  34 64 73   1955     R D  da Cunha  A Study on Iterative Methods for the Solution of Systems of Linear  Equations on Transputer Networks  PhD thesis  Computing Laboratory  University of  Kent at Canterbury  July 1992     R D  da Cunha and T R  Hopkins  Parallel preconditioned Conjugate Gradients methods  on transputer networks  Transputer Communications  1 2  111 125  1993  Also as TR 5 93   Computing Laboratory  Universi
37. a and parallel scalability  A reference  manual can be found at the end of this report with specific details of the routines and  parameters     Contents  1 Introduction    2 An overview of the iterative methods    CG with eigenvalues estimatlon                          CGNR and GENE seslisi 0 bo edel hoe ei 3r e 373 x3  BROG  lel edem AV e edes ed du lE eim  COS coa docuere A PS wo we gu qe Seals  BECESTABO cod bi le guck Aem dee Rd Rode xS Te ee M en  IcBIEEC E sou utes set wie a Ate IT ge aaah R es at S 35 S pen  GMRES e ke  lt  qo dotem m tenu babent esed e eq ob P ra    GCR A  Cc    3 Internal details of PIM    3 1 Supported architectures and environments                         3 2 Parallel programming model         les  3 3  Data partitioning ve yek eek cla aueia Da le RER q Gel Sun WR RR e  3 4 Increasing the parallel scalability of iterative methods                  3 9 Stopping eritera  vd X ADR aed Rd biem SG he Se UE AA A KU x E    4 Using PIM  4 1 Naming convention of routines                               42  Obtaming PIM auo pem eo RE ORDER ues  4 97  Histalling PTM    26 e kem reper Rees Besok Aer eds Rr  dre Pena a Lt  Building the PIM core functions                          Building the examples   2    2  2 a   leaning poss So dpe A Ba ete ao queen d ehe  Using PIM in your application                           4 4 Calling a PIM iterative method routine                          4 5 External routines      se eee eee eee era e ee sas ses s ee s e os ns ns e   M
38. achcons f and pim22 common dmachcons f respectively  Default values are  supplied for the IEEE 754 floating point standard  and are stored separately in the files  pim common smachcons f ieee754 and pim common dmachcons f ieee754     these are used  by default  However if you are using PIM on a computer which does not support the IEEE 754  standard  you may     18    1  type make smachcons or make dmachcons  this will compile and execute a program which  uses the LAPACK routine _LAMCH  to compute those constants  and the relevant files will  be generated     2  edit either pim common smachcons f orig or pim common dmachcons f orig and re   place the strings MACHEPSVAL  UNDERFLOWVAL and OVERFLOWVAL by the val   ues of the machine epsilon  underflow and overflow thresholds to those of the particular  computer you are using  either in single  or double precision     To build PIM  type make makefiles to build the makefiles in the appropriate directories  and then make single  make double  make complex or make dcomplex to build the single   precision  double precision  complex or double complex versions of PIM  This will generate  o  files  one for each iterative method routine  along with the library file libpim a which contains  the support routines     Building the examples Example programs are provided for sequential use  and for parallel  use with MPI and PVM     The example programs require a timing routine  The distribution comes with the file  examples common timer f which
39. at contains the eigenvalues of G  If we obtain a box  r  s t  u   where r     Re A G       s and t     Im A G       u  then the axes of this ellipse are defined as    p V2 r s  2  q  V2 t u  2    These parameters for the Chebyshev iteration are computed by PIM CHEBYSHEV  An example of  the use of this routine may be found in Section 4 5    For nonsymmetric systems  one may use a combination of the routine PTM RGMRESEV and  PIM CHEBYSHEV as proposed by Elman et al  as a hybrid method  21  page 847      12    Figure 1  Selecting an iterative method        Symmetric matrix     QU ss                Eigenvalue Eigenvalue  estimation  estimation   Transpose  CGEV CG RGMRESEV  I matrix vector product  CHEBYSHEV       available     Bi CG CG  1   CGNR CGS  CGNE Bi CGSTAB    Notes         1  only for mildly non       QMR RBi CGSTA  symmetric systems  2 TFOMR   2  use a small restarting    2 RGMRES  2   value if storage is at    RGCR  2   a premium  CHEBYSHEV        To conclude this section  Figure 1 shows a diagram to aid in the selection of an iterative  method     3 Internal details of PIM    3 1 Supported architectures and environments    PIM has been tested on scalar  vector and parallel computers including the Cray Y MP2E 232   Cray Y MP C90 16256  SGI Challenge  Intel Paragon  TMC CM 5  and networks of worksta   tions under PVM 3 3 6  CHIMP MPI v1 2  Argonne MPI  p4 v1 2  TCGMSG 4 02 and Intel    NX  Table 1 lists the architectures and environments on which PIM has been successfu
40. atriz  vector product sx 4 keli am m s pod a  Preconditioning jio sco   4 sek dren hg ee ea b 4 4 Ro   03  Inner products  vector norms and global accumulation                13  13  14  14  15  16    4 6    Monitoring the iterations    4 2l ls  Example programs           y Z lll J sss s sso oss  4 6 1 Eigenvalues estimation and Chebyshev acceleration                4 6 2 Densesborage va s ca osx eR RR ee a EG  4 023   PDE storage ois sia xor RR dat em ate RB RUP UR de Ron a   A matrix vector product for parallel vector architectures           4 6 4 Preconditionees              E            ee  4 025     Results S lele la eek ER Q aa ae Se Se E Ma en Mer R cay a   Stopping criterias E ae E o BUS Saha ls queue uba eni qb eene   General result La 2 ep bou wot Weder ROI deep   Scalability as aa suis aoo ue thee de  i A   EE qe Pc    5 Summary    References    A Reference manual    A 1 Description of parameters     A 2    t  rnalopoutin  ss 1 2k 22       ss gr eic eR mew de RO     Not  rr us doi dee e ded ode e Ec  Matrir vector product v   AL     Transpose matriz vector product v   Alu                     Left preconditioning v   Gu     Right preconditioning v   QL     Parallel Sur Las a UE ead A  kiii dell an ed LUE died a BY  Parallel mec ROPAS bos ae soirs ond Gy boe opes aste d  Monitoring routine s passade s E E    Rss aas  AUT ETN OO P   e detener Gia SU   DE asp NO nd Marke T epi es hee MuR S NUES ye De R NT  A  N PE GE ECT  ao age ae ede mk Hae eye Se a A  ACS PIMICGNE  
41. d MPI     3 3 Data partitioning    With PIM  the iterative method routines have no knowledge of the way in which the user has  chosen to store and access either the coefficient or the preconditioning matrices  We thus restrict  ourselves to partitioning the vectors     The assumption made is that each PE knows the number of elements of each vector stored  in it and that all vector variables in a processor have the same number of elements  This is  a broad assumption that allows us to accommodate many different data partitioning schemes   including contiguous  cyclic  or wrap around  and scattered partitionings  We are able to make    14    this assumption because the vector vector operations used     vector accumulations  assignments  and copies     are disjoint element wise  The other operations used involving matrices and vectors  which may require knowledge of the individual indices of vectors  are the responsibility of the  user    PIM requires that the elements of vectors must be stored locally starting from position  1  thus the user has a local numbering of the variables which can be translated to a global  numbering if required  For example  if a vector of 8 elements is partitioned in wrap around  fashion among 2 processors  using blocks of length 1  then the first processor stores elements  1  3  5 and 7 in the first four positions of an array  the second processor then stores elements  2  4  6 and 8 in positions 1 to 4 on its array  We stress that for most of the co
42. dependencies    BLAS  COPY  _AXPY    DOT  DOTO   LIBPIM    Notes    None    61    Reference manual PIM_BICGSTAB    Algorithm A 7 Bi CGSTAB      Q    b     AQ  vo     ro    v    0     po    amp 0   wo   1   for k  1 2       pk   PLT  Pr   pror 1  pr 10r 1   Dk   Tk 1   Prlpr 1     wk 1vk 1   Up   Q1AQopk    Er Pl ok  Ak   pr Ek  Sk  Tk 1     LDL    if    s    lt  macheps soft breakdown has occurred  tk   QI AQosy   Wk   tf sy  t Tt    Ek   Tk 1   Upper   URSI   Tk   Sk     Wktk   check stopping criterion    endfor       62    Reference manual PIM RBICGSTAB    A 10 PIM RBICGSTAB  Purpose    Solves the system Q1 AQ27   Ob using the restarted Bi CGSTAB method   Synopsis  PIMSRBICGSTAB X B WRK IPAR SPAR MATVEC  PRECONL   PRECONR   PSSUM  PSNRM  PROGRESS     PIMDRBICGSTAB X B WRK IPAR DPAR MATVEC  PRECONL   PRECONR   PDSUM  PDNRM  PROGRESS     PIMCRBICGSTAB X B WRK IPAR SPAR MATVEC   PRECONL   PRECONR   PCSUM  PSCNRM  PROGRESS                       PIMZRBICGSTAB X B WRK IPAR DPAR MATVEC   PRECONL   PRECONR   PZSUM  PDZNRM  PROGRESS     Storage requirements    Parameter No  of words    X  B IPAR 4    WRK  6 2 IPAR 5   IPAR 4   IPAR 13    PAR 6    Possible exit status values returned by IPAR 12    0     Function dependencies  BLAS  COPY  _AXPY    DOT  DOTO   LIBPIM   Notes    1  The degree of the MR  polynomial  the maximum degree is 10  must be stored in  IPAR 5   If the user needs to use a larger degree then the parameter IBDIM  defined  on PIM RBICGSTAB must be changed accordingly
43. dp ema b etes phe AV go ou ut M Edd d des seu  Au  PIMECGNES cnet a e e TRR T   Y A  TRES ARA E   e e EU  AGS  SI qi este trek  diae or epe ei ee d al ir t  A0  SPIMCBLGG OT AB 2 S gua do Bee qr Gente ta e URN o Ve Ru Se ds etg ce My   ACTOJPIMSRBIOGSTAB  c cei uti Mom vt tab d euo e T ans ty m es tees  AS TISPTMGRGMRES AS  200 A EP nta  qo cona eh gh gi MODA AO Anas ipae TR al Alaa o a NA  AUIZPIM RGMRESEWV  eps SER Meksa ma bermeli a uA  Ee SENERE Lu Al Sq p   c e Q 3 3 El   p ek SG  AGBEPIMOGME mi gri kb   beni bilek es wha    kok habei ip ded er          ETM ZL EOIN  ri Say la E F   ADEM Keme euet R ee d irr 73    A LG PIM CHEBYSHEV e s xv R   NON al t ei Oe diea diede dee 75  A PIMSSETPAR   vk R Roe aude Pe eee eee Bam uu wo 77  AJSOPBIMIPRIPAR eee aly R hiq Qe ee Re Ge m le ELS 78  ALO ANTR  ho ro Ge A Sa toe a So D eA S s Oh kli dme ru A 79    1 Introduction    The Parallel Iterative Methods  PIM  is a collection of Fortran 77 routines designed to solve  systems of linear equations  SLEs  on parallel computers using a variety of iterative methods     PIM offers a number of iterative methods  including    Conjugate Gradients  CG   29      Conjugate Gradients for normal equations with minimisation of the residual norm     CGNR   35      Conjugate Gradients for normal equations with minimisation of the error norm  CGNE      11     Bi Conjugate Gradients  Bi CG   22     Conjugate Gradients squared  CGS   44     the stabilised version of Bi Conjugate Gradients  Bi CGSTAB   
44. etric linear systems  SIAM Journal of Scientific and Statistical  Computing  13 631 644  1992  Also as Report No  90 50  Mathematical Institute  University  of Utrecht     43    Reference manual    A Reference manual    In this section we provide details of each individual subroutine im PIM  Each entry describes  the purpose  the name and parameter list  storage requirements  function dependencies and  restrictions of the respective routine    For each iterative method routine we also provide a description of the implemented algo   rithm  Vectors and scalar values are denoted by lower case letters and matrices by capital  letters  Subscripts indicate either the iteration or a vector column  the latter in the case of a  matrix    Each computational step in the algorithm is numbered  if a routine suffers a breakdown the  step number where the failure occurred is returned in IPAR  13     Whenever an underscore   appears it indicates the type of a variable or function  REAL   DOUBLE PRECISION  COMPLEX  DOUBLE COMPLEX  and should be replaced by S  D  C or Z  which   ever is appropriate    The COMPLEX and DOUBLE COMPLEX PIM routines compute inner products using the BLAS  CDOTC and ZDOTC routines respectively     44    Reference manual Description of parameters    A 1 Description of parameters    The parameters used in an iterative method routine are    Parameter Description  X A vector of length IPAR 4   On input  contains the initial estimate  On output  contains the last estimate
45. ety of parallel environments without imposing  any restriction on the way the coefficient matrices and the preconditioning steps are handled    The user may thus explore characteristics of the problem and of the particular parallel architec     37    Table 4  Example with IDLU 0  preconditioner        Method k     Time s  lr   ls Status  CG 32 0 0900 59 8747    1  CGEV 32 0 1500 59 8747 cul  Bi CG 32 0 1000 10 8444    1  CGS 11 0 0500 1 8723 x 107  0  BiCGSTAB 12 0 0400 1 1099 x 10    0  RBi CGSTAB 7 0 0300 3 7795 x 10 12 0  RGMRES 3 0 0600 2 8045 x 10712 0  RGMRESEV 3 04500 2 8045 x 10712 0  RGCR 3 0 0800 2 8048 x 1071  0  CGNR 32 0 0800 81 5539    1  CGNE 32 0 0900 37 0570    1  QMR 32 0 1200 4 0091 1  TFQMR 11 0 0500 5 3482 x 107   0    Table 5  Example with Neumann polynomial preconditioner        Method k  Time s     C    Status  CG 32 0 1000 41 9996 E  CGEV 32 0 1300 41 9996    1  Bi CG 32 0 000 13 8688    CGS 14 0 0500 7 4345 x 107  0  Bi CGSTAB 14 0 0500 5 3863 x 107     0  RBi CGSTAB 8 0 0500 8 4331 x 107   0  RGMRES 3 0 0600 4 0454 x 10    0  RGMRESEV 3 0 3900 4 0454 x 10    0  RGCR 3 0 0900 4 0455 x 10    0  CGNR 32 0080 7 4844 x 107     1  CGNE 32 0 1000 3 6205 x 10     1  QMR 32 0 1600 2 3351 esi   TFQMR 14 0 1100 7 6035 x 1071 0    38    Table 6  Execution time  in seconds  for test problem  n   16384  solved by PIMDRGMRES with  the Neumann polynomial preconditioner  of first degrece      IBM Intel Intel SGI Cray Cray Cray  p  SP 1 Paragon XP S iPSC 860  Challen
46. f the usual ILU 0  method  This modification was made to allow the com   putation of the preconditioning step without requiring any communication to be performed  To  achieve this we note that the matrices arising from the five point finite difference discretisation  have the following structure     19   F B y a    where E and F are diagonal matrices and a  p and y are the central  north and south coefficients  derived from the discretisation  the subscripts are dropped for clarity   Each matrix B is  approximating the unknowns in a single vertical line on the grid on Figure 4    To compute a preconditioner Q   LU  we modify the ILU 0  algorithm in the sense that  the blocks E and F are discarded  because only the diagonal blocks are considered we refer to  this factorisation as ZDLU 0    The resulting L and U factors have the following structure    X 1  X x 1  L       X  A E  X y 1  Y    f  Y AN  U     y PE    f  20   Y 5    where  amp      and Y are the modified coefficients arising from the IL U 0  algorithm   From the  structure of L and U it may be clearly seen that applying the preconditioning step reduces to the  solution of small  order 1   independent  triangular systems  Each of these systems correspond  to a vertical line in the grid  since it was partitioned in vertical panels  these systems can be  solved independently in each processor     35     The polynomial preconditioners used can be expressed by     E Ymi  I      diag A   4y   diag  AY      21     i   0   
47. ge  KSR1  T3D   Y MP2E C9016  1 150 42 265 83 99 60 453 20  2 39 60 131 64 80 90 297 40 11 64  4 20 43 60 54 46 95 26 73 4 99  8 7 10 29 82 31 62 166 80  16 16 35 35 38 4 04  32 11 20      W H  Purvis  Daresbury Laboratory  U K   1 S  Thomas  CERCA  Montr  al     A  Pindor  U  of Toronto  39    2 P  T M  Bulh  es  Cray Research Inc     ture being used  Indeed  the performance of a PIM routine is dependent on the user supplied  routines for the matrix vector products  inner products and vector norms and the computation  of the preconditioning steps    PIM is an ongoing project and we intend to improve it and include other iterative methods   We encourage readers to send their comments and suggestions  the authors may be contacted  via e mail at either rudnerOmat ufrgs br or trh uke ac uk      Acknowledgements    We would like to thank the National Supercomputing Centre  CESUP   Brazil  the National  Laboratory for Scientific Computing  LNCC CNPg   Brazil  the Parallel Laboratory  University  in Bergen  Norway  the Army High Performance Computing Research Center  Minnesota  USA  and Digital Equipment Corporation  via the Internet Alpha Program   who kindly made their  facilities available for our tests    We also thank our collaborators  Matthias G  Imhof  MIT   Paulo Tib  rio M  Bulh  es  Cray  Research   Steve Thomas  CERCA  Montr  al   Andrzej Pindor  University of Toronto   William  H  Purvis  Daresbury Laboratory  U K   and Ramiro B  Willmersdorf  LNCC  Brazil  for their  
48. he iterative methods present in PIM   More details are available in the works of Ashby et al   4   Saad  40  41   Nachtigal et al   37    Freund et al   25  26  and Barrett et al   6     We introduce the following notation  CG  Bi CG  CGS  Bi CGSTAB  restarted GMRES     restarted GCR and TFQMR solve a non singular system of n linear equations of the form    QiAQor   Qib  1   where Q4 and Q   are the preconditioning matrices  For CGNR  the system solved is  Qi AT AQox   Qi A b  2   and for CGNE we solve the system  Q AAT Qor   Qib  3     CG The CG method is used mainly to solve Hermitian positive definite  HPD  systems  The  method minimises the residual in the A norm and in finite precision arithmetic it terminates in  at most n iterations  The method does not require the coefficient matrix  only the result of a    matrix vector product Au is needed  It also requires a relatively small number of vectors to be  stored per iteration since its iterates can be expressed by short  three term vector recurrences    With suitable preconditioners  CG can be used to solve nonsymmetric systems  Holter  et al   36  have solved a number of problems arising from the modelling of groundwater flow  via finite differences discretisations of the two dimensional diffusion equation  The properties  of the model led to systems where the coefficient matrix was very ill conditioned  incomplete  factorisations and least squares polynomial preconditioners were used to solve these systems   Hyperbolic
49. he three linear algebra operations mentioned above    A package similar to PIM is the Simplified Linear Equation Solvers  SLES  by Gropp and  Smith  31   part of the PETSc project  In SLES the user has a number of iterative methods  CG   CGS  Bi CGSTAD  two variants of the transpose free OMR  restarted GMRES  Chebyshev and  Richardson  which can be used together with built in preconditioners and can be executed  either sequentially or in parallel  The package may be used with any data representation of  the matrix and vectors with some routines being provided to create matrices dynamically in  its internal format  a feature found on ITPACK   The user can also extend SLES in the sense  that it can provide new routines for preconditioners and iterative methods without modifying  SLES  It 1s also possible to debug and monitor the performance of a SLES routine    Portability of code across different multiprocessor platforms is a very important issue  For  distributed memory multiprocessor computers  a number of public domain software libraries  have appeared  including PVM  28   TCGMSG  33   NXLIB  45   p4  9   the latter with support    for shared memory programming   These libraries are available on a number of architectures    making it possible to port applications between different parallel computers with few  if any   modifications to the code being necessary  In 1993 the    Message Passing Interface Forum   a  consortium of academia and vendors  drawing on the experiences 
50. help on testing PIM    This work was supported in part by the Army High Performance Computing Research    39    Center  under the auspices of Army Research Office contract number DA AL03 89 C 0038 with    the University of Minnesota     References     1      3      6      10     E  Anderson  Z  Bai  C  Bischof  J  Demmel  J  Dongarra  J  Du Croz  A  Greenbaum   S  Hammarling  A  McKenney  S  Ostrouchov  and D  Sorensen  LAPACK Users    Guide   SIAM  Philadelphia  1992     S F  Ashby  Minimax polynomial preconditioning for Hermitian linear systems  Report  UCRL JC 103201  Numerical Mathematics Group  Computing  amp  Mathematics Research  Division  Lawrence Livermore National Laboratory  March 1990     S F  Ashby  T A  Manteuffel  and J S  Otto  A comparison of adaptive Chebyshev and least  squares polynomial preconditioning for Hermitian positive definite linear systems  Report  UCRL JC 106726  Numerical Mathematics Group  Computing  amp  Mathematics Research  Division  Lawrence Livermore National Laboratory  March 1991     S F  Ashby  T A  Manteuffel  and P E  Saylor  A taxonomy for Conjugate Gradient meth   ods  SIAM Journal of Numerical Analysis  27 1542 1568  1990     S F  Ashby and M K  Seager  A proposed standard for iterative linear solvers  version  1 0   Report UCRL 102860  Numerical Mathematics Group  Computing  amp  Mathematics  Research Division  Lawrence Livermore National Laboratory  January 1990     R  Barrett  M  Berry  T  Chan  J  Demmel  J  Donald  J  Dongarr
51. hough numerically their  iterations will usually differ  The method does not require the computation of transpose matrix   vector products as in Bi CG and a smaller number of vectors need to be stored per iteration    than for other restarted methods like GMRES     GMRES The GMRES method is a very robust method to solve nonsymmetric systems   The method uses the Arnoldi process to compute an orthonormal basis  vi  v9      ug  of the  Krylov subspace K 4 v1   The solution of the system is taken as zo   Vey  where W is a  matrix whose columns are the orthonormal vectors v   and yz is the solution of the least   squares problem H y       rg   oe1  where the upper Hessenberg matrix Hj is generated during    The PIM implementation of Bi CG  CGS and Bi CGSTAB sets fo   ro but the user may modify the code if  another choice of fo 1s desirable     10    the Arnoldi process and e     1 0 0     0 7  This least squares problem can be solved using a  QR factorisation of H     A problem that arises in connection with GMRES is that the number of vectors of or   der n that need to be stored grows linearly with k and the number of multiplications grows  quadratically  This may be avoided by using a restarted version of GMRES  this is the method  implemented in PIM  Instead of generating an orthonormal basis of dimension k  one chooses a  value c  c    amp  n  and generates an approximation to the solution using an orthonormal basis of  dimension c  thereby reducing considerably the amount of
52. ical Computing  1 840 855  1986     R  Fletcher  Conjugate Gradient Methods for Indefinite Systems  volume 506 of Lecture  Notes in Mathematics  pages 13 89  Spring Verlag  Heidelberg  1976     41     23      24     195            0g      27      28     199      30     131     139      33     134            35     Message Passing Interface Forum  MPI  A message passing interface standard  TR CS 93   214  University of Tennessee  November 1993     R W  Freund  A transpose free quasi minimal residual algorithm for non Hermitian linear  systems  Submitted to SIAM Journal of Scientific and Statistical Computing     R W  Freund  G H  Golub  and N M  Nachtigal  Iterative solution of linear systems  Acta  Numerica  1 57 100  1991     R W  Freund  G H  Golub  and N M  Nachtigal  Recent advances in Lanczos based itera   tive methods for nonsymmetric linear systems  RIACS Technical Report 92 02  Research  Institute for Advanced Computer Science  NASA Ames Research Center  1992  To appear  on Algorithmic trends for Computational Fluid Dynamics in the 90 s     R W  Freund and M  Hochbruck  A Biconjugate Gradient type algorithm for the itera   tive solution of non Hermitian linear systems on massively parallel architectures  RIACS  Technical Report 92 08  Research Institute for Advanced Computer Science  NASA Ames  Research Center  1992  To appear on Computational and Applied Mathematics I  Algorithms  and Theory     A  Geist  A  Beguelin  J  Dongarra  W  Jiang  R  Manchek  and V S  Sunde
53. is stored in a single processor  and is then broadcast to the remaining processors     The CGS and TFQMR implementations available on PIM do not benefit from this approach     3 5 Stopping criteria    PIM offers a number of stopping criteria which may be selected by the user  In Table 2 we  list the different criteria used  rj    b     Ar  is the true residual of the current estimate Tp   zp is the pseudo residual  usually generated by linear recurrences and possibly involving the  preconditioners  and  lt  is the user supplied tolerance  Note that the norms are not indicated   these depend on the user supplied routine to compute a vector norm     Table 2  Stopping criteria available on PIM    No  Stopping criterion          1    rk  lt    2    rp  lt  e        3 yrize lt ellbii  4    zz         5  zl   ell   6  zl    ell Qi   T      k     zp      lt           If speed of execution is of the foremost importance  the user needs to select the stopping  criterion that will impose the minimum overhead  The following notes may be of use in the    selection of an appropriate stopping criterion    16    1  If the stopping criterion selected is one of 1  2 or 3 then the true residual is computed   except when using TFQMR with either no preconditioning or left preconditioning      2  The restarted GMRES method uses its own stopping criterion  see  42  page 862   which  is equivalent to the 2 norm of the residual  or pseudo residual if preconditioning is used      3  If either no preco
54. lly tested     2 The results obtained are based upon a beta version of the software and  consequently  is not necessarily  representative of the performance of the full version of this software     13    Table 1  Computers where PIM has been tested    Architecture Compiler and O S   Sun SPARC Sun Fortran 1 4   SunOS 4 1 3  Sun SPARC Sun Fortran 2 0 1   SunOS 5 2  Sun SPARC EPC Fortran 77   SunOS 4 1 3  Sun SPARC EPC Fortran 90   SunOS 4 1 3  Sun SPARC NAG Fortran 90   SunOS 4 1 3    DEC AXP 4000 610 DEC Fortran 3 3 1  DEC OSF 1 1 3  DEC AXP 3000 800 DEC Fortran 3 4 480   DEC OSF 1 2 0       SGI IRIS Indigo MIPS Fortran 4 0 5   SGI IRIX 4 0 5F  SGI IRIS Crimson MIPS Fortran 4 0 5   SGI IRIX 4 0 5C  SGI Indy IT MIPS Fortran 5 0   SGI IRIX 5 1 1  Cray Y MP2E 232 Cray Fortran 6 0   UNICOS 7 0 5 2  Cray Y MP C90 16256 Cray Fortran 7 0   UNICOS 8 2 3   SGI Challenge MIPS Fortran 5 2   SGI IRIX 5 2  Intel Paragon XP S Portland if77 4 5   OSF 1 1 2 6   IBM 9076 SP 1 IBM XL Fortran 6000 2 3   AIX 3 2  Cray T3D Cray Fortran 8 0   UNICOS 8 3 3  TMC CM 5 CM Fortran 77    3 2 Parallel programming model    PIM uses the Single Program  Multiple Data  SPMD  programming model   The main implica   tion of using this model is that certain scalar values are needed in each processing element  PE    Two of the user supplied routines  to compute a global sum and a vector norm  must provide  for this  preferably making use of a reduction and or broadcast routine like those present on    PVM 3 3 6 an
55. meters Type   U INPUT U INPUT   V OUTPUT V OUTPUT   IPAR INPUT IPAR INPUT  Transpose matrix vector product v   ATu Right preconditioning v     Qu  SUBROUTINE TMATVEC U V IPAR  SUBROUTINE PRECONR U V IPAR   precision U    V    precision UC   V CX   INTEGER IPAR    INTEGER IPAR      Parameters Type Parameters Type   U INPUT U INPUT   V OUTPUT V OUTPUT   IPAR INPUT IPAR INPUT    AT    Reference manual External routines    Synopsis    Parallel sum    SUBROUTINE P SUM ISIZE X IPAR  Monitoring routine   INTEGER ISIZE   precision X    SUBROUTINE PROGRESS  LOCLEN  ITNO    INTEGER IPAR      NORM X RES     TRUERES     Parameters Type  INTEGER LOCLEN ITNO    ISIZE INPUT Ue  X INPUT OUTPUT precision NORM X CX   RES       IPAR INPUT d TRUERES QN   Parameters Type  Parallel vector norm  LOCLEN INPUT   ITNO INPUT  s NORM INPUT  precision FUNCTION P NRM LOCLEN U IPAR  X INPUT  IUE LOCLEN RES INPUT  pre ven TRUERES INPUT             INTEGER IPAR       Parameters Type    LOCLEN INPUT   U INPUT   IPAR INPUT  Notes    1  Replace precision by REAL  DOUBLE PRECISION  COMPLEX  DOUBLE COMPLEX     2  In the monitoring routine PROGRESS above  the array TRUERES contains the value of the  true residual rj    b     Ax  only if IPAR 9  is 1  2 or 3     48    Reference manual PIM CG    A 3  PIM CG  Purpose    Solves the system Q4 AQ  r   Ob using the CG method   Synopsis  PIMSCG X B WRK IPAR SPAR MATVEC   PRECONL   PRECONR   PSSUM  PSNRM  PROGRESS     PIMDCG X B WRK IPAR DPAR MATVEC   PRECONL   PRECONR   P
56. method is    SUBROUTINE PIMDCG X B WRK IPAR DPAR MATVEC PRECONL   PRECONR      PDSUM PDNRM PROGRESS     and for Bi CG  as well as for CGNR  CGNE and QMR     SUBROUTINE PIMDBICG X B WRK IPAR DPAR MATVEC  TMATVEC  PRECONL   PRECONR     PDSUM PDNRM PROGRESS     where the parameters are as follows    Parameter Description  X A vector of length IPAR 4   On input  contains the initial estimate  On output  contains the last estimate computed    B  The right hand side vector of length IPAR 4   WRK A work vector used internally  see the description  of each routine for its length   IPAR An integer array containing input output parameters   PAR A floating point array containing input output parameters  MATVEC Matrix vector product external subroutine  TMATVEC Transpose matrix vector product external subroutine    PRECONL Left preconditioning external subroutine  PRECONR Right preconditioning external subroutine  P_SUM Global sum  reduction  external function  P_NRM Vector norm external function   PROGRESS Monitoring routine    Note in the example above that  contrary to the proposal in  5   PIM uses separate routines  to compute the matrix vector and transpose matrix vector products  See the reference manual   sections A 1 and A 2 for the description of the parameters above and the synopsis of the external  routine     21    4 5 External routines    As stated earlier  the user is responsible for supplying certain routines to be used internally by  the iterative method routines  One of 
57. mmonly used  partitioning schemes data may be retrieved with very little overhead     3 4 Increasing the parallel scalability of iterative methods    One of the main causes for the poor scalability of implementations of iterative methods on  T n  v   DO uii     where u and v are vectors distributed across p processors  without loss of generality assume    distributed memory computers is the need to compute inner products  v   u    that each processor holds n p elements of each vector   This computation can be divided in  three parts    1  The local computation of partial sums of the form 5    y   ttj  on each processor     2  The reduction of the 8  values  where these values travel across the processors in some  efficient way  for instance  as if traversing a binary tree up to its root  and are summed  during the process  At the end  the value of a    gt  Dj is stored in a single processor     3  The broadcast of o to all processors     During parts 2  and 3  a number of processors are idle for some time  A possible strategy to  reduce this idle time and thus increase the scalability of the implementation  is to re arrange  the operations in the algorithm so that parts 2  and 3  accumulate a number of partial sums  corresponding to some inner products  Some of the algorithms available in PIM  including  CG  CGEV  Bi CG  CGNR and CGNE have been rewritten using the approach suggested by  D Azevedo and Romine  15   Others  like Bi CGSTAB  RBi CGSTAB  RGCR  RGMRES and  QMR have 
58. n electromagnetics and quantum chromodynamics applications respectively  4      QMR The QMR method by Freund and Nachtigal  27  overcomes the difficulties associated  with the Bi CG method  The original QMR algorithm uses the three term recurrences as found  in the underlying Lanczos process  In  8   B  cker and Sauren propose a highly parallel algorithm  for the QMR method  with a single global synchronization point  which is implemented in this  version of PIM     TFQMR TFQMR is a variant of CGS proposed by Freund  24   TFQMR uses all available    search direction vectors instead of the two search vectors used in CGS  Moreover  these vectors    11    are combined using a parameter which can be obtained via a quasi minimisation of the residual   The method is thus extremely robust and has the advantage of not requiring the computation of  transpose matrix vector products  PIM offers TFQMR with 2 norm weights  see  24  Algorithm  5 1       Chebyshev acceleration The Chebyshev acceleration is a polynomial acceleration applied  to basic stationary methods of the form    TL   Grp  f    where G   I      0 4  f   Qib  If we consider k iterations of the above method  the iterates  zk may be linearly combined such that y   S cjzj isa better approximation to z      Alb   The coefficients c  are chosen so that the norm of error vector is minimized and E 4 cj   1  If  we assume that the eigenvalues of G are contained in an interval  o  0   with    1  lt  o  lt  f  lt  1   then the Cheb
59. nditioning or right preconditioning is used and criterion 6 is selected   the PIM iterative method called will flag the error and exit without solving the system   except for the restarted GMRES routine      4 Using PIM    4 1 Naming convention of routines    The PIM routines have names of the form  PIM method    where   indicates single precision  S   double precision  D   complex  C  or double precision  complex  Z  and method is one of  CG  CGEV  CG with eigenvalue estimation   CGNR  CGNE   BICG  CGS  BICGSTAB  RBICGSTAB  RGMRES  RGMRESEV  RGMRES with eigenvalue estimation    and RGCR  QMR  TFQMR and CHEBYSHEV     4 2 Obtaining PIM    PIM 2 0 is available via anonymous ftp from  unix hensa ac uk  file  pub misc netlib pim pim22 tar Z  and  ftp mat ufrgs br  file  pub pim pim22 tar gz  There is also a PIM World Wide Web homepage which can be accessed at  http   www mat ufrgs br pim e html    which gives a brief description of the package and allows the reader to download the software  and related documentation    The current distribution contains    e The PIM routines in the directories single  double  complex and dcomplex    e A set of example programs for sequential and parallel execution  using PVM and MPI   in the directories examples sequential  examples pvm and examples  mpi     e This guide in PostScript format in the doc directory     17    4 3 Installing PIM  To install PIM  unpack the distributed compressed  or gzipped   tar file     uncompress pim22 tar Z  or gun
60. not been re arranged but some or all of their inner products can be computed with  a single global sum operation     The computation of the last two parts depends on the actual message passing library being  used  With MPI  parts 2  and 3  are also offered as a single operation called MPI ALLREDUCE   Applications using the PVM 3 3 6 Fortran interface should however call PVMFREDUCE and then  PVMFBROADCAST     15    An important point to make is that we have chosen modifications to the iterative methods  that reduce the number of synchronization points while at the same time maintaining their  convergence properties and numerical qualities  This is the case of the D Azevedo and Romine  modification  also  in the specific case of GMRES  which uses the Arnoldi process  a suitable  reworking of the modified Gram Schmidt procedure  to compute a vector basis  the computation  of several inner products with a single global sum does not compromise numerical stability    For instance  in the algorithm for the restarted GMRES  see Algorithm A 9   step 5 involves  the computation of 7 inner products of the form WV  7   1 2     3  It is thus possible to  arrange for each processor to compute 7 partial sums using the BLAS routine DOT and store  these in an array  Then in a single call to a reduction routine  these arrays are communicated  among the processors and their individual elements are summed  On the completion of the  global sum the array containing the respective 7 inner products 
61. of users of those and other  libraries  defined a standard interface for message passing operations  called MPI  23   To   day we have available implementations of MPI built on top of other  existing libraries  like  the CHIMP MPI library developed at the Edinburgh Parallel Computer Centre  7   and the  Unify project  10  which provides an MPI interface on top of PVM  It is expected that native  implementations will be available soon  In the previous releases of PIM  1 0 and 1 1  we had  distributed examples using PVM  TCGMSG  p   and NXLIB  however from this release onwards  we will support only PVM  the    de facto    standard for message passing  and MPI    We would like to mention two projects which we believe can be used together with PIM   The first is the proposed standard for a user level sparse BL AS by Duff et al   19  and Heroux   34   This standard addresses the common problem of accessing and storing sparse matrices in  the context of the BLAS routines  such routines could then be called by the user in conjunction  with a PIM routine  The second is the BLACS project by Dongarra et al   17  which provides  routines to perform distributed operations over matrices using PVM 3 1     2 An overview of the iterative methods    How to choose an iterative method from the many available is still an open question  since any  one of these methods may solve a particular system in very few iterations while diverging on  another  In this section we provide a brief overview of t
62. on dependencies    BLAS  COPY  AXPY    DOT  DOTO    SCAL  _TRSV  LIBPIM    Notes    1  The size of the orthonormal basis  maximum of 50 vectors  must be stored in IPAR 5    If the user needs to use a larger basis then the parameter IBDIM  defined on PIM RGMRES  must be changed accordingly     2  The user must supply a routine to compute the 2 norm of a vector     65    Reference manual PTM_RGMRES    Algorithm A 9 RGMRES    1  ro   Q1 b     AQ22z0   2  Po    Irolle  for k   1 2       3  g    Bri  Bias    Vi PIH  for j  1 2      restart   Rig   Vi Q AQV  t 1    55       Q1AQ  V      Y    Bij Vi  Fg      l  lle  Via DIR  apply previous Givens s rotations to R  j  compute Givens s rotation to zero R  1 j   apply Givens   s rotation to g  if  9j41   lt RHSSTOP then  perform steps 13 and 14 with restart  7  stop  endaf  endfor  solve Ry   g  solution to least squares problem   Tp   1 1   Vy  form approximate solution   ry   Q1 b     AQ  oxzy   Br    Irelle  endfor       66    Reference manual PIM RGMRESEV    A 12  PIM RGMRESEV  Purpose    Solves the system Q   AQ  x   Ob using the restarted GMRES method  returns  at each itera   tion  the estimates of the smallest and largest eigenvalues of J   AQ   obtained from the upper  Hessenberg matrix produced during the Arnoldi process     Synopsis    PIMSRGMRESEV X B WRK  IPAR SPAR  MATVEC  PRECONL   PRECONR   PSSUM  PSNRM  PROGRESS     PIMDRGMRESEV  X B WRK IPAR DPAR MATVEC   PRECONL   PRECONR   PDSUM  PDNRM  PROGRESS     PIMCRGMRESEV  X 
63. or is then sent across the network to be summed in a recursive doubling fashion until  the accumulated vectors  carrying the contributions of the remaining processors  arrive at the  target processor  These accumulated vectors are then summed together with the partial sum  vector computed locally in the target processor  yielding the elements of the vector resulting  from the matrix vector product  This process 1s repeated for all processors  This algorithm 1s  described in  13     To compute the dense transpose matrix vector product  ATu  each processor broadcasts to  the other processors a copy of its own part of u  The resulting part of the v vector is then  computed by each processor     4 6 3 PDE storage    For the PDE storage format  a square region is subdivided into    1 rows and columns giving a  grid containing 1  internal points  each point being numbered as      j     1 1  i  j   1 2     1  see  Figure 4   At each point we assign 5 different values corresponding to the center  north  south   cast and west points on the stencil  aij  Bij  Vij  B  Eig respectively  which are derived from  the PDE and the boundary conditions of the problem  Each grid point represents a variable   the whole being obtained by solving a linear system of order n   1      A matrix vector product v   Au 1s obtained by computing    vij   oi jtij Diii E J   Oi jUi gl T GL  7     where some of the o  8  y     and e may be zero according to the position of the point relative  to the grid  Note
64. oreover   the method may break down  for example  the iterations cannot proceed when some quantities     dependent on fo  become zero      CGS CGS is a method that tries to overcome the problems of Bi CG  By rewriting some of  the expressions used in Bi CG  it is possible to eliminate the need for AT altogether  Sonneveld   44  also noted that it is possible to  theoretically  increase the rate of convergence of Bi CG at  no extra work per iteration  However  if Bi CG diverges for some system  CGS diverges even  faster  It is also possible that CGS diverges while Bi CG does not for some systems     Bi CGSTAB Bi CGSTAB isa variant of Bi CG with a similar formulation to CGS  However   steepest descent steps are performed at each iteration and these contribute to a considerably  smoother convergence behaviour than that obtained with Bi CG and CGS  It is known that for  some systems Bi CGSTAB may present an erratic convergence behaviour as does Bi CG and    CGS     RBi CGSTAB The restarted Bi CGSTAB  proposed by Sleijpen and Fokkema  43   tries to  overcome the stagnation of the iterations of Bi CGSTAB which occurs with a large class of  systems of linear equations  The method combines the restarted GMRES method and Bi CG   being composed of two specific sections  a Bi CG part where  1 1  u and r vectors are produced   l being usually 2 or 4   and a minimal residual step follows  when the residuals are minimized   RBi CGSTAB is mathematically equivalent to Bi CGSTAB if l   1  alt
65. problem  it 1s important to be able to obtain feedback  from the PIM routines as to how an iterative method is progressing    To this end  we have included in the parameter list of each iterative method routine an  external subroutine  called PROGRESS  which receives from that routine the number of vector el   ements stored locally  LOCLEN   the iteration number  ITNO   the norm of the residual  NORMRES    according to the norm being used   the current iteration vector  X   the residual vector  RES   and the true residual vector rj    b     Arp   TRUERES   This last vector contains meaningful  values only if IPAR 9  is 1  2 or 3    The parameter list of the monitoring routine is fixed  as shown in  4 2  The example below  shows a possible use of the monitoring routine  for the DOUBLE PRECISION data type     SUBROUTINE PROGRESS  LOCLEN ITNO NORMRES X RES  TRUERES   INTEGER LOCLEN ITNO  DOUBLE PRECISION NORMRES  DOUBLE PRECISION X    RES    TRUERES      EXTERNAL PRINTV  WRITE  6 FMT 9000  ITNO NORMRES  WRITE  6 FMT 9010     X    CALL PRINTV LOCLEN X   WRITE  6 FMT 9010   RESIDUAL       CALL PRINTV LOCLEN  RES   WRITE  6 FMT 9010   TRUE RESIDUAL     CALL PRINTV LOCLEN   TRUERES   RETURN  9000 FORMAT    I5 1X D16 10   9010 FORMAT    A   END    SUBROUTINE PRINTV N U   INTEGER N  DOUBLE PRECISION U     INTEGER I  DO 10 I   1 N  WRITE  6 FMT 9000  U I   10 CONTINUE  RETURN  9000 FORMAT  4 D14 8 1X      26    END    As with the other external routines used by PIM  this routine needs 
66. ram  PVM 3  user s guide and reference manual  Research Report ORNL TM 12187  Oak Ridge National  Laboratory  May 1993     G H  Golub and C F  Van Loan  Matrix Computations  Johns Hopkins University Press   Baltimore  2nd edition  1989     R G  Grimes  D R  Kincaid  and D M  Young  ITPACK 2 0 user s guide  Report No   CNA 150  Center for Numerical Analysis  University of Texas at Austin  August 1979     W  Gropp and B  Smith  Simplified linear equation solvers users manual  ANL 93 8   Argonne National Laboratory  February 1993     L A  Hageman and D M  Young  Applied Iterative Methods  Academic Press  New York   1981     R J  Harrison  TCGMSG Send receive subroutines     version 4 02  User s manual  Battelle  Pacific Northwest Laboratory  January 1993     M A  Heroux  A proposal for a sparse BLAS toolkit   SPARKER working note  2  Report  TR PA 92 90  CERFACS  October 1992     M R  Hestenes and E L  Stiefel  Method of Conjugate Gradients for solving linear systems   Journal of Research National Bureau of Standards  49 435 498  1952     42     36  W H  Holter  L M  Navon  and T C  Oppe  Parallelizable preconditioned Conjugate Gra         37    138     139      40     141        149      43     144           145      46     dient methods for the Cray Y MP and the TMC CM 2  Technical report  Supercomputer  Computations Research Institute  Florida State University  December 1991     N M  Nachtigal  S C  Reddy  and L N  Trefethen  How fast are nonsymmetric matrix  iterations  SIAM
67. sequential execution only    The parallel programs for the dense and PDE storage formats have different partitioning  strategies and the matrix vector products have been designed to take advantage of these    The systems solved have been set up such that the solution is the vector z    1 1     1 7   in order to help in checking the results  For the dense storage format  the real system has the  tridiagonal coefficient matrix of order n   500    and the complex system of order n   100 has the form A   uI   S  where S   S     4     4i  and    27    S   Was cL 3E  5   1 1  0 1 41   1 4 0  The problem using the Harwell Boeing format is NOS4 from the LANPRO collection of problems  in structural engineering  18  pp  54 55   Problem NOS4 has order n   100 and is derived from a    finite element approximation of a beam structure  For the PDE storage format the system being  solved is derived from the five point finite difference discretisation of the convection diffusion    equation  Pu du Ou  Ou         E   2    cosla   gt    ame    0  6   on the unit square  with e   0 1  a      7 6 and u   224 y  on OR  The first order terms were    discretised using forward differences  this problem was taken from  44     A different set of systems is used for the HYBRID examples with dense storage format  The  real system has a nonsymmetric tridiagonal coefficient matrix of order n   500    and the complex system of order n     100 has A defined as    2           2   2  1       241 2  1     2   
68. stencil as a sequence of AXPYs  The use of _AXPYs  also provides a better performance because these operations are usually very efficient on such  machines    Consider the storage scheme described above i e   five coefficients  o  D  y  6 and     are stored  per grid point  and numbered sequentially as      j     1    i  j   1 2     1  The coefficients can  then be stored in five separate arrays of size n    gt   The matrix vector product   Au can  then be obtained by the following sequence of operations    UR   apup   k   1 2     n  9   Up L L HL  k   1 2    n   1  10   UR   wu yuk  k 2 33     n  11   Up   pt Opupyi  k   1 2    n 1  12   UR   vptepup_y  K lL 4 1 142     n  13     and the transpose matrix vector product v   A    u is obtained similarly        Up   pup  k 1 2     n  14   Va   ve us k 1 2    n   1  15   Uk    UH yuk  k 2 33     n  16   Vet    Ur d kuk  k   1 2     n l  17   Uk    vp ptepup  k 1 1 1 2     n  18     Experiments on the Cray Y MP2E 232 showed that this approach gave a three fold improve   ment in the performance  from 40 MFLOPS to 140MFLOPS  A separate set of the PDE examples  containing these matrix vector product routines are provided under the  pim22 examples seguential pvp pde directory      This may only need to be done once if the coefficient matrix is unchanged during the solution of the system     34    4 6 4 Preconditioners    The examples involving an incomplete LU factorisation as the preconditioner for the PDE case  are a modification o
69. the characteristics of PIM is that if external routines are  not required by an iterative method routine they are not called  the only exception being the  monitoring routines   The user only needs to provide those subroutines that will actually be  called by an iterative method routine  depending on the selection of method  preconditioners  and stopping criteria  dummy parameters may be passed in place of those that are not used   Some compilers may require the presence of all routines used in the program during the linking  phase of the compilation  in this case the user may need to provide stubs for the dummy  routines  Section A 2 gives the synopsis of each user supplied external routine used by PIM    The external routines have a fixed parameter list to which the user must adhere  see   A 2    Note that  from version 2 2 onwards  the coefficient and the preconditioning matrices do not  appear in the parameter list of the PIM routines  Indeed we regard the matrix vector products  and preconditioning routines as operators returning only the appropriate resulting vector  thus  the PIM routines have no knowledge of the way in which the matrices are stored     The external routines  however  may access the matrices declared in the main program via  COMMON blocks  This strategy hides from the PIM routines details of how the matrices are  declared in the main program and thus allows the user to choose the most appropriate storage  method for her problem  previous versions of PI
70. to be supplied by  the user  we have included the source code for the routine as shown above in the directory   pim examples common and this may be used as 1s or can be modified by the user as required   Please note that for large system sizes the routine above will produce very large amounts of  output  We stress that this routine 1s always called by the PIM iterative method routines  if no  monitoring is needed a dummy routine must be provide    Note that some of the iterative methods contain an inner loop within the main iteration  loop  This means that  for PIM_RGCR and PIM TFQMR  the value of ITNO passed to PROGRESS  will be repeated as many times as the inner loop is executed  We did not modify the iteration  number passed to PROGRESS so as to reflect the true behaviour of the iterative method being  used     4 6 Example programs    In the distributed software the user will find a collection of example programs under the direc   tory examples  The example programs show how to use PIM with three different matrix storage  formats including dense matrices  those derived from the five point finite difference discretisa   tion of a partial differential equation  PDE  and the standard sparse representation found in  the Harwell Boeing sparse matrix collection  18     Most of the examples are provided for sequential and parallel execution  the latter with  separate codes for PVM 3 3 6 and MPI libraries  The examples involving the Harwell Boeing  sparse format are provided for 
71. to consult Barrett et al   6  pp  57ff  where such schemes as well as algorithms to compute  matrix vector products are discussed     Preconditioning For the preconditioning routines  one may use the scheme outlined above  for the matrix vector product  in some cases this may not be necessary  when there is no need    to operate with A or the preconditioner is stored as a vector  An example is the diagonal  or  1    Jacobi  left preconditioning  where O    diag  A     PROGRAM DIAGP  INTEGER LDA  PARAMETER  LDA 500   INTEGER LOCLEN  PARAMETER  LOCLEN 250       Q1 IS DECLARED AS A VECTOR OF LENGTH 250  AS IF USING AT LEAST    TWO PROCESSORS    DOUBLE PRECISION A LDA LOCLEN  Q1 LOCLEN    COMMON  PIMQ1 Q1   EXTERNAL MATVEC DIAGL PDUMR PDSUM PDNRM    23      SET UP PROBLEM SOLVING PARAMETERS FOR USE BY USER DEFINED ROUTINES    THE USER MAY NEED TO SET MORE VALUES OF THE IPAR ARRAY    LEADING DIMENSION OF A  IPAR 1  LDA    NUMBER OF ROWS COLUMNS OF A  IPAR 2  N    NUMBER OF PROCESSORS  IPAR 6  NPROCS    NUMBER OF ELEMENTS STORED LOCALLY  IPAR 4  N IPAR 6     SET LEFT PRECONDITIONING  IPAR 8  1    DO 10 I 1 N  Q1 I  21 0D0 A I I   10 CONTINUE    CALL DINIT IPAR 4  0 0D0 X 1    CALL PIMDCG X B WRK IPAR DPAR MATVEC  DIAGL  PDUMR  PDSUM  PDNRM  PROGRESS   STOP   END    SUBROUTINE DIAGL U V IPAR   DOUBLE PRECISION U     V     INTEGER IPAR      INTEGER LOCLEN   PARAMETER  LOCLEN 250   DOUBLE PRECISION Q1 LOCLEN   COMMON  PIMQ1 Q1   CALL DCOPY IPAR 4  U 1 V 1   CALL DVPROD IPAR 4   Q1 1 V
72. trix vector product  see External routines  22    Naming convention of routines  17  Obtaining PIM  17    Parallelism   data partitioning  14   programming model  14  Parameters   description of  45   printing  78   setting  77  Preconditioning step   see External routines  23    Stopping criteria  16    Supported architectures and environments     13    Vector initialisation  79  Vector norm  see External routines  25    81    
73. ty of Kent at Canterbury  U K     R D  da Cunha and T R  Hopkins  A parallel implementation of the restarted GMRES  iterative method for nonsymmetric systems of linear equations  Advances in Computa   tional Mathematics  2 3  261 277  April 1994  Also as TR 7 93  Computing Laboratory   University of Kent at Canterbury     E F  D Azevedo and C H  Romine  Reducing communication costs in the Conjugate Gradi   ent algorithm on distributed memory multiprocessors  Research Report ORNL TM 12192   Oak Ridge National Laboratory  1992     J J  Dongarra  J  Du Croz  S  Hammarling  and R J  Hanson  An extended set of FOR    TRAN Basic Linear Algebra Subprograms  ACM Transactions on Mathematical Software   14 1  1 17  1988     J J  Dongarra  R A  van de Geijn  and R C  Whaley  A users    guide to the BLACS v0 1     Technical report  Computer Science Department  University of Tennessee  1993     LS  Duff  R G  Grimes  and J G  Lewis  Users    guide for the Harwell Boeing sparse matrix  collection  Report TR PA 92 86  CERFACS  October 1992     LS  Duff  M  Marrone  and G  Radicati  A proposal for user level sparse BLAS   SPARKER  working note  1  Report TR PA 92 85  CERFACS  October 1992     S C  Eisenstat  A note on the generalized Conjugate Gradient method  STAM Journal of  Numerical Analysis  20 358 361  1983     H C  Elman  Y  Saad  and P  Saylor  A hybrid Chebyshev Krylov subspace algorithm  for solving nonsymmetric systems of linear equations  SIAM Journal of Scientific and    Statist
74. ute the global  sum and the vector 2 norm    u      vut using the BLAS DDOT routine and the reduction   plus broadcast operation provided by MPI    SUBROUTINE PDSUM ISIZE X IPAR     INCLUDE  mpif h    INTEGER ISIZE   DOUBLE PRECISION X      DOUBLE PRECISION WRK 10    INTEGER IERR IPAR      EXTERNAL DCOPY MPI  ALLREDUCE   CALL MPI  ALLREDUCE X WRK ISIZE MPI DOUBLE PRECISION  MPI SUM     MPI COMM WORLD IERR    CALL DCOPY ISIZE WRK 1 X 1     RETURN  END    DOUBLE PRECISION FUNCTION PDNRM LOCLEN U IPAR     INCLUDE  mpif h    INTEGER LOCLEN   DOUBLE PRECISION U      DOUBLE PRECISION PSUM   INTEGER IERR IPAR      DOUBLE PRECISION DDOT   EXTERNAL DDOT   INTRINSIC SQRT   DOUBLE PRECISION WRK 1    EXTERNAL MPI  ALLREDUCE   PSUM   DDOT LOCLEN U 1 U 1    CALL MPI  ALLREDUCE PSUM WRK 1 MPI DOUBLE PRECISION MPI  SUM     MPI  COMM WORLD IERR   PDNRM   SQRT WRK 1      RETURN  END    It should be noted that P  SUM is actually a wrapper to the global sum routines available on    25    a particular machine  Also  when executing PIM on a sequential computer  these routines are  empty i e   the contents of the array X must not be altered in any way since its elements already  are the inner product values    The parameter list for these routines was decided upon after inspecting the format of the  global operations available from existing message passing libraries     Monitoring the iterations In some cases  most particularly when selecting the iterative  method to be used for solving a specific 
75. yshev polynomials satisfy the above conditions on the c  s  We refer the user to   32  pp  45 58  332 339  for more details    The Chebyshev acceleration has the property that its iterates can be expressed by short   three term  recurrence relations and  especially for parallel computers  no inner products or  vector norms are needed  except for the stopping test   The difficulty associated with the  Chebyshev acceleration 1s the need for good estimates either for the smallest or largest eigen   values of G if the eigenvalues are real  or in the case of a complex eigenspectrum a region in  the complex plane containing the eigenvalues of minimum and maximum modulus    With PIM  the user may make use of two routines  PIM CGEV and PIM RGMRESEV  to obtain  such estimates  PIM_CGEV covers the case where the eigenvalues of G are real  for the complex  case  PIM_RGMRESEV should be used  To obtain appropriately accurate estimates  these routines  must be used with left preconditioning  and should be allowed to run for several iterations  The  estimates for the eigenvalues of Q1 A should then be modified to those of I     Q4 A  This is done  by replacing the smallest and largest real values  r and s  by 1     s and 1     r respectively  The  imaginary values should not be modified    We note that even if A has only real eigenvalues  G may have complex  or imaginary  only  eigenvalues  In this latter case  the Chebyshev acceleration is defined in terms of a  minimum bounding ellipse th
76. zip pim22 tar gz   tar xfp pim22 tar  cd pim    and edit the Makefile  The following variables may need to be modified   HOME Your top directory  e g    ui users fred   FC Your Fortran compiler of choice  usually   77   FFLAGS Flags for the Fortran compilation of main programs  example programs     OFFLAGS Flags for the Fortran compilation of separate modules  PIM routines and modules of  examples   NOTE  This must include at least the flag required for separate compilation  usually  c     AR The archiver program  usually ar    HASRANLIB Either t  true  or f  false   indicating if it is necessary to use a random library  program  usually ranlib  to build the PIM library    BLASLIB Either the name of an archive file containing the BLAS library or  1b1as if the library  libblas a has been installed on a system wide basis    PARLIB The compilation switches for any required parallel libraries  This variable must be left  blank if PIM is to be used in sequential mode  For example  if PVM 3 is to be used  then  PARLIB would be defined as   L  PVM ROOT  lib   PVM ARCH   1fpvm3  lpvm3  lgpvm3    Each iterative method routine is stored in a separate file with names in lower case fol   lowing the naming convention of the routines  e g   the routine PIMDCG is stored in the file  pim22 double pimdcg f     Building the PIM core functions PIM needs the values of some machine dependent  floating point constants  The single  or double precision values are stored in the files  pim22 common sm
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Indesit ID60G2 Cooktop User Manual  User Manual - Sütron electronic GmbH  WETTERSTATION Betriebsanleitung  PV Module Troubleshooting and Measurement  Télécharger  - Archipel  Samsung BT65TDAST Kullanıcı Klavuzu  Hasselblad H3DII-22 User`s Manual    Copyright © All rights reserved. 
   Failed to retrieve file