Home
        - Computer Science
         Contents
1.      double start  finish  middle   middle   0 0     try       InitKeLP  argc  argv     KelpConfig CONTIG_MSG_IN_PLACE  TRUE       int L M N  chk_freq  niter  reps  si  sj  gi  gj    double eps    cmdLine argc  argv  L M  N  eps  niter chk_freq  reps   si  sj  gi  gj      Region3 domain 1 1 1 N N N         Print header information     OUTPUT  rb3D run on P      lt  lt  mpNodes     lt  lt    nodes with N    lt  lt  N    lt  lt  endl     OUTPUT   Processors geometry    lt  lt  gi  lt  lt    x    lt  lt  gj  lt  lt   x    lt  lt  mpNodes    gixgj   lt  lt    Blocking factors     lt  lt  si          90     lt  lt   x    lt  lt  sj  lt  lt  endl      OUTPUT    Iter      lt  lt  niter  lt  lt      Reps     lt  lt  reps  lt  lt  endl    if  chk_freq  lt   niter   OUTPUT    Convg check every    lt  lt  chk_freq     lt  lt    Iterations          Allocate space for the local part of the problem        Use the dock library to help with the partitioning       const Region3 STRIP_V 1 1 1 gi gj mpNodes    gixgj     Processors3 P STRIP_V     OUTPUT P  lt  lt  endl      Decomposition3 T domain     T distribute BLOCK1 BLOCK1 BLOCK1 P  SputnikTimes     OUTPUT T  lt  lt  endl      T addGhost  1     Manhattan grid T    IrregularGrid3 rhs T       Initialize the local grid  InitGrid grid     rhs fi11 0 0      IrregularGrid3  U     grid     double stop   1 0   double  times   new doublelreps    double  timesLoc   new double reps       const int RED   0   BLK   1             Do the computation      
2.    DecompositionX const RegionX amp  R   _domain R      DecompositionX const DecompositionX amp  D      DecompositionX amp  operator    const DecompositionX amp  D       PSSS KK       simple access functions     DECC KK    const PointX amp  distributionRules   const  return _distType    const RegionX amp  domain   const  return _domain    int domainEmpty   const  return _domain empty       int domainLower const int dim  const    return _domain  lower  dim      int domainUpper const int dim  const    return _domain upper  dim      int domainExtents const int dim  const    return _domain extents  dim      int domainSize   const  return _domain size            query functions about the virtual processor array   int pLower const int dim  const  return _Map lower dim       int pUpper const int dim  const  return _Map upper dim       int pExtents const int dim  const  return _Map extents dim      int pMap const PointX amp  P  const   return _Map P           query functions about the global index domain  int pIndex const int dim  const int glndex  const   int pOwner const PointX amp  P  const    RegionX pRegion const RegionX amp  R  const     const XObjectX amp  operator     const int i  const     return FloorPlanX  operator    i   J  const XObjectX amp  operator     const PointX amp  P  const    return   this   pMap P         72    void setowner const int i  const int proc       FloorPlanX   setowner  i  proc       void setowner const PointX amp  P  const int proc     FloorPlanX  s
3.    TEST MODE ON n    if  maxThr 0     0  amp  amp  maxThr 1     0     cout  lt  lt   MAX THREADS SET INDIVIDUALLY n      else  cout  lt  lt   MAX THREADS SET AS A GROUP n       else if  testSMPS      amp  amp   numThreads  gt  O    numThr 0   gt  0    amp  amp  maxThreads    0  4  cout  lt  lt   TEST MODE ON n    if  numThr 0     O  amp  amp  numThr 1     0   cout  lt  lt   SPECIFIC NUM THREADS SET INDIVIDUALLY n    else  cout  lt  lt   SPECIFIC NUM THREADS SET AS A GROUP n       else    cout  lt  lt   FIXED NUMBER OF THREADS n    if  numThr 0     O  amp  amp  numThr 1     0   cout  lt  lt   SPECFIC NUM THREADS SET INDIVIDUALLY n    else  cout  lt  lt   SPECIFIC NUM THREADS SET AS A GROUP n          k   MPI_Get_processor_name procName  amp j    if  k   cout  lt  lt    tname of SMP    lt  lt  myid  lt  lt          78     lt  lt  procName  lt  lt  endl     int bestRun    double bestTime    double final    double   bestTimes   new double nodes     double   staticTimes   new double  nodes     int   bestRuns   new int  nodes     FILE   stats     MPI_Barrier  MPI_COMM_WORLD         MPI problem  only root process gets argv correctly     must broadcast ring size to other processes  MPI_Bcast   amp testSMPS 1 MPI_INT 0 MPI_COMM_WORLD     MPI_Bcast   amp hetero 1 MPI_INT 0 MPI_COMM_WORLD     MPI_Bcast   amp numThreads 1 MPI_INT 0 MPI_COMM_WORLD     MPI_Bcast   amp maxThreads 1 MPI_INT 0 MPI_COMM_WORLD     MPI_Bcast   amp maxThr  2 MPI_INT 0 MPI_COMM_WORLD     MPI_Bcast   amp numThr  2 
4.    s  I have not  only introduced a method for improved performance  but reduced the complexity  to do multi tier programming    Most importantly  Sputnik targets heterogeneous clusters  As clusters  of multiprocessors age  unless a cluster is completely replaced  at high cost  as  opposed to being upgraded  being able to utilize an entire cluster without having  an entire node or parts of several nodes remain idle is desirable  No programmer   researcher  or owner of the cluster will want to waste precious time on a costly  high   maintenance piece of computing hardware  To that end  Sputnik does irregular  partitionings and regulates the amount of OpenMP threads that are used within    each node     I C Organization of the Thesis    This thesis discusses the structure of and problems associated with pro   gramming a heterogeneous cluster of multiprocessors    Chapter 2 presents background information on multiprocessors  clusters  of multiprocessors and tools for programming both    Chapter 3 introduces a variety of methods for programming clusters of  multiprocessors and also discusses a variety of experiments that I ran which led    up to my conclusion that the API I had in mind would work     Chapter 4 discusses the most important aspect of this thesis and my  research     the theory of operation of the Sputnik Model  In addition  chpater 4  also looks at a few implementation details    Chapter 5 discusses the results of the Sputnik API in an application study    Chapter 
5.   bestTime   timings i      bestRun   i     i      MPI_Barrier  MPI_COMM_WORLD       bestTimes myid    bestTime   bestRuns  myid    bestRun     for  i   0  i  lt  nodes  i         82    MPI_Bcast  void     amp bestTimes  i   1   MPI_DOUBLE i MPI_COMM_WORLD      MPI_Bcast  void     amp bestRuns  il   1   MPI_INT i MPI_COMM_WORLD       if  myid    0       stats   fopen  StatPut   w       for  i   0  i  lt  nodes  i       fprintf stats   f n  bestTimes  i       cout  lt  lt   bestRuns  node    lt  lt  i     lt  lt         lt  lt  bestRuns i   lt  lt  endl   cout  lt  lt   bestTimes  node    lt  lt  i   lt  lt         lt  lt  bestTimes i   lt  lt  endl     for  i   0  i  lt  maxThreads 1  i     fprintf  stats    Time for  d threads   f n    i timings i     fclose stats        MPI_Barrier MPI_COMM_WORLD     testSMPS   0   omp_set_num_threads  bestRuns  myid       final   SputnikMain argc  argv   testSMPS  bestTimes        else if  testSMPS      amp  amp   numThreads  gt  O    numThr 0   gt  0    amp  amp  maxThreads    0  4    if  numThr 0     0     omp_set_num_threads  numThr  myid      cout  lt  lt   SETTING ID THREADS    lt  lt  myid  lt  lt        lt  lt  numThr  myid   lt  lt  endl      else    omp_set_num_threads  numThreads     cout  lt  lt   SETTING ID THREADS    lt  lt  myid  lt  lt        lt  lt  numThreads  lt  lt  endl     else      83    MPI_Barrier  MPI_COMM_WORLD      bestTimes myid    SputnikMain argc  argv   testSMPS  NULL     MPI_Barrier  MPI_COMM_WORLD       for
6.   i   0  i  lt  nodes  i       MPI_Bcast  void     amp bestTimes  i   1   MPI_DOUBLE i MPI_COMM_WORLD       if  myid    0     for  i   0  i  lt  nodes  i       cout  lt  lt   bestTimes  node    lt  lt  i   lt  lt         lt  lt  bestTimes i   lt  lt  endl        MPI_Barrier  MPI_COMM_WORLD       testSMPS   0   final   SputnikMain argc  argv  testSMPS  bestTimes          testSMPS     if  numThr 0     0   omp_set_num_threads  numThr  myid      else  omp_set_num_threads  numThreads       if hetero     if  myid    0   cout  lt  lt   TIMES TO PARTITION FOR   n    stats   fopen  StatPut   r     for  j   0  j  lt  nodes  j       fscanf stats    lf n     amp  staticTimes j      if  myid    0   cout  lt  lt  staticTimes j    lt  lt    seconds n       fclose stats       final   SputnikMain argc  argv   testSMPS  staticTimes       84       else       Run  unaltered   final   SputnikMain argc  argv   testSMPS  NULL          MPI_Barrier  MPI_COMM_WORLD     cout  lt  lt   Final     lt  lt  myid  lt  lt         lt  lt  final  lt  lt  endl     cout  lt  lt   Done    lt  lt  endl   cout flush      MPI_Barrier  MPI_COMM_WORLD       Appendix C    Source code to redblack3D with  Sputnik    C A rb F    COBO RR KK KKK Kk KK 2k KKK KK K K  subroutine rb7rrelax u ul0 uli ul2 uh0 uhi1 uh2    integer ul0  ul1 uh0  uhi  ul2  uh2   double precision u ul0 uh0 ul1 uh1 ul2 uh2     peform 7 point red black relaxation for Poissons   s  equation with h 1 0    Originally written by Stephen J  Fink   Modified b
7.  2  u i j k   c      2   u i 1 j k    u i 1 j k      3  u i j 1 k    u i j 1 k      4  u i j k 1    u i j k 1      5 c2 rhs i j k      end do  end do  end do  end do  end do      OMP END DO NOWAIT    OMP END PARALLEL    return  end    C B rb3D C    DEORE kk  rb3D C    program that solves Poisson   s equation on a unit cube  using Red Black ordering   We should never solve Poisson   s equation this way   this kernel is intended to be used in multigrid    This version uses a modified custom MotionPlan to send  contiguous messages where possible    It also optimizes communication still further using the  the Manhattan class to avoid communicating corner and   edge ghost cells  the solver uses a 7 point stencil so  there is no need to send these extra points    If you use this code as a starting point for another  applcation  and you need the corner or edge points  do not  use the Manhattan class  use an IrregularGrid3 instead   The code may be easily modified to this end   Replace Manhattan by IrregularGrid3 and be sure to uncomment                                                               the code that sets up the Mover and MotionPlan objects    XK XA XA XA XA X X A XA A KF XA KF KF XX X XX KF X      Uncomment the following Mover member function calls   execute      Comment out the following Manhattan member function calls  fillGhost     Optimze       Original 2D code was written by Stephen J  Fink  Extensively modified for benchmarking by Scott B  Baden  Deptartment of Com
8.  I increased  the problem size because I was not finding good scaling with N 761  Therefore  for the case with 64 and 32 threads  I had N 949  For 128 64 and 128 96 threads   I used N 1163    As seen from the timings in table V 7 and especially from the speedups  in table V 8  Sputnik performed well with large numbers of threads per system  as well  showing more than 34  improvement after repartitioning for either the  64 32 threads case or the 128 64 threads case  As expected  when the numbers  of threads per node narrows  and as the number of threads in total grows   the  speedup gains decline  indicated by only a 6 77  improvement with the 128 96    threads case     53    Redblack3D with 48 Threads on balder  707    60F                time  seconds        10  Original Computation Time  New Computation Time  Predicted Time    0 a  a  mm  24 30 36 42 48    Number of Threads on aegir                                                 Figure V 4  Important redblack3D timings with 48 threads on balder and varying    numbers of threads on aegir  using the Sputnik library          Threads Original Compute New Total New Compute  balder   aegir   balder aegir balder aegir   balder   aegir  62 32 57 1 99 3 74 6   74 6085   71 5   71 8537  128 64 59 6 108 83 2697   83 7 80 3   75 5464  128 96 65 6 73 5 72 9498   72 7 65 5   68 841                                  Table V 7  Complete redblack3D timings for large numbers of threads per system    V E 4 Anomalies    There were some strange 
9.  J  and S B  Baden     Run time Data Distribution for Block Structured  Applications on Distributed Memory Computers     CSE Technical Report  Number CS94 386  September 1994     Foster  Ian  and Nicholas T  Karonis     A Grid Enabled MPI  Message Passing  in Heterogeneous Distributed Computing Systems PS      SC 98 Conference   Orlando FL  Nov  1998     Gatlin  K S  and L  Carter     Architecture Cognizant Divide and Conquer Al   gorithms     SC 99 Conference  Portland  OR  Nov  1999     Kesselman  C   et al    lt http   www globus org  gt      Gropp  W  W  and E  L  Lusk     A Taxonomy of Programming Models for  Symmetric Multiprocessors and SMP Clusters     Proc  Programming Models  for Massively Parallel Computers  October 1995  Berlin  pp  2 7     Hill  J  M  D   P  I  Crumpton  D  A  Burgess     The Theory  Practice and a  Tool for BSP Performance Prediction Applied to a CFD Application     PRG   TR 4 1996 Oxford University Computing Laboratory  1996     IBM PowerPC 604e User Manual   lt http   www chips ibm com products   powerpc chips 604e 604eUM_book pdf gt      Kleiman  S   D  Shah  B  Smaalders     Programming with Threads     SunSoft  Press     Kohn  S  R      A Parallel Software Infrastructure for Dynamic Block Irregular  Scientific Calculations     Ph D dissertation  University of California at San  Diego  La Jolla  CA  1995     Lauden  J  and D  Lenowski      The SGI Origin  A ccNUMA Highly Scalable  Server     ISCA     Lawrence Livermore National Labs     Usin
10.  Sputnik API if the  experiment was successful   After seeing that the program compiled and appeared  to run successfully  I had all of the elements in place to write the Sputnik API and    begin gathering data on its effectiveness        2KeLP Web Page   lt http   www cse ucsd edu groups hpcl scg kelp  gt     Chapter IV    The Sputnik Model and Theory    of Operation    IV A Introduction    Using the experience gained from experiments with hand modifying var   ious combinations of MPI  KeLP  OpenMP and Pthreads programs  I extended  the API and functionality of KeLP in a new set of routines called Sputnik  These  routines that I implemented perform two steps that work in tandem to achieve  efficiency on heterogeneous clusters    Although the Sputnik Model allows for any sort of shared memory mul   tiprocessor and any sort of optimizations to be done  in theory  the Sputnik API  has been written with a specific focus  The API has been written with two of  the many possible optimizations  that are validated as good optimizations for the  redblack3D benchmark in the next chapter  The order of events for the Sputnik  Model  which the API based on  is     1  ClusterDiscovery  Runs the kernel of the program repeatedly on each sep   arate node to determine the timings and relative performance  changing    program parameters over several runs to determine the configuration that    33    34    achieves optimal performance     2  ClusterOptimization  Using the parameters which the pro
11.  best times for each node      This way  not only does each node run with an optimal number     of threads per node  it may not be the maximum available   but     also with an optimal division of work     time i    SputnikMain int argc  char   argv  bestTimes       Appendix B    Source Code to Sputnik    B A DecompositionX h m4    paaa o kkk kkk kkk kkk kk kkk kk k kk kk kkk kkk kkk kkk k k kk k  DecompositionX h m4          Author  Stephen Fink    Modified for Sputnik  Sean Peisert       Class DecompositionX represents a distributed index space     with a regular block decomposition    EEEE EE o ooo I AKIRA RI A kkk kkk      kk XA A XA x x       include  ArrayX h    include  ProcessorsX h    include  GhostPlanX h   tinclude  menagerie h      define BLOCK1 1   define BLOCK2 2    E E 2k 2k 2k 2k ak ak k k k k k 2k 2K 2K 2K FK 2K 2K 2K K 2K K 2K K K 2K 2K 2K 2K 2K 2K 2K 2K 2K 2K 2K K K K 2K 2K FK FK FK 2K 2K 2K 2K 2K K 2K K 2K 2K 2K 2K 2K 2K FK      Class DecompositionX is a first class dynamic template    BEAR      class DecompositionX  public GhostPlanX      RegionX _domain     Global region of DecompositionX      PointX _distType     Distribution directive in each  dimension      ArrayX lt int gt  _Map     maps virtual proc array to 1 d    70    71    floorplan     public   1 RER RHIN IREM INR E HR kkk       constructors and destructors     1 RER RHIN HER INR ER k k kkk    DecompositionX       DecompositionX  ArrayIndexArguments     _domain PointX 1   PointX ArrayIndices    
12.  combining both technologies to form a highly tuned piece  of parallel code can be daunting  For this reason  using a library with multi tier  support built in can make programming an application significantly easier than  when using its component technologies  MPI and Pthreads     by hand     with   out sacrificing performance  23   I will discuss various programming libraries and    methods in the next chapter     Chapter III    Heterogeneous Multi Tier    Programming    III A Previous Work    Significant progress has already been made in programming homogeneous  multi tier machines  but the issue is still an open problem  Software APIs have  been developed which can specifically address a multi tier machine  In the simplest  approach  at least one vendor has implemented MPI on their machines so that if  a message is to go to another processor on node  it gets converted into a native  shared memory call  If it goes off node  it gets converted into a native messaging   layer call  Finally  if it goes off machine  it is sent via TCP IP  A standardized  technology exists based on this concept called Multiprotocol Active Messages and  additionally Sun has implemented similar technology in their version of MPI that  runs on the Sun Enterprise 6500 and Enterprise 10000 servers  42   Other vendors    are working on their own implementations     16    17    III A 1 Multi Protocol MPI    Multi protocol active messages are a technology that have been developed  to allow different mes
13.  directives around did not produce any parallel speedup  at all  Second  it turned out that only when the program   s built in cache tiling  mechanism was disabled by using two specific options   si 1  sj 1   did the pro   gram produce any scaling as well   And then  as I mentioned before  it performed  several times worse than KeLP  which uses MPI exclusively  alone    Again  since the repartitioning increased the utilization of the cluster and  speed of the program  I did not feel that this affected the validity of my results  I  am confident that experiments on a true commodity cluster of multiprocessors  as    Sputnik was designed for  will resolve the OpenMP scaling and speedup issues     Chapter VI    Conclusions and Future Work    The results that Sputnik produced with the application study of red   black3D indicate strongly that as part of the ClusterOptimization step discussed  in chapter 1  repartitioning the load to balance out the time that each Origin2000  system spends running so that both nodes finish at the same time  works    The speedup over running unoptimized with uniform partitioning  though  not completely linear  is good  and works well for medium sized problems to very  large ones  The most dramatic results came with the example of 48 threads on  balder and 24 threads on aegir where the repartitioning revealed a speedup of  35 7   This speedup is actually 2 1  better than the equations predict for the  theoretical results  This is shown despite OpenMP 
14.  equal two dimensional    blocks     Thus  each block  or Region  would have one third of the dataset  which  would in turn get assigned to a cluster  where each node gets only one MPI pro   cess   However  a question this thesis addresses is  What happens if each node  is not equal in processing power  What if instead of the processing power of the  cluster being as evenly divided as the data is in figure IV 1  such that node 0 is  twice as powerful as the other two nodes and so therefore can run in 10 seconds  whatever nodes 1 or 2 can run in 20 seconds  In this case  the partitioning of the  dataset should look like the one in figure IV 2    By modifying the DOCK library so that the domain is partitioned non   uniformly  according to how well each node really performs  rather than a uniform  partitioning  the program can be run on the cluster more optimally    In the previous chapter  I discussed this partitioning scheme for use with    the redblack3D version that runs with hand tuned MPI and Pthreads and described    39       Node 0 Node 1 Node 2    Figure IV 2  Two dimensional dataset partitioned so that node 0 gets twice as    much data to work with as either node 1 or node 2     how the relative power of a node can be determined by comparing the inverse of  its time with the sum of the inverse of the times  forming a ratio  This ratio is  multiplied by the total amount of work available to determine the size of the block    to give to a particular node     IV E API Des
15.  of optimizations     more than just reparti     tioning or adjusting the number of threads      e  Working with many other different types applications     Contact Inforamtion    The author can be contacted at the email address     peisert sdsc edu    This thesis can be downloaded in full  in PDF format  at     http   www sdsc edu  peisert     59    Appendix A    User   s Guide    A A Introduction    Sputnik runs in two stages  In the first stage  the routines gather in   formation about how the program actually runs on each multiprocessor node   Once it gets timings for each separate node  it calculates the fraction of the  workload that each multiprocessor node should run  This ratio looks like this   time   total time   work   total work in a cluster with N multiprocessor nodes   Additionally  the first stage determines the optimal number of OpenMP threads to  run per node  This particular optimization of the ideal number of threads per node  to run is only one possible optimization that we could be doing  Other possible  optimizations might include tiling for cache and making predictions about dynamic  optimizations that the program might need after the original partitioning    In the second stage  the routines partition the problem based on the  calculated fractions  and finally make a final run of the program at the peak speed  possible based on the chosen partition    Sputnik expands upon KeLP1 by adding to the API with two new func   tions and one new technology  Ope
16.  on  balder and varying numbers of threads on aegir               V 5 Complete redblack3D timings with 48 threads on balder and varying  numbers of threads on aegir        bo one es ue    BRR A dont  V 6 Speedup and predicted timings for redblack3D with 48 threads on    balder and varying numbers of threads on aegir                51    V 7 Complete redblack3D timings for large numbers of threads per system 53    V 8 Speedup and predicted timings for redblack3D with large numbers    of threads per system   a er des a A Se A    LIST OF FIGURES    I 1 Diagram of a heterogeneous cluster of multiprocessor nodes        2  IL1 Diagram of a multiprocessor                         9  11 2 Diagram of a distributed memory machine                12    11 3 Diagram of an SGI Origin2000  Courtesy of SGI   s    Origin2000 and  Onyx2 Performance Tuning and Optimization Guide  IRIX 6 5        14    II 4 Diagram of a cluster of symmetric multiprocessors                 14    II 1 Hierarchy of software layers for KeLP2                  20  111 2 redblack3D with heterogeneous partitioning using hand coded MPI  and Pthreads  N 300  PEO has 2 threads and PET and PE2 have    Teacha Gace se ad th ee eS SRA ae ge oes BH ese 26  IIL3 OpenMP Fork Join Example                        28   11 4 OpenMP scalabilitytest code 4 44 1  La tps due bin moe 29    II 5 OpenMP Scaling Test Timings on an SGI Origin2000 with 250 MHz  PROCESSORS tc De a eit  ek oon Ge len Od oO ae AA TE 30  II 6 OpenMP Scaling Test Spe
17.  optimal    number of threads for each node  When it has found the optimal times     63    it writes them to a file called    stats    and makes a final run of the  program using the optimal number of threads per node and the optimal    decomposition     e The  hetero argument is ignored   2  mpirun  np 2 mpiprog  testSMPS 1  maxthrO 10  maxthri 20    e With a maximum number of 10 threads on node 0 and 20 threads on  node 1  Sputnik will test each node with the kernel of the program to  determine the optimal number of threads for each node  When it has  found the optimal times  it writes them to a file called     stats    and  makes a final run of the program using the optimal number of threads    per node and the optimal decomposition     e The  hetero argument is ignored   3  mpirun  np 2 mpiprog  testSMPS O  hetero O  numthreads 10    e Sputnik runs on 2 nodes with exactly 10 threads per node with homo     geneous decomposition     4  mpirun  np 2 mpiprog  testSMPS 0  hetero 0     numthrO 10  numthri 20    e Sputnik runs on 2 nodes with exactly 10 threads on node 0 and 20  threads on node 1 with homogeneous decomposition  Use of this set of    options is comparable to running with KeLP2     5  mpirun  np 2 mpiprog  testSMPS 1  hetero 0     numthr0 10  numthri 20    e Sputnik runs on 2 nodes with exactly 10 threads on node 0 and 20  threads on node 1  but with heterogeneous decomposition based on a    single test run made before the final run     64    6  mpirun  np 2 mpipr
18.  that helped shape who I am and bring me to this point  And for a  lot of fun  And to Alex  Eric  Kent  Laura  P J   Steve  and Tage  too  I have such  fantastic friends    My girlfriend Cathy who has given me inspiration when things proved to  be most frustrating elsewhere    Ms  Jean Isaac  with whom I had my first computer class in the First  Grade at St  Mark   s School in Marin County  California  who inspired me to  explore computers more    Mrs  Judy Farrin  my great friend and teacher  who made computers and  chemistry at Redwood High School in Marin County  California  fun and interesting  by believing in me and letting me explore on my own  And for letting me blow  things up in the chemistry lab    Dr  Paul H  J  Kelly  Dr  Jarmo Rantakokko  and Dr  Daniel Shalit for  their help and friendship  especially when things were going very awry    Uppsala University in Sweden for generously allowing me use of the Yg   gdrasil DEC Alpha cluster    The National Center for Supercomputing Applications  NCSA  and their  extremely knowledgeable and helpful staff for the extremely generous allocation of  computing time on the Silicon Graphics Origin2000 machines there and their help  in using them  When I couldn   t find any other machines to use  NCSA heroically  bailed me out  In particular  I would like to thank Faisal Saied  Louis Hoyenga   Susan John  Yiwu Chen  Roy Heimbach  Scott Koranda  Dave McWilliams  and    Michael Pflugmacher     xii    UCSD and the UCSD Computer Sci
19. 6 presents my conclusions along with future work possibilities    The appendices contain source code and present a user   s guide to the    Sputnik API     Chapter II    Clusters of Multiprocessors    II A Multiprocessors    A multiprocessor is a machine with two or more processors that all share  the same main memory  as shown in Figure II 1  Some multiprocessors contain  processors that are equidistant from the main memory  This is referred to specif   ically as a symmetric multiprocessor or SMP which has uniform memory access  or UMA  By contrast  some multiprocessors  including the SGI Origin2000  have  internal networks which cause the processors to have non uniform memory access  or NUMA because the amount of time it takes for a processor to retrieve mem   ory from two other processors might differ  The SGI Origin2000 actually has a  NUMA variation called cache coherent NUMA or ccNUMA  Typically although  main memory is shared on a multiprocessor  each processor in the node is sepa   rated from main memory by one  two or sometimes even three levels of individual  cache    Multiprocessors are starkly different from their vector supercomputer and  massively parallel multicomputer ancestors  First  by definition  a multiproces   sor uses shared memory  whereas in an multicomputer  all processors have their  own  separate main memories  making them distributed memory  also called shared    nothing    The SGI Origin2000  the principle machine used in obtaining the re     Mai
20. ClusterOptimization component appears to work  This can  certainly be extended in the future to function well with problems that may benefit  not only from a partitioning or balancing of the problem  but possibly from other  optimizations including cache tiling or focusing specific sections of the program on  specific machines that have unique characteristics from which the sections would  benefit    The idea could certainly also be adapted to work in a dynamic environ   ment as well  where instead of sampling just once  at the beginning  testing and  sampling could happen continuously throughout the run of the program to opti   mally execute long running programs  tuning throughout the run of the program    The Model might also be brought to address nodes of multiple architec   tures in the same cluster  For example  if we define our    cluster    to be an SGI Cray  T3E  an SGI Cray T90  and IBM SP  connected by a high bandwidth  low latency  network  we will have a phenomenally heterogeneous cluster or PHC  as opposed  to a slightly heterogenous cluster or SHC   A problem that is able to make use  of all of these machines and their unique characteristics would be rare  but it is  entirely possible that a program might have some loops that are easily vectoriz   able and should best be directed to the T90 and parts of the program that simply  should be farmed out to as many processors as possible on the T3E and SP  5   A    58    possible step for the ClusterDiscovery stage w
21. I Fortran and C    Processors 256 128  Main Memory 128 GB 64 GB  Peak Theoretical Performance 128 GFLOPS   64 GFLOPS                      Table V 1  Specifications for the two Origin2000 machines  balder and aegir     Although I experimented with a variety of environment variables in many  possible combinations  I found the optimal settings for     DSM_MIGRATION    to  be    ALL_ON    and     DSM_PLACEMENT    to be    ROUND ROBIN     The system  software configurations that I used  and their versions  are shown in table V 2  The  two Origin2000 machines were connected by an SGI    Gigabyte System Network   GSN     interconnect that support a maximum bandwidth 800 MB per second  and have a theoretical latency of less than 30 microseconds  Experimental results  showed that the actual latency might be much closer to 140 microseconds and the    bandwidth less than 100 MB per second              Version Special Flags  Operating System   IRIX 6 5  MIPSpro f77 7 3 1m  mp  03  mips4  r10000  64  MIPSpro CC 7 3 1m    mp  Impi  lm  lftn  lcomplex  03  r10000  64  KeLP 1 3a                      Table V 2  Software configurations    44    V C Predicted Results    Generically  the optimal speedup can be computed from the following  equations  where T is total time  For all cases  both with heterogeneous and  homogeneous partitioning  the total time for the program is only as fast as the    slowest node     Lai     slowestnode  V 1     Thus  for a cluster  if one node runs faster than the 
22. MPI_INT 0 MPI_COMM_WORLD          if  maxThr 0     0  amp  amp  maxThr 1     0   maxThreads   maxThr  myid      int maxMax   maxThreads        Get the biggest number of threads on all nodes  for  i   0  i  lt  nodes  i       if  maxMax  lt  maxThr i    maxMax   maxThr il     if  testSMPS    1  amp  amp  maxThreads  gt  0     double   timings   new double maxMax 1        Initialize Values to Zero  We can   t use zero     threads  so we go all the way up to maxThreads  for  i   0  i  lt   maxMax  i       timings  i    0 0     int gotit   0    int   a_gotit   new int nodes    int   a_min   new int nodes    int   a_max   new int nodes      79    int   a_iter   new int  nodes    int minMin 1     for  i   0  i  lt  nodes  i       bestTimes i    0 0   bestRuns i    0     a_gotitli    0   a_min i    0   a_max i    0        i   1        Find the Max and Min times that the best     is in between  while  i  lt   maxMax     cout  lt  lt   PE   lt  lt  myid  lt  lt     Running with     lt  lt  i  lt  lt    threads   n    MPI_Barrier  MPI_COMM_WORLD     a_iter myid    i   for  j   0  j  lt  nodes  j       MPI_Bcast  void     amp a_iter j  1   MPI_INT  j MPI_COMM_WORLD     assert a_iter j     i         omp_set_num_threads i          Initial run  timings i    SputnikMain argc  argv   testSMPS  NULL         Store the best time for the first run     or a lower time  if   timings il  lt   bestTime    i    1    amp  amp  gotit      amp  amp  i  lt   maxThr  myid      bestRun   i   bestTime   tim
23. P Test                                   Seo ee    Node 0  Node 1                Number of Threads for Node 1    30    Figure III 5  OpenMP Scaling Test Timings on an SGI Origin2000 with 250 MHz    Processors    31    OpenMP Test Speedup  8 T i T       A    Speedup                   al  T    Parallel Speedup    A  T          1  32 16 8 4 2 1    Number of Threads for Node 1          Figure II 6  OpenMP Scaling Test Speedup for an SGI Origin2000 with 250 MHz    Processors    32       Threads   Node 0   Node 1   Speedup    32 1 8800   0 2417   7 8182  16 1 8867   0 2611   7 2263          8 1 8807   0 3280   5 7342  4 1 8733   0 5378   3 4833  2 1 8768   0 9297   2 0186  1 1 8677   1 7462   1 0696                         Table 111 2  Table of OpenMP Scaling Test Speedup for an SGI Origin2000 with  250 MHz Processors    had an advantage in ease of programming     III C 3 KeLP and OpenMP    Finally  because MPI  one of the component technologies that KeLP is  built on  functioned properly with OpenMP  I assumed that KeLP would function  properly with OpenMP  However  prior to adding an API to KeLP directly  I  decided to test both KeLP and OpenMP   s compatibility and also the efficiency  of programs built using both technologies  I went back to rb3D this time  but  instead of starting with a hand coded MPI and Pthreads version  I began with a  KeLP   version  then added OpenMP directives and code to have the two interact   which would be later replaced by calls directly to the new
24. R  B  Frost  D  Shalit     KeLP User Guide Version 1 3     De   partment of Computer Science and Engineering  University of California  San  Diego  September 25  1999     Baden  S  B  and S  J  Fink     The Data Mover  A Machine independent  Abstraction for Managing Customized Data Motion     LCPC    99  August 1999     93    12    13    14          15    16          17     18      19      20      23      24     94    Baden  S  B  and S  J  Fink     Communication overlap in multi tier parallel  algorithms     SC98  Orlando FL  Nov  1998     Baden  S  B  and S  J  Fink     A Programming Methodology for Dual Tier  Multicomputers     submitted for publication     Bader  David A  and Joseph Ja   Ja        SIMPLE  A Methodology for Program   ming High Performance Algorithms on Clusters of Symmetric Multiproces   sors  SMPs      Tech  Rep  CS TR 3798  Univ  of Maryland Inst  for Advanced  Computer Studies Dept  of Computer Sci  Univ  of Maryland  May 1997     Barney  Blaise M      POSIX Threads Programming     Maui High Performance  Computing Center  October 29  1998     Becker  D  J   T  Sterling  D  Savarese  J  E  Dorband  U  A  Ranawake  and  C  V  Packer     Beowulf  A Parallel Workstation for Scientific Computation        Cappello  F  and O  Richard     Performance characteristics of a network of  commodity multiprocessors for the NAS benchmarks using a hybrid memory  model     PACT 99  July 20  1999     Carter  L      Single Node Optimization     NPACI Parallel Computing Ins
25. UNIVERSITY OF CALIFORNIA  SAN DIEGO    A Programming Model for Automated Decomposition on    Heterogeneous Clusters of Multiprocessors    A thesis submitted in partial satisfaction of the  requirements for the degree Master of Science in    Computer Science    by    Sean Philip Peisert    Committee in charge     Professor Scott B  Baden  Chair  Professor Larry Carter  Professor Jeanne Ferrante  Professor Sidney Karin    2000    Copyright  Sean Philip Peisert  2000  All rights reserved     The thesis of Sean Philip Peisert is approved  and it is    acceptable in quality and form for publication on miero     film        University of California  San Diego    2000     Dedicated to     My parents and grandparents  who raised me wonderfully and gave me everything  I needed to figure out how to succeed in life  including my Opa  who is no longer  here to see me present this thesis  but who first set me in front of a typewriter      an experience that without  I would probably never have taken to the computer     much less the supercomputer        It is an old maxim of mine which states that once you have eliminated    the impossible  whatever remains  however improbable  must be the truth            Sherlock Holmes  The Sign of Four    IT    Ill    TABLE OF CONTENTS    Signature Page    c oo eh ay A a ee A ew ek 111  Dedication s ams ode Se a de ae o LEE A    iv  O E E A So v  Table o Contents me pls me data a a BN ae Geste vi  Listsot Tables ai ase as wa a A ee e ix  Distok Figure
26. We actually do it twice  once with communication      and once without          if  mpNodes    gt  1  amp  amp   testSMPS     for  int k   0   k  lt  reps  k      STATS_RESET       STATS_START STATS_ITERATION       91    if  k   0      start   MPI_Wtime       for  int i  1  i lt  niter  i            Exchange boundary data with neighboring     processors  U  gt fillGhost          Perform the local smoothing operation on the     RED Points  ComputeLocal  U si sj RED rhs         Exchange boundary data with neighboring     processors  U  gt fillGhost          Perform the local smoothing operation on the     RED Points      Perform the local smoothing operation on the     BLK Points   ComputeLocal   U si sj BLK rhs       if  k   0     finish   MPI_Wtime     middle   finish   start   middle   y  STATS_STOP  STATS_ITERATION     times k    STATS_TIME STATS_ITERATION          if   testSMPS   cout  lt  lt   SPUTNIK TIME WITH COMMUNICATION      lt  lt  middle  lt  lt  endl      middle   0 0   for  int k   0   k  lt  reps  k        STATS_RESETO    STATS_START  STATS _LOCAL       if  k   0        92    start   MPI_Wtime          for  int i  1  i lt  niter  i            Perform the local smoothing on the RED Points  ComputeLocal  U si sj RED rhs       Perform the local smoothing on the BLACK Points  ComputeLocal   U si sj BLK rhs          STATS_STOP STATS_LOCAL      timesLoc k    STATS_TIME STATS_LOCAL      if  k    0     finish   MPI_Wtime      middle   finish   start   middle          Re
27. ad something to do with memory  distribution as well as thread distribution  having all threads spawned on the same  processor   On the Origin 2000  threads and memory can be distributed through   out the system  There may be processors spawned on one part and memory placed  on another  using OpenMP  Despite memory and thread migration and round robin  memory distribution  having all the OpenMP threads realize good memory access    speed does not seem to be trivial     59    Due to the fact that the program did improve with greater numbers of  threads  the thread distribution idea could be discounted  What was left was  memory distribution and this was never solved    I decided in the end  though  that since Sputnik demonstrated the results  of my thesis     that performance results could be analyzed and that action could  be taken  in the case of Sputnik  adjusting the number of threads per system and  re partitioning the dataset according to    processing power        that my thesis had  been proven  The Sputnik API is currently designed to work with a commodity  cluster with a good MPI and OpenMP implementation  It will take more work  on other vendors    systems with other vendors    implementations of OpenMP to  determine exactly what the cause of the problems with OpenMP on the Origin2000  are    Another problem  presumably related to the OpenMP issues as well  had  to do with scaling  First  it turned out that the loop that was most obviously the  one to put the OpenMP
28. ads    and     hetero    arguments and that it should test the speed of each individual  multiprocessor node before making a final run with the optimal number of  threads per node  If equal to the integer    0     the    hetero    argument is  checked  If    hetero    is equal to    1     then the     stats    file is read for its timing  information and threads are allocated to the nodes based on the timings from     stats    and the value of    maxthreads     If    hetero    is equal to 0  then the     stats    file and    maxthreads    are ignored and    numthreads    threads per  node are used rather than reading the timings from the     stats    file  All    other conditions will produce an error     2  hetero  Whether or not to run heterogeneously  based on the     stats    file  if    the multiprocessor nodes are not tested individually     3  maxthreads  If given  is the maximum number of threads to be used on each    node  If not given  a default value is used     4  numthreads  The number of threads per node to allocate if testSMPS is 0   false      A C Examples of Command Line Options    An example of each possible running mode with the Sputnik command   line arguments follows  Each example runs an MPI program mpiprog with a certain    number of threads on two multiprocessor nodes     1  mpirun  np 2 mpiprog  testSMPS 1  maxthreads 10    e With a maximum number of 10 threads on both nodes  Sputnik will  test each node with the kernel of the program to determine the
29. anomalies encountered  In fact   all of the results show that Sputnik gives a speedup within 5 2  of the theoretical  optimum and the majority of the results are within 2   Some  in fact  are up to  2 3  better than the theoretical optimum  Figure V 5 shows graphically really  how close Sputnik comes to perfect speedup with the available optimizations    An application written with KeLP can be converted to Sputnik very easily  as long as a good OpenMP implementation exists and the kernel that one is trying  to parallelize can in fact be taken and optimized by OpenMP well  As noted  by other researchers  this is not always possible  21  44  45   This makes code for    heterogeneous clusters almost as easy to program as homogeneous clusters by using    56    57    Sputnik instead of KeLP  which was one of my primary goals    Following the Sputnik Model  the Sputnik API library could certainly be  adapted to work with different component technologies  For instance  instead of  KeLP1 and OpenMP  it could be built on top of the one of the already existing  multi tier API   s described in my related work section  The intent would be to  continue to make developing Sputnik based scientific code supporting heteroge   neous clusters of multiprocessors even easier than Sputnik currently provides  while  still achieving at least the speedup that was demonstrated on the Origin2000   s at  NCSA    Regardless of the component technologies used  however  the idea of a  ClusterDiscovery and 
30. before   the program only runs as fast as the slowest system  Therefore the time for the  slowest system is also the time for the whole program  The goal of Sputnik is  to make the program run faster  A side benefit of this is that it also increases  machine utilization by not having a system or systems remain idle while waiting    for the slower system or systems to catch up  In table V 3  we can see the original     49    Redblack3D Using 32 Threads on balder  90     80F    70F                a  o  T       time  seconds        Original Computation Time  10  New Computation Time  Predicted Time    0 ME ME MM  16 20 24 28 32    Number of Threads on aegir                                                 Figure V 1  Important redblack3D timings with 32 threads on balder and varying    numbers of threads on aegir    unmodified run with 16 threads on aegir causing balder to remain idle for up to  40 seconds while waiting for aegir to catch up    As can be seen clearly from figure V 2  Sputnik does provide a speedup  over the version of the program without the heterogeneous Sputnik partitioning   As expected  the speedup tapers off as the number of threads per system becomes  closer together on the pair of Origin2000   s  Not only does Sputnik provide  but  as shown  it demonstrates improved system utilization because one system does  not remain idle for nearly as long as it originally had  At its best  Sputnik shows  a 34 9  speedup when 32 threads are used on balder and 16 thread
31. blocks    Block decomposition facilitates the implementation of features including  tiling for cache and repartitioning  The Sputnik Model is different from the Sputnik  API  In the Sputnik Model  I discuss how an API like KeLP can discover resources  and act appropriately to refine the program to run heterogeneously  In my API  I  have made specific choices and assumptions about what to implement  Although   as I have mentioned  many possible optimizations could have been made  I specif   ically chose repartitioning and thread adjustment as methods for optimizing for    heterogeneous clusters     IV D 2 Heterogeneous Partitioning    KeLP provides a mechanism for doing automated decomposition of a  Grid object into Domain objects and further into FloorPlan objects  which include  assignments of each Region to processors  This mechanism is contained in a library  called DOCK  DecOmposition Classes for KeLP   Among other uses  DOCK will    take a Grid and partition it into equal size Regions  with slight tolerance and    38    variation when the number of Regions to divide the Grid into does not evenly  divide the amount of columns or rows index to be partitioned  Since a Region  or Grid can be multi dimensional  DOCK can partition in multiple dimensions as  well  Roughly  the one dimensional partitioning of a two dimensional Grid into    three two dimensional Regions might look like the partitioning in figure IV 1     Figure IV 1  Two dimensional dataset partitioned into three
32. d method for many current cutting edge clusters of multiproces   sors  including the IBM Blue Horizon machine at the San Diego Supercomputer Center and ASCI  Blue Pacific at Lawrence Livermore National Labs  37  62      23    finely tuned implementation of MPI for running optimally on their hardware  but  there are freely available versions of MPI as well  including MPICH  7     Of further interest beyond MPI  however  was to be able to use a message   passing library that supports both heterogeneity and block decomposition and also  assists in hiding most of the details of heterogeneity  especially data decomposition   KeLP is a library that supports these requirements and has its communication  mechanisms built on top of MPI  10     The experiments that lead up to my combination of OpenMP     a thread  library easier to use than Pthreads that is programmed with compiler directives  and API calls     and KeLP follow    The first question I attempted to answer in the following experiments was  whether or not the idea of repartitioning the data helped address the heterogeneous  cluster  I did this by first working with a benchmark built with MPI and Pthreads  combined to work on a multi tier machine  The second question I wanted to  answer was whether MPI and OpenMP     the two underlying technologies that  I ultimately wanted to use     were interoperable  At the same time  I wanted to  assess the scalability of OpenMP on the Origin2000  Finally  I tested KeLP and  OpenMP tog
33. d well  In this case  I varied the  number of threads on aegir from 24 up to 48  Table V 5 shows the complete list  of timings  as does figure V 3  The important timings  showing only the slowest  system from each run  are indicated in figure V 4  The speedup  which is very  good is shown in figure V 5 and table V 6  In this case  the results for running  with 48 threads on balder and 24 threads on aegir  are even slightly better than  the case with 32 and 16  showing 35 7  speedup    As with the runs with a 32 thread maximum per system  these runs also  differed from the predicted runs somewhat  but to a lesser extent  In the cases    when I ran with 48 threads per system  at best  Sputnik produced results 1 67     52    Redblack3D with 48 Threads on balder and 24 Threads on aegir    60   50   OD  2 40   Q  o  2  oO  E                      30F n Total balder   Original Total aegir   Original Computation balder  20  Original Computation aegir  Balanced Total balder  Balanced Total aegir   10  Balanced Computation balder  Balanced Computation aegir  Theoretical Balanced    OS DE D ey A                                                             Figure V 3  Complete timings for redblack3D with 48 threads on balder and vary     ing numbers of threads on aegir  using the Sputnik library    better than predicted and at worst  it produced results 1 03  worse than predicted     V E 3 Large numbers of threads per system    With the very large numbers of threads per Origin2000 system 
34. ded MPI  and Pthreads  N 300  PEO has 2 threads and PE1 and PE2 have 1 each     pas    111 C 2 MPI and OpenMP    Following successful results with MPI and Pthreads  I combined MPI  and OpenMP in a single program  The idea  if successful  would make it easier  for programmers to create multi tier programs  since OpenMP is inherently easier  to program with than Pthreads  due to its higher level constructs  I was not  certain  however  if thread binding in OpenMP and compatibility with MPI would  function properly on the Origin 2000 system  Also  based on the fork join model  that OpenMP is built on  as opposed to the    parked threads    concept that I  discussed earlier in relation to Pthreads   I was not sure that the MPI OpenMP  combination would work with all of the different kernels I wanted it to  A fork join  model has the threads fork at the beginning of the parallel region declared with  compiler directives and join at the end  Thus if the parallel region is called many  times  there might be a significant amount of overhead involved in creating and  cleaning up threads    Because of the way some kernels are written  I suspected that there might  be problems with this fork join model  For instance  a tiny amount of computation  in a Fortran kernel which is itself inside many nested C   loops would cause a  problem because the machine would be inefficient due to the cost of forks and joins  at the beginning of the parallel region inside the Fortran code with each itera
35. e  processes even need to communicate with each other  One wants to make sure  for  instance  that the processes or threads are as close by as possible to the others that  they communicate with  Whereas a multi tier API can ensure an order to the dis   tribution of the processes or threads  a scattering of MPI processes cannot  Culler   42   Fink  amp  Baden  24  26   and Fink  23  employ higher dimensional partitionings  to solve this problem    Finally  and most importantly  although multiprotocol MPI can use all    the processors and thus make some heterogeneous clusters function as if they are    18    homogeneous  this is only true when all processors are the same speed  If the  processors are all different speeds  then merely utilizing all the processors and  the shared memory hardware is not enough  One needs to make sure that slower  processors are processing less data    Although multiprotocol MPI is reasonable for a programmer to use to  quickly run existing MPI based parallel software on a cluster of multiprocessors  and realize improved performance  starting from the ground up with a multi tier  program written in Sputnik  or migrated from KeLP to Sputnik   the programmer    can avoid the problems that I have just mentioned     III  A 2 Multi Tier APIs  SIMPLE    SIMPLE is an API that is based on two lower level technologies  14  and  primarily addresses collective communication  One is a message passing layer and  the other is an SMP node layer  The principle req
36. e node has the same percentage of work to be done  relative to the total amount of work as the power of the node is relative to the    total power of the cluster  Thus     Tiori a work        R    V 9  rotatoria WOT Ktotal       and so finally the amount of work we give to each node is   work    R    wor kiotal   V 10     In the case of round off issues  some work chunks will have 1 added to  them to make sure all data is accounted for     Plugging this back into our original equation for predicting the new time     newamountofdatafornodei             La imal     Tiori   ES   v 11  pone md originalamountofdatafornodei      work    in e A V 12  A WOT ki orig       R    k ota    Ti orig   ui   WOT Etotal  V 13    WOT Ki orig   Pi x worktota     T  orig   Pta a  V 14     wor ki orig    46    a T   Ti orig  N 1    yal Do Ti WOT ki orig  k 0 De orig  WOT ktotal se a T     N 1  Te ey Poo T  2 0r19g k 0 This    N 1  2 wor ktotal p 20 T   V 17     N 1    WOT Ki orig o nin Tj  k 0       WOT Ktotal       7 Tone       V 15         V 16       done    wor ki orig       Tk orig  V D Experiments    The purpose of each of the experiments that I ran was to determine  how close to the optimal time that the run could come by repartitioning with the  Sputnik library  regardless of the numbers of threads actually used or the size of  the problem  The experiment is designed to establish  artificially  various levels of  heterogeneity  in order to detect any sensitivity in Sputnik  The results show
37. ed  that Sputnik was insensitive to this parameter and so the degree of heterogeneity   as long as no node is less than half as fast as any other node  is not relevant    As I have said  Sputnik can find the optimal number of OpenMP threads  per system to use  Alternatively  these can be manually and individually set   Principally  I manually set the amount of OpenMP threads per system so as to focus  more on finding the right partition than finding the right number of threads per  system  I ran with many different configurations of threads per system  however   so in effect  I was able to manually try to find the optimal number of threads per  system  The reason for this is that using a system with 128 256 threads per system   one might have to do 30 or more runs to find the optimal number of threads per  system and the time on the Origin2000   s was not available    Additionally  do to scaling issues  I did not use the full number of proces   sors on each Origin2000  By manually setting the number of threads  I created a     virtual cluster     The virtual cluster had the effect of simulating a heterogeneous    cluster since the full Origin2000 cluster could not be used and no other    commodity    47    clusters     as the Sputnik API was designed for  were available    To that end  I ran with several different values  First  I fixed the size  of the problem to N 761  761x761x761 unknowns  and the number of threads on  balder to 32  I then ran redblack3D once for each diff
38. edup for an SGI Origin2000 with 250 MHz    IP TOCESSOLS A ce eats eee sank doe eA ee a ed  he Nes bt ee e 31    IV 1 Two dimensional dataset partitioned into three equal two dimensional  E A E cod Se a ee i 38  IV 2 Two dimensional dataset partitioned so that node 0 gets twice as    much data to work with as either node 1 or node 2            39    V 1 Important redblack3D timings with 32 threads on balder and vary    ing numbers of threads on aegir                       49  V 2 Speedup for redblack3D with 32 threads on balder and varying num    bers of threads on aegir  using the Sputnik library            50    V 3    V 4    V 5    A l    Complete timings for redblack3D with 48 threads on balder and  varying numbers of threads on aegir  using the Sputnik library        Important redblack3D timings with 48 threads on balder and vary   ing numbers of threads on aegir  using the Sputnik library           Speedup for redblack3D with 48 threads on balder and varying num   bers of threads on aegir  using the Sputnik library                       Hierarchy of software layers for Sputnik                      Xl    52    ACKNOWLEDGEMENTS    I would like to thank the following people and institutions for their generous sup   port and encouragement during my life and the research and writing process of  this thesis    All my friends  especially my best friend Stephen Shore for many enlight   ening discussions about life  the universe  everything  and beyond  for the past  twelve years
39. efore  the solution is to partition the dataset in a way that the nodes    would each finish their runs at the same time   4  The problem of how to partition the dataset to do this was then created     5  The solution to the partitioning  I decided  was to find the relative speeds  of each multiprocessor node and calculate the fraction of the power of the  whole cluster that each individual multiprocessor node had  Then  one would  assign the same fraction of the dataset to a node as the fraction of power of  the node has in the whole cluster     Power    for a node is defined to be the  inverse of the time that a node takes to run a benchmark relative to the sum  of the inverses of the timings of the same benchmark on every node in the    cluster     III B 2 Requirements    As I have discussed  existing programs that use a hybrid of a message   passing and shared memory model as tools to program multi tier machines    do  not work effectively on heterogeneous machines  The feasibility studies I made  and how they were modified  in some cases  from existing programs  to support  heterogeneous clusters  follow in the next section    An MPI based messaging library was an obvious choice to use as the inter   node messaging component  The reason is that MPI is standardized across all of  the multiprocessor platforms  with each vendor creating their own implementation    adhering to the MPI standard  6   Not only does each vendor specifically have a        The currently recommende
40. ence and Engineering department for  five and a half great years of college and graduate education and a very good time    The San Diego Supercomputer Center  SDSC  and the National Part   nership for Advanced Computing Infrastructure  NPACI  and so many people  there  I certainly can   t list every person who has been so wonderful in their friend   ship and support  but some of the people who have been there most have been  Jeanne Aloia  Harry Ammons  Mike Bailey  Chaitan Baru  Gracie Cheney Parsons   Cheryl Converse Rath  Sandy Davey  Richard Frost  Jon Genetti  Anke Kamrath   Sid Karin  Greg Johnson  Dan Jonsson  Amit Majumdar  Yuki Marsden  Jen   nifer Matthews  Steve Mock  Reagan Moore  Arcot Rajasekar  Wayne Schroeder   Teri Simas  Allan Snavely  Ken Steube  Shawn Strande  Peggy Wagner  and Bill  Zamora  Without my first internship at SDSC doing MacSupport  I never would  have become involved with supercomputers and be where I am today    Everyone in the PCL  especially Shelly Strout for her friendship and mak   ing writing a thesis more fun  and helping to keep me sane     My fantastic professors  especially Professor Larry Carter  with his en   lightening  challenging  interactive classes that made algorithms and performance  programming fun    Finally  Professor Scott Baden  my friend and thesis advisor  who gave me  the opportunity to explore parallel and scientific computation and write a Master   s  thesis  Despite many other life experiences  I had never done any
41. ent in sending large  frequent communications over a network with high con   tention and varying latencies  He proposes that collective MPI operations  such  as MPI Reduce     might well first reduce within each SMP node  then within each  MPP  and finally across MPPs      28     Non Uniform 2 D Grid Partitioning    Crandall investigates different partitioning schemes for heterogeneous com   puting  19  20   Unlike the Sputnik API  which does a one dimensional decompo   sition  Crandall   s work suggests a multi dimensional decomposition using a variety  of different schemes including block  strip and    Fair Binary Recursive Decomposi   tion     The advantage of this work  over a plain one dimensional strip decomposi   tion  is that the cache miss rate can theoretically be improved  In a simple block  decomposition  for instance  by adjusting the dimensions of the block   s edges  one    can attempt to fit the rows  or columns  depending on the programming language    21    used  in cache  thus improving the performance by reducing the number of expen   sive cache misses occurring  Also  the amount of data that needs to be transmitted   but not necessarily the amount of sends and receives  can be constrained  Crandall  claims that this trade off can lead to an overall savings in communication time   Whereas strip decomposition might be extremely straightforward  irregular blocks  can be much more complicated  however    Crandall also works with a    decomposition advisory sy
42. er D  Dalrymple   St  Peter   s College  Oxford University  England    XIV    ABSTRACT OF THE THESIS    A Programming Model for Automated Decomposition on    Heterogeneous Clusters of Multiprocessors  by    Sean Philip Peisert  Master of Science in Computer Science  University of California  San Diego  2000  Professor Scott B  Baden  Chair    Clusters of multiprocessor nodes are becoming common in scientific com   puting  As a result of the expandability of clusters  faster nodes are frequently  added and older nodes are gradually removed  making the cluster heterogeneous   As heterogeneity increases  traditional methods for programming clusters of mul   tiprocessors become less optimal  because they do not account for the fact that  a cluster will only run as fast as the slowest node  Sputnik is a programming  methodology and software library that addresses the problem of heterogeneity on  a dedicated cluster of multiprocessors    Sputnik uses a two stage process for running applications on a cluster of  multiprocessors  The first stage assesses the relative performance of each node by  running the program individually on each node  determining from the run times  both the performance and application specific optimization  Using the timings  obtained from stage one  the second stage partitions the dataset non uniformly   according to the relative speed of each node  All future runs of the program use  the optimal partitionings and number of threads per node    Sputnik is imp
43. erent number of threads for  aegir that I wanted to test with  For aegir  I started with 16 threads and ran  again with 20  24  28 and 32  The reason that I started with 16 is that I made the  decision ahead of time that if one system was less than half as powerful as the other  that it probably would not even be worth using the slower system  Therefore  I  scaled my problem from one half of the fixed number of threads on balder up to  the same number of threads on aegir    After doing the experiments with 32 threads on balder  I ran with the  identical problem size with 48 threads on balder  this time scaling from 24 to 30   36  42 and 48  Finally  I ran just a few very large problem sizes  with up to 128  threads on balder and 96 on aegir  as I will discuss below    In the tables and graphs that follow  I include many different types of  times  Although one can compare the original and new total times  communication  plus computation  to get the most realistic    real world    times  comparing the  computation times alone shows the results better  The reason for this  as I will  show in a few limited runs  is that when times are observed when communication  is measured for homogeneously partitioned runs  the times for both nodes will be  nearly identical  The reason for this is that they need to synchronize at various  points and so both maintain a sort of lock step with each other at certain barriers   Even though the times are similar  however  they are both very high  Th
44. essor  which inspired this thesis because it uses distributed shared memory  Each node  on the Origin2000 consists of two processors which have locally shared memory   The nodes are all connected together in a complex structure to achieve 128 to 256  processors per system  A diagram of this is shown in figure 11 3  This large system   since it has distributed shared memory  can be used to simulate a multiprocessor   although one could argue that an Origin2000 itself is really a collection of tightly  coupled SMPs  Therefore  since in these test cases  I used a large part of the  Origin2000 and a node is only a small part of the system  in this results chapter  I  will refer specifically to a    system    where I have referred interchangeably to either  a node or multiprocessor in previous chapters  Likewise  since before  I ran with  one MPI process per multiprocessor node  here  I will run with one MPI process  per Origin2000 system  Unlike an SMP  this distributed shared memory system  is not an UMA machine  Instead  it is a NUMA derivative called cache coherent   non uniform memory access or ccNUMA     I obtained my results by performing experiments on the two machines     43    balder and aegir  shown in Table V 1                                            balder aegir  Processor Type and Clock Speed 250 MIPS R10000  Cycle Time 4 0 ns  Processor Peak Performance 500 MFLOPS  L1 Cache Size 32 KB  L2 Cache Size 4 MB  Operating System IRIX 6 5  Compilers and Linkers Native SG
45. ether to make sure that by using KeLP instead of MPI  even though  KeLP is built on top of MPI  there were no new incompatibilities of performance    bugs when used with OpenMP     HI C Maulti Tier Experiments    III C 1 MPI and Pthreads    In order to determine whether a heterogeneous API for KeLP might work   I decided to use a hand coded multi tier program in a simulated  heterogeneous  hardware environment  To do this  I used a piece of software called    Red Black  3D     hereafter referred to as rb3D or redblack3D   23  11   According to its own    documentation     The rb3D program is a 3D iterative solver which solves Poisson   s    24    equation using Gauss Seidel   s method with Red Black ordering  58      The program  is described in more detail in chapter 5    This particular implementation of the rb3D algorithm is hand coded using  MPI and Pthreads  This implementation partitions the data twice     once for the  node level and once for the processor level  MPI is used to pass messages between  multiprocessors and Pthreads are used to communicate within a multiprocessor  node via shared memory  The Pthread model is different than the one OpenMP  uses in many ways  but one important issue is what happens to the threads dur   ing the run of the program between iterations  Pthreads uses    parked threads      meaning that instead of the threads being destroyed between iterations or sections  of the program  they are temporarily parked for future use  The benefit of 
46. etowner  pMap P   proc        void setDomain const RegionX amp  D     _domain   D       DOO GOO RA Ka a       distribution methods     DOO EEK    void distribute const PointX amp  D const ProcessorsX amp  P   double   Times    void distribute ArrayIndexArguments  const ProcessorsX amp  P   double   Times     distribute  PointX ArrayIndices   P Times      void distribute const PointX amp  D  double   Times     distribute D ProcessorsX   Times      void distribute ArrayIndexArguments  double   Times     distribute  PointX ArrayIndices   ProcessorsX    Times        private      simple access functions     int procExtents int dim  const   return _Map extents  dim      int procLower int dim  const   return _Map lower  dim           distribution functions     void distributeBlock1 int dim  double   Times    void distributeBlock2 int dim            endif    B B DecompositionX C m4    B B 1 distribute    POCO o kkk kkk kkk kkk kk kkk kk kkk kk kkk k kk kkk kkk k k k k k    void DecompositionX   distribute const PointX amp  D         const ProcessorsX amp  P            73      Distribute a decomposition across the logical processor array    BAO AGAR I I ED I A A A 1 21 21 21 kkk kkk 4 24 24 24    void DecompositionX   distribute const PointX amp  D    const ProcessorsX amp  P    double   Times        initialize the _Map array     _Map resize P region       resize _Map size        int i   0   for_point_X p _Map   _Map p    i    itt     end_for  if  domainEmpty    return     for  int di
47. ey are  clearly demonstrating the fact that a program can run only as fast as its slowest  node  It is more interesting and informational  however  to see the times for    exclusively computation     48  V E Results of redblack3D    V E 1 Up to 32 threads per system          Threads   Original Compute New Total New Compute  aegir   balder aegir balder aegir   balder   aegir  16 47 7 87 7 66 415 65 7 65 60 4183  20 48 4 74 4 63 0848   63 8 61 3   58 0324  24 51 62 7 58 4   59 5019 56 55 4033  28 50 55 6 54 409 53 8 50 3   51 4442  32 51 2 48 7 51 6   52 5732 48 50 5865                               Table V 3  Complete redblack3D timings with 32 threads on balder and varying    numbers of threads on aegir       Threads New Compute   Predicted   Speedup   Theoretical  balder   aegir   balder   aegir Speedup  32 16 65 60 4138   61 7916 1 3492 1 4193  32 20 61 3   58 0324   58 6476 1 2137 1 2686  32 24 56 55 4033   56 2480 1 1196 1 1147  32 28 50 3   51 4442   52 6515 1 0808 1 056  32 32 48 50 5865   49 9187 1 012 1 0257                                  Table V 4  Speedup and predicted timings for redblack3D with 32 threads on    balder and varying numbers of threads on aegir    Figure V 1  table V 3  and table V 4 show the results for the runs of  redblack3D with 32 threads on balder and 16 to 32 threads on aegir  The important  numbers to compare are the slowest of the original times for computation to the  slowest of the new times for computation  This is because  as mentioned 
48. filling Curves     January 10  1995     Baden  S  B     RedBlack 3D     1999     Ridge  D   B  Becker  P  Merkey  and T  Sterling     Beowulf  Harnessing the  Power of Parallelism in a Pile of PCs        Schopf  J  M  and F  Berman     Performance Prediction Using Intervals      UCSD CSE Dept Technical Report CS97 541  May 1997     Schopf  J  M  and F  Berman     Performance Prediction in Production Envi   ronments     UCSD CSE Dept  Technical Report CS97 558  September 1997     San Diego Supercomputer Center  SDSC   lt http   www sdsc edu  gt    SGI Cray     Models of Parallel Computation        SGI Cray     Origin2000 and Onyx2 Performance Tuning and Optimization  Guide  IRIX 6 5        lt http   techpubs sgi com library tpl cgi bin browse cgi coll   0650 amp db bks amp cmd toc amp pth  SGI_Developer Or0n2_PfTune gt      Simon  H  D  and S  H  Teng     How Good is Recursive Bisection     SIAM J   S  C   1995     Smallen  S   W  Cirne  J  Frey  F  Berman  R  Wolski  M H  Su  C  Kesselman   S  Young  M  Ellisman     Combining Workstations and Supercomputers to  Support Grid Applications  The Parallel Tomography Experience     October  12  1999     Sun Microsystems  Inc      Sun Servers      lt http   www sun com servers  gt      Sun Microsystems  Inc      UltraSPARC II Products       lt http   www sun com microelectronics UltraSPARC II index html gt      98     69  Wolski  R   N  Spring  and C  Peterson     Implementing a Performance Fore   casting System for Metacomputing  T
49. g ASCI Blue Pacific       lt http   www llnl gov asci platforms bluepac  gt      Lawrence Livermore National Labs     The ASCI Program       lt http   www 11n1 gov asci  gt       39      40      41     42    43          44    45    46          47    48    49          90    51          52    96    Grimshaw  A   et al    lt http   legion virginia edu  gt      Lenoski  D   J  Lauden  K  Gharachorloo  W D  Weber  A  Gupta  J  Hen   nessy  M  Horowitz  and M  S  Lam     The Stanford Dash Multiprocessor      Stanford University  March 1992     Lim  B  H   P  Heidelberger  P  Pattnaik  and M  Snir     Message Proxies  for Efficient  Protected Communication on SMP Clusters     Proc  Third Int  Symp  on High Performance Computer Architecture  San Antonio  TX  Feb   1997  IEEE Computer Society Press  pp  116 27     Lumetta  S  S   A  M  Mainwaring  and D  E  Culler     Multi protocol active  messages on a cluster of SMPS     in Proc  SC97  Nov  1997     Majumdar  A      Parallel Monte Carlo Photon Transport     NPACI Parallel  Computing Training  1999     May  J   B  de Supinski  B  Pudliner  S  Taylor  and S  Baden     Final Report  Programming Models for Shared Memory Clusters     Lawrence Livermore Na   tional Labs  99 ERD 009  January 13  2000     May  J  and B  R  de Supinski     Experience with Mixed MPI Threaded Pro   gramming Models     Lawrence Livermore National Labs  UCRL JC 133213     Mitchell  N   L  Carter  J  Ferrante  and K  Hogstedt     Quantifying the Multi   Level Na
50. gram estimates to  be optimal from ClusterDiscovery  decomposes the data non uniformly based    on the relative powers of the nodes     3  Runs the program one last time using the optimizations and decompositions    from ClusterDiscovery and ClusterOptimization     IV B Sputnik Model    IV B 1 ClusterDiscovery    The ClusterDiscovery performs an estimation  It searches the param   eter space  somewhat intelligently  for the available optimizations  and seeks a  performance gain in the program it is optimizing  The ClusterDiscovery works  transparently to the user  Since one of the goals of the Sputnik Model is to allow  the user to program as if the cluster is heterogeneous  the ClusterDiscovery runs  through the possible optimizations automatically and finds the best optimizations  and the timings for those runs using the optimizations  The kernel runs inside  a kind of    shell    so that Sputnik has access to running it whenever it needs to   Not only does the ClusterDiscovery save the user from manually searching for all  the optimal configuration parameters  but there is no firm limit on the amount  of permutations that can be searched since the search all happens transparently  inside its shell    Optimizations could include adjusting the number of OpenMP threads   as is done in the Sputnik API   doing cache tiling  sending vectorizable code to a  vector computer in the cluster and parallelizable code to an MPP in the cluster   and a huge number of other possible variat
51. gram repeatedly on each individual multiprocessor node  varying the  number of threads  until the optimal number of threads per node is reached   The kernel runs in a kind of    shell    that Sputnik creates using SputnikMain     described below      3  Run the kernel with the optimal number of thread per node     In order to use the Sputnik library  there are three principle changes that  need to be made to the KeLP code in addition to adding OpenMP directives   First  a new function needs to be defined by the programmer  SputnikMain     Second  all calls to the distribute function from the user code need to have an  additional argument added  Finally  the main Sputnik function  SputnikGo  needs    to be called by the programmer from within main     66    A E 1 SputnikMain      SputnikMain   is the code that is called over and over again while trying  to find the optimal number of threads per multiprocessor node  It is not just the  kernel of the program  but is also everything that the program needs to call before  running the kernel  such as the initialization of values in arrays  SputnikMain  returns a double  The value of that double should be the time it takes for the    kernel to run  For example     double SputnikMain int argc char    argv  double   SputnikTimes       double start  finish      lt declaraitions  initializations gt     start   MPI_Wtime       start timing  kernel       call the kernel function    finish   MPI_Wtime       finish timing    return finish s
52. he Network Weather Service     Proceed   ings of Supercomputing 1997     
53. i   0     for_point_X p _Map   dimOffset   p dim    PLower     75    if  Times    NULL  amp  amp  P  gt  1            ceiling  if  i  gt          e         0     low    lse    low    ceilings i         aHigh i 1    1     domainLower  dim    ceiling   dimOffset     aHigh i    low   ceiling   1     i            else           low    domainLower  dim    ceiling   dimOffset     setlower _Map p   dim  low     setupper _Map p  dim MIN low   ceiling   1   domainUpper  dim          end_for    B C Sputnik h    void SputnikGo int argc  char  x argv    double SputnikMain int argc  char   xargv  int testSMPS     double   SputnikTimes       B D Sputnik C     include   include   include   include   include   include   include   include     define   define   define   define     lt iostream h gt    Sputnik h      lt omp   h gt    lt mpi h gt      lt stdlib h gt      lt stdio h gt      lt string h gt    lt assert h gt     def _numThreads 10   def _maxThreads 0   def _testSMPS 0   def _hetero 0     76    void SputnikGo int argc  char   argv     int numThreads   int hetero   int maxThreads   int testSMPS   int maxThr  2   numThr 2      int i  j  k    int myid  nodes    char procName  80      MPI_Comm_rank  MPI_COMM_WORLD   amp myid     MPI_Comm_size MPI_COMM_WORLD   amp nodes       if  myid    0      testSMPS   def_testSMPS   testSMPS   def_hetero   numThreads   def_numThreads   maxThreads   def_maxThreads   maxThr  0     maxThr  1   numThr  0   numThr  1     E     0  0   0  0    E     E     fo
54. ign    The API is designed so that the main   routine of a program is moved   mostly  to a user defined routine called SputnikMain    The real main   does  initialization and calls a routine called SputnikGo    SputnikGo   acts as a kind  of    shell    that calls SputnikMain   over and over to determine the optimal number  of threads per node  and make the final run with the optimized configuration  The  repartitioning  one of the primary features of the Sputnik API is a modification  of the distribution functions in the DOCK library  Although no new functions    are added  the distribution functions are mostly rewritten to support non uniform    40    partitioning     IV F Assumptions and Limitations    There are assumptions this API makes  First  it assumes that no node  is more than twice as fast as another node  This assumption also helps to ensure  that communication time does not overwhelm a particular node because it is doing  so much less computation than another node  Second  because the API is running  the entire kernel on each separate node  it assumes that the speed of the node  will not change when the node is given only a portion of the entire computational  domain to run  as is done after the repartitioning  for the final run  Finally  the  repartitioning that Sputnik does is only one dimensional  Although DOCK  and  therefore KeLP  support multi dimensional decomposition  for simplicity  Sputnik  does not    One reason to support multi dimensional decompositi
55. ings i         If the time goes up  if   timings i   gt  bestTime   amp  amp  gotit    0    a_min myid    i 4     80    a_max myid    i   gotit   1   a_gotit myid    1       keep increasing number of threads  if  i 2  gt  maxMax  amp  amp  i    maxMax     if  gotit    0        2nd highest power of 2     before maxMax  a_min myid    i 2       maxMax  a_max myid    maxMax      i   maxMax      else    1  2           Check to see if everyone   s time went up  int allgotit   0   for  j   0  j  lt  nodes  j       MPI_Bcast  void     amp a_gotit j  1   MPI_INT j MPI_COMM_WORLD     allgotit   a_gotit j      if  a_max myid     maxThr myid      a_min myid    maxThr  myid  2   gotit   1   a_gotit myid    1     minMin   a_min myid    maxMax   a_max myid    for  j   0  j  lt  nodes  j    4  MPI_Bcast  void     amp a_min j  1 MPI_INT   j MPI_COMM_WORLD     if  a_min j   lt  minMin   minMin   a_min j      81       for  j   0  j  lt  nodes  j       MPI_Bcast  void     amp a_max j  1 MPI_INT   j MPI_COMM_WORLD     if  a_max j   gt  maxMax   maxMax   a_max j      if  minMin  lt   0   minMin   1   i   minMin      Step through from the min to the max     to find the best    while  i  lt   maxMax       if  timings i     0 0       i     continue      cout  lt  lt   PE   lt  lt  myid  lt  lt     Running with       lt  lt  i  lt  lt    threads  n    omp_set_num_threads  i      timings i    SputnikMain argc  argv   testSMPS  NULL    if   timings i   lt   bestTime    amp  amp  i  lt   maxThr  myid    
56. io PU Model ese mer AA Res    a AA EA ae    34   1  Cl  sterDiscovery Line Een ta e da dt 34   2  Cluster Optimiser    ade Boe Se da UN RS di a 35   De   MO Le rt ig mn A et  Se Se nl sinh gg Sa NP awk  Des    ae 30   C  Sputnik API O a 36   EEO E E A RU he A REE 36   D  Decomposition 2 ct Se aN eee RN eee ec  US 37   1  Block Decomposition ada Wien Sa    Le ere A DR 2e 37   2  Heterogeneous Partitioning 5 Lu le di a ce  Ss 37   E APEDESI OI nee ent WO arc are Gt eh ae Ge PER te Mani 39   F  Assumptions  and Limitations         o 0a a STS 40   V Validation  Aix aoe a a a ly A ee ee a Be 41   Ae Ti OGWEHOW ds ake Sk Mee a aura ROR Sf Des oe Gi DASS A 41   Lis R   Black 3D ta ne se A A RUE RAA    41   Bl ACL ETS te  ie ere EE ADA al dl ae 42   C  Predicted Results    Shoe are eae te bos a bow dB be a 44   D  Experiments ca SE AN ee eee de Se oe ee 46   Ex Results of redblack3D su AAA A une die dk BS 48   1  Up to 32 threads per system A hat li dk 48   2  Up to 48 threads per system                      51   3  Large numbers of threads per system                   52   4  Anomalies se LU A Rd Red Leds pelo Lc de 53   VI Conclusions and Future Work tease E Ea Gif dem be 56   Contact Information s mens ai iva nan Lune io 59  Appendices   A  MUS  E CHIT Ce Fe  gs es TA 60   A  Introduction AA et Te dt ten  GNT a I a  et nt ae 60   B  Major Changes in Usage from KeLP1                   61   C  Examples of Command Line Options                    62   D  Example of Intra Node Parallel
57. ions in the entire parameter space of    possible optimizations for scientific code     39    This shell is run separately on each separate node in the cluster so that  each node is optimized individually with a distinct parameter set and timing results  are returned to the shell for those optimizations    One does not need to know anything about the characteristics of each  node in the cluster  which may or may not even be multiprocessors  prior to running    a program written using the Sputnik Model     IV B 2 ClusterOptimizer    The ClusterOptimizer uses the optimizations found in the ClusterDiscov   ery stage and the best timings of each node to decompose the computation of the  problem according to the performance of each node in the cluster  A node discov   ered to have better performance than other nodes will therefore work on a larger  chunk of the problem  Depending on the size of the problem as a whole  the cache  sizes  and the amount of communication taking place  there are a variety of differ   ent decomposition schemes available  which are described below  The important  thing  however  is to make sure that which ever decomposition scheme is used   computation must be balanced out so that each node finishes at the same time   Finally  one must make sure that after decomposition  communication does not  overwhelm computation  To the degree that the original problem does not have  this issue and we insist that no node is slower than half the speed of the fastest  
58. ism                      64   E  Sp  tnik Usage ek ite  IA RI SE Lee US 65   l  SPUR MAM   enc odes ate E Ae ke a eS 66   F  Sputnik Implementation     2 4 408 8 4 84468 e fae AES 67    vil    B Source Code to Sputnik      rt Som me A ee Boas Be hos Pati de 70    A  DecompositionX h m4 ir Ya a 70   B  Deco posto XI  oa AA D    RES    a Pew mie LA n   72   Le distributes ar ton a toe ih poe ss hee st hel dez 72   2  distributeBlock      aoaaa a 73   Es SRE ss y en e ey Me oko ete Ae vst a Su AP xis aada vee Os ot 75   P Sputnik ks oe ce ee ee tS cae OS  ae acta oe 75   C Source code to redblack3D with Sputnik                     85  Fig  DD er NU es ee ane Se Se QL he he ess bee ee bee eee ES     85   Be BSO  Gale ls e Wake infos Ie et oh  da A 87  a A ee Sh ae Bi NE    93    vill    LIST OF TABLES    II 1 Growth of connections in a crossbar switched machine             II 1 redblack3D MFLOPS rates with heterogeneous partitioning using  hand coded MPI and Pthreads  N 300  PEO has 2 processors and  PET and PE2 have each  cunde ridad ia   111 2 Table of OpenMP Scaling Test Speedup for an SGI Origin2000 with  250  MHZ  Processors     cp ay ate Se Wy oe oa ADA    V 1 Specifications for the two Origin2000 machines  balder and aegir       V 2 Software configurations    lada DA ee PO ee ke te  V 3 Complete redblack3D timings with 32 threads on balder and varying  numbers of threads  on aegir ei Late ue da ee eee ee het  V 4 Speedup and predicted timings for redblack3D with 32 threads
59. l 1 or level 2 cache of varied sizes is certainly also possible   The emphasis is  however  that the ClusterDiscovery and ClusterOptimization are  separate processes that can function in tandem easily and are not just limited to  multiprocessors  Other architectures and possible optimizations are beyond the  scope of this thesis and are not addressed in this incarnation of the API that I  have developed     Finally  I make certain assumptions about the condition of the cluster     1  Multiprocessor nodes are connected by a uniform  local  dedicated network     2  The program running has dedicated access to the cluster hardware     3  The cluster is set up to have many more processors per node  on average  than  nodes in the whole cluster  A cluster with many nodes of very few processors   including ASCI Blue Pacific  may be better off with a single tiered approach  including MPI because the shared memory aspect of Sputnik will be much    less relevant     4  No node in the cluster runs less than half as fast as any other node in the    cluster     5  The problem does not fit  in entirety  into memory cache on any of the    processors        IB Heterogeneity    I take the concept of some of the existing API   s for programming clusters  of multiprocessors one step further by directly supporting heterogeneous clusters   Acting on the assumption that clusters of multiprocessors are the immediate future  of high performance parallel computing  I decided that programming clusters 
60. lemented on top of the KeLP infrastructure to handle ir   regular decomposition and data motion  It enables code to be written for a het     erogeneous cluster as if the cluster is homogeneous  Sputnik can run scientific    XV    applications on a heterogeneous cluster faster  with improved utilization  than a  nearly identical program written in KeLP alone  Experimental results from a pair  of SGI Origin2000   s indicate that Sputnik can improve run time of an iterative    solver for Poisson   s equation by 35 percent     xvl    Chapter I    Introduction    IA Motivation    The computer hardware used for scientific and technical computing is  continually evolving  In the most recent generations of supercomputing hardware   there have been vector supercomputers  made by companies including Cray as  well as multicomputer style massively parallel processors  MPPs  made by many  companies  including Cray  IBM  Intel  Hewlett Packard and SGI  Clusters of mul   tiprocessors  however  are increasing in popularity  replacing older mainframes  As  a result of mass production of the components used  monetary costs for purchas   ing clusters of multiprocessors are dropping and therefore  use of the technology  has been spreading from business computing to scientific and technical computing   replacing vector supercomputers and multicomputer MPPs in science where they  replaced mainframes in industry    Unfortunately  although multiprocessors and multiprocessor clusters are  attractive 
61. ly finely grained  communication might be inappropriate for the current iteration of the API because  extremely tight communication imposes a kind of synchronization on the program  that might negate the speedup that Sputnik can provide    The API is packaged as a C   library and is built on top of the KeLP  infrastructure  23  35   The library allows users to write scientific programs that  run effectively on heterogeneous clusters of multiprocessors where the component  nodes may all run at different speeds  It is intended as something of a    proof of  concept    about what steps are needed to make heterogeneous clusters run more  efficiently  I also present a broader theory of which the API is merely a subset  The  broader theory  the Sputnik Model  is not limited to multiprocessors  stencil type  applications  or an just two optimization techniques    This thesis introduces a two stage process for optimizing performance on  a heterogeneous cluster  Though Sputnik has been targeted for multiprocessors   some or all of the nodes may be uniprocessors    The first stage  the ClusterDiscovery stage  performs a resource discovery  to understand how the application  or perhaps a part of the application  runs on  each individual node in the cluster  The second stage  ClusterOptimization  makes    specific optimizations based on what the first stage has discovered  Depending on    the hardware and the type of problem  there can be many possible types of opti   mizations  In this 
62. m 0 dim lt NDIM dim        switch D dim       case BLOCK1   distributeBlocki dim  Times    break    case BLOCK2   distributeBlock2 dim     break    default   break        do processor assignments     for_point_X p _Map    setowner _Map p   P p     end_for    B B 2 distributeBlock1    J K Kk k k 2k 2k 2k 2k ak ak k k k k K 2k 2K 2K 2K FK FK 2K 2K 2K 2K 2K K K K 2K 2K 2K 2K 2K 2K 2K 2K 2K K 2K K 2K K 2K 2K FK FK FK FK 2K 2K 2K 2K K 2K K 2K 2K 2K 2K 2K 2K FK      void DecompositionX   distributeBlock1 int dim                       74    In a BLOCK1 distribution  each procesor gets exactly    ceiling N P  elements  If N doesn   t divide P  this will    result in a load imbalance       BRC OO I E k k kk kkk k kkk k kkk k kkk 24 LL LL LES LL 2k k kkk kkk      void DecompositionX  distributeBlocki int dim  double   Times          int N   domainExtents  dim    int P   pExtents  dim     int PLower   pLower  dim    int dimOffset  low    int ceiling     int 1    double tTotal 0 0    double   invTimes   new double P    int   ceilings   new int P     int   aHigh   new int P      if  Times    NULL  amp  amp  P  gt  1     for  i   0  i  lt  P  i            Get the inverses of the total times     tTotal    1 0 Times  i    invTimes i    1 0 Times  i       for  i   0  i  lt  P  i          Get the ratios and even it out  if necessary     ceilings i    floor  invTimes i  tTotal  N     if  NAP    0   ceilings i     1   F     else    ceiling    N P    N P  1   N P        _distType dim    BLOCK1   
63. m partitioning  and have a previously   generated file containing the timing results prepared  the program can partition  based on any arbitrary data  The idea is that the program would be run on each  individual multiprocessor  the times would be recorded  in order  and put into the  file  Then  in the future  the program could be run without re running the resource  discovery part of the program    The results of this implementation are shown in figure 111 2 and table II 1   As can be seen  with the exception of the first trial  the results are positive and the  heterogeneous version performs better than the homogeneous version and close to  the theoretical best  Essentially  the code works by first being told that it has three  nodes and that one node can run the program twice as fast as the other two  Instead  of breaking the problem into three equal partitions  one for each multiprocessor  as  would normally happen  the program divides up the data such that node 0 gets half  of the work  since it has half of the combined processors of the cluster  and nodes  1 and 2 each get a quarter of the work  Node 1 and 2 then can work on their parts  of the workload without spawning off extra threads and can communicate with  each other and node 0 with MPI  Node 0 first spawns two threads and performs  shared memory communication with Pthreads to communicate between the two  threads and MPI to communicate with nodes 1 and 2  In this manner  the program    has been able to be sped u
64. memory may or may not  necessarily be physically shared on the hardware  This means that one processor  might be able to access the memory of a processor on a completely different ma   chine because the software has been built to allow that style of access  This is  called a non uniform memory access or NUMA    One of the principle challenges in future development of multiprocessors  is determining what an optimal configuration is  Due to the nature of the con   struction of a multiprocessor  it is possible that putting too many processors in    one multiprocessor could cause congestion on the bus  Additionally  too many    11    threads all trying to perform read and write accesses to the same place in mem   ory can create a bottleneck due to each process having to wait for other process   s    mutex locks to unlock before they can go and set their own mutex lock   21  44  45     I B Multicomputers    In parallel programs where one processor works on data and then needs  to exchange data with another processor  there must be communication between  processors  In a shared nothing architecture  distributed memory machine with   out shared memory   the only way to do this is to pass messages between the  processors  In message passing  one processor communicates with other proces   sors through a basic set of message passing primitives  including send  receive   broadcast  scatter  and gather  Using send and receive  one processor sends a mes   sage across the communications ne
65. multiprocessors are also sometimes called hierarchical ma   chines or multi tier machines because one level of communication is possible within  the node  generally shared memory  and one level of communication is possible be   tween nodes themselves  The two tiers of processor and node level communication  is why they are considered to be hierarchical  It is certainly possible to take the  model beyond two tiers as well  to three or more  by connecting a two tiered cluster  of multiprocessors to another two tiered cluster of multiprocessors  thus creating a  three tiered cluster of multiprocessors  or a cluster of clusters of multiprocessors   Clusters of more than two tiers are outside of the scope of this thesis    Clusters of multiprocessors are attractive because the manufacturers are  able to scale the systems to large number of processors much easier and more  inexpensively than if they tried to solve the problem of networking every processor    together  as in a multicomputer  or share memory between all processors  as in an    14    nnf N nf    4 processor n    system    8 processor  system       16 processor  system       32 processor 64 processor  system system    Figure 11 3  Diagram of an SGI Origin2000  Courtesy of SGI   s    Origin2000 and  Onyx2 Performance Tuning and Optimization Guide  IRIX 6 5         Network Hub or Switch  SMP Node 0     SMP Node 1   SMP Node 2   SMP Node 3    Figure 11 4  Diagram of a cluster of symmetric multiprocessors       15    multi
66. n Memory Memory Bus    L2 Cache            L2 Cache L2 Cache L2 Cache       L1 Cache L1 Cache L1 Cache L1 Cache    Processor 0 Processor 1 Processor 2    Processor 3       Bus    Figure 11 1  Diagram of a multiprocessor    sults presented in this thesis might be considered a hybrid between the multipro   cessor and multicomputer because it uses distributed shared memory   Second  the  multiprocessors that compete in today   s market with modern multicomputers are  typically built partially from commodity parts for the purpose of reducing cost by  means of increased economies of scale  A recent multiprocessor incarnation from  IBM  for instance  forms the basis of each node comprising the the ASCI Blue  Pacific machine at Lawrence Livermore National Labs and Blue Horizon at the  San Diego Supercomputer Center  37  62   These machines use hundreds of IBM   s  PowerPC processors   a chip family that powers all modern Apple Macintosh com   puters  33  37  38   Similarly  the chip inside Sun   s Enterprise servers  the Sparc   also powers every Sun workstation that comes off the line  67  68   The large IBM  systems  despite being massively parallel  are still essentially large clusters of mul   tiprocessors  As discussed below  many multicomputers are built using expensive   specialty communications hardware significantly more complex than that of an    multiprocessor     10    In a multiprocessor  parallel programs can typically be run either using  message passing across the bu
67. nMP  Further  Sputnik modifies the existing  KeLP distribution by adding new partitioning algorithms to the distribute and  distributeBlock1 functions in the DOCK DecompositionX class  The goal of the    60    61    r    User Program           OpenMP    Figure A 1  Hierarchy of software layers for Sputnik    API  designed to be consistent with the original KeLP1 goals of making scientific  program easier while making the overall performance greater has resulted in an  API that requires only a few minor modifications to an existing KeLP1 program  to work in Sputnik  A good progression for writing a Sputnik program would be  to first write the serial program  then modify it to be a KeLP1 program  then add  OpenMP directives  and then finally modify the KeLP1 program to be a Sputnik  program using the Sputnik API calls     A B Major Changes in Usage from KeLP1    The two new routines are as follows  which are described in detail later    in this chapter      void SputnikGo int argc  char   argv     double SputnikMain int argc  char   xargv  double   SputnikTimes     Sputnik accepts more arguments than those that may just be passed into  the application or to MPI  A typical MPI program  called mpiprog in this example   is started like this on four nodes  mpirun  np 4 mpiprog     Sputnik takes four additional arguments     62    1  testSMPS  If equal to the integer    1     this tells the program that it should  pay attention to the    maxthreads    argument  ignore the    numthre
68. nly run as fast as the slowest component node   Thus  to the degree that any node finishes early and idles while waiting  we will  under utilize the hardware and nodes will be forced to idle rather than spending  their time productively computing  Ideally  therefore  we want all nodes to finish  at the same time    Though the techniques for programming multiprocessors and homoge   neous clusters of multiprocessors have been explored in detail and have achieved  some level of sophistication  programming heterogeneous clusters of multipro   cessors for use with scientific computation is still a difficult challenge  8  9  10      11   12  13  14  23   24   25   26   27   35   42   Existing software technologies may be    used to program heterogeneous clusters of multiprocessors  however  the process  of doing so and still achieving good performance through load balancing can be  extremely difficult    The goal of my research presented in this thesis is to investigate a way  to enable scientific programs to run faster and effectively utilize a heterogeneous  cluster of multiprocessors while allowing the user to write the program as if they  were running on a homogeneous cluster    This thesis introduces an API called Sputnik designed to assist in greater  performance and utilization on a heterogeneous cluster for certain types of appli   cations  Current applications for which Sputnik has been proven to function well  with include stencil based programs  Applications with extreme
69. node  there should not necessarily be any inherent restrictions on which type of    decomposing to use to partition the data for the heterogeneous cluster     IV B 3 Running    Given the optimal parameters and partitioning  we know enough to run  the program on a heterogeneous cluster and can expect it to utilize the cluster  and perform significantly better  Because results of ClusterDiscovery are saved on    disk  future runs of the program on the same cluster will not have to    re discover       36    the cluster each time and can simply run with the optimal settings     IV C Sputnik API    IV C 1 Goals    The Sputnik API is based on the Sputnik Model  It implements a specific  subset of the ideas generalized in the Model     1  ClusterDiscovery  Runs the kernel of the program repeatedly on each sep   arate node to determine the timings and relative performance  varying the    threads per node given to determine the optimal number of threads per node     2  ClusterOptimization  Using the parameters which the program estimates  to be optimal from ClusterDiscovery  it decomposes the data non uniformly    based on the relative powers of the nodes     3  Runs the program one last time using the optimizations and decompositions    from ClusterDiscovery and ClusterOptimization     The first optimization that Sputnik performs is the determination of the  optimal number of OpenMP threads per node to run with  The optimal number  of threads might be equal to the number of processo
70. of  multiprocessors was a problem worth investigating    I decided that I could build my API on top of either two existing API   s      one that could handle message passing between nodes and one that could handle  shared memory communication within a node     or one existing API that already  supported multi tier communication    In choosing a basis for my API  I decided to use KeLP1 as my message   passing layer because  unlike MPI and PVM  it has built in support for data de   scription and general blocked decomposition as well as transparent message passing  communication  This makes the job of the programmer using the API much easier  for complex parallel programming  23  35   Such levels of abstraction have been    shown to come without a performance penalty  11   The question then was what    technology would be best used to take advantage of the shared memory architec   ture for intra node communication  I use OpenMP as my shared memory layer to  handle intra node parallelization because at the moment  it is the easiest technol   ogy to use and is an emerging standard as an alternative to Pthreads  My API is  built on top of KeLP and OpenMP  KeLP itself is built on top of MPI  OpenMP   can be based on a variety of sub technologies  depending on the particular vendor   s  implementation  The Origin2000  which I run on  uses SGI   s    sprocs     though the  IBM SP systems build OpenMP on top of Pthreads  48     Hopefully by extending KeLP1   s API instead of simply MPI
71. og  testSMPS O  hetero 1  numthreads 10    e Sputnik runs on 2 nodes with exactly 10 threads per node with hetero     geneous decomposition based on the times stored in the     stats    file     7  mpirun  np 2 mpiprog  testSMPS O  hetero 1     numthr0 10  numthri 20    e Sputnik runs on 2 nodes with exactly 10 threads on node 0 and 20  threads on node 1 with heterogeneous decomposition based on the times    stored in the     stats    file     A D Example of Intra Node Parallelism    Additionally  OpenMP directives must be put around the loops that the  programmer wishes to parallelize  The directives function either in C C   or  Fortran kernels and can be as simple as those used in this code from a KeLP    version of Red Black 3D       OMP PARALLEL PRIVATE jj ii k j i jk   do jj   uli 1  uhi 1  sj  do ii   ul0 1  uh0 1  si    OMP DO SCHEDULE  STATIC   do k   ul2 1  uh2 1  do j   jj  min jj sj 1 uh1 1   jk   mod j k 2   do i   iitjk  min ii jk si 1 uh0 1   2  u i j k    c    2   u i 1 j k    u i 1 j k      u i j 1 k     3 u i j 1 k      u i j k 1    u i j k 1     4 c2 rhs i j k       65    end do  end do   end do    OMP END DO   end do    end do      OMP END PARALLEL    The code  above  scales well and with the OpenMP directives in place    can show parallelism well  depending on the size of the problem     A E Sputnik Usage    Sputnik works in this manner     1  Initialize MPI and KeLP    2  Set the number of threads on each multiprocessor node and run the kernel of  the pro
72. on  since  half the number of processors are working on the problem    As can be seen from the charts in Figure III 5  figure II 6  and table 111 2   for at least up to 8 threads  the kernel scales very well  but for this size problem   does not improve significantly with 16 or 32 threads  Scaling is good but not  great  but MPI and OpenMP are shown by the results to interoperate without any  problems    After I ran experiments to determine whether MPI and OpenMP as well  as C   and Fortran OpenMP directives would coexist and function correctly  I  set out to obtain timings to see how well OpenMP would parallelize scientific code    to see if OpenMP would be close to Pthreads in terms of efficiency  since it already    29    for  j   64  j  gt  0  j j 2       if  myid    0     omp_set_num_threads j     start   MPI_Wtime         pragma omp parallel shared numthreads        if  0    omp_get_thread_num      numthreads   omp_get_num_threads         pragma omp for schedule static     for i   0  i  lt  LONG  i       arr i    arrli 1    i   1 0001      finish   MPI_Wtime        else    start   MPI_Wtime     numthreads   omp_get_num_threads      for i   0  i  lt  LONG  i     arr i    arr i 1    i   1 0001   finish   MPI_Wtime        cout  lt  lt   PE    lt  lt  myid  lt  lt    threads    lt  lt  numthreads   lt  lt    time    lt  lt  finish   start   lt  lt     j    lt  lt  j  lt  lt    n     MPI_Barrier  MPI_COMM_WORLD       Figure 111 4  OpenMP scalability test code    0 2    OpenM
73. on is if Sputnik will  be running on a cluster with a huge number of nodes so that memory and cache  tiling could be built into the decomposition and message passing communication  could be made automatically more efficient  Since the initial release of the API has  focused on much smaller clusters with 2 4 nodes  I assumed that one dimensional  decomposition was adequate and the possible downsides would be negligible    The Sputnik API also requires that the application is written in C    at  least as a wrapper  though the kernel s  of the program may be written in either  C  C    or Fortran and linked in  Finally  Sputnik depends on the fact that  the cluster has a thread safe implementation of MPI installed as well as OpenMP  for both Fortran and C C   and all technical requirements for both MPI and  OpenMP programs to run are adhered to    The validation of these results and of the model appear in the next chap     ter     Chapter V    Validation    V A Introduction    Redblack3D was the program I chose to run on a pair of SGI Origin2000  supercomputers at the National Center for Supercomputing Applications  NCSA   to determine the success of the performances aspect of the Sputnik API  48  47    Whether it succeeds also in its goal of being able to allow the user to program a  heterogeneous cluster as if it is homogeneous is much harder to measure  although  there are not very many changes that have to be made  and I estimate that someone  familiar with OpenMP could make 
74. ould be to recognize this and direct  the ClusterOptimizer to divide not necessarily just the data or the computation  as a whole but to separate the problem into specific tasks that could be assigned  to each unique hardware architecture according to the machines    specialties    I would like the readers of this thesis to bring away with them  the fol     lowing points     1  Sputnik is an API which demonstrates that heterogeneous clusters of multi   processors need not be difficult to program  As clusters of multiprocessors  appear to be the near term future for supercomputing  ways are needed to    address the evolution of these machines  Sputnik is one of these ways     2  Sputnik is an API which demonstrates that programs need not waste any  available processing power on a heterogeneous cluster of multiprocessors  By  adjusting the number of threads per multiprocessors and repartitioning the  problem so that each multiprocessors node is time balanced  the program    can run as if the entire cluster was homogeneous     3  Sputnik can be extended in the future  as I intend to do myself in many ways  in future research with the San Diego Supercomputer Center  including  but  not limited to     a  Running on a variety of vendor platforms    b  Performing dynamic optimizations      c  Running on clusters of varying architectures  including vector machines  and MPPs   not just varying speed  memory  network  or cache size of    multiprocessors      d  Performing varying types
75. p via heterogeneous partitioning        Original   Balanced   Theoretical   Balanced Speedup   Theoretical Speedup  136 978   173 233   182 637333 26 47  33 33                             Table 111 1  redblack3D MFLOPS rates with heterogeneous partitioning using  hand coded MPI and Pthreads  N 300  PEO has 2 processors and PET and PE2    have 1 each     To analyze what the theoretical best possible speedup might be through    heterogeneous decomposition  consider N to be the total amount of data in the    26    problem  In the homogeneous run  each processor gets N 3 data  Node 0 finishes  in time T while nodes 1 and 2 finish in time 2 x T because they only run half as  fast  So two processors are wasted for a time of T  If the data is repartitioned  optimally  so that node 0 gets N 2 data and nodes 1 and 2 get N 4 data each  then  the time that each node would take is Tx N 2   N 3     3 2 T  assuming optimal  efficiency  the equations for obtaining the new times are described in chapter 5    Although the timings did not indicate optimal efficiency  they approached it and at  least gave indications that not only that MPI and Pthreads worked well together     but more importantly  that the repartitioning concept is valid     REDBLACK 3D   MPI PTHREADS  2005       180        160     140     1207    MFLOPS  a  o       40        Original  Balanced  Theoretical Balanced                                           Figure 111 2  redblack3D with heterogeneous partitioning using hand co
76. portTimings times timesLoc   reps  chk_freq  niter  N   si  sj  gi  gj       catch  KelpErr  amp  ke     ke abort          return  middle      Bibliography     1                 Alpern B   L  Carter  J  Ferrante     Modeling Parallel Computers as Memory  Hierarchies     1993 Conference on Programming Models for Massively Parallel  Computers     Alpern  B   and L  Carter      Towards a Model for Portable Performance   Exposing the Memory Hierarchy     1992     Alpern  B   L  Carter  E  Feig  and T  Selker     The Uniform Memory Hierarchy  Model of Computation     IBM Watson Resarch Center  November 2  1992     Ammon  J      Hypercube connectivity with CC NUMA architecture     SGI   Cray White paper  1999     Anglano  C   J  Schopf  R  Wolski  and F  Berman     Zoom  A Hierarchical  Representation for Heterogeneous Applications     UCSD CSE Technical Re   port CS95 451  January 5  1995     Argonne National Laboratories     MPI   The Message Passing Interface Stan   dard      lt http   www unix mcs anl gov mpi  gt      Argonne National Laboratories     MPICH A Portable Implementation of  MPI      lt http   www unix mcs anl gov mpi mpich  gt      Baden  S  B      Tradeoffs in hierarchical algorithm design on multi tier archi   tectures        Baden  S  B      Programming Abstractions for Dynamically Patitioning and  Coordinating Localized Scientific Calculations Running on Multiprocessors      SIAM J  Sci  Stat  Comput   Vol  12  No  1  pp 145 157  January 1991     Baden  S  B   
77. pow   erful class of machines for parallel programmers to use the KeLP style technology  on    Whereas KeLP1 was an interface on top of the MPI  6  technology un   derneath  KeLP2 was an interface on top of both MPI and Pthreads  as shown  in Figure III 1  As described previously  the idea was that although MPI could  be used for both inter  and intra node communication  in theory  this is not the  optimal communication method because MPI does not utilize the shared memory  hardware  Instead the designers of KeLP2 decided that MPI would be used for  inter node communication and Pthreads would communicate within an SMP node  using fast  fine grained  shared memory accesses that Pthreads are particularly  good for  KeLP2 adds support the multi tier nature of a cluster of SMPs  if the  SMP nodes are different  however  one sees inefficiencies  The program will only  run as fast as the slowest node and so if one node finishes in 13 seconds and another  in 5  the program will take 13 seconds to run    Although KeLP2 is a very good multi tier implementation  at the time  of writing the Sputnik API  it was not running on my platform of choice  the  Origin2000  so I built my Sputnik API on top of KeLP1  MPI  instead of KeLP2   multi tier with MPI and Pthreads      20    User Program    KeLP2        PThreads    Figure 111 1  Hierarchy of software layers for KeLP2    III A 3 Heterogeneity Work  Grid Enabled MPI    In Grid Enabled MPI  Foster attempts to reduce communication time  inher
78. processor     The principal downside of a cluster of multiprocessors is that they are  difficult to program  Memory within a node is only    shared    between processors  within a given multiprocessor node  Therefore  a shared memory program cannot  be used across the entire cluster  Message passing can be used across the entire  cluster  but this does not take advantage of the unique shared memory hardware  that exists within a multiprocessor  which is is what often makes communication  in a multiprocessor fast  So message passing throughout the entire cluster is not  optimal either    The solution currently being adopted to program clusters of multipro   cessors is to use a dual tier approach of combining message passing and shared   memory programming  as if blending multicomputer and multiprocessor program   ming techniques  respectively  In this manner  within a multiprocessor  intranode    communication can be achieved through shared memory  Pthreads  OpenMP  and  between multiprocessor nodes  internode   communication can be achieved through  message passing  e g  MPI and PVM  15  50  52  51  6   One can either hand write  the code that combines one method from each paradigm or use a library that specif   ically supports multi tier machines  including KeLP2 or SIMPLE   35  23  14   Since  programming a piece of software with MPI in mind can easily double the size of  the code  and programming and debugging in Pthreads can be a challenging task  as well  the prospect of
79. processor to every other processor directly  As the number of processors grows     the number of connections grows by     p p     1      gt   11 1        Number of Processors   Number of Connections  2 1   3   6   10   15   21       NOD OF KB W                Table 11 1  Growth of connections in a crossbar switched machine    13    II C Clusters of Multiprocessors    A cluster of multiprocessors is the composite of two distinct hardware  technologies for building parallel machines  multicomputers and multiprocessors    A cluster of multiprocessors is a possible solution to the lack of expand   ability of a lone multiprocessor and the expensive nature of expanding a multi   computer  This solution has been adopted in high end server lines and even super   computers from nearly all major computer hardware manufacturers including Sun   IBM  HP  and Compaq DEC  SGI has created a machine called the    Origin2000     which uses shared memory but has a very advanced hypercube interconnection  structure to support scalability of up to 256 processors per Origin2000  An image  of what the Origin2000   s architecture looks like can be seen in figure 11 3  Some of  the computers with the highest theoretical peak speed in the world  such as ASCI  Red at Sandia National Labs  Intel   ASCI Blue Pacific at Lawrence Livermore Na   tional Labs  IBM  and the Blue Horizon machine at the San Diego Supercomputer  Center  IBM  are all essentially clusters of multiprocessors   33  62    Clusters of 
80. puter Science and Engineering   University of California  San Diego    XK XA A A XA A A A XX XA XX KF       BRC ladilla EE E E k kk k kkk k kkk k kkk kkk k kkk 2k kkk k kkk kkk       include  Sputnik h    include  j3D h    include  XArray3 h    include  Grid3 h    include  Mover3 h    include  timer h    include  manhattan h    include  lt omp h gt       extern int testSMPS     void cmdLine int argc  char   argv   int amp L  int amp  M  int amp  N  double amp  eps  int amp  niter   int amp  chk_freq  int amp  reps  int amp  si  int amp  sj  int amp gi   int amp gj      void ReportTimings double times    double timesLoc      int reps  int chk_freq  int niter  int N   int si  int sj  int gi  int gj      void InitGrid IrregularGrid3 amp  grid        grid fill 1 0    grid assignGhost  0 0      XK XA XA XA XX O         88    void ComputeLocal  IrregularGrid3 amp  grid   int si  int sj  const int color   IrregularGrid3 amp  rhs      K k k k ak 2k kak 3k 2k k ak 2k 2k 2k ak 2k 2k aK 2K 2K 2K K FK 2K 2K K FK 2K 2K 2K FK 2K K FK FK 2K K FK 2K 2K 2K FK 2K 2K K FK 2K K   K FK 2K 2K K FK 2K K FK 2k 2K 2k K 2k ak      main             main   takes one argument  N  the number of points      along each axis      BRC OO RO EE E k I RI k kkk k kkk k ak LL LCL LES LL 2k k 2k 2k 2k kkk      int main int argc char   argv       MPI_Init  amp argc   amp argv    SputnikGo argc argv    MPI_Finalize      return  0          double SputnikMain int argc char    argv   int testSMPS  double   SputnikTimes
81. r  int arg   1  arg  lt  argc  arg    4  if   strcmp   testSMPS  argv arg     testSMPS   atoi argv  targ     else if   strcmp   hetero  argvlarg     hetero   atoi argv  targ     else if   strcmp   numthreads   argv  arg      numThreads   atoi argv ttarg     else if   strcmp   maxthreads  argv  arg      maxThreads   atoi argv  targ     else if   strcmp   maxthr0O  argv  arg      maxThr 0    atoi argv  targ     else if   strcmp   maxthri  argv arg      maxThr 1    atoi argv  targ      else if   strcmp   numthr0O  argv  arg      numThr 0    atoi argv   arg     else if   strcmp   numthri  argv  arg      numThr 1    atoi argv  targ        cout  lt  lt   INPUTS     lt  lt  endl   cout  lt  lt    ttestSMPS     lt  lt  testSMPS  lt  lt  endl     TT    cout  lt  lt    thetero     lt  lt  hetero  lt  lt  endl    cout  lt  lt    tnumthreads     lt  lt  numThreads  lt  lt  endl    cout  lt  lt    tmaxthreads     lt  lt  maxThreads  lt  lt  endl    cout  lt  lt    tmaxthreads for SMP 0     lt  lt  maxThr 0   lt  lt  endl   cout  lt  lt    tmaxthreads for SMP 1     lt  lt  maxThr 1   lt  lt  endl   cout  lt  lt    tnumthreads for SMP 0     lt  lt  numThr 0   lt  lt  endl   cout  lt  lt    tnumthreads for SMP 1     lt  lt  numThr 1   lt  lt  endl     for  i   0  i  lt  nodes  i       k   MPI_Get_processor_name procName  amp j     if   k   cout  lt  lt    tname of SMP   lt  lt  i  lt  lt        lt  lt  procName  lt  lt  endl     if  testSMPS    1  amp  amp  maxThreads  gt  0  4  cout  lt  lt
82. rest  the fastest node  is wasting time while it waits for the slower nodes to finish  Thus the wasted time  can be computed as follows  where N is the total number of nodes and the times    for the nodes are ranked from 0 to N     1     T    TimeOnNodei  V 2   N 1 T   Tuastea   MAX T      Y      V 3   i 0    Therefore  the solution is to repartition the data  After optimally repar   titioning  the program will run in a time in between that of the faster and slower    nodes        newamountofdatafornodei AET  To imal     fi ori ae    pa VA  piuma orig   originalamountofdatafornodez 2  N du  Therefore  the speedup will be   ieee MAXI   Speedup   ted   T  _ 1  V 5        Toptimal Toptimal    This speedup is based on a repartitioning of the dataset  Where in a  homogeneously partitioned run  each node would be assigned 1 N work  in a het   erogeneously partitioned run  the work assigned is a bit more complex  If the node    originally ran in time Ti orig then first we want to find out what the speed of this    45    node is relative to all the others and the total  We can do this by assigning a power  P to the node equal to the inverse of the time  We can use this then to produce a    power ratio R        Xio Ty  Pa A V6  ia    N 1 N 1 se T   Pioislel  ster   y Py   5 ees  V 7   k 0 k 0   k orig  z pay a Du E  R    _ 25 AS Lio 1   V 8   Pi otalcluster Tiori k 0 T k  rig    The goal of the software is to assign a percentage of the total work of the  problem to a node such that th
83. rs in the node but may be less  if the problem is not large enough to warrant the overhead and costs of shared   memory communication for that many threads  More threads would not speed up  communication and may in fact slow it down because two or more threads would  be competing for one processor   s time and memory    The second optimization that Sputnik performs is decomposition  One of  the appealing aspects of using KeLP  aside from the fact that it is based on MPI   which has a standardized interface across all major parallel platforms  is that it    supports block decomposition     37  IV D Decomposition    IV D 1 Block Decomposition    Block decomposition allows for manipulating certain types of data in a  much easier way than with standard C   datatypes  Also  rather than describing  data in terms of C   arrays and having to program complex MPI communi   cations  in KeLP  one can describe the domain of the data  called a Grid and  multidimensional blocks within the domain called Regions  Rather than forcing  the application programmer to write long sequences of loops to pass ghost cells   boundary conditions between the blocks  between processors  KeLP handles this  automatically with built in functions to expand and shrink any given Region   Further  a Region can be intersected with another Region and since it is merely a  subset of the overall Grid  it can overlap with another Region and cover a domain  which may as a whole be very irregular  with neat  individual 
84. s are used on  aegir  This speedup is good  though still 7  less than the theoretical best for the  thread configuration with 32 threads on balder and 16 threads on aegir    The    New Compute    timings for each system should be close to identi   cal  As the timings from table V 3 indicate  the speeds are not perfectly identical   The reasons for this are not completely clear  but there are a number of possible    explanations  First  speed on the Origin2000 depends highly on thread scheduling    50    Speedup for redblack3D Using 32 Threads on balder                 Speedu  1 4 5 mt Theoretical Speedup                         16 20 24 28 32    Number of Threads on aegir    Figure V 2  Speedup for redblack3D with 32 threads on balder and varying num   bers of threads on aegir  using the Sputnik library    and memory placement  both of which seem to be massive issues affecting per   formance and can vary widely even within the run of a program  as in this case   Since memory is stored throughout the entire machine within the two processor  nodes  a poor distribution could certainly affect the timings and cause the kind of  variation that we see in table V 3  Because these timings were done in a dedicated  environment  competition for processor  memory  or network bandwidth is not an  issue    Based on the equations presented above  we can predict  from the original  timings and the number of threads per system that we are using  the theoretical  or predicted time for a re par
85. s cuasi be e De HPA db hk x  Acknowledgements iras RS ANS OP ae E xii  Vita  Publications  and Fields of Study                     xiv  PTAC Ge A eels ca Mead der cde as D ae sores uh die Dinde XV  ITTOCUEMON     a a ee Bh ees oe a a 1  As MO tIN AION  Se FR ES noth att pote  MUR ae ey DE es Na 1  B  Heterogeneity    the god Sn a eS eee de el a ee 5  C  Organization of the Thesis aunado SAS AS 6  Clusters of Multiprocessors ye Ra Se NM des ee   g D  E 8  A  Multiprocessors   E A ee ee ee A ee ES 8  B  Mu  lticompu  ters Seti Se Sy Se ohne E vst es Eden Sh Se A 11  C  Clusters of Multiprocessors                          13  Heterogeneous Multi Tier Programming                    16      Previous Work s at sac en ee ae a ee te es En ad  a ad ee A 16   1  Multi Protocol MPI oscars ae eee od ae eed 17   E  Mut Per APIs es haaa Ds hen eam dl Biel ee Gas a 18   3  Heterogeneity Work usa 4 De IOS 2 NL OOS Ee 20  B  Heterogeneous Multi Tier Programming                  21   ly  Problemi iuas ta ho Soe Be  Be Gea aces ab se gs oe os Pg ah kg 21   2  Requirements      edb a a et eae ag be Ba Ee 22  C  Multi Tier Experiments AAA    ee dd 23   l VET aid Pthreads si duce Sh ait Gee A est hee ath ah Saal ad de Te 23   2  MPrand OpenMP yi  gsc dae ose whe de lewd sale ne ace 27   3  KeLP and OpenMP Te ef sks ns Bale os Ge atte gs ee wale  A 32    vi    IV The Sputnik Model and Theory of Operation                 33    A  Introduction i a ee LS RS ak Be Reed neni  oe ponent he eae ath 33   B
86. s that connects the processors or by interacting using  their shared memory  The latter is often accomplished by way of using a threads  library such as Pthreads  short for POSIX Threads     a POSIX compliant thread  library  that supports mutual exclusion on pieces of memory so that one thread  can hold exclusive access to a piece of memory while it is writing to that piece  so that other threads get only the final value when the thread that has the lock  is finished writing to it  6  15  50  51  52   The API calls for Pthreads are starkly  different from MPI  Instead of    communication primitives     there are commands  that do thread manipulation  e g  create  fork and join  and commands that do  memory    locking    so that no more than one thread has access to a critical section  of code simultaneously  e g  pthread_mutex_lock and pthread_mutex_unlock    A program running on a multiprocessor using shared memory can often achieve  significantly better timing results because it is specifically using the shared mem   ory hardware that a multiprocessor is designed to use  rather than a congested  processor bus    It is possible to simulate message passing with shared memory and also  possible to simulate shared memory with message passing  Machines that are  called distributed shared memory type machines are often examples of the lat   ter  A distributed shared memory machine is one that would have a library that  would allow    shared memory    style calls even though the 
87. saging protocols depending on where the message is in the  machine  If the message needs to be passed within an SMP  for instance  the  message is    passed    using shared memory  Within an MPP  the message is passed  using a native messaging layer  Outside the MPP  between MPPs  the message is  passed using TCP IP  Lumetta finds that the shared memory protocol can achieve  five times the bandwidth of the networking protocol  for instance    While multi method or multiprotocol communication  especially when  running across MPI  can theoretically run effectively on a heterogeneous cluster   there is also a strong possibility that it may underperform an API that has been  specifically designed in a multi tier fashion  The reason for this is that although the  multiprotocol capable API can use all of the processors and even take advantage  of whatever special shared memory and networking hardware exists in the cluster   it is entirely possible that the runtime system will distribute the MPI processes  in an nonoptimal manner that requires significantly more communication than is  necessary  An API that has been specifically designed to partition on two levels   as the machine itself is built on a hardware level  can specify exactly how all the  processes should be arranged  For example  the MPI processes may be scattered  throughout the entire cluster when the program executes  Some processes might  be able to share memory within the cluster  but there is no guarantee that thos
88. stem    which func   tions like Sputnik in some respects  choosing an optimal decomposition system    based on pre known aspects of the computational demands     Zoom    Anglano  Schopf  Wolski and Berman investigated a method for describ   ing heterogeneous applications in terms of structure  implementation and data   The motivation is that not every machine existing  e g  multicomputers  vector  supercomputers  multiprocessors  is adept at solving all problems  The Zoom rep   resentation attempts to solve this problem by allowing the user to abstract the  program such that each particular segment in the abstraction can be sent to the  machine or class of machine that it is best suited for  5   Such a technology would  be a fantastic improvement to Sputnik in the future  See the Proceedings of the  Heterogeneous Computing Workshop  1995  and Proceedings of the 1992 Hetero   geneous Workshop  IEEE CS Press  for more information on related technology        IITLB Heterogeneous Multi Tier Programming    III B 1 Problem    1  As discussed previously  the goal of my research  presented in this thesis  is  to find a way to make scientific programs run faster and improve utilization    on heterogeneous clusters of multiprocessors and still allow the user to write    22    the program as if they were running on a homogeneous cluster     2  The problem is that they currently have imbalanced loads if running homo     geneous partitions on heterogeneous multiprocessor nodes     3  Ther
89. tart     Essentially everything that was in main   can now be in SputnikMain      with the addition of timing calls  A bare bones main   might now look like this     int main int argc  char   argv       MPI_Init  amp argc   amp argv      Initialize MPI  InitKeLP  argc  argv       Initialize KeLP       Call Sputnik   s main routine  which in turn will    67       then call SputnikMain       SputnikGo  argc  argv       MPI_Finalize       Close off MPI    return  0      The call that sets the number of threads per node to use is actually  set in SputnikGo   and so does not need to be used by the programmer  The  number of threads actually being used in a loop should be tested by using the  OMP_GET MAX _THREADS   call  In this way  the programmer can determine  whether OpenMP is doing a good job of parallelizing the kernel that the program   mer would like to speed up  A number of factors can affect OpenMP   s paralleliza     tion  including striding of loops and data dependencies     A F Sputnik Implementation    The overall process of what Sputnik does has been discussed previously  in this thesis  I will discuss the specifics here  Sputnik has many command   line options  but in this following section  I will discuss just the part where we  test the strength of the multiprocessor nodes   testSMPS   1   which is the most  important and unique part of Sputnik   s functionality    After SputnikGo   is called  probably from within main     the program    follows something similar 
90. the modifications to a program that already  runs in KeLP in an hour  As already discovered in several papers  however  some    scientific codes optimizes better than others  21  44  45      V A 1 Red Black 3D    Redblack3D it is a scientific program written using Sputnik  which itself  comprised of MPI  KeLP and OpenMP  It is mostly in C    except for the kernel  of the program  which is written in Fortran and linked in with the rest of the code   The OpenMP directives are in the Fortran part of the code  The program itself    solves Poisson s equation  a differential equation  using the Gauss Seidel method     41    42    Each run was done with one MPI process per Origin2000 and varying numbers of  OpenMP threads within each system  The program alternates between running  a kernel on the    black    points and the    red    points  which are arranged in a  three dimensional grid  The reason for this is that redblack3D operates by doing  a 7 point stencil calculation  right  left  front  back  top  bottom  and itself  and  needs to make sure that the values next to it don   t change in a given iteration   So the grid alternates with every other point being a red point and all the others  being black with no red immediately adjacent to a black point  though orthogonal  is fine   The entire source code for redblack3D and the kernel  both modified with  the Sputnik API  are in Appendices B and C     V B Hardware    The Origin2000 is somewhat different than the typical multiproc
91. thesis  I make one specific application study with two different  types of optimizations  repartitioning the amount of data each node works on and  adjusting the number of threads that run on each node    This thesis does not attempt to solve the problem of scheduling hetero   geneous clusters of multiprocessors that are networked over a heavily trafficked  wide area network  including grids  Such problems are best solved through dif   ferent methods  including dynamic load balancing by use of the Globus  Legion   Network Weather Service or AppLeS  30  39  66  69   Blending one or more of these  technologies with my API is beyond the scope of this thesis    The thesis also does not attempt to tackle the problem of machines with a  deeper hierarchy than two tiers  processor and node   A deeper hierarchy might be  a    cluster of clusters     Additionally  this thesis and the API it presents are specif   ically focusing on clusters of multiprocessors  It is not looking at the added level  of detail of what happens when several completely different architecture types are  clustered together  such as vector and multicomputer style MPP supercomputers   and parts of the computation run better on different architectures    Optimizations including the ones I have just mentioned  in addition to  many others  are possible optimizations that could be done in ClusterOptimization   A process whereby the ClusterOptimizer does nothing more than vary the tiling of  the problem to fit in leve
92. thing quite like  this  It   s been a unique  fun  life changing experience that has given me not just  an interest scientific computing but an interest and enthusiasm towards academia  and research in general    This work was supported by UC MICRO program award number 99 007    and Sun Microsystems     xiii    March 23  1976  1993 1996  1996 1997  1997   1997   1998   1999    1999    1996 2000    2000  1994   Present  2000 Present    VITA    Born  Marin County  California   Intern  Autodesk  Inc    Freelance Writer  TidBITS   Study Abroad  Oxford University  England  Programming Intern  Pixar Animation Studios  Programming Intern  Dolby Laboratories  Inc     B A  University of California  San Diego  Minor in Organic Chemistry    Research Assistant   University of California  San Diego    Programming Intern   San Diego Supercomputer Center  SDSC     M S   University of California  San Diego  Computer Consultant    Chief Operating Officer  Dyna Q Corporation    PUBLICATIONS       Simulating Neurotransmitter Activity on Parallel Processors     S  Peisert  S   Mock  and S  B  Baden  UCSD Research Review 1999 Poster Session  February    26  1999     FIELDS OF STUDY    Major Field  Computer Science  Studies in Computer Graphics     Dr  Michael J  Bailey    Studies in High Performance and Scientific Computing   Professor Scott B  Baden   Studies in George Gershwin    Dr  Elizabeth B  Strauchen    Wolfson College  Oxford University  England    Studies in Russian Literature   Dr  Rog
93. things that occurred in the course of gathering  results for redblack3D modified with the KeLP OpenMP Sputnik API libraries   All of them appear to stem from anomalies either with OpenMP in general  or  perhaps the specific OpenMP implementation on the Origin2000 machines  Specif   ically  I found that MPI processes appear to run significantly faster than OpenMP  threads  For example  when I ran with one MPI process with between 2 and 64  threads spawned by that MPI process  it ran significantly slower than if I ran with  between 2 and 64 MPI processes with 1 OpenMP thread per process  Despite    94    Speedup for redblack3D with 48 Threads on balder                 Speedu  eae ase Theoretical Speedup                         24 30 36 42 48    Number of Threads on aegir    Figure V 5  Speedup for redblack3D with 48 threads on balder and varying num   bers of threads on aegir  using the Sputnik library       Threads New Compute   Predicted   Speedup   Theoretical                               balder   aegir   balder   aegir Speedup  62 32 71 5   71 8537   72 5068 1 3820 1 3695  128 64 80 3   75 5464   76 8115 1 345 1 4060  128 96 65 5   68 841   69 3257 1 0677 1 0602       Table V 8  Speedup and predicted timings for redblack3D with large numbers of    threads per system    speaking with the NCSA Consultants  who in turn contacted SGI engineers  as  well as performing a variety of experiments myself  the precise cause of this was  never solved  It was speculated that this too h
94. this  is that there is less time overhead in continually creating and destroying threads   The downside is that the threads are using system resources when other system  processes might need a spare moment on a CPU    Instead of a simple  static  uniform partitioning where each node n of N  is assigned the fraction n N of the work  my version uses two partitioning schemes   The first is an optimized  non uniform partitioning  based the partitioning on the  total number of processors  as opposed to nodes  It calculates the total number  of processors available on all nodes  P  the total number of nodes  N  and how  many processors each individual node has  po     ph_1  It then assigns work chunks  of the total data set  W  equal to the fraction of processing power each node  has  wo    Wn 1    Po   Pn 1  P  Therefore  as long as everything is equal in a  machine  especially the processing power of each individual processor  except the  number of processors in a node  the problem workload will remain balanced and  the problem will run optimally fast based on the configuration of the cluster of  multiprocessors    One cannot be guaranteed an optimal problem as in the first partitioning  scheme  however  More than likely  speeds of processors will differ  speeds of    memory and networks will differ  and cache sizes and other important machine    25    characteristics will differ  Therefore  by having the user issue the proper flag to  the program to initiate balanced  non unifor
95. tion  of the C   loops  An example of this is shown in Figure III 3   Finally  there was the issue of Fortran and C   OpenMP compatibility  since there were Fortran directives recognized separately by each language   s com   piler  I wrote a small microkernel to test OpenMP scaling as well as C    Fortran   OpenMP and MPI compatibility on the Origin 2000 as shown in Figure 111 4  The  microkernel was written to avoid compiler optimization and precomputation as    much as possible   The program runs two heavyweight processes  One process keeps running  just the inner loop for every iteration     for i   0  i  lt  LONG  i     arr i    arrli 1    i   1 0001     28    for  int i   0  i  lt  40000  i     times   kernel x y z      double kernel double x  double y  double z          Every time that kernel   is called  the following     pragma is called as well  If kernel   is called     40 000 times  as shown in this example  overhead     of generating threads will be incurred 40 000     times    pragma omp parallel shared x y z    pragma omp for schedule static    for int i   0  i  lt  MAX_INT  i          lt mathematical calculations gt     Figure 111 3  OpenMP Fork Join Example    The other process  for each iteration of the outer loop  runs with a dif   ferent number of threads  set with OpenMP commands  The program starts at  64 threads and halves the number of threads each time through  all the way down  to 1  If the program scales well  the time should double with each iterati
96. titioned and balanced run of the program  At its  worst  the runs with a maximum of 32 threads per system are 5 19  worse than  the predicted results  At best  the actual runs are 2 29  better than the predicted  results  Again this variance is presumably due to the same conditions that cause  a discrepancy between the timings of the nodes after their workloads have been    re balanced     51    V E 2 Up to 48 threads per system          Threads   Original Compute New Total New Compute  aegir   balder aegir balder aegir   balder   aegir  24 37 6 62 9 50 6   49 8617   43 8   46 3449  30 37 6 50 1 45 498 45 5 43 4   42 687  36 36 1 43 5 42 2992   42 3 37 8   39 675  42 36 5 38 4 40 5622   40 7 36 8   35 4458  48 37 4 34 3 41 4 42 3218   33 3   36 2054                               Table V 5  Complete redblack3D timings with 48 threads on balder and varying    numbers of threads on aegir       Threads New Compute   Predicted   Speedup   Theoretical  balder   aegir   balder   aegir Speedup  48 24 43 8   46 3449   47 0655 1 3572 1 3364  48 30 43 4 42 687 42 9592 1 1544 1 1662  48 36 37 8   39 675   39 4560 1 0964 1 1025  48 42 36 8   35 4458   37 4259 1 0435 1 0260  48 48 33 3   36 2054   35 7830 1 0330 1 0452                                  Table V 6  Speedup and predicted timings for redblack3D with 48 threads on    balder and varying numbers of threads on aegir    As with the runs with a maximum of 32 threads per system  runs with  a maximum of 48 threads per system also scale
97. titute   August  1998     Crandall  P  E  and M  J  Quinn     A Decomposition Advisory System for Het   erogeneous Data Parallel Processing     Proceeding of the Third International  Symposium on High Performance Distributed Computing     Crandall  P  E  and M  J  Quinn     Non Uniform 2 D Grid Partitioning for  Heterogeneous Parallel Architectures     Proceedings of the 9th International  Parallel Processing Symposium  1995     de Supinski  B  R  and J  May     Benchmarking Pthreads Performance      Lawrence Livermore National Labs  UCRL JC 133263     Donaldson  S   J  M  D  Hill  and D  B  Skillicorn     BSP Clusters  High  Performance  Reliable and Very Low Cost     PRG TR 5 98 Oxford University  Computing Laboratory  1998     Fink  S  J      A Programming Model for Block Structured Scientific Calcula   tions on SMP Clusters     UCSD CSE Department Ph D Dissertation  June  1998     Fink  S  J   S  B  Baden  and S  R  Kohn     Efficient Run Time Support for  Irregular Block Structured Applications     Journal of Parallel and Distributed  Computing  1998      25      26      27      28     29    30          31     32     33    34          39    36    37    38          95    Fink  S J   S B  Baden  and S R  Kohn     Flexible Communication Mechanisms  for Dynamic Structured Applications     IRREGULAR    96     Fink  S  J  and S  B  Baden     Runtime Support for Multi Tier Programming  of Block Structured Applications on SMP Clusters     ISCOPE    97  December  1997     Fink  S
98. to organizations with a need for large scale parallel computation  well   established techniques used to program multicomputer MPPs and vector machines  are not always the optimal techniques to program multiprocessors or multiproces   sor clusters  Further  since one of the appealing aspects of clusters of multiproces     sors is that many of their components can be built from readily available  commer     cial hardware solutions  such as Sun  IBM or SGI multiprocessor workstations  it  can be cost effective to add in or swap out systems in the cluster at will  replacing  old components gradually  The result is that what was originally a homogeneous  cluster of multiprocessors can easily become heterogeneous over time  with the  addition of newer systems with different processor speeds  number of processors   memory sizes  cache sizes and network speeds  an example of which is shown in    figure 1 1        Network Hub or Switch                   Multiprocessor    Node 0  OQ     QO    Multiprocessor  Node 2  sis O    Multiprocessor  Node 1    s OO                     Processors Nodes with varying numbers    of processors     Figure 1 1  Diagram of a heterogeneous cluster of multiprocessor nodes    In a heterogeneous cluster  a uniform partitioning of data across the clus   ter is not optimal because some of the nodes will finish before others  leaving parts  of the cluster idle until the slower nodes terminate  This problem can generally be  stated to say that a cluster will o
99. to this outline in pseudo code        Until we have hit the user defined limit of the maximum number     of threads or until the time we get by increasing the number of     threads is higher  worse performance  than the lower number of       threads  keep increasing the number of threads and running the    68       kernel of the program  as contained in SputnikMain    without     communication  This way  we can get the individual timings for     each multiprocessor node    while i  lt  MAX_THREADS     amp  amp  time last iteration   lt  time second to last iteration          omp_set_num_threads  i        By passing in NULL for the times  we are telling the     routine inside the DecompositionX class not to do any     special modifications   time i    SputnikMain int argc  char   argv  NULL    i   i   2         i   iteration before the best we found in the previous loop        During the first loop  we move quickly to find the optimal     solution by doubling i each time  This time  we only     increment by 1  starting with the best estimate from the     first while loop   while  time last iteration   lt  time second to last iteration         omp_set_num_threads  i      time i    SputnikMain int argc  char   argv  NULL      i  iett        Set the optimal number of threads  Each node may have a    69       different optimal value     omp_set_num_threads optimal number          This time  pass in the best times and let the DecompositionX     routines do partitioning based on the
100. ture of Tiling Interactions     LCPC 1997        National Center for Supercomputing Applications  NCSA  at University of  Illinois  Urbana Champaign  part of The Alliance   lt http   www ncsa uiuc   edu  gt    lt http   www uiuc edu  gt    lt http   www ncsa edu  gt      NCSA Silicon Graphics Origin2000    lt http   www ncsa uiuc edu SCD Hardware Origin2000  gt      Nguyen  T   M  M  Strout  L  Carter  and J  Ferrante     Asynchronous Dy   namic Load Balancing of Tiles     SIAM 99     OpenMP Architecture Review Board  OpenMP FAQ    lt http   www openmp org index cgi faq gt      OpenMP Architecture Review Board     OpenMP Fortran Application Pro   gram Interface 1 0     Oct  1997        OpenMP Architecture Review Board     OpenMP C and C   Application  Program Interface 1 0     Oct  1998     53    54    59    56          57    58  59    60    61    62  63          64    97    Patterson and Hennessy  Computer Architecture  A Quantitative Approach   2nd Ed   Morgan Kaufmann     Peisert  S   S  Mock  and S  Baden     Simulating Neurotransmitter Activity on  Parallel Processors     UCSD Graduate Research Conference  1999     Pfister  G  F      In Search of Clusters   The Coming Battle in Lowly Parallel  Computing     Prentice Hall PTR  1995     Pilkington  J  R  and S  B  Baden     Partitioning with Spacefilling Curves      CSE Technical Report Number CS94 349  March 1994     Pilkington  J  R  and S  B  Baden     Dynamic Partitioning of Non Uniform  Structured Workloads with Space
101. twork while the other receives it  In broadcast  and scatter methods  one processor sends a message to all other processors in the  network  In the gather method  all processors send to a single processor in the net   work  It is clear that as more messages are being passed  the more congested the  network becomes and the more complex the solutions needed to solve the problem  of building a low latency  high bandwidth network    An example of a distributed memory machine can be seen in Figure 11 2   An extremely basic example of a simple distributed memory machine might use  a bus to pass messages  The problem with this particular design is that the bus  that messages travel on is often a bottleneck and would therefore not scale well  to larger number of processors due to competing demands on the bus  For this  reason  large machines typically use a more scalable interconnect    The advantages of topologies like the crossbar  the hypercube  or the  toroidal mesh  as is used in the SGI Cray T3E  is that it makes for an extremely fast  network and is very expandable to a large numbers of processors  Unfortunately     toroidal meshes or crossbars are among technologies that are very expensive to    12        Main  Memory    L2 Cache L2 Cache L2 Cache L2 Cache    L1 Cache L1 Cache L1 Cache L1 Cache    Interconnection Network                    Figure 11 2  Diagram of a distributed memory machine    construct  Using a crossbar switch  for instance  would involve connecting every  
102. uirement is that the message  passing library implements a collection of eleven message passing primitives and  the node library implements three shared memory primitives  These combine to  implement a final set of eight SIMPLE primitives  barrier  reduce  broadcast   allreduce  alltoall  alltoallv  gather  and scatter  Although MPI  includ   ing MPICH  works as a messaging layer  the architects of SIMPLE discovered that  the Internode Communication Library  ICL  provided superior performance  They  also decided to use Pthreads as the SMP node layer  although as they describe  a  faster library  possibly something vendor specific  might work even better    The calls in the SIMPLE library allow the programmer to work across  multiple SMPs seamlessly instead of having to partition the dataset twice manually     a primary partition to determine which data segment runs on each SMP and a  secondary partition to determine which part of each data segment runs on each    individual processor of each SMP     19    KeLP2    Expanding upon the KeLP1 API that already made programming SMPs  easier with support for region calculus and data motion  among other parallel pro   gramming abstractions  KeLP2 supports the unique multi tier structure of clusters  of SMPs  23  26  12  8   While KeLP1 aids programmers in making parallel code  easier to write without suffering performance penalties  KeLP2 does the same thing  for multi tier machines  Essentially  KeLP2 opened up a whole new and more 
103. y Scott B  Baden for 3D RB  Converted to 3D  Blocked for Cache    RB ordering  ORR CORA ARIA I I kk kkk k    0000000000000        Smooth the Red Points  subroutine rb7rrelax u ul0 ul1l1 u12 uh0 uh1 uh2 si sj rhs   integer ul0  ul1 uh0  uhi  ul2  uh2  si  sj  double precision u u10 uh0 ul1 uh1 ul2 uh2   double precision rhs ul0 uh0 ul1 uh1 ul2 uh2   double precision c h c2    integer i  j  k  ii  jj  jk    c   1 0d0 6 0d0   h 1 0d0    85    86    c2 h h      OMP PARALLEL DEFAULT SHARED  PRIVATE jj ii k j i jk   I OMP DO SCHEDULE STATIC   do jj   uli i  uhi 1  sj  do ii   ul0 1  uh0 1  si  do k   ul2 1  uh2 1  do j   jj  min jj sj 1 uh1 1   jk   mod j k 2   do i   iitjk  min ii jk si 1 uh0 1   2  u i j k   c      2  Cu i 1 j k    u i 1 j k       3  u i j 1 k    u i j 1 k      4  u i  j k 1    u i j k 1     5 c2 rhs i j k      end do  end do  end do  end do  end do      OMP END DO NOWAIT    OMP END PARALLEL    return  end    c Smooth the black points  subroutine rb7brelax u ul0 ul1 ul2 uh0 uh1 uh2 si sj rhs   integer ul0  ul1 uh0  uhi  ul2  uh2  si  sj  double precision u ul0 uh0 ul1 uh1 ul2 uh2   double precision rhs ul0 uh0 ul1 uh1 ul2 uh2   double precision c c2 h    integer i  j  k  ii  jj  jk  c   1 0d0 6 0d0    h 1 0d0   c2 h h      OMP PARALLEL DEFAULT SHARED  PRIVATE jj ii k j i jk   I OMP DO SCHEDULE STATIC   do jj   uli i  uhi 1  sj  do ii   ul0 1  uh0 1  si  do k   ul2 1  uh2 1    do j   jj  min jj sj 1 uh1 1   jk   1   mod j k 2   do i   iitjk  min ii jk si 1 uh0 1  
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Progress Lighting P4610-30 Installation Guide  廃車時のアウ ト ラ ンダー PHEV 駆動 用バッ テリーの取 外し方法  INSTALLATION MANUAL MANUAL DE INSTALACIÓN MANUEL D  5 ton capacity 10 ton capacity professional service jack  Global Machinery Company MOC6L User's Manual  User Guide for the Polycom VVX400  Model 8635-C SUREFLOW Room Pressure Controller  Docentes Registo Biográfico    Copyright © All rights reserved. 
   Failed to retrieve file