Home

Parallel Programming Using the Global Arrays Toolkit

1. and copy them Into the local Duriers a end D 7 1o2 0 lo O Llo2 1 0 hi2 0 hi O hi2 1 dims 0 1 NGA Get g_a lo2 hi2 a ld 1lo3 0 0 lo3 1 lo 1 hi3 0 dims 1 1l hi3 1 hil l NGA Get g_ b lo3 hi3 b ld Do local matrix multiplication and store the result in local buffer Start by evalvating the transpose of b y For i U7 lt boo l0 bos lO rls a4 fFor j 0 J lt haslijelos 1lielg y l berns ialll BILT Ty of La Wt fe Maltiply a and b to get e o moo oc for i 0 1 lt hi O Llo O 1 1 7 porGaoe 4 2 aja Gordy amp ae TE TEE HUJ c i j 0 0 for k 0 k lt dims 0 k nga_put nga_get Cla Tj eflal la ali Kk btras 7 k pe Cope hace bogo Yy NGA Put g_c lo hi c ld pA Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 New Interface for Creating Arrays gt Developed to handle the proliferating number of properties that can be assigned to Global Arrays Fortran integer function ga create handle subroutine ga set data g a dim dims type subroutine ga set array name g a name subroutine ga set chunk g_ a chunk SUDCOUGING ga Se 222 0 Git a Nap ANDIOCK subroutine ga set ghosts g_a width SUDTOULING Ga Set block Cyclic a dins SUDLOULING Oa Ser Glock Cyclic proc grid g a dims proc grid logi
2. HW EW within the approximation of a single Slater determinant Assuming the one electron orbitals are expanded as pir X CiuXu lr u the calculation reduces to the self consistent eigenvalue problem FuvCky Duy Cky Duy X CukCvk k l F uv Auv gt 5 2 uv w uw VA Dona w Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Parallelizing the Fock Matrix The bulk of the work involves computing the 4 index elements uv w A This is done by decomposing the quadruple loop into evenly sized blocks and assigning blocks to each processor using a global counter After each processor completes a block it increments the counter to get the next block do 1 467 gt cojo gt do k Read and do l Accumulate increment Ds results counter Mair ss Evaluate block Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Gorden Bell finalist at SC09 GA Crosses the Petaflop Barrier gt GA based parallel implementation of coupled cluster calculation performed at 1 39 petaflops using over aiii Floating Point performa 223 000 processes on D at 223K cores ORNL s Jaguar petaflop 1 39 PetaFLOP s m Apra et al Liquid E H20 z4 water obtaining the right 3 0 72 atoms answer for the right 15000 1224 basis functions reasons SC 2009 B N Cc pvtz f basis gt Global Arrays is one of two programming models that have achieved this le
3. Non blocking Operations gt The non blocking APIs are derived from the blocking interface by adding a handle argument that identifies an instance of the non blocking request Fortran e subroutine nga_nbput g_a lo hi buf Id nbhandle e subroutine nga_nbget g_a lo hi buf Id nobhandle e subroutine nga_nbacc g_a lo hi buf Id alpha nbhandle e subroutine nga_nbwait nbhandle e void NGA_NbPut int g_a int lof int hil void buf int Id ga_nbhdl_t nbhandle e void NGA_NbGet int g_a int lof int hif void buf int Id ga_nbhdl_ t nbhandle e void NGA_NbAcc int g_a int lof int hif void buf int Id void alpha ga_nbhdl_t nbhandle e int NGA NbWait ga_nbhdl_ t nbhandle Python e handle ga nbput g_a buffer lo None hi None e buffer handle ga nbget g_a lo None hi None numpy ndarray buffer None e handle ga nbacc g_a buffer lo None hi None alpha None e ga nbwait handle Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Non Blocking Operations double precision buf1 nmax nmax double precision buf2 nmax nmax call nga _ nbget g a lol hil buf1 ld1 nbl ncount 1 do while if mod ncount 2 eq 1 then Evaluate lo2 hi2 call nga nbget g a lo2 hi2 buf2 nb2 call nga wait nbl Do work using data in bufl else Evaluate lol hil call nga _nbget g a lol hil bufl1 nbl call nga wait nb2 Do work using data in buf2 endif ncount ncount
4. call ga set data g b ndim dims MT F DBL call ga set chunk g_b chunk call ga _ set _ name g_ b array B call ga allocate g_b a Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Basic Array Operations gt Whole Arrays To set all the elements in the array to zero e Fortran subroutine ga_zero g_a e C void GA_Zero int g_a e Python ga zero g_a To assign a single value to all the elements in array e Fortran subroutine ga_fill g_a val e C void GA_Fill int g_a void val e Python ga fill g_a val Toscale all the elements in the array by factorval e Fortran subroutine ga_scale g_a val eC void GA_Scale int g_a void val e Python ga scale g_a va Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Basic Array Operations cont gt Whole Arrays To copy data between two arrays e Fortran Subroutine ga_copy g_a g_b o C void GA_Copy int g_a int g_b e Python ga copy g_a g b m Arrays must be same size and dimension E Distribution may be different Global Arrays g_a and g_b distributed on a 3x3 process grid Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Basic Array Patch Operations gt Patch Operations m The copy patch operation e Fortran subroutine nga_copy_patch trans g_a alo ahi g_b blo bhi e C void NGA_Copy_patch char trans int g_a int alof int ahi int g_b
5. e http www emsl pnl gov docs global userinterface html GA Support Help e hpctools pnl gov or hpctools emskpnt gov a mailing lists GA User Forum and GA_Announce Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Installing GA gt GA 5 0 established autotools configure amp amp make amp amp make install for building m No environment variables are required e Traditional configure env vars CC CFLAGS CPPFLAGS LIBS etc m Specify the underlying network communication protocol e Only required on clusters with a high performance network e g If the underlying network is Infiniband using OpenlB protocol use configure with openib m GA requires MPI for basic start up and process management e You can either use MPI or TCGMSG wrapper to MPI MPI is the default configure TCGMSG MPI wrapper configure with mpi with tcgmsg TCGMSG configure with tcgmsg gt Various make targets m make to build GA libraries make install to install libraries m make checkprogs to build tests and examples m make check MPIEXEC mpiexec np 4 to run test suite gt VPATH builds one source tree many build trees i e configurations a Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Compiling and Linking GA Programs Your Makefile Please refer to the CFLAGS FFLAGS CPPFLAGS LDFLAGS and LIBS variables which will be printed if y
6. gt Exploration of multiple data redundancy models for fault tolerance gt Recent demonstrations of fault tolerance with Global Arrays and ARMCI gt Design and implementation of CCSD T using this methodology E Ongoing Demonstrations at PNNL booth gt Future ongoing developments for leading platforms m Cray and IBM based systems Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Exascale Challenges gt Node architecture will change significantly E Multiple memory and program spaces e Develop GA support for Hybrid Platforms Small amounts of memory per core forces the use of non SPMD programming execution models e Thread safety support for multithreaded execution E There s not enough memory or memory bandwidth to fully replicate data in private process spaces e Distributing GA metadata within nodes Greater portability challenges e Refactoring ARMCI Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Exascale Challenges gt Much shorter mean time between failures m Fault tolerant GA and ARMCI gt Likely traditional SPMD execution will not be feasible gt Programming models with intrinsic parallelism will be needed MPI amp GA in their current incarnations only have external parallelism gt Data consistency will be more of a challenge at extreme scales Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle S
7. Demonstrated scalability to 200K cores and greater than 1 Petaflop performance gt High programmer productivity Global address space and one sided communication eliminate many programming overheads Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Thanks gt DOE Office of Advanced Scientific and Computing Research gt PNNL Extreme Scale Computing Initiative Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 ISCUSSION D ID EARE a P y Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965
8. Message Passing Message requires cooperation on both sides The processor sending the message P1 and the processor receiving the message passing Message PO must both MPI participate receive send One sided Communication Once message is initiated on sending processor P1 the sending processor can continue computation Receiving processor PQ is not involved Data is copied directly from switch inte one sided communicatio memory on PO SHMEM ARMCI MPT 2 1S eo et ee Proudly Operated by Battelle Since 1965 Remote Data Access in GA vs MPI Message Passing identify size and location of data blocks loop over processors if me P_N then pack data in local message buffer send block of data to message buffer on PO else if me PO then receive block of data from P_N in message buffer unpack data from message buffer to local buffer endif end loop py local data on PO to local buffe Global Arrays NGA_Get g_a lo hi buffer Id a PO Global Array Global upper Local buffer handle and lower and array of indices of data strides patch Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Onesided vs Message Passing gt Message passing Communication patterns are regular or at least predictable Algorithms have a high degree of synchronization m Data consistency is straightforward gt One sided Communication is irregular e Load balancing E A
9. bjhi patch of g_b input integer g_c cilo cihi cjlo cjhi patch of g_c input s dbl prec comp alpha beta scale factors input character 1 transa transb transpose flags input Pacific Northwest Proudly Operated by Battelle Since 1965 Linear Algebra on Patches cont gt To compute the element wise dot product of two arrays m Three separate functions for data types e Integer Fortran nga_idot_patch g_a ta alo ahi g_b tb blo bhi C NGA_Idot_patch int g_a char ta int alo int ahi int g_b char tb int blof int bhif e Double precision Fortran nga_ddot_patch g_a ta alo ahi g_b tb blo bhi C NGA_Ddot_patch int g_a char ta int alo int ahi int g_b char tb int blof int bhif e Double complex Fortran nga_zdot_patch g_a ta alo ahi g_b tb blo bhi oc NGA_Zdot_patch int g_a char ta int alo int ahi int g_b char tb int blof int bhif m Python has only one function ga dot g_a g_b alo None ahi None blo None bhi None bint ta False bint to False integer ga gb input integer GA_Idot int g_a int g_b long GA_Ldot int g_a int g_b float GA _Fdot int g_a int g_b Pacific Northwest double GA_Ddoi int g_a int g_b NATIONAL LABORATORY DoubleComplex GA_Zdot int g_a int g_b Proudly Operated by Battelle Since 1965 Block Cyclic Data Distributions Normal Data Distribution Block Cyclic Data Distribution Pacific Northwest NATIONAL LABORATORY
10. 1 end do fic Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 SRUMMA Matrix Multiplication patch matrix multiplication Issue NB Get A and B blocks do until last chunk issue NB Get to the next blocks wait for previous issued call compute A B sequential dgemm NB atomic accumulate into C matrix C A B done Computation Comm Overlap Advantages Minimum memory Highly parallel Overlaps computation and communication latency hiding exploits data locality patch matrix multiplication easy to use dynamic load balancing http hpc pnl gov projects srumme Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 TeraFLOPs SRUMMA Matrix Multiplication Improvement over PBLAS ScaLAPACK Parallel Matrix Multiplication on the HP Quadrics Cluster at PNNL Matrix size 40000x40000 Efficiency 92 9 w r t serial algorithm and 88 2 w r t machine peak on 1849 CPUs E PBLAS ScaLAPACK pdgemm a o Cad oS P 10 Theoretical Peak oS 0 512 1024 1536 2048 Processors Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Cluster Information gt Example gt 2 nodes with 4 processors each Say there are 7 processes created m ga cluster nnodes returns 2 m ga cluster _nodeid returns O or 1 m ga cluster _nprocs inode returns 4 or 3 ga _cluster_procid inode
11. Distribution g_a iproc lo hi Where is the data NGA_Access g_a lo hi ptr Id Use this information to organize calculation so that maximum use is made of locally held data Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Example Matrix Multiply global arrays H representing m m nga put nga get matrices dgemm local buffers on the processor Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Matrix Multiply a better version atomic accumulate more scalable THR less memory Uoga higher parallelism OTET local buffers on the processor Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Application Areas Immigration Biowarfare _ WMD Reports Financial Transactions 4 amp electronic structure chemistry in ators e RE T Major area smoothed particle hydrology hydrodynamics material sciences molecular dynamics Others financial security forecasting astrophysics biology chmate analysis ee Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Recent Applications ScalaBLAST aswel Oehmen and J Nieplocha ScalaBLAST A scalable V e ne implementation of BLAST for high performance data Ps CDRA
12. E Data and operations can be distributed amongst N processors instead of 1 processor Codes execute potentially N times as quickly gt Disadvantages of parallel programs m They are bad for your mental health Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Parallel vs Serial gt Parallel codes can divide the work and memory required for application execution amongst multiple processors gt New costs are introduced into parallel codes Communication m Code complexity m New dependencies Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Communication gt Parallel applications require data to be communicated from one processor to another at some point gt Data can be communicated by having processors exchanging data via messages message passing gt Data can be communicated by having processors directly write or read data in another processors memory onesided Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Data Transfers gt The amount of time required to transfer data consists of two parts Latency the time to initiate data transfer independent of data size Transfer time the time to actually transfer data once the transfer is started proportional to data size Latency Data Transfer Because of latency costs a single large message is preferred over many small messages Pacific Northwest
13. NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Parallel Efficiency gt Strong Scaling For a given size problem the time to execute Is inversely proportional to the number of processors used If you want to get your answers faster you want a strong scaling program gt Weak Scaling If the problem size increases in proportion to the number of processors the execution time is constant If you want to run larger calculations you are looking for weak scaling gt Speedup The ratio of the execution time on N processors to the execution time on 1 processor If your code is linearly scaling the best case then Speedup is equal to the number of processors gt Strong Scaling and Weak Scaling are not incompatible You can have both Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 sources of Parallel Inefficiency gt Communication m Message latency is a constant regardless of number of processors Not all message sizes decrease with increasing numbers of processors E Number of messages per processor may increase with number of processors particularly for global operations such as synchronizations etc gt Load Imbalance m Some processors are assigned more work than others resulting in processors that are idle Note parallel inefficiency is like death and taxes It s inevitable The goal of parallel code development is to put off as long as possible the po
14. Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 11 Basic GA Operations gt GA programming model is very simple gt Most of a parallel program can be written with these basic calls GA _Initialize GA_Terminate GA _Nnodes GA_Nodeid GA Create GA_Destroy GA Put GA Get GA Distribution GA_Access m GA_ Sync Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 GA Initialization Termination program main include mafdecls fh include global fh gt There are two functions to initialize GA integer ierr Fortran call mpi init ierr e subroutine ga_initialize SEMI ganita ZEN e subroutine ga_initialize_Itd limit write 6 Hello world a C call ga_terminate e void GA_Initialize call mpi_finalize i ae i E d e void GA Initialize Itd size_t limit i E Python e import ga then ga set_memory_limit limit gt To terminate a GA program Fortran subroutine ga_terminate a C void GA_ Terminate F ton N F Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Parallel Environment Process Information gt Parallel Environment m how many processes are working together size what their IDs are ranges from O to size 1 gt To return the process ID of the current process Fortran integer function ga_nodeid a C int GA_Nodeid m Python nodeid ga nodeid gt To determine the num
15. Proudly Operated by Battelle Since 1965 Block Cyclic Data cont Simple Distribution Scalapack Distribution 0 1 0 1 0 1 e O e e O Pe O Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Block Cyclic Data cont gt Most operations work exactly the same data distribution is transparent to the user gt Some operations matrix multiplication non blocking put get not implemented gt Additional operations added to provide access to data associated with particular sub blocks gt You need to use the new interface for creating Global Arrays to get create block cyclic data distributions Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Creating Block Cyclic Arrays gt Must use new API for creating Global Arrays Fortran subroutine ga set block cyclic g a dims subroutine ga_set block cyclic _proc_grid g_a dims proc_grid a C void GA_Set_block_cyclic int g_a int dims R void GA_Set_block_cyclic_proc_grid g_a dims proc_grid m Python ga set block cyclic g_a dims ga set_block _cyclic_proc_grid g_a block proc grid integer dims dimensions of blocks integer proc_grid dimensions of processor grid note that product of all proc_grid dimensions Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Block Cyclic Methods gt Methods for accessing data of individual blocks Fortran length
16. Since 1965 Outline of the Tutorial gt Overview of parallel programming gt Introduction to Global Arrays programming model gt Basic GA commands gt Advanced features of the GA Toolkit gt Current and future developments in GA Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Profiling Capability gt Weak bindings for ARMCI and GA API m Enable custom user wrappers to intercept these calls gt ARMCI GA support in TAU m On par with support for MPI m Available in current stable TAU release gt Performance patterns for ARMCI in SCALASCA Analysis of traces from ARMCI GA programs Available in an upcoming SCALASCA release gt Consistent naming convention NGA_ Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Restricted Arrays Create arrays in which only a few processors have data or arrays in which data is distributed to processors in a non standard way ga set restricted g a list nproc Proces s List Global Array Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Restricted Arrays 4 nodes 16 processors Standard data distribution User specified distribution Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 TASCEL Dynamic Load Balancing Task SPMD l An Termination SPMD gt Express computation as collection of tasks
17. element wise two patches and save the results into another patch Fortran subroutine nga_add_patch alpha g_a alo ahi beta g_b blo bhi g_c clo chi a C void NGA_Add_patch void alpha int g_a int alo int ahif void beta int g_b int blof int bhif int g_c int clo int chi Python ga add g_a g_b g_c alpha None beta None alo None ahi None blo None bhi None clo None chi None integer ga g b gc input dbl prec comp int alpha beta scale factors input integer ailo aihi ajlo ajhi g a patch coord input integer bilo bihi bjlo bjhi g_b patch coord input integer cilo cihi cjlo cjhi g c patch coord input an Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Linear Algebra on Patches cont gt To perform matrix multiplication Fortran subroutine ga_matmul patch transa transb alpha beta g_a ailo aihi ajlo ajhi g_b bilo bihi bjlo bjhi g_c cilo cihi cjlo cjhi a C void GA_Matmul_ patch char transa char transb void alpha void beta int g_a int ailo int alihi int ajlo int ajhi int g_b int bilo int bihi int bjlo int bjhi int g_c int cilo int cihi int cjlo int cjhi Fortran subroutine ga_matmul_patch bool transa bool transb aloha beta g_a ailo aihi ajlo ajhi g_b bilo bihi bjlo bjhi g_c cilo cihi cjlo cjhi integer g_a ailo aihi ajlo ajhi patch of g_ a input integer g_b bilo bihi bjlo
18. ga destroy g_a call nga_create MT F INT dim dims array a chunk g a integer g_a g_b or 7 character name call ga _duplicate g a g b array b name acharacter string input Gare daudestToy gza ga Integer handle for reference array input g b Integer handle for new array output Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Put Get gt Put copies data from a local array to a global array section Fortran subroutine nga_put g_a lo hi buf Id a C void NGA Put int g_a int lof int hil void buf int Id Python ga put g_a buf lo None hi None Get copies data from a global array section to a local array Fortran subroutine nga_get g_a lo hi buf Id a C void NGA_Get int g_a int lof int hi void buf int Id Python buffer ga get g_a lo hi numpy ndarray butfer None integer ga global array handle input integer lo hiQ limits on data block to be moved input Double precision complex integer buf local buffer output integer ld array of strides for local buffer input Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Put Get cont gt Example of put operation transfer data from a local buffer 10 x10 array to 7 15 1 8 section of a 2 dimensional 15 x10 global array into o 7 1 hi 15 8 ld 10 global double precision buf 10 10 call nga_put g a lo hi buf 1d Pa
19. int blof int bhif e Python ga copy g_a g_b alo None ahi None blo None bhi None bint trans False Number of elements must match Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Basic Array Patch Operations cont gt Patches Cont To set only the region defined by o and hi to zero e Fortran subroutine nga_zero_patch g_a lo hi e C void NGA_Zero_patch int g_a int lof int hil e Python ga zero g_a lo None hi None m To assign a single value to all the elements in a patch e Fortran subroutine nga_fill_ patch g_a lo hi val e C void NGA Fill patch int g_a int lof int hil void val e Python ga fill g_a value lo None hi None Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Basic Array Patch Operations cont Patches Cont Toscale the patch defined by lo and hi by the factor val e Fortran subroutine nga_scale_patch g_a lo hi val eC void NGA_Scale_patch int g_a int lof int hif void val e Python ga scale g_a value lo None hi None m The copy patch operation e Fortran subroutine nga_copy_patch trans g_a alo ahi g_b blo bhi e C void NGA_Copy_patch char trans int g_a int alo int ahi int g_b int blof int bhi e Python ga copy g_a g_b alo None ahi None blo None bhi None bint trans False Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Ou
20. int g_a int g_b double beta int g_c m Python def gemm bool ta bool tb m n k alpha g_a g_b beta g _ c double precision complex integer alpha beta scale factor input integer ga gb gec array handles input character 1 transa transb input integer m n k ing Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Linear Algebra cont gt To compute the element wise dot product of two arrays m Three separate functions for data types e Integer Fortran ga _idot g_a g_b Cc GA_Idot int g_a int g_b e Double precision Fortran ga ddot g a g_b Cc GA_Ddot int g_a int g_b e Double complex Fortran ga zdot g a g_b C GA_Zdot int g_a intg_b m Python has only one function ga_dot g_a g_b integer ga g_b input integer GA_ldot int g_a int g_b long GA_Ldot int g_a int g_b float GA_Fdoit int g_a int g_b pi double GA _Ddot int g_a int g_b Pacific Northwest DoubleComplex GA_Zdot int g_a int g_b Proudly Operated by Battelle Since 1965 Linear Algebra cont gt Tosymmetrize a matrix Fortran subroutine ga_symmetrize g_a a C void GA_Symmetrize int g_ a Python ga symmetrize g_a gt To transpose a matrix Fortran subroutine ga_transpose g_ a g_b a C void GA_Transpose int g_a int g_b Python ga transpose g_a g_b Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Linear Algebra on Patches gt To add
21. iproc returns a processor ID lorthwest INALIUINAL LABORATORY Proudly Operated by Battelle Since 1965 Cluster Information cont gt To return the total number of nodes that the program is running on Fortran integer function ga_cluster_nnodes a C int GA_Cluster_nnodes Python nnodes ga cluster_nnodes gt o return the node ID of the process Fortran integer function ga_cluster_nodeid aC int GA Cluster _nodeid m Python nodeid ga cluster_nodeid NO N1 Pacific Northwest NATIONAL LABORATORY __ Proudly Operated by Battelle Since 1965 Cluster Information cont gt Toreturn the number of processors available on node inode Fortran integer function ga_cluster_nprocs inode a C int GA_Cluster_nprocs int inode m Python nprocs ga cluster_nprocs inode gt o return the processor ID associated with node inode and the local processor ID iproc Fortran integer function ga_cluster_procid inode iproc a C int GA_Cluster_procid int inode int iproc Python procid ga cluster_procid inode iproc Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Accessing Processor Memory Node SMP Memory ga_access Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Processor Groups gt o create a new processor group Fortran integer function ga_pgroup create list size a C int GA_Pgroup _create int
22. list int size m Python pgroup ga pgroup create list gt To assign a processor groups Fortran logical function nga_create_contig type ndim dims name chunk p_handle g_a a C int NGA_Create_contfig int type int ndim int dims char name int p handle int chunk Python g_a ga create type dims name chunk pgroup 1 integer ga global array handle input integer p handle processor group handle output integer list size list of processor IDs in group input integer size number of processors in group input Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Processor Groups group A group B at 4 S58 S385 group C world group oO Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Processor Groups cont gt To set the default processor group Fortran Subroutine ga_pgroup set default p handle m Cvoid GA Pgroup_ set default int p handle m Python ga pgroup_ set default p handle gt To access information about the processor group Fortran e integer function ga_pgroup_nnodes p_handle e integer function ga_pgroup_nodeid p_ handle a C e int GA_Pgroup_nnodes int p_handle e int GA_Pgroup_nodeid int p_ handle Python e nnodes ga pgroup_ nnodes p_handle e nodeid ga pgroup_nodeid p_ handle integer p handle processor group handle input oO Pacific Northwest NATIONAL LABORATORY Proudly Opera
23. multidimensional grids gt They provide an index translation layer that allows users to request blocks using put get and accumulate operations that possibly extend beyond the boundaries of a global array gt The references that are outside of the boundaries are wrapped around inside the global array gt Current version of GA supports three periodic operations E periodic get m periodic put periodic acc Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Periodic Interfaces LLL a ih local Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Periodic Get Put Accumulate v Fortran subroutine nga_periodic_get g_a lo hi buf Id C void NGA Periodic get int g_a int lof int hif void buf int Id Python ndarray ga periodic_get g_a lo None hi None buffer None Fortran subroutine nga_periodic_put g_a lo hi buf Id C void NGA _Periodic_put int g_a int lof int hif void buf int Id Python ga periodic_put g_a buffer lo None hi None Fortran subroutine nga_periodic_acc g_a lo hi buf Id alpha C void NGA _Periodic_acc int g_a int lof int hif void buf int Id void alpha Python ga periodic_acc g_a buffer lo None hi None alpha None Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Lock and Mutex gt Lock works together with mutex gt Simple synchronization mechanism
24. NAL LABORATORY Proudly Operated by Battelle Since 1965 Writing GA Programs include lt stdio h gt gt GA requires the following include mpi h include ga h functionalities from a message include macdecls h passing library MPI TCGMSG int main int argc char argv 4 MPI Init amp argc amp argv E initialization and termination of aA aa PrOCesses printf Hello world n Broadcast Barrier ee ee a function to abort the running le en parallel job in case of an error gt The message passing library has to be m initialized before the GA library m terminated after the GA library is terminated A is compatible with ME Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Source Code and More Information gt Version 5 0 2 available gt Homepage at hitp www emsl pnl gov docs global gt Platforms IBM SP BlueGene m Cray XT XE6 Gemini Linux Cluster with Ethernet Infiniband Solaris Fujitsu m Hitachi m NEC m HP E Windows Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Documentation on Writing Building and Running GA programs gt For detailed information m GA Webpage e GA papers APIs user manual etc e Google Global Arrays e http www emsl pnl gov docs global m GA User Manual e http www emsl pni gov docs global user htm GA API Documentation e GA Webpage gt User Interface
25. Parallel Programming Using the Global Arrays Toolkit Bruce Palmer Sriram Krishnamoorthy Daniel Chavarria Abhinav Vishnu Jeff Daily Pacific Northwest National Laboratory Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Global Arrays gt Developed over 20 years gt Under active development and focusing on preparing for future exascale platforms gt Available across platforms from PCs to leadership machines gt Easy access to distributed data on multiprocessor machines m High programmer productivity gt Library available from http www emsl pnl gov docs global Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Outline of the Tutorial gt Overview of parallel programming gt Introduction to Global Arrays programming model gt Basic GA commands Advanced features of the GA Toolkit gt Current and future developments in GA Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Why Parallel gt When to Parallelize Program takes too long to execute on a single processor m Program requires too much memory to run on a single processor E Program contains multiple elements that are executed or could be executed independently of each other gt Advantages of parallel programs m Single processor performance is not increasing The only way to improve performance is to write parallel programs
26. Python integer length integer idx integer subscript subroutine ga_get_block_info g_a num_blocks block_dims integer function ga_total_ blocks g_a subroutine nga_access_ block_segment g_a iproc index length subroutine nga_access_block g_a idx index Id subroutine nga_access_block_grid g_a subscript index Id void GA_Get_block_info g_a num_blocks block_dims int GA_Total_ blocks int g_a void NGA_Access_block_segment int g_a int iproc void ptr int void NGA_Access_block int g_a int idx void ptr int Id void NGA_Access_block_grid int g_a int subscript void ptr int Id num_blocks block_dims ga get_block_info g_a blocks ga total_ blocks g_a ndarray ga access_ block _segment g_a iproc ndarray ga access_block g_a idx ndarray ga access_block_grid g_a subscript total size of blocks held on processor 4 index of block in array for simple block cyclic distribution Pacific Northwest NATIONAL LABORATORY location of block in block grid for Scalapack distribution Proudly Operated by Battelle Since 1965 Interfaces to Third Party Software Packages gt Scalapack E Solve a system of linear equations Compute the inverse of a double precision matrix gt TAO E General optimization problems gt Interoperability with Others m PETSc m CUMULVS Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Data Mapping Information gt T
27. Se intensive bioinformatics analysis IEEE Trans Parallel a Distributed Systems Vol 17 No 8 2006 Parallel Inspire Krishnan M SJ Bohn WE Cowley VL Crow and J Nieplocha Scalable Visual Analytics of Massive Textual Datasets Proc IEEE International Parallel and Distributed WMD Reports Processing Symposium 2007 Smooth Particle Hydrodynamics B Palmer V Gurumoorthi A Tartakovsky T Scheibe A Component Based Framework for Smoothed Particle Hydrodynamics Simulations of Reactive Fluid Flow in Portous Media Int J High Perf Comput App Vol 24 2010 a Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Recent Applications Subsurface Transport Over Multiple Phases STOMP Transient Energy Transport Hydrodynamics Simulator TETHYS Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Outline of the Tutorial gt Overview of parallel programming gt Introduction to Global Arrays programming model gt Basic GA commands 3 Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Structure of GA Application programming Global Arrays and MPI are completely interoperable MPI ARMCI Code can Global portable 1 sided communication contain calls operations put get locks etc to both system specific interfaces LAPI Infiniband threads VIA libraries Pacific Northwest NATIO
28. a by specifying its location using a global index Data is mapped out in a natural manner usually corresponding to the Original problem and access IS easy Information on data locality is obscured and leads to loss of performance 150 200 Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Distributed vs Shared Data View Distributed Data Data is explicitly associated with each processor accessing data requires specifying the location of the data on the processor and the processor itself Data locality is explicit sarah pl icated 7 0xt32874 P5 Distributed computing is typically implemented with message passing e g MPI Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Global Arrays Distributed dense arrays that can be accessed through a shared memory like style Physically distributed data single shared data structure global indexing e g access A 4 3 rather than buf 7 on task 2 Global Address Space Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Creating Global Arrays integer array minimum block size handle character string on each processor g_a NGA Create type ndim dims name chunk ei float double int etc array Of dimensions dimension Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 One sided Communication
29. ber of computing processes Fortran integer function ga_nnodes a C int GA_ Nnodes Python nnodes ga nnodes Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Parallel Environment Process Information EXAMPLE amp program main include mafdecls fh include global fh integer ierr me nproc call mpi _ init ierr call ga initialize me ga_nodeid size ga_nnodes mpirun np 4 helloworld Hello world My rank is 0 out of 4 processes nodes Hello world My rank is 2 out of 4 processes nodes Hello world My rank is 3 out of 4 processes nodes Hello world My rank is 1 out of 4 processes nodes mpirun np 4 python helloworld py Hello world My rank is 0 out of 4 processes nodes Hello world My rank is 2 out of 4 processes nodes Hello world My rank is 3 out of 4 processes nodes Hello world My rank is 1 out of 4 processes nodes write 6 Hello world My rank is me out of size processes nodes call ga terminate call mpi finilize end Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 GA Data Types gt C Data types m C_INT int m C LONG long m C FLOAT float m C DBL double m C SCPL single complex m C DCPL double complex gt Fortran Data types m MT F_INT integer 4 8 bytes m MT F REAL real m MI F DBL double precision m MT F SCPL single co
30. cal function ga allocate g a Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 New Interface for Creating Arrays C int GA Greate Mandle void GA Set data int g a int dim int dims int type void GA Set array name int g a char name void GA Set chunk int g ar int chunk void GA Set irreg distr int g a int map int nblock void GA Set ghosts int g a int width void GA Set DIlOCK cyclic intg 9g a Int dims void GA Set block cyclic proc grid int g a int ae kaise C DEO Cie int GA Allocate ant g a Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 New Interface for Creating Arrays Python neanole Ga Creates nendle ga set data g a dims type ga set array name g a name ga set chunk g a chunk ga set irreg distr g a map nblock ga set ghosts g a width g BEC DLO OyGlriG g a dms SeeeGo DOLOG SCLC PrTOG Delt a Giles proc grid bool ga allocate int g a Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 New Interface for Creating Arrays Cont integer ndim dims 2 chunk 2 integer g a g b logical status c ndim 2 dims 1 5000 dims 2 5000 chunk 1 100 chunk 2 100 c c Create global array A using old interface c status nga create MT_ F DBL ndim dims chunk array A g a c c Create global array B using new interface C g_b ga create handle
31. cific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Atomic Accumulate gt Accumulate combines the data from the local array with data in the global array section Fortran subroutine nga_acc g_a lo hi buf Id alpha a C void NGA_Acc int g_a int lof int hi void buf int Id void alpha m Python ga acc g_a buffer lo None hi None aloha None integer g_a array handle input integer lo hi limits on data block to be moved input double precision complex buf local buffer input integer ld array of strides for local buffer input double precision complex alpha arbitrary scale factor input Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Atomic Accumulate cont global local ga i j ga i j alpha buf k Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 sync gt Sync is a collective operation gt It acts as a barrier which synchronizes all the processes and ensures that all outstanding Global Array operations are complete at the call gt The functions are gt p gt Fortran subroutine ga_sync A i a C void GA_Sync gt gt Python ga sync r GA sync is the main mechanism in GA for guaranteeing data consistency sync Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Globa
32. ince 1965 Scalability GA Metadata is a key component gt GA currently allocates metadata for each global array ina replicated manner on each process gt OK for now on petascale systems with O 10 processes m 200 000 x 8 bytes 1 5 MB per global array instance Not that many global arrays in a typical application Pointers to other processes global array portions p 1 n entries on each process COL Local global array portion m Local global array portion owned by PO owned by P1 Soe Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 scalability Proposed Metadata Overhead Reduction gt Share metadata between processes on the same shared memory domain today s node gt Reduce metadata storage by the number of processes per shared memory domain Laa Pointers to global array PO portions P 1 Local global array portion Local global array portion j A owned by P1 owned by PO I OLA l 4 Hain en sic Sia Sciam sine nig Sis eigen cme IS Ses ici tid ang S E E OE Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Summary gt Global Arrays supports a global address space m Easy mapping between distributed data and original problem formulation gt One sided communication m No need to coordinate between sender and receiver m Random access patterns are easily programmed e Load balancing gt High Performance
33. int at which the inefficiencies dominate Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Increasing Scalability gt Design algorithms to minimize communication Exploit data locality m Aggregate messages gt Overlapping computation and communication On most high end platforms computation and communication use non overlapping resources Communication can occur simultaneously with computation E Onesided non blocking communication and double buffering gt Load balancing m Static load balancing partition calculation at startup so that each processor has approximately equal work E Dynamic load balancing structure algorithm to repartition work while calculation is running Note that dynamic load balancing generally increases the other source of parallel inefficiency communication Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Outline of the Tutorial gt Overview of parallel programming gt Introduction to Global Arrays programming model gt Basic GA commands gt Advanced features of the GA Toolkit gt Current and future Gy eit ments in GA i ws dea tee ad tame om OL a A pret yep e i p E a Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Distributed Data vs Shared Memory Shared Memory Data is in a globally accessible address space any processor can access dat
34. l Operations gt Fortran gt Python subroutine ga brdcst type buf lenbuf root subroutine ga igop type xX Nn op subroutine Ga dgop type x Ny Op void GA Brdcst void buf int lenbuf int root void GA Igop long x int n char op void GA Dgop double x int n char op buffer buffer ga brdcst buffer root ga gop x op Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 GLOBAL ARRAY MODEL OF COMPUTATIONS compute update Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Locality Information Discover array elements held by each processor Fortran nga_distribution g_a proc lo hi a C void NGA Distribution int g_a int proc int lo int hi Python lo hi ga distribution g_a proc 1 integer ga array handle input integer proc processor ID input integer lo ndim lower index output integer hi ndim upper index output do iproc 1 nproc write 6 Printing g a info for processor iproc call nga_distribution g a iproc lo hi do j 1 ndim write 6 j lo j hi j end do dnd do Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Example Matrix Multiply Determine which block of data is locally owned Note that the same block is locally owned for all GAs NGA Distribution g c me lo hi Get the blocks from g a and g 0 Geeder to compute ize biock IN
35. lgorithms are asynchronous e But also can be used for synchronous algorithms Data consistency must be explicitly managed Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 GLOBAL ARRAY MODEL OF COMPUTATIONS compute update Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Global Arrays vs Other Models Advantages gt Inter operates with MPI m Use more convenient global shared view for multi dimensional arrays but can use MPI model wherever needed gt Data locality and granularity control is explicit with GA s get compute put model unlike the non transparent communication overheads with other models except MPI gt Library based approach does not rely upon smart compiler optimizations to achieve high performance Disadvantage gt Data consistency must be explicitly managed Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Global Arrays cont gt Shared data model in context of distributed dense arrays gt Much simpler than message passing for many applications gt Complete environment for parallel code development gt Compatible with MPI gt Data locality control similar to distributed memory message passing model gt Extensible gt Scalable Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Data Locality in GA What data does a processor own NGA
36. m Tasks operate on data stored in PGAS Global Arrays Executed in collective task parallel phases gt TASCEL runtime system manages task execution m Load balancing locality optimization etc gt Extends Global Arrays execution model Pa NEONA LABORATORY Proudly Operated by Battelle Since 1965 Global Pointer Arrays gt Create arrays where each array element can be an arbitrary data object E May be more limited in Fortran where each array object might need to be restricted to an arbitrarily sized array of some type Access blocks of array elements or single elements and copy them into local buffers using standard put get syntax gt Potential Applications E Block sparse matrix m Embedded refined grids m Recursive data structures Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Global Pointer Arrays cont Pointer Array DEE Pointer Array Data Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Global Pointer Arrays cont Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Fault Tolerance Application Domain Science MACCARI Uae Layer Non MPI Global Arrays TCGMSG sanae Fault Tolerant Fault ARMC Barrier Resilient Process Fault Tolerance Non MPI Manager Management message Infrastructure passing Network ea ot Proudly Operated by Battelle Since 1965 Fault Tolerance cont
37. mplex m MT F DCPL double complex Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Creating Destroying Arrays gt o create an array with a regular distribution Fortran logical function nga_create type ndim dims name chunk g_a a C int NGA_Create int type int ndim int dims char name int chunk Python g a ga create type dims name int pgroup 1 chunk None character name a unique character string input integer type GA data type input integer dims array dimensions input integer chunk minimum size that dimensions should be chunked into input integer ga array handle for future references output dims 1 dims 2 chunk 1 Use defaults chunk 2 if not nga create MT F DBL 2 dims Array A chunk g a Pacific Northwest call ga_error Could not create global array A g a NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Creating Destroying Arrays cont gt To create an array with an irregular distribution Fortran logical function nga_create_irreg type ndim dims array _name map nblock g_a a C int NGA_Create_irreg int type int ndim int dims char array_name nblock mapf Python g a ga create_irreg int gtype dims block map name pgroup 1 character name a unique character string input integer type GA datatype input integer dims array dimensions input in
38. o determine the process ID that owns the element defined by the array subscripts Fortran logical function nga_locate g_a subscript owner a C int NGA_Locate int g_a int subscript E Python proc ga locate g_a subscript integer ga array handle input Integer subscript ndim element subscript input integer owner process id output owner 5 Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Data Mapping Information cont gt To return a list of process IDs that own the patch Fortran logical function nga_ locate _region g_a lo hi map proclist np a C int NGA_Locate_region int g_a int lof int hil int map int procs Python map procs ga locate_region g_a lo hi integer np number of processors that own a portion of block output integer ga global array handle input integer ndim number of dimensions of the global array Procs 0 172 ay oe oO integer lo ndim array of starting indices for array section input map 10 1099 A19 A192 integer hi ndim array of ending indices for array section input 10 4 1019 hij r Ais r integer map 2 ndim array with mapping information output lO217 1022 h1z N1y5 integer procs np list of processes that own a part of array section output LO41r LO427 Nlis Nl4gr 1 1 1 LOs LOs27 Risi Risg L g Logo N1 g Nig Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle
39. ou make flags Suggested compiler linker options are as follows GA libraries are installed in Users d3n000 ga ga 5 0 bld openmpi_ shared 1lib GA headers are installed in Users d3n000 ga ga 5 0 bld openmpi_ shared include CPPFLAGS I Users d3n000 ga ga 5 0 bld openmpi_ shared include LDFLAGS L Users d3n000 ga ga 5 0 bld openmpi_ shared 1lib For Fortran Programs FFLAGS fdefault integer 8 LIBS lga framework Accelerate For C Programs CFLAGS LIBS lga framework Accelerate L usr local lib gcc x86 64 apple darwinl10 4 6 0 L usr local lib gcc x86 64 apple darwinl10 4 6 0 lgfortran You can use these variables in your Makefile a For example gcc CPPLAGS LDFLAGS o ga_test ga_test c LIBS Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Writing GA Programs gt GA Definitions and Data types C programs include files ga h macdecls h Fortran programs should include the files mafdecls th global fh include lt stdio h gt include mpi h include ga h include macdecls h int main int argc char argv Parallel program oO Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Running GA Programs gt Example Running a test program ga_test on 2 processes for GA built using the MPI runtime gt mpirun np 2 ga_test gt Running a GA program is same as MPI
40. s width name chunk None pgroup 1 m For arrays with irregular distribution Visible Date Ghost Cell Data e n d Fortran logical function nga_create_ghosts_irreg type dims width array_name map block g_a e C int NGA_Create_ghosts_irreg int type int ndim int dims int width char array_name int mapf int block e Python g a ga create_ ghosts _irreg type dims width block map name pgroup 1 oO Pacific Northwest NATIONAL LABORATORY integer width ndim array of ghost cell widths input Proudly Operated by Battelle Since 1965 Ghost Cells normal global array global array with ghost cells Operations NGA Create gnore Creates array With Ghosts cells GA Update ghosts updates with data from adjacent processors NGA ACCESS a provides Access to local ghost cell EC NGA Nbget ghost dir nonblocking call to update ghosts cel Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Ghost Cell Update Automatically update ghost cells with appropriate data from neighboring processors A multiprotocol implementation has been used to optimize the update operation to match platform characteristics Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Periodic Interfaces Periodic interfaces to the one sided operations have been added to Global Arrays in version 3 1 to support computational fluid dynamics problems on
41. ted m ga _init_fence m ga put g_a m ga_fence The initialize fence functions are Fortran subroutine ga_init_fence a C void GA_Init_fence m Python ga init fence The fence functions are Fortran subroutine ga_fence a C void GA_Fence m Python ga fence Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Synchronization Control in Collective Operations gt To eliminate redundant synchronization points Fortran subroutine ga_mask_sync prior_ sync_mask post_sync_mask void GA_Mask_sync int prior sync_mask int post_sync_mask m Python ga mask_sync prior sync _mask post sync _ mask logical first mask 0 1 for prior internal synchronization logical last mask 0 1 for post internal synchronization a C input input Status ge Guo ILLcare Gia C6 Cally ga wasl lt 0 iL Cabin Gam Zein eugle Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Linear Algebra gt To add two arrays Fortran subroutine ga_add alpha g_a beta g b g c a C void GA_Add void alpha int g_a void beta int g_b int g_c Python ga add g_a g_b g_c alpha None beta None alo None ahi None blo None bhi None clo None chi None gt To multiply arrays Fortran subroutine ga dgemm transa transb m n k aloha g_ a g b beta g Cc a C void GA_Dgemm char ta char tb int m int n int k double alpha
42. ted by Battelle Since 1965 Processor Groups cont gt To determine the handle for a standard group at any point in the program Fortran e integer function ga_pgroup_ get_default e integer function ga_pgroup_get_mirror e integer function ga_pgroup_get_world e int GA Pgroup get default e int GA Pgroup get mirror e int GA Pgroup get world Python e p_ handle ga pgroup_get_default e p_ handle ga pgroup_get_mirror e p_ handle ga pgroup_ get_world Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Default Processor Group Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 MD Application on Groups Scaling of Single Parallel Task Scaling of Parallel MD Tasks on Groups 1200 1000 Speedup Speedup ai eal Ideal 600 Speedup Speedup 400 200 0 la 0 5 10 15 20 0 200 400 600 800 1000 1200 Number of Processors Number of Processors eo Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Creating Arrays with Ghost Cells Global Array gt Tocreate arrays with ghost cells m For arrays with regular distribution e Fortran logical function nga_create_ghosts type dims width array_name chunk g_a e C int int NGA Create _ghosts int type int ndim int dims int widthf char array _name int chunk e Python g a ga create_ghosts type dim
43. teger nblock no of blocks each dimension is divided into input integer map starting index for each block input integer ga integer handle for future references output Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Creating Destroying Arrays cont gt Example of irregular distribution The distribution is specified as a Cartesian product of distributions for each dimension The array indices start at 1 e The figure demonstrates distribution of a 2 dimensional array 8x10 on 6 or more processors block 2 3 2 the size of map array is S 5 and array map contains the following elements map 1 3 1 6 e The distribution is nonuniform because P1 and P4 get 20 elements each and processors PO P2 P3 and P5 only 10 elements each block 1 block 2 map 1 3 2 map 2 map 3 1 3 7 1 map 4 map 5 6 if not nga create irreg MT F DBL 2 dims amp Array A map block g a amp call ga_ error Could not create array A g a Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Creating Destroying Arrays cont gt To duplicate an array Fortran logical function ga_duplicate g_ a g b name a C int GA_Duplicate int g_ a char name Python ga duplicate g_ a name gt Global arrays can be destroyed by calling the function Fortran subroutine ga_destroy g_a aC void GA_Destroy int g_a Python
44. tline of the Tutorial gt Overview of parallel programming gt Introduction to Global Arrays programming model gt Basic GA commands gt Advanced features of the GA Toolkit gt Current and future developments in GA i q ze Y 7 Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Scatter Gather gt Scatter puts array elements into a global array Fortran subroutine nga_scatter g_a v subscrpt_array n a C void NGA Scatter int g_a void v int subscrpt_array int n Python ga scatter g_ a values subsarray gt Gather gets the array elements from a global array into a local array Fortran subroutine nga_gather g_a v subscrpt_array n a C void NGA Gather int g_a void v int subscrpt_array int n Python values ga gather g a subsarray numpy ndarray values None integer g_a array handle input double precision v n array of values input output integer n number of values input integer subscrpt_array location of values in global array input Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 scatter Gather cont gt Example of scatter operation Scatter the 5 elements into a 10x10 global array a e Element 1 v 0 5 subsArray 0 0 2 subsArray O 1 3 e Element 2 v 1 3 subsArray 1 0 3 subsArray 1 1 4 F e Element 3 v 2 8 subsArray 2 0 8 cE subsArray 2 1 5 e Element 4
45. to protect a critical section gt To enter a critical section typically one needs to Create mutexes Lock on a mutex Do the exclusive operation in the critical section Unlock the mutex Destroy mutexes gt The create mutex functions are Fortran logical function ga_create_mutexes number a C int GA_Create_mutexes int number Python bool ga create_mutexes number Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Lock and Mutex cont Lock Unlock Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Lock and Mutex cont gt The destroy mutex functions are Fortran logical function ga_destroy_mutexes a C int GA_Destroy_mutexes Python bool ga destroy_mutexes gt The ock and unlock functions are Fortran e subroutine ga_lock int mutex e subroutine ga_unlock int mutex a C e void GA_lock int mutex e void GA_unlock int mutex Python e ga lock mutex ga unlock mutex Pacific Northwest integer mutex input mutex id NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Fence gt gt Fence blocks the calling process until all the data transfers corresponding to the Global Array operations initiated by this process complete For example since ga_put might return before the data reaches final destination ga_init_ fence and ga_fence allow process to wait until the data transfer is fully comple
46. v 3 7 subsArray 3 0 3 subsArray 3 1 7 e Element 5 v 4 2 subsArray 4 0 6 subsArray 4 1 3 m After the scatter operation the five elements would be scattered into the global array as shown in the figure integer subscript ndim nlen call nga_scatter g a v subscript nlen Proudly Operated by Battelle Since 1965 Read and Increment gt Read_inc remotely updates a particular element in an integer global array and returns the original value Fortran integer function nga_read_ inc g_a subscript inc a C long NGA_Read_inc int g_a int subscript long inc Python val ga read_inc g_a subscript inc 1 m Applies to integer arrays only m Can be used as a global counter for dynamic load balancing integer ga input integer subscript ndim inc input Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Read and Increment cont c Create task counter NGA R d a status nga create MT_F_ INT one one chunk g counter _Kead_inc call ga_zero g_ counter Read and Increment itask nga read inc g counter one one Global Array Global Lock access to data is serialized Every integer value is read OF once and only once by 6ec0e oae some processor Translate itask into task Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Hartree Fock SCF Obtain variational solutions to the electronic Schrodinger equation
47. vel of performance 10000 OaK 75000 100000 125000 150000 175000 200000 225000 no of processors RIDGE i CGS National Laborato ry Cr RIPITIPAL BREN RS Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Direct Access to Local Data gt Global Arrays support abstraction of a distributed array object gt Object is represented by an integer handle gt A process can access its portion of the data in the global array gt Todo this the following steps need to be taken Find the distribution of an array i e which part of the data the calling process owns m Access the data Operate on the data read write cS m Release the access to the data Vis He Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Access gt To provide direct access to local data in the specified patch of the array owned by the calling process Fortran subroutine nga_access g_a lo hi index Id a C void NGA Access int g_a int lof int hil void ptr int Id Python ndarray ga access g_a lo None hi None m Processes can access the local position of the global array e Process 0 can access the specified patch of its local position of the array e Avoids memory copy Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965 Access cont Pacific Northwest NATIONAL LABORATORY Proudly Operated by Battelle Since 1965

Parallel Programming Using the Global Arrays Toolkit

Contents

Download Pdf Manuals

Related Search

Related Contents