Home

Global Arrays with hands-on time

1. Invert data globally by copying locally inverted blocks into their inverted positions in the GA lo2 0 dims 0 hil 0 1 hi2 0 NGA Put g_ b lo2 hi2 b ld dims 0 lol 0 1 Synchronize all processors to make sure inversion is complete GA Sync Check to see if inversion is correct if me 0 verify g a g_b Deallocate arrays GA Destroy g a GA Destroy g_b Transpose Example Fortran integer dims 3 chunk 3 nprocs me i lo 3 hi 3 lol 3 integer hi1 3 lo2 3 hi2 3 ld 3 nelem integer g a g_b a MAXPROC TOTALELEMS b MAXPROC TOTALELEMS integer heap stack ichk ierr logical status heap 300000 stack 300000 c c Initialize communication library c ifdef USE MPI call mpi init ierr else call pbeginf ffendif c c Initialize GA library c call ga initialize Transpose Example Fortran cont G Find local processor ID and number of processors me ga nodeid nprocs ga nnodesf if me eg 0 write 6 101 nprocs 101 format Using i4 processors Allocate memory for GA library status ma init MT F DBL stack nprocs heap nprocs Configure array dimensions Force an unegual data distribution dims 1 nprocs TOTALELEMS nprocs 2 ld 1 MAXPROC TOTALELEMS chunk 1 TOTALELEMS Minimum data on each processor Create global array g_a and then duplicate it to get g_b status nga create MT F INT NDIM dims Array A
2. e GA programming model is very simple e Most of a parallel program can be written with these basic calls GA Initialize GA_Terminate GA_Nnodes GA_Nodeid GA_Create GA_Destroy GA_Put GA_Get GA_Sync GA Initialization Termination e There are two functions to nitialize GA Fortran e subroutine ga_initialize program main a O R include mafdecls h e subroutine ga_initialize_Itd limit ee C integer ierr e void GA_Initialize e void GA_Initialize_Itd size_t limit call mpi_init ierr call ga initialize write 6 Hello world e To terminate a GA program call ga_terminate Fortran subroutine ga_terminate Ga UL AU end Cvoid GA_Terminate integer limit amount of memory in bytes per process input PALIS L gt L gt Parallel Environment Process Information e Parallel Environment how many processes are working together size what their IDs are ranges from O to size 1 e To return the process ID of the current process Fortran integer function ga_nodeid int GA_Nodeid C e To determine the number of computing processes Fortran integer function ga_nnodes int GA_Nnodes C Parallel Environment Process Information EXAMPLE program main include mafdecls h include global fh integer ierr me nproc call mpi init ierr call ga initialize m
3. gt e Patch Operations The copy patch operation e Fortran subroutine nga_copy_patch trans g_a alo ahi g_b blo bhi e C void NGA Copy_patch char trans int g_a int alo int ahi int g_b int blo int bhi e Number of elements must match Basic Array Operations cont e Patches Cont To set only the region defined by o and hi to zero e Fortran subroutine nga_zero_patch g_a lo hi e C void NGA_Zero_patch int g_a int lo int hi To assign a single value to all the elements in a patch e Fortran subroutine nga_fill_patch g_a lo hi val c D voidNGA_Fill_patch int g_a int lo int hi void va To scale the patch defined by lo and hi by the factor val e Fortran subroutine nga_scale_patch g_a lo hi val c 1 voidNGA_Scale_patch int g_a int lo int hi void va The copy patch operation e Fortran subroutine nga_copy_patch trans g_a alo ahi g_b blo bhi e C void NGA_Copy_patch char trans int g_a int alo int ahi int g_b int blo int bhi Scatter Gather e Scatter puts array elements into a global array Fortran subroutine nga_scatter g_a v subscrpt_array n C void NGA_Scatter int g_a void v int subscrpt_array int n e Gather gets the array elements from a global array into a local array Fortran subroutine nga_gather g_a v subscrpt_array n C void NGA_Gather int g_a void v int subscrpt_array int n
4. Fortran e integer function ga_pgroup_get_default e integer function ga_pgroup_get_mirror e integer function ga_pgroup_get_world C e int GA_Pgroup_get_default EG UD HD BER e int GA_Pgroup_get_mirror p_a ga_pgroup create list nproc H call ga_pgroup_set_default p_a e int GA_Pgroup_get_world eee call ga_pgroup set default ga pgroup get world subroutine parallel task p_b ga pgroup create new list new_nproc Galil tey jo oge Poroto Create mey liet Den morde caiga poroupl set default ome call parallel subtask Lock and Mutex e Lock works together with mutex e Simple synchronization mechanism to protect a critical section e To enter a critical section typically one needs to Create mutexes Lock on a mutex Do the exclusive operation in the critical section Unlock the mutex Destroy mutexes e The create mutex functions are Fortran logical function ga_create_mutexes number C int GA_Create_mutexes int number number number of mutexes in mutex array input Lock and Mutex cont e The destroy mutex functions are Fortranlogical function ga_destroy_mutexes C int GA_Destroy_mutexes e The lock and unlock functions are Fortran e subroutine ga_lock int mutex e subroutine ga_unlock int mutex C e void GA_lock int mutex e void GA_unlock int mutex integer mutex input mutex id Fence blocks the calling rocess until all th
5. integer g_a array handle input double precision v n array of values input output integer n number of value input integer subscrpt_array location of values in global array input Scatter Gather cont e Example of scatter operation Scatter the 5 elements into a 10x10 global array Element 1 v 0 5 subsArray 0 0 y S subsArray 0 1 3 7 Element 2 vV 1 3 6 subsArray 1 0 3 subsArray 1 1 4 s E Element 3 v 2 8 subsArray 2 0 8 subsArray 2 1 5 l l Element 4 v 3 7 integer subscript ndim nlen subsArray 3 0 3 subsArray 3 1 7 Element 5 Vv subsArray 4 0 6 subsArray 4 1 3 After the scatter operation the five elements would be scattered into the global array as shown in the figure call nga scatter g a v subscript nlen 4 2 Read and Increment s Read _inc remotely updates a particular element in an integer global array and returns the original value Fortran integer function nga_read_inc g_a NGA_Read_inc subscript inc Read and Increment C long NGA_Read_inc int g_a int subscript long inc Applies to integer arrays only Can be used as a global counter for dynamic load Global Lock balancing access to data integer input is a TIP Global Array g_a integer subscript ndim inc input Create task counter call nga create MT F INT one one chunk g counter call ga_zero g counter itask nga
6. then write 6 111 i a i b dims 1 i 1 111 format Mismatch at 3i8 ichk ichk 1 endif end do if ichk eg 0 and me eg 0 write 6 Transpose OK Transpose Example Fortran cont c Deallocate memory for arrays and clean up GA library if me eq 0 write 6 Terminating status ga destroy g a status ga destroy g_b call ga_terminate ifdef USE MPI call mpi finalize else call pend endif stop end Matrix Multiply Example int dims NDIM chunk NDIM ld NDIM int lo NDIM hi NDIM lol NDIM hil NDIM int lo2 NDIM hi2 NDIM lo3 NDIM hi3 NDIM mtga gb go i j k 4 Find local processor ID and the number of processors int me GA_Nodeid nprocs GA Nnodesf Configure array dimensions Force an unegual data distribution for i 0 i lt NDIM i dims i TOTALELEMS ld i dims i chunk i TOTALELEMS nprocs l minimum block size on each process Matrix Multiply Example C cont create a global array g a and duplicate it to get g b and g c g_a NGA Create C DBL NDIM dims array A chunk if g_a GA Error create failed A NDIM if me 0 printf Created Array A n g_b GA Duplicate g a array B g_c GA Duplicate g a array C if gb gc GA Error duplicate failed NDIM if me 0 printf Created Arrays B and Cyn initialize data in matrices a and b if me 0 printf Initializin
7. Arrays cont e To duplicate an array Fortran logical function ga_duplicate g_a g_b name C int GA_Duplicate int g_a char name e Global arrays can be destroyed by calling the function Fortran subroutine ga_destroy g_a C void GA_Destroy int g_a call nga create MT F INT dim dims array a chunk g a call ga duplicate g a g b array b call ga destroy g a integer g_a g_b character name name a character string input g_a Integer handle for reference array input g_b Integer handle for new array output Put Get Put copies data from a local array to a global array section Fortran subroutine nga_put g_a lo hi buf Id C void NGA_Put int g_a int lo int hi void buf int Id Get copies data from a global array section to a local array Fortran subroutine nga_get g_a lo hi buf Id g_a global array handle integer C void NGA_Get int g_a int lo int hi void buf int Id integer input lo hi limits on data block to be moved input Double precision complex integer buf local buffer output integer Shared Object Id array of strides for local buffer input Shared Object Put Get cont e Example of put operation transfer data from a local buffer 10 x10 array to 7 15 1 8 section of a 2 dimensional 15 x10 global array into lo 7 1 hi 15 8 ld 10 double precision buf 10 10 call nga p
8. Is tedious due to the partitioned local view of the data 29104 oY a te gt 5 s C 2 2 dg aa od Ay gt e Uniform Global view of shared data e Available for Fortran C C e Work sharing constructs parallel loops and sections and global shared data view ease program development e Disadvantage Data locality issues obscured by programming model Private Data Co Array Fortran e Partitioned but global shared data view e SPMD programming model with local and shared variables e Shared variables have additional co array dimension s mapped to process space each process can directly access array elements in the space of other processes A 1 J A L J me 1 A 1 J me 1 Private Data e Compiler optimization of communication critical to performance but all non local access is explicit Unified Parallel C UPC e SPMD programming model with global shared view for arrays as well as pointer based data structures e Compiler optimizations critical for controlling inter processor communication overhead Very challenging problem since local vs remote access is not explicit in syntax unlike Co Array Fortran Linearization of multidimensional arrays makes compiler optimization of communication very difficult Private Data Global Arrays vs Other Models Advantages e Inter operates with MPI Use more convenient global shared view for multi dimensi
9. e C void GA_Transpose int g_a int g_b Linear Algebra cont e Patches To add element wise two patches and save the results into another patch e Fortran subroutine nga_add_patch alpha g_a alo ahi beta g_b blo bhi g_c clo chi e C void NGA_Add_patch void alpha int g_a int alo int ahi void beta int g_b int blo int bhi int g_c int clo int chi integer g_a g_b g_c input dbl prec comp int alpha beta scale factors input integer ailo aihi ajlo ajhi g_a patch coord input integer bilo bihi bjlo bjhi g_b patch coord input integer cilo cihi cjlo cjhig_c patch coord input Linear Algebra cont e Patches cont To perform matrix multiplication e Fortran subroutine ga_matmul_patch transa transb alpha beta g_a ailo athi ajlo ajhi g_b bilo bihi bjlo bjhi g_c cilo cihi cjlo cjhi e C void GA_Matmul_patch char transa char transb void alpha void beta int g_a int ailo int aihi int ajlo int ajhi int g_b int bilo int bihi int bjlo int bjhi int g_c int cilo int cihi int cjlo int cjhi integer g_a ailo aihi ajlo ajhi patch of g_a input integer g_b bilo bihi bjlo bjhipatch of g_b input integer g_c cilo cihi cjlo cjhi patch of g_c input dbl prec comp alpha beta scale factors input character 1 transa transb transpose flags input Linear Algebra cont e Patches cont To compute the element wise do
10. g_a iproc lo hi Where is the data NGA Access g_a lo hi ptr Id Use this information to organize calculation so that maximum use is made of locally held data OlO6 y ALIS Structure of GA A cy ys Application F90 Java programming Global Arrays and MPI are completely interoperable MPI ARMCI Code can Global portable 1 sided communication contain calls operations put get locks etc to both system specific interfaces LAPI GM Myrinet threads VIA libraries Disk Resident Arrays e Extend GA model to disk system similar to Panda U Illinois but higher level APIs e Provide easy transfer of data between N dim arrays stored on disk and distributed arrays stored in memory disk resident array e Use when Arrays too big to store in core checkpoint restart out of core solvers global array Application Areas T Immigration cas E WORA WU PIDWETTare Financiaf 13 Tramsacti ns JB Border Crossings a Shipping d electronic structure chemistry Biology organ simulation LAR a bioinformatics visual analytics Major area Visualization and image analysis smooth particle molecular dynamics Others financial security forecasting astrophysics geosciences atmospheric chemistry New Applications ScalaBLAST C Oehmen and J Nieplocha ScalaBLAST A scalable implementation of BLAST for high performance data inte
11. global testing To compile and link your GA based program for example app c or app f e Copy to GA_DIR global testing and type e make app x or gmake app x Compile any test program in GA testing directory and use the appropriate compile link flags in your program Compiling and Linking GA Programs cont e Your Makefile Please refer to the INCLUDES FLAGS and LIBS variables which will be printed at the end of a successful GA installation on your platform INCLUDES I include I msrc apps mpich 1 2 6 gcc ch shmem include LIBS L msrc home manoj GA cvs lib LINUX lglobal lma llinalg larmed L mcrc epes mpi cha Iko goc ch smmem Lib Impich lim For Fortran Programs FLAGS g Wall funroll loops fomit Crame pointer malign double fno second underscore Wno globals For C Programs FLAGS g Wall rumecol lb T rOma r eame Pointer malign double fno second underscore Wno globals You can use these variables in your Makefile For example gcc INCLUDES FLAGS o ga_test ga_test c LIBS i NOTE Please refer to GA user manual chapter 2 for more information OlO6 K Running GA Programs e Example Running a test program ga_test on 2 processes e mpirun np 2 ga_test e Running a GA program is same as running an MPI program Writing Building and Running GA Programs Basic Calls Intermediate Calls Advanced Calls GA Basic Operations
12. read inc g counter one one Translate itask into task Creating Arrays with Ghost Cells e To create arrays with ghost cells For arrays with regular distribution e Fortran logical function a mc in as dims width array_name chunk g_a e C int NGA_Create_ghosts int ty e int ndim int dims int widt i char array_name int chunk For arrays with irregular distribution e n d Fortran logical function nga_create_ghosts_irreg type dims width array_name map block g_a e C int NGA Create_ghosts_irre VW type int ndim int dims int width char array_name int map int block integer width ndim array of ghost cell widths input Global Array normal global array global array with ghost cells Operations NGA Create ghosts _ creates array with ghosts cells GA_Update_ghosts updates with data from adjacent processors NGA_Access_ghosts _ provides access to local ghost cell elements NGA_Nbget ghost dir nonblocking call to update ghosts cells Ghost Cell Update Automatically update ghost cells with appropriate data from neighboring processors A multiprotocol implementation has been used to optimize the update operation to match platform characteristics Linear Algebra e Whole Arrays To add to arrays e Fortran subroutine ga_add alpha g_a beta g_b g_c e C void GA_Add void alpha int g_a void b
13. ELEMS 32768 int main int argc char argv 4 int dims chunk nprocs me i lo hi lo2 hi2 ld int g_a g_b a TOTALELEMS b TOTALELEMS N GA_Initialize me GA_Nodeid N nprocs GA_Nnodes V dims nprocs TOTALELEMS chunk Id TOTALELEMS N create a global array g_a GA_Create C_INT NDIM dims array A chunk N g_b GA_Duplicate g_a array B N INITIALIZE DATA IN GA N GA_Distribution g_a me lo hi V GA_Get g_a lo hi a Id INVERT DATA LOCALLY N for i 0 i lt nelem i b i a nelem 1 i INVERT DATA GLOBALLY N lo2 dims hi 1 N hi2 dims lo 1 N GA_Put g_a lo2 hi2 b ld V GA_Terminate ED a Matrix Multiplication nga _ get dgemm 7 One sided Communication Message Passing Message requires cooperation on both sides The processor sending the message P1 and the processor receiving the message passing message PO must both MPI participate receive send One sided Communication Once message is initiated on sending processor P1 the sending processor can continue computation po put Receiving processor PO is not involved Data is copied directly from switch into one sided communication memory on PO SHMEM ARMCI MPI 2 1S Data Locality in GA What data does a processor own NGA_Distribution
14. GA libraries a override the default compiler and optimization flags when building gmake FC f90 CC cc FOPT 04 COPT g Writing GA Programs e GA Definitions and Data types C programs include files ga h macdecls h Fortran programs should include the files mafdecls fh global fh e GA Initialize GA_Terminate gt initializes and terminates GA library include lt stdio h gt include mpi h include ga h include macdecls h int main int argc char argv MPI Init amp argc amp argv GA Initialize printf Hello world n GA Terminate MPI Finalize return 0 Writing GA Programs e GA requires the following functionalities from a message include lt stdio h gt include mpi h passing library MPI TCGMSG include ga h eee we S E E include macdecls h initialization and termination of processes int main int argc char argv MPI Init amp amp Broadcast Barrier Cd ii a function to abort the running parallel job in case of an error printf Hello world n GA Terminate a message passing library has to MPI Finalize e return 0 initialized before the GA library terminated after the GA library is terminated e GA is compatible with MPI ALII Compiling and Linking GA Programs L o gt L e 2 ways Use the GA Makefile in global testing Your Makefile e GA Makefile in
15. Installing GA e Writing GA programs e Compiling and linking e Running GA programs s For detailed information GA Webpage e GA papers APIs user manual etc e Google Global Arrays e http www emsl pnl gov docs global GA User Manual e http Www emsl pnl gov docs global user html GA API Documentation e GA Webpage gt User Interface http Www emsl pnl gov docs global userinterface html GA Support Help hpctoolsGpnl gov or hpctools emsl pnl gov 2 mailing lists GA User Forum and GA Announce Installing GA e Required environment settings TARGET Used to set the platform e E g setenv TARGET LINUX 32 bit Linux platform e See chapter 2 of GA user manual for the complete list ARMCI NETWORK Specify the underlying network communication protocol e This setting is required only on clusters with a high performance network e E g If the underlying network is Infiniband using OpenIB protocol setenv ARMCI_NETWORK OPENIB GA requires MPI for basic start up and process management e You can either use MPI or TCGMSG wrapper to MPI To use MPI setenv MSG_COMMS MPI To use TCGMSG MPI wrapper setenv USE_MPI y e Also set MPI_LIB and MPI_INCLUDE which contain the path to MPI include and libraries e Set LIBMPI which point to the actual MPI libs e g setenv LIBMPI Impich e Please refer to chapter 2 of user manual for other optional arguments e make or gmake to build
16. Models 7 gt gt e Single Threaded Data Parallel e g HPF e Multiple Processes Partitioned Local Data Access e MPI Uniform Global Shared Data Access e OpenMP Partitioned Global Shared Data Access e Co Array Fortran Uniform Global Shared Partitioned Data Access e UPC Global Arrays X10 High Performance Fortran e Single threaded view of computation e Data parallelism and parallel loops e User specified data distributions for arrays e Compiler transforms HPF program to SPMD program Communication optimization critical to performance e Programmer may not be conscious of communication implications of parallel program HPF Independent HPF Independent DOI 1 N DOI 1 N HPF Independent HPF Independent DOJ 1 N DOJ 1 N A I J B J I A I J B I J END END END END s s 1 A 1 100 B 0 99 B 2 101 HPF Independent Do I 1 100 A T B I 1 B 1I 1 End Do Message Passing Interface Most widely used parallel programming model today e Bindings for Fortran C C MATLAB e P parallel processes each with local data MPI 1 Send receive messages for inter proce communication MPI 2 One sided get put data access from to local Private Data data at remote process e Explicit control of all inter processor communication Advantage Programmer Is conscious of i TIC On overheads and attempts to minimize i Drawback Program development debugging
17. Overview of the Global Arrays Parallel Software Development Toolkit Vinod Tipparaju tipparajuv ornl gov http ft ornl gov vinod Future Technologies and NCCS Oak Ridge National Laboratory Contributors to these slides include Manoj Krishan PNL Bruce Palmer PNL and Jarek Nieplocha Managed by UT Battelle for the Department of Energy vad U OF D7 Outline of the Tutorial gt gt gt e Introduction Parallel programming models e Global Arrays GA programming model e GA Operations Writing compiling and running GA programs Basic intermediate and advanced calls e With C and Fortran examples e GA Hands on session IOA a g 40 People involved not a complete Cea THE SUPERCOMPUTER COMPANY Ryan Olson Kitrick Sheets Brian Smith Robert Harrison Mike Blocksome harrisonrj ornl gov Sameer Kumar Manoj Krishnan Bruce Palmer Abhinav Vishnu Edo Apra meng nl gov bruce palmer pnl goy abhinav vishnu pnl gov Edo apr rnl gov mano pni gov pnl g apra ornl g leading GA effort at PNL apaw 2 Overview of the Global Arrays Parallel Software Development Toolkit Introduction to Parallel Programming Models Managed by UT Battelle for the Department of Energy Performance vs Abstraction and Generality Domain Specific Systems Ideal Scalability Autoparallelized C Fortran90 Generality ALIS Parallel Programming
18. b a MAXPROC TOTALELEMS b MAXPROC TOTALELEMS int nelem i Find local processor ID and number of processors int me GA_Nodeid nprocs GA_Nnodes Configure array dimensions Force an unequal data distribution ndim 1 1 d transpose dims 0 nprocs TOTALELEMS nprocs 2 1d 0 dims 0 chunk 0 TOTALELEMS minimum data on each process create a global array g a and duplicate it to get g b g_a NGA Create C_INT 1 dims array A chunk if g a GA_Error create failed A 0 if me 0 printf Created Array A n g_b GA Duplicate g a array B if gb GA_Error duplicate failed 0 if me 0 printf Created Array B n Transpose Example C cont initialize data in d a if me 0 printf Initializing matrix A n for i 0 i lt dims 0 i a i i lo 0 O hi 0 dims 0 1 NGA Put g a lo hi a ld Synchronize all processors to guarantee that everyone has data before proceeding to the next step GA Sync Start initial phase of inversion by inverting the data held locally on each processor Start by finding out which data each processor owns NGA Distribution g a me lol hil Get locally held data and copy it into local buffer a NGA Get g a lol hil a ld Invert data locally nelem hil 0 lo1 0 1 for i 0 i lt nelem i b i a nelem l i Transpose Example C cont
19. bhandle non blocking request handle output input Non Blocking Operations double precision buf1 nmax nmax double precision buf2 nmax nmax call nga_nbget g_a lol hil buf1 ldi nb1 ncount 1 do while if mod ncount 2 eq 1 then Evaluate lo2 hi2 call nga nbget g a lo2 hi2 buf2 nb2 call nga wait nbl Do work using data in bufl else Evaluate lol hil call nga nbget g a lol hil bufl nbl call nga wait nb2 Do work using data in buf2 endif ncount ncount 1 end do Cluster Information e Example e 2 nodes with 4 processors each Say there are 7 processes created ga_cluster_nnodes returns 2 ga_cluster_nodeid returns O or 1 ga_cluster_nprocs inode returns 4 or 3 ga_cluster_procid inode iproc returns a processor ID Cluster Information cont e To return the total number of nodes that the program is running on Fortran integer function ga_cluster_nnodes C int GA Cluster_nnodes e To return the node ID of the process Fortran integer function ga_cluster_nodeid C int GA_Cluster_nodeid NO Cluster Information cont e To return the number of processors available on node inode Fortran integer function ga_cluster_nprocs inode C int GA_Cluster_nprocs int inode e To return the processor ID associated with node inode and the local processor ID iproc Fortran integer function ga_cluster_procid inode iproc C int GA_Cl
20. chunk g a status ga duplicate g a g_b Array B Transpose Example Fortran cont c Initialize data in g_a c do i 1 dims 1 a i i end do lo1 1 1 hi1 1 dims 1 c c Copy data from local buffer a to global array g a Only do this for c processor 0 c if me eq 0 call nga put g a lol hil a ld c c Synchronize all processors to guarantee that everyone has data c before proceeding to the next step call ga_sync Transpose Example Fortran cont c c Start initial phase of inversion by inverting the data held locally on G each processor Start by finding out which data each processor owns c call nga distribution g a me lo hi c Get locally held data and copy it into local buffer a call nga_get g a lo hi a 1d Invert local data nelem hi 1 lo 1 1 do i 1 nelem b i a nelem i 1 end do c c Do global inversion by copying locally inverted data blocks into c their inverted positions in the GA c 1o2 1 dims 1 hi 1 1 hi2 1 dims 1 lo 1 1 call nga put g bb lo2 hi2 b ld Transpose Example Fortran cont c Synchronize all processors to make sure inversion is complete c call ga_sync c c Check to see if inversion is correct Start by copying g_a into local c buffer a and g_b into local buffer b c call nga get g a lol hil a ld call nga get g b lol hil b 1d ichk 0 do i 1 dims 1 if a i ne b dims 1 i 1l and me eg 0
21. chunk g a call nga distribution g a me lo hi call nga access g a lo hi index ld call do subroutine task dbl mb index ld 1 call nga release g a lo hi subroutine do subroutine task a ldl1 double precision a ld1 Locality Information cont e Global Arrays support abstraction of a distributed array object e Object is represented by an integer handle e A process can access its portion of the data in the global array e To do this the following steps need to be taken Find the distribution of an array i e which part of the data the calling process owns Access the data Operate on the data read write Release the access to the data Non blocking Operations overlapping computation with communication e The non blocking APIs are derived from the blocking interface by adding a handle argument that identifies an instance of the non blocking reguest Fortran subroutine nga_nbput g_a lo hi buf Id nbhandle subroutine nga_nbget g_a lo hi buf Id nbhandle subroutine nga_nbacc g_a lo hi buf Id alpha nbhandle subroutine nga_nbwait nbhandle void NGA_NbPut int g_a int lo int hi void buf int Id J ga_nbhdl_t nbhandle void NGA_NbGet int g_a int eu int hi void buf int Id J ga_nbhdl_t nbhandle void NGA NbAcc int g_a int lof int hi void buf int Id void alpha ga_nbhdl_t nbhandle int NGA_NbWait ga_nbhdl_t nbhandle integer n
22. create global array A g a Creating Destroying Arrays cont e To create an array with an irregular distribution Fortran logical function nga_create_irreg type ndim dims array_name map nblock g_a C int NGA Create _iirreg int type int ndim int dims char array_name nblock map E name a unique character string GA datatype array dimensions nblock no of blocks each dimension is divided into character integer type integer dims integer integer map integer g_a starting index for each block integer handle for future references input input input input input output Example of irregular distribution The distribution is specified as a Cartesian product of 5 distributions for each dimension The array indices start at 1 e The figure demonstrates distribution of a 2 dimensional array 8x10 on 6 or more processors block 2 3 2 the size of map array is s 5 and array map contains the following elements map 1 3 7 1 6 e The distribution is nonuniform because P1 and P4 get 20 elements each and processors PO P2 P3 and P5 only 10 elements each block 1 block 2 map l map 2 map 3 map 4 map 5 if not call N w Orn WF lI il nga_ create irreg MT F DBL 2 dims Array_A map block g_a ga_error Could not create global array A g a Creating Destroying
23. dims 0 k ela D ela fi alil Te ptene La K F Copy Dack to g c NGA Put g c lo hi c ld Overview of the Global Arrays Parallel Software Development Toolkit Intermediate and Advanced APIs Managed by UT Battelle for the Department of Energy PSA ce e Writing Building and Running GA Programs Basic Calls e Intermediate Calls e Writing Scalable GA code with Advanced Calls ALIS Basic Array Operations 7 U s gt e Whole Arrays To set all the elements in the array to zero e Fortran subroutine ga_zero g_a e C void GA_Zero int g_a To assign a single value to all the elements in array e Fortran subroutine ga_fill g_a val e C void GA_Fill int g_a void val To scale all the elements in the array by factorval e Fortran subroutine ga_scale g_a val e C void GA_Scale int g_a void val Basic Array Operations cont e Whole Arrays To copy data between two arrays e Fortran subroutine ga_copy g_a g_b e C void GA_Copy int g_a int g_b e Arrays must be same size and dimension Distribution may be different call ga create MT F INT ndim dims array A chunk a g a call nga create MT F INT ndim dims array B chunk_b g_b initialize ga call ga copy g a d b Global Arrays g_a and g_b distributed on a 3x3 process grid PALIS Basic Array Operations cont ir gt s
24. e ga nodeid size ga nnodes mpirun np 4 helloworld Hello world My rank is 0 out of 4 processes nodes Hello world My rank is 2 out of 4 processes nodes Hello world My rank is 3 out of 4 processes nodes Hello world My rank is 1 out of 4 processes nodes write 6 Hello world My rank is me out of call ga_terminate call mpi_finilize end size processes nodes GA Data Types e C Data types C INT int C LONG long C FLOAT float C DBL double C SCPL single complex C DCPL double complex e Fortran Data types MT E INT integer 4 8 bytes MT E REAL real MT_F_DBL double precision MT F SCPL single complex MT_F_DCPL double complex Creating Destroying Arrays e To create an array with a regular distribution Fortran logical function nga_create type ndim dims name chunk g_a C int NGA_Create int type int ndim int dims char name int chunk character name a unigue character string input integer type GA data type input integer dims array dimensions input integer chunk minimum size that dimensions should be chunked into input integer g_a array handle for future references output dims 1 5000 dims 2 5000 chunk 1 1 Use defaults chunk 2 1 if not nga create MT F DBL 2 dims Array A chunk g a ap call ga_error Could not
25. e data transfers 0 corresponding to the al Array operations initiated by this process complete For example since ga_put might return before the data reaches final destination ga_init_fence and ga_fence allow process to wait until the data transfer is fully completed ga_init_fence ga_put g_a ga_fence The initialize fence functions are Fortran subroutine ga_init_fence C void GA_Init_fence The fence functions are Fortran subroutine ga_fence C void GA_Fence Synchronization Control in Collective Operations e To eliminate redundant synchronization points Fortran subroutine ga_mask_sync prior_sync_mask post_sync_mask C void GA_Mask_sync int prior_sync_mask int post_sync_mask logical first mask 0 1 for prior internal synchronization input logical last mask 0 1 for post internal synchronization input Call E eg a Gg gt Call ga mas 071 call ga_zero g_b Block Cyclic Data Distributions Normal Data Distribution Block Cyclic Data Distribution Interfaces to Third Party Software Packages e Scalapack Solve a system of linear eguations Compute the inverse of a double precision matrix e TAO General optimization problems e Interoperability with Others PETSc CUMULVS Locality Information e To determine the process ID that owns the element defined by the array subscripts n Dfortran log
26. ed data like style Physically distributed data single shared data structure H H H H global indexing y e g access A 4 3 rather than buf 7 on task 2 PA G ce ao J lOp Global Arrays cont e Shared data model in context of distributed dense arrays e Much simpler than message passing for many applications e Complete environment for parallel code development e Compatible with MPI e Data locality control similar to distributed memory message passing model e Extensible e Scalable Global Array Model of Computations compute update ED ce Creating Global Arrays integer array minimum block size handle character string on each processor g_a NGA_Create type ndim dims name chunk _ float double int etc array of dimensions dimension Remote Data Access in GA vs MPI Message Passing Global Arrays identify size and location of data NGA_Get g_a lo hi buffer Id blocks ae lt y Rg loop over processors if me P_N then Global Array Global Pa Local U pack data in local message handle and lower and array of buffer indices of data strides send block of data to patch message buffer on PO else if me PO then receive block of data from P_N in message buffer unpack data from message buffer to local buffer endif end loop copy local data on PO to local buffer Example invert an array define NDIM 1 define TOTAL
27. eta int g_b int g_c To multiply arrays e Fortran subroutine ga_dgemm transa transb m n k alpha g_a g_b beta g_c e C void GA_Dgemm char ta char tb int m int n int k double alpha int g_a int g_b double beta int g_c double precision complex integer alpha beta input integer g_a g_b g_c array handles input double complex int alpha scale factor input double complex int beta scale factor input character 1 transa transb input integer m n k input double precision alpha beta input DGEMM double complex alpha beta input ZGEMM integer g_a g_b input integer g_c output Linear Algebra cont e Whole Arrays cont To compute the element wise dot product of two arrays e Three separate functions for data types Integer e Fortran ga_idot g_a g_b e C GA_Idot int g_a int g_b Double precision e Fortran ga ddot g_a g_b e C GA_Ddot int g_a int g_b Double complex e Fortran ga _zdot g_a g_b e C GA_Zdot int g_a int g_b integer input _a g_b integer GA_Idot int g_a int g_b long GA_Ldot int g_a int g_b float GA_Fdot int g_a int g_b double GA_Ddot int g_a int g_b DoubleComplex GA_Zdot int g_a int g_b we OlO Linear Algebra cont e Whole Arrays cont To symmetrize a matrix e Fortran subroutine ga_symmetrize g_a e C void GA_Symmetrize int g_a To transpose a matrix e Fortran subroutine ga_transpose g_a g_b
28. g c NGA Put g c lo hi c ld verify g a g_b gc lol hil ld Deallocate arrays GA Destroy g a GA Destroy g_b GA Destroy g c Sources of Information s http Www emsl pnl gov docs global
29. g matrix A and B n K 0 l 7 for i 0 i lt dims 0 i for j 0 j lt dims 1 j a i j double k 29 b i j double 1 37 Matrix Multiply Example Copy data to global arrays g a and gb lo1 0 0 lol 1 0 hil 0 dims 0 1 hil 1 dims 1 1 if me 0 NGA Put g a lol hil a ld NGA Put g b lol hil b ld Synchronize all processors to make sure everyone has data GA Sync Determine which block of data is locally owned Note that the same block is locally owned for all GAs NGA Distribution g c me lo hi Matrix Multiply Example C cont Get the blocks from g_a and g_b needed to compute this block in g_c and copy them into the local buffers a and b 102 0 lo 0 lo2 1 O hi2 0 hi 0 hi2 1 dims 0 1 NGA Get g a lo2 hi2 a ld 103 0 O 103 1 lo 1 hi3 0 dims 1 1 hi3 1 hi 1 NGA Get g_b 103 hi3 b ld Do local matrix multiplication and store the result in local buffer c Start by evaluating the transpose of b for i 0 i lt hi3 0 103 0 1 i for j 0 j lt hi3 1 103 1 1 j btrns j i b i j Matrix Multiply Example C cont Multiply a and b to get c Eor i 0 i lt hi 0 lo 0 1 i for j 0 j lt hi 1 lo 1 1 j c i j 0 0 for k 0 k lt dims 0 k efi L c i j afi k btrns j k Copy c back to
30. ical function nga_locate g_a subscript owner C int NGA_Locate int g_a int subscript integer g_a array handle input Integer subscript ndim element subscript input integer owner process id output owner 5 Locality Information cont e To return a list of process IDs that own the patch Fortran logical function nga_locate_region g_a lo hi map proclist np C int NGA_Locate_region int g_a int lo int hi int map int procs integer np number of processors that own output a portion of block integer g_a global array handle input integer ndim number of dimensions of the global array integer lo ndim array of starting indices for array section input integer hi ndim array of ending indices for arra 3 L a Section T input procs lt 1 0 1 2 4 5 6 integer map 2 ndim array with mapping information output map 10 10 95 hig Higgs integer procs np list of processes that own a part 104 710 5 N14 N145 of array section output Gaa r LO gs Migs LL a2 LO e Fa Md losis lOs Nisar Niss log lOp BL IL New Interface for Creating Arrays e Developed to handle the proliferating number of properties that can be assigned to Global Arrays Fortran integer function ga_create_handle subroutine ga_set_data g_a dim dims type subroutine ga_set_array_name g_a name subroutine ga_set_chunk g_a chunk subroutine ga_set_irreg_distr g_a ma
31. l of Computations compute update Locality Information e Discover array elements held by each processor Fortran nga_distribution g_a proc lo hi C void NGA_Distribution int g_a int proc int lo int hi integer g_a array handle input integer proc processor ID input integer lo ndim lower index output integer hi ndim upper index output do iproc 1 nproc write 6 Printing g_a info for processor iproc call nga distribution g a iproc lo hi do j 1 ndim write 6 j lo j hi j end do dnd do Example Matrix Multiply Determine which block of data is locally owned Note that the same block is locally owned for all GAs NGA Distribution g c me lo hi Get the blocks from g a and g b needed to compute this block in g c and copy them into the local buffers a and b lo2 0 lo 0 lo2 1 0 hi2 0 hi 0 hi2 1 dims 0 1 NGA Get g a lo2 hi2 a ld lo3 0 0 lo3 1 lo 1 hi3 0 dims 1 1 hi3 1 ni 1 NGA_Get g_b lo3 hi3 b ld Do local matrix multiplication and store the result in local buffer c Start by evaluating the transpose of b for i 0 i lt hi3 0 lo3 0 1 i for j 0 j lt hi3 1 lo3 1 1l j btrns j i b i j Multiply a and b to get c for i 0 i lt hi 0 lo 0 1 i for j 0 j lt ball To l 1 j 1 c i B 0 0 for k 0 k lt
32. nsive bioinformatics analysis IEEE Trans Parallel Distributed Systems Vol 17 No 8 2006 ScalaBLAST S 4 Immigration WMD Reports lu Ju mr Parallel Inspire a Krishnan M SJ Bohn WE Cowley VL Crow and J Nieplocha Scalable Visual Analytics of Massive Textual Datasets Proc IEEE International Parallel and Distributed Processing Symposium 2007 Smooth Particle Hydrodynamics 9104 Productivity and Scalability e Liquid Water Obtaining the Right Answer for the Right Reasons Edoardo Apra Robert J Harrison Wibe A de Jong Alistair Rendell Vinod Tipparaju Sotiris Xantheas e SC2009 Gordon Bell Finalist 30000 40000 50000 60000 70000 80000 90000 no of processors H O g CCSD T benchmarkonXxT5 ALIS Source Code and More gt gt gt Information A Z e Version 4 1 available s Homepage at http www emsl pnli gov docs global e Platforms 32 and 64 bit IBM SP BlueGene Cray X1 XD1 XT3 XT4 XT5 Linux Cluster with Ethernet Myrinet Infiniband or Quadrics Solaris Fujitsu Hitachi NEC HP Windows Overview of the Global Arrays Parallel Software Development Toolkit Getting Started Basic Calls Managed by UT Battelle for the Department of Energy Writing Building and Running GA Programs Basic Calls Intermediate Calls Advanced Calls Writing Building and Running GA programs e
33. onal arrays but can use MPI model wherever needed e Data locality and granularity control is explicit with GA s get compute put model unlike the non transparent communication overheads with other models except MPI e Library based approach does not rely upon smart compiler optimizations to achieve high performance Disadvantage e Only useable for array data structures Overview of the Global Arrays Parallel Software Development Toolkit Global Arrays Programming Model Managed by UT Battelle for the Department of Energy PA G u Overview Of GA e Programming model e Structure of the GA toolkit e Overview of interfaces Distributed vs Shared Data View Distributed Data Data is explicitly associated with each processor accessing data reguires specifying the location of the data on the processor and the processor itself Data locality is explicit but data access is re a eam mn MM i icall ee with message passing e g MPI Distributed vs Shared Data Cont Shared Data Data is in a globally accessible address space any processor can access data by specifying its location using a global index Data is mapped out in a natural manner usually corresponding to the original problem and access is easy Information on data locality is obscured and leads to loss of performance Global Arrays Distributed dense arrays that can be accessed through a shar
34. p nblock subroutine ga_set_ghosts g_a width subroutine ga_set_block_cyclic g_a dims subroutine ga_set_block_cyclic_proc_grid g_a dims proc_grid logical function ga_allocate g_a New Interface for Creating Arrays C int GA_Create_handle void GA Sert data int g_a int dim int dims int type void GA_Set_array_name int g_a char name void GA_Set_chunk int g_a int chunk void GA_Set_irreg_distr int g_a int map int nblock void GA_Set_ghosts int g_a int width void GA_Set_block_cyclic int g_a int dims void GA_Set_block_cyclic_proc_grid int g_a int dims int proc_grid int GA_Allocate int g_a New Interface for Creating Arrays Cont integer ndim dims 2 chunk 2 integer g a g_b logical status c ndim 2 dims 1 5000 dims 2 5000 chunk 1 100 chunk 2 100 c c Create global array A using old interface c status nga create MT F DBL ndim dims chunk array A ga c c Create global array B using new interface C g_b ga create handle call ga set data g b ndim dims MT F DBL call ga set chunk g b chunk call ga set name g b array B call ga allocate g_b PLA u Example Code e 1 D Transpose Fortran e 1 D Transpose C e Matrix Multiply Fortran e Matrix Multiply C Example 1 D Transpose Dh Transpose Example C int ndim dims 1 chunk 1 ld 1 lo 1 hi 1 int lol 1 hil 1 lo2 1 hi2 1 int g_a g_
35. t product of two arrays e Three separate functions for data types Integer e Fortran nga_idot_patch g_a ta alo ahi g_b tb blo bhi e C NGA_Idot_patch int g_a char ta int alo int ahi int g_b char tb int blo int bhi Double precision e Fortran nga_ddot_patch g_a ta alo ahi g_b tb blo bhi e C NGA_Ddot_patch int g_a char ta int al of int ahi int g_b char tb int blo int bhi Double complex e Fortran nga_zdot_patch g_a ta alo ahi g_b tb blo bhi e C NGA_Zdot_patch int g_a char ta int alo ol int ahi int g_b char tb int blo int bhi integer g_a g_b input integer GA_Idot int g_a int g_b long GA_Ldot int g_a int g_b float GA_Fdot int g_a int g_b double GA_Ddot int g_a int g_b DoubleComplex GA_Zdot int g_a int g_b e Writing Building and Running GA Programs e Basic Calls e Intermediate Calls s Writing scalable GA code with Advanced Calls Access Performance from locality awareness e To provide direct access to local data in the specified of the array owned by the calling process Fortran subroutine nga_access g_a lo hi index ld a C void NGA_Access int g_a a int lo int hi void ptr int Id Processes can access the local position of the global array e Process 0 can access the specified patch of its local position of the array e Avoids memory copy call nga create MT F DBL 2 dims Array
36. uster_procid int inode int iproc integer inode input inode input integer inode iproc input inode iproc input Accessing Processor Memory Node SMP Memory ga_access Processor Groups e To create a new processor group Fortran integer function ga_pgroup_create list size C int GA_Pgroup_create int list int size e To assign a processor groups Fortran logical function nga_create_config type ndim dims name chunk p_handle g_a C int NGA_Create_config int type int ndim int dims char name int p_handle int chunk integer g_a global array handle input integer p_handle processor group handle output integer list size list of processor IDs in group input integer size number of processors in group input Processor Groups group A _ group B tes group C world group ALIS Processor Groups cont e To set the default processor group Fortran subroutine ga_pgroup_set_default p_handle C void GA_Pgroup_set_default int p_handle e To access information about the processor group Fortran e integer function ga_pgroup_nnodes p_handle e integer function ga_pgroup_nodeid p_handle C e int GA_Pgroup_nnodes int p_handle e int GA_Pgroup_nodeid int p_handle integer p_handle processor group handle input Processor Groups cont e To determine the handle for a standard group at any point in the program
37. ut g a lo hi buf ld Atomic Accumulate e Accumulate combines the data from the local array with data in the global array section Fortran subroutine nga_acc g_a lo hi buf Id alpha C void NGA_Acc int g_a int lo int hi void buf int Id void alpha integer g_a array handle input integer lot hi limits on data block to be moved input double precision complex buf local buffer input integer _ ld array of strides for local buffer input double _ precision complex alpha arbitrary scale factor input ga i j ga i j alpha buf k local buffers on the processor What s wrong with this picture GA_Put Mysection GA MPI_Barrier Virtual Virtual Virtual Pointer P Pointer P Pointer P GA_Access GA P Read From P Read From P Read From P Rea d P i Sync is a collective operation It acts as a barrier which synchronizes all the processes and ensures that all the Global Array operations are complete at the call j The functions are Fortran subroutine ga_sync T C void GA_Sync i f fj sync Global Operations Fortran subroutine ga brdcst type buf lenbuf root subroutine ga igop type x n op subroutine ga dgop type x n op void GA Brdcst void buf int lenbuf int root void GA Igop long x int n char op void GA Dgop double x int n char op Global Array Mode

Global Arrays with hands-on time

Contents

Download Pdf Manuals

Related Search

Related Contents