Home

Shmem Programming Manual - Pittsburgh Supercomputing Center

image

Contents

1. Upon return from a collective routine the following are true for the local PE e The target data object is updated e The values in the pSync array are restored to the original values SEE ALSO shmem_broadcast 3 The Shmem Library 2 67 Control Data Cache 2 12 Address Manipulation The address manipulation functions listed in Table 2 9 are not supported in this implementation Table 2 9 Address Manipulation Functions Name Description shmem_ptr Returns a pointer to a data object on a remote PE Makes a stack address remotely accessible Entry points for these functions are provided in the library The functions will generate an exception if called 2 13 Control Data Cache The functions for controls data cache cache are listed in Table 2 10 Table 2 10 Control Data Cache Functions Name Description shmem_udcflush Makes the entire user data cache coherent shmem_udcflush_line Makes coherent a cache line These routines are supplied for compatibility with the Cray Shmem library They perform no operations and returns to the caller successfully The control data cache functions are described in detail on the following pages 2 68 The Shmem Library shmem_cache 3 NAME shmem_clear_cache_inv shmem_set_cache_inv shmem_set_cache_line_inv shmem_udcflush shmem_udeflush_line controls data cache utilities SYNOPSIS include lt shmem h gt void shmem clear c
2. printf Pin printf Numbers may be postfixed with k or m n Programming Examples 3 11 Program Listing xn 0 printf exit void printStats int proc if doprint proc amp 1 printf 3d pinged 3d int peer int doprint int now double t 8d words 9 2f uSec 8 2 MB s n proc peer now t sizeof long now t int main int argc char argv double t tv 2 int reps 10000 int doprint 0 char progName int minWords 1 int maxWords 1 int incWords int nwords int nproc int proc int peer Int Ci int ry int a long rbut long tbuf for progName argv 0 strlen argv 0 progName gt argv 0 amp amp progName 1 progName while c getopt argc argv n eh 1 switch c case n i if reps getSize optarg lt 0 usage progName break case e doprint break case h help progName default usage progName if optind argc minWords 1 else if minWords getSize argv optind lt 0 usage progName 3 12 Programming Examples Program Listing if optind argc maxWords minWords else if maxWords getSize argv optind lt minWords usage progName if optind argc incWords 0 else if incWords getSize argv optind lt 0 usage progName if rbuf long malloc maxWords sizeof long perror Failed mem
3. from the remote PE The function s the remote PE The function s remote PE SEE ALSO hmem_longlong_iget reads strided array of type long long from hmem_short_iget reads strided array of type short from the shmem_iput 3 shmem_put 3 shmem_get 3 shmem_quiet 3 2 22 The Shmem Library Synchronization Operations 2 8 Synchronization Operations The synchronisation functions are listed in Table 2 5 Table 2 5 Synchronization Functions IO Jb seription TT shmem long vait until Waits for a long variable to change and satisfy a condition Waits for completion of all outstanding remote writes Waits for a Long variable to change on the local PE These functions are used to express different kinds of synchronization The routines barrier shmem barrier and shmem barrier all barrier are used to synchronize all or a subset of the processes belonging to the parallel application The routines like shmem vait are used to synchronize a pair of processing elements The PE calling one of these functions on a local variable V is blocked until a remote PE changes the value of V The function shmem_fence ensures ordering of remote write put operations All put operations issued to a particular processing element PE prior to the call to shmem_fence are guaranteed to be delivered before any subsequent put operations to the same PE which follow the call to shmem fence The funct
4. ii aaa 2 11 shmem_longlong_put 3 2 11 shmem_short_put 8 onen 2 11 shmem_put32 3 2 11 shmem_put64 3 2 aa o e 2 11 shm mi put128 9 command REY 2 11 shmem_putmem 89 aaa 2 11 shmem_iput 3 sa e 2 oo oo onen 2 13 shmem_double_iputlB 2 13 shmem_float_iput 8 2 13 shmem_intiput 3 2 13 shmem_iput32 3 e sss aaa a a q a w k y wa Wu e M Q Q w N 2 13 shmem_iput64 3 nen 2 13 shmem_iputl28 3 s CC Common 2 13 shmem_long_iput 3 X eee 2 13 shmem_longdouble_iput 3 2 13 shmem_longlong iput 8 2 13 shmem_short_iput 8 onen 2 13 2 7 Remote Read Operations 2 2 2 2 2 2 2 nn 2 16 shmem_double_g 3 2 sz v awa w w nerion nitas 2 18 shmem_float_g 3 ee 2 18 shmem int g 3 e e 2 18 shmem long g 3 22 2 CC non 2 18 shmem_short_g 3 2 2 22m aa 2 18 shmem_get 3 aa a a sa puas nenn 2 19 shmem_double_get 3 anaa aaa 2 19 shmem_float_get 3 ee 2 19 shmem_get32 3 222 o oo 2 19 shmem_get64 3 Emmen 2 19 shmem_get128 3 sss aa w wa a w ee w 2 19 shmem_getmem 3 cessa 2 19 shmem int get 3 u vs sss sa e e e Ron S RUR Q a 2 19 ii Contents shmem long _get 3 o ee ee 2 19 shmem_longdouble_get 3 2 19 shmem_longlong
5. void shmem short prod to all short target short source int nreduce int PE start int logPE_stride int PE size short pWrk long pSync The Shmem Library 2 55 shmem_double_prod_to_all 3 PARAMETERS target source nreduce PE_start logPE_stride PE_size pWrk pSync DESCRIPTION A symmetric array of length nreduce to receive the result ofthe reduction operations The data type of target should match that implied in the SYNOPSIS section A symmetric array of length nreduce that contains one element for each separete reduction operation The source argument must have the same data type as target The number of elements in the target and source array The lowest virtual PE number of the active set of PEs The log base 2 of the stride between consecutive virtual PE number in the active set The number of PEs in the active set A symmetric work array The pwrk argument must have the same data type as target In C C this contains max nreduce 2 1 _SHMEM_REDUCE_MIN_WRKDATA_SIZE elements A symmetric work array In C C pSync must be oftype long and size _SHMEM_REDUCE_SYNC_SIZE Every element of this array must be initialized with the value _SHMEM_SYNC_VALUE before any ofthe PEs in the active set enter the reduction routine The shared memory reduction routines compute one or more reductions across symmetric arrays on multip
6. remote write to begin with while the even numbered processes write the remote target deadlock is avoided After the set of repetitions the process calls the function gettime again It calculates the time taken for one ping in each direction the difference between the two timer readings divided by the number of repetitions This value expressed in microseconds is halved to get the value for a ping in one direction Before the processes print the results they synchronise again This means that all the results are displayed at roughly the same time and the printing does not interfere with the network performance When the process has come out of the for loop it synchronises with its peers again before exiting 3 9 Subsidiary Functions The subsidiary functions make no use of the Shmem library getSize dt usage help printStats This function checks whether the user has suffixed the number of repetitions specified on the command line with the n option with either a k or K for kilobytes or m or M for megabytes If it finds a suffix it multiplies the number as appropriate a left shift by one place multiplies by 2 This function returns the difference between its two arguments This function prints out the command line syntax for the program and then exits This function prints out the command line syntax for the program and enumerates the various options before exiting This functions displays the timing s
7. shmem_collect shmem_coll ect32 shmem_collect64 shmem_fcollect shmem_fcollect32 shmem_fcollect64 concatenates blocks of data from multiple processing elements PEs to an array in every PE SYNOPSIS include lt shmem h gt void shmem_collect void target void source int nlong int PE_start int logPE_stride int PE_size long pSync void shmem_collect32 void target void source int nlong int PE start int logPE_stride int PE size long pSync void shmem_collect64 void target void source int nlong int PE_start int logPE_stride int PE_size long pSync void shmem_fcollect void target void source int nlong int PE_start int logPE_stride int PE_size long pSync void shmem_fcollect32 void target void source int nlong int PE_start int logPE_stride int PE_size long pSync void shmem_fcollect64 void target void source int nlong int PE_start int logPE_stride int PE_size long pSync PARAMETERS target A symmetric array The target argument must be large enough to accept the concatenation ofthe source arrays on all PEs For shmem_coll shmem_fcol lect shmem_collect64 shmem_fcollect and Llect 64 the data type of target can be any type that has an element size of 64 bits source A symmetric data object that can be of any data type that is permissible for the target argument nlong The number of elements in the source array
8. 2 13 shmem_longlong_max_to_all 2 47 shmem_longlong_min_to_all 2 50 shmem_longlong_or_to_all 2 53 shmem_longlong_prod_to_all 2 55 shmem_longlong_put 2 11 shmem_longlong_sum_to_all 2 58 shmem_longlong_swap 2 31 shmem_longlong_wait 2 26 shmem_longlong_wait_until 2 26 shmem_longlong_xor_to_all 2 61 shmem_put 2 11 shmem_put128 2 11 shmem_put32 2 11 shmem_put64 2 11 shmem_putmem 2 11 shmem_quiet 2 29 shmem_short_add 2 35 shmem_short_and_to_all 2 45 shmem_short_cswap 2 33 shmem_short_fadd 2 38 shmem_short_finc 2 40 shmem_short_g 2 18 shmem_short_get 2 19 shmem_short_iget 2 21 shmem_short_inc 2 42 shmem_short_iput 2 13 shmem_short_max_to_all 2 47 shmem_short_min_to_all 2 50 shmem_short_mswap 2 36 shmem_short_or_to_all 2 53 shmem_short_p 2 10 2 11 shmem_short_prod_to_all 2 55 shmem_short_sum_to_all 2 58 shmem_short_swap 2 31 shmem_short_wait 2 26 shmem_longdouble_max_to_all 2 47 shmem_short_wait_until 2 26 shmem_longdouble_min_to_all 2 50 shmem_short_xor_to_all 2 61 shmem_longdouble_prod_to_all 2 55 shmem_swap 2 31 shmem_longdouble_put 2 11 shmem_wait 2 26 shmem_longdouble_sum_to_all 2 58 shmem_wait_until 2 26 shmem_longlong_and_to_all 2 45 Index 2
9. The number may have a k or an m appended to it or their upper case equivalents to denote multiples of 1024 and 1 048 576 respectively By default the program pings 10 000 times e Instructs every process to print their timing statistics Programming Examples 3 1 Header Files and Variables h Displays the list of options nwords maxWords incWords nwords specifies to sping how many words there are in each packet If maxWords is given it specifies a maximum number of words to send in each packet and invokes the following behavior After each n repetitions as specified with the n option the packet size is increased by incWords the default is a doubling in size and another set of repetitions is performed until the packet size exceeds maxWords This means that if neither of the optional parameters are specified only one set of repetitions is performed 3 3 Program Output At the start of the program if printing has been enabled for all processes with the e option a message like this is displayed by each process 1 8 Shmem PING reps 250000 minWords 64 maxWords 128 incWords 32 where 1 is the process s identity number i e the processing element or PE number and 8 gives the number of processes running in parallel After each set of repetitions timing statistics are displayed like this 1 pinged 0 64 words 10 14 uSec 50 49 MB s This indicates that process 1 pinged process 0 with 64 word packets The pingi
10. ptrdiff t tst ptrdiff_t sst size t len int pe void shmem longlong iput long long target const long long source ptrdiff_t tst ptrdiff t sst size_t len int pe void shmem short iput short target const short source ptrdiff t tst ptrdiff t sst size_t len int pe The Shmem Library 2 13 shmem_iput 3 PARAMETERS target The remotely accessible array data object to be updated on the remote PE source Array containing the data to be copied on the remote PE tst The stride between consecutive elements of the target array The stride is scaled by the element size of the target array A value of 1 indicates contiguous data tst must be of type integer sst The stride between consecutive elements of the source array The stride is scaled by the element size of the source array A value of 1 indicates contiguous data sst must be of type integer len Number of elements in the target and source len must be of integer type pe The number of the remote PE were strided data will be stored DESCRIPTION These routines provide the means for copying a strided array from the local PE to a contiguous data object on a different PE The routines return when the data has been copied out of the source array on the local PE but not necessarily before the data has been delivered to the remote data object The function shmem_iput writes strided array where each element is any non character type that has a storage size equal to 64 b
11. short source int nreduce int PE_start int logPE_stride int PE_size short pWrk long pSync PARAMETERS target A symmetric array of length nreduce to receive the result ofthe reduction operations The data type of target should match that The Shmem Library 2 47 shmem_double_max_to_all 3 source nreduce tart logPE_stride E_size pWrk pSync DESCRIPTION implied in the SYNOPSIS section A symmetric array of length nreduce that contains one element for each separete reduction operation The source argument must have the same data type as target The number of elements in the target and source array The lowest virtual PE number of the active set of PEs The log base 2 ofthe stride between consecutive virtual PE number in the active set of type integer The number of PEs in the active set A symmetric work array The pwrk argument must have the same data type as target In C C this contains max nreduce 2 1 _SHMEM_REDUCE_MIN_WRKDATA_SIZE elements A symmetric work array In C C pSync must be oftype long and size _SHMEM_REDUCE_SYNC_SIZE Every element of this array must be initialized with the value _SHMEM_SYNC_VALUE before any ofthe PEs in the active set enter the reduction routine The shared memory reduction routines compute one or more reductions across symmetric arrays on multiple virtual PEs A reduction performs an asso
12. 31 shmem_int_swap 3 2 31 Contents iii shmem_long_swap 9 0 ee 2 31 shmem_longlong swap 9 X nme 2 31 shmem_short_swap 9 ee 2 31 shmem_int_cswap 3 aaa 2 33 shmem_long_cswap 3 ee ee 2 33 shmem_longlong cswap 9 eee 2 33 shmem_short_cswap 9 2 33 shmem_short_add 8 2 35 shmem_int_mswap 9 2 36 shmem_long mswap 9 2 36 shmem_short_mswap 3 2 36 shmem int fadd 38 aa a 2 38 shmem long fadd 3 eee 2 38 shmem_longlong_fadd 3 2 38 shmem_short_fadd 8 2 38 shmem_int_finc 3 2 40 shmem_long_finc 3 o ee 2 40 shmem_longlong _finc 3 2 40 shmem short finc 8 a a 2 40 shmem short inc 8 000000 ee eee 2 42 2 10 Collective Reduction Operations 2 2 2 2 2 i 2 43 shmem_int_and_to_all 3 2 45 shmem_long_and_to_all 3 2 45 shmem_longlong_and_to_all 3 2 45 shmem_short_and_to_alll3 2 45 shmem double max _ to_all 3 2 47 shmem float max to all 8 2 47 shmem int max to all 8 i 2 47 shmem long max to all 8 2 47 shmem longdouble
13. PE The function s hmem_ge t64 reads any non character type that has a storage size equal to 64 bits from a remote PE The function s hmem_ge equal to 128 bits from a The function s in bytes The function s remote PE The function s remote PE The function s double from a The function s The function s remote PE SEE ALSO hmem_ge hmem_in t128 reads any non character type that has a storage size remote PE tmem reads any data type from a remote PE len is scaled t_get reads contiguous elements of type integer from a hmem_long_get reads contiguous elements of type long from a hmem_longdouble_get reads contiguous elements of type long remote PE hmem_longlong_get reads contiguous elements of type long long from a remote PE hmem_short_get reads contiguous elements of type short from a shmem_iput 3 shmem_put 3 shmem_iget 3 shmem_quiet 3 2 20 The Shmem Library shmem_iget 3 NAME shmem_iget shmem_double_iget shmem_float_iget shmem_iget32 shmem_iget64 shmem_iget128 shmem_int_iget shmem_long_iget shmem_longdouble_iget shmem_longlong_iget shmem_short_iget transfer strided data from a remote PE SYNOPSIS include lt shmem h gt void shmem_iget void target const void source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem_double_iget double target const double source ptrdiff_t tst ptrdiff_t
14. The nlong argument must be equal on all PEs for shmem_collect 64 shmem_fcol Llect shmem fcollect32 and shmem_fcollect64 The nlong argument can be different across PEs for shmem_coll 2 66 The Shmem Library Lect and shmem_collect 32 PE_start The lowest virtual PE number of the active set of PEs shmem_collect 3 logPE_stride The log base 2 of the stride between consecutive virtual PE numbers in the active set PE_size The number of PEs in the active set pSync A symmetric work array In C C pSync must be of type long and size _SHMEM_REDUCE_SYNC_SIZE Every element of this array must be initialized with the value _SHMEM_SYNC_VALUE before any of the PEs in the active set enter the reduction routine DESCRIPTION The shared memory collective routines concatenate nlong 64 bit or 32 bit data items from the source array into the target array over the set of PEs defined by PE_start log2PE_stride and PE_size in processor number order The resultant target array contains the contribution from PE PE_start first then the contribution from PE PE_start PE_stride second and so on The collected result is written to the target array for all PEs in the active set The values of arguments PE_start logPE_stride and PE_size must be equal on all PEs in the active set The same target and source array and the same pSync work array must be passed to all PEs in the active set
15. and 2 26 The Shmem Library shmem_wait 3 offer a mechanism to notify a PE that another process element has completed some action The function shmem_wait blocks the calling PE until some remote PE writes a long value not equal to value into var on the waiting PE The function shmem_int_wait blocks the calling PE until some remote PE writes an integer value not equal to value into var on the waiting PE The function shmem_long_wait blocks the calling PE until some remote PE writes a long value not equal to value into var on the waiting PE The function shmem_longlong_wait blocks the calling PE until some remote PE writes a long long value not equal to value into var on the waiting PE The function shmem_short_wait blocks the calling PE until some remote PE writes a short value not equal to value into var on the waiting PE The function shmem_wait_until blocks the calling PE until some remote PE changes the long variable var to satisfy the condition implied by comp and val The function shamem_int_wait_until blocks the calling PE until some remote PE changes the integer variable var to satisfy the condition implied by comp and val The function shmem_long_wait_until blocks the calling PE until some remote PE changes the long variable var to satisfy the condition implied by comp and val The function shmem_longlong_wait_until blocks the calling PE until some remote PE changes the long
16. and shmem_short_add perform atomic fetch and add fetch and increment and atomic add on a remote data object respectively The functions performing atomic memory operations are described in detail on the following pages 2 30 The Shmem Library shmem_swap 3 NAME shmem_swap shmem_double_swap shmem_float_swap shmem_int_swap shmem_long_swap shmem_longlong_swap shmem_short_swap Perform an atomic swap to a remote data object SYNOPSIS include lt shmem h gt long shmem_swap long target long value int pe double shmem_double_swap double target double value int pe float shmem_float_swap float target float value int pe int shmem_int_swap int target int value int pe long shmem_long_swap long target long value int pe long long shmem_longlong_swap long long target long long value int pe short shmem_short_swap short target short value int pe PARAMETERS target The pointer to the remotely accessible data object to be updated on the remote PE The type oftarget should match that implied in the SYNOPSIS section value Value to be atomically written to the remote PE value is the same type as target pe An integer indicating the PE number on which target is to be updated DESCRIPTION These functions perform atomic swap operations It is worth noting that the atomic access to a variable V is only guaranteed if V is updated solely by Shmem routines Thus in order to preserv
17. element PE to change SYNOPSIS include lt shmem h gt void shmem wait long var void shmem_int_wait int var void shmem_long_wait long var long void void shmem_short_wait short var shmem_longlong_wait long long var long value int value value long long value short value void void void void void PARAMETERS var cond value DESCRIPTION shmem_wait_until long var shmem int vait until int var shmem long vait until long var shmem longlong vait until long long var shmem short vait until short var int cond long value int cond int value int cond long value int cond long long value int cond short value A remotely accessible integer variable that is being updated by a remote processing element The compare operator that compares var with value The following cond values are supported SHMEM_CMP_EQ Equal operator SHMEM_CMP_NE Not equal operator SHMEM_CMP_GT Greater then operator SHMEM_CMP_LE Less then or equal operator SHMEM_CMP_LT Less then operator operator SHMEM_CMP_GE Greater then or equal operator Is the value used as right operand ofthe compare operator cond The left one is the value pointed by var These functions wait for var to be changed by a write put or atomic swap issued by a remote PE These routines can be used for point to point direct synchronization
18. int argc char argv for progName argv 0 strlen argv 0 progName gt argv 0 amp amp progName 1 progName while c getopt argc argv n eh 1 switch c case n if reps getSize optarg lt 0 usage progName break case e doprint break case h help progName default usage progName if optind argc minWords 1 else if minWords getSize argv optind lt 0 usage progName if optind argc maxWords minWords else if maxWords getSize argv optind lt minWords usage progName if optind argc incWords 0 else if incWords getSize argv optind lt 0 usage progName 3 4 Programming Examples Initialization The program name is passed in as argv 0 the first string on the command line This string may take the form of a pathname such as opt rms example spino The progname variable is set to point to the end of the program name The loop then steps the variable backwards one character at a time until either a filename separator or the beginning of the name is reached This leaves progname pointing at the start of the program name The while loop steps through the options given on the command line e If the n option has been used the variable reps is set to the requested number of repetitions after a check that the number is greater than 0 If the number is invalid the usage function is
19. of nwords is incremented by the value of incWords If no value was specified for incWords on the command line the original value of nword is doubled or if nwords was unspecified it is set to 1 If the user specified maxWords the for loop is iterated until nwords exceeds the value of maxWords If not the loop is only executed once Before the processes begin to time how long the ping operation takes they synchronize using shmem_barrier_all This ensures 3 8 Programming Examples Writing Shared Variables that they are all ready to start sending and receiving messages at the same time The timing is done by calling twice the function gettime one before the remote writes start and one when they have finished After testing that the process has a peer this test has to be repeated in here since all the processes must participate in the synchronization the read write operations on shared variables can begin The odd numbered processes proc amp 1 start first by waiting that the shared variable is modified by the peer The call to shmem_wait blocks the process until the value stored in the nwords 1 postion of the buffer rbuf is modified by the peer The call to shmem_wait specifies e The address of a remotely accessible variable that is being updated by a remote processing element e The value V to be compared with the value S stored in the remotely accessible variable The process blocks until S and V remain equal that is
20. range of valid PEs identifiers This singleton is disabled from printing 3 8 Writing Shared Variables In the final section of main the process pings its peer a given number of times using the Shmem functions int main int argc char argv for nwords minWords nwords lt maxWords nwords incWords nwords incWords nwords 2 nwords 1 r reps shmem_barrier_all Programming Examples 3 7 Writing Shared Variables tv 0 gettime if peer lt nproc if proc amp 1 r shmem_wait amp rbuf nwords 1 0 rbuf nwords 1 0 while r gt 0 shmem_put rbuf tbuf nwords peer shmem_wait rbuf nwords 1 0 rbuf nwords 1 0 if proc 1 shmem_put rbuf tbuf nwords peer tv 1 gettime t dt amp tv 1 amp tv 0 2 reps shmem_barrier_all printStats proc peer doprint nwords t shmem_barrier_all 6 exit 0 The Shmem library functions to access a shared variable are described here The for loop controls how many sets of repetitions are performed In each set of repetitions a message containing nwords words is written from one process to its peer for the number of times specified by reps The first time through the loop nwords is set to minWords This was initialized earlier see Section 3 5 to the value the user entered for nwords on the command line by default 1 On subsequent iterations the value
21. s s s s s u 3 u u O 3 un O 3 r fi O 3 un un un 2y 3 O 3 r These functions provide low latency reads of variables stored in the memory of a remote PE The library offers a wide number of remote read get functions that are optimized for most basic data types In particular the remote read functions can be grouped as follows 1 Functions reading a single data item having basic type from the memory of a remote PE e g shmem_double_ g etc 2 Functions reading contiguous data from from the memory of a remote PE e g shmem_double_get etc 2 16 The Shmem Library Remote Read Operations 3 Functions reading strided data from the memory of a remote PE e g shmem_double_iget etc The remote read functions are described in detail on the following pages The Shmem Library 2 17 shmem_double_g 3 NAME shmem_double_g shmem_float_g shmem_int_g shmem_long_g shmem_short_g transfer one data item from a remote PE SYNOPSIS include lt shmem h gt double shmem_double_g double addr int pe float shmem_float_g float addr int pe int shmem int glint addr int pe long shmem long _g long addr int pe short shmem_short_g short addr int pe PARAMETERS addr The remotely accessible array element or scalar data object pe The number of the remote PE on which addr resides DESCRIPTION These routines provide a very low latency remote read capability
22. shmem_double_prod_to_all performs a reduction applaying the product function to doubles values distributed across the PEs The function shmem_float_prod_to_all performs a reduction applaying the product function to float values distributed across the PEs The function shmem_int_prod_to_alql performs a reduction applaying the product function to integer values distributed across the PEs The function shmem_long_prod_to_all performs a reduction applaying the product function to long values distributed across the PEs The function shmem_longdouble_prod_to_all performs a reduction applaying the product function to long double values distributed across the PEs The function shmem_longlong_prod_to_all performs a reduction applaying the product function to Long long values distributed across the PEs The function shmem_short_prod_to_all performs a reduction applaying the product function to short values distributed across the PEs SEE ALSO shmem barrier 3 shmem barrier all 3 The Shmem Library 2 57 shmem_double_sum_to_all 3 NAME shmem_double_sum_to_all shmem_float_sum_to_ all shmem_int_sum_to_all shmem_long_sum_to_all shmem_longdouble_sum_to_all shmem_longlong_sum_to_all shmem_short_sum_to_all performs a product reduction across a set of processing elements PEs SYNOPSIS include lt shmem h gt void shmem_double_sum_to_all double target int PE_start int logPE
23. shmem_int_get int target const int source size_t len int pe void shmem_long_get long target const long source size_t len int pe void shmem longdouble get long double target const long double source size t len int pe void shmem longlong get long long target const long long source size_t len int pe void shmem short get short target const short source PARAMETERS target Source len size_t len int pe Local data object to be updated Data object on the PE identified by pe that contains the data to be copied This data object must be remotely accessible Number of elements in the target and source The Shmem Library 2 19 shmem_get 3 pe DESCRIPTION The number of the remote PE on which source resides These routines provide the means for copying a contiguous data object from a remote PE to a contiguous data object in to the local PE The routines return when the data has been delivered to the target array on the local PE The function s hmem_get reads any non character type that has a storage size equal to 64 bits from a remote PE The function s a remote PE The function s remote PE The function s hmem_ge hmem_double_get reads contiguous elements of type double from hmem_float_get reads contiguous elements of type float from a t32 reads any non character type that has a storage size equal to 32 bits from a remote
24. sst size_t len int pe void shmem_float_iget float target const float source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem_iget32 void target const void source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem_iget64 void target const void source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem_iget128 void target const void source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem_int_iget int target const int source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem_long_iget long target const long source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem_longdouble_iget long double target const long double source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem longlong iget long long target const long long source ptrdiff_t tst ptrdiff t sst size_t len int pe void shmem short iget short target const short source ptrdiff t tst ptrdiff t sst size_t len int pe PARAMETERS target Array to be updated on the local PE The Shmem Library 2 21 shmem_iget 3 source tst sst len pe DESCRIPTION These routines Array containing the data to be copied on the remote PE The stride between consecutive elements of the target array The stride is scaled by the element size of the target array A value of 1 indicates contiguous data The stride between consecutive elements of
25. the logical AND operator to short values distributed across the PEs SEE ALSO shmem barrier 3 shmem barrier all 3 2 46 The Shmem Library shmem_double_max_to_all 3 NAME shmem_double_max_to_all shmem_float_max_to_all shmem_int_max_to_all shmem_long_max_to_all shmem_longdouble_max_to_all shmem_longlong_max_to_all shmem_short_max_to_all performs a maximum function reduction across a set of processing elements PEs SYNOPSIS include lt shmem h gt void shmem_double_max_to_all double target double source int nreduce int PE start int logPE_stride int PE size double pWrk long pSync void shmem float max to all float target float source int nreduce int PE start int logPE_stride int PE size float pWrk long pSync void shmem_int_max_to_all int target int source int nreduce int PE_start int logPE_stride int PE_size int pWrk long pSync void shmem_long_max_to_all long target long source int nreduce int PE_start int logPE_stride int PE_size long pWrk long pSync void shmem_longdouble_max_to_all long double target long double source int nreduce int PE_start int logPE_stride int PE_size long double pWrk long pSync void shmem_longlong_max_to_all long long target long long source int nreduce int PE_start int logPE_stride int PE_size long long pWrk long pSync void shmem_short_max_to_all short target
26. the means for copying a contiguous data object from the local PE to a contiguous data object on another PE The routines return when the data has been copied out ofthe source array on the local PE but not necessarily before the data has been delivered to the remote data object Use shmem_quiet to force completion on all remote transfers The function s hmem_put writes any non character type that has a storage size equal to 64 bits to the remote PE The function s the remote PE The function s remote PE The function s remote PE The function s remote PE The function s doubletype to The function s hmem_double_put writes contiguous elements of double type to hmem_float_put writes contiguous elements of float type to the hmem_int_put writes contiguous elements oftype integer to the hmem long put write contiguous elements of long type to the hmem_longdouble_put writes contiguous elements of long the remote PE hmem_longlong_put writes contiguous elements of long long type to the remote PE The function s remote PE The function s hmem_short_put writes contiguous elements of short type to the hmem_put 32 writes any non character type that has a storage size equal to 32 bits to the remote PE The function s hmem_put 64 writes any non character type that has a storage size equal to 64 bits to the remote PE The function s hmem_put128 writes any non character ty
27. to be updated DESCRIPTION The conditional swap routines conditionally update a target data object on an arbitrary processing element PE and return prior contents of the data object in one atomic operation It is worth noting that atomic access to a variable V is only guaranteed if V is updated solely by Shmem routines Thus in order to preserve the correct semantic of atomic operations all the processing elements including the one for which the variable V is local must refer to V using the Shmem atomic routines The function shmem_int_cswap performs an atomic conditional operation on a remotely accessible integer data object The Shmem Library 2 33 shmem_int_cswap 3 The function shmem_long_cswap performs an atomic conditional operation on a remotely accessible Long data object The function shmem_longlong_cswap performs an atomic conditional operation on a remotely accessible Long long data object The function shmem_short_cswap performs an atomic conditional operation on remotely accessible short data object RETURN VALUES These functions return the contents that had been at the target address on the remote PE prior to the conditional swap SEE ALSO shmem_swap 3 2 34 The Shmem Library shmem_short_add 3 NAME shmem_short_add performs an atomic add operation on a remote data object SYNOPSIS include lt shmem h gt void shmem_short_add short target short value int pe PARAMETERS t
28. until a remote processing element write a different value in the shared variable The buffer rbuf was initialised erlier to 0 see Section 3 6 and thus the process blocks until the peer writes the shared buffer with a value different from 0 When the process returns from the function shmem_wait it sets the nwords 1 th position of the shared buffer to 0 preparing the next iteration In the while loop both the odd and even numbered processes write the shared variable and then wait that the peer executed the remote write The number of repetitions r is decremented each time The call to shmem_put specifies e The address of the remote variable to the be updated on the remote PE i e rbuf e The address of the local variable containing the data to be copied on the remote variable i e tbuf e The number of elements in the local and remote variables i e nwords e The PE number of the remote processing element where the local variable will be copied When the process has written the remote variable rbuf it waits until the peer performs a write on the its local buffer This is done calling the shmem_wait function Once that the process returns from this function it sets the nwords 1 th element of the shared buffer to 0 preparing for the next iteration Finally the odd numbered process performs the final write on the remote buffer By making the odd numbered processes wait for a Programming Examples 3 9 Program Listing 6
29. with the main processor that is to say memory on the CPU s high speed memory bus The main CPU or CPUs for a multi processor of a node typically an Alpha 21264 management network A private network used by the RMS daemons for control and diagnostics 1Used to be called GMT Glossary 3 multirail system A system that has more than one Elan card connected to each node each Elan card being connected to a different switch network multi threaded program A multi threaded program is one that is constructed such that during its execution multiple sequences of instructions are executed concurrently possibly by different CPUs Each thread of execution has a separate stack but otherwise they all share the same address space node A system with memory one or more CPUs and one or more Elan cards running an instance of the operating system poll Loop and check on each loop whether a specified event has occurred rank An integer value that identifies a single process from a set of parallel processes reduce Combine the results of a parallel computation into a single value remote memory The memory Elan card or main of a node when accessed by another node over the network resource A set of CPUs allocated to a user to run one or more parallel jobs slice A local copy of a global object switch network The network constructed from the Elan cards and Elite cards thread An independent sequence of execution Every hos
30. 2 19 shmem_get128 2 19 shmem_get32 2 19 shmem_get64 2 19 shmem_getmem 2 19 shmem_iget 2 21 shmem_iget128 2 21 shmem_iget32 2 21 shmem_iget64 2 21 shmem_init 2 7 shmem_int_and_to_all 2 45 shmem_int_cswap 2 33 shmem_int_fadd 2 38 shmem_int_finc 2 40 shmem_int_g 2 18 shmem_int_get 2 19 Index 1 shmem_int_iget 2 21 shmem_int_iput 2 13 shmem_int_max_to_all 2 47 shmem_int_min_to_all 2 50 shmem_int_mswap 2 36 shmem_int_or_to_all 2 53 shmem_int_p 2 10 shmem_int_prod_to_all 2 55 shmem_int_put 2 11 shmem_int_sum_to_all 2 58 shmem_int_swap 2 31 shmem_int_wait 2 26 shmem_int_wait_until 2 26 shmem_int_xor_to_all 2 61 shmem_iput 2 13 shmem_iput128 2 13 shmem_iput32 2 13 shmem_iput64 2 13 shmem_long_and_to_all 2 45 shmem_long_cswap 2 33 shmem_long_fadd 2 38 shmem_long_finc 2 40 shmem_long_g 2 18 shmem_long_get 2 19 shmem_long_iget 2 21 shmem_long_iput 2 13 shmem_long_max_to_all 2 47 shmem_long_min_to_all 2 50 shmem_long_mswap 2 36 shmem_long_or_to_all 2 53 shmem_ long _p 2 10 shmem_long_prod_to_all 2 55 shmem_long_put 2 11 shmem_long_sum_to_all 2 58 shmem_long_swap 2 31 shmem_long_wait 2 26 shmem_long_wait_until 2 26 shmem_long_xor_to_all 2 61 shmem_longdouble_get 2 19 shmem_longdouble_iget 2 21 shmem_longdouble_iput 2 13 shmem_longlong_cswap 2 33 shmem_longlong_fadd 2 38 shmem_longlong_finc 2 40 shmem_longlong_get 2 19 shmem_longlong_iget 2 21 shmem_longlong_iput
31. 32 shmem_put64 shmem_put128 shmem_putmem transfer data to a remote PE SYNOPSIS include lt shmem h gt void shmem_put void target const void source size_t len int pe void shmem_double_put double target const double source size_t len int pe void shmem_float_put float target const float source size_t len int pe void shmem_int_put int target const int source size_t len int pe void shmem_long_put long target const long source size_t len int pe void shmem_longdouble_put long double target const long double source size_t len int pe void shmem_longlong_put long long target const long long source size_t len int pe void shmem_put32 void target const void source size_t len int pe void shmem_put64 void target const void source size_t len int pe void shmem_put128 void target const void source size_t len int pe void shmem_putmen void target const void source size_t len int pe void shmem_short_put short target const short source size_t len int pe PARAMETERS target The remotely accessible array data object to be updated on the remote PE source Data object containing the data to be copied on the remote PE len Number of elements in the target and source len must be of The Shmem Library 2 11 shmem_put 3 DESCRIPTION These routines integer type The number ofthe remote PE where the data object source will be transferred provide
32. IS section source A symmetric array of length nreduce that contains one element for each separete reduction operation The source argument must have the same data type as target nreduce The number of elements in the target and source array PE_start The lowest virtual PE number of the active set of PEs logPE_stride The log base 2 of the stride between consecutive virtual PE number in the active set PE_size The number of PEs in the active set pWrk A symmetric work array The pwrk argument must have the same data type as target In C C this contains max nreduce 2 1 _SHMEM_REDUCE_MIN_WRKDATA_SIZE elements The Shmem Library 2 53 shmem_int_or_to_all 3 pSync A symmetric work array In C C pSync must be oftype long and size _SHMEM_REDUCE_SYNC_SIZE Every element of this array must be initialized with the value _SHMEM_SYNC_VALUE before any ofthe PEs in the active set enter the reduction routine DESCRIPTION The shared memory reduction routines compute one or more reductions across symmetric arrays on multiple virtual PEs A reduction performs an associative binary operation across a set of values The nreduce argument determines the number of elements to perform the reduction operation on The source array on all PEs in the active set provides one element for each reduction The results of the reductions are placed in the target array on all PEs in the active set The act
33. PE_start logPE_stride and PE_size must be equal on all PEs in the active set The same target and source arrays and the same pWrk and pSync work arrays must be passed to all PEs in the active set Before any PE calls a reduction routine you must ensure that the following conditions exist synchronization via a barrier or some other method is often needed to ensure this e The pwrk and pSync arrays on all PEs in the active set are not still in use from a prior call to a collective shared memory routine e The target array on all PEs in the active set is ready to accept the results of the reduction Upon return from a reduction routine the following are true for the local PE e The target array is updated e The values in the pSync array are restored to the original values The Shmem Library 2 51 shmem_double_min_to_all 3 The function shmem_double_min_to_all performs a reduction applaying the minimum function to doubles values distributed across the PEs The function shmem_float_min_to_all performs a reduction applaying the minimum function to f1oat values distributed across the PEs The function shmem_int_min_to_all performs a reduction applaying the minimum function to integer values distributed across the PEs The function shmem_long_min_to_all performs a reduction applaying the minimum function to long values distributed across the PEs The function shmem_longdouble_min_to_all performs a
34. TL I r Shmem Programming Manual Quadrics Supercomputers World Ltd Document Version 3 June 27th 2001 The information supplied in this document is believed to be correct at the time of pub lication but no liability is assumed for its use or for the infringements of the rights of others resulting from its use No licence or other rights are granted in respect of any rights owned by any of the organisations mentioned herein This document may not be copied in whole or in part without the prior written consent of Quadrics Supercomputers World Ltd Copyright 1998 1999 2000 2001 Quadrics Supercomputers World Ltd The specifications listed in this document are subject to change without notice Compaq the Compaq logo Alpha AlphaServer and Tru64 are trademarks of Compaq Information Technologies Group L P in the United States and other countries UNIX is a registered trademark of The Open Group in the U S and other countries TotalView and Etnus are registered trademarks of Etnus LLC All other product names mentioned herein may be trademarks of their respective com panies Cray is a registered trademark of Cray Inc The Quadrics Supercomputers World Ltd Quadrics web site can be found at http www quadrics com Quadrics address is QSW Limited One Bridewell Street Bristol BS1 2AA UK Tel 44 0 117 9075375 Fax 44 0 117 9075395 Circulation Control None Document Revision History Revision Da
35. Uniform Resource Locator Light Emitting Diode Multiple Instruction Multiple Data parallel processing computer architecture characterised as having multiple processors each potentially executing a different instruction sequence on different data Memory Management Unit part of CPU that provides protection between user processes and support for virtual memory Message Passing Interface parallel processing API Massively Parallel Processing processing that involves the use of a large number of processors in a coordinated fashion Peripheral Component Interconnect the Elan is connected to a node through this interface Portable Document Format the page description language used by Adobe Acrobat derived from PostScript for displaying pages on the screen Page Table Entry an entry in the page table which maps the base address of a page to physical memory Reduced Instruction Set Computer a computer whose machine instructions represent relatively simple operations that can be executed very quickly Resource Management System Quadrics software for managing clusters of UNIX nodes Synchronous Dynamic Random Access Memory high performance computer memory architecture A one sided put get inter process communication interface used on high performance parallel systems Symmetric MultiProcessor a computer whose main memory is shared by more than one processor Simple Network Ma
36. _stride int nreduce double pWrk void shmem_float_sum_to_all float target int nreduce float pWrk shmem_int_sum_to_all int target int PE_start int PE_size shmem long sum to all long target void int int void int nreduce long pWrk void shmem longdouble sum to all long doub long doub int PE_start int PE_size double source int PE_size long pSync float source int PE_start int logPE_stride int PE_size long pSync source int nreduce int logPE_stride pWrk long pSync long source int PE_start int logPE_stride int PE_size long pSync le target le source int logPE_stride long double pWrk int nreduce long pSync void shmem longlong sum to all long long int PE start int PE size long target long source int nreduce int logPE_stride long long pWrk long pSync void shmem_short_sum_to_all short target int nreduce short pWrk PARAMETERS short source int PE_start int logPE_stride int PE_size long pSync target 2 58 The Shmem Library A symmetric array of length nreduce to receive the result ofthe reduction operations The data type of target should match that shmem_double_sum_to_all 3 implied in the SYNOPSIS section source A symmetric array of length nreduce that contains one element for each separete reduction o
37. a allocated by malloc 3 C and C data allocated by elan_allocMain orelan_gallocMain Warning Note that calls to malloc calloc ete are unsynchronised and that these functions are called from other C library routines You should not rely on dynamically allocated objects being at the same address in each process The global allocator elan gallocMain performs synchronised storage allocation see Elan Programming Manual for details 2 3 1 Word Lengths The Shmem library provides functions that perform the same operation for different data types for example shmem_int_put shmem_long_put and shmem_double_put Some types have different lengths under different operating systems and compiler combinations and in particular they may differ from the lengths found in the Cray Shmem implementation The sizes of each type in bytes are listed in Table 2 1 Table 2 1 Data Type Sizes Type Tru64 UNIX Alpha Linux Solaris Unicos i p ja ja fe Gas je js E e _ 5 js longl 5 5 e a ee 2 4 Library Function Categories The functions in the Shmem library can be grouped according to the operations they perform These groups are 2 2 The Shmem Library Initialisation Library Function Categories The initialisation functions Section 2 5 prepare for the process to participate in shared memory operations Furthermore this group of functions can be used to retrieve information such as
38. ache _inv void void shmem_set_cache_inv void void shmem set cache line inv void target void shmem_udcflush void void shmem_udcflush_line void target DESCRIPTION These routines are suplied for compatibility with the Cray Shmem library They perform NULL operations and return to the caller successfully The Shmem Library 2 69 Programming Examples 3 1 Introduction This chapter contains a programming example which makes use of the facilities of the Shmem library This programming example implements a multiprocess version of the UNIX program ping using the Shmem routines ping sends packets across the network to elicit a response from a specified network host and prints out timing statistics for the round trip sending a packet and getting a response The example program extends ping to work on multiple network hosts by running processes in parallel on a number of processors The processes form pairs and each process in the pair pings the other After a user specified number of pings one of the processes in each pair prints its timing statistics The following sections describe how the program is implemented The complete program listing is given in Section 3 10 3 2 The Command Line Interface This is the command line interface for the program sping sping n number k K m M eh nwords maxWords incWords The options for the programs are n number k K m M Specifies the number of times to ping
39. and add Collective Reduction The shared memory reduction routines distribute work across a set of PEs Section 2 10 In particular these functions perform an associative binary operation across a set of values distributed on a set of PEs Collective Communication The shared memory collective routines operate on the same data object on multiple PEs The Shmem library supplies routines to broadcast a block of data from a processing element to one or more target PEs and to concatenate data item coming from a subset of PEs Section 2 11 The Shmem Library 2 3 Initialisation Address Manipulation The Shmem library routines that provide multi process programs with access a contiguous region of virtual address space Section 2 12 are not supported in this implementation Control Data Cache These routines are supplied for compatibility with the Cray Shmem library and they are implemented as NOPs Section 2 13 The following sections describe these groups of functions in more detail Each section starts by discussing how the functions work as a group and then the functions are described individually 2 5 Initialisation The initialisation functions are listed in Table 2 2 Table 2 2 Initialisation Functions Name Deseription ll Not supported in this implementation Initialize a process to use the Shmem Return the number of processes using Shmem Return the processing element identifier These functions are use
40. arget The pointer to a remotely accessible data object to be updated on the remote PE The data type of target should match that implied in the SYNOPSIS section value The value to be atomically added to the target pe An integer that indicates the PE number upon which target is to be updated DESCRIPTION The shmem_short_add routine performs an atomic add operation It adds value to the variable pointed by target on the processing element specified by pe It is worth noting that the atomic access to a variable V is only guaranteed if V is updated solely by Shmem routines Thus in order to preserve the correct semantic of atomic operations all the processing elements including the one for which the variable V is local must refer to V using Shmem atomic routines SEE ALSO shmem_short_cswap 3 The Shmem Library 2 35 shmem_int_mswap 3 NAME shmem_int_mswap shmem_long_mswap shmem_short_mswap perform an atomic masked swap on a remote data object SYNOPSIS include lt shmem h gt int shmem int_mswap int target int mask int value int pe long shmem_long_mswap long target long mask long value int pe short shmem_short_mswap short target short mask short value int pe PARAMETERS target The pointer to a remotely accessible data object to be updated on the remote PE The data type of target should match that implied in the SYNOPSIS section mask Identifies the bits within target that are to be updated w
41. ariables 22 2 Com nn nenn 3 7 3 9 Subsidiary Functions nennen 3 10 3 10 Program Listing Co Coon 3 10 Glossary Glossary 1 Index Index 1 vi Contents 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 Control Data Cache Functions Data Type Sizes Initialisation Functions Remote Write Functions Remote Read Functions Synchronization Functions Atomic Memory Operations Collective Reduction Operation Collective Communication Functions Address Manipulation Functions List of Tables List of Tables i Preface 1 1 Scope of Manual This manual describes the Shmem programming library This library supports a shared memory programming model where cooperating processes exchange data by performing read and write operations on logically shared variables 1 2 Audience This manual is intended for developers who want to develop parallel applications using a shared memory programming model The manual assumes that the reader is familiar with the following e UNIX operating system e C programming language 1 3 Using this Manual This manual contains three chapters Their contents are as follows Chapter 1 Preface describes the layout ofthe manual and the conventions used to present information Chapter 2 The Shmem Library describes the functions in the Shmem library Chapter 3 Programming Examples contains a worked example of using the Shmem librarie
42. base 2 of the stride between consecutive virtual PE numbers in the active set PE_size The number of PEs in the active set PE_size must be of type integer pSync A symmetric work array pSync must have size _SHMEM_BARRIER_SYNC_SIZE Every element of this array must be initialized to 0 before any of the PEs in the active set enter shmem_barrier the first time DESCRIPTION The shmem_barrier is a collective synchronization routine Control returns from shmem_barrier after all PEs in the active set specified by PE_start logPE_stride and PE_size have called shmem_barrier The values of arguments PE_start logPE_stride and PE_size must be equal on all PEs in the active set The same work array must be passed in pSync to all PEs in the active set The shmem_barrier routine ensures that all previously issued local stores and previously issued remote memory updates done by any of the PEs in the active set by using Shmem calls for example shmem_put are complete before returning The same pSync array may be reused on consecutive calls to shmem_barrier ifthe same active PE set is used SEE ALSO shmem barrier all 3 The Shmem Library 2 25 shmem_wait 3 NAME shmem_wait shmem_int_wait shmem_long_wait shmem_longlong_wait shmem_short_wait shmem_wait_until shmem_int_wait_until shmem_long_wait_until shmem_longlong_wait_until shmem_short_wait_until Waits for a variable on the local processing
43. called This merely displays the command line syntax for the program and then exits e If the e option has been used the variable doprint is incremented This variable is used later to enable or disable the printing of statistics e The h option calls the help function which displays the command line syntax for the program and explains the meaning of the various options or flags like this Usage sping flags nwords maxWords incWords Flags may be any of n number repetitions to time e everyone print timing info h print this info Numbers may be postfixed with k or m e If any other options beside the three mentioned here are given the function usage is called to display the correct command line syntax and then exit The three if statements determine whether the optional arguments for specifying a varying packet size have been set The variable optindis defined externally and included by the header files at the start of the program After stepping through all the options with the while loop opt ind indexes the first argument in argv The first argument should be nwords the number of words in each packet If the user has not specified this argument the program continues rather than exiting but assumes a value of 1 Note that the value is assigned to minWords rather than to the variable nwords Later on the value is transferred to nwords when it acts as an iteration variable Programming Examples 3 5 E
44. ciative binary operation across a set of values The nreduce argument determines the number of elements to perform the reduction operation on The source array on all PEs in the active set provides one element for each reduction The results of the reductions are placed in the target array on all PEs in the active set The active set is defined by the P E_start logP E_stride PE_size triplet The source and target arrays may be the same array but they may not be overlapping arrays The values of arguments nreduce P E_start logPE_stride and PE_size must be equal on all PEs in the active set The same target and source arrays and the same pWrk and pSync work arrays must be passed to all PEs in the active set Before any PE calls a reduction routine you must ensure that the following conditions exist synchronization via a barrier or some other method is often needed to ensure this e The pwrk and pSync arrays on all PEs in the active set are not still in use from a prior call to a collective shared memory routine e The target array on all PEs in the active set is ready to accept the results of the reduction Upon return from a reduction routine the following are true for the local PE e The target array is updated e The values in the pSync array are restored to the original values 2 48 The Shmem Library shmem_double_max_to_all 3 The function shmem_double_max_to_all performs a reduct
45. ction e g shmem int min all to all OR The logical OR function e g shmem int or all to all PROD The product function e g shmem int prod all to all SUM The sum function e g shmem int sum all to all XOR The logical exclusive OR function e g shmem int xor all to all The collective reduction functions are described in detail on the follovving pages 2 44 The Shmem Library shmem_int_and_to_all 3 NAME shmem_int_and_to_all shmem_long_and_to_all shmem_longlong_and_to_all shmem_short_and_to_all perform a logical AND function across a set of processing elements PEs SYNOPSIS include lt shmem h gt void shmem int and to alllint target int source int nreduce int PE start int logPE_stride int PE size int pWrk long pSync void shmem long and to all long target long source int nreduce int PE start int logPE_stride int PE size long pWrk long pSync void shmem longlong and to all long long target long long source int nreduce int PE start int logPE_stride int PE size long long pWrk long pSync void shmem short and to all short target short source int nreduce int PE start int logPE_stride int PE size short pWrk long pSync PARAMETERS target A symmetric array of length nreduce to receive the result of the reduction operation The data type of target should match that implied in the SYNOPSIS section source A symmetric ar
46. d shmem_broadcast void target void source int nlong int PE_root int PE_start int logPE_stride int PE_size long pSync void shmem_broadcast32 void target void source int nlong int PE_root int PE_start int logPE_stride int PE_size long pSync void shmem_broadcast64 void target void source int nlong int PE_root int PE_start int logPE_stride int PE_size long pSync PARAMETERS target A symmetric data object used to receive the data broadcasted by the processing element specified by PE_start For shmem_broadcast and shmem_broadcast 64 the data type of target can be any type that has an element size of 64 bits source A symmetric data object that can be of any data type that is permissible for the target argument nlong The number of elements in source For shmem_broadcast and shmem_broadcast 64 this is the number of 64 bit words For shmem_broadcast 32 this is the number of 32 bit halfwords PE_root Zero based ordinal ofthe PE with respect to the active set from which the data is copied Must be greater than or equal to 0 and less than PE_size PE_start The lowest virtual PE number of the active set of PEs logPE_stride The log base 2 of the stride between consecutive virtual PE numbers in the active set PE_size The number of PEs in the active set pSync A symmetric work array In C C pSync must be oftype long and size _SHMEM_REDUCE_SYNC_SIZE Every element of this a
47. d to initialize the environment for the processes using the features offered by the Shmem library In particular the shmem_init expects all of the processes to have been started by RMS The function initialises the caller and then synchronises the caller with the other processes The functions num_pes and my_pe supply the number of PEs belonging to the parallel application and the PE identifier ofthe calling process respectively The initialisation functions are described in detail on the following pages The function start_pes is not supported in this implementation Shmem programs are started via prun see RMS User Manual for details 2 4 The Shmem Library my_pe 3 NAME my pe returns the processing element number of the calling PE SYNOPSIS include lt shmem h gt int my pe void DESCRIPTION The function my_pe returns the processing element PE number of the calling PE RETURN VALUES The function my_pe returns an integer between 0 and npes 1 where npes is the total number of PE s executing the current program SEE ALSO num_pes 3 shmem_init 3 The Shmem Library 2 5 num_pes 3 NAME num_pes returns the number of PEs running in an application SYNOPSIS include lt shmem h gt int num_pes void DESCRIPTION The function num_pes computes the number of PEs running in a parallel application RETURN VALUES The function my_pe returns an integer indicating the number of PEs that are curre
48. dwidth and minimize data latency The Shmem library can be used in conjunction with or as a replacement for message passing routines e g MPI so that developers can optimally mix message passing and shared memory programming models in the same application 2 2 Compiling To use the functions in the Shmem library programs must include the header file shmem h The library functions reference header files which are by default installed in the directory usr include or opt rms include for Solaris Programs must be linked with libshmem so An example command line to compile a program prog c is shown here cc o prog prog c lshmem Definitions for the Fortran interface to Shmem can be found in the header file shmem fh The Shmem Library 2 1 Library Function Categories 2 3 Using the Shmem Library Shmem routines can be used in programs that perform computations in separate address spaces and that explicitly pass data to and from different processes in the program The processes participating in shared memory applications will be referred as processing elements PEs Typically target or source data that reside on remote processing elements are identified by passing the address of the corresponding data object on the local PE The local existence of a corresponding data object implies that a data object is remotely accessible The remotely accessible data object are listed below 1 Non stack C and C variables 2 C and C dat
49. e library Cluster File System the file system for Tru64 UNIX clusters Common Gateway Interface a standard method for generating HTML pages dynamically from an application so that a Web server and a Web browser can exchange information A CGI script can be written in any language and can access various types of data for example a SQL database Central Processing Unit the part of the computer that executes the machine instructions that make up the various user and system programs Cyclic Redundancy Check a method of error detection Concurrent Versions System a revision control utility for managing software releases and controlling the concurrent editing of files by multiple software developers Dual In Line Memory Module Direct Memory Access high performance I O technique where peripherals read write memory directly and not through a CPU GNU s Not UNIX A UNIX like development effort of the Free Software Foundation headed by Richard Stallman HyperText Markup Language a generic markup language comprising a set of tags that enables structured documents to be delivered over the World Wide Web and viewed by a browser Glossary 1 HTTP LED MIMD MMU MPI MPP PCI PDF PTE RISC RMS SDRAM Shmem SMP SNMP SQL Glossary 2 HyperText Transfer Protocol a communications protocol commonly used between a Web server and a Web browser together with a URL
50. e pWrk void shmem float min to all float target int nreduce float pWrk shmem_int_min_to_all int target int PE_start int PE_size void int int void shmem_long_min_to_all long target int nreduce long pWrk void shmem longdouble min to all long doubl long doub int PE_start int PE_size double source int PE_size long pSync float source int PE_start int logPE_stride int PE_size long pSync source int nreduce int logPE_stride pWrk long pSync long source int PE_start int logPE_stride int PE_size long pSync le target le source int nreduce int logPE_stride long double pWrk long pSync void long int PE_start int PE_size long source shmem longlong min to all long long target int nreduce int logPE_stride long long pWrk long pSync void shmem short min to all short target int nreduce short pWrk PARAMETERS short source int PE start int logPE_stride int PE size long pSync target 2 50 The Shmem Library A symmetric array of length nreduce to receive the result of the reduction operations The data type of target should match that shmem_double_min_to_all 3 implied in the SYNOPSIS section source A symmetric array of length nreduce that contains one element for each separete reduction operation The source argument mus
51. e the correct semantic of atomic operations all the processing elements including the one for which the variable V is local must refer V using the Shmem atomic routines The shmem_swap function writes the 1ong value value in to the variable pointed by target on processing element pe and returns the previous contents of target as an atomic operation The shmem_double_swap function writes the double value value in to the variable pointed by target on processing element pe and returns the previous contents of target as an atomic operation The Shmem Library 2 31 shmem_swap 3 The shmem_float_swap function writes the float value value in to the variable pointed by target on processing element pe and returns the previous contents of target as an atomic operation The shmem_int_swap function writes the integer value value in to the variable pointed by target on processing element pe and returns the previous contents of target as an atomic operation The shmem_long_swap function writes the Long value value in to the variable pointed by target on processing element pe and returns the previous contents of target as an atomic operation The shmem_longlong_swap function writes the Longlong value value in to the variable pointed by target on processing element pe and returns the previous contents of target as an atomic operation The shmem_short_swap function writes the short value value in to the variable pointed by target on processing elemen
52. es not return until all the data is delivered to the remote PE s memory SEE ALSO shmem_put 3 shmem_fence 3 shmem_barrier 3 shmem_ wait 3 The Shmem Library 2 29 Atomic Memory Operations 2 9 Atomic Memory Operations The atomic memory functions are listed in Table 2 6 Table 2 6 Atomic Memory Operations Name Despo O OOOO i E hmem_ hmem_ Ir hmem_int_ hmem 1 dE hmem 1 shmem E i i shmem_ i j hmem int shmem_ i Atomic fetch and add on an integer data object shmem_long_fadd Atomic fetch and add on a long data object mem_long 1 hme hort_finc Atomic fetch and increment on a short data object hmem_short_inc Atomic increment on a short data object Atomic masked swap to a short data object These routines are used to perform atomic read and update operations on a remote data object It is worth noting that the atomicty accessing a shared variable V is only guaranteed if V is updated using the Shmem routines only Thus in order to preserve the correct semantic of atomic operations all the processing elements including the one for which V is local must refer to V using the Shmem atomic routines Routines like shmem_swap shmem_int_cswap and shmem_int_mswap perform an atomic swap operation an atomic conditional swap operation and a masked atomic swap operation to a remote data object respectively The functions like shmem_int_fadd shmem_int_finc
53. for single elements of most basic types The function shmem double g transfers a double data item from a remote PE The function shmem_float_g transfers a float data item from a remote PE The function shmem_int_g transfers a integer data item from a remote PE The function shmem_long_g transfers a Long data item from a remote PE The function shmem_short_g transfers a short data item from a remote PE RETURN VALUES These functions return the contents that had been at the target address addr on the remote PE specified by pe SEE ALSO shmem_get 3 2 18 The Shmem Library NAME shmem_ get 3 shmem_get shmem_double_get shmem_float_get shmem_get32 shmem_get64 shmem_get128 shmem_getmem shmem_int_get shmem_long_get shmem_longdouble_get shmem_longlong_get shmem_short_get transfer contiguous data from a remote PE SYNOPSIS include lt shmem h gt void shmem get void target const void source size_t len int pe void shmem double get double target const double source size_t len int pe void shmem_float_get float target const float source size_t len int pe void shmem_get32 void target const void source size_t len int pe void shmem_get64 void target const void source size_t len int pe void shmem_get128 void target const void source size_t len int pe void shmem_getmen void target const void source size_t len int pe void
54. get 3 o 2 19 shmem_short_get 3 o ee ee 2 19 shmem_iget 8 o o o 2 21 shmem_double_iget 3 2 21 shmem_float_iget 3 o ee ee 2 21 shmem_iget82 3 o 2 21 shmem_iget64 8 2 21 shmem_iget128 3 ero r Erak EErEE ee ee 2 21 shmem_int_iget 3 2 21 shmem_long_iget 3 gt cacc ssas radott adiadas 2 21 shmem_longdouble_iget 3 2 21 shmem longlong iget 8 ia 2 21 shmem short iget 8 2 21 2 8 Synchronization Operations _ o 2 23 b rrier 3 une Bes a la di 2 24 shmem_barrier_all 3 2 24 shmem barrier 8 a 2 25 shmem_wait 3 2 26 shmem_int_wait 9 a 2 26 shmem_long_wait 3 2 26 shmem_longlong_wait 3 2 26 shmem_short_wait 3 2 26 shmem vvait until 8 a a 2 26 shmem int vvait until 8 i 2 26 shmem_long_wait_until 3 2 26 shmem_longlong_wait_until 3 2 26 shmem_short_wait_until 3 2 26 shmem fence 3 2 28 shmem_quiet 3 2 29 2 9 Atomic Memory Operations 2 30 shmem_swap 9 nenn 2 31 shmem_double_swap 3 nennen 2 31 shmem_float_swap 3 2
55. h gt void shmem double_p double addr double value int pe void shmem_float_p float addr float value int pe void shmem_int_p int addr int value int pe void shmem_long_p long addr long value int pe void shmem_short_p short addr short value int pe PARAMETERS addr value pe DESCRIPTION The remotely accessible array element or scalar data object which will receive the data on the remote PE The value to be transferred to addr on the remote PE The number of the remote PE where value will be transferred These routines provide a very low latency remote write capability for single elements of most basic types These functions start the remote transfer and may return before the data is delivered to the remote PE Use shmem_quiet to force completion on all remote transfers The function s The function s The function s The function s The function s SEE ALSO hmem_double_p transfers a double data item to the remote PE hmem_float_p transfers a float data item to the remote PE hmem_int_p transfers an integer data item to the remote PE hmem_long_p transfers a Long data item to the remote PE hmem short p transfers a short data item to the remote PE shmem_put 3 shmem_quiet 3 2 10 The Shmem Library shmem_put 3 NAME shmem_put shmem_double_put shmem_float_put shmem_int_put shmem_long_put shmem_longdouble_put shmem_longlong_put shmem_short_put shmem_put
56. he only process it exits as there is no one for it to ping 3 6 Programming Examples Writing Shared Variables 3 7 Establishing the Peer Group Before starting the first and possibly only set of repetitions the processes must synchronize and group themselves into pairs int main int argc char argv if doprint printf Sd d Shmem PING reps d minWords d maxWords d incWords d n proc nproc reps minWords maxWords incWords shmem_barrier_all peer proc 1 if peer gt nproc doprint 0 If all the processes have been enabled for printing with the e option each prints a message to confirm its identity the number of processes in the program and the program parameters Before starting to ping each other the processes synchronize that is to say each waits in the call to shmem_barrier_all until all have made the call This guarantees that all the processes are initialized and ready to write and read shared variables before any one of them starts to ping another In order to ping each other the processes split up into pairs Each process determines its opposite number or peer simply by an exclusive OR of its own PE number identifier with the constant 1 The processes have PE identifier numbered from 0 to nproc 1 where nproc is the number of processes in the program With an uneven number of processes one will have no peer This can be determined by checking that the peer s PE number is in the
57. ion applaying the maximum function to doubles values distributed across the PEs The function shmem_float_max_to_all performs a reduction applaying the maximum function to float values distributed across the PEs The function shmem_int_max_to_all performs a reduction applaying the maximum function to integer values distributed across the PEs The function shmem_long_max_to_all performs a reduction applaying the maximum function to Long values distributed across the PEs The function shmem_longdouble_max_to_all performs a reduction applaying the maximum function to long double values distributed across the PEs The function shmem_longlong_max_to_all performs a reduction applaying the maximum function to Long long values distributed across the PEs The function shmem_short_max_to_all performs a reduction applaying the maximum function to short values distributed across the PEs SEE ALSO shmem_barrier 3 shmem_barrier_all 3 The Shmem Library 2 49 shmem_double_min_to_all 3 NAME shmem_double_min_to_all shmem_float_min_to_all shmem_int_min_to_all shmem_long_min_to_all shmem_longdouble_min_to_all shmem_longlong_min_to_all shmem_short_min_t o all performs a minimum function reduction across a set of processing elements PEs SYNOPSIS include lt shmem h gt void shmem double min to all double target int PE start int logPE stride int nreduce doubl
58. ion shmem quiet waits for completion of all outstanding remote writes initiated from the current PE The routine shmem quiet does not return until all data is delivered to the remote PEs memory The synchronization functions are described in detail on the follovving pages The Shmem Library 2 23 barrier 3 NAME barrier shmem_barrier_all register the arrival of a PE at a barrier and suspends PE execution until all other PE arrive at the barrier SYNOPSIS include lt shmem h gt void barrier void void shmem barrier all void DESCRIPTION Barriers are a fast mechanism for synchronizing all PEs at once The function shmem barrier all cause a PE to suspend execution until all PEs have called shmem barrier al1 These barrier functions also ensure completion ofall previously issued local memory stores and remote memory updates issued via shared memory routine calls such as shmem_put32 SEE ALSO shmem barrier 3 shmem init 3 2 24 The Shmem Library shmem_barrier 3 NAME shmem_barrier Performs a barrier operation on a subset of processing elements PEs SYNOPSIS include lt shmem h gt void shmem_barrier int PE_start int logPE_stride int PE_size long pSync PARAMETERS PE_start The lowest virtual PE number ofthe active set of PEs PE_start must be of type integer If you are using Fortran it must be a default integer value logPE_stride The log
59. ith bits from value The bits set to 1 in mask indicate bits to be copied from value into the corresponding bit location in target The parameter mask must be the same data type as target value Contains the bits to be atomically written to target on the remote PE The parameter mask identifies the bits to be transferred The parameter value must be the same data type as target pe An integer that indicates the PE number upon which target is to be updated DESCRIPTION The masked swap routines update a target data object on an arbitrary processing element PE and return the prior content ofthe data object in one atomic operation It is worth noting that atomic access to a variable V is only guaranteed if V is updated solely by this Shmem routines Thus in order to preserve the correct semantic of atomic operations all the processing elements including the one for which V is local must refer to the variable V using Shmem atomic routines The shmem_int_mswap routine updates atomically the integer value pointed by target according to the bit mask specified by mask The shmem_long_mswap routine updates atomically the long value pointed by target according to the bit mask specified by mask 2 36 The Shmem Library shmem_int_mswap 3 The shmem_short_mswap routine updates atomically the short value pointed by target according to the bit mask specified by mask RETURN VALUES These functions return the contents that had been in the target add
60. its to the remote PE The function shmem_double_iput writes strided array of type double to the remote PE The function shmem_float_iput writes strided array of type float to the remote PE The function shmem_int_iput writes strided array of type integer to the remote PE The function shmem_iput 32 writes any non character type that has a storage size equal to 32 bits to the remote PE The function shmem_iput 64 writes strided array where each element is any non character type that has a storage size equal to 64 bits to the remote PE The function shmem_iput128 writes strided array where each element is any non character type that has a storage size equal to 128 bits to the remote PE The function shmem_long_iput writes strided array of type long to the remote PE The function shmem_longdouble_iput writes strided array of type long double to the remote PE The function shmem_longlong_iput writes strided array of type long long to the remote PE 2 14 The Shmem Library shmem_iput 3 The function shmem_short_iput writes strided array of type short to the remote PE SEE ALSO shmem_put 3 shmem_get 3 shmem_iget 3 shmem_quiet 3 The Shmem Library 2 15 Remote Read Operations 2 7 Remote Read Operations The Shmem library includes the functions shown in Table 2 4 for performing remote read operations Table 2 4 Remote Read Functions NI n y 3 D 3
61. ive set is defined by the PE_start logPE_stride PE_size triplet The source and target arrays may be the same array but they may not be overlapping arrays The values of arguments nreduce PE_start logPE_stride and PE_size must be equal on all PEs in the active set The same target and source arrays and the same pWrk and pSync work arrays must be passed to all PEs in the active set Before any PE calls a reduction routine you must ensure that the following conditions exist synchronization via a barrier or some other method is often needed to ensure this e The pwrk and pSync arrays on all PEs in the active set are not still in use from a prior call to a collective shared memory routine e The target array on all PEs in the active set is ready to accept the results of the reduction Upon return from a reduction routine the following are true for the local PE e The target array is updated e The values in the pSync array are restored to the original values The function shmem_int_or_to_all performs a reduction applaying the logical OR operator on integer values distributed across the PEs The function shmem_long_or_to_all performs a reduction applaying the logical OR operator on long values distributed across the PEs The function shmem_longlong_or_to_all performs a reduction applaying the logical OR operator on long long values distributed across the PEs The function shmem_short_or_to_all perform
62. le virtual PEs A reduction performs an associative binary operation across a set of values The nreduce argument determines the number of separate reduction to perform The source array on all PEs in the active set provides one element for e ach reduction The results of the reductions are placed in the target array on all PEs in the active set The active set is defined by the PE_start logPE_stride PE_size triplet The source and target arrays may be the same array but they may not be overlapping arrays The values of arguments nreduce PE_start logP The same targe E_stride and PE_size must be equal on all PEs in the active set t and source arrays and the same pWrk and pSync work arrays must be passed to all PEs in the active set Before any PE calls a reduction routine you must ensure that the following conditions exist synchronization via a barrier or some other method is often needed to ensure this e The pWrk and pSync arrays on all PEs in the active set are not still in use from a prior call to a collective shared memory routine e The target array on all PEs in the active set is ready to accept the results of the reduction Upon return from a reduction routine the following are true for the local PE 2 56 The Shmem Library shmem_double_prod_to_all 3 e The target array is updated e The values in the pSync array are restored to the original values The function
63. long variable var to satisfy the condition implied by comp and val The function shmem_short_wait_until blocks the calling PE until some remote PE changes the short variable var to satisfy the condition implied by comp and val SEE ALSO shmem_put 3 The Shmem Library 2 27 shmem_fence 3 NAME shmem_fence assures ordering of delivery of puts SYNOPSIS include lt shmem h gt void shmem_fence void DESCRIPTION This function ensures ordering of remote write put operations All put operations issued to a particular processing element PE prior to the call to shmem_fence are guaranteed to be delivered before any subsequent remote write operation to the same PE which follows the call to shmem_fence The shmem_quiet function should be called if ordering of puts is desired when multiple remote PEs are involved SEE ALSO shmem_quiet 3 2 28 The Shmem Library shmem_quiet 3 NAME shmem_quiet Waits for completion of all outstanding remote writes issued by a processing element PE SYNOPSIS include lt shmem h gt void shmem _ quiet void DESCRIPTION This function waits for completion of all outstanding remote writes initiated from the calling PE Remote writes are issued by calls to shmem_put and related put routines When controls returns from shmem_put the data is delivered to the communication circuitry but has not yet arrived to the remote PE The shmem_quiet function do
64. m_int_fadd operates on integer data object The shmem_long_fadd operates on long data object The shmem_longlong_fadd operates on long long data object The shmem_longshort_fadd operates on short data object RETURN VALUES These functions return the contents that had been at the target address on the remote PE prior to the atomic addition operation 2 38 The Shmem Library shmem_int_fadd 3 SEE ALSO shmem_int_swap 3 shmem_int_cswap 3 shmem_int_finc 3 The Shmem Library 2 39 shmem_int_finc 3 NAME shmem_int_finc shmem_long_finc shmem_longlong_finc shmem_short_fine perform an atomic fetch and increment operation on a remote data object SYNOPSIS long long include lt shmem h gt int shmem_int_finc int target int pe long shmem_long_finc long target int pe shmem_longlong_finc long long target int pe short shmem_short_finc short target int pe PARAMETERS target DESCRIPTION The pointer to a remotely accessible data object to be incremented on the remote PE The data type oftarget should match that implied in the SYNOPSIS section An integer that indicates the PE number upon which target is to be updated These routines perform an atomic fetch and increment operation They increment the data objet pointed by target on PE specified by pe and return the previous contents oftarget as an atomic operation It is worth noting that the atomic access to a variable V is only gua
65. max to all 3 2 47 shmem longlong max to all 8 2 222220 2 47 shmem short max to all 8 2 47 shmem double min to_alll3 2 50 shmem float min to all 8 2 50 shmem int min to all 8 2 50 shmem_long_min_to_all3 2 50 iv Contents shmem_longdouble_min_to_all 3 2 50 shmem longlong min to all 3 iii iii 2 50 shmem short min to_all3 2 50 shmem int or to all 3 2 53 shmem long or to all 3 aaa 2 53 shmem longlong or to all 3 2 53 shmem short or to all 8 i 2 53 shmem double prod to all 38 2 55 shmem_float_prod_to_all 3 2 55 shmem_int_prod_to_all B3 2 55 shmem_long_prod_to_all3 2 55 shmem_longdouble_prod_to_all 3 2 55 shmem_longlong_prod_to_all 3 2 55 shmem_short_prod_to_all 3 2 55 shmem_double_sum_to_all 3 2 58 shmem_float_sum_to_all 3 2 58 shmem int sum toalll3 2 58 shmem long sum to all 3 2 58 shmem_longdouble_sum_to_all 3 2 58 shmem longlong sum to all 8 2 58 shmem short sum to all 8 2 58 shmem int xor
66. nagement Protocol a protocol used to monitor and control devices on the Internet Structured Query Language a database language TLB Translation Lookaside Buffer part of the MMU that caches the result of virtual to physical address translations to minimise translation times in subsequent accesses to the same page URL Uniform Resource Locator a standard protocol for addressing information on the World Wide Web UTC Coordinated Universal Time on UNIX systems it is represented as the time elapsed in seconds since January 1 1970 at 00 00 00 Terms barrier A synchronisation point in a parallel computation that all of the processes must reach before they are allowed to continue bisectional bandwidth The worst case bandwidth across the diameter of the network block A thread that blocks without relinquishing the processor until a critical section Elan memory event Flit HTTP cookies main memory main processor specified event occurs A section of program statements that can yield incorrect results if more than one thread tries to execute the section at the same time The SDRAM on the Elan card A parallel processing synchronisation primitive implemented by the Elan card A communications cycle unit of information Cookies provide a general mechanism that HTTP server side connections use to store and to retrieve information on the client side of the connection The memory normally associated
67. ng took 10 14 microseconds giving a rate of 50 49MBytes per second If printing has been enabled for all processes with the e option this message is displayed by each process By default only one process in each pair displays the message 3 4 Header Files and Variables The header files and variables used by the program are shown here The variables are declared in main include lt stdio h gt include lt fcntl h gt include lt errno h gt include lt signal h gt include lt sys types h gt include lt sys time h gt include lt shmem h gt int main int argc char argv double t tv 2 int reps 10000 3 2 Programming Examples Header Files and Variables minWords 1 maxWords 1 incWords proc peer nproc rbuf tbuf doprint 0 6 progName nwords c Y 1 The header files and variables are described here Besides the standard C header files the shmem h header file is required for the Shmem libraries The two time variables are used to time each set of repetitions of writing and reading a shared variable The tv array is used to record two times using the function gettimeofday 1 The time before the set of repetitions begins 2 The time after the set of repetitions has ended The variable t is used to hold the difference between these two readings All the time values are expressed in microseconds This group of variables is used to control how many time
68. ntly allowed to cooperate using the Shmem library functions SEE ALSO my pe 3 shmem init 3 2 6 The Shmem Library shmem_init 3 NAME shmem_init initialise a process to use the Shmem library SYNOPSIS include lt shmem h gt void shmem_init void DESCRIPTION The function shmem_init initialises the Shmem library The shmem_init call must me made before any other Shmem library calls The function shmem_init should only be called once for each process SEE ALSO num_pes 3 my_pe 3 The Shmem Library 2 7 Remote Write Operations 2 6 Remote Write Operations The remote write functions are listed in Table 2 3 Table 2 3 Remote Write Functions Name Description double_p Transfers a double data item to a PE Transfers a float data item to a PE Transfers a integer data item to a remote PE double_put Transfers contiguous double data to a PE Transfers contiguous float data to a PE ui juju shmem_ shmem_ shmem_ shmem_ shmem_ hmem_ em_ Transfers contiguous integer data to a PE Transfers contiguous long data to a remote PE hmem_longdouble_put Transfers contiguous long double data to a PE m_1 I hme onglong_put Transfers contiguous long long data to a PE u 3 u J 3 em un em u em un em hmem_short_put Transfers contiguous short data to a PE put Transfer data type having 64 bits storage size put 32 Transfers data type having 32 bits storage
69. on shmem_short_xor_to_all performs a reduction applaying the logical exclusive OR operator on short values distributed across the PEs SEE ALSO shmem barrier 3 shmem barrier all 3 2 62 The Shmem Library Collective Communication 2 11 Collective Communication The collective communication functions are listed in Table 2 8 Table 2 8 Collective Communication Functions Name Desrpion OOO O OOO Concatenates blocks of data having 64 bit storage class Concatenates blocks of data having 64 bit storage class Concatenates blocks of data having 32 bit storage class Concatenates blocks of data having 64 bit storage class Collective communication routines operate on the same data object on multiple PE The Shmem supports two different type of collective communication as explained below e Broadcast routines i e shmem_broadcast that are used to broadcast a block of data from one processing element named the root of the operation to a set of PEs e Concatenation routines i e shmem_collect that are used to concatenate data items distributed over a set of PEs The collective communication functions are described in detail on the following pages The Shmem Library 2 63 shmem_broadcast 3 NAME shmem_broadcast shmem_broadcast32 shmem_broadcast64 broadcasts a block of data from one processing element PE to one or more target PEs SYNOPSIS include lt shmem h gt voi
70. ory allocation exit 1 memset rbuf 0 maxWords sizeof long if tbuf long malloc maxWords sizeof long perror Failed memory allocation exit 1 shmem_init proc my_pe nproc num_pes if nproc 1 exit 0 for i 0 i lt maxWords 1 tbuf i 1000 i amp 255 if doprint printf Sd d Shmem PING reps d minWords d maxWords d incWords d n proc nproc reps minWords maxWords incWords shmem_barrier_all peer proc 1 if peer gt nproc doprint 0 for nwords minWords nwords lt maxWords nwords incWords nwords incWords nwords 2 nwords 1 r reps shmem_barrier_all tv 0 gettime if peer lt nproc if proc amp 1 Programming Examples 3 13 Program Listing ri shmem_wait amp rbuf nwords 1 rbuf nwords 1 0 while r gt 0 shmem_put rbuf tbuf shmem_wait amp rbuf nwords 1 rbuf nwords 1 0 if proc amp 1 shmem_put rbuf tbuf tv 1 gettime t dt amp tv 1 amp tv 0 shmem_barrier_all printStats proc peer doprint shmem_barrier_all exit 0 3 14 Programming Examples 0 nwords peer 0 nwords peer 2 reps nwords t Abbreviations API CFS CGI CPU CRC CVS DIMM DMA GNU HTML Glossary Application Program Interface specification of interface to software packag
71. pe that has a storage size equal to 128 bits to the remote PE The function s in bytes SEE ALSO hmem putmem writes any data type to the remote PE len is scaled shmem_iput 3 shmem_quiet 3 2 12 The Shmem Library shmem_iput 3 NAME shmem_iput shmem_double_iput shmem_float_iput shmem_int_iput shmem_iput32 shmem_iput64 shmem_iput128 shmem_long_iput shmem_longdouble_iput shmem_longlong_iput shmem_short_iput transfer strided data to a remote PE SYNOPSIS include lt shmem h gt void shmem_iput void target const void source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem_double_iput double target const double source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem_float_iput float target const float source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem_int_iput int target const int source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem iput32 void target const void source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem_iput64 void target const void source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem_iput128 void target const void source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem_long_iput long target const long source ptrdiff_t tst ptrdiff_t sst size_t len int pe void shmem longdouble iput long double target const long double source
72. peration The source argument must have the same data type as target nreduce The number of elements in the target and source array PE_start The lowest virtual PE number of the active set of PEs logPE_stride The log base 2 of the stride between consecutive virtual PE number in the active set PE size The number of PEs in the active set pWrk A symmetric work array The pwrk argument must have the same data type as target In C C this contains max nreduce 2 1 _SHMEM_REDUCE_MIN_WRKDATA_SIZE elements pSync A symmetric work array In C C pSync must be oftype long and size _SHMEM_REDUCE_SYNC_SIZE Every element ofthis array must be initialized with the value _SHMEM_SYNC_VALUE before any ofthe PEs in the active set enter the reduction routine DESCRIPTION The shared memory reduction routines compute one or more reductions across symmetric arrays on multiple virtual PEs A reduction performs an associative binary operation across a set of values The nreduce argument determines the number of separate reduction to perform The source array on all PEs in the active set provides one element for each reduction The results of the reductions are placed in the target array on all PEs in the active set The active set is defined by the PE_start logPE_stride PE_size triplet The source and target arrays may be the same array but they may not be overlapping arrays The values of arg
73. ranteed if V is updated solely by this Shmem routines Thus in order to preserve the correct semantic of atomic operations all the processing elements including the one for which the variable V is local must refer to V using the Shmem atomic routines The shmem_int_finc operates on integer data object The shmen 1 The shmen 1 The shmen 1 long finc operates on long data object Longlong_finc operates on long long data object longshort_finc operates on short data object RETURN VALUES These functions return the contents that had been at the target address on the remote PE prior to the atomic increment SEE ALSO 2 40 The Shmem Library shmem_int_finc 3 shmem_int_swap 3 shmem_int_cswap 3 shmem_int_fadd 3 The Shmem Library 2 41 shmem_short_inc 3 NAME shmem_short_inc perform an atomic increment operation on a remote data object SYNOPSIS include lt shmem h gt void shmem_short_inc short target int pe PARAMETERS target The pointer to a remotely accessible data object to be incremented on the remote PE pe An integer that indicates the PE number upon which target is to be updated DESCRIPTION This routine performs an atomic increment on a remote variable pointed by target on PE specified by pe It is worth noting that the atomic access to a variable V is only guaranteed if V is updated solely by this Shmem routines Thus in order to preserve the correct semantic of atomic ope
74. rations all the processing elements including the one for which the variable V is local must refer to V using the Shmem atomic routines SEE ALSO shmem_short_swap 3 shmem_short_finc 3 shmem_short_fadd 3 2 42 The Shmem Library Collective Reduction Operations 2 10 Collective Reduction Operations The collective reduction functions are listed in Table 2 7 Table 2 7 Collective Reduction Operation Nm Desrpon LI s ong n s Teng continued on next page hmem_ hmem hmem hmem hmem hmem_ hmem_ hmem uju un Ion u 10 u un mn u n The Shmem Library 2 43 Collective Reduction Operations Table 2 7 Collective Reduction Operation cont preme Y Description shmem_short_xor_to_all Performs a logical exclusive OR on short The Shmem library supplies a wide number of functions to perform associative binary operations across a set of values distributed on a set of processing elements The following associative binary operators are supported AND The logical AND function e g shmem_int_and_all_to_all MAX The maximum function e g shmem_int_max_all_to_all MIN The minimum fun
75. ray of length nreduce that contains one element for each separete reduction operation The source argument must have the same data type as target nreduce The number of elements in the target and source array PE start The lowest virtual PE number of the active set of PEs logPE_stride The log base 2 of the stride between consecutive virtual PE number in the active set PE size The number of PEs in the active set pWrk A symmetric work array The pwrk argument must have the same data type as target In C C this contains max nreduce 2 1 _SHMEM_REDUCE_MIN_WRKDATA_SIZE elements The Shmem Library 2 45 shmem_int_and_to_all 3 pSync A symmetric work array In C C pSync must be oftype long and size _SHMEM_REDUCE_SYNC_SIZE Every element of this array must be initialized with the value _SHMEM_SYNC_VALUE before any ofthe PEs in the active set enter the reduction routine DESCRIPTION The shared memory reduction routines compute one or more reductions across symmetric arrays on multiple virtual PEs A reduction performs an associative binary operation across a set of values The nreduce argument determines the number of elements to perform the reduction operation on The source array on all PEs in the active set provides one element for each reduction The results of the reductions are placed in the target array on all PEs in the active set The active set is defined by the PE_sta
76. reduction applaying the minimum function to long double values distributed across the PEs The function shmem_longlong_min_to_all performs a reduction applaying the minimum function to long long values distributed across the PEs The function shmem_short_min_to_all performs a reduction applaying the minimum function to short values distributed across the PEs SEE ALSO shmem_barrier 3 shmem_barrier_all 3 2 52 The Shmem Library shmem_int_or_to_all 3 NAME shmem_int_or_to_all shmem_long_or_to_all shmem_longlong_or_to_all shmem_short_or_to_all perform a logical OR function across a set of processing elements PEs SYNOPSIS include lt shmem h gt void shmem_int_or_to_all int target int source int nreduce int PE_start int logPE_stride int PE_size int pWrk long pSync void shmem long or to all long target long source int nreduce int PE_start int logPE_stride int PE_size long pWrk long pSync void shmem_longlong_or_to_all long long target long long source int nreduce int PE_start int logPE_stride int PE_size long long pWrk long pSync void shmem short or to all short target short source int nreduce int PE start int logPE_stride int PE size short pWrk long pSync PARAMETERS target A symmetric array of length nreduce to receive the result ofthe reduction operation The data type of target should match that implied in the SYNOPS
77. ress on the remote PE prior to the masked swap SEE ALSO shmem_int_swap 3 shmem_int_cswap 3 The Shmem Library 2 37 shmem_int_fadd 3 NAME shmem_int_fadd shmem_long_fadd shmem_longlong_fadd shmem_short_fadd perform an atomic fetch and add operation on a remote data object SYNOPSIS include lt shmem h gt int shmem_int_fadd int target int value int pe long shmem_long_fadd long target long value int pe long long shmem _longlong_hfadd long long target long long value int pe short shmem_short_fadd short target short value int pe PARAMETERS target The pointer to a remotely accessible data object to be updated on the remote PE The data type of target should match that implied in the SYNOPSIS section value The value to be atomically added to target The type of value should match that implied in the SYNOPSIS section pe An integer that indicates the PE number upon which target is to be updated DESCRIPTION These routines perform an atomic fetch and add operation adding value to target on PE specified by pe and returning the previous contents of the target It is worth noting that the atomic access a variable V is only guaranteed if V is updated solely by this Shmem routines Thus in order to preserve the correct semantic of atomic operations all the processing elements including the one for which the variable V is local must refer to V using the Shmem atomic routines The shme
78. rms a reduction applaying the sum function to long double values distributed across the PEs The function shmem longlong sum to all performs a reduction applaying the sum function to long long values distributed across the PEs The function shmem short sum to all performs a reduction applaying the sum function to short values distributed across the PEs SEE ALSO shmem barrier 3 shmem barrier all 3 2 60 The Shmem Library shmem_int_xor_to_all 3 NAME shmem_int_xor_to_all shmem_long_xor_to_all shmem_longlong_xor_to_all shmem_short_xor_to_all perform a logical exclusive OR function across a set of processing elements PEs SYNOPSIS include lt shmem h gt void shmem_int_xor_to_all int target int source int nreduce int PE_start int logPE_stride int PE size int pWrk long pSync void shmem_long_xor_to_all long target long source int nreduce int PE_start int logPE_stride int PE_size long pWrk long pSync void shmem_longlong_xor_to_all long long target long long source int nreduce int PE_start int logPE_stride int PE size long long pWrk long pSync void shmem_short_xor_to_all short target short source int nreduce int PE_start int logPE_stride int PE_size short pWrk long pSync PARAMETERS target A symmetric array of length nreduce to receive the result of the reduction operation The data type of target should match that implied in
79. rray 2 64 The Shmem Library shmem_broadcast 3 must be initialized with the value _SHMEM_SYNC_VALUE before any ofthe PEs in the active set enter the reduction routine DESCRIPTION The shared memory broadcast routines are collective routines They copy data object source on the processor specified by PE_root and store the values at target on the other PEs specified by the triplet PE_start logPE_stride PE_size The data is not copied to the target area on the root PE The values of arguments PE_root PE start logPE_stride and PE size must be equal on all PEs in the active set The same target and source data objects and the same pSync work array must be passed to all PEs in the active set Before any PE calls a broadcast routine you must ensure that the following conditions exist synchronization via a barrier or some other method is often needed to ensure this e The pSync arrays on all PEs in the active set is not still in use from a prior call to a broadcast routine e The target array on all PEs in the active set is ready to accept the broadcast data Upon return from a broadcast routine the following are true for the local PE e Ifthe current PE is not the root PE the target data object is updated e The values in the pSync array are restored to the original values SEE ALSO shmem barrier 3 shmem_barrier_all 3 The Shmem Library 2 65 shmem_collect 3 NAME
80. rt logPE_stride PE_size triplet The source and target arrays may be the same array but they may not be overlapping arrays The values of arguments nreduce PE_start logPE_stride and PE_size must be equal on all PEs in the active set The same target and source arrays and the same pWrk and pSync work arrays must be passed to all PEs in the active set Before any PE calls a reduction routine you must ensure that the following conditions exist synchronization via a barrier or some other method is often needed to ensure this e The pwrk and pSync arrays on all PEs in the active set are not still in use from a prior call to a collective shared memory routine e The target array on all PEs in the active set is ready to accept the results of the reduction Upon return from a reduction routine the following are true for the local PE e The target array is updated e The values in the pSync array are restored to the original values The function shmem_int_and_to_all performs a reduction applaying the logical AND operator to integer values distributed across the PEs The function shmem_long_and_to_all performs a reduction applaying the logical AND operator to long values distributed across the PEs The function shmem_longlong_and_to_all performs a reduction applaying the logical AND operator to long long values distributed across the PEs The function shmem_short_and_to_all performs a reduction applaying
81. s 1 1 Conventions 1 4 Related Information The following manuals provide additional information relevant to developing parallel applications using Shmem e Elan Programming Manual e RMS Reference Manual e RMS User Manual Programming examples are installed im the directory usr lib rms examples or opt rms examples for Solaris together with makefiles for compiling the programs 1 5 Location of Online Documentation Online documentation in HTML format is installed in the directory usr lib rms docs html or opt rms docs html for Solaris and can be accessed from a browser athttp rmshost 8081 htm1 index html PostScript and PDF versions of the documents are in usr lib rms docs or opt rms docs for Solaris Please consult your system administrator if you have difficulty accessing the documentation New versions of this and other Quadrics documentation can be found on the Quadrics web site http www quadrics com 1 6 Reader s Comments If you would like to make any comments on this or any other Quadrics manual please send them to support quadrics com 1 7 Conventions The following typographical conventions have been used in this document monospace type Monospace type denotes literal text This is used for command descriptions file names and examples of output bold monospace type Bold monospace type indicates text that the user enters when contrasted with on screen computer output italic monospace t
82. s a reduction applaying the logical OR operator on short values distributed across the PEs SEE ALSO shmem barrier 3 shmem barrier all 3 2 54 The Shmem Library shmem_double_prod_to_all 3 NAME shmem_double_prod_to_all shmem_float_prod_to_all shmem_int_prod_to_all shmem_long_prod_to_all shmem_longdouble_prod_to_all shmem_longlong_prod_to_all shmem_short_prod_to_all performs a product reduction across a set of processing elements PEs SYNOPSIS include lt shmem h gt void shmem double prod to all double target double source int nreduce int PE start int logPE_stride int PE size double pWrk long pSync void shmem float prod to all float target float source int nreduce int PE start int logPE_stride int PE size float pWrk long pSync void shmem int prod to all int target int source int nreduce int PE start int logPE_stride int PE size int pWrk long pSync void shmem long prod to all long target long source int nreduce int PE start int logPE_stride int PE size long pWrk long pSync void shmem longdouble prod to all long double target long double source int nreduce int PE start int logPE stride int PE size long double pWrk long pSync void shmem longlong prod to all long long target long long source int nreduce int PE start int logPE_stride int PE size long long pWrk long pSync
83. s the process pings its opposite number and the size of packets sent The variable reps is set to the number of repetitions requested with the n option It has a default setting of 10 000 The next three variables hold the minimum maximum and increment values for the packet size They are used when more than one set of repetitions is requested The variable incWords is used to iterate from minWords to maxWords during a set of repetitions These variables are used to identify by means of their PE number the process and its peer or opposite number to which it write a shared variable and to hold the total number of processing elements The variable rbuf is a pointer to a shared buffer A processing element uses this buffer pointer to write data in the memory of its peer The variable tbuf is a pointer to a buffer containing the data used to fill the shared buffer rbuf Programming Examples 3 3 Argument Checking 6 The variable doprint is used to enable 1 or disable 0 the printing ofresults by all the processes The progName variable is used to extract the name of the program for use with the standard UNIX style h option and Usage message which is displayed when the program is called with the wrong arguments The remaining four variables are general purpose iteration variables 3 5 Argument Checking The first section of main is concerned with checking the arguments passed to the program on the command line int main
84. size put 64 Transfers data type having 64 bits storage size put128 Transfers data type having 128 bits storage size u y 3 nmem vujajljajoalululula n em_putmem Transfer any contiguous data type to a remote PE em_double_iput Transfer strided array of double to a remote PE float_iput Transfer strided array of float to a remote PE Transfer strided array of integer to a remote PE Transfer strided array of long to a remote PE hmem_longdouble_iput Transfer strided array of long double to a PE These functions provide low latency writes to variables in the memory of a remote PE The library offers a wide number of remote write functions that are optimized for most of basic data type In particular the remote write function can be grouped as follows 1 Functions transferring a single data item having basic type in to the memory of a remote PE e g shmem_double_ p etc Functions transferring contiguous data in to the memory of a remote PE e g shmem_double_put etc Functions transferring strided data in to the memory of a remote PE e g shmem_double_iput etc 2 8 The Shmem Library Remote Write Operations The remote write functions are described in detail on the following pages The Shmem Library 2 9 shmem_double_p 3 NAME shmem_double_p shmem_float_p shmem_int_p shmem_long_p shmem_short_p transfer one data item to a remote PE SYNOPSIS include lt shmem
85. stablishing the Peer Group 3 6 Initialization The next section of main is concerned with initializing the process to use the Shmem library and setting up a target shared variable int main int argc char argv if rbuf long malloc maxWords sizeof long perror Failed memory allocation exit 1 if tbuf long malloc maxWords sizeof long perror Failed memory allocation exit 1 for i 0 i lt maxWords i tbuf i 1000 1 255 memset rbuf 0 maxWords sizeof long shmem init proc my pe nproc num pes if nproc 1 exit 0 The initialization process is as follows The process allocates memory for the two message buffers rbuf and tbuf using the malloc function The buffers are used as the destination and source in the Shmem remote write operation Pointers to them are passed to the Shmem library functions If a maximum number of words for the packet size is specified on the command line to sping the process allocates a buffer of this size By default the buffers are 8 bytes 1 word The transmit buffer tbuf is initialized by writing a sequence of numbers to it The remote buffer rbuf is initialized to zero The process calls shmem_init to initialize itself to use the Shmem library The process calls the functions my_pe and num_pes to determinate its PE number and to find out how many processes are runningin parallel Ifit is t
86. t have the same data type as target nreduce The number of elements in the target and source array PE_start The lowest virtual PE number of the active set of PEs logPE_stride The log base 2 of the stride between consecutive virtual PE number in the active set PE size The number of PEs in the active set pWrk A symmetric work array The pwrk argument must have the same data type as target In C C this contains max nreduce 2 1 _SHMEM_REDUCE_MIN_WRKDATA_SIZE elements pSync A symmetric work array In C C pSync must be oftype long and size _SHMEM_REDUCE_SYNC_SIZE Every element ofthis array must be initialized with the value _SHMEM_SYNC_VALUE before any ofthe PEs in the active set enter the reduction routine DESCRIPTION The shared memory reduction routines compute one or more reductions across symmetric arrays on multiple virtual PEs A reduction performs an associative binary operation across a set of values The nreduce argument determines the number of elements to perform the reduction operation on The source array on all PEs in the active set provides one element for each reduction The results of the reductions are placed in the target array on all PEs in the active set The active set is defined by the PE_start logPE_stride PE_size triplet The source and target arrays may be the same array but they may not be overlapping arrays The values of arguments nreduce
87. t is defined by the PE_start logPE_stride PE_size triplet The source and target arrays may be the same array but they may not be overlapping arrays The values of arguments nreduce PE start logPE_stride and PE size must be equal on all PEs in the active set The same target and source arrays and the same pWrk and pSync work arrays must be passed to all PEs in the active set Before any PE calls a reduction routine you must ensure that the following conditions exist synchronization via a barrier or some other method is often needed to ensure this e The pwrk and pSync arrays on all PEs in the active set are not still in use from a prior call to a collective shared memory routine e The target array on all PEs in the active set is ready to accept the results of the reduction Upon return from a reduction routine the following are true for the local PE e The target array is updated e The values in the pSync array are restored to the original values The function shmem_int_xor_to_all performs a reduction applaying the logical exclusive OR operator on integer values distributed across the PEs The function shmem_long_xor_to_all performs a reduction applaying the logical exclusive OR operator on long values distributed across the PEs The function shmem_longlong_xor_to_all performs a reduction applaying the logical exclusive OR operator on Long long values distributed across the PEs The functi
88. t pe and returns the previous contents of target as an atomic operation RETURN VALUES These functions return the contents that had been at the target address on the remote PE prior to the swap is returned SEE ALSO shmem_put 3 2 32 The Shmem Library shmem_int_cswap 3 NAME shmem_int_cswap shmem_long_cswap shmem_longlong_cswap shmem_short_cswap Performs an atomic conditional swap to a remote data object SYNOPSIS include lt shmem h gt int shmem_int_cswap int target int cond int value int pe long shmem_long_cswap long target long cond long value int pe long long shmem_longlong_cswap long long target long long cond long long value int pe short shmem_short_cswap short target short cond short value int pe PARAMETERS target The pointer to a remotely accessible data object to be updated on the remote PE The data type of target should match that implied in the SYNOPSIS section cond The value of cond is compared to the remote target value If cond and the remote target value are equal then value is swapped in the remote target Otherwise the remote target is unchanged In either case the old value ofthe remote target is returned as the function return value The parameter cond must be ofthe same data type oftarget value The value to be atomically written to the remote PE value must be the same data type as target pe An integer that indicates the PE number upon which target is
89. t process has at least one thread virtual memory A feature provided by the operating system in conjunction with the MMU that provides each process with a private address space that may be larger than the amount of physical memory accessible to the CPU virtual process A possibly multi threaded component of a parallel program executing on a node word A 32 bit value Glossary 4 B barrier 2 24 D documentation feedback 1 2 online 1 2 my_pe 2 5 N num_pes 2 6 S shmem_barrier 2 25 shmem_barrier_all 2 24 shmem_broadcast 2 64 shmem_broadcast32 2 64 shmem_broadcast64 2 64 shmem_clear_cache_inv shmem_set_cache_inv shmem_set_cache_line_inv shmem_udcflush shmem_udcflush_line 2 69 shmem_collect 2 66 shmem_collect32 2 66 shmem_collect64 2 66 shmem_double_g 2 18 shmem_double_get 2 19 shmem_double_iget 2 21 shmem_double_iput 2 13 shmem_double_max_to_all 2 47 Index shmem_double_min_to_all 2 50 shmem_double_p 2 10 shmem_double_prod_to_all 2 55 shmem_double_put 2 11 shmem_double_sum_to_all 2 58 shmem_double_swap 2 31 shmem_fcollect 2 66 shmem_fcollect32 2 66 shmem_fcollect64 2 66 shmem_fence 2 28 shmem_float_g 2 18 shmem_float_get 2 19 shmem_float_iget 2 21 shmem_float_iput 2 13 shmem_float_max_to_all 2 47 shmem_float_min_to_all 2 50 shmem_float_p 2 10 shmem_float_prod_to_all 2 55 shmem_float_put 2 11 shmem_float_sum_to_all 2 58 shmem_float_swap 2 31 shmem_get
90. tatistics generated during each set of repetitions Unless printing is enabled for all processes with the e option only the odd numbered processes have their statistics displayed 3 10 Program Listing This section shows the program in its entirety include include include include include include include lt stdio strongl lt errno lt signal lt sys ty h gt h gt gt h gt pes h gt lt sys time h gt lt shmen h gt 3 10 Programming Examples Program Listing int getSize char str int size char mod 32 switch sscanf str d l mMkK amp size mod case 1 return size case 2 switch mod case m case M return size lt lt 20 case k case K return size lt lt 10 default return size default return 1 double gettime struct timeval tv gettimeofday amp tv 0 return tv tv_sec 1000000 tv tv_usec double dt double tvl double tv2 return tvl lt tv2 void usage char name fprintf stderr Usage s flags nwords maxWords incWords n name fprintf stderr s h n name exit 1 void help char name printf Usage s flags nwords maxWords incWords n name prince PAM printf CU Flags may be any of n printf n number repititions to time n printf e everyone print timing info n printf h print this info n
91. te Author Remarks Jan 2001 DR First public draft June 2001 Corrections for Linux release 1 2 Contents Preface 1 1 1 1 Scope of Manual mm nennen 1 1 1 2 Audiente sauer sa Ae oe he eh a Ge d 1 1 1 3 Using this Manual 2 2 2 Co oo onen 1 1 1 4 Related Information i 1 2 1 5 Location of Online Documentation 1 2 1 6 Reader s Comments 00 eee eee ee ee 1 2 1 7 Conventions 2 2 a a s aa ce ce ea qa wasa sa qua sa e ew sea 1 2 The Shmem Library 2 1 2 1 Introduction a sa esas ana ss s sa was diera 2 1 2 2 Compiling 4 os 4 yy yu ye eee ee been bbb See wee ees 2 1 2 3 Using the Shmem Library aa 2 2 2 3 1 Word Lengths wisi eere t aa as a wa nah wa 2 2 2 4 Library Function Categories 0 0 000002 eee 2 2 2 5 Initiahsation os t te wu was an A DD a a a 2 4 MY pe 9 oo a in a en 2 5 num pes 3 Ce 2 6 Shmem nit en ca rea re n d 2 7 2 6 Remote Write Operations aoaaa 2 8 shmem_double_p 3 onen 2 10 shmem_float_p 3 u es 2 2 Coon 2 10 shmem int p 8 i srs Hmm nn 2 10 shmem long p 3 CC Cm onen 2 10 shmem_short_p 3 o o ee 2 10 shmeM_ publ 2 11 Contents i shmem_double_put 3 2 11 shmem_float_put 3 2 11 shmem_int_put 3 oa Como 2 11 shmem_long_put 3 2 22 Coon 2 11 shmem_longdouble_put 3
92. the SYNOPSIS section source A symmetric array of length nreduce that contains one element for each separete reduction operation The source argument must have the same data type as target nreduce The number of elements in the target and source array PE_start The lowest virtual PE number of the active set of PEs logPE_stride The log base 2 of the stride between consecutive virtual PE number in the active set PE_size The number of PEs in the active set pWrk A symmetric work array The pWrk argument must have the same data type as target In C C this contains max nreduce 2 1 _SHMEM_REDUCE_MIN_WRKDATA_SIZE elements The Shmem Library 2 61 shmem_int_xor_to_all 3 pSync A symmetric work array In C C pSync must be oftype long and size _SHMEM_REDUCE_SYNC_SIZE Every element of this array must be initialized with the value _SHMEM_SYNC_VALUE before any ofthe PEs in the active set enter the reduction routine DESCRIPTION The shared memory reduction routines compute one or more reductions across symmetric arrays on multiple virtual PEs A reduction performs an associative binary operation across a set of values The nreduce argument determines the number of separate reduction to perform The source array on all PEs in the active set provides one element for each reduction The results of the reductions are placed in the target array on all PEs in the active set The active se
93. the number of processes elements PEs belonging to a shared memory application and the PE identifier Remote Write Operations The Shmem library offers a wide number of functions to perform remote write operations put operation Section 2 6 Using these functions a processing element is able to transfer a remotely accessible data object to a remote PE Remote Read Operations The Shmem library offers a wide number of functions to perform remote read operations get operation Section 2 7 Using these functions a processing element is able to transfer a remotely accessible data object from a remote PE Synchronisation Operations The library supplies a set of functions providing synchronisation Section 2 8 among the processing elements participating to a parallel computation In particular there are two type of synchronisation supported one is used to express a barrier of groups of PE and the other one is used to notify a PE when a local variable has been modified by a remote PE Atomic Memory Operations The Shmem library supplies programmers with a set of functions allowing atomic operation on shared variables Section 2 9 An atomic memory operation is an atomic i e that cannot be interrupted read and update operation on a remote data object The value read is guaranteed to be the value ofthe data object just prior to the update A wide range of atomic operations are supported like swap add fetch and increment and fetch
94. the source array The stride is scaled by the element size of the source array A value of 1 indicates contiguous data Number of elements in the target and source arrays The number of the remote PE on which source resides provide the means for copying a strided array from a remote PE to a local strided array The routines return when the data has been copied into the local target array The function s hmem_iget reads strided array where each element is any non character type that has a storage size equal to 64 bits from the remote PE The function s remote PE The function s remote PE The function s hmem_double_iget reads strided array of type double from the hmem_float_iget reads strided array of type float from the hmem_iget 32 reads any non character type that has a storage size equal to 32 bits from the remote PE The function s hmem_iget 64 reads strided array where each element is any non character type that has a storage size equal to 64 bits from the remote PE The function s hmem_iget128 reads strided array where each element is any non character type that has a storage size equal to 128 bits from the remote PE The function s remote PE The function s PE The function s hmem_int_iget reads strided array of type integer from the hmem_long_iget reads strided array of type long from the remote hmem_longdouble_iget reads strided array of type long double
95. to_alll3 2 61 shmem long xor to all 3 i nn 2 61 shmem longlong xor to all 8 2 61 shmem short xor to all 8 2 61 2 11 Collective Communication o 2 63 shmem_broadcast 3 2 64 shmem broadcast32 3 om nen 2 64 shmem_broadcast64 3 2 64 shmem_collect 3 2 2 22 0 00 00 000 0c eee eee 2 66 shmem collect32 3 2 66 shmem collect64 3 i 2 66 shmem fcollect 8 0 2 66 shmem_fcollect82 3 2 66 shmem_fcollect64 3 0 0 0 0 00000 ee eee 2 66 2 12 Address Manipulation 0 00 a 2 68 Contents v 2 13 Control Data Cache 2 68 shmem_clear_cache_inv 3 2 69 shmem_set_cache_inv 3 2 69 shmem set cache line imv 3 2 69 shmem_udcflush 8 ee eee 2 69 shmem_udcflush_line 8 2 69 3 Programming Examples 3 1 3 1 Introduction 2 w wa usa a g dr d 3 1 3 2 The Command Line Interface _ iii 3 1 3 3 Program Output recenti rrr kra EE EEE 3 2 3 4 Header Files and Variables aaa 3 2 3 5 Argument Checking e 3 4 3 6 Initialization suis ms sn dee Deere ans 3 6 3 7 Establishing the Peer Group 2 2 2 2 ii iii 3 7 3 8 Writing Shared V
96. uments nreduce PE_start logPE_stride and PE_size must be equal on all PEs in the active set The same target and source arrays and the same pWrk and pSync work arrays must be passed to all PEs in the active set Before any PE calls a reduction routine you must ensure that the following conditions exist synchronization via a barrier or some other method is often needed to ensure this e The pWrk and pSync arrays on all PEs in the active set are not still in use from a prior call to a collective shared memory routine e The target array on all PEs in the active set is ready to accept the results of the reduction Upon return from a reduction routine the following are true for the local PE e The target array is updated e The values in the pSync array are restored to the original values The Shmem Library 2 59 shmem_double_sum_to_all 3 The function shmem_double_sum_to_all performs a reduction applaying the sum function to doubles values distributed across the PEs The function shmem_float_sum_to_all performs a reduction applaying the sum function to float values distributed across the PEs The function shmem int sum to alqi performs a reduction applaying the sum function to integer values distributed across the PEs The function shmem long sum to all performs a reduction applaying the sum function to long values distributed across the PEs The function shmem longdouble sum to all perfo
97. ype Italic slanted monospace type denotes some meta text This is used most often in command or parameter descriptions to show where a textual value is to be substituted 1 2 Preface italic type Ctrl7x de Ur Conventions Italic slanted proportional type is used in the text to introduce new terms It is also used when referring to labels on graphical elements such as buttons This symbol indicates that you hold down the Ctr1 key while you press another key or mouse button shown here by x Small capital letters indicate an abbreviation see Glossary A cross reference to a reference page includes the appropriate section number in parentheses A number sign represents the superuser prompt A percent sign represents the C shell system prompt A dollar sign represents the system prompt for the Bourne Korn and POSIX shells Preface 1 3 The Shmem Library 2 1 Introduction This chapter describes in detail the functions belonging to the Shmem programming library This library allows user to write parallel applications using a shared memory programming model where all the processes can operate on a globally accessible address space In order to support this programming model the Shmem routines supply remote data transfer work shared broadcast and reduction barrier synchronization and atomic memory operations Furthermore the Shmem routines minimize the overhead associated with data passing requests maximize ban

Download Pdf Manuals

image

Related Search

Related Contents

  Oster FPSTFP1355 Instruction Manual  

Copyright © All rights reserved.
Failed to retrieve file