Home

The Parallel C Preprocessor* - Hindawi Publishing Corporation

1. are provided to fully exploit the notion of team splitting These are 1 Memory that is private to a processor pri vate memory 2 Memory that is shared among all proces sors shared memory 3 Memory that is shared among the mem bers of a given team or grouping of proces sors but private to the team teamprivate memory Private memory is implemented on the proces sor that requires access to it Shared memory is implemented in the interleaved shared memory facility see Section 7 of the BBN TC2000 There are many situations where a user would like to have static data that is not shared by all proces sors but is shared by all of the processors within a given team This type of data is declared using the storage class modifier teamprivate Teamprivate memory is allocated as an array in the interleaved shared memory indexed by a team descriptor that is unique to a given team More details on the team descriptor are given in Section 4 By default all statically allocated data is shared and thus accessible by all of the processors Stack or auto data is private to a processor and is stored in a processor s local memory if it exists The choice of default for statically allocated data is a holdover from the days of running PCP codes on shared memory machines such as those manu factured by Sequent or Alliant and is not that appropriate for a scalable machine such as the BBN TC2000 Fortunately the default can be switched to
2. careful about the possibility of generating adverse impact on avail able memory bandwidth through the introduction of a hot spot 6 PCP SYNTAX A well structured C code requires very few changes to make effective use of parallelism While this typically does not take into account data locality issues simply getting a program running correctly with the split join programming model is usually not a difficult task This has been our experience with PCP on systems providing a monolithic shared memory On systems with a hierarchical memory structure such as the BBN TC2000 fur ther work optimizing the code to exploit data lo cality is required to get the expected performance out of the hardware In this section we describe the lightweight parallel excecution mechanisms of PCP Under the assumption that nonlocal memory references are expensive relative to local refer ences PCP control constructs generate fast inline code involving only local references The typical control construct requires only a few local memory references for execution The absence of overhead in this area has allowed programmers to exploit parallelism in sections of their applications that were previously deemed not heavy enough to am ortize the overhead In addition the control struc tures contain no implicit barrier synchronization If there is no data dependency explicitly involved with the construct processors are free to run ahead in the execution of subseq
3. loosely follow each other through the code is referred to as a team of processors In a serial program a single processor enters the main routine and executes code until it either returns from main calls exit or encounters an ex ception In the PCP model we generalize this in a natural way having a team of processors enter main and having the job end when any team member returns from main calls exit or en counters an exception A team of processors con sists of a team master and zero or more other pro cessors which travel through the code almost in unison The split join parallel programming model is very similar to Harry Jordan s Force 4 and the IBM SPMD 5 programming model the most sig nificant difference being the support for team splitting and the arbitrary nesting of concurrency constructs The PCP concept of team splitting al lows an arbitrary subdivision of the team of pro cessors executing the code and allows each sub team to execute arbitrarily different codes within the constraint of block structure This is a flexible extension of the SPMD programming model that supports the exploitation of nested concurrency for both subroutines and nested loops The PCP concept of teams is dynamic and well suited for massively parallel machines The user synchronizes the members of a team designates tasks for individual team members and splits up the team into smaller sized teams to execute logi cally di
4. programmer No attempt is made to automatically parallelize serial code and in particular data dependence analysis is entirely up to the user If users write code that allows a race condition they are likely to get bitten by it sooner or later Adding data dependence analysis to aid the user in spotting the rather trivial race conditions that often bite would be useful and is most certainly technically possible but would require a compiler development that is en tirely beyond the scope of the effort to date PCP in its current form only offers easy man agement of data structures in private and inter leaved shared memory Users must allocate and manage data structures in block shared memory by hand and as reported this is profitable but very tedious 14 Data distribution models sup porting data parallel computation which have re cently been a very popular research area 16 provide a means of exploiting data structures allocated in block shared memory We are devel oping such a programming model as a high level layer on top of our Fortran implementation of the split join programming model PFP This will hopetully provide the ease of programming that Fortran 90 offers for array operations while maintaining an efficient escape to the PFP split join programming model for everything else The The largest BBN TC2000 delivered to date is the 128 processor machine at Lawrence Livermore National Labora tory resulting Par
5. rier barriers A barrier requires all members of the team to arrive at the barrier before any are allowed to con tinue Each team has its own distinct barrier A barrier is often used after a master block or a forall loop to ensure that the preceding work is complete before any processor is allowed to con tinue A fast algorithm that has no hot spots or critical regions has been implemented for PCP run time support 6 4 lock unlock Concurrency must be inhibited in a statement that modifies a variable that many processors are ac cessing for both reading and writing To prevent processors from destructively interfering with each other entrance to a critical section of a code must be restricted so that only one processor may exe cute it at a time This is accomplished by using a lock PCP offers spin wait locks that are imple mented by variables of the lock data type which has the two states locked and unlocked A lock variable is a statically allocated and initialized PCP data type lock var unlocked The functions that change the state of a lock are lock and unlock which take the pointer to the lock variable as an argument lock when passed a pointer to a variable of the data type lock waits until the lock is unlocked and then atomically sets it to locked unlock when passed a pointer to a lock sets it to unlocked A lock is used to protect a critical section in the following way lock amp var lt
6. the split blocks are to compute some results re quired by all the members of the parent team but for which the use of the top level shared memory would pose an access hazard due to nested use of team splitting in a reentrant way We have not given the syntax for this here as the notion of accessing the teamprivate data of a parent team is still undergoing exploratory use We expect changes in syntax and functionality as we learn from the experience of users Hot spots are shared memory locations for which many processors are contending The sophisticated user will find the team index and team size useful when a custom scheduling algorithm for a segment of code is desired By cus tomizing the scheduling algorithms for certain tasks the need for barrier synchronization can of ten be reduced with an attendant increase in effi ciency Programmers must be careful to design the custom algorithm so that it will work properly for any team size The more aggressive programmer could also manipulate the team state values per haps creating custom splitting algorithms Simple heuristic techniques are used for team splitting and sometimes a customized heuristic is war ranted for special circumstances Programmers must be careful to preserve the integrity of the team state The correct functioning of nearly all the parallel language constructs is dependent on the team state being consistent on each processor 5 SYNCHRONIZATION Barrier syn
7. the weights of the next iter ation appropriately This second strategy would be effective if the relative load of two split tasks varied slowly with the iteration index Splitting the team into smaller subteams to ex ploit nested concurrency is counter intuitive The goal however in executing nested concurrency is to use a fixed number of processors more effi ciently not to use more processors The split join programming model is in some sense the dual of the fork join model One finds that one can usu ally accomplish the task at hand with either pro gramming model The advantage of the split join model is its bottleneck free implementation through a highly portable preprocessor As will be seen in Section 4 the process of team splitting is accomplished in a few instructions independent of the size of the team which only access local memory Teams may be split both statically and dynami cally Static team splitting is specified using the keywords split and and As an example consider the routines foo and bar which perform an equal amount of work and can be called concur rently In the PCP model the concurrent execution of the two routines is arranged through the syntax split foo and bar where the amount of work performed by foo 82 BROOKS GORDA AND WARREN and bar is assumed to take the same amount of time The team encountering the split divides into two subteams of equal size Th
8. 61 D M Dias and M Kumar Preventing conges tion in multistage networks in the presence of hot spots Proc of the 1989 International Confer 10 11 12 13 14 15 16 PARALLEL C PREPROCESSOR 89 ence on Parallel Processing University Park and London The Pennsylvania State University Press August 1989 pp I 9 I 13 D Hensgen R Finkel and U Manber Two algorithms for barrier synchronization Int J Parallel Programming vol 17 no 1 pp 1 17 1988 B Gorda K Warren and E D Brooks III Pro gramming in PCP UCRL MA 107029 Lawrence Livermore National Laboratory Livermore CA 1991 T E Anderson The performance of spin lock alternatives for shared memory multiproces sors IEEE Trans Parallel and Distributed Syst vol 1 no 1 pp 6 16 January 1990 Inside the TC2000 BBN Advanced Computers Inc Cambridge MA 1989 H J Siegel Interconnection Networks for Large Scale Parallel Processing 2nd edition New York McGraw Hill 1990 pp 113 174 L H Yang E D Brooks III and J Belak A Linked Cell Domain Decomposition Method for Molecular Dynamics Simulation on a Scalable Multiprocessor UCRL JC 109752 Lawrence Livermore National Laboratory Livermore CA 1992 submitted to Scientific Programming S Picano E D Brooks III and J E Hoag As sessing Programming Costs of Explicit Memory Localization on a Large Scale Shared Memory Multiprocesso
9. The Parallel C Preprocessor EUGENE D BROOKS III BRENT C GORDA AND KAREN H WARREN Massively Parallel Computing Initiative Lawrence Livermore National Laboratory Livermore CA 94550 ABSTRACT We describe a parallel extension of the C programming language designed for multi processors that provide a facility for sharing memory between processors The pro gramming model was initially developed on conventional shared memory machines with small processor counts such as the Sequent Balance and Alliant FX 8 but has more recently been used on a scalable massively parallel machine the BBN TC2000 The programming model is split join rather than fork join Concurrency is exploited to use a fixed number of processors more efficiently rather than to exploit more processors as in the fork join model Team splitting a mechanism to split the team of processors execut ing a code into subteams to handle parallel subtasks is used to provide an efficient mechanism to exploit nested concurrency We have found the split join programming model to have an inherent implementation advantage compared to the fork join model when the number of processors in a machine becomes large 1992 by John Wiley amp Sons Inc 1 INTRODUCTION Shared memory multiprocessors wherein a small number of processors access a common mono lithic system memory have been around for a long time The frequently used parallel programming model on these systems is the fork j
10. ached accesses cause the processor to stall until the request is satisfied Be cause of the rather large performance penalty for accesses to interleaved shared memory compared to cache hits users are required to expend a con siderable amount of effort in localizing data struc tures if they are to achieve a substantial fraction of the performance potential of the machine Using the current PCP facilities users have typically managed the interleaved and local memories 13 to accomplish this The use of block shared mem ory has usually been implicit but critical having been hidden in the provided synchronization con structs Some users have gone further by using block shared memory directly for their own data structures but have found doing this to be tedi ous 8 DISCUSSION PCP was originally developed on shared memory systems with small processor counts to solve the code portability problem for parallel applications written in C for this class of machines During this period which in fact has not yet ended each ven dor of a parallel machine developed and marketed its own parallel execution environment which was incompatible with those of other vendors The im plementation of PCP as a preprocessor with a minimal amount of machine dependent run time support was critical to being successful in achiev ing this goal The split join paradigm an exten sion of the simple SPMD model 5 allows nested concurrency to be exploite
11. allel Data Distribution Preprocessor PDDP is expected to be useful on the BBN TC2000 and will likely be even more useful on future machines of this general architectural class employing substantially faster processors The la tency of communication networks is not improving at the same rate as the speeds of microprocessors and this will make the successful exploitation of data locality even more important in the future REFERENCES 1 2 TC2000 Fortran Reference BBN Advanced Computers Inc Cambridge MA 1991 E D Brooks HI PCP A Parallel Extension of C that is 99 Fat Free UCRL 99673 Lawrence Livermore National Laboratory Livermore CA 1988 K H Warren B Gorda and E D Brooks MI Programming in PFP UCRL MA 107028 Liver more CA Lawrence Livermore National Labora tory 1991 H F Jordan The force A highly portable par allel programming language Proc of the Inter national Conference on Parallel Processing Uni versity Park and London The Pennsylvania State University Press August 1989 pp H 112 H t17 F Darema D A George V A Norton and G F Pfister A single program multiple data compu tational model for EPEX FORTRAN Parallel Comput vol 7 no 1 pp 11 24 April 1988 K H Warren and E D Brooks III Gauss elimi nation A case study on parallel machines Proc of Compcon 91 Los Alamitos CA IEEE Com puter Society Press February 1991 pp 57
12. cessors access this memory through the interconnection network and face latency and contention problems in order to gain access Ma chines with this complex memory structure have the advantage of being scalable to very high pro cessor counts but the programmer pays for this advantage with a much more demanding coding effort being required to obtain a reasonable frac tion of the performance capability offered by the architecture On the BBN TC2000 accesses through the net work have priority over local memory references We have exploited this feature in the supplied PCP synchronization constructs by arranging that a processor that is busy polling does so in a mem ory location residing in the on card block shared memory This has been done in the barrier syn chronization algorithm the most heavily used synchronization operation Doing this is a simple matter of allocating the barrier data structure used to control the progress of the processors in appro priate segments of block shared memory For locks the data structure representing the state of the lock resides in interleaved shared memory but a linked list of pointers to locations in block shared memory can be used to control processors that are waiting for the lock This is referred to as a PARALLEL C PREPROCESSOR 87 queueing lock 10 Using a lock data structure of this sort users again arrange that busy polling is done in the on card block shared memory at low priority o
13. chronization and the notion of locks are provided in the PCP implementations of the split join programming model In barrier synchronization all of the processors in a given team are forced to wait at the barrier until the last processor arrives A bottleneck free software implementation 8 is used requiring 30 to 40 microseconds to synchronize 32 processors The execution time of the barrier scales as the log of the processor count Each team has its own unique barrier A lock is used to provide for critical region ac cess to data A processor attempting to acquire a lock spin waits until the lock is unlocked and then indivisibly locks it When the processor unlocks the lock it is available for others immediately Locks may be declared by the user in either shared memory or teamprivate memory When in shared memory the lock is visible to all the pro cessors regardless of the team to which they be long Locks declared in teamprivate memory are visible only within a team In addition to the use of barriers and locks the user may implement event notifications by simply spin waiting on a location in shared or team private memory to change On a machine sup porting coherent shared memory caches this is particularly effective and has no negative impact t It was in fact through such activity that the notion of team splitting was invented initially If the machine lacks this support as is the case for the BBN TC2000 users must be
14. critical section gt unlock amp var The lock variable may be declared as either shared or teamprivate If the lock variable is shared then the critical section is global If the lock is declared teamprivate then the critical sec tion is local to the team 6 5 split To divide a number of tasks which is known at compile time among subteams that are split from the parent static team splitting is used split lt weight1 gt lt task1 gt and lt weight2 gt lt task2 gt and lt weightn gt lt taskn gt The tasks may be executed in any order in cluding sequentially if the team encountering the split statement cannot be split for some reason If one task contains more work than another weights may be assigned to the blocks of work to achieve some measure of load balancing The weights determine the fraction of the current team s processors that are split into each sub team 6 6 splitall The dynamic version of team splitting is the splitall loop splitall int i lt start gt lt cond gt i lt step gt nteams tsize lt work dependent on the index i gt When a team encounters a splitall loop it disso ciates into subteams to which the indices of the loop are interleaved The number and size of the subteams may be determined by the optional inte ger expressions nteams for specifying desired number of teams and tsize for desired size of teams or by compile
15. ct teams could not have independent barrier operations PCP quickly calculates the new team descriptor for team splits in local memory by shifting the cur rent team descriptor left n times where n is the log base 2 of the number of new subteams and or ing in an integer from O to the number of sub teams 1 Teamprivate memory is not initialized As team descriptors are reused initializations of teamprivate data are almost guaranteed to have been corrupted The team descriptor and team size are identical for all of the members of a given team but each team member carries its own pri vate copy in order to prevent hot spots 7 as the team state is accessed Team splitting is handled in a block structured way Each time a processor becomes a member of a new subteam it computes a new team descriptor and its position in the new team without accessing any shared memory or synchronization resources This leads to an efficient bottleneck free imple mentation of team splitting the cost of which is completely independent of the number of proces sors in the team As the processor computes a new team descriptor it pushes the old one onto a pri vate stack for recovery when it reaches the end of its share of the work in the split block Since a processor carries the team descriptors of all its antecedent teams on a stack it has access to the teamprivate memory of a parent team This can be very useful in a situation where the tasks in
16. d while still preserving the simple preprocessor implementation and high efficiency In the early days of using PCP on a 12 proces sor Sequent Balance or on an 8 processor Alliant FX 8 users did not have to concern themselves with exploiting data locality Team splitting was used to improve the concurrency available and was used to reduce overhead by reducing the cost of a barrier within a team PCP was used to obtain rather efficient execution on shared memory multiprocessors 15 but users had to be very careful about the quantization error involved when splitting a team of a few processors in size and this tended to restrict the use of team splitting to the trivial case of binary splits of an even num ber of processors for two equally sized subtasks With the BBN TC2000 we have up to 128 pro cessors available and the quantization error for team splitting is an order of magnitude smaller This allows much more accurate matching of the team size to the work involved for the subtasks The sheer number of processors available begins to support deeply nested team splitting With the earlier shared memory systems possessing only a few processors a programmer could essentially enumerate the task for each processor individually and keep track of them With hundreds or more processors the programmer ceases to think of in dividual processors and begins to think of teams as abstract aggregates of processors PCP provides no frills for the
17. d with a split join pro gramming model This is indeed the case and on the BBN TC2000 we have developed a Fortran implementation the Parallel Fortran Preproces sor PFP We discuss only the C syntax for the split join programming model for the sake of brev ity see Warren et al 3 if you are interested in the Fortran syntax PCP was originally written to solve the problem of the portability of C language based parallel pro grams among several different shared memory multiprocessors The PCP preprocessor is ma chine independent the amount of machine de pendent run time support is small 200 lines of C on the BBN TC2000 and can be easily imple mented with fast inline code PCP has been used successfully on the Alliant Sequent Cray SGI and Stellar machines The sections of this paper are as follows The split join model and its memory model are de scribed in Sections 2 and 3 Details on how the implementation of these models take advantage of the architecture are included The PCP team state is discussed in Section 4 The synchronization primitives offered in PCP are discussed in Section 5 The actual PCP syntax is shown in Section 6 In Section 7 we discuss the issues of implementing and using PCP on a scalable machine such as the BBN TC2000 A general discussion follows in Section 8 2 THE SPLIT JOIN MODEL In the traditional fork join parallel programming model a single processor starts the execution of the program a
18. e first subteam calls foo while the second calls bar concur rently and the two subteams join again at the end of the and block The tasks performed in foo and bar must be independent If there is only one processor in the team that encounters the split or if an implementation limit for team split ting has been reached the encountering team calls foo and then bar sequentially in an im plementation dependent order For this reason one must view the execution of foo and bar to be completely asynchronous The algorithms implemented by foo and bar should not be designed in a way that requires simultaneous pro gress in both routines This could be done but the code would not be portable to multiprocessors for which the team size entering the split block is unity In addition to the static splitting described above wherein the amount of work performed in each block of code is assumed to be the same weights may be assigned to the blocks of work If the user has provided accurate loading informa tion via the weight parameters for team splitting see Section 6 5 that determine the subteam sizes the processors in the subteams finish their work and rejoin to become the parent team nearly simultaneously The total number of processors is conserved in the team splitting process As an ex ample of team splitting with weights consider the case of two concurrent linear system solves that are for potentially differin
19. g dimensions For a de scription of the Gauss elimination implementation in this model see Warren et al 6 int dim1 dim2 double matrix1 matrix2 double rhs1 rhs2 split dim1 dim1 dim1 dgauss matrix1 rhs1 dim1 and dim2 dim2 dim2 dgauss matrix2 rhs2 dim2 Here the routine dgauss performs a linear system solve leaving the result in the vector rhs provided as an argument The operations re quired scale like the cube of the dimension and this is noted by the weight expressions included on the split and and lines The weight expressions are used to compute the sizes of the two subteams dividing the parent team into two subteams having relative sizes that match the ratio of weights as closely as possible By specifying weights for the concurrent blocks of work some measure of load balancing can be achieved subject to the alge braic restrictions caused by the fact that the num ber of processors is finite Split constructs are not limited to the static bi nary form shown above Static splits may have more parallel blocks specified by concatenating and blocks Team splitting may also be treated dynami cally The construct splitall int i 0 i lt imax i 1 i lt work dependent on the index i gt specifies that the body of the loop is to be exe cuted with the indices i 0 1 imax 1 parceling out the indices to a collection of sub teams that are split off from the team tha
20. hese two values are set by the run time system before main is entered and under no conditions should they be changed In implementations where the target multiprocessor does not support local memory directly the processor index is used to index an array to simulate local memory To support the concept of team splitting three more variables were added to the set carried along by the processors and the name team state was coined Unlike the processor index and the num ber of processors these values are read write and are manipulated by the language constructs of PCP 1 TSIZE that is the team size If no team splitting has occurred this will be equal to _NPROCS 2 TINDEX that is the index of the member within the team It must have a unique value within the team in the range 0 to TSIZE 1 The team master is the processor with a team index of zero 3 TDESC that is the team descriptor a non negative value unique to the team 84 BROOKS GORDA AND WARREN Teams are a dynamic association of physical processors and because of this teams can be cre ated and destroyed as these associations change during split join operations The team descriptor is a small positive integer used to identify the team and provide for teamprivate memory which is pri vate to the team but shared among the team mem bers Arrays in shared memory are indexed by this value to simulate the notion of privacy Without the team descriptor distin
21. memory are likely to be an order of magnitude short of the performance required to keep a pro cessor running at full speed The problem with the fork join model is that the process of acquiring and relinquishing processors requires highly con tended accesses for the shared memory or some specialized hardware mechanism to handle pro cessor dispatch Although the fork join program 79 30 BROOKS GORDA AND WARREN ming style is comfortable the high overhead of its implementation quickly becomes burdensome An alternative to the fork join model is the split join programming model In the split join paradigm all the processors that the code will ever have will enter the main program in a live manner at the start of the job Processors are not fetched from and returned to a pool This involves ac cesses to shared memory to implement PCP 2 is an implementation of the split join programming model that provides a good fit to massively parallel machines offering a shared memory facility in ad dition to the local memory that such machines usually have Its constructs though simple pro vide powerful and efficient flow control in the form of parallel loops and team splitting The program ming model is a relatively straightforward exten sion of the conventional C programming model An arbitrary number of processors executes the code stream as a single processor executes a serial program Nested concurrency is exploited through the str
22. n the BBN TC2000 so that the progress of processors accessing this memory through the switch in the course of performing real work is not retarded We have not implemented queueing locks as application codes have typically used locks in a way that avoids contention As if having to deal with local interleaved shared and block shared memory is not compli cated enough the performance of the network used to support memory references between pro cessors can be highly dependent on how the run time communication patterns clash with the to pology of the network If the interconnection net work is a 2 D mesh for instance a remote mem ory reference to a neighboring processor is accomplished with a much lower latency and higher bandwidth than a remote memory refer ence to a random processor in the machine For tunately we have not had to face this problem on the BBN TC2000 as the latency and bandwidth of the interconnection network used in this architec ture are rather independent of the source and des tination addresses for memory traffic flowing through the network The BBN TC2000 11 is a scalable multipro cessor architecture that can support up to 012 Motorola 88100 RISC microprocessors running at 20 MHz The processors with their 16 megabyte memories are interconnected to each other in a PE to PE model using a variant of a multistage cube network 12 which BBN refers to as the butterfly switch The BBN TC2000 supports l
23. nd acquires more processors as concurrency is encountered in the code The fork join programming model has been quite useful on tightly coupled shared memory machines with rel atively few processors Some architectures such as the Alliant FX 8 and the Convex C2 provide spe cial hardware to make the dispatch of slave pro cessors happen as quickly as possible Scalable machine architectures are not as tightly coupled and the cost of communication between proces sors heavily used in process of dispatching pro cessors in the fork join model is relatively high The BBN TC2000 described in more detail in Section 7 is a realistic example of what might be expected in this regard The latency of a cache hit on local memory is 3 clocks pipelined at a rate of one per clock whereas the latency of a remote memory reference is roughly 40 clocks not pipe lined If one must deal with a 40 clock latency for every memory reference required in the code used to dispatch processors even an efficient spanning tree implementation can have substantial over head In the split join paradigm we deal with the high cost of processor dispatch and communication between processors by minimizing their occur rence in the fundamental constructs of the pro gramming model All of the processors the job will ever acquire are dispatched at the start of the pro gram and are immediately placed under the con trol of the programmer This group of processors which
24. ndawi com
25. ocal memory block shared memory wherein suc cessive cache lines reside on one processor card and interleaved shared memory wherein succes sive cache lines are placed on successive cards and wrap around the machine The aggregate bandwidth to interleaved shared memory scales linearly with the number of processor cards con tributing to this memory pool The contribution of each node to the inter leaved shared memory pool is made at boot time set via device registers in the interface to the switch that connects the processors Any number of processors can be configured to contribute to the interleaved shared memory pool and it is use ful and convenient to set the number of contribut ing processors to a prime number to avoid band width reductions when accessing arrays in a strided manner The rest of the memory in each node can be used for either local memory or block 88 BROOKS GORDA AND WARREN shared memory The division between these two memory types is enforced by the memory manage ment unit attached to the processor and is set at the time an application is run using operating sys tem calls in a completely flexible way On the BBN TC2000 the memory latency for a cache hit is 3 clocks the latency for a cache miss to the local memory card is 8 clocks and the la tency for a remote memory reference satisfied through the switch is 39 clocks Cache hits are fully pipelined at a rate of one access per clock Cache misses or nonc
26. oin model where one processor starts out executing the serial code and additional processors are acquired when a parallel construct is encountered Although the fork join model is serving well in the form of ven dor supplied implementations on machines with a small number of processors such as the Cray YMP and the Convex C 2 series fundamental short comings in its implementation begin to surface Work performed under the auspices of the U S Depart ment of Energy by the Lawrence Livermore National Labora tory under contract No W 7405 ENG 48 Received February 1992 Revised April 1992 1992 by John Wiley amp Sons Inc Scientific Programming Vol 1 pp 79 89 1992 CCC 1058 9244 92 010079 11504 00 when the number of processors grows These shortcomings have become apparent through our experiences with the vendor supplied fork join ex tension of Fortran 1 supplied by BBN on their TC2000 multiprocessor It is no longer possible to build a monolithic shared memory system that can sustain the band width and latency demands of hundreds or thou sands of processors when the number of proces sors is scaled up Large local memories are introduced in order to provide an acceptable level of memory performance in such systems Use of the shared memory facility in large parallel sys tems is best relegated to those times when com munication is absolutely required by the applica tion as the bandwidth and latency of the shared
27. p is the PCP concurrent equivalent of the C language for loop It achieves a fine grained parallelism by dividing the passes of the for loop among the members of the team forall int i lt start gt lt cond gt i lt step gt lt work dependent on the index i gt The indices of the loop are interleaved among the members of the executing team The loop in dex variable must be declared in the forall state ment We have borrowed this syntax from C to remind the user that the loop index is not defined after the closing brace of the loop body The lt start gt and lt step gt expressions are currently restricted to simple constants or variables The lt cond gt expression is unrestricted and not checked for sanity forall loops may be nested ar bitrarily The team index inside the loop body is set to 0 and the team size to 1 This makes a team of size 1 for the scope of the loop iteration ena Amdahl s law states that a program s performance is dominated by its slowest component typically a serial section t In C the index i would be defined outside of the loop body We borrowed the syntax but not the semantics 36 BROOKS GORDA AND WARREN bling any enclosed PCP constructs to work cor rectly 6 3 barrier The team of processors executing the code freely run through it unless explicit synchronization primitives are encountered One basic and fre quently used form of synchronization is the bar
28. private through the use of a compile time flag 4 TEAM STATE The PCP concept of teams is implemented using a small amount of local memory The team state which is carried by the processors is made avail PARALLEL C PREPROCESSOR 83 able to the programmer who may use it to con struct parallel language extensions that are not di rectly supported The current features of PCP were not postulated before its first implementa tion These features evolved over time as users constructed some of their own using the team state variables Those features that were found to be generally useful have been standardized and ele vated to constructs directly supported by the Par allel C Preprocessor itself Further evolution of the parallel programming model in this manner will reduce the need to directly use the team state but the team state will always remain accessible to the programmer both for backward compatibility and to facilitate the creation of new language con structs that might eventually find their way into PCP The team state consists of five values that are carried along by the processors Two of these val ues are implicitly read only not to be touched by user programs They are 1 NPROCS that is the number of proces sors that execute the program It is the size of the team that enters main 2 IPROC that is the processor index It has a value unique to each executing processor in the range from 0 to NPROCS 1 T
29. r UCRL JC 109751 Lawrence Livermore National Laboratory Livermore CA 1992 Scientific Programming vol 1 no 1 pp 65 76 Fall 1992 E D Brooks III Effective use of shared memory multiprocessors Proc Third International Con ference on Supercomputing May 15 20 1988 Vol II pp 365 371 International Supercom puting Institute Inc G Fox et al Fortran D Language Specifica tion Rice COMP TR90 141 Rice University December 1990 Industrial Engineering Journal of Multimedia no IS Advances in N r P os Applied Computational Intelligence and Soft o e International Journal of PEE Ii t mee The Scientific Distributed awa tt rE World Journal Sensor Networks i vwwhi http www hindawi com Advances in Fuzzy Systems Modelling amp Simulation Hindawi Submit your manuscripts at http www hindawi com Journal of Computer Networks and Communications Advances in Artificial Intelligence International Journal of Advances in Biomedical magina Artificial d gt Neural Systems a International Journal of Gomputer Games Technology Advances in Software Engineering International Journal of Reconfigurable Computing Advances in Computational H man Conme ui Intelligence and Interaction Neuroscience Journal of Electrical and Computer Engineering Journal of i b t Hindawi Publishing Corporation http www hi
30. stinct code sections All parallel constructs apply to teams Working within any team the pro grammer has access to the parallel looping con structs the barrier locks and even further team splitting In the team splitting process the total number of processors remains constant The pro cessors temporarily become members of new sub teams As this happens the processors save their old state for restoration at the end of the split con struct As processors within a team complete their work they then rejoin the parent team with no implicit serialization PARALLEL C PREPROCESSOR 81 Team splitting is used to exploit nested concur rency with a fixed number of processors In the split construct the user explicitly marks off sepa rate blocks of work that can be executed indepen dently each block of work may itself be a job con sisting of subtasks that can be executed in parallel The user may also indicate the relative amount of work in each block of code The PCP split construct takes an optional weight parameter that controls the fraction of processors to send into each team This form of load balance control is effective if the workload can be accurately pre dicted at execution time as is the case for the two concurrent linear system solves below The team splitting weights which occur in an iterative loop during execution might also be corrected by using a real time clock to detect imbalance in a prior iteration and adjusting
31. t encoun ters the split The actual number of subteams is determined at run time and is possibly influenced by a compile time flag There may not be a split at all or the team may be split to individual proces sors using the best heuristic algorithm that can be conjured up Extra parameters see Section 6 6 inside the splitall header can be used to establish firmer control of the team splitting mechanism To give a trivial application of the splitall loop consider the parallel computation of a set of ma trix vector products double result double matrices double multplend int dim int number splitall int i 0 i lt number i 1 mvprod result i matrices i multplend i dim If a team split would be profitable the team encountering the splitall block is divided into subteams each subteam handling a subset of the indices i The library routine mvprod is de signed for team entry and contains parallel lan guage constructs designed to efficiently exploit the parallelism of each matrix vector product If the team that enters the splitall loop has 100 proces sors and the number and dimension of the matrix vector products is 5 and 20 respectively we see that the use of team splitting will have a substan tial impact on program performance 3 THE MEMORY MODEL PCP allows the user to designate the memory class for all data In the split join programming para digm itself three types of memory
32. time flags to PCP If the appropriate number of processors are available at run time the user supplied directives are followed Otherwise the number of teams and team size are picked by the implementation giving priority to the nteams parameter 7 PCP AND THE BBN TC2000 On conventional shared memory multiprocessors which provide a single monolithic shared memory system to the programmer a great deal of effort does not have to be expended to restructure a se rial code to exploit data locality in order to closely approach the performance limits of the machine This situation is characteristic of the shared mem ory machines manufactured by Cray Research Alliant Convex and Sequent There is perhaps the slight exception of those systems that employ coherent caches in an attempt to insulate an in adequate shared memory system from the de mands of the processors Massively parallel systems on the other hand can have a rather complex structure which in cludes local memory finely interleaved shared memory scattered across the processor boards and shared memory which all processors can ac cess but which remains on a given processor board as the physical address is incremented We call this last form of shared memory block shared memory for lack of a better term Block shared memory has the characteristic that the processsor on the same card as the memory accesses it with out going through the interconnection network The other pro
33. uctured mechanism of team splitting in which the team of processors divides up into subteams in order to address independent but parallel tasks at a lower level These subteams are relatively inde pendent of one another and are free to execute different code modules until they rejoin the par ent team in a block structured manner The PCP programming language places the is sues of scheduling communication and synchro nization directly into the hands of the program mer It does this in a simple manner by providing useful constructs to handle these complexities PCP allows the user to specify which portions of the program are to be executed in parallel which are to be executed by subteams and which are to be executed by one processor only PCP allows any number of processors to be allocated to a job the number of processors being a run time param eter that is set by the end user of the code when the program is actually executed Through the use of a compile time flag the PCP preprocessor can also produce serial code that does not contain the run time synchronization overhead required for parallel execution This feature is often used to provide both a sanity check and a point of perfor mance comparison for parallel executions The split join parallel programming paradigm is independent of the target language Examining the basic programming model it is quickly con cluded that any procedural programming lan guage could be extende
34. uent code A short summary of the PCP syntax follows For more detailed specification see the PCP us er s manual 9 Anyone interested in the equiva lent PFP syntax for Fortran should refer to the PFP user s manual 3 6 1 master Within the context of a specific team that proces sor whose current team index is 0 executes the code delimited by a master block Arbitrary PCP constructs may be enclosed by a master block A master block is often used in the portion of the program that performs initialization as well as in put output and memory allocation At a much smaller scale of granularity master blocks are PARALLEL C PREPROCESSOR 85 used to initialize shared data such as accumula tors which all team members will access master lt declarations gt lt executable code gt Note that a master block does not provide a serial critical region in the sense that most people think If team splitting has occurred several teams exist each with its own master and each executing its own task Multiple teams may be in master blocks concurrently A race condition could exist within a master block for access to shared data if it is possible for the teams to encounter the block asynchronously A simple global semaphore is all that would be required to protect the region Thus master blocks do not necessarily have the Amdahl s law impact that they might be other wise expected to have 6 2 forall The forall loo

The Parallel C Preprocessor* - Hindawi Publishing Corporation

Contents

Download Pdf Manuals

Related Search

Related Contents