Home

A “Hands-on” Introduction to OpenMP*

1. CG one thread LCG 4 threads trial 1 LCT 4 threads trial 2 LCG 4 threads trial 3 lt LCG 4 threads thread safe version gives the same answer each time you run the program But for large number of samples its quality is lower than the one thread result Why Random number Generators RNGs define a a onee of pseudo random numbers of length equal to the period of the RN _ In a typical problem you grab a subsequence of the RNG range Seed determines starting point Grab arbitrary seeds and you may generate overlapping sequences E g three sequences last one wraps at the end of the RNG period Thread 1 Thread 2 Thread 3 Overlapping sequences over sampling and bad statistics lower quality or even wrong answers Multiple threads cooperate to generate and use random numbers Solutions Replicate and Pray Give each thread a separate independent generator Have one thread generate all the numbers Leapfrog deal out sequence values round robin as if dealing a deck of cards Block method pick your seed so each threads gets a distinct contiguous block Other than replicate and pray these are difficult to implement Be smart buy a math library that does it right Intel s Math kernel Library supports all of these methods If d
2. void input_parameters int int fetch values of input parameters void do_work int int void main int Nsize choice pragma omp parallel private Nsize choice pragma omp single copyprivate Nsize choice input_parameters Nsize choice do_work Nsize choice Sample a problem domain to estimate areas compute probabilities find optimal values etc Example Computing T with a digital dart board Throw darts at the circle square Chance of falling in circle is proportional to ratio of areas A r 1 A 2 r 2 r 4 r2 P AJA 1 4 Compute tr by randomly choosing points count the fraction that falls in the circle compute pi N 10 T 2 8 N 100 Tr 3 16 N 1000 tr 3 148 We provide three files for this exercise pi_mc c the monte carlo method pi program random c a simple random number generator random h include file for random number generator Create a parallel version of this program without changing the interfaces to functions in random c This is an exercise in modular software why should a user of your parallel random number generator have to know any details of the generator or make any changes to how the generator is called Extra Credit Make the random number generator threadsafe Make your random number generator numerically correct non overlapping sequences of pseudo random numbers Outline e Introduction to OpenMP e Creating Threads
3. Exercise 7 Exercise 7 Exercise 8 hello world Simple SPMD Pi program SPMD Pi without false sharing Loop level Pi Producer consumer Monte Carlo Pi and random numbers hard linked lists without tasks easy matrix multiplication linked lists with tasks Write a multithreaded program where each thread prints hello world include omp h A OpenMP include file voe main Parallel region with default Sample Output pragma omp parallel a hello 1 hello O world 1 world 0 int ID omp_get_thread_num pello 3 hello 2 world 3 printf hello d ID printf world d n ID world 2 Runtime library function to End of the Parallel region return a thread ID Exercise 1 Exercise 2 Exercise 3 Exercise 4 Exercise 5 Exercise 6 Exercise 7 Exercise 7 Exercise 8 hello world Simple SPMD Pi program SPMD Pi without false sharing Loop level Pi Producer consumer Monte Carlo Pi and random numbers hard linked lists without tasks easy matrix multiplication linked lists with tasks The most common approach for parallel algorithms is the SPMD or Single Program Multiple Data pattern Each thread runs the same program Single Program but using the thread ID they operate on different data Multiple Data or take slightly different paths through the code In OpenMP this means A parallel region near the top of the code Pick up thread ID and nu
4. Result poor scalability Solution When updates to an item are frequent work with local copies of data instead of an array indexed by the thread ID Pad arrays so elements you use are on distinct cache lines static long num_steps 100000 double step void main 0 double pi step 1 0 double num_steps Create a scalar local to each thread to int i id nthrds double x sum lt accumulate partial id omp_get_thread_num0 sums nthrds omp_get_num_threads if id 0 nthreads nthrds for i id sum 0 0 i lt num_steps i i nthreads x i 0 5 step sum 4 0 1 0 x x No array so no false sharing 7 7 Sum goes out of scope beyond the pi sum step parallel region so you must sum it in here Must protect summation into pi in a critical region so updates don t conflict Exercise 1 Exercise 2 Exercise 3 Exercise 4 Exercise 5 Exercise 6 Exercise 7 Exercise 7 Exercise 8 hello world Simple SPMD Pi program SPMD Pi without false sharing Loop level Pi Producer consumer Monte Carlo Pi and random numbers hard linked lists without tasks easy matrix multiplication linked lists with tasks static long num_steps 100000 double step For good OpenMP void main implementations inti double x pi sum 0 0 reduction is more step 1 0 double num_steps scalable than critical for_ i 0 1 lt num_steps 1 SEA x 1 0 5
5. step private by default sum sum 4 0 1 0 x x Note we created a parallel B x Be step So program without changing any code and by adding 4 simple lines Exercise 1 Exercise 2 Exercise 3 Exercise 4 Exercise 5 Exercise 6 Exercise 6 Exercise 7 Exercise 8 hello world Simple SPMD Pi program SPMD Pi without false sharing Moto om 37 1 ad Monte Carlo Pi and random numbers hard linked lists without tasks easy matrix multiplication Producer consumer linked lists with tasks We use dice to make random numbers Given previous values you cannot predict the next value There are no patterns in the series and it goes on forever Computers are deterministic machines set an initial state run a sequence of predefined instructions and you get a deterministic answer By design computers are not random and cannot produce random numbers However with some very clever programming we can make pseudo random numbers that are as random as you need them to be but only if you are very careful Why do I care Random numbers drive statistical methods used in countless applications Sample a large space of alternatives to find statistically good answers Monte Carlo methods Sample a problem domain to estimate areas compute probabilities find optimal values etc Example Computing T with a digital dart board Throw darts at the circle square Chance of falling i
6. synchronization Flush forces data to be updated in memory so other threads see the most recent value double A A compute flush A flush to memory to make sure other threads can pick up the right value Note OpenMP s flush is analogous to a fence in other shared memory APPs Parallelize the prod_cons c program This is a well known pattern called the producer consumer pattern One thread produces values that another thread consumes Often used with a stream of produced values to implement pipeline parallelism The key is to implement pairwise synchronization between threads int main need to put the prod cons pair inside a loop so its true pipeline parallelism double A sum runtime int flag 0 A double malloc N sizeof double runtime omp_get_wtime fill_rand N A Il Producer fill an array of data sum Sum_array N A Consumer sum the array runtime omp_get_wtime runtime printf In lf seconds The sum is lf n runtime sum Compilers routinely reorder instructions implementing a program This helps better exploit the functional units keep machine busy hide memory latencies etc Compiler generally cannot move instructions past a barrier past a flush on all variables But it can move them past a flush with a list of variables so long as those variables are not accessed Keeping track of consistency when flushes are used can be co
7. 3 A between all threads printf all done n Threads wait here for all threads to finish before proceeding i e a barrier The name OpenMP is the property of the OpenMP Architecture Review Board Exercises 2 to 4 Numerical Integration Mathematically we know that 4 0 1 x 0 We can approximate the integral as a sum of rectangles dx 7 _ N gt lt q x 4 N gt F X AX S TN i 0 F x Where each rectangle has width Ax and height F x at the middle of interval i 18 Exercises 2 to 4 Serial PI Program static long num_steps 100000 double step void main inti double x pi sum 0 0 step 1 0 double num_steps for i 0 i lt num_steps i x i 0 5 step sum sum 4 0 1 0 x x pi step sum Create a parallel version of the pi program using a parallel construct Pay close attention to shared versus private variables In addition to a parallel construct you will need the runtime library routines Tame ia o A LT lT ESHER Number of threads in the team int omp_get_thread_num UIE thread 1D or rank Time in Seconds since a fixed point in the past Outline e Introduction to OpenMP e Creating Threads e Synchronization e Parallel Loops e Synchronize single masters and stuff e Data environment e Schedule your for and sections e Memory model e OpenMP 3 0 and Tasks Synchronization Syn
8. entire region The final value of a private inside a parallel loop can be transmitted to the shared variable outside the loop with The default attributes can be overridden with DEFAULT PRI VATE s Fortran only private var creates a new local copy of var for each thread void wrong int tmp 0 pragma omp for private tmp for int j 0 j lt 1000 j tmp was not tmp j P printf Yod n tmp initialized tmp 0 in 3 0 unspecified in 2 5 The original variable s value is unspecified in OpenMP 2 5 In OpenMP 3 0 if it is referenced outside of the construct int tmp void danger tmp 0 pragma omp parallel private tmp tmp 5 work printf d n tmp unspecified which tmp has unspecified copy of tmp value 51 extern int tmp void work Firstprivate is a special case of private void useless int tmp 0 pragma omp for firstprivate tmp for int j 0 j lt 1000 j tmp j Each thread gets its own printf d n tmp tmp with an initial value of O SS tmp 0 in 3 0 unspecified in 2 5 Lastprivate passes the value of a private from the last iteration to a global variable void closer int tmp 0 pragma omp parallel for firstprivate tmp lastprivate tmp Aaa N a Each thread gets its own tmp tmp j an printf Sed n tmp with an initial value of O tmp is defined as its value at the l
9. parallel private tmp pragma omp for ordered reduction res for l 0 I lt N 1 tmp NEAT_STUFF I ed res consum tmp Simple Lock routines A lock implies a TEN memory fence A simple lock is available if it is unset a flush of i all thread visible variables Nested Locks A nested lock is available if it is unset or if it is set but owned by the thread executing the nested lock function Note a thread always accesses the most recent copy of the lock so you don t need to use a flush on the lock variable Protect resources with locks omp_lock_t Ick omp_init_lock amp lck pragma omp parallel private tmp id id omp_get_thread_num Wait here for tmp do ots _of_work id your turn omp_set_lock amp Ick Release the lock printed 4 vod id tmp so the next thread et_lockK gets a turn omp_destroy_lock amp lck Free up storage when done Runtime environment routines plus a few less commonly used routines To use a known fixed number of threads in a program 1 tell the system that you don t want dynamic adjustment of the number of threads 2 set the number of threads then 3 save the number you got Disable dynamic adjustment of the include lt omp h gt number of threads void main int num_threads Request as many threads as omp_set_dynamic 0 a e E pragma omp u int id 0mp_ge pragma omp A num _threads omp_
10. the serial program Outline e Introduction to OpenMP e Creating Threads e Synchronization e Parallel Loops e Synchronize single masters and stuff e Data environment e Schedule your for and sections e Memory model e OpenMP 3 0 and Tasks Synchronization Barrier e Barrier Each thread waits until all threads arrive pragma omp parallel shared A B C private id id 0omp_get thread_num A id big_calc1 id pragma omp barrier i e at the A a pragma omp for or works ieee construc for i 0 i lt N i C i big_calc3 i A Ea pragma omp for nowait for i 0 i lt N i B i big_calc2 C i Aid big_calc4 id lt lt ____ implicit barrier at the end no implicit barrier of a parallel region due to nowait The ma construct denotes a structured block that is only executed by the master thread The other threads just skip it no synchronization is implied pragma omp parallel do_many_things pragma omp master exchange boundaries pragma omp barrier do_many_other_things The s construct denotes a block of code that is executed by only one thread not necessarily the master thread A barrier is implied at the end of the single block can remove the barrier with a nowait clause pragma omp parallel do_many_things pragma omp single exchange boundaries do_many_other_things The 1 region executes in the sequential order pragma omp
11. you get beyond the high error small sample count range adding threads doesn t decrease quality of random sampling Leap Frog method e interleave samples in the sequence of pseudo random numbers Thread i starts at the it number in the sequence Stride through sequence stride length number of threads e Result the same sequence of values regardless of the number of threads pragma omp single nthreads omp_get_num_threads ee OPA em SS ee ee ee One thread pseed 0 iseed computes offsets mult_n MULTIPLIER and strided for i 1 i lt nthreads i multiplier iseed unsigned long long MULTI PLI ER iseed PMOD pseed i iseed LCG with Addend 0 Ae MSGi 0 o a OB lee OPH just to keep things simple Each thread stores offset y starting point into its eae oT LLM F e a Ee ale Eol OlT E Ee IHE threadprivate last random value We can use the leapfrog method to generate the same answer for any number of threads Steps Onethread 2threads 4 threads 1000 3 156 3 156 3 156 A010100 3 1168 3 1168 3 1168 oman 3 13964 3 13964 a 13964 4000000 3 140348 3 140348 140348 LKO10010101010 3 141658 3 141658 3 141658 Used the MKL library with two generator streams per computation one for the x values WH and one for the y values WH 1 Also used the leapfrog method to deal out iterations among threads 139 Exercise 1 Exercise 2 Exercise 3 Exercise 4 Exercis
12. A Hello world Verify that your environment works e Write a program that prints hello world void main int ID 0 printf hello d ID printf world d n ID Write a multithreaded program that prints hello world void main Switches for compiling and linking fopenmp gcc mp pgi int ID 0 Qopenmp intel printf hello d ID printf world d n ID Write a multithreaded program where each thread prints hello world include omp h A OpenMP include file voe main Parallel region with default Sample Output pragma omp parallel a hello 1 hello 0 world 1 world 0 int ID omp_get_thread_num pello 3 hello 2 world 3 printf hello d ID printf world d n ID world 2 Runtime library function to End of the Parallel region return a thread ID OpenMP is a multi threading shared address model Unintended sharing of data causes race conditions To control race conditions Synchronization is expensive so Outline e Introduction to OpenMP m e Creating Threads e Synchronization e Parallel Loops e Synchronize single masters and stuff e Data environment e Schedule your for and sections e Memory model e OpenMP 3 0 and Tasks Fork Join Parallelism spawns a tee Is as needed Parallelism added incrementally until performance goals are met i e the sequential program evolv
13. END PMOD random_last random_next return double random_next double PMOD Running the PI_MC program with LCG generator Log number of samples 2 ae Run the same y A LCG one thread program the same way and a 0 01 r LCG 4 threads get different 5 trail 1 answers LCG 4 threads 0 001 trial 2 That is not Q gt lt LCG 4 threads acceptable trial 3 Issue my LCG generator is not threadsafe 0 00001 Program written using the Intel C C compiler 10 0 659 2005 in Microsoft Visual studio 2005 8 0 50727 42 and running on a dual core laptop Intel T2400 1 83 Ghz with 2 GB RAM running Microsoft Windows XP 130 static long MULTIPLIER 1366 random_last carries static long ADDEND 150889 state between random static long PMOD 714025 number computations long random_last 0 To make the generator double random jike aa e long random_next threadprivate so each thread has its own copy random_next MULTIPLIER random_last ADDEND PMOD random_last random_next return double random_next double PMOD Thread safe random number generators Log number of samples Thread safe O 0 1 ro 0 01 3 H 0 001 3 O 5 0 0001 0 00001
14. OpenMP A Hands on Introduction to OpenMP Tim Mattson Larry Meadows Principal Engineer Principal Engineer Intel Corporation Intel Corporation timothy g mattson intel com lawrence f meadows intel com The name OpenMP is the property of the OpenMP Architecture Review Board Disclosures The views expressed in this tutorial are those of the people delivering the tutorial This is a new tutorial for us Help us improve tell us how you would make this tutorial better Our plan for the day Active learning We will mix short lectures with short exercises You will use your laptop for the exercises that way you ll have an OpenMP environment to take home so you can keep learning on your own Please follow these simple rules Do the exercises we assign and then change things around and experiment Don t cheat Do Not look at the solutions before you complete an exercise even if you get really frustrated Install sw Parallel regions hello_world Il Creating threads Pi_spmd_simple Parallel default data environment runtime library calls Ill Synchronization Pi_spmd_final False sharing critical atomic V Odds and ends No exercise Single master runtime libraries environment variables synchronization etc VI Data Environment Data environment details modular software threadprivate VII Worksharing and Linked list For schedules sections schedule matmul VIII Mem
15. ace the appropriate OpenMP directive and test Note loop index int i j A MAX i is private by inti A MAX j 5 default pragma omp parallel for for i 0 i lt MAX i for i 0 i lt MAX i j 2 int j 5 2 i Ali i Remove loop V Ali big carried j dependence How do we handle this case double ave 0 0 A MAX inti for i 0 i lt MAX i ave A i ave ave MAX We are combining values into a single accumulation variable ave there is a true dependence between loop iterations that can t be trivially removed This is a very common situation it is called a reduction Support for reduction operations is included in most parallel programming environments OpenMP reduction clause Inside a parallel or a work sharing construct The variables in list must be shared in the enclosing parallel region double ave 0 0 A MAX inti for i 0 i lt MAX i ave A il ave ave MAX Many different associative operands can be used with reduction Initial values are the ones that make sense mathematically Operator Initial value Ce So C C only AETA KUCUR LNY Operator Initial value AND NEQV AND All bits on O EQV true Largest pos number Most neg number Exercise 4 e Go back to the serial pi program and parallelize it with a loop construct e Your goal is to minimize the number changes made to
16. ast sequential iteration i e for j 999 Consider this example of PRIVATE and FIRSTPRIVATE variables A B and Garam pragma omp parallel private B firstprivate C Are A B C local to each thread or shared inside the parallel region What are their initial values inside and values after the parallel region Inside this parallel region A is shared by all threads equals 1 B and C are local to each thread Outside this parallel region The values of B and C are unspecified in OpenMP 2 5 and in OpenMP 3 0 if referenced in the region but outside the construct 54 Note that the default storage attribute is no need to use it Exception To change default each variable in the construct is made private as if specified in a private clause mostly saves typing no default for variables in static extent Must list storage attribute for each variable in static extent Good programming practice Only the Fortran API supports default private C C only has default shared or default none Data Sharing Default Clause Example itotal 1000 C OMP PARALLEL PRIVATE np each np omp_get_num_threads each Itotal np These two code fragments are equivalent C OMP END PARALLEL itotal 1000 C OMP PARALLEL DEFAULT PRIVATE SHARED itotal np omp_get num_threads each itotal np C OMP END PARALLEL The default for tasks is usually firstprivate because the tas
17. chronization is used to impose order constraints and to protect access to shared data barrier ordered Low level synchronization flush locks both simple and nested Mutual exclusion Only one thread at a time can enter a region float res pragma omp parallel float B inti id nthrds id omp_get_thread_num Threads wait nthrds omp_get_num_threads their turn for i id i lt niters i nthrds only one at a B big_job i time calls pragma omp critical consume consume B res gt provides mutual exclusion but only applies to the update of a memory location the update of X in the following example pragma omp parallel double tmp B B DOIT the read update of X pragma omp atomic X tmp In exercise 2 you probably used an array to create space for each thread to store its partial sum If array elements happen to share a cache line this leads to false sharing Modify your pi program from exercise 2 to avoid false sharing due to the sum array Outline e Introduction to OpenMP e Creating Threads e Synchronization e Parallel Loops e Synchronize single masters and stuff e Data environment e Schedule your for and sections e Memory model e OpenMP 3 0 and Tasks A parallel construct by itself creates an SPMD or Single Program Multiple Data program i e each thread redundantly executes the same code How do yo
18. e 5 Exercise 6 Exercise 6 Exercise 7 Exercise 8 hello world Simple SPMD Pi program SPMD Pi without false sharing Moto om 37 1 ad Monte Carlo Pi and random numbers hard linked lists without tasks easy matrix multiplication Producer consumer linked lists with tasks See the file Linked_omp25 c while p NULL p p gt next Count number of items in the linked list countt p head for i 0 i lt count i parr i p p p gt next Copy pointer to each node into an array pragma omp parallel pragma omp for schedule static 1 for i 0 i lt count i Process nodes in parallel with a for loop processwork parr i One Thread Two Threads Results on an Intel dual core 1 83 GHz CPU Intel 1A 32 compiler 10 1 build 2 See the file Linked_cpp cpp std vector lt node gt nodelist for p head p NULL p p gt next nodelist push_back p Copy pointer to each node into an array int j int nodelist size Count number of items in the linked list pragma omp parallel for schedule static 1 for int i 0 i lt j i processwork nodelist i Process nodes in parallel with a for loop a C default sched C static 1 C static 1 One Thread Two Threads Results on an Intel dual core 1 83 GHz CPU Intel 1A 32 compiler 10 1 build 2 Exercise 1 Exercise 2 Exercise 3 Exercise 4 Exercise 5 Exercise 6 Exercise 6 Exercise 7 Exerci
19. e Synchronization e Parallel Loops e Synchronize single masters and stuff e Data environment e Schedule your for and sections e Memory model e OpenMP 3 0 and Tasks The Sections worksharing construct gives a different structured block to each thread pragma omp parallel pragma omp sections pragma omp section X_calculation pragma omp section y_calculation pragma omp section Z_calculation By default there is a barrier at the end of the omp sections Use the nowait clause to turn off the barrier The schedule clause affects how loop iterations are mapped onto threads schedule static chunk schedule dynamic chunk schedule guided chunk schedule runtime Least work at runtime schedulin STATIC Pre determined and oae g predictable by the compile time programmer DYNAMIC Unpredictable highly variable work per Most work at iteration runtime R complex ene Sees Fa logic used at overhead intime Consider the program linked c Traverses a linked list computing a sequence of Fibonacci numbers at each node Parallelize this program using constructs defined in OpenMP 2 5 loop worksharing constructs Once you have a correct program optimize it Exercise 6 easy e Parallelize the matrix multiplication program in the file matmul c e Can you optimize the program by playing with how the loops are scheduled Outline e Intr
20. ere good The ones after lunch were too hard I need to refine the exercise set One idea is to for each slot have an easy exercise and a hard exercise This will help me keep the group s experience more balanced Most people didn t get the threadprivate exercise The few people who tried the linked list exercise were amazingly creative one even gettting a single nowait version to work
21. es into a parallel program Parallel Regions A Nested Master pee 2S Parallel Thread region in red You create threads in OpenMP with the parallel construct For example To create a 4 thread Parallel region double A 1000 Runtime function to Each thread executes a copy of the code within the int ID omp _get thread_num structured pooh ID A AN Runtime function block returning a thread ID Each thread calls pooh ID A for ID 0 to 3 omp_set_num_threads 4 saris eee pragma omp parallel number of tnreads The name OpenMP is the property of the OpenMP Architecture Review Board You create threads in OpenMP with the parallel construct For example To create a 4 thread Parallel region clause to request a certain number of threads double A 1000 Each thread executes a copy of the pragma omp parallel num_threads 4 code within the int ID omp_get_thread_num structured pooh ID A AN Runtime function block returning a thread ID Each thread calls pooh ID A for ID 0 to 3 The name OpenMP is the property of the OpenMP Architecture Review Board double A 1000 omp_set_num_threads 4 pragma omp parallel Each thread executes the same code redundantly int ID omp_get_thread_num double A 1000 pooh ID A omp_set_num_threads 4 printf all done n A single e a n copy of A is shared E ooa 0Y pooh 1 A pooh 2 A pooh
22. executed immediately by the encountering thread The data environment is still local to the new task and it s still a different task with respect to synchronization e It s a user directed optimization when the cost of deferring the task is too great compared to the cost of executing the task code to control cache and memory affinity At thread barriers explicit or implicit applies to all tasks generated in the current parallel region up to the barrier matches user expectation At task barriers i e Wait until all tasks defined in the current task have completed Note applies only to tasks generated in the current task not to descendants pragma omp parallel pragma omp single private p p listhead p is firstprivate inside while p wae this task pragma omp task process p p next p Example parallel pointer chasing on multiple lists using tasks pragma omp parallel pragma omp for private p for int i 0 i lt numlists i p listheads i while G o pragma omp task process p p next p T OpenMP 3 0 Example postorder tree traversal void postorder node p if p gt left pragma omp task postorder p gt left if p gt right pragma omp task postorder p gt right pragma omp taskwait wait for descendants gt data ___ process p gt data Task scheduling point e Parent
23. get_num_threads do_lots_of_stuff id Protect this op since Memory stores are not atomic Even in this case the system may give you fewer threads than requested If the precise of threads matters test for it and respond accordingly sanvdiaelalaatciaiae sclatclel ss e Set the default number of threads to use OMP_NUM_THREADS int literal e Control how omp for schedule RUNTIME loop iterations are scheduled OMP_SCHEDULE schedule chunk_size Plus several less commonly used environment variables Outline e Introduction to OpenMP e Creating Threads e Synchronization e Parallel Loops e Synchronize single masters and stuff e Data environment e Schedule your for and sections e Memory model e OpenMP 3 0 and Tasks Shared Memory programming model Global variables are SHARED among threads But not everything is shared double A 10 extern double A 10 int main void work int index int index 10 double temp 10 pragma omp parallel static int count work index printf d n index 0 A index count A index and count are shared by all threads temp temp is local to each thread A index count Third party trademarks and names are the property of their respective owner One can selectively change storage attributes for constructs using the following clauses All the clauses on this page apply to the OpenMP construct NOT to the
24. h flag 1 pragma omp flush flag pragma omp section pragma omp flush flag while flag 1 pragma omp flush flag pragma omp flush sum Sum_array N A Use flag to Signal when the produced value is ready Flush forces refresh to memory Guarantees that the other thread sees the new value of A Flush needed on both reader and writer sides of the communication Notice you must put the flush inside the while loop to make sure the updated flag variable is seen Exercise 1 Exercise 2 Exercise 3 Exercise 4 Exercise 5 Exercise 6 Exercise 6 Exercise 7 Exercise 8 hello world Simple SPMD Pi program SPMD Pi without false sharing Moto om 37 1 ad Monte Carlo Pi and random numbers hard linked lists without tasks easy matrix multiplication Producer consumer linked lists with tasks See the file Linked_intel_taskq c pragma omp parallel pragma intel omp taskq while p NULL pragma intel omp task captureprivate p processwork p p p gt next Array Static 1 Intel taska One Thread Two Threads 149 Results on an Intel dual core 1 83 GHz CPU Intel 1A 32 compiler 10 1 build 2 See the file Linked_intel_taskq c pragma omp parallel pragma omp single p head while p pragma omp task firstprivate p processwork p p p gt next Creates a task with its own copy of p initialized to the value of p when
25. he MPI programs with OpenMP Lecture Notes in Computer Science Vol 1662 Springer Verlag 1999 pp 339 50 Liu Z Huang L cae B Weng T Efficient Implementationi of OpenMP for Clusters with Implicit Data Distribution Shared Memory Parallel o amm with OpenMP Lecture notes in Computer Science Vol 3349 P 121 2005 B Chapman F Bregier A Patil A Prabhakar Achieving _ performance under OpenMP on ccNUMA and software distributed shared memory systems Concurrency and Computation Practice and Experience 14 8 9 713 739 2002 J M Bull and M E Kambites JOMP an OpenMP like interface for Java Proceedings of the ACM 2000 conference on Java Grande 2000 Pages 44 53 L Adhianto and B Chapman Performance modeling of communication and elt ele in hybrid MPI and OpenMP odeling Practice and Theory vol 15 p 481 ppa Mele Simulation 491 2007 Shah S Haab G Petersen P Throop J Flexible control structures for parallelism in OpenMP Concurrency Practice and Experience 2000 12 1219 1239 Publisher John Wiley amp Sons Ltd Mattson T G How Good is OpenMP Scientific Programming Vol 11 Number 2 p 81 93 2003 Duran A Silvera R Corbalan J Labarta J Runtime Adjustment of Parallel Nested Loops Shared Memory Parallel Programming with OpenMP Lecture notes in Computer Science Vol 3349 P 137 2005 113 Exercise 1 Exercise 2 Exercise 3 Exercise 4 Exercise 5 Exercise 6
26. k may not be executed until later and variables may have gone out of scope Variables that are shared in all constructs starting from the innermost enclosing parallel construct are shared because the barrier guarantees task completion pragma omp parallel shared A private B l A is shared E B is firstprivate pragma omp task C is private int C compute A B C Makes global data private to a thread Fortran C A blocks C File scope and static variables static class members Different from making them with global variables are masked preserves global scope within each thread Threadprivate variables can be initialized using or at time of definition using language defined initialization capabilities A threadprivate example C Use threadprivate to create a counter for each thread int counter 0 pragma omp threadprivate counter int increment_counter counter return counter Data Copying Copyin You initialize threadprivate data using a copyin clause parameter N 1000 common buf A N OMP THREADPRIVATE buf C Initialize the A array call init_data N A I OMP PARALLEL COPYIN A Now each thread sees threadprivate array A initialied to the global value set in the subroutine init_data I OMP END PARALLEL end Used with a single region to broadcast values of privates from one member of a team to the rest of the team include lt omp h gt
27. l Conference on Parallel Processing IEEE Comput Soc 1999 pp 172 80 Los Alamitos CA USA Bova SW Breshears CP Cuicchi C Demirbilek Z Gabb H Nesting OpenMP in an MPI application Proceedings of the ISCA 12th International Conference Parallel and Distributed Systems ISCA 1999 pp 566 71 Cary NC USA 111 Jost G Labarta J Gimenez J What Multilevel Parallel Programs do when you are not watching a Performance analysis case avy comparing MPI OpenMP MLP and Nested OpenMP Shared Memory Paralle geet with OpenMP Lecture notes in Computer Science Vol 3349 P 29 2005 Gonzalez M Serra A Martorell X Oliver J Ayguade E Labarta J Navarro N Applying interposition techniques for performance analysis of OPENMP parallel applications Proceedings 14th International Parallel and Distributed Processing Symposium IPDPS 2000 IEEE Comput Soc 2000 pp 235 40 Chapman B Mehrotra P Zima H Enhancing OpenMP with features for locality control Proceedings of Eighth ECMWF Workshop on the Use of Parallel Processors in Meteorology Towards Teracomputing World Scientific Publishing 1999 pp 301 13 Singapore Steve W Bova Clay P Breshears Henry Gabb Rudolf Eigenmann Greg Gaertner Bob Kuhn Bill Magro Stefano Salvini Parallel Programming with Message Passing and Directives SIAM News Volume 32 No 9 Nov 1999 sapere F Richard O Etiemble D Performance of the NAS benchmarks on a cluster of SMP PCs malt P MOM of t
28. m_threads Use them to split up loops and select different blocks of data to work on 117 Promote scalar to an Se array dimensioned by Static long num_steps 100000 double step number of threads to avoid race condition void main int i nthreads double pi sum NUM_THREADS step 1 0 double num_steps ne E E Only one thread should copy int i id nthrds the number of threads to the double x global value to make sure 19 multiple threads writing to the same address don t conflict for i id sum id 0 0 i lt num_steps i i nthrds AS i 0 5 ar x This is a common trick in sumlid 4 0 1 0 x x SPMD programs to create a cyclic distribution of loop iterations for i 0 pi 0 0 i lt nthreads i pi sum i step Exercise 1 Exercise 2 Exercise 3 Exercise 4 Exercise 5 Exercise 6 Exercise 7 Exercise 7 Exercise 8 hello world Simple SPMD Pi program SPMD Pi without false sharing Loop level Pi Producer consumer Monte Carlo Pi and random numbers hard linked lists without tasks easy matrix multiplication linked lists with tasks If independent data elements happen to sit on the same cache line each update will cause the cache lines to slosh back and forth between threads This is called false sharing If you promote scalars to an array to support creation of an SPMD program the array elements are contiguous in memory and hence share cache lines
29. n circle is proportional to ratio of areas A r 1 A 2 r 2 r 4 r2 P AJA 1 4 Compute tr by randomly choosing points count the fraction that falls in the circle compute pi N 10 T 2 8 N 100 Tr 3 16 N 1000 tr 3 148 Embarrassingly parallel the parallelism is so easy its embarrassing Add two lines and you have a parallel program static long num_trials 10000 int main long i long Ncirc 0 double pi x y double r 1 0 radius of circle Side of squrare is 2 r seed 0 r r The circle and square are centered at the origin for i 0 i lt num_trials i x random y random if x x y y lt rr Neirct pi 4 0 double Ncirc double nun_trials printf n d trials pi is f n num_trials pi LCG Easy to write cheap to compute portable OK quality random_next MULTIPLIER random_last ADDEND PMOD random_last random_next If you pick the multiplier and addend correctly LCG has a eX a ele Ko i adi 0 DA Picking good LCG parameters is complicated so look it up Numerical Recipes is a good source I used the following MULTIPLIER 1366 ADDEND 150889 PMOD 714025 static long MULTIPLIER 1366 static long ADDEND 150889 static long PMOD 714025 long random _last 0 Seed the pseudo random double random sequence by setting random_last long random_next random_next MULTIPLIER random_last ADD
30. nfusing especially if flush list is used Note the flush operation does not actually synchronize different threads It just ensures that a thread s values are made consistent with main memory Outline e Introduction to OpenMP e Creating Threads e Synchronization e Parallel Loops e Synchronize single masters and stuff e Data environment e Schedule your for and sections e Memory model gt e OpenMP 3 0 and Tasks OpenMP based upon SMP directive standardization efforts PCF and aborted ANSI X3H5 late 80 s Nobody fully implemented either standard Only a couple of partial implementations Vendors considered proprietary APPs to be a competitive feature Every vendor had proprietary directives sets Even KAP a portable multi platform parallelization tool used different directives on each platform PCF Parallel computing forum KAP parallelization tool from KAI History of OpenMP ASCI Merged needed commonality across products Wrote a ISV needed rough draft larger market straw man SMP API was tired of recoding for SMPs Urged vendors to standardize Other vendors invited to join OpenMP 1997 OpenMP Release History A single specification for Fortran C and C 2005 tasking other new features ERIC e Adding tasking is the biggest addition for 3 0 e Worked on by a separate subcommittee led by Jay Hoeflinger at Intel e Re examined i
31. oduction to OpenMP e Creating Threads e Synchronization e Parallel Loops e Synchronize single masters and stuff e Data environment e Schedule your for and sections e Memory model e OpenMP 3 0 and Tasks veer memory model Coherence Behavior of the memory syst address is accessed by multiple threads Consistency Orderings of accesses to different addresses k OpenMP Memory Model Basic Terms W W R Rp Cod RW s in any e semantically b b a att equivalent order Re ordering Compiler re orders program ord Machine re orders c Sequential Consistency In a multi processor ops R W S are sequentially consistent if Program order code order commit order Relaxed consistency Remove some of the ordering constraints for memory ops R W S OpenMP defines consistency as a variant of weak consistency S ops must be in sequential order across threads Can not reorder S ops with R or W ops on the same thread S Woe oR RS WSS The Synchronization operation relevant to this discussion is flush Defines a sequence point at which a thread is guaranteed to see a consistent view of memory with respect to the flush set The flush set is all thread visible variables for a flush construct without an argument list a list of variables when the flush list construct is used The action of Flush is to guarantee that Memory ops R Read W write S
32. one right can generate the same sequence regardless of the number of threads Nice for debugging but not really needed scientifically MKL includes several families of RNGs in its vector statistics library Specialized to efficiently generate vectors of random numbers Initialize a stream or pseudo random numbers define BLOCK 100 double buff BLOCK Select type of VSLStreamStatePtr stream RNG and set seed vsiNewStream amp ran_stream VSL_BRNG_WA int seed_val GdRngUniform VSL_METHOD_DUNIFORM_STD stream BLOCK buff low hi vs DeleteStream amp stream i Fill buff with BLOCK pseudo rand nums uniformly distributed with values between lo and hi Delete the stream when you are done WH is a family of 273 parameter sets each defining a non overlapping and independent RNG Easy to use just make each stream threadprivate and initiate RNG stream so each thread gets a unique WG RNG VSLStreamStatePtr stream pragma omp threadprivate stream vslNewStream amp ran_stream VSL_BRNG_WH Thrd_ID int seed JOM saizejay Ho7 Independent Generator for each thread Log number of samples 0 01 0 1 4 0 001 4 1 2 3 T Lk 4 5 6 WH one thread WH 2 threads WH 4 threads Notice that once
33. ops Added environment variable to control the size of child threads stack OMP STACKSIZE Added environment variable to hint to runtime how to treat idle threads OMP_WAIT_POLICY ACTIVE keep threads alive at barriers locks PASSIVE try to release processor at barriers locks Added environment variable and runtime routines to get set the maximum number of active levels of nested parallelism OMP MAX ACTIVE LEVELS omp_set_max_active_levels omp_get_max_active_levels Added environment variable to set maximum number of threads in use OMP THREAD LIMIT omp_get_thread_limit Allow unsigned ints in parallel for loops Disallow use of the original variable as master thread s private variable Make it clearer where how private objects are constructed destructed Relax some restrictions on allocatable arrays and Fortran pointers Plug some minor gaps in memory model Allow C static class members to be threadprivate Improve C C grammar Minor fixes and clarifications to 2 5 Consider the program linked c Traverses a linked list computing a sequence of Fibonacci numbers at each node Parallelize this program using tasks Compare your solution s complexity compared to the approach without tasks Conclusion e OpenMP 3 0 is a major upgrade expands the range of algorithms accessible from OpenMP e OpenMP is fun and about as easy as we can make it for applications programmers wo
34. ory model Producer Point to point synch with flush consumer IX OpenMP 3 and tasks Linked list Tasks and other OpenMP 3 features Outline e Introduction to OpenMP e Creating Threads e Synchronization e Parallel Loops e Synchronize single masters and stuff e Data environment e Schedule your for and sections e Memory model e OpenMP 3 0 and Tasks OpenMP Overview CSOMP FLUSH pragma omp critical OpenMP An API for Writing Multithreaded Applications A set of compiler directives and library routines for parallel application programmers Greatly simplifies writing multi threaded MT programs in Fortran C and C Standardizes last 20 years of SMP practice CSOMP PARALLEL COPYIN b1k CSOMP DO lastprivate XX Nthrds OMP_GET_NUM_PROCS omp_set_lock lck OpenMP Basic Defs Solution Stack Directives Environment wi Compiler i OpenMP Runtime library OS system support for shared memory and threading OpenMP library variables Proct Proc Proc base rt ha ProcN Shared Address Space Most of the constructs in OpenMP are compiler directives Example Function prototypes and types in the file Most OpenMP constructs apply to a structured block Structured block a block of one or more statements with one point of entry at the top and one point of exit at the bottom It s OK to have an exit within the structured block 8 Exercise 1 Part
35. rking with shared memory platforms OpenMP architecture review board URL the owner of the OpenMP specification OpenMP User s Group cOMPunity URL Get involved join compunity and help define the future of OpenMP Books about OpenMP A vy PATTERNS FOR PARALLEL PROGRAMMING TIMOTHY G MATTSON BARBARA CHAPMAN foreword by BEVERLY A SANDERS GABRIELE JOST DAVID J KUCK BERNA L MASSINGILL AND RUUD VAN DER PAS Sosa CP Scalmani C Gomperts R Frisch MJ Ab initio quantum chemistry on a ccNUMA architecture using P Ill Parallel Computing vol 26 no 7 8 July 2000 pp 843 56 Publisher Elsevier Netherlands Couturier R Chipot C Parallel molecular dynamics using OPENMP on a shared memory machine Computer Physics Communications vol 124 no 1 Jan 2000 p 49 59 Publisher Elsevier Netherlands Bentz J Kendall R Parallelization of General Matrix Multiply Routines Using OpenMP Shared Me noT E araa Sg amine with OpenMP Lecture notes in ool Bova SW Breshearsz CP Cuicchi CE Demirbilek Z Gabb HA Dual level parallel analysis of harbor wave response using MPI and OpenMP International Journal of High Performance Computing PEE one vol 14 no 1 Spring 2000 pp 49 64 Publisher Sage Science Press USA Computer Science Vol 3 Ayguade E Martorell X Labarta J Gonzalez M Navarro N eng multiple levels of parallelism in OpenMP a case study Proceedings of the 1999 Internationa
36. se 8 hello world Simple SPMD Pi program SPMD Pi without false sharing Moto om 37 1 ad Monte Carlo Pi and random numbers hard linked lists without tasks easy matrix multiplication Producer consumer linked lists with tasks for i 0 i lt Ndim i for j 0 j lt Mdim j tmp 0 0 for k 0 k lt Pdim k C i j sum over k A i k B k j tmp A i Ndim k B k Pdim j C i Ndim j tmp On a dual core laptop 13 2 seconds 153 Mflops one thread 7 5 seconds 270 Mflops two threads Results on an Intel dual core 1 83 GHz CPU Intel 1A 32 compiler 10 1 build 2 Exercise 1 Exercise 2 Exercise 3 Exercise 4 Exercise 5 Exercise 6 Exercise 6 Exercise 7 Exercise 8 hello world Simple SPMD Pi program SPMD Pi without false sharing Moto om 37 1 ad Monte Carlo Pi and random numbers hard linked lists without tasks easy matrix multiplication Producer consumer linked lists with tasks OpenMP lacks synchronization constructs that work between pairs of threads When this is needed you have to build it yourself Pair wise synchronization Use a shared flag variable Reader spins waiting for the new flag value Use flushes to force updates to and from memory int main double A sum runtime int numthreads flag 0 A double malloc N sizeof double pragma omp parallel sections pragma omp section fill_rand N A pragma omp flus
37. ssue from ground up quite different from Intel taskq s A task has Code to execute A data environment it owns its data An assigned thread that executes the code and uses the data Two activities packaging and execution Each encountering thread packages a new instance of a task code and data Some thread in the team executes the task at some later time task directive plus structured block the package of code and instructions for allocating data created when a thread encounters a task construct the dynamic sequence of instructions produced by the execution of a task by a thread Tasks have been fully integrated into OpenMP Key concept OpenMP has always had tasks we just never called them that Thread encountering parallel construct packages up a set of implicit tasks one per thread Team of threads is created Each thread in team is assigned to one of the tasks and tied to it Barrier holds original master thread until all implicit tasks are finished We have simply added a way to create a task explicitly for the team to execute Every part of an OpenMP program is part of one task or another task Construct pragma omp task clause clause structured block where clause can be one of if expression untied shared list private list firstprivate list default shared none The if clause e When the if clause argument is false The task is
38. task suspended until children tasks complete Certain constructs have task scheduling points at defined locations within them When a thread encounters a task scheduling point it is allowed to suspend the current task and execute another called task switching It can then return to the original task and resume pragma omp single for i 0 i lt ONEZILLION i pragma omp task process item i Too many tasks generated in an eye blink Generating task will have to suspend for a while With task switching the executing thread can execute an already generated task draining the task poof dive into the encountered task could be very cache friendly pragma omp single pragma omp task for i 0 i lt ONEZILLION i pragma omp task process item i Eventually too many tasks are generated Generating task is suspended and executing thread switches to a long and boring task Other threads get rid of all already generated tasks and start starving With thread switching the generating task can be resumed by a different thread and starvation is over Too strange to be the default the programmer is responsible The Taskprivate directive was removed from OpenMP 3 0 Too expensive to implement Restrictions on task scheduling allow threadprivate data to be used User can avoid thread switching with tied tasks Task scheduling points are well defined Conclusions on tasks e Enormo
39. the task is defined Intel compiler Launch SW dev environment on my laptop I use cd to the directory that holds your source code Build software for program foo c Set number of threads environment variable Run your program To get rid of the pwd on the prompt type prompt Start new project Select win 32 console project Set name and path On the next panel Click next instead of finish so you can select an empty project on the following panel Drag and drop your source file into the source folder on the visual studio solution explorer Activate OpenMP Set number of threads inside the program Build the project Run without debug from the debug menu it seemed to go very well We had over 50 people who stuck it out throughout the day Most people brought their laptops only 7 loaner laptops were used And of those with laptops most had preloaded an OS The chaos at the beginning was much less than I expected I had fears of an hour or longer to get everyone setup But thanks to PGI provicing a one key in a temp file we were able to get everyone installed in short order Having a team of 4 two speakers and two assistants worked well It would have been messier without a hardcore compiler person such as Larry With dozens of different configurations he had to do some serious trouble shooting to get the most difficult cases up and running The exercises used early in the course w
40. u split up pathways through the code between threads within a team This is called worksharing Discussed later The loop workharing construct splits up loop iterations among the threads in a team pragma omp parallel Loop construct pragma omp for name for I O I lt N l l NEAT_STUFF I NGS ies Fortran do The variable is made private to each thread by default You could do this explicitly with a private l clause Loop worksharing Constructs A motivating example Sequential code OpenMP parallel region OpenMP parallel region and a worksharing for construct for i 0 l lt N i afi afi bfi pragma omp parallel int id i Nthrds istart iend id omp_get_thread_num Nthrds omp_get_num_threads istart id N Nthrds iend id 1 N Nthrds if id Nthrds 1 iend N for i istart l lt iend i afi afi b i pragma omp parallel pragma omp for for i 0 l lt N i afi afi bfi OpenMP shortcut Put the parallel and the worksharing directive on the same line double res MAX int i double res MAX int i for i 0 i lt MAX i res i huge for i 0 i lt MAX i res i hugeQ These are equivalent Basic approach Find compute intensive loops Make the loop iterations independent So they can safely execute in any order without loop carried dependencies Pl
41. us amount of work by many people e Tightly integrated into 3 0 spec Flexible model for irregular parallelism e Provides balanced solution despite often conflicting goals Appears that performance can be reasonable Better support for nested parallelism Per thread internal control variables Allows for example calling omp_set_num_threads inside a parallel region Controls the team sizes for next level of parallelism Library routines to determine depth of nesting IDs of parent grandparent etc threads team sizes of parent grandparent etc teams omp_get_active_level omp_get_ancestor level omp_get_teamsize level Parallel loops e Guarantee that this works i e that the same schedule is used in the two loops Somp do schedule static do i l1 n a i Jee Melo omp end do nowait omp do schedule static o o M Rw FE 0 a i Jee Mele OpenMP 3 0 Loops cont Allow collapsing of perfectly nested loops omp parallel do collapse 2 do i l1 n e Will form a single loop and then parallelize that 102 ET l more useful can get set it with library routines omp_set_schedule omp_get_schedule allow implementations to implement their own schedule kinds Added a new schedule kind AUTO which gives full freedom to the runtime to determine the scheduling of iterations to threads Allowed C Random access iterators as loop control variables in parallel lo

A “Hands-on” Introduction to OpenMP*

Contents

Download Pdf Manuals

Related Search

Related Contents

A &ldquo;Hands-on&rdquo; Introduction to OpenMP*

Contents

Download Pdf Manuals

Related Search

Related Contents

A “Hands-on” Introduction to OpenMP*