Home

28.1 Locks: The Basic Idea

1. ning on the CPU Hardware support alone cannot solve the problem We ll need OS sup port too Let s now figure out just how that might work 28 13 A Simple Approach Just Yield Baby Hardware support got us pretty far working locks and even as with the case of the ticket lock fairness in lock acquisition However we still have a problem what to do when a context switch occurs in a critical section and threads start to spin endlessly waiting for the interrupted lock holding thread to be run again Our first try is a simple and friendly approach when you are going to spin instead give up the CPU to another thread Or as Al Davis might say just yield baby D91 Figure 28 8 presents the approach In this approach we assume an operating system primitive yield which a thread can call when it wants to give up the CPU and let an other thread run A thread can be in one of three states running ready or blocked yield is simply a system call that moves the caller from the 1 void init 2 flag 0 3 4 5 void lock 6 while TestAndSet amp flag 1 1 7 yield give up the CPU 8 Re 10 void unlock 11 flag 0 12 Figure 28 8 Lock With Test and set And Yield OPERATING SYSTEMS VERSION 0 91 WWW OSTEP ORG 28 14 LOCKS 15 running state to the ready state and thus promotes another thread to running Thus the yielding process essentially deschedules itse
2. relegated to the dustbin of history operating system itself will use interrupt masking to guarantee atom icity when accessing its own data structures or at least to prevent cer tain messy interrupt handling situations from arising This usage makes sense as the trust issue disappears inside the OS which always trusts itself to perform privileged operations anyhow 2014 ARPACI DUSSEAU THREE EASY PIECES LOCKS OPERATING SYSTEMS VERSION 0 91 typedef struct __lock_t int flag lock_t void init lock_t mutex 0 gt lock is available 1 gt held mutex gt flag 0 void lock lock_t mutex while mutex gt flag 1 TEST the flag spin wait do nothing mutex gt flag 1 now SET it void unlock lock_t mutex mutex gt flag 0 Figure 28 1 First Attempt A Simple Flag Test And Set Atomic Exchange Because disabling interrupts does not work on multiple processors system designers started to invent hardware support for locking The earliest multiprocessor systems such as the Burroughs B5000 in the early 1960 s M82 had such support today all systems provide this type of support even for single CPU systems The simplest bit of hardware support to understand is what is known as a test and set instruction also known as atomic exchange To under stand how test and set works let s first try to build a simple lock without it In this failed attempt we us
3. setpark new code m gt guard 0 A different solution could pass the guard into the kernel In that case the kernel could take precautions to atomically release the lock and de queue the running thread Different OS Different Support We have thus far seen one type of support that an OS can provide in order to build a more efficient lock in a thread library Other OS s provide similar support the details vary For example Linux provides a futex which is similar to the Solaris in terface but provides more in kernel functionality Specifically each futex has associated with it a specific physical memory location as well as a per futex in kernel queue Callers can use futex calls described below to sleep and wake as need be Specifically two calls are available The call to fFutex_wait address expected puts the calling thread to sleep assuming the value at address is equal to expected If it is not equal the call returns immediately The call to the routine futex_wake address wakes one thread that is wait ing on the queue The usage of these in Linux is shown in Figure 28 10 This code snippet from lowlevellock h in the nptl library part of the gnu libc library L09 is pretty interesting Basically it uses a single integer to track both whether the lock is held or not the high bit of the integer and the number of waiters on the lock all the other bits Thus if the lock is negative it is held because the high
4. simply checks if the flag is 0 and if so atomically swaps in a 1 thus acquiring the lock Threads that try to acquire the lock while it is held will get stuck spinning until the lock is finally released If you want to see how to really make a C callable x86 version of compare and swap this code sequence might be useful from S05 1 char CompareAndSwap int ptr int old int new 2 unsigned char ret 3 4 Note that sete sets a byte not the word 5 __asm__ __volatile__ 6 lock n 7 cmpxchgl 2 1 n 8 sete 0 n 9 q ret m ptr 10 r new m ptr a old 11 memory 12 return ret 13 Finally as you may have sensed compare and swap is a more power ful instruction than test and set We will make some use of this power in the future when we briefly delve into wait free synchronization H91 However if we just build a simple spin lock with it its behavior is iden tical to the spin lock we analyzed above OPERATING SYSTEMS VERSION 0 91 WWW OSTEP ORG CaONankrone Bee RES 28 10 CaONATRONH Pee apes LOCKS 11 int LoadLinked int ptr return ptr int StoreConditional int ptr int value if no one has updated ptr since the LoadLinked to this address xptr value return 1 success else return 0 failed to update Figure 28 5 Load linked And Store conditional Load Linked and Store Conditional Some platforms provi
5. single processor systems The code would look like this void lock DisablelInterrupts 2E unlock EnableInterrupts Assume we are running on such a single processor system By turn ing off interrupts using some kind of special hardware instruction be fore entering a critical section we ensure that the code inside the critical section will not be interrupted and thus will execute as if it were atomic When we are finished we re enable interrupts again via a hardware in struction and thus the program proceeds as usual The main positive of this approach is its simplicity You certainly don t have to scratch your head too hard to figure out why this works Without interruption a thread can be sure that the code it executes will execute and that no other thread will interfere with it The negatives unfortunately are many First this approach requires us to allow any calling thread to perform a privileged operation turning interrupts on and off and thus trust that this facility is not abused As you already know any time we are required to trust an arbitrary pro gram we are probably in trouble Here the trouble manifests in numer ous ways a greedy program could call 1ock at the beginning of its execution and thus monopolize the processor worse an errant or mali cious program could call 1ock and go into an endless loop In this latter case the OS never regains control of the system and there is only
6. special hardware instructions and even OS support Dekker s algorithm uses just loads and stores assuming they are atomic with respect to each other which was true on early hardware Dekker s approach was later refined by Peterson P81 Once again just loads and stores are used and the idea is to ensure that two threads never enter a critical section at the same time Here is Peterson s algorithm for two threads see if you can understand the code What are the flag and turn variables used for int flaglzi int CUr VOLA ARIC 4 flag 0 flag 1 0 1 gt thread wants to grab lock turn 0 whose turn thread 0 or 1 void lock 4 flag self 1 self thread ID of caller turn 1 self make it other thread s turn while flag l self 1 amp amp turn 1 self spin wait void unlock flag self 0 simply undo your intent For some reason developing locks that work without special hardware support became all the rage for a while giving theory types a lot of prob lems to work on Of course this line of work became quite useless when people realized it is much easier to assume a little hardware support and indeed that support had been around from the earliest days of multipro cessing Further algorithms like the ones above don t work on mod ern hardware due to relaxed memory consistency models thus making them even less useful than they were before Yet more research
7. the most basic requirement providing mutual exclusion WWW OSTEP ORG 28 7 ar one LOCKS Thread 1 Thread 2 call Lock while flag 1 interrupt switch to Thread 2 call Lock while flag 1 flag 1 interrupt switch to Thread 1 flag 1 set flag to 1 too Figure 28 2 Trace No Mutual Exclusion The performance problem which we will address more later on is the fact that the way a thread waits to acquire a lock that is already held it endlessly checks the value of flag a technique known as spin waiting Spin waiting wastes time waiting for another thread to release a lock The waste is exceptionally high on a uniprocessor where the thread that the waiter is waiting for cannot even run at least until a context switch oc curs Thus as we move forward and develop more sophisticated solu tions we should also consider ways to avoid this kind of waste Building A Working Spin Lock While the idea behind the example above is a good one it is not possi ble to implement without some support from the hardware Fortunately some systems provide an instruction to support the creation of simple locks based on this concept This more powerful instruction has differ ent names on SPARC it is the load store unsigned byte instruction 1dstub whereas on x86 it is the atomic exchange instruction xchg but basically does the same thing across platforms and is generally re ferred to as test and s
8. the following flags a bx 1000 bx 1000 this flag sets each thread to loop through the critical 1000 times Watch what happens over time do the threads spend much time spinning waiting for the lock How does the code behave as you add more threads Now examine yield s in which we pretend that a yield instruc tion enables one thread to yield control of the CPU to another re alistically this would be an OS primitive but for the simplicity of simulation we assume there is an instruction that does the task Find a scenario where test and set s wastes cycles spinning but yield s does not How many instructions are saved In what scenarios do these savings arise Finally examine test and test and set s What does this lock do What kind of savings does it introduce as compared to test and set s 2014 ARPACI DUSSEAU THREE EASY PIECES
9. thread enters the lock code also executing the load linked instruction and also getting a 0 and continuing At this point two threads have each executed the load linked and each are about to attempt the store conditional The key feature of these instructions is that only one of these threads will succeed in updating the flag to 1 and thus acquire the lock the second thread to attempt the store conditional will fail because the other thread updated the value of flag between its load linked and store conditional and thus have to try to acquire the lock again In class a few years ago undergraduate student David Capel sug gested a more concise form of the above for those of you who enjoy short circuiting boolean conditionals See if you can figure out why it is equivalent It certainly is shorter void lock lock_t lock while LoadLinked amp lock gt flag StoreConditional amp lock gt flag 1 7 spin Fetch And Add One final hardware primitive is the fetch and add instruction which atomically increments a value while returning the old value at a partic ular address The C pseudocode for the fetch and add instruction looks like this int FetchAndAdd int ptr int old ptr xptr old 1 return old WWW OSTEP ORG CaONanRONH Peeper e pee oe SCaWUanraoanre 28 12 LOCKS 13 typedef struct __lock_t int ticket int turn F lock_t void lock_init lock_t lock lock gt ticket 0 loc
10. 28 1 aR oneR 28 Locks From the introduction to concurrency we saw one of the fundamental problems in concurrent programming we would like to execute a series of instructions atomically but due to the presence of interrupts on a single processor or multiple threads executing on multiple processors concur rently we couldn t In this chapter we thus attack this problem directly with the introduction of something referred to as a lock Programmers annotate source code with locks putting them around critical sections and thus ensure that any such critical section executes as if it were a sin gle atomic instruction Locks The Basic Idea As an example assume our critical section looks like this the canonical update of a shared variable balance balance 1 Of course other critical sections are possible such as adding an ele ment to a linked list or other more complex updates to shared structures but we ll just keep to this simple example for now To use a lock we add some code around the critical section like this lock_t mutex some globally allocated lock mutex lock amp mutex balance balance 1 unlock amp mutex A lock is just a variable and thus to use one you must declare a lock variable of some kind such as mutex above This lock variable or just lock for short holds the state of the lock at any instant in time It is ei ther available or unlocked or free and thus
11. STEP ORG LOCKS 21 MS91 Algorithms for Scalable Synchronization on Shared Memory Multiprocessors John M Mellor Crummey and M L Scott ACM TOCS Volume 9 Issue 1 February 1991 An excellent and thorough survey on different locking algorithms However no operating systems support is used just fancy hardware instructions P81 Myths About the Mutual Exclusion Problem G L Peterson Information Processing Letters 12 3 pages 115 116 1981 Peterson s algorithm introduced here S05 Guide to porting from Solaris to Linux on x86 Ajay Sood April 29 2005 Available http www ibm com developerworks linux library 1l solar S09 OpenSolaris Thread Library Available http src opensolaris org source xref onnv onnv gate usr src lib libc port threads synch c This is also pretty interesting to look at though who knows what will happen to it now that Oracle owns Sun Thanks to Mike Swift for the pointer to the code W09 Load Link Store Conditional Wikipedia entry on said topic as of October 22 2009 http en wikipedia org wiki Load Link Store Conditional Can you believe we referenced wikipedia Pretty lazy no But we found the information there first and it felt wrong not to cite it Further they even listed the instructions for the different architec tures 1dl_1 stl cand ldq_1 stq _c Alpha lwarx stwcx PowerPC 11 sc MIPS and ldrex strex ARM version 6 and above Actually wik
12. ard queue_t q F lock_t void lock_init lock_t m m gt flag 0 m gt guard 0 queue_init m gt q void lock lock_t m while TestAndSet amp m gt guard 1 1 acquire guard lock by spinning if m gt flag 0 m gt flag 1 lock is acquired m gt guard 0 else queue_add m gt q gettid m gt guard 0 park void unlock lock_t x m while TestAndSet amp m gt guard 1 1 acquire guard lock by spinning if queue_empty m gt q m gt flag 0 let go of lock no one wants it else unpark queue_remove m gt q hold lock for next thread m gt guard 0 Figure 28 9 Lock With Queues Test and set Yield And Wakeup might be interrupted while acquiring or releasing the lock and thus cause other threads to spin wait for this one to run again However the time spent spinning is quite limited just a few instructions inside the lock and unlock code instead of the user defined critical section and thus this approach may be reasonable Second you might notice that in Lock when a thread can not ac quire the lock it is already held we are careful to add ourselves to a queue by calling the gettid call to get the thread ID of the current thread set guard to 0 and yield the CPU A question for the reader What would happen if the release of the guard lock came after the park and not before Hint something bad You might also notice th
13. ast as far back to Dahm Locks in the early 1960 s M82 and is now referred to as a two phase lock A two phase lock realizes that spinning can be useful particularly if the lock is about to be released So in the first phase the lock spins for a while hoping that it can acquire the lock However if the lock is not acquired during the first spin phase a sec ond phase is entered where the caller is put to sleep and only woken up when the lock becomes free later The Linux lock above is a form of such a lock but it only spins once a generalization of this could spin in a loop for a fixed amount of time before using futex support to sleep Two phase locks are yet another instance of a hybrid approach where combining two good ideas may indeed yield a better one Of course whether it does depends strongly on many things including the hard ware environment number of threads and other workload details As always making a single general purpose lock good for all possible use cases is quite a challenge WWW OSTEP ORG 28 17 LOCKS 19 Summary The above approach shows how real locks are built these days some hardware support in the form of a more powerful instruction plus some operating system support e g in the form of park and unpark primitives on Solaris or futex on Linux Of course the details differ and the exact code to perform such locking is usually highly tuned Check out the Solaris or Linux code ba
14. bit is set and that bit determines the sign of the integer The code is also interesting because it shows how to optimize for the common case where there is no contention with only one thread acquiring and releasing a lock very little work is done the atomic bit test and set to lock and an atomic add to release the lock See if you can puzzle through the rest of this real world lock to see how it works 2014 ARPACI DUSSEAU THREE EASY PIECES 18 LOCKS 28 16 OPERATING SYSTEMS VERSION 0 91 void mutex_lock int mutex int vi Bit 31 was clear we got the mutex this is the fastpath if atomic_bit_test_set mutex 31 0 return atomic_increment mutex while 1 if atomic_bit_test_set mutex 31 0 atomic_decrement mutex return We have to wait now First make sure the futex value we are monitoring is truly negative i e locked v xmutex if v gt 0 continue futex_wait mutex v void mutex_unlock int mutex x Adding 0x80000000 to the counter results in 0 if and only if there are not other interested threads if atomic_add_zero mutex 0x80000000 return x There are other threads waiting for this mutex wake one of them up x futex_wake mutex Figure 28 10 Linux based Futex Locks Two Phase Locks One final note the Linux approach has the flavor of an old approach that has been used on and off for years going at le
15. cks thus allowing more threads to be in locked code at once a more fine grained approach WWW OSTEP ORG 28 3 28 4 LOCKS Building A Lock By now you should have some understanding of how a lock works from the perspective of a programmer But how should we build a lock What hardware support is needed What OS support It is this set of questions we address in the rest of this chapter The Crux How To BUILD A LOCK How can we build an efficient lock Efficient locks provided mutual exclusion at low cost and also might attain a few other properties we discuss below What hardware support is needed What OS support To build a working lock we will need some help from our old friend the hardware as well as our good pal the OS Over the years a num ber of different hardware primitives have been added to the instruction sets of various computer architectures while we won t study how these instructions are implemented that after all is the topic of a computer architecture class we will study how to use them in order to build a mu tual exclusion primitive like a lock We will also study how the OS gets involved to complete the picture and enable us to build a sophisticated locking library Evaluating Locks Before building any locks we should first understand what our goals are and thus we ask how to evaluate the efficacy of a particular lock implementation To evaluate whether a lock works and works well we sh
16. de a pair of instructions that work in concert to help build critical sections On the MIPS architecture H93 for example the load linked and store conditional instructions can be used in tandem to build locks and other concurrent structures The C pseudocode for these instructions is as found in Figure 28 5 Alpha PowerPC and ARM provide similar instructions W09 The load linked operates much like a typical load instruction and sim ply fetches a value from memory and places it in a register The key differ ence comes with the store conditional which only succeeds and updates the value stored at the address just load linked from if no intervening store to the address has taken place In the case of success the store conditional returns 1 and updates the value at ptr to value if it fails the value at pt r is not updated and 0 is returned Asa challenge to yourself try thinking about how to build a lock using load linked and store conditional Then when you are finished look at the code below which provides one simple solution Do it The solution is in Figure 28 6 The lock code is the only interesting piece First a thread spins waiting for the flag to be set to 0 and thus indicate the lock is not held Once so the thread tries to acquire the lock via the store conditional if it succeeds the thread has atomically changed the flag s value to 1 and thus can proceed into the critical section void lock lock_t lock
17. different interrupt frequencies what values lead to a bad outcomes Which lead to good outcomes Now let s look at the program test and set s First try to un derstand the code which uses the xchg instruction to build a sim ple locking primitive How is the lock acquire written How about lock release Now run the code changing the value of the interrupt interval i again and making sure to loop for a number of times Does the code always work as expected Does it sometimes lead to an ineffi cient use of the CPU How could you quantify that Use the P flag to generate specific tests of the locking code For example run a schedule that grabs the lock in the first thread but then tries to acquire it in the second Does the right thing happen What else should you test Now let s look at the code in peterson s which implements Pe terson s algorithm mentioned in a sidebar in the text Study the code and see if you can make sense of it Now run the code with different values of i What kinds of differ ent behavior do you see Can you control the scheduling with the P flag to prove that the code works What are the different cases you should show hold Think about mutual exclusion and deadlock avoidance WWW OSTEP ORG LOCKS 23 11 12 13 14 15 Now study the code for the ticket lock in ticket s Does it match the code in the chapter Now run the code with
18. e a simple flag variable to denote whether the lock is held or not In this first attempt Figure 28 1 the idea is quite simple use a simple variable to indicate whether some thread has possession of a lock The first thread that enters the critical section will call lock which tests whether the flag is equal to 1 in this case it is not and then sets the flag to 1 to indicate that the thread now holds the lock When finished with the critical section the thread calls unlock and clears the flag thus indicating that the lock is no longer held If another thread happens to call lock while that first thread is in the critical section it will simply spin wait in the while loop for that thread to call unlock and clear the flag Once that first thread does so the waiting thread will fall out of the while loop set the flag to 1 for itself and proceed into the critical section Unfortunately the code has two problems one of correctness and an other of performance The correctness problem is simple to see once you get used to thinking about concurrent programming Imagine the code interleaving in Figure 28 2 page 7 assume f1ag 0 to begin As you can see from this interleaving with timely untimely inter rupts we can easily produce a case where both threads set the flag to 1 and both threads are thus able to enter the critical section This behavior is what professionals call bad we have obviously failed to provide
19. e interesting fact that the flag does not get set back to 0 when another thread gets woken up Why is this Well it is not an error but rather a necessity When a thread is woken up it will be as if it is returning from park however it does not hold the guard at that point in the code and thus cannot even try to set the flag to 1 Thus we just pass the lock directly from the thread releasing the lock to the next thread acquiring it flag is not set to 0 in between WWW OSTEP ORG 28 15 LOCKS 17 Finally you might notice the perceived race condition in the solution just before the call to park With just the wrong timing a thread will be about to park assuming that it should sleep until the lock is no longer held A switch at that time to another thread say a thread holding the lock could lead to trouble for example if that thread then released the lock The subsequent park by the first thread would then sleep forever potentially This problem is sometimes called the wakeup waiting race to avoid it we need to do some extra work Solaris solves this problem by adding a third system call setpark By calling this routine a thread can indicate it is about to park If it then happens to be interrupted and another thread calls unpark before park is actually called the subsequent park returns immediately instead of sleep ing The code modification inside of lock is quite small queue_add m gt q gettid
20. et We define what the test and set instruction does with the following C code snippet int TestAndSet int old_ptr int new int old xold_ptr fetch old value at old_ptr xold_ptr new store new into old_ptr return old return the old value What the test and set instruction does is as follows It returns the old value pointed to by the ptr and simultaneously updates said value to new The key of course is that this sequence of operations is performed atomically The reason it is called test and set is that it enables you to test the old value which is what is returned while simultaneously setting the memory location to a new value as it turns out this slightly more powerful instruction is enough to build a simple spin lock as we now examine in Figure 28 3 Or better yet figure it out first yourself Let s make sure we understand why this lock works Imagine first the case where a thread calls lock and no other thread currently holds the lock thus f Lag should be 0 When the thread calls Test AndSet flag 2014 ARPACI DUSSEAU THREE EASY PIECES LOCKS OPERATING SYSTEMS VERSION 0 91 typedef struct __lock_t int flag lock_t void init lock_t xlock 0 indicates that lock is available 1 that it is held lock gt flag 0 void lock lock_t lock while TestAndSet amp lock gt flag 1 1 spin wait do nothing void unlock lock_t loc
21. ferred Tip THINK ABOUT CONCURRENCY AS MALICIOUS SCHEDULER From this example you might get a sense of the approach you need to take to understand concurrent execution What you should try to do is to pretend you are a malicious scheduler one that interrupts threads at the most inopportune of times in order to foil their feeble attempts at building synchronization primitives What a mean scheduler you are Although the exact sequence of interrupts may be improbable it is possible and that is all we need to demonstrate that a particular approach does not work It can be useful to think maliciously at least sometimes WWW OSTEP ORG 28 8 28 9 LOCKS to as a spin lock It is the simplest type of lock to build and simply spins using CPU cycles until the lock becomes available To work correctly on a single processor it requires a preemptive scheduler i e one that will interrupt a thread via a timer in order to run a different thread from time to time Without preemption spin locks don t make much sense on a single CPU as a thread spinning on a CPU will never relinquish it Evaluating Spin Locks Given our basic spin lock we can now evaluate how effective it is along our previously described axes The most important aspect of a lock is correctness does it provide mutual exclusion The answer here is yes the spin lock only allows a single thread to enter the critical section at a time Thus we have a correct lock T
22. he next axis is fairness How fair is a spin lock to a waiting thread Can you guarantee that a waiting thread will ever enter the critical sec tion The answer here unfortunately is bad news spin locks don t pro vide any fairness guarantees Indeed a thread spinning may spin forever under contention Spin locks are not fair and may lead to starvation The final axis is performance What are the costs of using a spin lock To analyze this more carefully we suggest thinking about a few different cases In the first imagine threads competing for the lock on a single processor in the second consider the threads as spread out across many processors For spin locks in the single CPU case performance overheads can be quite painful imagine the case where the thread holding the lock is pre empted within a critical section The scheduler might then run every other thread imagine there are N 1 others each of which tries to ac quire the lock In this case each of those threads will spin for the duration of a time slice before giving up the CPU a waste of CPU cycles However on multiple CPUs spin locks work reasonably well if the number of threads roughly equals the number of CPUs The thinking goes as follows imagine Thread A on CPU 1 and Thread B on CPU 2 both contending for a lock If Thread A CPU 1 grabs the lock and then Thread B tries to B will spin on CPU 2 However presumably the crit ical section is short and thus
23. ipedia is pretty amazing so don t be so harsh ok WG00 The SPARC Architecture Manual Version 9 David L Weaver and Tom Germond September 2000 SPARC International San Jose California Available http www sparc org standards SPARCV9 pdf Also see http developers sun com solaris articles atomic_sparc for some more details on Sparc atomic operations THREE 2014 ARPACI DUSSEAU EASY PIECES 22 LOCKS OPERATING SYSTEMS VERSION 0 91 Homework This program x86 py allows you to see how different thread inter leavings either cause or avoid race conditions See the README for de tails on how the program works and its basic inputs then answer the questions below Questions J 10 First let s get ready to run x86 py with the flag p flag s This code implements locking with a single memory flag Can you understand what the assembly code is trying to do When you run with the defaults does flag s work as expected Does it produce the correct result Use the M and R flags to trace variables and registers and turn on c to see their values Can you predict what value will end up in flag as the code runs Change the value of the register bx with the a flag e g a bx 2 bx 2 if you are running just two threads What does the code do How does it change your answer for the question above Set bx to a high value for each thread and then use the i flag to generate
24. k lock gt flag 0 Figure 28 3 A Simple Spin Lock Using Test and set 1 the routine will return the old value of flag which is 0 thus the call ing thread which is testing the value of flag will not get caught spinning in the while loop and will acquire the lock The thread will also atomi cally set the value to 1 thus indicating that the lock is now held When the thread is finished with its critical section it calls unlock to set the flag back to zero The second case we can imagine arises when one thread already has the lock held i e fag is 1 In this case this thread will call Lock and then call TestAndSet flag 1 as well This time TestAndSet will return the old value at flag which is 1 because the lock is held while simultaneously setting it to 1 again As long as the lock is held by another thread TestAndSet will repeatedly return 1 and thus this thread will spin and spin until the lock is finally released When the flag is finally set to 0 by some other thread this thread will call TestAndSet again which will now return 0 while atomically setting the value to 1 and thus acquire the lock and enter the critical section By making both the test of the old lock value and set of the new value a single atomic operation we ensure that only one thread acquires the lock And that s how to build a working mutual exclusion primitive You may also now understand why this type of lock is usually re
25. k gt turn 0 void lock lock_t xlock int myturn FetchAndAdd amp lock gt ticket while lock gt turn myturn spin void unlock lock_t lock lock gt turn lock gt turn 1 Figure 28 7 Ticket Locks In this example we ll use fetch and add to build a more interesting ticket lock as introduced by Mellor Crummey and Scott MS91 The lock and unlock code looks like what you see in Figure 28 7 Instead of a single value this solution uses a ticket and turn variable in combination to build a lock The basic operation is pretty simple when a thread wishes to acquire a lock it first does an atomic fetch and add on the ticket value that value is now considered this thread s turn myturn The globally shared 1ock gt turn is then used to determine which thread s turn it is when myturn turn fora given thread it is that thread s turn to enter the critical section Unlock is accomplished simply by incrementing the turn such that the next waiting thread if there is one can now enter the critical section Note one important difference with this solution versus our previous attempts it ensures progress for all threads Once a thread is assigned its ticket value it will be scheduled at some point in the future once those in front of it have passed through the critical section and released the lock In our previous attempts no such guarantee existed a thread spinning on test and set for e
26. lf Think about the example with two threads on one CPU in this case our yield based approach works quite well If a thread happens to call lock and find a lock held it will simply yield the CPU and thus the other thread will run and finish its critical section In this simple case the yielding approach works well Let us now consider the case where there are many threads say 100 contending for a lock repeatedly In this case if one thread acquires the lock and is preempted before releasing it the other 99 will each call lock find the lock held and yield the CPU Assuming some kind of round robin scheduler each of the 99 will execute this run and yield pattern before the thread holding the lock gets to run again While better than our spinning approach which would waste 99 time slices spinning this approach is still costly the cost of a context switch can be substantial and there is thus plenty of waste Worse we have not tackled the starvation problem at all A thread may get caught in an endless yield loop while other threads repeatedly enter and exit the critical section We clearly will need an approach that addresses this problem directly Using Queues Sleeping Instead Of Spinning The real problem with our previous approaches is that they leave too much to chance The scheduler determines which thread runs next if the scheduler makes a bad choice a thread runs that must either spin waiting for the lock our first a
27. no thread holds the lock or acquired or locked or held and thus exactly one thread holds the lock and presumably is in a critical section We could store other information LOCKS 28 2 abr one OPERATING SYSTEMS VERSION 0 91 in the data type as well such as which thread holds the lock or a queue for ordering lock acquisition but information like that is hidden from the user of the lock The semantics of the Lock and unlock routines are simple Call ing the routine Lock tries to acquire the lock if no other thread holds the lock i e it is free the thread will acquire the lock and enter the crit ical section this thread is sometimes said to be the owner of the lock If another thread then calls lock on that same lock variable mutex in this example it will not return while the lock is held by another thread in this way other threads are prevented from entering the critical section while the first thread that holds the lock is in there Once the owner of the lock calls unlock the lock is now available free again If no other threads are waiting for the lock i e no other thread has called lock and is stuck therein the state of the lock is simply changed to free If there are waiting threads stuck in lock one of them will eventually notice or be informed of this change of the lock s state acquire the lock and enter the critical section Locks provide some minimal amount of control
28. one recourse restart the system Using interrupt disabling as a general purpose synchronization solution requires too much trust in applications Second the approach does not work on multiprocessors If multiple threads are running on different CPUs and each try to enter the same critical section it does not matter whether interrupts are disabled threads will be able to run on other processors and thus could enter the critical section As multiprocessors are now commonplace our general solution will have to do better than this Third turning off interrupts for extended periods of time can lead to interrupts becoming lost which can lead to serious systems problems Imagine for example if the CPU missed the fact that a disk device has finished a read request How will the OS know to wake the process wait ing for said read Finally and probably least important this approach can be inefficient Compared to normal instruction execution code that masks or unmasks interrupts tends to be executed slowly by modern CPUs For these reasons turning off interrupts is only used in limited con texts as a mutual exclusion primitive For example in some cases an WWW OSTEP ORG LOCKS ASIDE DEKKER S AND PETERSON S ALGORITHMS In the 1960 s Dijkstra posed the concurrency problem to his friends and one of them a mathematician named Theodorus Jozef Dekker came up with a solution D68 Unlike the solutions we discuss here which use
29. ould first establish some basic criteria The first is whether the lock does its basic task which is to provide mutual exclusion Basically does the lock work preventing multiple threads from entering a critical section The second is fairness Does each thread contending for the lock get a fair shot at acquiring it once it is free Another way to look at this is by examining the more extreme case does any thread contending for the lock starve while doing so thus never obtaining it The final criterion is performance specifically the time overheads added by using the lock There are a few different cases that are worth con sidering here One is the case of no contention when a single thread is running and grabs and releases the lock what is the overhead of do ing so Another is the case where multiple threads are contending for the lock on a single CPU in this case are there performance concerns Fi nally how does the lock perform when there are multiple CPUs involved and threads on each contending for the lock By comparing these differ ent scenarios we can better understand the performance impact of using various locking techniques as described below 2014 ARPACI DUSSEAU THREE EASY PIECES LOCKS 28 5 AnkoONB OPERATING SYSTEMS VERSION 0 91 Controlling Interrupts One of the earliest solutions used to provide mutual exclusion was to disable interrupts for critical sections this solution was invented for
30. over scheduling to programmers In general we view threads as entities created by the pro grammer but scheduled by the OS in any fashion that the OS chooses Locks yield some of that control back to the programmer by putting a lock around a section of code the programmer can guarantee that no more than a single thread can ever be active within that code Thus locks help transform the chaos that is traditional OS scheduling into a more controlled activity Pthread Locks The name that the POSIX library uses for a lock is a mutex as it is used to provide mutual exclusion between threads i e if one thread is in the critical section it excludes the others from entering until it has completed the section Thus when you see the following POSIX threads code you should understand that it is doing the same thing as above we again use our wrappers that check for errors upon lock and unlock pthread_mutex_t lock PTHREAD_MUTEX_INITIALIZER Pthread_mutex_lock amp lock wrapper for pthread_mutex_lock balance balance 1 Pthread_mutex_unlock amp lock You might also notice here that the POSIX version passes a variable to lock and unlock as we may be using different locks to protect different variables Doing so can increase concurrency instead of one big lock that is used any time any critical section is accessed a coarse grained locking strategy one will often protect different data and data structures with different lo
31. pproach or yield the CPU immediately our second approach Either way there is potential for waste and no prevention of starvation Thus we must explicitly exert some control over which thread next gets to acquire the lock after the current holder releases it To do this we will need a little more OS support as well as a queue to keep track of which threads are waiting to acquire the lock For simplicity we will use the support provided by Solaris in terms of two calls park to puta calling thread to sleep and unpark threadID to wake a particular thread as designated by threadID These two rou tines can be used in tandem to build a lock that puts a caller to sleep if it tries to acquire a held lock and wakes it when the lock is free Let s look at the code in Figure 28 9 to understand one possible use of such primitives We do a couple of interesting things in this example First we combine the old test and set idea with an explicit queue of lock waiters to make a more efficient lock Second we use a queue to help control who gets the lock next and thus avoid starvation You might notice how the guard is used Figure 28 9 page 16 basi cally as a spin lock around the flag and queue manipulations the lock is using This approach thus doesn t avoid spin waiting entirely a thread 2014 ARPACI DUSSEAU THREE EASY PIECES 16 LOCKS OPERATING SYSTEMS VERSION 0 91 typedef struct __lock_t int flag int gu
32. ses if you want to see more details they are a fascinating read L09 S09 Also see David et al s excellent work for a comparison of locking strategies on modern multiprocessors D 13 2014 ARPACI DUSSEAU THREE EASY PIECES 20 LOCKS OPERATING SYSTEMS VERSION 0 91 References D91 Just Win Baby Al Davis and His Raiders Glenn Dickey Harcourt 1991 There is even an undoubtedly bad book about Al Davis and his famous just win quote Or we suppose the book is more about Al Davis and the Raiders and maybe not just the quote Read the book to find out D 13 Everything You Always Wanted to Know about Synchronization but Were Afraid to Ask Tudor David Rachid Guerraoui Vasileios Trigonakis SOSP 13 Nemacolin Woodlands Resort Pennsylvania November 2013 An excellent recent paper comparing many different ways to build locks using hardware primitives A great read to see how many ideas over the years work on modern hardware D68 Cooperating sequential processes Edsger W Dijkstra 1968 Available http www cs utexas edu users EWD ewd01xx EWD123 PDF One of the early seminal papers in the area Discusses how Dijkstra posed the original concurrency problem and Dekker s solution H93 MIPS R4000 Microprocessor User s Manual Joe Heinrich Prentice Hall June 1993 Available http cag csail mit edu raw documents R4400_Uman_book_Ed2 pdf H91 Wait free Synchroni
33. soon the lock becomes available and is ac quired by Thread B Spinning to wait for a lock held on another processor doesn t waste many cycles in this case and thus can be effective Compare And Swap Another hardware primitive that some systems provide is known as the compare and swap instruction as it is called on SPARC for exam ple or compare and exchange as it called on x86 The C pseudocode for this single instruction is found in Figure 28 4 The basic idea is for compare and swap to test whether the value at the 2014 ARPACI DUSSEAU THREE EASY PIECES 10 LOCKS 1 int CompareAndSwap int ptr int expected int new 2 int actual ptr 3 if actual expected 4 xptr new 5 return actual 6 Figure 28 4 Compare and swap address specified by pt r is equal to expected if so update the memory location pointed to by ptr with the new value If not do nothing In either case return the actual value at that memory location thus allowing the code calling compare and swap to know whether it succeeded or not With the compare and swap instruction we can build a lock in a man ner quite similar to that with test and set For example we could just replace the lock routine above with the following 1 void lock lock_t xlock 2 while CompareAndSwap amp lock gt flag 0 1 1 3 spin 4 The rest of the code is the same as the test and set example above This code works quite similarly it
34. while 1 while LoadLinked amp lock gt flag 1 spin until it s zero if StoreConditional amp lock gt flag 1 1 return if set it to l was a success all done otherwise try it all over again void unlock lock_t lock lock gt flag 0 Figure 28 6 Using LL SC To Build A Lock 2014 ARPACI DUSSEAU THREE EASY PIECES 12 LOCKS RON 28 11 ar one OPERATING SYSTEMS VERSION 0 91 TIP LESS CODE IS BETTER CODE LAUER S LAW Programmers tend to brag about how much code they wrote to do some thing Doing so is fundamentally broken What one should brag about rather is how little code one wrote to accomplish a given task Short concise code is always preferred it is likely easier to understand and has fewer bugs As Hugh Lauer said when discussing the construction of the Pilot operating system If the same people had twice as much time they could produce as good of a system in half the code L81 We ll call this Lauer s Law and it is well worth remembering So next time you re bragging about how much code you wrote to finish the assignment think again or better yet go back rewrite and make the code as clear and con cise as possible Note how failure of the store conditional might arise One thread calls lock and executes the load linked returning 0 as the lock is not held Before it can attempt the store conditional it is interrupted and another
35. xample could spin forever even as other threads acquire and release the lock Too Much Spinning What Now Our simple hardware based locks are simple only a few lines of code and they work you could even prove that if you d like to by writing some code which are two excellent properties of any system or code However in some cases these solutions can be quite inefficient Imagine you are running two threads on a single processor Now imagine that one thread thread 0 is in a critical section and thus has a lock held and unfortunately gets interrupted The second thread thread 1 now tries to acquire the lock but finds that it is held Thus it begins to spin And spin 2014 ARPACI DUSSEAU THREE EASY PIECES 14 LOCKS Then it spins some more And finally a timer interrupt goes off thread 0 is run again which releases the lock and finally the next time it runs say thread 1 won t have to spin so much and will be able to acquire the lock Thus any time a thread gets caught spinning in a situation like this it wastes an entire time slice doing nothing but checking a value that isn t going to change The problem gets worse with N threads contending for a lock N 1 time slices may be wasted in a similar manner simply spinning and waiting for a single thread to release the lock And thus our next problem THE CRUX How TO AVOID SPINNING How can we develop a lock that doesn t needlessly waste time spin
36. zation Maurice Herlihy ACM Transactions on Programming Languages and Systems TOPLAS Volume 13 Issue 1 January 1991 A landmark paper introducing a different approach to building concurrent data structures However because of the complexity involved many of these ideas have been slow to gain acceptance in deployed systems L81 Observations on the Development of an Operating System Hugh Lauer SOSP 81 Pacific Grove California December 1981 A must read retrospective about the development of the Pilot OS an early PC operating system Fun and full of insights L09 glibc 2 9 include Linux pthreads implementation Available http ftp gnu org gnu glibe In particular take a look at the nptl subdirectory where you will find most of the pthread support in Linux today M82 The Architecture of the Burroughs B5000 20 Years Later and Still Ahead of the Times Alastair J W Mayer 1982 www ajwm net amayer papers B5000 html From the paper One particularly useful instruction is the RDLK read lock It is an indivisible operation which reads from and writes into a memory location RDLK is thus an early test and set primitive if not the earliest Some credit here goes to an engineer named Dave Dahm who apparently invented a number of these things for the Burroughs systems including a form of spin locks called Buzz Locks as well as a two phase lock eponymously called Dahm Locks WWW O

28.1 Locks: The Basic Idea

Contents

Download Pdf Manuals

Related Search

Related Contents