POSTS
Is Parallel Programming Hard [07]: Locking
Claude 根据原.tex格式生成 markdown, 前面还行, 后边稀烂.
Locking
Locking is the worst general-purpose synchronization mechanism except for all those other mechanisms that have been tried from time to time.
— With apologies to the memory of Winston Churchill and to whoever he was quoting
In recent concurrency research, locking often plays the role of villain. Locking stands accused of inciting deadlocks, convoying, starvation, unfairness, data races, and all manner of other concurrency sins. Interestingly enough, the role of workhorse in production-quality shared-memory parallel software is also played by locking. This chapter will look into this dichotomy between villain and hero, as fancifully depicted in Section Villain or Slob?/ Workhorse or Hero?.
There are a number of reasons behind this Jekyll-and-Hyde dichotomy:
Many of locking’s sins have pragmatic design solutions that work well in most cases, for example:
Use of lock hierarchies to avoid deadlock.
Deadlock-detection tools, for example, the Linux kernel’s lockdep facility [JonathanCorbet2006lockdep].
Locking-friendly data structures, such as arrays, hash tables, and radix trees, which will be covered in Section Data Structures.
Some of locking’s sins are problems only at high levels of contention, levels reached only by poorly designed programs.
Some of locking’s sins are avoided by using other synchronization mechanisms in concert with locking. These other mechanisms include statistical counters (see Section Counting), reference counters (see Section Reference Counting), hazard pointers (see Section Hazard Pointers), sequence-locking readers (see Section Sequence Locks), RCU (see Section Read-Copy Update (RCU)), and simple non-blocking data structures (see Section Non-Blocking Synchronization).
Until quite recently, almost all large shared-memory parallel programs were developed in secret, so that it was not easy to learn of these pragmatic solutions.
Locking works extremely well for some software artifacts and extremely poorly for others. Developers who have worked on artifacts for which locking works well can be expected to have a much more positive opinion of locking than those who have worked on artifacts for which locking works poorly, as will be discussed in Section Hero or Villain?.
All good stories need a villain, and locking has a long and honorable history serving as a research-paper whipping boy.
Quick Quiz: Just how can serving as a whipping boy be considered to be in any way honorable???
Answer: The reason locking serves as a research-paper whipping boy is because it is heavily used in practice. In contrast, if no one used or cared about locking, most research papers would not bother even mentioning it.
This chapter will give an overview of a number of ways to avoid locking’s more serious sins.
Staying Alive
I work to stay alive.
— Bette Davis
Given that locking stands accused of deadlock and starvation, one important concern for shared-memory parallel developers is simply staying alive. The following sections therefore cover deadlock, livelock, starvation, unfairness, and inefficiency.
Deadlock
occurs when each member of a group of threads is holding at least one lock while at the same time waiting on a lock held by a member of that same group. This happens even in groups containing a single thread when that thread attempts to acquire a non-recursive lock that it already holds. Deadlock can therefore occur even given but one thread and one lock!
Without some sort of external intervention, deadlock is forever. No thread can acquire the lock it is waiting on until that lock is released by the thread holding it, but the thread holding it cannot release it until the holding thread acquires the lock that it is in turn waiting on.
We can create a directed-graph representation of a deadlock scenario with nodes for threads and locks, as shown in Section Deadlock Cycle. An arrow from a lock to a thread indicates that the thread holds the lock, for example, Thread B holds Locks 2 and 4. An arrow from a thread to a lock indicates that the thread is waiting on the lock, for example, Thread B is waiting on Lock 3.
A deadlock scenario will always contain at least one deadlock cycle. In Section Deadlock Cycle, this cycle is Thread B, Lock 3, Thread C, Lock 4, and back to Thread B.
Quick Quiz: But the definition of lock-based deadlock only said that each thread was holding at least one lock and waiting on another lock that was held by some thread. How do you know that there is a cycle?
Answer: Suppose that there is no cycle in the graph. We would then have a directed acyclic graph (DAG), which would have at least one leaf node.
If this leaf node was a lock, then we would have a thread that was waiting on a lock that wasn’t held by any thread, counter to the definition. In this case the thread would immediately acquire the lock.
On the other hand, if this leaf node was a thread, then we would have a thread that was not waiting on any lock, again counter to the definition. And in this case, the thread would either be running or be blocked on something that is not a lock. In the first case, in the absence of infinite-loop bugs, the thread will eventually release the lock. In the second case, in the absence of a failure-to-wake bug, the thread will eventually wake up and release the lock.1
Therefore, given this definition of lock-based deadlock, there must be a cycle in the corresponding graph.
Although there are some software environments such as database systems that can recover from an existing deadlock, this approach requires either that one of the threads be killed or that a lock be forcibly stolen from one of the threads. This killing and forcible stealing works well for transactions, but is often problematic for kernel and application-level use of locking: Dealing with the resulting partially updated structures can be extremely complex, hazardous, and error-prone.
Therefore, kernels and applications should instead avoid deadlocks. Deadlock-avoidance strategies include locking hierarchies (Section Locking Hierarchies), local locking hierarchies (Section Local Locking Hierarchies), layered locking hierarchies (Section Layered Locking Hierarchies), temporal locking hierarchies (Section Temporal Locking Hierarchies), strategies for dealing with APIs containing pointers to locks (Section Locking Hierarchies and Pointers to Locks), conditional locking (Section Conditional Locking), acquiring all needed locks first (Section Acquire Needed Locks First), single-lock-at-a-time designs (Section Single-Lock-at-a-Time Designs), and strategies for signal/interrupt handlers (Section Signal/Interrupt Handlers). Although there is no deadlock-avoidance strategy that works perfectly for all situations, there is a good selection of tools to choose from.
Locking Hierarchies
Locking hierarchies order the locks and prohibit acquiring locks out of order. In Section Deadlock Cycle, we might order the locks numerically, thus forbidding a thread from acquiring a given lock if it already holds a lock with the same or a higher number. Thread B has violated this hierarchy because it is attempting to acquire Lock 3 while holding Lock 4. This violation permitted the deadlock to occur.
Again, to apply a locking hierarchy, order the locks and prohibit out-of-order lock acquisition. For different types of locks, it is helpful to have a carefully considered hierarchy from one type to the next. For many instances of the same type of lock, for example, a per-node lock in a search tree, the traditional approach is to carry out lock acquisition in order of the addresses of the locks to be acquired. Either way, in large program, it is wise to use tools such as the Linux-kernel ‘lockdep‘ [JonathanCorbet2006lockdep] to enforce your locking hierarchy.
Local Locking Hierarchies
However, the global nature of locking hierarchies makes them difficult to apply to library functions. After all, when a program using a given library function has not yet been written, how can the poor library-function implementor possibly follow the yet-to-be-defined locking hierarchy?
One special (but common) case is when the library function does not invoke any of the caller’s code. In this case, the caller’s locks will never be acquired while holding any of the library’s locks, so that there cannot be a deadlock cycle containing locks from both the library and the caller.
Quick Quiz: Are there any exceptions to this rule, so that there really could be a deadlock cycle containing locks from both the library and the caller, even given that the library code never invokes any of the caller’s functions?
Answer: Indeed there are! Here are a few of them:
If one of the library function’s arguments is a pointer to a lock that this library function acquires, and if the library function holds one of its locks while acquiring the caller’s lock, then we could have a deadlock cycle involving both caller and library locks.
If one of the library functions returns a pointer to a lock that is acquired by the caller, and if the caller acquires one of its locks while holding the library’s lock, we could again have a deadlock cycle involving both caller and library locks.
If one of the library functions acquires a lock and then returns while still holding it, and if the caller acquires one of its locks, we have yet another way to create a deadlock cycle involving both caller and library locks.
If the caller has a signal handler that acquires locks, then the deadlock cycle can involve both caller and library locks. In this case, however, the library’s locks are innocent bystanders in the deadlock cycle. That said, please note that acquiring a lock from within a signal handler is a no-no in many environments—it is not just a bad idea, it is unsupported. But if you absolutely must acquire a lock in a signal handler, be sure to block that signal while holding that same lock in thread context, and also while holding any other locks acquired while that same lock is held.
But suppose that a library function does invoke the caller’s code. For example, ‘qsort()‘ invokes a caller-provided comparison function. Now, normally this comparison function will operate on unchanging local data, so that it need not acquire locks, as shown in Section No qsort() Compare-Function Locking. But maybe someone is crazy enough to sort a collection whose keys are changing, thus requiring that the comparison function acquire locks, which might result in deadlock, as shown in Section Without qsort() Local Locking Hierarchy. How can the library function avoid this deadlock?
The golden rule in this case is “Release all locks before invoking unknown code.” To follow this rule, the ‘qsort()‘ function must release all of its locks before invoking the comparison function. Thus ‘qsort()‘ will not be holding any of its locks while the comparison function acquires any of the caller’s locks, thus avoiding deadlock.
Quick Quiz: But if ‘qsort()‘ releases all its locks before invoking the comparison function, how can it protect against races with other ‘qsort()‘ threads?
Answer: By privatizing the data elements being compared (as discussed in Section Data Ownership) or through use of deferral mechanisms such as reference counting (as discussed in Section Deferred Processing). Or through use of layered locking hierarchies, as described in Section Layered Locking Hierarchies.
On the other hand, changing a key in a list that is currently being sorted is at best rather brave.
To see the benefits of local locking hierarchies, compare Section Without qsort() Local Locking Hierarchy/Local Locking Hierarchy for qsort(). In both figures, application functions ‘foo()‘ and ‘bar()‘ invoke ‘qsort()‘ while holding Locks A and B, respectively. Because this is a parallel implementation of ‘qsort()‘, it acquires Lock C. Function ‘foo()‘ passes function ‘cmp()‘ to ‘qsort()‘, and ‘cmp()‘ acquires Lock B. Function ‘bar()‘ passes a simple integer-comparison function (not shown) to ‘qsort()‘, and this simple function does not acquire any locks.
Now, if ‘qsort()‘ holds Lock C while calling ‘cmp()‘ in violation of the golden release-all-locks rule above, as shown in Section Without qsort() Local Locking Hierarchy, deadlock can occur. To see this, suppose that one thread invokes ‘foo()‘ while a second thread concurrently invokes ‘bar()‘. The first thread will acquire Lock A and the second thread will acquire Lock B. If the first thread’s call to ‘qsort()‘ acquires Lock C, then it will be unable to acquire Lock B when it calls ‘cmp()‘. But the first thread holds Lock C, so the second thread’s call to ‘qsort()‘ will be unable to acquire it, and thus unable to release Lock B, resulting in deadlock.
In contrast, if ‘qsort()‘ releases Lock C before invoking the comparison function, which is unknown code from ‘qsort()‘’s perspective, then deadlock is avoided as shown in Section Local Locking Hierarchy for qsort().
If each module releases all locks before invoking unknown code, then deadlock is avoided if each module separately avoids deadlock. This rule therefore greatly simplifies deadlock analysis and greatly improves modularity.
Nevertheless, this golden rule comes with a warning. When you release those locks, any state that they protect is subject to arbitrary changes, changes that are all too easy for the function’s caller to forget, resulting in subtle and difficult-to-reproduce bugs. Because the ‘qsort()‘ comparison function rarely acquires locks, let’s switch to a different example.
Consider the recursive tree iterator in Section Recursive Tree Iterator (‘rec_tree_itr.c‘). The iterator visits every node in the tree, invoking a user’s callback function. The tree lock is released before the invocation and re-acquired after return. This code makes dangerous assumptions:
The number of children of the current node has not changed,
The ancestors stored on the stack by the recursion are still there, and
The visited node itself has not been removed and freed.
A few of these hazards can be encountered if one thread calls ‘tree_add()‘ while another thread releases the tree’s lock to run a callback function.
Quick Quiz: So the iterating thread may or may not observe the added child. What is the big deal?
Answer: There are at least two hazards in this situation.
One is indeed that the number of children may or may not be observed to have changed. While that would be consistent with ‘tree_add()‘ being called either before or after the iterator started, it is better not left to the vagaries of the compiler. A more serious problem is that ‘realloc()‘ may not be able to extend the array in place, causing the heap to free the one used by the iterator and replace it with another block of memory. If the ‘children‘ pointer is not re-read then the iterating thread will access invalid memory (either free or reclaimed).
One strategy is to ensure that state is preserved despite the lock being released, for example, by acquiring a reference on a node to prevent it from being freed. Alternatively, the state can be re-initialized once the lock is re-acquired after the callback function returns.
struct node {
int data;
int nchildren;
struct node **children;
};
struct tree {
spinlock_t s;
struct node *root;
};
void tree_for_each_rec(struct tree *tr, struct node *nd,
void (*callback)(struct node *))
{
struct node **itr;
spin_unlock(&tr->s);
callback(nd);
spin_lock(&tr->s);
itr = nd->children;
for (int i = 0; i < nd->nchildren; i++) {
tree_for_each_rec(tr, *itr, callback);
itr++;
}
}
void tree_for_each(struct tree *tr,
void (*callback)(struct node *))
{
spin_lock(&tr->s);
tree_for_each_rec(tr, tr->root, callback);
spin_unlock(&tr->s);
}
void tree_add(struct tree *tr, struct node *parent,
struct node *new_child)
{
spin_lock(&tr->s);
parent->nchildren++;
parent->children = realloc(parent->children,
sizeof(struct node *) *
parent->nchildren);
parent->children[parent->nchildren - 1] = new_child;
spin_unlock(&tr->s);
}
Layered Locking Hierarchies
Unfortunately, it might be infeasible to preserve state on the one hand or to re-initialize it on the other, thus ruling out a local locking hierarchy where all locks are released before invoking unknown code. However, we can instead construct a layered locking hierarchy, as shown in Section Layered Locking Hierarchy for qsort(). Here, the ‘cmp()‘ function uses a new Lock D that is acquired after all of Locks A, B, and C, avoiding deadlock. We therefore have three layers to the global deadlock hierarchy, the first containing Locks A and B, the second containing Lock C, and the third containing Lock D.
struct locked_list {
spinlock_t s;
struct cds_list_head h;
};
struct cds_list_head *list_start(struct locked_list *lp)
{
spin_lock(&lp->s);
return list_next(lp, &lp->h);
}
struct cds_list_head *list_next(struct locked_list *lp,
struct cds_list_head *np)
{
struct cds_list_head *ret;
ret = np->next;
if (ret == &lp->h) {
spin_unlock(&lp->s);
ret = NULL;
}
return ret;
}
Please note that it is not typically possible to mechanically change ‘cmp()‘ to use the new Lock D. Quite the opposite: It is often necessary to make profound design-level modifications. Nevertheless, the effort required for such modifications is normally a small price to pay in order to avoid deadlock. More to the point, this potential deadlock should preferably be detected at design time, before any code has been generated!
For another example where releasing all locks before invoking unknown code is impractical, imagine an iterator over a linked list, as shown in Section Concurrent List Iterator (‘locked_list.c‘). The ‘list_start()‘ function acquires a lock on the list and returns the first element (if there is one), and ‘list_next()‘ either returns a pointer to the next element in the list or releases the lock and returns ‘NULL‘ if the end of the list has been reached.
struct list_ints {
struct cds_list_head n;
int a;
};
void list_print(struct locked_list *lp)
{
struct cds_list_head *np;
struct list_ints *ip;
np = list_start(lp);
while (np != NULL) {
ip = cds_list_entry(np, struct list_ints, n);
printf("\t%d\n", ip->a);
np = list_next(lp, np);
}
}
Section Concurrent List Iterator Usage shows how this list iterator may be used. define the ‘list_ints‘ element containing a single integer,
and show how to iterate over the list. locks the list and fetches a pointer to the first element, provides a pointer to our enclosing ‘list_ints‘ structure, prints the corresponding integer, and moves to the next element. This is quite simple, and hides all of the locking.
That is, the locking remains hidden as long as the code processing each list element does not itself acquire a lock that is held across some other call to ‘list_start()‘ or ‘list_next()‘, which results in deadlock. We can avoid the deadlock by layering the locking hierarchy to take the list-iterator locking into account.
This layered approach can be extended to an arbitrarily large number of layers, but each added layer increases the complexity of the locking design. Such increases in complexity are particularly inconvenient for some types of object-oriented designs, in which control passes back and forth among a large group of objects in an undisciplined manner.2 This mismatch between the habits of object-oriented design and the need to avoid deadlock is an important reason why parallel programming is perceived by some to be so difficult.
Some alternatives to highly layered locking hierarchies are covered in Section Deferred Processing.
Temporal Locking Hierarchies
One way to avoid deadlock is to defer acquisition of one of the conflicting locks. This approach is used in Linux-kernel RCU, whose function is invoked by the Linux-kernel scheduler while holding its locks. This means that ‘call_rcu()‘ cannot always safely invoke the scheduler to do a wakeup, for example, in order to wake up an RCU kthread in order to start the new grace period that is required by the callback queued by ‘call_rcu()‘.
Quick Quiz: What do you mean “cannot always safely invoke the scheduler”? Either ‘call_rcu()‘ can or cannot safely invoke the scheduler, right?
Answer: It really does depend.
The scheduler locks are always held with interrupts disabled. Therefore, if ‘call_rcu()‘ is invoked with interrupts enabled, no scheduler locks are held, and ‘call_rcu()‘ can safely call into the scheduler. Otherwise, if interrupts are disabled, one of the scheduler locks might be held, so ‘call_rcu()‘ must play it safe and refrain from calling into the scheduler.
However, grace periods last for many milliseconds, so waiting another millisecond before starting a new grace period is not normally a problem. Therefore, if ‘call_rcu()‘ detects a possible deadlock with the scheduler, it arranges to start the new grace period later, either within a timer handler or within the scheduler-clock interrupt handler, depending on configuration. Because no scheduler locks are held across either handler, deadlock is successfully avoided.
The overall approach is thus to adhere to a locking hierarchy by deferring lock acquisition to an environment in which no locks are held.
Locking Hierarchies and Pointers to Locks
Although there are some exceptions, an external API containing a pointer to a lock is very often a misdesigned API. Handing an internal lock to some other software component is after all the antithesis of information hiding, which is in turn a key design principle.
Quick Quiz: Name one common situation where a pointer to a lock is passed into a function.
Answer: Locking primitives, of course!
One exception is functions that hand off some entity, where the caller’s lock must be held until the handoff is complete, but where the lock must be released before the function returns. One example of such a function is the POSIX function, where passing a pointer to a prevents hangs due to lost wakeups.
Quick Quiz: Doesn’t the fact that ‘pthread_cond_wait()‘ first releases the mutex and then re-acquires it eliminate the possibility of deadlock?
Answer: Absolutely not!
Consider a program that acquires ‘mutex_a‘, and then ‘mutex_b‘, in that order, and then passes ‘mutex_a‘ to ‘pthread_cond_wait()‘. Now, ‘pthread_cond_wait()‘ will release ‘mutex_a‘, but will re-acquire it before returning. If some other thread acquires ‘mutex_a‘ in the meantime and then blocks on ‘mutex_b‘, the program will deadlock.
In short, if you find yourself exporting an API with a pointer to a lock as an argument or as the return value, do yourself a favor and carefully reconsider your API design. It might well be the right thing to do, but experience indicates that this is unlikely.
Conditional Locking
But suppose that there is no reasonable locking hierarchy. This can happen in real life, for example, in some types of layered network protocol stacks where packets flow in both directions, for example, in implementations of distributed lock managers. In the networking case, it might be necessary to hold the locks from both layers when passing a packet from one layer to another. Given that packets travel both up and down the protocol stack, this is an excellent recipe for deadlock, as illustrated in Section Protocol Layering and Deadlock.
Here, a packet moving down the stack towards the wire must acquire the next layer’s lock out of order. Given that packets moving up the stack away from the wire are acquiring the locks in order, the lock acquisition in of the listing can result in deadlock.
spin_lock(&lock2);
layer_2_processing(pkt);
nextlayer = layer_1(pkt);
spin_lock(&nextlayer->lock1);
spin_unlock(&lock2);
layer_1_processing(pkt);
spin_unlock(&nextlayer->lock1);
One way to avoid deadlocks in this case is to impose a locking hierarchy, but when it is necessary to acquire a lock out of order, acquire it conditionally, as shown in Section Avoiding Deadlock Via Conditional Locking.
Instead of unconditionally acquiring the layer-1 lock, conditionally acquires the lock using the primitive. This primitive acquires the lock immediately if the lock is available (returning non-zero), and otherwise returns zero without acquiring the lock.
retry:
spin_lock(&lock2);
layer_2_processing(pkt);
nextlayer = layer_1(pkt);
if (!spin_trylock(&nextlayer->lock1)) {
spin_unlock(&lock2);
spin_lock(&nextlayer->lock1);
spin_lock(&lock2);
if (layer_1(pkt) != nextlayer) {
spin_unlock(&nextlayer->lock1);
spin_unlock(&lock2);
goto retry;
}
}
spin_unlock(&lock2);
layer_1_processing(pkt);
spin_unlock(&nextlayer->lock1);
If ‘spin_trylock()‘ was successful, does the needed layer-1 processing. Otherwise, releases the lock, and acquire them in the correct order. Unfortunately, there might be multiple networking devices on the system (e.g., Ethernet and WiFi), so that the ‘layer_1()‘ function must make a routing decision. This decision might change at any time, especially if the system is mobile.3 Therefore, must recheck the decision, and if it has changed, must release the locks and start over.
Acquire Needed Locks First
In an important special case of conditional locking, all needed locks are acquired before any processing is carried out, where the needed locks might be identified by hashing the addresses of the data structures involved. In this case, processing need not be idempotent: If it turns out to be impossible to acquire a given lock without first releasing one that was already acquired, just release all the locks and try again. Only once all needed locks are held will any processing be carried out.
However, this procedure can result in livelock, which will be discussed in Section Livelock and Starvation.
Quick Quiz: When using the “acquire needed locks first” approach described in Section Acquire Needed Locks First, how can livelock be avoided?
Answer: Provide an additional global lock. If a given thread has repeatedly tried and failed to acquire the needed locks, then have that thread unconditionally acquire the new global lock, and then unconditionally acquire any needed locks. (Suggested by Doug Lea.)
A related approach, two-phase locking [PhilipABernstein1987], has seen long production use in transactional database systems. In the first phase of a two-phase locking transaction, locks are acquired but not released. Once all needed locks have been acquired, the transaction enters the second phase, where locks are released, but not acquired. This locking approach allows databases to provide serializability guarantees for their transactions, in other words, to guarantee that all values seen and produced by the transactions are consistent with some global ordering of all the transactions. Many such systems rely on the ability to abort transactions, although this can be simplified by avoiding making any changes to shared data until all needed locks are acquired. Livelock and deadlock are issues in such systems, but practical solutions may be found in any of a number of database textbooks.
Single-Lock-at-a-Time Designs
In some cases, it is possible to avoid nesting locks, thus avoiding deadlock. For example, if a problem is perfectly partitionable, a single lock may be assigned to each partition. Then a thread working on a given partition need only acquire the one corresponding lock. Because no thread ever holds more than one lock at a time, deadlock is impossible.
However, there must be some mechanism to ensure that the needed data structures remain in existence during the time that neither lock is held. One such mechanism is discussed in Section Lock-Based Existence Guarantees and several others are presented in Section Deferred Processing.
Signal/Interrupt Handlers
Deadlocks involving signal handlers are often quickly dismissed by noting that it is not legal to invoke from within a signal handler [OpenGroup1997pthreads]. However, it is possible (though often unwise) to hand-craft locking primitives that can be invoked from signal handlers. Besides which, almost all operating-system kernels permit locks to be acquired from within interrupt handlers, which are analogous to signal handlers.
The trick is to block signals (or disable interrupts, as the case may be) when acquiring any lock that might be acquired within a signal (or an interrupt) handler. Furthermore, if holding such a lock, it is illegal to attempt to acquire any lock that is ever acquired outside of a signal handler without blocking signals.
Quick Quiz: Suppose Lock A is never acquired within a signal handler, but Lock B is acquired both from thread context and by signal handlers. Suppose further that Lock A is sometimes acquired with signals unblocked. Why is it illegal to acquire Lock A holding Lock B?
Answer: Because this would lead to deadlock. Given that Lock A is sometimes held outside of a signal handler without blocking signals, a signal might be handled while holding this lock. The corresponding signal handler might then acquire Lock B, so that Lock B is acquired while holding Lock A. Therefore, if we also acquire Lock A while holding Lock B, we will have a deadlock cycle. Note that this problem exists even if signals are blocked while holding Lock B.
This is another reason to be very careful with locks that are acquired within interrupt or signal handlers. But the Linux kernel’s lock dependency checker knows about this situation and many others as well, so please do make full use of it!
If a lock is acquired by the handlers for several signals, then each and every one of these signals must be blocked whenever that lock is acquired, even when that lock is acquired within a signal handler.
Quick Quiz: How can you legally block signals within a signal handler?
Answer: One of the simplest and fastest ways to do so is to use the ‘sa_mask‘ field of the ‘struct sigaction‘ that you pass to ‘sigaction()‘ when setting up the signal.
Unfortunately, blocking and unblocking signals can be expensive in some operating systems, notably including Linux, so performance concerns often mean that locks acquired in signal handlers are only acquired in signal handlers, and that lockless synchronization mechanisms are used to communicate between application code and signal handlers.
Or that signal handlers are avoided completely except for handling fatal errors.
Quick Quiz: If acquiring locks in signal handlers is such a bad idea, why even discuss ways of making it safe?
Answer: Because these same rules apply to the interrupt handlers used in operating-system kernels and in some embedded applications.
In many application environments, acquiring locks in signal handlers is frowned upon [OpenGroup1997pthreads]. However, that does not stop clever developers from (perhaps unwisely) fashioning home-brew locks out of atomic operations. And atomic operations are in many cases perfectly legal in signal handlers.
Discussion
There are a large number of deadlock-avoidance strategies available to the shared-memory parallel programmer, but there are sequential programs for which none of them is a good fit. This is one of the reasons that expert programmers have more than one tool in their toolbox: Locking is a powerful concurrency tool, but there are jobs better addressed with other tools.
Quick Quiz: Given an object-oriented application that passes control freely among a group of objects such that there is no straightforward locking hierarchy,4 layered or otherwise, how can this application be parallelized?
Answer: There are a number of approaches:
In the case of parametric search via simulation, where a large number of simulations will be run in order to converge on (for example) a good design for a mechanical or electrical device, leave the simulation single-threaded, but run many instances of the simulation in parallel. This retains the object-oriented design, and gains parallelism at a higher level, and likely also avoids both deadlocks and synchronization overhead.
Partition the objects into groups such that there is no need to operate on objects in more than one group at a given time. Then associate a lock with each group. This is an example of a single-lock-at-a-time design, which discussed in Section Single-Lock-at-a-Time Designs.
Partition the objects into groups such that threads can all operate on objects in the groups in some groupwise ordering. Then associate a lock with each group, and impose a locking hierarchy over the groups.
Impose an arbitrarily selected hierarchy on the locks, and then use conditional locking if it is necessary to acquire a lock out of order, as was discussed in Section Conditional Locking.
Before carrying out a given group of operations, predict which locks will be acquired, and attempt to acquire them before actually carrying out any updates. If the prediction turns out to be incorrect, drop all the locks and retry with an updated prediction that includes the benefit of experience. This approach was discussed in Section Acquire Needed Locks First.
Use transactional memory. This approach has a number of advantages and disadvantages which will be discussed in .
Refactor the application to be more concurrency-friendly. This would likely also have the side effect of making the application run faster even when single-threaded, but might also make it more difficult to modify the application.
Use techniques from later chapters in addition to locking.
Nevertheless, the strategies described in this section have proven quite useful in many settings.
Livelock and Starvation
Although conditional locking can be an effective deadlock-avoidance mechanism, it can be abused. Consider for example the beautifully symmetric example shown in Section Abusing Conditional Locking. This example’s beauty hides an ugly livelock. To see this, consider the following sequence of events:
void thread1(void)
{
retry:
spin_lock(&lock1);
do_one_thing();
if (!spin_trylock(&lock2)) {
spin_unlock(&lock1);
goto retry;
}
do_another_thing();
spin_unlock(&lock2);
spin_unlock(&lock1);
}
void thread2(void)
{
retry:
spin_lock(&lock2);
do_a_third_thing();
if (!spin_trylock(&lock1)) {
spin_unlock(&lock2);
goto retry;
}
do_a_fourth_thing();
spin_unlock(&lock1);
spin_unlock(&lock2);
}
Thread 1 acquires ‘lock1‘ on , then invokes ‘do_one_thing()‘.
Thread 2 acquires ‘lock2‘ on , then invokes ‘do_a_third_thing()‘.
Thread 1 attempts to acquire ‘lock2‘ on , but fails because Thread 2 holds it.
Thread 2 attempts to acquire ‘lock1‘ on , but fails because Thread 1 holds it.
Thread 1 releases ‘lock1‘ on , then jumps to ‘retry‘ at .
Thread 2 releases ‘lock2‘ on , and jumps to ‘retry‘ at .
The livelock dance repeats from the beginning.
Quick Quiz: How can the livelock shown in Section Abusing Conditional Locking be avoided?
Answer: Section Avoiding Deadlock Via Conditional Locking provides some good hints. In many cases, livelocks are a hint that you should revisit your locking design. Or visit it in the first place if your locking design “just grew”.
That said, one good-and-sufficient approach due to Doug Lea is to use conditional locking as described in Section Conditional Locking, but combine this with acquiring all needed locks first, before modifying shared data, as described in Section Acquire Needed Locks First. If a given critical section retries too many times, unconditionally acquire a global lock, then unconditionally acquire all the needed locks. This avoids both deadlock and livelock, and scales reasonably assuming that the global lock need not be acquired too often.
Livelock can be thought of as an extreme form of starvation where a group of threads starves, rather than just one of them.5
void thread1(void)
{
unsigned int wait = 1;
retry:
spin_lock(&lock1);
do_one_thing();
if (!spin_trylock(&lock2)) {
spin_unlock(&lock1);
sleep(wait);
wait = wait << 1;
goto retry;
}
do_another_thing();
spin_unlock(&lock2);
spin_unlock(&lock1);
}
void thread2(void)
{
unsigned int wait = 1;
retry:
spin_lock(&lock2);
do_a_third_thing();
if (!spin_trylock(&lock1)) {
spin_unlock(&lock2);
sleep(wait);
wait = wait << 1;
goto retry;
}
do_a_fourth_thing();
spin_unlock(&lock1);
spin_unlock(&lock2);
}
Livelock and starvation are serious issues in software transactional memory implementations, and so the concept of contention manager has been introduced to encapsulate these issues. In the case of locking, simple exponential backoff can often address livelock and starvation. The idea is to introduce exponentially increasing delays before each retry, as shown in Section Conditional Locking and Exponential Backoff.
Quick Quiz: What problems can you spot in the code in Section Conditional Locking and Exponential Backoff?
Answer: Here are a couple:
A one-second wait is way too long for most uses. Wait intervals should begin with roughly the time required to execute the critical section, which will normally be in the microsecond or millisecond range.
The code does not check for overflow. On the other hand, this bug is nullified by the previous bug: 32 bits worth of seconds is more than 50 years.
For better results, backoffs should be bounded, and even better high-contention results are obtained via queued locking [Anderson90], which is discussed more in Section Other Exclusive-Locking Implementations. Of course, best of all is to use a good parallel design that avoids these problems by maintaining low lock contention.
Unfairness
can be thought of as a less-severe form of starvation, where a subset of threads contending for a given lock are granted the lion’s share of the acquisitions. This can happen on machines with shared caches or numa characteristics, for example, as shown in Section System Architecture and Lock Unfairness. If CPU 0 releases a lock that all the other CPUs are attempting to acquire, the interconnect shared between CPUs 0 and 1 means that CPU 1 will have an advantage over CPUs 2–7. Therefore CPU 1 will likely acquire the lock. If CPU 1 holds the lock long enough for CPU 0 to be requesting the lock by the time CPU 1 releases it and vice versa, the lock can shuttle between CPUs 0 and 1, bypassing CPUs 2–7.
Quick Quiz: Wouldn’t it be better just to use a good parallel design so that lock contention was low enough to avoid unfairness?
Answer: It would be better in some sense, but there are situations where it can be appropriate to use designs that sometimes result in high lock contentions.
For example, imagine a system that is subject to a rare error condition. It might well be best to have a simple error-handling design that has poor performance and scalability for the duration of the rare error condition, as opposed to a complex and difficult-to-debug design that is helpful only when one of those rare error conditions is in effect.
That said, it is usually worth putting some effort into attempting to produce a design that both simple as well as efficient during error conditions, for example by partitioning the problem.
Inefficiency
Locks are implemented using atomic instructions and memory barriers, and often involve cache misses. As we saw in Section Hardware and its Habits, these instructions are quite expensive, roughly two orders of magnitude greater overhead than simple instructions. This can be a serious problem for locking: If you protect a single instruction with a lock, you will increase the overhead by a factor of one hundred. Even assuming perfect scalability, one hundred CPUs would be required to keep up with a single CPU executing the same code without locking.
This situation is not confined to locking. Section Saw Kerf shows how this same principle applies to the age-old activity of sawing wood. As can be seen in the figure, sawing a board converts a small piece of that board (the width of the saw blade) into sawdust. Of course, locks partition time instead of sawing wood,6 but just like sawing wood, using locks to partition time wastes some of that time due to lock overhead and (worse yet) lock contention. One important difference is that if someone saws a board into too-small pieces, the resulting conversion of most of that board into sawdust will be immediately obvious. In contrast, it is not always obvious that a given lock acquisition is wasting excessive amounts of time.
And this situation underscores the importance of the synchronization/granularity tradeoff discussed in Section Synchronization Granularity, especially Section Synchronization Efficiency: Too coarse a granularity will limit scalability, while too fine a granularity will result in excessive synchronization overhead.
Acquiring a lock might be expensive, but once held, the CPU’s caches are an effective performance booster, at least for large critical sections. In addition, once a lock is held, the data protected by that lock can be accessed by the lock holder without interference from other threads.
Quick Quiz: How might the lock holder be interfered with?
Answer: If the data protected by the lock is in the same cache line as the lock itself, then attempts by other CPUs to acquire the lock will result in expensive cache misses on the part of the CPU holding the lock. This is a special case of false sharing, which can also occur if a pair of variables protected by different locks happen to share a cache line. In contrast, if the lock is in a different cache line than the data that it protects, the CPU holding the lock will usually suffer a cache miss only on first access to a given variable.
Of course, the downside of placing the lock and data into separate cache lines is that the code will incur two cache misses rather than only one in the uncontended case. As always, choose wisely!
The Rust programming language takes lock/data association a step further by allowing the developer to make a compiler-visible association between a lock and the data that it protects [RalfJung2021RustSafeSysProg]. When such an association has been made, attempts to access the data without the benefit of the corresponding lock will result in a compile-time diagnostic. The hope is that this will greatly reduce the frequency of this class of bugs. Of course, this approach does not apply straightforwardly to cases where the data to be locked is distributed throughout the nodes of some data structure or when that which is locked is purely abstract, for example, when a small subset of state-machine transitions is to be protected by a given lock. For this reason, Rust allows locks to be associated with types rather than data items or even to be associated with nothing at all. This last option permits Rust to emulate traditional locking use cases, but is not popular among Rust developers. Perhaps the Rust community will come up with other mechanisms tailored to other locking use cases.
Types of Locks
Only locks in life are what you think you know, but don’t. Accept your ignorance and try something new. Dennis Vickers There are a surprising number of types of locks, more than this short chapter can possibly do justice to. The following sections discuss exclusive locks (Section Exclusive Locks), reader-writer locks (Section Reader-Writer Locks), multi-role locks (Section Beyond Reader-Writer Locks), and scoped locking (Section Scoped Locking).
Exclusive Locks
— lock
are what they say they are: Only one thread may hold the lock at a time. The holder of such a lock thus has exclusive access to all data protected by that lock, hence the name.
Of course, this all assumes that this lock is held across all accesses to data purportedly protected by the lock. Although there are some tools that can help (see for example Section Axiomatic Approaches and Locking), the ultimate responsibility for ensuring that the lock is always acquired when needed rests with the developer.
Quick Quiz: Does it ever make sense to have an exclusive lock acquisition immediately followed by a release of that same lock, that is, an empty critical section?
Answer: Empty lock-based critical sections are rarely used, but they do have their uses. The point is that the semantics of exclusive locks have two components:
- > The familiar data-protection semantic and
A messaging semantic, where releasing a given lock notifies a waiting acquisition of that same lock.
An empty critical section uses the messaging component without the data-protection component.
The rest of this answer provides some example uses of empty critical sections, however, these examples should be considered “gray magic.”7 As such, empty critical sections are almost never used in practice. Nevertheless, pressing on into this gray area …
One historical use of empty critical sections appeared in the networking stack of the 2.4 Linux kernel through use of a read-side-scalable reader-writer lock called ‘brlock‘ for “big reader lock”. This use case is a way of approximating the semantics of read-copy update (RCU), which is discussed in Section Read-Copy Update (RCU). And in fact this Linux-kernel use case has been replaced with RCU.
The empty-lock-critical-section idiom can also be used to reduce lock contention in some situations. For example, consider a multithreaded user-space application where each thread processes units of work maintained in a per-thread list, where threads are prohibited from touching each others’ lists [PaulEMcKenney2012EmptyLocks]. There could also be updates that require that all previously scheduled units of work have completed before the update can progress. One way to handle this is to schedule a unit of work on each thread, so that when all of these units of work complete, the update may proceed.
In some applications, threads can come and go. For example, each thread might correspond to one user of the application, and thus be removed when that user logs out or otherwise disconnects. In many applications, threads cannot depart atomically: They must instead explicitly unravel themselves from various portions of the application using a specific sequence of actions. One specific action will be refusing to accept further requests from other threads, and another specific action will be disposing of any remaining units of work on its list, for example, by placing these units of work in a global work-item-disposal list to be taken by one of the remaining threads. (Why not just drain the thread’s work-item list by executing each item? Because a given work item might generate more work items, so that the list could not be drained in a timely fashion.)
If the application is to perform and scale well, a good locking design is required. One common solution is to have a global lock (call it ‘G‘) protecting the entire process of departing (and perhaps other things as well), with finer-grained locks protecting the individual unraveling operations.
Now, a departing thread must clearly refuse to accept further requests before disposing of the work on its list, because otherwise additional work might arrive after the disposal action, which would render that disposal action ineffective. So simplified pseudocode for a departing thread might be as follows:
Acquire lock ‘G‘.
Acquire the lock guarding communications.
Refuse further communications from other threads.
Release the lock guarding communications.
Acquire the lock guarding the global work-item-disposal list.
Move all pending work items to the global work-item-disposal list.
Release the lock guarding the global work-item-disposal list.
Release lock ‘G‘.
Of course, a thread that needs to wait for all pre-existing work items will need to take departing threads into account. To see this, suppose that this thread starts waiting for all pre-existing work items just after a departing thread has refused further communications from other threads. How can this thread wait for the departing thread’s work items to complete, keeping in mind that threads are not allowed to access each others’ lists of work items?
One straightforward approach is for this thread to acquire ‘G‘ and then the lock guarding the global work-item-disposal list, then move the work items to its own list. The thread then release both locks, places a work item on the end of its own list, and then wait for all of the work items that it placed on each thread’s list (including its own) to complete.
This approach does work well in many cases, but if special processing is required for each work item as it is pulled in from the global work-item-disposal list, the result could be excessive contention on ‘G‘. One way to avoid that contention is to acquire ‘G‘ and then immediately release it. Then the process of waiting for all prior work items look something like the following:
Set a global counter to one and initialize a condition variable to zero.
Send a message to all threads to cause them to atomically increment the global counter, and then to enqueue a work item. The work item will atomically decrement the global counter, and if the result is zero, it will set a condition variable to one.
Acquire ‘G‘, which will wait on any currently departing thread to finish departing. Because only one thread may depart at a time, all the remaining threads will have already received the message sent in the preceding step.
Release ‘G‘.
Acquire the lock guarding the global work-item-disposal list.
Move all work items from the global work-item-disposal list to this thread’s list, processing them as needed along the way.
Release the lock guarding the global work-item-disposal list.
Enqueue an additional work item onto this thread’s list. (As before, this work item will atomically decrement the global counter, and if the result is zero, it will set a condition variable to one.)
Wait for the condition variable to take on the value one.
Once this procedure completes, all pre-existing work items are guaranteed to have completed. The empty critical sections are using locking for messaging as well as for protection of data.
It is important to note that unconditionally acquiring an exclusive lock has two effects:
Waiting for all prior holders of that lock to release it and
Blocking any other acquisition attempts until the lock is released.
As a result, at lock acquisition time, any concurrent acquisitions of that lock must be partitioned into prior holders and subsequent holders. Different types of exclusive locks use different partitioning strategies [BjoernBrandenburgPhD,Guerraoui:2019:LPA:3319851.3301501], for example:
Strict FIFO, with acquisitions starting earlier acquiring the lock earlier.
Approximate FIFO, with acquisitions starting sufficiently earlier acquiring the lock earlier.
FIFO within priority level, with higher-priority threads acquiring the lock earlier than any lower-priority threads attempting to acquire the lock at about the same time, but so that some FIFO ordering applies for threads of the same priority.
Random, so that the new lock holder is chosen randomly from all threads attempting acquisition, regardless of timing.
Unfair, so that a given acquisition might never acquire the lock (see Section Unfairness).
Unfortunately, locking implementations with stronger guarantees typically incur higher overhead, motivating the wide variety of locking implementations in production use. For example, real-time systems often require some degree of FIFO ordering within priority level, and much else besides (see Section Event-Driven Real-Time Support), while non-realtime systems subject to high contention might require only enough ordering to avoid starvation, and finally, non-realtime systems designed to avoid contention might not need fairness at all.
Reader-Writer Locks
[Courtois71] permit any number of readers to hold the lock concurrently on the one hand or a single writer to hold the lock on the other. In theory, then, reader-writer locks should allow excellent scalability for data that is read often and written rarely. In practice, the scalability will depend on the reader-writer lock implementation.
The classic reader-writer lock implementation involves a set of counters and flags that are manipulated atomically. This type of implementation suffers from the same problem as does exclusive locking for short critical sections: The overhead of acquiring and releasing the lock is about two orders of magnitude greater than the overhead of a simple instruction. Of course, if the critical section is long enough, the overhead of acquiring and releasing the lock becomes negligible. However, because only one thread at a time can be manipulating the lock, the required critical-section size increases with the number of CPUs.
It is possible to design a reader-writer lock that is much more favorable to readers through use of per-thread exclusive locks [WilsonCHsieh92a]. To read, a thread acquires only its own lock. To write, a thread acquires all locks. In the absence of writers, each reader incurs only atomic-instruction and memory-barrier overhead, with no cache misses, which is quite good for a locking primitive. Unfortunately, writers must incur cache misses as well as atomic-instruction and memory-barrier overhead—multiplied by the number of threads.
In short, reader-writer locks can be quite useful in a number of situations, but each type of implementation does have its drawbacks. The canonical use case for reader-writer locking involves very long , preferably measured in hundreds of microseconds or even milliseconds.
As with exclusive locks, a reader-writer lock acquisition cannot complete until all prior conflicting holders of that lock have released it. If a lock is read-held, then read acquisitions can complete immediately, but write acquisitions must wait until there are no longer any readers holding the lock. If a lock is write-held, then all acquisitions must wait until the writer releases the lock. Again as with exclusive locks, different reader-writer lock implementations provide different degrees of FIFO ordering to readers on the one hand and to writers on the other.
But suppose a large number of readers hold the lock and a writer is waiting to acquire the lock. Should readers be allowed to continue to acquire the lock, possibly starving the writer? Similarly, suppose that a writer holds the lock and that a large number of both readers and writers are waiting to acquire the lock. When the current writer releases the lock, should it be given to a reader or to another writer? If it is given to a reader, how many readers should be allowed to acquire the lock before the next writer is permitted to do so?
There are many possible answers to these questions, with different levels of complexity, overhead, and fairness. Different implementations might have different costs, for example, some types of reader-writer locks incur extremely large latencies when switching from read-holder to write-holder mode. Here are a few possible approaches:
Reader-preference implementations unconditionally favor readers over writers, possibly allowing write acquisitions to be indefinitely blocked.
Batch-fair implementations ensure that when both readers and writers are acquiring the lock, both have reasonable access via batching. For example, the lock might admit five readers per CPU, then two writers, then five more readers per CPU, and so on.
Writer-preference implementations unconditionally favor writers over readers, possibly allowing read acquisitions to be indefinitely blocked.
Of course, these distinctions matter only under conditions of high lock contention.
Please keep the waiting/blocking dual nature of locks firmly in mind. This will be revisited in Section Deferred Processing’s discussion of scalable high-performance special-purpose alternatives to locking.
Beyond Reader-Writer Locks
| Null (Not Held) | Concurrent Read | Concurrent Write | Protected Read | Protected Write | Exclusive | |
|---|---|---|---|---|---|---|
| Null (Not Held) | ||||||
| Concurrent Read | X | |||||
| Concurrent Write | X | X | X | |||
| Protected Read | X | X | X | |||
| Protected Write | X | X | X | X | ||
| Exclusive | X | X | X | X | X |
Reader-writer locks and exclusive locks differ in their admission policy: Exclusive locks allow at most one holder, while reader-writer locks permit an arbitrary number of read-holders (but only one write-holder). There is a very large number of possible admission policies, one of which is that of the VAX/VMS distributed lock manager (DLM) [Snaman87], which is shown in Section VAX/VMS Distributed Lock Manager Policy. Blank cells indicate compatible modes, while cells containing “X” indicate incompatible modes.
The VAX/VMS DLM uses six modes. For purposes of comparison, exclusive locks use two modes (not held and held), while reader-writer locks use three modes (not held, read held, and write held).
The first mode is null, or not held. This mode is compatible with all other modes, which is to be expected: If a thread is not holding a lock, it should not prevent any other thread from acquiring that lock.
The second mode is concurrent read, which is compatible with every other mode except for exclusive. The concurrent-read mode might be used to accumulate approximate statistics on a data structure, while permitting updates to proceed concurrently.
The third mode is concurrent write, which is compatible with null, concurrent read, and concurrent write. The concurrent-write mode might be used to update approximate statistics, while still permitting reads and concurrent updates to proceed concurrently.
The fourth mode is protected read, which is compatible with null, concurrent read, and protected read. The protected-read mode might be used to obtain a consistent snapshot of the data structure, while permitting reads but not updates to proceed concurrently.
The fifth mode is protected write, which is compatible with null and concurrent read. The protected-write mode might be used to carry out updates to a data structure that could interfere with protected readers but which could be tolerated by concurrent readers.
The sixth and final mode is exclusive, which is compatible only with null. The exclusive mode is used when it is necessary to exclude all other accesses.
It is interesting to note that exclusive locks and reader-writer locks can be emulated by the VAX/VMS DLM. Exclusive locks would use only the null and exclusive modes, while reader-writer locks might use the null, protected-read, and protected-write modes.
Quick Quiz: Is there any other way for the VAX/VMS DLM to emulate a reader-writer lock?
Answer: There are in fact several. One way would be to use the null, protected-read, and exclusive modes. Another way would be to use the null, protected-read, and concurrent-write modes. A third way would be to use the null, concurrent-read, and exclusive modes.
Although the VAX/VMS DLM policy has seen widespread production use for distributed databases, it does not appear to be used much in shared-memory applications. One possible reason for this is that the greater communication overheads of distributed databases can hide the greater overhead of the VAX/VMS DLM’s more-complex admission policy.
Nevertheless, the VAX/VMS DLM is an interesting illustration of just how flexible the concepts behind locking can be. It also serves as a very simple introduction to the locking schemes used by modern DBMSes, which can have more than thirty locking modes, compared to VAX/VMS’s six.
Scoped Locking
The locking primitives discussed thus far require explicit acquisition and release primitives, for example, ‘spin_lock()‘ and ‘spin_unlock()‘, respectively. Another approach is to use the object-oriented pattern [MargaretAEllis1990Cplusplus].8 This pattern is often applied to auto variables in languages like C++, where the corresponding constructor is invoked upon entry to the object’s scope, and the corresponding destructor is invoked upon exit from that scope. This can be applied to locking by having the constructor acquire the lock and the destructor free it.
This approach can be quite useful, in fact in 1990 I was convinced that it was the only type of locking that was needed.9 One very nice property of RAII locking is that you don’t need to carefully release the lock on each and every code path that exits that scope, a property that can eliminate a troublesome set of bugs.
However, RAII locking also has a dark side. RAII makes it quite difficult to encapsulate lock acquisition and release, for example, in iterators. In many iterator implementations, you would like to acquire the lock in the iterator’s “start” function and release it in the iterator’s “stop” function. RAII locking instead requires that the lock acquisition and release take place in the same level of scoping, making such encapsulation difficult or even impossible.
Strict RAII locking also prohibits overlapping critical sections, due to the fact that scopes must nest. This prohibition makes it difficult or impossible to express a number of useful constructs, for example, locking trees that mediate between multiple concurrent attempts to assert an event. Of an arbitrarily large group of concurrent attempts, only one need succeed, and the best strategy for the remaining attempts is for them to fail as quickly and painlessly as possible. Otherwise, lock contention becomes pathological on large systems (where “large” is many hundreds of CPUs). Therefore, C++17 [RichardSmith2019N4800] has escapes from strict RAII in its ‘unique_lock‘ class, which allows the scope of the critical section to be controlled to roughly the same extent as can be achieved with explicit lock acquisition and release primitives.
Example strict-RAII-unfriendly data structures from Linux-kernel RCU are shown in Section Locking Hierarchy. Here, each CPU is assigned a leaf ‘rcu_node‘ structure, and each ‘rcu_node‘ structure has a pointer to its parent (named, oddly enough, ‘->parent‘), up to the root ‘rcu_node‘ structure, which has a ‘NULL‘ ‘->parent‘ pointer. The number of child ‘rcu_node‘ structures per parent can vary, but is typically 32 or 64. Each ‘rcu_node‘ structure also contains a lock named ‘->fqslock‘.
The general approach is a tournament, where a given CPU conditionally acquires its leaf ‘rcu_node‘ structure’s ‘->fqslock‘, and, if successful, attempt to acquire that of the parent, then release that of the child. In addition, at each level, the CPU checks a global ‘gp_flags‘ variable, and if this variable indicates that some other CPU has asserted the event, the first CPU drops out of the competition. This acquire-then-release sequence continues until either the ‘gp_flags‘ variable indicates that someone else won the tournament, one of the attempts to acquire an ‘->fqslock‘ fails, or the root ‘rcu_node‘ structure’s ‘->fqslock‘ has been acquired. If the root ‘rcu_node‘ structure’s ‘->fqslock‘ is acquired, a function named ‘do_force_quiescent_state()‘ is invoked.
void force_quiescent_state(struct rcu_node *rnp_leaf)
{
int ret;
struct rcu_node *rnp = rnp_leaf;
struct rcu_node *rnp_old = NULL;
for (; rnp != NULL; rnp = rnp->parent) {
ret = (READ_ONCE(gp_flags)) ||
!raw_spin_trylock(&rnp->fqslock);
if (rnp_old != NULL)
raw_spin_unlock(&rnp_old->fqslock);
if (ret)
return;
rnp_old = rnp;
}
if (!READ_ONCE(gp_flags)) {
WRITE_ONCE(gp_flags, 1);
do_force_quiescent_state();
WRITE_ONCE(gp_flags, 0);
}
raw_spin_unlock(&rnp_old->fqslock);
}
Simplified code to implement this is shown in Section Conditional Locking to Reduce Contention. The purpose of this function is to mediate between CPUs who have concurrently detected a need to invoke the ‘do_force_quiescent_state()‘ function. At any given time, it only makes sense for one instance of ‘do_force_quiescent_state()‘ to be active, so if there are multiple concurrent callers, we need at most one of them to actually invoke ‘do_force_quiescent_state()‘, and we need the rest to (as quickly and painlessly as possible) give up and leave.
To this end, each pass through the loop spanning attempts to advance up one level in the ‘rcu_node‘ hierarchy. If the ‘gp_flags‘ variable is already set () or if the attempt to acquire the current ‘rcu_node‘ structure’s ‘->fqslock‘ is unsuccessful (), then local variable ‘ret‘ is set to 1. If sees that local variable ‘rnp_old‘ is non-‘NULL‘, meaning that we hold ‘rnp_old‘’s ‘->fqs_lock‘, releases this lock (but only after the attempt has been made to acquire the parent ‘rcu_node‘ structure’s ‘->fqslock‘). If sees that either or saw a reason to give up, returns to the caller. Otherwise, we must have acquired the current ‘rcu_node‘ structure’s ‘->fqslock‘, so saves a pointer to this structure in local variable ‘rnp_old‘ in preparation for the next pass through the loop.
If control reaches , we won the tournament, and now holds the root ‘rcu_node‘ structure’s ‘->fqslock‘. If still sees that the global variable ‘gp_flags‘ is zero, sets ‘gp_flags‘ to one, invokes ‘do_force_quiescent_state()‘, and resets ‘gp_flags‘ back to zero. Either way, releases the root ‘rcu_node‘ structure’s ‘->fqslock‘.
This function illustrates the not-uncommon pattern of hierarchical locking. This pattern is difficult to implement using strict RAII locking,10 just like the iterator encapsulation noted earlier, and so explicit lock/unlock primitives (or C++17-style ‘unique_lock‘ escapes) will be required for the foreseeable future.
Locking Implementation Issues
When you translate a dream into reality, it’s never a full implementation. It is easier to dream than to do. Shai Agassi Developers are almost always best-served by using whatever locking primitives are provided by the system, for example, the POSIX pthread mutex locks [OpenGroup1997pthreads,Butenhof1997pthreads]. Nevertheless, studying sample implementations can be helpful, as can considering the challenges posed by extreme workloads and environments.
Sample Exclusive-Locking Implementation Based on Atomic Exchange
This section reviews the implementation shown in Section Sample Lock Based on Atomic Exchange. The data structure for this lock is just an ‘int‘, as shown on , but could be any integral type. The initial value of this lock is zero, meaning “unlocked”, as shown on .
> typedef int xchglock_t; #define DEFINE_XCHG_LOCK(n) xchglock_t n = 0 void xchg_lock(xchglock_t *xp) { while (xchg(xp, 1) == 1) { while (READ_ONCE(*xp) == 1) continue; } } void xchg_unlock(xchglock_t *xp) { (void)xchg(xp, 0); } > ``` > > > **Quick Quiz:** Why not rely on the C language’s default > > initialization of zero instead of using the explicit initializer > > shown on of Section Sample Lock Based on Atomic Exchange? > > **Answer:** Because this default initialization does not apply > > to locks allocated as auto variables within the scope of a function. > > Lock acquisition is carried out by the ‘xchg_lock()‘ function shown on > > — e . This function uses a nested loop, with the outer loop repeatedly atomically exchanging the value of the lock with the value one (meaning “locked”). If the old value was already the value one (in other words, someone else already holds the lock), then the inner loop () spins until the lock is available, at which point the outer loop makes another attempt to acquire the lock. > **Quick Quiz:** Why bother with the inner loop on of Section > Sample Lock Based on Atomic Exchange? Why not simply repeatedly do the > atomic exchange operation on ? > > **Answer:** Suppose that the lock is held and that several threads > are attempting to acquire the lock. In this situation, if these > threads all loop on the atomic exchange operation, they will ping-pong > the cache line containing the lock among themselves, imposing load on > the interconnect. In contrast, if these threads are spinning in the > inner loop on , they will each spin within their own caches, placing > negligible load on the interconnect. Lock release is carried out by the ‘xchg_unlock()‘ function shown on . atomically exchanges the value zero (“unlocked”) into the lock, thus marking it as having been released. > **Quick Quiz:** Why not simply store zero into the lock word on of > Section Sample Lock Based on Atomic Exchange? > > **Answer:** This can be a legitimate implementation, but only if > this store is preceded by a memory barrier and makes use of > ‘WRITE_ONCE()‘. The memory barrier is not required when the ‘xchg()‘ > operation is used because this operation implies a full memory barrier > due to the fact that it returns a value. This lock is a simple example of a test-and-set lock \[Segall84\], but very similar mechanisms have been used extensively as pure spinlocks in production. ### Other Exclusive-Locking Implementations There are a great many other possible implementations of locking based on atomic instructions, many of which are reviewed in the classic paper by Mellor-Crummey and Scott \[MellorCrummey91a\]. These implementations represent different points in a multi-dimensional design tradeoff \[Guerraoui:2019:LPA:3319851.3301501,HugoGuirouxPhD,McKenney96a\]. For example, the atomic-exchange-based test-and-set lock presented in the previous section works well when contention is low and has the advantage of small memory footprint. It avoids giving the lock to threads that cannot use it, but as a result can suffer from unfairness or even starvation at high contention levels. In contrast, ticket lock \[MellorCrummey91a\], which was once used in the Linux kernel, avoids unfairness at high contention levels. However, as a consequence of its strict FIFO discipline, it can grant the lock to a thread that is currently unable to use it, perhaps due to that thread being preempted or interrupted. On the other hand, it is important to avoid getting too worried about the possibility of preemption and interruption. After all, in many cases, this preemption and interruption could just as well happen just after the lock was acquired.[^11] All locking implementations where waiters spin on a single memory location, including both test-and-set locks and ticket locks, suffer from performance problems at high contention levels. The problem is that the thread releasing the lock must update the value of the corresponding memory location. At low contention, this is not a problem: The corresponding cache line is very likely still local to and writeable by the thread holding the lock. In contrast, at high levels of contention, each thread attempting to acquire the lock will have a read-only copy of the cache line, and the lock holder will need to invalidate all such copies before it can carry out the update that releases the lock. In general, the more CPUs and threads there are, the greater the overhead incurred when releasing the lock under conditions of high contention. This negative scalability has motivated a number of different queued-lock implementations \[Anderson90,Graunke90,MellorCrummey91a,Wisniewski94,Craig93,Magnusson94,Takada93\], some of which are used in recent versions of the Linux kernel \[JonathanCorbet2014qspinlocks\]. Queued locks avoid high cache-invalidation overhead by assigning each thread a queue element. These queue elements are linked together into a queue that governs the order that the lock will be granted to the waiting threads. The key point is that each thread spins on its own queue element, so that the lock holder need only invalidate the first element from the next thread’s CPU’s cache. This arrangement greatly reduces the overhead of lock handoff at high levels of contention. More recent queued-lock implementations also take the system’s architecture into account, preferentially granting locks locally, while also taking steps to avoid starvation \[McKenney02e,radovic03hierarchical,radovic02efficient,BenJackson02,McKenney02d\]. Many of these can be thought of as analogous to the elevator algorithms traditionally used in scheduling disk I/O. Dice et al. discuss use of local and global locks in order to transform any architecture-oblivious lock into a local/global locking scheme that optimizes for the system’s structure by providing (for example) per-socket locks along with a global lock \[DavidDice2012NUMAlock1,DavidDice2012NUMAlock2\]. Unfortunately, the same scheduling logic that improves the efficiency of queued locks at high contention also increases their overhead at low contention. Beng-Hong Lim and Anant Agarwal therefore combined a simple test-and-set lock with a queued lock, using the test-and-set lock at low levels of contention and switching to the queued lock at high levels of contention \[BengHongLim94\], thus getting low overhead at low levels of contention and getting fairness and high throughput at high levels of contention. Browning et al. took a similar approach, but avoided the use of a separate flag, so that the test-and-set fast path uses the same sequence of instructions that would be used in a simple test-and-set lock \[LukeBrowning2005SimpleLockNUMAAware\]. This approach has been used in production. Another issue that arises at high levels of contention is when the lock holder is delayed, especially when the delay is due to preemption, which can result in *priority inversion*, where a low-priority thread holds a lock, but is preempted by a medium priority CPU-bound thread, which results in a high-priority process blocking while attempting to acquire the lock. The result is that the CPU-bound medium-priority process is preventing the high-priority process from running. One solution is *priority inheritance* \[Lampson1980Mesa\], which has been widely used for real-time computing \[LuiSha1990PriorityInheritance,JonathanCorbet2006PriorityInheritance\], despite some lingering controversy over this practice \[Yodaiken2004FSM,DougLocke2002a\]. Another way to avoid priority inversion is to prevent preemption while a lock is held. Because preventing preemption while locks are held also improves throughput, most proprietary UNIX kernels offer some form of scheduler-conscious synchronization mechanism \[Kontothanassis97a\], largely due to the efforts of a certain sizable database vendor. These mechanisms usually take the form of a hint that preemption should be avoided in a given region of code, with this hint typically being placed in a machine register. These hints frequently take the form of a bit set in a particular machine register, which enables extremely low per-lock-acquisition overhead for these mechanisms. In contrast, Linux avoids these hints. Instead, the Linux kernel community’s response to requests for scheduler-conscious synchronization was a mechanism called *futexes* \[HubertusFrancke2002Futex,IngoMolnar2006RobustFutexes,StevenRostedt2006piFutexes,UlrichDrepper2011Futexes\]. Interestingly enough, atomic instructions are not strictly needed to implement locks \[Dijkstra65a,Lamport74a\]. An excellent exposition of the issues surrounding locking implementations based on simple loads and stores may be found in Herlihy’s and Shavit’s textbook \[HerlihyShavit2008Textbook,HerlihyShavit2020Textbook\]. The main point echoed here is that such implementations currently have little practical application, although a careful study of them can be both entertaining and enlightening. Nevertheless, with one exception described below, such study is left as an exercise for the reader. Gamsa et al. \[Gamsa99\] describe a token-based mechanism in which a token circulates among the CPUs. When the token reaches a given CPU, it has exclusive access to anything protected by that token. There are any number of schemes that may be used to implement the token-based mechanism, for example: 1. Maintain a per-CPU flag, which is initially zero for all but one CPU. When a CPU’s flag is non-zero, it holds the token. When it finishes with the token, it zeroes its flag and sets the flag of the next CPU to one (or to any other non-zero value). 2. Maintain a per-CPU counter, which is initially set to the corresponding CPU’s number, which we assume to range from zero to $`N-1`$, where $`N`$ is the number of CPUs in the system. When a CPU’s counter is greater than that of the next CPU (taking counter wrap into account), the first CPU holds the token. When it is finished with the token, it sets the next CPU’s counter to a value one greater than its own counter. This lock is unusual in that a given CPU cannot necessarily acquire it immediately, even if no other CPU is using it at the moment. Instead, the CPU must wait until the token comes around to it. This is useful in cases where CPUs need periodic access to the critical section, but can tolerate variances in token-circulation rate. Gamsa et al. \[Gamsa99\] used it to implement a variant of read-copy update (see Section Read-Copy Update (RCU)), but it could also be used to protect periodic per-CPU operations such as flushing per-CPU caches used by memory allocators \[McKenney93\], garbage-collecting per-CPU data structures, or flushing per-CPU data to shared storage (or to mass storage, for that matter). The Linux kernel now uses queued spinlocks \[JonathanCorbet2014qspinlocks\], but because of the complexity of implementations that provide good performance across the range of contention levels, the path has not always been smooth \[CatalinMarinas2018qspinlockTLA,WillDeacon2018qspinlock\]. As increasing numbers of people gain familiarity with parallel hardware and parallelize increasing amounts of code, we can continue to expect more special-purpose locking primitives to appear, see for example Guerraoui et al. \[Guerraoui:2019:LPA:3319851.3301501,HugoGuirouxPhD\]. Nevertheless, you should carefully consider this important safety tip: Use the standard synchronization primitives whenever humanly possible. The big advantage of the standard synchronization primitives over roll-your-own efforts is that the standard primitives are typically *much* less bug-prone.[^12] ## Lock-Based Existence Guarantees > *Existence precedes and rules essence.* > > — Jean-Paul Sartre A key challenge in parallel programming is to provide , so that attempts to access a given object can rely on that object being in existence throughout a given access attempt. **Listing: Per-Element Locking Without Existence Guarantees (Buggy!)**c int delete(int key) { int b; struct element *p;
b = hashfunction(key); p = hashtable[b]; if (p == NULL || p->key != key) return 0; spin_lock(&p->lock); hashtable[b] = NULL; spin_unlock(&p->lock); kfree(p); return 1; }
In some cases, existence guarantees are implicit:
1. Global variables and static local variables in the base module will
exist as long as the application is running.
2. Global variables and static local variables in a loaded module will
exist as long as that module remains loaded.
3. A module will remain loaded as long as at least one of its functions
has an active instance.
4. A given function instance’s on-stack variables will exist until that
instance returns.
5. If you are executing within a given function or have been called
(directly or indirectly) from that function, then the given function
has an active instance.
These implicit existence guarantees are straightforward, though bugs
involving implicit existence guarantees really can happen.
But the more interesting—and troublesome—guarantee involves heap memory:
A dynamically allocated data structure will exist until it is freed. The
problem to be solved is to synchronize the freeing of the structure with
concurrent accesses to that same structure. One way to do this is with
*explicit guarantees*, such as locking. If a given structure may only be
freed while holding a given lock, then holding that lock guarantees that
structure’s existence.
But this guarantee depends on the existence of the lock itself. One
straightforward way to guarantee the lock’s existence is to place the
lock in a global variable, but global locking has the disadvantage of
limiting scalability. One way of providing scalability that improves as
the size of the data structure increases is to place a lock in each
element of the structure. Unfortunately, putting the lock that is to
protect a data element in the data element itself is subject to subtle ,
as shown in Section Per-Element
Locking Without Existence Guarantees (Buggy!).
To see one of these race conditions, consider the following sequence of
events:
1. Thread 0 invokes , and reaches of the listing, acquiring the lock.
2. Thread 1 concurrently invokes , reaching , but spins on the lock
because Thread 0 holds it.
3. Thread 0 executes , removing the element from the hashtable,
releasing the lock, and then freeing the element.
4. Thread 0 continues execution, and allocates memory, getting the
exact block of memory that it just freed.
5. Thread 0 then initializes this block of memory as some other type of
structure.
6. Thread 1’s operation fails due to the fact that what it believes to
be is no longer a spinlock.
Because there is no existence guarantee, the identity of the data
element can change while a thread is attempting to acquire that
element’s lock on !
**Listing: Per-Element Locking With Lock-Based Existence Guarantees**
c int delete(int key) { int b; struct element *p; spinlock_t *sp;
b = hashfunction(key); sp = &locktable[b]; spin_lock(sp); p = hashtable[b]; if (p == NULL || p->key != key) { spin_unlock(sp); return 0; } hashtable[b] = NULL; spin_unlock(sp); kfree(p); return 1; } ```
One way to fix this example is to use a hashed set of global locks, so that each hash bucket has its own lock, as shown in Section Per-Element Locking With Lock-Based Existence Guarantees. This approach allows acquiring the proper lock (on ) before gaining a pointer to the data element (on ). Although this approach works quite well for elements contained in a single partitionable data structure such as the hash table shown in the listing, it can be problematic if a given data element can be a member of multiple hash tables or given more-complex data structures such as trees or graphs. Not only can these problems be solved, but the solutions also form the basis of lock-based software transactional memory implementations . However, Section Deferred Processing describes simpler—and faster—ways of providing existence guarantees.
Locking: Hero or Villain?
You either die a hero or you live long enough to see yourself become the villain. Aaron Eckhart as Harvey Dent As is often the case in real life, locking can be either hero or villain, depending on how it is used and on the problem at hand. In my experience, those writing whole applications are happy with locking, those writing parallel libraries are less happy, and those parallelizing existing sequential libraries are extremely unhappy. The following sections discuss some reasons for these differences in viewpoints.
Locking For Applications: Hero!
When writing an entire application (or entire kernel), developers have full control of the design, including the synchronization design. Assuming that the design makes good use of partitioning, as discussed in Section Partitioning and Synchronization Design, locking can be an extremely effective synchronization mechanism, as demonstrated by the heavy use of locking in production-quality parallel software. Nevertheless, although such software usually bases most of its synchronization design on locking, such software also almost always makes use of other synchronization mechanisms, including special counting algorithms (Section Counting), data ownership (Section Data Ownership), reference counting (Section Reference Counting), hazard pointers (Section Hazard Pointers), sequence locking (Section Sequence Locks), and read-copy update (Section Read-Copy Update (RCU)). In addition, practitioners use tools for deadlock detection [JonathanCorbet2006lockdep], lock acquisition/release balancing [JonathanCorbet2004sparse], cache-miss analysis [ValgrindHomePage], hardware-counter-based profiling [LinuxKernelPerfWiki,OProfileHomePage], and many more besides. Given careful design, use of a good combination of synchronization mechanisms, and good tooling, locking works quite well for applications and kernels.
Locking For Parallel Libraries: Just Another Tool
Unlike applications and kernels, the designer of a library cannot know the locking design of the code that the library will be interacting with. In fact, that code might not be written for years to come. Library designers therefore have less control and must exercise more care when laying out their synchronization design. Deadlock is of course of particular concern, and the techniques discussed in Section Deadlock need to be applied. One popular deadlock-avoidance strategy is therefore to ensure that the library’s locks are independent subtrees of the enclosing program’s locking hierarchy. However, this can be harder than it looks. One complication was discussed in Section Local Locking Hierarchies, namely when library functions call into application code, with ‘qsort()‘’s comparison-function argument being a case in point. Another complication is the interaction with signal handlers. If an application signal handler is invoked from a signal received within the library function, deadlock can ensue just as surely as if the library function had called the signal handler directly. A final complication occurs for those library functions that can be used between a ‘fork()‘/‘exec()‘ pair, for example, due to use of the ‘system()‘ function. In this case, if your library function was holding a lock at the time of the ‘fork()‘, then the child process will begin life with that lock held. Because the thread that will release the lock is running in the parent but not the child, if the child calls your library function, deadlock will ensue. The following strategies may be used to avoid deadlock problems in these cases:
Don’t use either callbacks or signals.
Don’t acquire locks from within callbacks or signal handlers.
Let the caller control synchronization.
Parameterize the library API to delegate locking to caller.
Explicitly avoid callback deadlocks.
Explicitly avoid signal-handler deadlocks.
Avoid invoking ‘fork()‘.
Each of these strategies is discussed in one of the following sections.
Use Neither Callbacks Nor Signals
If a library function avoids callbacks and the application as a whole avoids signals, then any locks acquired by that library function will be leaves of the locking-hierarchy tree. This arrangement avoids deadlock, as discussed in Section Locking Hierarchies. Although this strategy works extremely well where it applies, there are some applications that must use signal handlers, and there are some library functions (such as the ‘qsort()‘ function discussed in Section Local Locking Hierarchies) that require callbacks. The strategy described in the next section can often be used in these cases.
Avoid Locking in Callbacks and Signal Handlers
If neither callbacks nor signal handlers acquire locks, then they cannot be involved in deadlock cycles, which allows straightforward locking hierarchies to once again consider library functions to be leaves on the locking-hierarchy tree. This strategy works very well for most uses of ‘qsort‘, whose callbacks usually simply compare the two values passed in to them. This strategy also works wonderfully for many signal handlers, especially given that acquiring locks from within signal handlers is generally frowned upon [TheOpenGroup1997SUS],11 but can fail if the application needs to manipulate complex data structures from a signal handler. Here are some ways to avoid acquiring locks in signal handlers even if complex data structures must be manipulated:
Use simple data structures based on , as will be discussed in Section Simple NBS.
If the data structures are too complex for reasonable use of non-blocking synchronization, create a queue that allows non-blocking enqueue operations. In the signal handler, instead of manipulating the complex data structure, add an element to the queue describing the required change. A separate thread can then remove elements from the queue and carry out the required changes using normal locking. There are a number of readily available implementations of concurrent queues [ChristophMKirsch2012FIFOisntTR,MathieuDesnoyers2009URCU,MichaelScott96].
This strategy should be enforced with occasional manual or (preferably) automated inspections of callbacks and signal handlers. When carrying out these inspections, be wary of clever coders who might have (unwisely) created home-brew locks from atomic operations.
Caller Controls Synchronization
Letting the caller control synchronization works extremely well when the library functions are operating on independent caller-visible instances of a data structure, each of which may be synchronized separately. For example, if the library functions operate on a search tree, and if the application needs a large number of independent search trees, then the application can associate a lock with each tree. The application then acquires and releases locks as needed, so that the library need not be aware of parallelism at all. Instead, the application controls the parallelism, so that locking can work very well, as was discussed in Section Hero!. However, this strategy fails if the library implements a data structure that requires internal concurrency, for example, a hash table or a parallel sort. In this case, the library absolutely must control its own synchronization.
Parameterize Library Synchronization
The idea here is to add arguments to the library’s API to specify which locks to acquire, how to acquire and release them, or both. This strategy allows the application to take on the global task of avoiding deadlock by specifying which locks to acquire (by passing in pointers to the locks in question) and how to acquire them (by passing in pointers to lock acquisition and release functions), but also allows a given library function to control its own concurrency by deciding where the locks should be acquired and released. In particular, this strategy allows the lock acquisition and release functions to block signals as needed without the library code needing to be concerned with which signals need to be blocked by which locks. The separation of concerns used by this strategy can be quite effective, but in some cases the strategies laid out in the following sections can work better. That said, passing explicit pointers to locks to external APIs must be very carefully considered, as discussed in Section Locking Hierarchies and Pointers to Locks. Although this practice is sometimes the right thing to do, you should do yourself a favor by looking into alternative designs first.
Explicitly Avoid Callback Deadlocks
The basic rule behind this strategy was discussed in Section Local Locking Hierarchies: “Release all locks before invoking unknown code.” This is usually the best approach because it allows the application to ignore the library’s locking hierarchy: The library remains a leaf or isolated subtree of the application’s overall locking hierarchy. In cases where it is not possible to release all locks before invoking unknown code, the layered locking hierarchies described in Section Layered Locking Hierarchies can work well. For example, if the unknown code is a signal handler, this implies that the library function block signals across all lock acquisitions, which can be complex and slow. Therefore, in cases where signal handlers (probably unwisely) acquire locks, the strategies in the next section may prove helpful.
Explicitly Avoid Signal-Handler Deadlocks
Suppose that a given library function is known to acquire locks, but does not block signals. Suppose further that it is necessary to invoke that function both from within and outside of a signal handler, and that it is not permissible to modify this library function. Of course, if no special action is taken, then if a signal arrives while that library function is holding its lock, deadlock can occur when the signal handler invokes that same library function, which in turn attempts to re-acquire that same lock. Such deadlocks can be avoided as follows:
If the application invokes the library function from within a signal handler, then that signal must be blocked every time that the library function is invoked from outside of a signal handler.
If the application invokes the library function while holding a lock acquired within a given signal handler, then that signal must be blocked every time that the library function is called outside of a signal handler.
These rules can be enforced by using tools similar to the Linux kernel’s lockdep lock dependency checker [JonathanCorbet2006lockdep]. One of the great strengths of lockdep is that it is not fooled by human intuition [StevenRostedt2011locdepCryptic].
Library Functions Used Between ‘fork()‘ and ‘exec()‘
As noted earlier, if a thread executing a library function is holding a lock at the time that some other thread invokes , the fact that the parent’s memory is copied to create the child means that this lock will be born held in the child’s context. The thread that will release this lock is running in the parent, but not in the child, which means that although the parent’s copy of this lock will be released, the child’s copy never will be. Therefore, any attempt on the part of the child to invoke that same library function (thus acquiring that same lock) will result in deadlock. A pragmatic and straightforward way of solving this problem is to ‘fork()‘ a child process while the process is still single-threaded, and have this child process remain single-threaded. Requests to create further child processes can then be communicated to this initial child process, which can safely carry out any needed ‘fork()‘ and system calls on behalf of its multi-threaded parent process. Another rather less pragmatic and straightforward solution to this problem is to have the library function check to see if the owner of the lock is still running, and if not, “breaking” the lock by re-initializing and then acquiring it. However, this approach has a couple of vulnerabilities:
The data structures protected by that lock are likely to be in some intermediate state, so that naively breaking the lock might result in arbitrary memory corruption.
If the child creates additional threads, two threads might break the lock concurrently, with the result that both threads believe they own the lock. This could again result in arbitrary memory corruption.
The function is provided to help deal with these situations. The idea is to register a triplet of functions, one to be called by the parent before the ‘fork()‘, one to be called by the parent after the ‘fork()‘, and one to be called by the child after the ‘fork()‘. Appropriate cleanups can then be carried out at these three points. Be warned, however, that coding of ‘pthread_atfork()‘ handlers is quite subtle in general. The cases where ‘pthread_atfork()‘ works best are cases where the data structure in question can simply be re-initialized by the child. Which might be one reason why the POSIX standard forbids use of any non-async-signal-safe functions between the ‘fork()‘ and the ‘exec()‘, which rules out acquisition of locks during that time. Other alternatives to ‘fork()‘/‘exec()‘ include ‘posix_spawn()‘ and ‘io_uring_spawn()‘ [JoshTriplett2022io_uring_spawn,JakeEdge2022io_uring_spawn].
Parallel Libraries: Discussion
Regardless of the strategy used, the description of the library’s API must include a clear description of that strategy and how the caller should interact with that strategy. In short, constructing parallel libraries using locking is possible, but not as easy as constructing a parallel application.
Locking For Parallelizing Sequential Libraries: Villain!
With the advent of readily available low-cost multicore systems, a common task is parallelizing an existing library that was designed with only single-threaded use in mind. This all-too-common disregard for parallelism can result in a library API that is severely flawed from a parallel-programming viewpoint. Candidate flaws include:
Implicit prohibition of partitioning.
Callback functions requiring locking.
Object-oriented spaghetti code.
These flaws and the consequences for locking are discussed in the following sections.
Partitioning Prohibited
Suppose that you were writing a single-threaded hash-table implementation. It is easy and fast to maintain an exact count of the total number of items in the hash table, and also easy and fast to return this exact count on each addition and deletion operation. So why not? One reason is that exact counters do not perform or scale well on multicore systems, as was seen in Section Counting. As a result, the parallelized implementation of the hash table will not perform or scale well. So what can be done about this? One approach is to return an approximate count, using one of the algorithms from Section Counting. Another approach is to drop the element count altogether. Either way, it will be necessary to inspect uses of the hash table to see why the addition and deletion operations need the exact count. Here are a few possibilities:
Determining when to resize the hash table. In this case, an approximate count should work quite well. It might also be useful to trigger the resizing operation from the length of the longest chain, which can be computed and maintained in a nicely partitioned per-chain manner.
Producing an estimate of the time required to traverse the entire hash table. An approximate count works well in this case, also.
For diagnostic purposes, for example, to check for items being lost when transferring them to and from the hash table. This clearly requires an exact count. However, given that this usage is diagnostic in nature, it might suffice to maintain the lengths of the hash chains, then to infrequently sum them up while locking out addition and deletion operations.
It turns out that there is now a strong theoretical basis for some of the constraints that performance and scalability place on a parallel library’s APIs [HagitAttiya2011LawsOfOrder,Attiya:2011:LOE:1925844.1926442,PaulEMcKenney2011SNC]. Anyone designing a parallel library needs to pay close attention to those constraints. Although it is all too easy to blame locking for what are really problems due to a concurrency-unfriendly API, doing so is not helpful. On the other hand, one has little choice but to sympathize with the hapless developer who made this choice in (say) 1985. It would have been a rare and courageous developer to anticipate the need for parallelism at that time, and it would have required an even more rare combination of brilliance and luck to actually arrive at a good parallel-friendly API. Times change, and code must change with them. That said, there might be a huge number of users of a popular library, in which case an incompatible change to the API would be quite foolish. Adding a parallel-friendly API to complement the existing heavily used sequential-only API is usually the best course of action. Nevertheless, human nature being what it is, we can expect our hapless developer to be more likely to complain about locking than about his or her own poor (though understandable) API design choices.
Deadlock-Prone Callbacks
Section Local Locking Hierarchies/Layered Locking Hierarchies/ Just Another Tool described how undisciplined use of callbacks can result in locking woes. These sections also described how to design your library function to avoid these problems, but it is unrealistic to expect a 1990s programmer with no experience in parallel programming to have followed such a design. Therefore, someone attempting to parallelize an existing callback-heavy single-threaded library will likely have many opportunities to curse locking’s villainy. If there are a very large number of uses of a callback-heavy library, it may be wise to again add a parallel-friendly API to the library in order to allow existing users to convert their code incrementally. Alternatively, some advocate use of transactional memory in these cases. While the jury is still out on transactional memory, Section Transactional Memory discusses its strengths and weaknesses. It is important to note that hardware transactional memory (discussed in Section Hardware Transactional Memory) cannot help here unless the hardware transactional memory implementation provides forward-progress guarantees, which few do. Other alternatives that appear to be quite practical (if less heavily hyped) include the methods discussed in Section Conditional Locking/Acquire Needed Locks First, as well as those that will be discussed in Section Data Ownership/Deferred Processing.
Object-Oriented Spaghetti Code
Object-oriented programming went mainstream sometime in the 1980s or 1990s, and as a result there is a huge amount of single-threaded object-oriented code in production. Although object orientation can be a valuable software technique, undisciplined use of objects can easily result in object-oriented spaghetti code. In object-oriented spaghetti code, control flits from object to object in an essentially random manner, making the code hard to understand and even harder, and perhaps impossible, to accommodate a locking hierarchy. Although many might argue that such code should be cleaned up in any case, such things are much easier to say than to do. If you are tasked with parallelizing such a beast, you can reduce the number of opportunities to curse locking by using the techniques described in Section Conditional Locking/Acquire Needed Locks First, as well as those that will be discussed in Section Data Ownership/Deferred Processing. This situation appears to be the use case that inspired transactional memory, so it might be worth a try as well. That said, the choice of synchronization mechanism should be made in light of the hardware habits discussed in Section Hardware and its Habits. After all, if the overhead of the synchronization mechanism is orders of magnitude more than that of the operations being protected, the results are not going to be pretty. And that leads to a question well worth asking in these situations: Should the code remain sequential? For example, perhaps parallelism should be introduced at the process level rather than the thread level. In general, if a task is proving extremely hard, it is worth some time spent thinking about not only alternative ways to accomplish that particular task, but also alternative tasks that might better solve the problem at hand.
Summary
— Unknown
Locking is perhaps the most widely used and most generally useful synchronization tool. However, it works best when designed into an application or library from the beginning. Given the large quantity of pre-existing single-threaded code that might need to one day run in parallel, locking should therefore not be the only tool in your parallel-programming toolbox. The next few chapters will discuss other tools, and how they can best be used in concert with locking and with each other.
Of course, one type of failure-to-wake bug is a deadlock that involves not only locks, but also non-lock resources. But the question really did say “lock-based deadlock”!
[return]- One name for this is “object-oriented spaghetti code.” [return]
- And, in contrast to the 1900s, mobility is the common case. [return]
- Also known as “object-oriented spaghetti code.” [return]
Try not to get too hung up on the exact definitions of terms like livelock, starvation, and unfairness. Anything that causes a group of threads to fail to make adequate forward progress is a bug that needs to be fixed, and debating names doesn’t fix bugs.
[return]That is, locking is temporal synchronization. Mechanisms that synchronize both temporally and spatially are described in Section Deferred Processing.
[return]- Thanks to Alexey Roytman for this description. [return]
Though more clearly expressed at https://www.stroustrup.com/bs_faq2.html#finally.
[return]My later work with parallelism at Sequent Computer Systems very quickly disabused me of this misguided notion.
[return]Which is why many RAII locking implementations provide a way to leak the lock out of the scope that it was acquired and into the scope in which it is to be released. However, some object must mediate the scope leaking, which can add complexity compared to non-RAII explicit locking primitives.
[return]But the standard’s words do not stop clever coders from creating their own home-brew locking primitives from atomic operations.
[return]