8    Coordination

In the previously described “reading world” (chapter 3) there was a reader process interacting with a book. There are two different ways to add in a second activity: we can add a second reader, and/or we can add a second book. We can diagram it as a 2 × 2 grid of possibilities (table 8.1):

•  one reader, one book (which we described in a previous chapter);

•  two readers, one book;

•  one reader, two books; or

•  two readers, two books.

Table 8.1

1 Book

2 Books

1 Reader

No multiprocess issues

?

2 Readers

?

?

We’ll fill in this table to cover the different cases. Initially we just put a question mark for each of the new combinations.

The one-reader, two-books case is a reader who is trying to make progress in two books at once. This implies that the reader spends some time reading one book, then switches to the other and spends some time reading that book. The reader then switches back to reading the first book, and the alternation between the books continues. A computer scientist would refer to this as “timesharing” or “multitasking.” This alternating approach turns out to be very important to how computer systems operate. It’s also challenging to get it right, as we’ll see in chapter 10.

One thing we know for sure is that we don’t want the reader to start from the beginning of the book each time there’s a change of books. For the reader to make progress within this arrangement, there must be some kind of marker or reminder of where to restart: possibly a bookmark or a pencil mark on a page, or a memory retained in the reader’s brain to identify a starting point. Without such a marker, each switch to a different book restarts reading at the beginning; in such an arrangement, the reader may stay very busy but doesn’t make much progress.

As soon as we are concerned with multiple processes, there is some additional data that’s not really related to the problem being solved—it’s just related to the work of coordinating the processes. This particular kind of data is what computer scientists usually call context. In general, the context of a process lets the process be bundled up and put away, then later resumed from the same place.

Sharing a Book

The next case to consider is two readers, one book. Two readers are each separately trying to make progress in the single shared book. Because the book is a single shared thing, there must be some kind of controlled access to it, or else the two readers will interfere with each other’s reading. For example, our first reader (named Alice) may be ready to turn from page 5 to page 6, but turning the page will prevent our second reader (named Bob) from finishing his reading of page 5. In a possible annoying scenario, the two readers could each briefly grab the book, flip the page to where they want it, read something there, and then let go of the book—only to have the other reader immediately flip the page back again to where it was previously. It’s possible to make progress under these circumstances, but it’s frustrating.

In an even worse scenario, the two readers might grab hold of the same page at the same time and pull it in opposite directions, tearing it. Such a disaster is like the kinds of damage or corruption that can happen if two or more processes have uncontrolled access to some shared data in a computer. Fortunately, there are tools that we can use to avoid those problems; we’ll learn more about some of those tools in chapter 10.

Multiple Books and Multiple Readers

We’ve now covered both the one-reader-two-books scenario and the two-readers-one-book scenario. You might well think that the two-readers-two-books case is the most complex of all, and in a sense that is correct. However, that case really adds only a little new difficulty to the issues that we’ve already discussed. In one arrangement, the two readers are each reading a distinct book, so there’s nothing shared between them. That means we effectively have a single-reader, single-book arrangement that just happens to be next to another single-reader, single-book arrangement. If there’s no interaction between those two setups, it doesn’t really matter if they’re next to each other or on different planets—they’ll still both have the same experiences as any other single-reader, single-book arrangement.

In the other “two readers, two books” case, the two readers are both reading from both books. Because there is sharing, this situation requires the same kind of controlled access that we needed when two readers were reading from one book. It almost doesn’t matter how many different shared books there are, or how many readers. As soon as there’s the possibility of a book being shared, there needs to be some kind of controlled access to it. Table 8.2 summarizes all of the cases we’ve considered.

Table 8.2

1 Book

2 Books

1 Reader

No multiprocess issues

Timesharing

2 Readers

Controlled access

No sharing: No multiprocess issues

Sharing: Controlled access

Deadlock

There is one new wrinkle in the two-book, two-reader case compared to what we saw previously. That new wrinkle is that there’s now a simple way to get “stuck” if the two readers both want to gain exclusive access to both books at once.

What do we mean by getting “stuck?” To see how the problem arises, let’s return to our two readers who we have named Alice and Bob. We’ll call the books Dickens and Shakespeare. Let’s further assume that both Alice and Bob want to grab both Dickens and Shakespeare before either one does something with them perhaps comparing them page-for-page, looking at differences. Alice plans to grab Dickens, then grab Shakespeare. Meanwhile, Bob is planning to grab Shakespeare, then Dickens. They will get “stuck” if Alice successfully grabs Dickens while Bob successfully grabs Shakespeare.

They will each be politely waiting for the other one to give up what has been grabbed. If they are obtuse and stubborn, they will each cling to what they have and wait for the other to give up—which might not ever happen.

Getting “stuck” in this way is called deadlock. Deadlock is easy to see in a two-process case like this, but the problem can also occur with larger groups of processes. Any arrangement with multiple shared resources and multiple processes poses some risk of deadlock until we figure out how to avoid it. Any time that a process has grabbed some of the shared resources it needs, but still needs other resources that are currently grabbed by other processes, there is a risk of deadlock.

Gridlock

It may seem implausible to imagine deadlock among readers sharing books, but there’s a real-life example that’s probably not hard to imagine and which you may even have experienced, if you drive in a crowded city like Boston. A traffic gridlock is a form of deadlock. Every intersection is shared among the various cars on the road. Cars are like processes, and their forward motion is analogous to steps (although of course cars are not digital in their movements!).

In a simple case with a badly designed intersection, two cars facing each other and both trying to turn left can block each other. Each car is occupying part of the intersection that the other one needs in order to leave the intersection.

In a simple gridlock like this one, it is easy for both drivers to see that there is indeed a “stuck” situation. It’s also simple to resolve the situation by one or both drivers giving up their attempt to turn left. What makes a large-scale gridlock more hazardous is that the collection of drivers and intersections may be spread across a number of blocks, so that there is no single intersection where it’s obvious that ordinary traffic behavior has broken down.

Bad driving habits can spread the gridlock. In Boston, it’s common for a driver to enter an intersection without having a clear path to exit the intersection. These naughty drivers usually get away with this behavior even if they are stranded in the intersection for a cycle or two of the traffic lights, because traffic eventually moves forward and gives them some space outside the intersection. There are some intersections where this kind of ugly intersection-blocking behavior happens as a matter of course for both the roads that are crossing. But if something happens further down the road to create a genuine blockage, the intersection can’t clear as it usually does—it stays blocked. The newly blocked intersection in turn blocks traffic in all directions, which in turn creates problems for traffic leading to that intersection—especially if that traffic is in turn coming through intersections blocked by misbehaving motorists.

Detecting Deadlock

If we know what processes have grabbed and what processes are waiting for, then we can understand exactly when deadlock has occurred by drawing a diagram of waiting processes. We draw an arrow from process A to process B whenever process A is waiting for a shared resource that process B has grabbed. We previously mentioned that Alice is waiting for Bob, since Bob is currently holding Shakespeare and Alice needs Shakespeare before she can proceed further. Figure 8.1 captures that A-waits-for-B relationship.

Figure 8.1

A waits for B.

If we draw in all the arrows wherever one process is waiting for another, we can look at the diagram to see if there are any loops. Figure 8.2 is our diagram for when Alice has Dickens and Bob has Shakespeare, but each of them is also waiting for the book the other has.

Figure 8.2

A cycle of waiting, A and B are deadlocked.

If there’s a circle of processes waiting for each other, with the arrows leading back to our process of interest, then there’s a deadlock among all the processes in that loop. Part of the challenge is that there might be a lot of processes in the loop. Figure 8.3 shows six processes in a deadlock. Looking at any one pair of processes doesn’t reveal the deadlock—only the whole group of six has a loop.

Figure 8.3

Six processes in a deadlock.

Breaking Deadlock

We’ve seen that we can detect deadlocks in the diagram of who is “waiting-for” whom, by looking for loops. That diagram is also useful for fixing the problem. Any process that notices such a loop can choose to leave it, and thereby break the deadlock. The leaving process stops waiting for whatever it currently needs, and/or releases whatever it has that others are waiting for. The remaining processes are no longer deadlocked and can make progress.

An alternative to this “noble volunteer” approach uses a kind of overseer process that tracks the behavior of some collection of other processes. When the overseer sees a deadlock, it can choose at least one “victim” from among the deadlocked processes. The victim will have to lose its grabbed resources so that others can proceed. With the involuntary, pick-a-victim approach to breaking a deadlock, the victim process may not be prepared for its demise. Simply taking resources away from a process doesn’t necessarily give a good result: the process might have already changed some of the items that it grabbed. If we simply wipe out a process in the middle of its activities, and then hand its resources to some other process, we might create a mess. After all, if you have to stop some cooks in the middle of making a meal and send them home, you also want to clean up whatever they’ve left behind in the kitchen, not just parachute in a new cook to start cooking more stuff in addition to what’s currently on the stove.

Fortunately, there are elegant mechanisms available to “undo” whatever has been done by the victim process, if we know that we may have to break deadlocks. We’re not going to get into those details; instead, we will just wave our hands and declare that we can reliably reverse all the steps taken by a process that has become trapped in a deadlock situation.

Livelock

A subtle trap is that it’s possible to replace a deadlock with a livelock—that is, a situation in which no process is “stuck,” but there is also no forward progress. For example, assume that there’s a group of processes that have somehow reached deadlock in a ring configuration, like the group in figure 8.3.

Each process (labeled A through F) is waiting for something. Although the processes may be unrelated to each other and unaware of each other, we can see that they have a kind of relationship through these shared items. If each of the processes handles a “stuck” situation in the exact same way, then each may choose to release and then regrab items at the same time. Bizarrely, the collection of processes will repeatedly create a brief deadlock that is then “resolved.” Although there is ongoing activity in such a system, it’s not doing any real work in any of the processes—each of the processes is just repeatedly grabbing a resource and then releasing it.

A real-world experience that’s like a different kind of livelock can happen in a large bureaucracy like a university or a state government. A difficult issue may fall into a gray area where it’s not clear who should solve the problem, and so department A will redirect an inquiry to department B, which may in turn redirect to department C. No one actually provides an answer, only a pointer to someone else who might be able to provide an answer. If this sequence of redirections leads back to a previously encountered department, we are in a redirection loop and can’t ever expect to get an answer no matter how many more times we’re redirected.

Now, when a person actually follows such a sequence of redirections, they usually notice if they find themselves back at a place they’ve visited previously. But that only means that this particular kind of livelock is easy to detect. In general, detecting livelock is hard. Although deadlock is readily detected by constructing the “waits-for” graph, there is no comparably easy way to detect livelock. In the scenario where each of the processes is repeatedly in a deadlock and then out of a deadlock, it’s not obvious how any particular process can tell the difference between livelock and a run of bad luck. In the worst case, detecting livelock requires predicting the behavior of programs. Such prediction is equivalent to solving the halting problem—and as we learned in chapter 7, the halting problem is unsolvable.

Thrashing

Thrashing is a situation where multiple processes are spending all of their time in coordination overhead rather than in doing real work. Thrashing is similar to livelock in some ways but is distinct from it. The usual cause of thrashing is overload—as a step-taking system divides its resources among more and more processes, each process’s share of steps gets smaller. Eventually, that share is no more than the steps required for housekeeping and bookkeeping associated with each distinct process.

We also see a similar class of problem in everyday life: as people try to increase their intensity of multitasking, they eventually reach a point where they are ineffective because they are spending too little time “on task” vs. too much time being interrupted or reminding themselves of what they should be doing next.

The difference between livelock and thrashing might seem academic—after all, the real point in either case is that there’s a lot of activity but not much real progress. But it’s useful to distinguish them and understand what the differences mean. Livelock is not something that typically develops or goes away based on how much work needs to be done. Instead, livelock usually reflects some underlying design error that affects the correctness of the system. Something in the sequencing or arrangement of the processes means that it’s impossible for them to make progress, because they get in each other’s way.

In contrast, thrashing both appears and disappears based on how much work needs to be done. Thrashing is essentially about implementation costs. In thrashing, there is no structural or design problem with the processes that keeps them from working; instead, the process-related coordination overhead has become so high relative to their real work that little or no useful work is being done.