Concurrent Programming in ML: A Race

Lately I’ve been super interested in language models for concurrency, after hearing a fascinating talk from Adam Solove on synchronizable abstractions for UI.Unfortunately, the talk isn’t online (Adam presented it at work), so the blog post linked above is the next best thing!

I’ve been working my way through a handful of books, including PCPH, the Concurrency section of PFPL, and most recently Concurrent Programming in ML, by John Reppy.

In particular, I think I’ve found a race condition in one of the code listings of Concurrent Programming in ML. After introducing the listing itself, we’ll walk through a trace that shows the errant behavior, then propose a small change that prevents it from happening.

Setup: Snippets from the Book

Before we begin, here’s the listing in full. It’s a sample implementation of a 1-element concurrent buffer, using condition variables. It supports creation, insertion, and removal.

datatype 'a buffer = BUF of {
  data      : 'a option ref,
  mu        : mutex,
  dataAvail : condition,
  dataEmpty : condition

fun buffer () =
  let val mu = mutex() in
    BUF {
      data      = ref NONE,
      mu        = mu,
      dataAvail = condition mu,
      dataEmpty = condition mu

fun insert (BUF {data, mu, dataAvail, dataEmpty}, v) =
    fun waitLp NONE = (data := SOME v; signal dataAvail)
      | waitLp (SOME v) = (wait dataEmpty; waitLp (!data))
    withLock mu waitLp (!data)

fun remove (BUF {data, mu, dataAvail, dataEmpty}) =
    fun waitLp NONE = (wait dataAvail; waitLp (!data))
      | waitLp (SOME v) = (data := NONE; signal dataEmpty)
    withLock mu waitLp (!data)
Concurrent Programming in ML, Listing 2.3

You might also want to reference this exerpt which explains the semantics of the concurrency primitives at play in the snippet above: locks and condition variables. Study the listing above and exerpt below for a moment. See if you can spot a race, or are convinced the code is correct.

The semantics of the expression

withLock mu f x

are that first the lock mu is acquired, then the function f is applied to x, and then the function’s result is returned after releasing the lock.

The basic operations on condition variables are

val wait : condition -> unit

which causes a process to block on the condition variable, and

val signal : condition -> unit

which wakes up one waiting process. A condition variable is associated with a specific mutex lock, which must be held when performing a wait operation on the variable. The semantics of the wait operation are that the mutex lock is released, and then the process is blocked; when the condition is signaled, the next process in the condition’s waiting queue is unblocked and it reacquires the mutex lock and proceeds. A signal operation on a condition variable that has an empty waiting queue has no effect; in this sense condition variables are memoryless.

Concurrent Programming in ML, section 2.4.2

A trace to expose the problem

The problem I see has to do with SML’s eager evaluation: before calling a function f e, we evaluate e to a value v. Then substitution kicks in and we substitute v into the body of f. For us, that means that in the definition of insert:

fun insert (* ··· *) =
  (* ··· *)
    withLock mu waitLp (!data)
  (* ··· *)

we evaluate !data to a value before we run the body of withLock to acquire the lock. When inserting into an empty queue, !data evaluates to NONE. And since this happens outside the withLock if two calls to insert attempt to acquire the lock at the same time, they’ll both think the queue is empty when they wake up! When this happens, the one to wake up second will unknowingly overwrite what the first one inserted.

Here’s a sample trace of a program allocating a buffer and then doing two concurrent insertions:

(* '=>' marks steps where two threads evolve concurrently  *)
(* '->' marks steps where just one thread evalautes        *)

-> val buf = buffer ()

(* Fork two threads; both have access to 'buf'.            *)

   (* thread 1 *)                       (* thread 2 *)
=> insert buf 1                         insert buf 2
=> withLock mu waitLp (!data)           withLock mu waitLp (!data)
=> withLock mu waitLp NONE              withLock mu waitLp NONE

   (* thread 1 acquires lock *)
-> waitLp NONE
-> (data := SOME v; signal dataAvail)
-> (data := SOME 1; signal dataAvail)
   (* {data = ref (SOME 1), ...} *)
-> ((); signal dataAvail)
-> signal dataAvail
-> ()
   (* thread 1 releases lock *)

                                        (* thread 2 acquires lock *)
                                        (* NONE is now stale! *)
->                                      waitLp NONE
                                        (* selects wrong case in function *)
->                                      (data := SOME v; signal dataAvail)
->                                      (data := SOME 2; signal dataAvail)
                                        (* ==> data = ref (SOME 2) *)
->                                      ((); signal dataAvail)
->                                      signal dataAvail
->                                      ()
Sample trace, showing that first insert gets dropped

Notice how the stale read allowed two consecutive inserts. What we wanted was for the second insert to wake up, see that the buffer is full, then wait for the dataEmpty condition variable to wake it up. So having the !data outside the lock is not good.

Fixing the stale read

The solution to this is to delay evaluating !data until the body of the waitLp function, which only executes when we have the lock. This ensures that we don’t read a stale value for the content of the buffer.

datatype 'a buffer = (* ··· *)
fun buffer () = (* ··· *)

fun insert (BUF {data, mu, dataAvail, dataEmpty}, v) =
    (* !data is now within waitLp, so it's never stale. *)
    fun waitLp () =
      case !data
        of NONE => (data := SOME v; signal dataAvail)
         | SOME v => (wait dataEmpty; waitLp ())
    withLock mu waitLp ()

fun remove (BUF {data, mu, dataAvail, dataEmpty}) =
    fun waitLp () =
      case !data
        of NONE => (wait dataAvail; waitLp ())
         | SOME v => (data := NONE; signal dataEmpty)
    withLock mu waitLp ()
Listing 2.3, updated to avoid race

Pretty small bug, and it doesn’t detract from the main point of the listing, which is to show how to use condition variables in a sort of “mutually recursive” style where dataEmpty wakes up insert which signals on dataAvail which wakes up remove.

This also underscores how difficult it really is to ensure correctness in the presence of concurrency! That’s exactly why I’ve been reading about all these language models for concurrency, to better understand how we can leverage our programming language to ensure our programs are correct by construction.