Trip Report: Ad-Hoc Meeting on Threads in C++
by Eric Niebler
October 17, 2006

Summary
The C++ standardization committee is hard at work standardizing threads for the next version of C++. Some members recently met to discuss the issues, and The C++ Source was there. Read on to learn what the world’s leading experts on concurrency are planning for C++0x.

C++ programmers have been writing multi-threaded programs for years, and yet the current C++ standard is conspicuously silent on the issue of multi-threading. C++ programmers have to do platform-specific gymnastics to launch a separate thread of execution within their process, and once they do, the standard gives them no guarantees to hang their hats on.

Nobody would disagree that C++ needs threads, least of all the standardization committee, which recognized the need as early at 1992. Adding concurrency to C++ has been a major initiative throughout the planning of C++0x, the next version of the standard, with thread-related papers and proposals occupying much of the committee’s time, both at meetings and between them. Domain experts and interested parties of all sorts attend in-between meetings, called “ad-hoc” meetings, to share information and get early feedback on various proposals. Just such an ad-hoc meeting convened on August 23-25 in Redmond, Washington to discuss concurrency. The result was three days of detailed and dizzying discussion and debate. Read on to learn which direction the C++ committee is leaning on these important issues!

The Players

The meeting was called by Hans Boehm of HP and Herb Sutter of Microsoft. You may know Hans Boehm by his work on the excellent Boehm Collector,1 a conservative garbage collector for C and C++. Herb Sutter should need no introduction here. A prolific C++ author, Herb is also convener of the C++ standardization committee, and his article, “The Free Lunch Is Over,”2 was the call to arms in the concurrency revolution. Also in attendance were representatives of most of the C++ world’s biggest movers and shakers: IBM, Intel, AMD, Apple, Microsoft, Google, ARM, HP, and Sun, just to name a few. The cross section ran the gamut, from hardware designers to compiler writers to library and application developers—everyone had a chance to air their unique perspectives and concerns.

The Issues

The group explored multi-threading issues in two broad domains:

  1. What guarantees can C++ programmers expect from the compiler and the processor in the presence of concurrent memory accesses?
  2. What abstractions can C++ programmers use to express parallelism in their code?

The first question can be rephrased as, “What is C++’s memory model?” The second question is less esoteric, and amounts to, “What does a portable C++ threading library look like?”

Memory Model

At the source level, C++ programs appear sequential: first do this, and then do that. However, compilers and processors often reorder the instructions in a program to increase performance. This is kosher so long as the reorderings do not change the observed behavior of the program. But when an extra thread is in play, the rules of the game change because one thread can sometimes tell when instructions in another thread have been reordered. The effects can be nothing short of hair-raising. For instance, one thread may “see” the effects of a write to memory before it sees whatever caused it. People in general like causes to precede effects, and programmers are no exception. “You mean the compiler is allowed to do that?!” was a common refrain, even among this group of experts.

A memory model answers two questions:

  1. Which reorderings are allowed on memory reads and writes within a thread of execution?
  2. In what order are writes to memory in one thread allowed to become visible to other threads?

Without coherent answers to these two questions, quite literally anything is possible. A good memory model should bring reality in line with programmers’ expectations. Describing C++’s memory model was first on the agenda.

As it turns out, there is no shortage of smart people who have been giving this issue thought lately. We heard from no less than 3 different groups working on a memory model for C++! The first is the C++ standardization committee, as represented by Hans Boehm. The second is Microsoft, represented by Herb Sutter, who is trying to define a consistent memory model across all of their native platforms. The third is the OpenMP group, represented by Michael Wong of IBM and Bronis R. de Supinski of LLNL. OpenMP is a consortium that has garnered broad industry support for their pragma-based model for expressing concurrency in C++. They have run smack into the same problem that Microsoft and the broader C++ community have—the lack of portable concurrency guarantees. Rather than competing, these three groups are now working together to come up with one unified memory model, or at the very least, to avoid coming up with three mutually incompatible memory models.

Hans Boehm and Herb Sutter both presented very detailed and well-thought out memory models. Their differences are subtle and important, but in broad strokes, both proposals paint a similar picture. In particular, both proposals:

  • Specify a set of atomic (aka, interlocked) primitive operations.
  • Explicitly specify the ordering constraints on atomic reads and writes.
  • Specify the visibility of atomic writes.
  • Disallow speculative stores on potentially shared objects.
  • Disallow reading and re-writing of unrelated objects. (For instance, if you have struct S{ char a,b; }; it is not OK to modify b by reading in the whole struct, bit-twiddling b, and writing the whole struct because that would interfere with another thread that is trying to write to a.)

In addition, both proposals seek to define what constitutes a data race; that is, when one thread is writing to an ordinary (non-atomic) variable that another thread is using, and what guarantees if any can be made in a program that contains a race. Both Herb and Hans would agree that data races are bad, and that if you have one in your program, surprising things will happen.

An atomic read or write is one that must happen all at once, so that it’s impossible for an observer to see an erroneous value as a result of a partial write. (Windows programmers will recognize these as interlocked operations, because of the APIs like InterlockedIncrement() and InterlockedCompareExchange().) Some atomic operations may have ordering constraints. Ordering constraints are inviolable decrees to the compiler and processor: “Thou shalt not reorder memory accesses across this instruction.” For instance, an acquire constraint is like a one-way barrier; memory operations can be moved below the acquire, but not above it. A release constraint is the opposite; memory operations can be moved above the release, but not below it. The names “acquire” and “release” are evocative of a critical section, and that’s a helpful way to think about it. It’s generally safe to move code into a critical section (down, after the acquire, and up, before the release), but not the other way. We want compilers and processors to have some latitude in reordering our programs because the performance wins can be huge. But too much latitude makes it impossible to write correct programs. The ordering constraints give us the knobs to turn to make our programs correct.

Hans Boehm’s proposal3,4 gives programmers powerful, explicit, low-level control of the ordering constraints. For every atomic operation (eg., load, store, fetch_add, etc.), we have four ordering constraints to choose from: acquire, release, both and “raw”. For instance, we might specify atomic reads and writes as follows:

atomic<int> x(0);          // An atomic integer
int i = x.load<acquire>(); // atomic read with acquire semantics
x.store<release>(i);       // atomic write with release semantics

Raw proved to be the most controversial ordering. It basically means “no ordering constraint,” and would only be useful to the most performance hungry, insane, low-level hacker type. (That’s not just my characterization; that’s how it was presented.) The ability to specify the ordering constraint or lack thereof for every atomic memory access is both powerful and dangerous. As we'll see, it is one of the places where Hans’ and Herb’s proposals differ.

Hans illustrated his memory model with a series of enlightening concrete examples. One of these sparked considerable debate. Consider the following use of atomic variables x and y, which are initially 0, and ordinary local variables r1 and r2.

Thread 1 Thread 2
y = 1;  // atomic store
r1 = x; // atomic load 
x = 1;
r2 = y;

The question Hans put to the group is, “Is r1 == r2 == 0 a possible result?” Under Hans’ original proposal the answer, somewhat surprisingly, is “yes,” although he admitted to some fence-sitting. To make sense of this example, you need to know that for y, which has type atomic<int>, Hans defined that:

  • y = 1” is equivalent to y.store<release>(1), and
  • r1 = y” is equivalent to r1 = y.load<acquire>()

... and likewise for x.

Hans explained the logic of his reasoning. A store<release> prevents code from moving down across it, and a load<acquire> prevents code from moving up across it. But there is no ordering restriction on a store<release> followed by a load<acquire>. As a result, the code might actually execute in this order:

r1 = x;

y = 1;

r2 = y;

x = 1;

The ordering rules may be logical, but in this case, the result came as a bit of a shock to everyone, especially Herb who felt this was far too dangerous. Won’t unsuspecting programmers expect such a straightforward use of atomics to be fully ordered? Most of the room agreed, and Hans has since altered his proposal so that the conversions between atomic<T> and T are fully ordered. You may lose some performance, but you get to keep your sanity, and the lower-level load<acquire> and store<release> are still there for all the insane hacker-types out there.

After Hans’ presentation was Herb’s. Herb’s memory model5 is clean and simple but less expressive. It does away with explicitly specified ordering constraints as too low-level and error prone. Locks and “interlocked” (aka, atomic) operations which are always fully ordered are the way to go. It is impossible to see an “event” before its cause. (Here, an “event” is a lock/atomic operation, or a batch of normal operations that are book-ended by a lock or an atomic operation.) It’s an appealing model that is relatively easy to reason about. The general feeling in the room was that if there ever were such a thing as multi-threading for the masses, this is it.

The issue of ordering constraints illustrates a difference in philosophy between the two memory models. Hans’ proposal, in the traditional spirit of C and C++, gives programmers access to the hardware at the lowest levels and the freedom to do very crazy things very quickly. Well and good. Herb’s approach is to give C++ programmers, compiler writers and hardware architects a simple and sane mental model, which will hopefully result in someone somewhere writing a correct multi-threaded application some day—also a worthy goal. These two philosophies are only in apparent conflict. A straw poll (which is totally non-binding, by the way) showed near universal consensus to keep the low-level “raw” atomic APIs, possibly with intentionally uglified names, and higher-level interfaces that make stronger guarantees, for use by us mere mortals.

Not all the dragons were so easily defeated, unfortunately. Among the issues guaranteed to waste at least 20 minutes of group time with little or nothing to show:

  • Are threads cancelable?
  • What does volatile mean?
  • What interfaces should we provide for C interoperability?
The cancellation and volatile issues seem particularly intractable, and it’s unlikely C++0x will have much to say about either.

The Abstractions

With the memory model proposals on track, the group turned its attention toward other matters; namely, how will C++ programmer express parallelism in their code? What does launching a thread look like? Or joining a thread? What higher-level synchronization facilities do we put in the standard (locks certainly, but condition variables? queues?), and what do those look like?

Thread Synchronization

Lawrence Crowl of Google threw out the first pitch. The problem with a library-based threading interface, said Lawrence, is the issue of C compatibility. If we put classes and templates in the interface, there is no easy way for programmers to use those from straight C code. Instead, Lawrence proposed adding keywords for launching and joining threads, locking a scope and waiting on and notifying condition variables.6 For Lawrence, delegating work to another thread should look something like this:

int function( int argument ) {
    int join pending = fork work1( argument );
    // work1 continues concurrently
    int value = work2( argument );
    return value + join pending;
}

Here, fork and join are placeholders for new keywords yet to be chosen. This is bold! Generally, the committee avoids language changes at all costs, preferring pure library solutions wherever possible. Lawrence was obviously aware of the fire his proposal would draw, but he eloquently defended the needs of C programmers. After some heated debate, the general consensus was that, while C interoperability is certainly a good thing, language changes are not the right approach. The most damning objection was that, since the C committee is not actively involved in the discussion, whatever we came up with by way of C compatibility might not actually pass muster with them, and we may have compromised our design for nothing. The group turned its attention to more traditional library proposals.  

Howard Hinnant of Apple presented his design for synchronization objects.7 Howard began with the mutex and lock classes from Boost.Threads and loosened the coupling between them. He then added a pile of useful and powerful functionality, including reader/writer mutexes and a clean and elegant syntax for upgrading a read lock into a write lock using move constructors. For instance, you might want to grab a read lock, and then upgrade it to a write lock:

upgradable_mutex mut;
void foo() {
   upgradable_lock<upgradable_mutex> read_lk(mut);
   // do read operation
   if ( some_condition() ) {
       scoped_lock<upgradable_mutex> write_lk(move(read_lk));
       // do write operation
   }
}

Few people know mutexes and locks as well as Howard, and the only nit the group could pick was about his spelling of the word “upgradeable.” Howard’s design for condition variables was likewise well received.

Thread Launching and Joining

A slightly more contentious issue was that of thread launching and joining. Pete Becker of Dinkumware has a very mature proposal8 based Dinkumware’s own CoreX threading library, which itself is based on the Boost.Threads design. It is a traditional OOP design, with a noncopyable std::thread class that is used both for launching and joining threads. This design has seen heavy industry use over the past few years, a fact that scores major points with the committee, and rightly so. In the past when the committee has decided to innovate rather than standardize existing practice, the results have sometimes been less than stellar.

Kevlin Henney has a different take on thread launching and joining.9 Although Kevlin couldn’t attend this meeting, many of his ideas were championed by Howard Hinnant. Kevlin views threads less as objects and more as asynchronous function calls. This functional viewpoint led Kevlin to separate the concerns of launching and joining into two objects: the threader and the joiner. A threader executes a function on a separate thread and returns a joiner, which can be used to wait for the function’s completion and to extract the return value, as follows:

void void_task();
int int_task();
...
threader run;
joiner<void> wait_for_completion = run(void_task);
joiner<int> wait_for_value = run(int_task);
...
int result = wait_for_value();
wait_for_completion();

In a sense, the joiner is a stand-in for the thread function’s return value. Users of other threading APIs might recognize this as a “future.” In fact, in the meeting some suggested that “joiner<>” should be spelled “future<>”. The threader may seem useless at first until you realize that it doesn’t necessarily have to launch a new thread for each task. Instead, it may have an internal pool of threads and a task queue!

One of the more interesting ideas floated was that a future<> type could have application outside the domain of threading. For instance, an asynchronous file I/O library might return its result as a future<>, as might a non-blocking network operation. All these different futures might be put in the same container, and then there could be a way to wait for one or all of them to complete—an intriguing possibility!

Howard showed that the two competing thread-launching designs are actually quite similar, and many in the room agreed that each proposal could benefit from the other. The thread/launcher/joiner/future debate is ongoing, but the way things look from here, we’ll probably end up with a layered approach that has a simple, low-level thread launching and joining facility like Boost’s, and higher-level facilities like futures built on top. Whether the higher-level facilities will make it into the standard this time around is still an open question. Let’s just hope that we can turn our future<proposal> into an actual proposal without blocking until after C++0x is done!

Threads and Exceptions

What happens if a thread terminates because of an uncaught exception? That was the question put to the group by Beman Dawes, a committee member and co-founder of Boost.org. It’s a profound question that goes straight to the heart of C++’s exception mechanism. It’s never a good idea to ignore an exception, so the fact that something went wrong should be signaled somehow. The question is how.

The logical place to signal that something went wrong is when joining the terminated thread. Consider:

void function_that_throws() {
    throw my_exception();
}
...
std::thread my_thread(&funcion_that_throws);
...
my_thread.join(); // 1

What happens on line 1? (I should point out here that although I’m using the Boost.Threads interface as an example, the same issue arises in Kevlin Henney’s proposal.) The reasonable options are:

  1. Do nothing. Swallow the exception.
  2. Throw some exception, not necessarily of the same type that terminated the thread.
  3. Throw a copy of the exception that terminated the thread.

Option (1) is not a very good idea. Something bad has happened; the rest of the program has a right to know. Option (2) is nice, but if you’re going to throw an exception, it should probably be the same as the one that causes the thread to terminate, right? So option (3) seems like the winner, except that it’s sadly not possible. The problem in a nutshell is illustrated below:

template<typename Fun>
void thread_fun(Fun fun) {
    try {
        fun();
    }
    catch(...) {
        // save the exception, but how?!
    }
}

If you want the join operation to throw a copy of the exception that terminated the thread, you’ll need to have a catch-all clause like above, and it will need to save a copy of the exception somewhere and then throw it in the context of the joining thread. That’s just not possible in the language today.

Beman presented several options. One is a library-based solution that only works on exception types derived from a certain base class. Another option is to simply require that it works, and not specify how it’s done. Compiler writers would do the necessary bit-twiddling behind the curtains. A third option would be to extend the language to give everybody the hooks needed to copy and rethrow exceptions of unknown type. A straw poll showed that few liked the It’s-Just-Magic second option. Both the first and third options got weak support, but at this point nobody really knows what a language extension would look like.

Conclusion

As you can see there was plenty to keep the Ad-Hoc Thread Standardization meeting busy, and I didn’t even talk about Clark Nelson’s sequence point proposal,10 Intel’s Threading Building Blocks presentation, Ion Gaztañaga’s process-launching and shared memory proposal,11 all of which were excellent. The meeting ended with pledges from everyone to keep the lines of communication open, and thanks to Microsoft for hosting the meeting and to Hans Boehm for his role as chief cat-herder. As Beman Dawes correctly observed to me over lunch, the group really would be in bad shape without Hans’ encyclopedic knowledge of threading issues.

future<c_plus_plus> is ( bright() );

Further Reading

Hans Boehm has collected useful links to thread-related information and standardization proposals on his website, which you can find here:

http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/.

Acknowledgements

I’d like to thank Hans Boehm, Herb Sutter and Howard Hinnant for checking this article for technical accuracy. Any errors that remain are entirely mine. Thanks are also due to Hans Boehm for his work spearheading the standardization effort and to Herb Sutter and Microsoft for hosting the Ad-Hoc C++ Threading Standardization Meeting.

End Notes

1. The Boehm Garbage Collector for C and C++, http://www.hpl.hp.com/personal/Hans_Boehm/gc/

2. Sutter, Herb. The Free Lunch is Over: A Fundamental Turn Towards Concurrency in Software, Dr. Dobb’s Journal, March 2005, http://www.gotw.ca/publications/concurrency-ddj.htm

3. Boehm, Hans. A Memory Model for C++: Strawman Proposal, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n1942.html

4. Boehm, Hans. An Atomic Operations Library for C++, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html

5. Sutter, Herb. Prism: A Principle-Based Sequential Memory Model for Microsoft Native Code Platforms, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2075.pdf

6. Crowl, Lawrence. C++ Threads, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1875.html

7. Hinnant, Howard. Multithreading API for C++0X - A Layered Approach, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2094.html

8. Becker, Pete. A Multi-threading Library for Standard C++, Revision 1, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1907.html

9. Henney, Kevlin. Preliminary Threading Library Proposal for TR2, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1883.pdf

10. Nelson, Clark. A finer-grained alternative to sequence points, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n1944.htm

11. Gaztañaga, Ion. Memory Mapped Files And Shared Memory For C++, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2044.html

Talk back!

Have an opinion? Readers have already posted 68 comments about this article. Why not add yours?

About the author

Eric Niebler is an independent C++ consultant currently working with Dave Abrahams and Boost Consulting. Formerly of Microsoft Research, Eric has also written template libraries for Visual C++. He is the author of the GRETA Regular Expression Template Library. When not writing C++ for a living, he can often be found in coffee shops around Seattle writing C++ for fun.