Implementing thread-safe unordered sets without reducers
Published On : | October 29, 2009 12:00 AM PDT |
Rate |
Races and Reducers: the Risks and Rewards
One of the biggest challenges in building a parallel program is dealing with data races. Cilk++ offers several tools and techniques to find and eliminate races from your program. Reducers provide a a powerful mechanism for eliminating races, but as Spiderman said, "with great power comes great responsibility."
The Problem
I recently looked at a customer problem that at first looked like an obvious application for reducers. Briefly, the task was to sort a collection of objects into separate bins based on certain properties of the objects. The simplest parallel implementation, converting the for loop to a cilk_for loop, introduces data races on the bins if two or more objects can be put into the same bin.
At first look, this seemed like a good opportunity to use reducers. The contents of each bin is a list of elements, so why not just have each bin hold a list reducer? Reducers are extremely efficient when used sparingly, but unfortunately can become expensive if you create thousands or even millions of reducers, as would be needed in this case.
The other safe approach to handling races is to use locks to protect access to the shared memory locations. In this case, that would either mean using a single lock to protect the entire collection (safe, but bad because the high contention on this lock would eliminate all advantages of parallelism) or to create and acquire a lock for each bin (also safe, but bad because it requires thousands or millions of expensive locks.)
Luckily, there is another way.
The Approach
Mutex locks provided by the operating system are expensive to create and acquire, and are generally more heavy weight than is required for this problem. However, there is a much lower-cost alternative. The underlying hardware offers atomic instructions, which are effectively locks around single hardware instructions. These are the same instructions that are used to implement mutexes, but we can use them directly to save a significant amount of overhead for our problem at hand. It is important to note that there is no free lunch - atomic instructions require a memory barrier that can cost 40 or more instruction cycles. This is far more expensive than the few cycles used by an unlocked read/write, but still much less than the thousands required to create and acquire mutexes. You can think of these as small locks that lock just the read/write hardware operation to ensure that no other processor can race on the memory between the read and the write.
My approach is to use the atomic instruction to safely update the bins in a way that will not create data races. As with any use of locks, this approach prevents data races, but does not guarantee a deterministic result. In this case, each bin will hold the correct list of elements, but the order of the elements in the list is indeterminate.
The Solution
An atomic swap operation for pointers is available under different names in both Gnu and Windows C/C++ compilers:
GNU: | __sync_lock_test_and_set |
MSVC: | InterlockedExchangePointer |
An implementation of a thread-safe sets can be implemented as a template with the class C of the set element as the template parameter:
#ifdef __GNUC__ |
0 意見 (+add yours?)
張貼留言