The goal of multithreading technology, as usual for software, is to raise the level of abstraction. Instead of starting, joining, and stopping threads, locking mutexes, and signaling semaphores, you'd like to be able to express yourself at a higher level, like "run this operation in parallel". You don't want to have to worry about figuring out how many free processors the machine has, load-balancing your threads across those processors, synchronizing the processing results of your threads, etc etc ad nauseum.
The most common technology right now for high-level threading is OpenMP. This is an industry standard for extensions to the C and C++ languages, implemented by compilers. It's a fairly successful standard, in that most major compilers these days actually support it. It provides a small function library and a set of pragmas that allow you to specify the intended parallel behavior of your code, and then the compiler generates the threading behind the scenes.
An up-and-coming contender is Intel's Threading Building Blocks library. This is a pure library, with no compiler dependencies. By means of generic (template-based) classes in the style of the standard C++ library, it provides powerful parallel-processing primitives (sorry, couldn't resist the alliteration), and handles all the threading itself.
The common factor between these two technologies is that both of them relieve you of ever having to deal with threads directly, and both of them gracefully degrade to the single-threaded case, which is extremely useful for debugging ("let me turn off the threading and see if the bug still happens").
Conveniently enough, I recently had to rewrite some OpenMP code in TBB, which gave me a great chance to compare their strengths and weaknesses. Bottom line, OpenMP is easier and results in clearer code, as long as you're dealing with a fairly simple problem. TBB is much more powerful, and scales much better when your problem is not so simple. For example, here's the same code written both ways:
float a[100], b[100], c[100];
// ... initialize a and b
OpenMP
#pragma omp parallel for
for (int cnt = 0; cnt < 100; ++cnt)
c = a + b;
TBB
using namespace tbb;
classAdder {
const int myA, myB, myC;
public:
Adder(const int *a, const int *b, int *c) : myA(a), myB(b), myC(c) {}
void operator(const blocked_range
for (int cnt = r.begin(); cnt != r.end(); ++cnt)
myC[cnt] = myA[cnt] + myB[cnt];
}
};
task_scheduler_init init;
parallel_for(blocked_range
A bit more complicated in the TBB case, no? However, all that complexity means that if I suddenly decide to change the for loop, say, to a do-while loop, it's pretty easy with TBB. With OpenMP, not so much - in fact, some very simple things (such as breaking out of a loop being executed in parallel) are virtually impossible. That's the price you pay for a simple syntax.
Overall, if all you want is to decompose a processing step using data parallelism, OpenMP is the better bet. However, as your problem becomes more complicated, you may find that you need to switch over to TBB for its greater expressiveness and power.
6 comments:
Years of research and effort have gone into making application parallelism automatic and implicit. The intent has been to remove the need for error prone fine grained lock "fiddling". The best approaches I have seen so far are in functional programming languages. When there are no side effects and all dependencies are known a scheduler can move pieces around willy-nilly to any given core as needed. Unfortunately C and C++ are the antithesis of side effect free.
We all know it is very difficult to get locking correct. Debugging a nasty race condition with a low frequency of occurrence is programmer hell. In general I am in favor of a middle ground where the design intent is explicit and the load distribution is under control of an engine or scheduler.
For C and C++ and most traditional imperative languages I prefer explicit parrallism via asynchrous queues. Building networks of processes/threads and asynchronous queues lays out an explicit design for your application. These systems are easier to monitor to see where the bottlenecks are and fine tune your application. Look at the dataflow programming model and products like Errlang, IBM MQ Series, Apache Active MQ. Today these provide coarse grained, process level parallelism and fault tolerance but the model should work on multi-core as well. We already have a model that is more intuitive (to me at least) with a 25 year track record in industry. I think what we need is more parallel design patterns, not another library/compiler extension to hide away threading issues.
"20,000 quatloos on asynchronous queue based designs for efficient multicore utilization!"
Did all of that just come out of my mouth........
After re-reading my post I want to say that I wasn't really trying to bash OpenMP or TBB. I actually haven't used either. I was just adding that there are existing models that support high levels of parallelism that minimize the locking issues of shared memory based designs.
I agree that explicit parallelism using queues is the best model for task-based decomposition (that is, each thread is doing one thing, and they hand data around using queues). However, it's, shall we say, suboptimal for data decomposition.
Barring an "embarrassingly parallel" algorithm, where you can have multiple worker threads, each with its own queue, and a feeder thread decomposing the problem, there's no reasonable way to break a problem like this into explicitly parallel processes (using that word in its general sense, not "process" as in "heavyweight flow of execution") without having to deal with all the low-level crap like load-balancing your processors.
I accept that other languages probably do it better than C++. I have very little experience with functional languages; all my languages are incestuously related: C, C++, Java, C#, etc., so I can't really speak to that. Unfortunately, most of us don't get to choose the language that we do most of our work in, so solutions like TBB and OpenMP fill a need.
Given the new support for threads in C++0x, I'm hopeful that we'll see some very high-powered threading libraries built once the developers can be sure that their libraries will work on all standards-compliant platforms.
You make a good point about data decomposition. I tend to think about problems using tasked based decomposition. Like building a circuit. Lots of little processes all filtering inputs and generating outputs. This fits well with explicit parallelism with queues.
I agree that hand built explicit parallelism using queues is only as good the programmer who builds the network. It doesn't scale as well and requires the programmer to find the bandwidth bottlenecks. It is however easy to understand and a sufficient solution can usually be found with a few iterations.
I will have to look at OpenMP to have a better basis for comparison.
FYI -
This is mostly idealistic prattle from me anyway. I really don't have much religion as far as the "best" way to do anything. I pay the bills as a grunt embedded programmer working in C. No OS. No multi-core. Just me and the hardware. I do think about the multi-core revolution and the implications once my toaster has more computing power than the laptop I am writing this on.
One problem with task decomposition using queues is that its spatial locality characteristics truly suck. Think about it - I have a chunk of data, that I'm handing between threads. If I've structured my system optimally, I have exactly as many logical threads as physical threads (processor cores), and none of my logical threads ever get pre-empted. That means that a thread spends its whole processing time getting a chunk of data into its processor cache, and then immediately has to discard the whole thing when it starts working on the next chunk. The next task has to load it all again into its processor cache. You basically never get a cache hit.
Even if you have a unified cache model, it still sucks, because the temporal locality is just as bad. Putting a chunk of data in a queue guarantees that it will sit there for some indeterminate amount of time. By the time the next task gets ahold of it, it's probably cooled off in cache completely, so he'll need to load it all again anyway.
If you have a unified cache, your cache is big enough to hold N chunks (where N = the number of threads), and you can get all your tasks to work in lockstep, you can get around the locality problems, but those are pretty stringent requirements, and they'll most likely introduce additional inefficiencies (e.g., that's probably a pretty small chunk size).
By the way, I don't have any religious axe to grind here either. Rereading my previous comment, I see that I might have come across as defensive about C++ threading. I'm not - it's just the language I have to use at work, so I have a vested interest in figuring out how to write efficient code.
Post a Comment