This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: My views on OpenMP
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
In private email a correspondent observed that OpenMP makes threading
very easy, but "it really seems under utilized in the community."
(Here, 'community' is 'scientific programming.') I was surprised to
find out that I had strong views on the topic.
OpenMP sits between several other pieces of technology, being:
GPU computing
cloud computing
POSIX and other common threading libraries
The new hotness is GPUs. Wes Faler gave a presentation at the recent
28th Chaos Communication Congress on Evolving
Custom Communication Protocols. He mentioned they ported C++ code
over to the GPU. The unoptimized version was 7 times slower on the GPU
than the CPU. However, they do many evaluations using the same
function, and because there are so many compute threads in the GPU,
the overall time was a factor of 7 faster. Similarly, Haque et al.
showed that a 4 core desktop machine, properly tuned, was "only" about
5x slower than a GPU card.
It looks like GPU computing is currently the approach to take if you
do a lot of evaluation of similar tasks, assuming you have the GPUs
and programming time available. That performance (and the novel way of
computing) interests people who might otherwise use OpenMP.
Cloud computing is another hotness. Alex Martelli was recently
interviewed by Larry Hastings in Radio Free Python
episode #2. At 33:47 Larry asked about Python's global
interpreter lock and Alex's reply was:
I hate threading anyway. Multiprocessing is the way to go, and
message-passing, not shared memory. That just doesn't scale. I use
multithreading so I can use all of my 16 cores, or whatever is the
average number of cores in a machine these days. Big furry deal. I've
got a few thousand servers waiting for me in the data center and how
do I use those with threading?
The topic comes up several times in the ensuing discussion.
What good indeed is OpenMP, which might be used for a 16 node machine,
if you're working on problems which involve 10,000 distributed
servers?
Even single nodes have multiple cores these days, and a good OpenMP
implemenation might help make good use of the nodes in that
cloud. However, you have to compare OpenMP to traditional POSIX
multithreading. OpenMP works for C/C++ and Fortran, but not for Python
nor (it seems) Java, nor other languages which support pthreads.
You're out of luck if you want to use OpenMP with one of those other
languages.
Some things scale up wonderfully well by adding one or two OpenMP
directives, but parallelism is rarely as trivial as giving a few hints
to the compiler. I think that the non-trivial cases of parallelizing
with OpenMP are about as much work as using pthreads, or a system like
Grand
Central Dispatch. I'll work through an example of doing that in my
next essay.
I do believe that OpenMP scales better than these alternatives for
some cases, in part because the compiler is doing the work rather than
using a library API. My tests so far show that pthreads and OpenMP
have about the same scaling with two processors, and I need four or
more cores to show a strong OpenMP advantage.
Most desktop/laptop computers just don't yet have 8+ cores. (Alex
Martelli said otherwise, but perhaps he's talking about Google's data
centers.) Most people develop for their own computers, which lessens
the incentive to work on good multicore scaling.
I have a four-core machine, and I'm willing to write a Python
extension in C which uses OpenMP. Even then I've run into some
difficulties. It took a while but I figured out how to configure
Python's setup.py so it includes the right "use OpenMP" flag for each
compiler. It includes a hard-coded list of compilers which do and do
not support OpenMP. Also, did you know that on a Mac you must run
OpenMP tasks in the main thread, and not in a pthread? Otherwise your
program crashes; even when you have a single OpenMP thread! I had to
figure out a workaround so I could use my library unchanged inside
Django.
People are interested in OpenMP development, but some who might use
OpenMP are drawn to other technologies. Some tasks are very
appropriate for OpenMP, but they are almost as appropriate for other,
more common technologies. OpenMP scales well, but most people don't
have the hardware where OpenMP shines. Even when they do, they have to
work in one of a handful of languages, and in somewhat restricted
circumstances.
All these contribute to diminishing OpenMP utilization in the community.