This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: OpenMP vs. POSIX threads
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
The changes to the code were almost trivial, so I asked Kim if he
would help me add OpenMP support to chemfp. He did,
and it will be part of the upcoming 1.1 release.
OpenMP is only one of many ways to make effective use of multiple
processors. Another common way is through POSIX threads, or its
equivalent on Windows. A third is to spawn off a new process and use
IPC to communicate with it.
How do these techniques compare to each other?
I decided to use Python 3.2's new concurrent.futures
module to handle the multithreaded and multiprocess cases, or rather,
the backport to
2.x. This merges the underlying "threading" and "multiprocessing"
APIs into a common form, based on the "future"
concept for asynchronous programming. I quickly found that the
multiprocess API had too much overhead so I won't talk about it.
My test case computes and stores the NxN Tanimoto similarity matrix
between a set of fingerprints. I took N=110885 compounds from the ZINC
data set and generated 2048-bit fingerprints using RDKit's hash
fingerprint. The chemfp fingerprint type is "RDKit-Fingerprint/1
minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=4 useHs=1".
I don't actually save the 12,295,483,225 values. For one, simple
symmetry reduces that to 6,147,686,170 values. For another, in the
problems I'm interested, I can ignore similarities below a
threshold. Instead, chemfp internally uses a sparse matrix format.
For this test I used a simple parallelization. I break up the rows of
the matrix into bands, and fill in the parts of the upper-right
triangle which are at or above the threshold. In the OpenMP version,
all the OpenMP threads work on a single band. In the
concurrent.futures version, each thread processes its own band. These
differences fall naturally out of the how those two APIs work.
Once I got both implementations working, debugged, and optimized, I
could finally do some performance numbers. I wanted to see how OpenMP
and pthreads scale over a range of processors and range of threshold
values. My desktop has two dual-core CPUs, so I decided to rent time
on a "High-CPU Extra Large Instance" Amazon EC2 node. It has 8 nodes
of the form:
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz
stepping : 10
cpu MHz : 2333.336
cache size : 6144 KB
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall
nx lm constant_tsc rep_good aperfmperf pni ssse3 cx16 sse4_1 hypervisor lahf_lm
bogomips : 4666.67
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
Sadly, this is not a machine which supports the POPCNT
instruction, so the chemfp popcount code fell back to Imran Haque's SSSE3-based implementation.
The machine has about 6GB of free memory. I knew from testing on my
desktop that I only needed 4.1 GB for the worst case, so I didn't run
into memory problems.
I ran my benchmark code through set of combinations of the OpenMP
vs. pthreads, with thread counts from 1 to 8, and with threshold
values of 1.0, 0.99, 0.97, 0.95, 0.93, 0.9, 0.88, 0.85, 0.8, 0.75,
0.7, 0.65, 0.6, 0.55, and 0.5. I also ran the values several times in
order to get some idea of the timing variations. Yes, that took about
20 hours to run.
You can get the raw
data if you're really interested. I used matplotlib to make a
couple of 3D plots. First, here's the overall times:
You can see that the OpenMP code (in red) is usually faster than the
pthread code (in blue). The exception is for thresholds of 0.55 and
lower. BTW, a threshold of 0.5 finds 285,371,794 matches in the NxN
matrix, which means this stores a few gigabytes of data.
To make more sense of this data, here's a plot of the speedup, defined
as T0/T where T0 is the fastest single-threaded time for a given
threshold and T is the fastest time for a given number of
threads. Perfect speedup would give a value of 8 for 8 processors.
The region near threshold=1.0 is so jagged because the search time is
less than the variability in the system, and is close to the minimum
time resolution size of 1 second.
Most people in this field use thresholds in the range 0.7-1.0. It's
obvious from this graph that OpenMP is the right solution. It's almost
always faster, and overall it makes much better use of a multi-core
machine.