This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Update: Faster population counts
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Short version: I've updated my popcount benchmark
to include code written by Imran Haque (SSE2 and SSSE3 version) and
Kim Walisch (improved platform-independent bitslicing and OpenMP
support).
In 2008 I wrote a series of essays on how to do a Tanimoto
comparison between two fingerprints. The most compute intensive
part computes the population count of a set of bits, that is, the
number of bits which are set to 1.
This has a long history in computing, with many variations of how to
compute it. The earliest reference I know of is cited by Knuth as the
"Gillies-Miller method for sideways addition" in The Preparation of
Programs for an Electronic Digital Computer by Wilkes, Wheeler,
and Gill, second edition, 1957, 191-193. This citation comes from Knuth's
MMIXware: a RISC computer for the third millennium and is also
in his Pre-fascicle
1A to volume 4 of his The Art of Computer Programming. In
modern terms, this is a "bit-twiddler" solution.
However, many other techniques exist. I described a number of them in
my essay HAKMEM
169 and other popcount implementations, and collected them into a
benchmark so others could test them out. My conclusion in 2008 was
that the lookup table was competitive with the best bit-twiddler
solution; the fastest algorithm depended on the hardware/compiler
combination.
Times have changed. Hardware has changed. Have the results changed?
Get the new
benchmark and judge for yourself.
My answer is that if you have a chip with the POPCNT instruction
built-in then use it. I still don't have one of those chips, but I
know someone who does, who has given me some numbers.
Failing that, there's a 4-bit lookup-table you can implement with
processor-specific SSSE3 code.
Failing that, there's a portable bitslice implementation which is only
25% slower than the hardware POPCNT instruction!
Finally, if you have OpenMP then you're likely memory bound with any
of these fast implementations.
Built-in popcount
If you use GCC (or clang or Intel's icpc) then you can use the
__builtin_popcount and __builtin_popcountll functions. These are
portable across every platform, but remember to tell GCC to compile
for the specific architecture by using the "-mpopcnt" or "-msse4.2"
flag. Otherwise you'll be stuck with a rather slow implementation.
(Why is GCC's implementation slow? Because it optimized for infrequent
uses, and can't share the setup cost or cache hit cost that the
fastest solutions use.)
Shuffling with SSSE3
Years ago I implemented popcount on my PowerPC Mac. The Altivec manual
described a neat trick, which is also available on newer Intel chips
with the "shuffle" instructions. Briefly, into one 128 bit word pack
the byte values {0, 1, 1, 2, 1, 2, ..., 4}, which are the population
counts for the values 0..15. With a shuffle instruction and some care
you can use this as a 4-bit lookup table. The advantage of this over
an 8-bit lookup table is that you can keep the table in a register.
I tend to stay away from assembly programming, or C programming which
is this close to assembly. Thankfully, Imran Haque did the work for
me. Well, more for a collabration he was doing with Vertex. The end
result is Anatomy of High-Performance 2D Similarity
Calculations [DOI:
10.1021/ci200235e | author's copy] and
the SSSE3 implementation is available as part of the supplemental
information.
I haven't yet had the chance to test it on a machine with the POPCNT
instruction, but estimating from the data I have it's about 5% slower
than using __builtin_popcountll.
Slicing with SSE2
Imran also implemented three different variations of the bitslice
algorithm, using the SSE2 intrinsics. The 8-bit version is about 8%
faster than the 32-bit version and about twice as fast as what I had
called "Bitslice(24)."
However ... !
Improved portable bitslicing
Kim Walisch contributed a better implementation of Lauradoux
Cédric's original bitslice implementation. It's portable across
the GCC, Microsoft, and Intel compilers and on my laptop it's only 1%
slower than the fastest SSE2 version and only 25% slower than the
hardware popcount.
My Timings
My laptop is an Apple MacBook Pro with a 2.53 GHz Intel Core 2
Duo. I used gcc 4.2.1 to generate these timings:
Kim Walisch also rewrote the benchmark to use OpenMP. Use "-fopenmp"
enable OpenMP support on GCC. On my Macs that means:
g++ -O3 popcnt.cpp -o popcnt -mssse3 -fopenmp
My desktop can run four threads, and the timings improve by 17% for
the SSSE3 and 30% for the Lauradoux implementations. Indeed, they are
competitive timings when OMP_NUM_THREAD=4 on my desktop:
I've updated his report so the benchmark names and order match the
current suite, but you can see I'm missing timings for some of Imran's
work.
Running with OMP_NUM_THREADS=4, and updating the names and order to
match the current benchmark (you can see I'm missing the SSE3 benchmark)
You can quite clearly see that the popcount instruction is fastest,
but once you have multiple threads running, the paper by Haque et
al. points out that you are memory bandwidth bound, not CPU bound. In
other words, it's better to do multiple different tasks with the same
data, if you can.