This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Testing the bitslice algorithm for popcount
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I want to compute the popcount of molecular fingerprints containing
1024 bits. My algorithm has been to compute the popcount 8, 16, 32,
or 64 bits at a time and repeat that for all of the words in the
fingerprint. There are a couple of algorithms which compute the
popcount of multiple words in parallel by doing a partial popcount of
each word and combining the intermediate results to get the result
over all words.
Lauradoux Cédric wrote a nice overview on
Hamming weight, including explanations of the tree merging and
bit-slicing algorithms. In personal email he pointed out this page
from comp.arch.arithmetic where someone attributes the algorithm to
Robert Harley.
It took me a while to understand it. It's been rather a while since I
had to do much thinking about half-adders. About 17 years. But it's
not hard. I'm not going to explain it though. It's best in pictures
and that will take me too long to do. You can follow those links
though.
While he hasn't finished the overview explanation on how to combine
the two algorithms, his Hamming
weight page includes source.
I've added two bit-slicing implementations to my popcnt.c
benchmark. The algorithm from comp.ach.arithmetic processes 7 words
at a time and the one from Cédric processes 24 words at a time.
Both fall back to other methods for the last few words if the input
isn't an exact multiple of 24 or 7.
Cédric implemented several algorithms in his code distributions
and included programs to compare their performance. The results
suggested that the 24 word bitslice algorithm would be twice as fast
as as the 16 bit looup table. I implemented it and found that while
it was fast, and in my laptop even faster than the 16 bit lookup
table, it was only about 10% faster and not a factor of 2.
I dug around for a while and used Toby's comment to try Shark as a
profiler. Things made a bit more sense after I remembered to add the
'-g' option to gcc so I could get Shark to line up the code and
assembly side-by-side. My conclusion is that Cédric's code has
too much function call overhead because the 16-bit LUT is being called
for every 32 bit number, while the 24 word bitslice code is only
called every 24 words. My code doesn't have that overhead, which is
why the times are closer to each other.
As before, I wondered how much "-m64" affects things. I'll declare it
a tie between the new bitslice code and the old 16-bit lookup table.
(While though the numbers aren't the same, the difference is about the
same as when I run the same algorithm again.)
Bitslice(24) is still the fastest bit-twiddler solution, but quite a
bit slower than using an 8-bit lookup table. There is no single best
solution. It seems the code is also compiler dependent so I'm
installing GCC 4.2 to see how that affects things. That'll probably
take a few hours so it's time to sleep.