This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: HAKMEM 169 and other popcount implementations
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Many other people have worked on doing high speed population counts.
In 1972 the AI Laboratory at MIT published "HAKMEM",
or "Artificial Intelligence Memo No. 239". It is a collection of
algorithms, tricks, observations, and unsolved but interesting
problems. One of which is to "Solve Go." Item
169 is a PDP-6/10 assembly way to count the number of ones in a
word. It was called "count ones" and is now called popcount.
I first came across HAKMEM Item 169 in 2005 when I last looked at
doing fast Tanimoto calculations. I tried it and a few other
solutions but concluded that using special Altivec instructions were
the fastest solution for my PowerPC-based PowerBook. I don't have a
PowerPC any more. I bought a MacBook Pro last year with a 2.33 GHz
Intel Core 2 Duo. I've been looking at the MMX instructions
quizzically trying to figure out if there's an equivalent solution. I
am not an assembly programmer.
The best I could find was an AMD "x86
Code Optimization Guide". It gives a 32-bit integer version using
normal Intel assembly and a 64-bit version using MMX which "can do
popcounts about twice as fast as the integer version (for an identical
number of bits)."
The newer guide (I misplaced the URL) says to use the popcount instruction in SSE4a.
Other popcount references
Computing popcount is apparently very widely used. You can find many
people have talked about it:
popcount_naive: 1.6666e+07 iters in 688 msecs for 41.28 nsecs/iter
popcount_8: 1e+08 iters in 995 msecs for 9.95 nsecs/iter
popcount_6: 2e+08 iters in 1725 msecs for 8.625 nsecs/iter
popcount_hakmem: 2e+08 iters in 1411 msecs for 7.055 nsecs/iter
popcount_keane: 1e+09 iters in 6566 msecs for 6.566 nsecs/iter
popcount_3: 1e+09 iters in 6270 msecs for 6.27 nsecs/iter
popcount_2: 1e+09 iters in 5658 msecs for 5.658 nsecs/iter
popcount_mult: 1e+09 iters in 4169 msecs for 4.169 nsecs/iter
...
At the suggestion of Bill Trost, I added 8 and 16-bit table-driven
popcount implementations. These perform the fastest in the benchmark
test-stand, at about 3ns/iteration. They're about the same speed, so
one would obviously use the 8-bit version.
Majek
has pointers to code. That's where I found the AMD optimization
guide link.
and nicely summarizes why few people use the fastest solution:
The fastest solutions use pre-computation. However, you will never
find this two implementations in current software product (X11, linux
kernel. . . ). The reason of that is simple: the programmers are
reluctant to use solutions implying a big amount of memory since it
can affect the portability of the applications and their
performance. As a consequence, the algorithms based on tree are the
most popular.
and he also found that lookups were the fastest solution.
There's a gcc bug
report from April 2008 showing that that __builtin_popcount isn't
even as fast as other bit twiddling solutions. Joseph Zbiciak attached
a benchmark. On my machine it shows that gcc's builtin solution
is about 1/3rd the speed of a better one:
Eyas El-Qawasmeh has a paper titled "Beating the
popcount" which I haven't processed yet. I think it's the same
approach Lauradoux Cédric's uses. In essense, there's
additional tricks you can do when computing the popcount over a set of
words.
Finally, you can fetch the GMP
package. It contains hand-written assembly implementations for many
different architectures. The Intel ones are based on bit twiddling
and do not use a lookup table.
Benchmarking popcount in GMP
Interesting. Since I have gmpy installed on this
machine I can figure out just how good gmp's hand-written bit
twiddling actually is. Here's what I did, with commentary inserted.
# Read the entire data file into memory
>>> s = open("nci.binary_fp").read()
>>> len(s)
155667200
# Convert each character into hex
>>> s = s.encode("hex")
# Convert the hex number into a gmp integer, using base 16
>>> import gmpy
>>> n = gmpy.mpz(s, 16)
# Do the popcount
>>> gmpy.popcount(n)
136963400
# How long did it take?
>>> import time
>>> t1 = time.time(); gmpy.popcount(n); t2 = time.time()
136963400
>>> t2-t1
0.36236190795898438
>>> print (155667200/256)*(t2-t1), "fingerprints per second"
220343.217182 fingerprints per second
# Verify that I got the right answer by using an 8-bit lookup table
>>> bitcounts = (
... 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,)
>>> bitcount_table = dict((chr(i), count) for (i,count) in enumerate(bitcounts))
# Double-check the table creation -- this should be 2
>>> bitcount_table["A"]
2
# Do the full popcount; this will take a while
>>> sum(bitcount_table[c] for c in s)
136963400
>>>
GMP takes about 0.36 seconds to compute one popcount. My C code
before loop unwinding using the 16-bit lookup table took 0.41 seconds
to compute two popcounts, or about 0.2 seconds for one popcount. This
is in very good agreement with the numbers from a1k0n's benchmark (3.4
seconds vs. 2.1 seconds).
Another popcount benchmark
So many different algorithms and the benchmarks aren't compatible
enough that I could compare them. I decided to expand a1k0n's
benchmark and include a few other implementations. The result is popcnt.c.
Yes, I didn't even change the name. It implements quite a few
algorithms but it's still incomplete. Anyone want to contribute
assembly versions using MMX instructions?
Here's what my laptop reports. It's a 2.33 GHz Intel Core 2 Duo and I
compiled with gcc 4.0.1:
The fastest implementation is still the 16-bit lookup table, at 40%
faster than that fastest bit-twiddling code. The "8-bit LUT"
implementation is by a1k0n. I noticed it was a lot slower than my
code so I investigated. The only difference was the order of
calculations for a simple addition. The "8-bit LUT v2" is my version.
It's interesting to see that that minor change causes a major timing
difference. And HAKMEM is holding its own; it's even better than the
Wikipedia ones, which I didn't expect.
A Core 2 Duo is a 64-bit processor. I can ask gcc to compile for
64-bits and see if that makes a difference.
It does, for some things. The lookup tables are faster (and the 8-bit
counts are now the same), and the 64-bit algorithms (Wikipedia, gcc
popcountll and Anderson) are much happier. The parallel algorithm
used by the Wikipedia implementations are almost as fast as the 8-bit
lookup table and only 33% slower than the 16-bit lookup table. While
HAKMEM got 25% slower.
Is there any way I can ask gcc to use the 64-bit instructions but
still talk to a Python which was compiled for 32 bits?
Profile-directed optimizations with gcc
I recently learned how to do profile-directed optimizations with gcc
using the -fprofile-arcs and -fbranch-probabilities options.
Sometimes the compiler needs to make a guess on which assembly code is
best for a given circumstance. For instance, if a branch for an 'if'
statement is usually taken then it might provide hints to the
processor pipline to optimistically assume the branch will be taken.
It can't guess correctly all the time. Profile-directed optimizations
works by asking gcc during compilation to instrument the program,
running the program to get timings on how the code is actually used,
saving the results to a file, then recompiling while asking gcc to use
the instrumentation data.
Is it useful for my case?
compile with instrumentation - these will be slower than usual!
% cc popcnt.c -O3 -fprofile-arcs
% ./a.out
FreeBSD version 1 : 4304773 us; cnt=32511665
FreeBSD version 2 : 4062710 us; cnt=32511665
16-bit LUT : 2835959 us; cnt=32511665
8-bit LUT : 5344514 us; cnt=32511665
8-bit LUT v2 : 4535290 us; cnt=32511665
8x shift and add : 22246178 us; cnt=32511665
32x shift and add : 19915850 us; cnt=32511665
Wikipedia #2 : 5626697 us; cnt=32511665
Wikipedia #3 : 5318770 us; cnt=32511665
gcc popcount : 7005410 us; cnt=32511665
gcc popcountll : 8173944 us; cnt=32511665
naive : 62215630 us; cnt=32511665
Wegner/Kernigan : 34541338 us; cnt=32511665
Anderson : 65256755 us; cnt=32511665
HAKMEM 169/X11 : 4727297 us; cnt=32511665
See the profile data?
% ls -l popcnt.*
-rw-r--r-- 1 dalke staff 9177 Jul 3 05:01 popcnt.c
-rw-r--r-- 1 dalke staff 1816 Jul 3 04:43 popcnt.gcda
-rw-r--r-- 1 dalke staff 7872 Jul 3 04:43 popcnt.gcno
recompile using the profile data
% cc popcnt.c -O3 -fbranch-probabilities
% ./a.out
FreeBSD version 1 : 3815036 us; cnt=32511665
FreeBSD version 2 : 3477915 us; cnt=32511665
16-bit LUT : 2251514 us; cnt=32511665
8-bit LUT : 5291010 us; cnt=32511665
8-bit LUT v2 : 3936142 us; cnt=32511665
8x shift and add : 21715704 us; cnt=32511665
32x shift and add : 19708580 us; cnt=32511665
Wikipedia #2 : 5273262 us; cnt=32511665
Wikipedia #3 : 5074212 us; cnt=32511665
gcc popcount : 6431689 us; cnt=32511665
gcc popcountll : 7562124 us; cnt=32511665
naive : 28044028 us; cnt=32511665
Wegner/Kernigan : 15280455 us; cnt=32511665
Anderson : 63431368 us; cnt=32511665
HAKMEM 169/X11 : 4209093 us; cnt=32511665
No, significant differences over the normal 32-bit code. Nor did I see
any real differences if I compiled with the architecture-specific
flags: