Python Buzz Forum - HAKMEM 169 and other popcount implementations

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

HAKMEM

Many other people have worked on doing high speed population counts. In 1972 the AI Laboratory at MIT published "HAKMEM", or "Artificial Intelligence Memo No. 239". It is a collection of algorithms, tricks, observations, and unsolved but interesting problems. One of which is to "Solve Go." Item 169 is a PDP-6/10 assembly way to count the number of ones in a word. It was called "count ones" and is now called popcount.

I first came across HAKMEM Item 169 in 2005 when I last looked at doing fast Tanimoto calculations. I tried it and a few other solutions but concluded that using special Altivec instructions were the fastest solution for my PowerPC-based PowerBook. I don't have a PowerPC any more. I bought a MacBook Pro last year with a 2.33 GHz Intel Core 2 Duo. I've been looking at the MMX instructions quizzically trying to figure out if there's an equivalent solution. I am not an assembly programmer.

The best I could find was an AMD "x86 Code Optimization Guide". It gives a 32-bit integer version using normal Intel assembly and a 64-bit version using MMX which "can do popcounts about twice as fast as the integer version (for an identical number of bits)."

The newer guide (I misplaced the URL) says to use the popcount instruction in SSE4a.

Other popcount references

Computing popcount is apparently very widely used. You can find many people have talked about it:

Bart Massey:

popcount_naive: 1.6666e+07 iters in 688 msecs for 41.28 nsecs/iter
popcount_8: 1e+08 iters in 995 msecs for 9.95 nsecs/iter
popcount_6: 2e+08 iters in 1725 msecs for 8.625 nsecs/iter
popcount_hakmem: 2e+08 iters in 1411 msecs for 7.055 nsecs/iter
popcount_keane: 1e+09 iters in 6566 msecs for 6.566 nsecs/iter
popcount_3: 1e+09 iters in 6270 msecs for 6.27 nsecs/iter
popcount_2: 1e+09 iters in 5658 msecs for 5.658 nsecs/iter
popcount_mult: 1e+09 iters in 4169 msecs for 4.169 nsecs/iter
  ...
At the suggestion of Bill Trost, I added 8 and 16-bit table-driven
popcount implementations. These perform the fastest in the benchmark
test-stand, at about 3ns/iteration. They're about the same speed, so
one would obviously use the 8-bit version.

Majek has pointers to code. That's where I found the AMD optimization guide link.
Lauradoux Cédric's Hamming weight page has an overview of the topic. He reports:

Algorithms Throughput

Naive 29.97 MB/s

Kernighan 33.34 MB/s

Tree 1 122.97 MB/s

Tree 2 130.60 MB/s

Tree 3 125.57 MB/s

Table 8 bits 200.94 MB/s

Table 16 bits 244.64 MB/s

and nicely summarizes why few people use the fastest solution:
The fastest solutions use pre-computation. However, you will never find this two implementations in current software product (X11, linux kernel. . . ). The reason of that is simple: the programmers are reluctant to use solutions implying a big amount of memory since it can affect the portability of the applications and their performance. As a consequence, the algorithms based on tree are the most popular.
There was a reddit thread in 2007 titled: Popcount - the "NSA instruction" missing from modern CPUs (or at least their datasheets.). In it, a1k0n posted a link to a popcount benchmark he wrote. After tweaking the 8-bit version slightly for better performance I got:
```
FreeBSD version 1   :  3784823 us; cnt=32511665
FreeBSD version 2   :  3421292 us; cnt=32511665
16-bit LUT          :  2068947 us; cnt=32511665
8-bit LUT           :  3632031 us; cnt=32511665
8x shift and add    : 21401951 us; cnt=32511665
32x shift and add   : 19419464 us; cnt=32511665
```
and he also found that lookups were the fastest solution.
There's a gcc bug report from April 2008 showing that that __builtin_popcount isn't even as fast as other bit twiddling solutions. Joseph Zbiciak attached a benchmark. On my machine it shows that gcc's builtin solution is about 1/3rd the speed of a better one:
```
popcount64_1 = 4114 clocks   gcc's __builtin_popcount
popcount64_2 = 1881 clocks 
popcount64_3 = 1627 clocks
popcount64_4 = 1381 clocks
```
There's also a number of popcount algorithms documented by Sean Eron Anderson including descriptions and further links.
Eyas El-Qawasmeh has a paper titled "Beating the popcount" which I haven't processed yet. I think it's the same approach Lauradoux Cédric's uses. In essense, there's additional tricks you can do when computing the popcount over a set of words.
Finally, you can fetch the GMP package. It contains hand-written assembly implementations for many different architectures. The Intel ones are based on bit twiddling and do not use a lookup table.

Benchmarking popcount in GMP

Interesting. Since I have gmpy installed on this machine I can figure out just how good gmp's hand-written bit twiddling actually is. Here's what I did, with commentary inserted.

# Read the entire data file into memory
>>> s = open("nci.binary_fp").read()
>>> len(s)
155667200

# Convert each character into hex

>>> s = s.encode("hex")

# Convert the hex number into a gmp integer, using base 16

>>> import gmpy
>>> n = gmpy.mpz(s, 16)

# Do the popcount

>>> gmpy.popcount(n)
136963400

# How long did it take?

>>> import time
>>> t1 = time.time(); gmpy.popcount(n); t2 = time.time()  
136963400
>>> t2-t1
0.36236190795898438
>>> print (155667200/256)*(t2-t1), "fingerprints per second"
220343.217182 fingerprints per second

# Verify that I got the right answer by using an 8-bit lookup table

>>> bitcounts = (
... 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,)
>>> bitcount_table = dict((chr(i), count) for (i,count) in enumerate(bitcounts))

# Double-check the table creation -- this should be 2

>>> bitcount_table["A"]
2

# Do the full popcount; this will take a while

>>> sum(bitcount_table[c] for c in s)
136963400
>>>

GMP takes about 0.36 seconds to compute one popcount. My C code before loop unwinding using the 16-bit lookup table took 0.41 seconds to compute two popcounts, or about 0.2 seconds for one popcount. This is in very good agreement with the numbers from a1k0n's benchmark (3.4 seconds vs. 2.1 seconds).

Another popcount benchmark

So many different algorithms and the benchmarks aren't compatible enough that I could compare them. I decided to expand a1k0n's benchmark and include a few other implementations. The result is popcnt.c. Yes, I didn't even change the name. It implements quite a few algorithms but it's still incomplete. Anyone want to contribute assembly versions using MMX instructions?

Here's what my laptop reports. It's a 2.33 GHz Intel Core 2 Duo and I compiled with gcc 4.0.1:

% cc popcnt.c -O3
% ./a.out
FreeBSD version 1   :  3828096 us; cnt=32511665
FreeBSD version 2   :  3474751 us; cnt=32511665
16-bit LUT          :  2095066 us; cnt=32511665
8-bit LUT           :  5329692 us; cnt=32511665
8-bit LUT v2        :  3691475 us; cnt=32511665
8x shift and add    : 21675609 us; cnt=32511665
32x shift and add   : 19687827 us; cnt=32511665
Wikipedia #2        :  5264710 us; cnt=32511665
Wikipedia #3        :  5037636 us; cnt=32511665
gcc popcount        :  6088138 us; cnt=32511665
gcc popcountll      :  7548551 us; cnt=32511665
naive               : 23647023 us; cnt=32511665
Wegner/Kernigan     : 15026099 us; cnt=32511665
Anderson            : 63285941 us; cnt=32511665
HAKMEM 169/X11      :  4205008 us; cnt=32511655

The fastest implementation is still the 16-bit lookup table, at 40% faster than that fastest bit-twiddling code. The "8-bit LUT" implementation is by a1k0n. I noticed it was a lot slower than my code so I investigated. The only difference was the order of calculations for a simple addition. The "8-bit LUT v2" is my version. It's interesting to see that that minor change causes a major timing difference. And HAKMEM is holding its own; it's even better than the Wikipedia ones, which I didn't expect.

A Core 2 Duo is a 64-bit processor. I can ask gcc to compile for 64-bits and see if that makes a difference.

% cc popcnt.c -O3 -m64
% ./a.out
FreeBSD version 1   :  3882201 us; cnt=32511665
FreeBSD version 2   :  3382603 us; cnt=32511665
16-bit LUT          :  1562167 us; cnt=32511665
8-bit LUT           :  2013767 us; cnt=32511665
8-bit LUT v2        :  2054639 us; cnt=32511665
8x shift and add    : 21679291 us; cnt=32511665
32x shift and add   : 20050205 us; cnt=32511665
Wikipedia #2        :  2113876 us; cnt=32511665
Wikipedia #3        :  2102645 us; cnt=32511665
gcc popcount        :  8081774 us; cnt=32511665
gcc popcountll      :  4217908 us; cnt=32511665
naive               : 27717913 us; cnt=32511665
Wegner/Kernigan     : 16618378 us; cnt=32511665
Anderson            : 12300547 us; cnt=32511665
HAKMEM 169/X11      :  5643968 us; cnt=32511655

It does, for some things. The lookup tables are faster (and the 8-bit counts are now the same), and the 64-bit algorithms (Wikipedia, gcc popcountll and Anderson) are much happier. The parallel algorithm used by the Wikipedia implementations are almost as fast as the 8-bit lookup table and only 33% slower than the 16-bit lookup table. While HAKMEM got 25% slower.

Is there any way I can ask gcc to use the 64-bit instructions but still talk to a Python which was compiled for 32 bits?

Profile-directed optimizations with gcc

I recently learned how to do profile-directed optimizations with gcc using the -fprofile-arcs and -fbranch-probabilities options. Sometimes the compiler needs to make a guess on which assembly code is best for a given circumstance. For instance, if a branch for an 'if' statement is usually taken then it might provide hints to the processor pipline to optimistically assume the branch will be taken. It can't guess correctly all the time. Profile-directed optimizations works by asking gcc during compilation to instrument the program, running the program to get timings on how the code is actually used, saving the results to a file, then recompiling while asking gcc to use the instrumentation data.

Is it useful for my case?

     compile with instrumentation - these will be slower than usual!
% cc popcnt.c -O3 -fprofile-arcs
% ./a.out
FreeBSD version 1   :  4304773 us; cnt=32511665
FreeBSD version 2   :  4062710 us; cnt=32511665
16-bit LUT          :  2835959 us; cnt=32511665
8-bit LUT           :  5344514 us; cnt=32511665
8-bit LUT v2        :  4535290 us; cnt=32511665
8x shift and add    : 22246178 us; cnt=32511665
32x shift and add   : 19915850 us; cnt=32511665
Wikipedia #2        :  5626697 us; cnt=32511665
Wikipedia #3        :  5318770 us; cnt=32511665
gcc popcount        :  7005410 us; cnt=32511665
gcc popcountll      :  8173944 us; cnt=32511665
naive               : 62215630 us; cnt=32511665
Wegner/Kernigan     : 34541338 us; cnt=32511665
Anderson            : 65256755 us; cnt=32511665
HAKMEM 169/X11      :  4727297 us; cnt=32511665


      See the profile data?
% ls -l popcnt.*
-rw-r--r--   1 dalke  staff  9177 Jul  3 05:01 popcnt.c
-rw-r--r--   1 dalke  staff  1816 Jul  3 04:43 popcnt.gcda
-rw-r--r--   1 dalke  staff  7872 Jul  3 04:43 popcnt.gcno

      recompile using the profile data
% cc popcnt.c -O3 -fbranch-probabilities
% ./a.out
FreeBSD version 1   :  3815036 us; cnt=32511665
FreeBSD version 2   :  3477915 us; cnt=32511665
16-bit LUT          :  2251514 us; cnt=32511665
8-bit LUT           :  5291010 us; cnt=32511665
8-bit LUT v2        :  3936142 us; cnt=32511665
8x shift and add    : 21715704 us; cnt=32511665
32x shift and add   : 19708580 us; cnt=32511665
Wikipedia #2        :  5273262 us; cnt=32511665
Wikipedia #3        :  5074212 us; cnt=32511665
gcc popcount        :  6431689 us; cnt=32511665
gcc popcountll      :  7562124 us; cnt=32511665
naive               : 28044028 us; cnt=32511665
Wegner/Kernigan     : 15280455 us; cnt=32511665
Anderson            : 63431368 us; cnt=32511665
HAKMEM 169/X11      :  4209093 us; cnt=32511665

No, significant differences over the normal 32-bit code. Nor did I see any real differences if I compiled with the architecture-specific flags:

cc popcnt.c -O3 -march=nocona

Read: HAKMEM 169 and other popcount implementations

Previous Topic

Next Topic


	Web Artima.com