The Artima Developer Community
Sponsored Link

Python Buzz Forum
Testing the bitslice algorithm for popcount

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Andrew Dalke

Posts: 291
Nickname: dalke
Registered: Sep, 2003

Andrew Dalke is a consultant and software developer in computational chemistry and biology.
Testing the bitslice algorithm for popcount Posted: Jul 4, 2008 8:28 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Andrew Dalke.
Original Post: Testing the bitslice algorithm for popcount
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Latest Python Buzz Posts
Latest Python Buzz Posts by Andrew Dalke
Latest Posts From Andrew Dalke's writings

Advertisement

This is part 7 of an on-going series dealing with molecular fingerprints:

  1. Molecular fingerprints, background
  2. Generating molecular fingerprints with OpenBabel
  3. Computing Tanimoto scores, quickly
  4. Block Tanimoto calculations
  5. Faster block Tanimoto calculations
  6. HAKMEM 169 and other popcount implementations
  7. Testing the bitslice algorithm for popcount
Comments?


I want to compute the popcount of molecular fingerprints containing 1024 bits. My algorithm has been to compute the popcount 8, 16, 32, or 64 bits at a time and repeat that for all of the words in the fingerprint. There are a couple of algorithms which compute the popcount of multiple words in parallel by doing a partial popcount of each word and combining the intermediate results to get the result over all words.

Lauradoux Cédric wrote a nice overview on Hamming weight, including explanations of the tree merging and bit-slicing algorithms. In personal email he pointed out this page from comp.arch.arithmetic where someone attributes the algorithm to Robert Harley.

It took me a while to understand it. It's been rather a while since I had to do much thinking about half-adders. About 17 years. But it's not hard. I'm not going to explain it though. It's best in pictures and that will take me too long to do. You can follow those links though.

While he hasn't finished the overview explanation on how to combine the two algorithms, his Hamming weight page includes source.

I've added two bit-slicing implementations to my popcnt.c benchmark. The algorithm from comp.ach.arithmetic processes 7 words at a time and the one from Cédric processes 24 words at a time. Both fall back to other methods for the last few words if the input isn't an exact multiple of 24 or 7.

Cédric implemented several algorithms in his code distributions and included programs to compare their performance. The results suggested that the 24 word bitslice algorithm would be twice as fast as as the 16 bit looup table. I implemented it and found that while it was fast, and in my laptop even faster than the 16 bit lookup table, it was only about 10% faster and not a factor of 2.

% sysctl machdep.cpu.brand_string
machdep.cpu.brand_string: Intel(R) Core(TM)2 CPU         T7600  @ 2.33GHz
% gcc --version
i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5367)
   ... deleted ...
% cc -O3 popcnt.c
% ./a.out
FreeBSD version 1   :  3854904 us; cnt=32511665
FreeBSD version 2   :  3517953 us; cnt=32511665
16-bit LUT          :  2127718 us; cnt=32511665   was the fastest
8-bit LUT           :  5317244 us; cnt=32511665
8-bit LUT v2        :  3762467 us; cnt=32511665
Wikipedia #2        :  5493756 us; cnt=32511665
Wikipedia #3        :  5295541 us; cnt=32511665
gcc popcount        :  6242596 us; cnt=32511665
gcc popcountll      : 10402364 us; cnt=32511665
HAKMEM 169/X11      :  4305536 us; cnt=32511665
Bitslice(7)         :  2077473 us; cnt=32511665
Bitslice(24)        :  1843217 us; cnt=32511665  fastest
naive               : 23460323 us; cnt=32511665
Wegner/Kernigan     : 14976014 us; cnt=32511665
Anderson            : 63971600 us; cnt=32511665
8x shift and add    : 21728414 us; cnt=32511665
32x shift and add   : 19747808 us; cnt=32511665

I dug around for a while and used Toby's comment to try Shark as a profiler. Things made a bit more sense after I remembered to add the '-g' option to gcc so I could get Shark to line up the code and assembly side-by-side. My conclusion is that Cédric's code has too much function call overhead because the 16-bit LUT is being called for every 32 bit number, while the 24 word bitslice code is only called every 24 words. My code doesn't have that overhead, which is why the times are closer to each other.

As before, I wondered how much "-m64" affects things. I'll declare it a tie between the new bitslice code and the old 16-bit lookup table. (While though the numbers aren't the same, the difference is about the same as when I run the same algorithm again.)

% cc -O3 popcnt.c -m64
% ./a.out
FreeBSD version 1   :  3902385 us; cnt=32511665
FreeBSD version 2   :  3422655 us; cnt=32511665
16-bit LUT          :  1621557 us; cnt=32511665 fastest
8-bit LUT           :  2070320 us; cnt=32511665
8-bit LUT v2        :  2094362 us; cnt=32511665
Wikipedia #2        :  2135130 us; cnt=32511665
Wikipedia #3        :  2142576 us; cnt=32511665
gcc popcount        :  8128024 us; cnt=32511665
gcc popcountll      :  4087390 us; cnt=32511665
HAKMEM 169/X11      :  5805272 us; cnt=32511665
Bitslice(7)         :  1728262 us; cnt=32511665
Bitslice(24)        :  1698257 us; cnt=32511665 fastest
naive               : 24990997 us; cnt=32511665
Wegner/Kernigan     : 16864887 us; cnt=32511665
Anderson            : 12463648 us; cnt=32511665
8x shift and add    : 21715482 us; cnt=32511665
32x shift and add   : 20003160 us; cnt=32511665

All of this was on one processor and compiler. I decided to try another machine with a slightly different processor and an earlier version of gcc.

%sysctl -w hw.model
hw.model: Intel(R) Core(TM)2 CPU          6420  @ 2.13GHz
%gcc --version
gcc (GCC) 3.4.6 [FreeBSD] 20060305
    ...
%cc popcnt.c -O3 
%./a.out 
FreeBSD version 1   :  4270677 us; cnt=32511665
FreeBSD version 2   :  3785047 us; cnt=32511665
16-bit LUT          :  2034836 us; cnt=32511665
8-bit LUT           :  1935156 us; cnt=32511665   fastest
8-bit LUT v2        :  1987370 us; cnt=32511665
Wikipedia #2        :  5010639 us; cnt=32511665
Wikipedia #3        :  4931266 us; cnt=32511665
gcc popcount        :  5793299 us; cnt=32511665
gcc popcountll      :  6815575 us; cnt=32511665
HAKMEM 169/X11      : 10788366 us; cnt=32511665
Bitslice(7)         :  2467150 us; cnt=32511665
Bitslice(24)        :  2469399 us; cnt=32511665
naive               : 27722786 us; cnt=32511665
  ... too slow to worry about ...
In this case I would use the 8-bit lookup table, which is about 20% faster than the the bitslice code.

Here's the results on an older Pentium 4 (Prescott) CPU 2.80GHz with gcc 3.4.6:

FreeBSD version 1   :  3745509 us; cnt=32500610
FreeBSD version 2   :  3774150 us; cnt=32500610
16-bit LUT          :  2762583 us; cnt=32500610
8-bit LUT           :  2430362 us; cnt=32500610    fastest
8-bit LUT v2        :  2433684 us; cnt=32500610
Wikipedia #2        : 11509758 us; cnt=32500610
Wikipedia #3        :  8540921 us; cnt=32500610
gcc popcount        :  9100055 us; cnt=32500610
gcc popcountll      : 10568664 us; cnt=32500610
HAKMEM 169/X11      : 10970609 us; cnt=32500610
Bitslice(7)         :  4226871 us; cnt=32500610
Bitslice(24)        :  3700020 us; cnt=32500610
naive               : 36117662 us; cnt=32500610
Wegner/Kernigan     : 46479326 us; cnt=32500610
Anderson            : 174285325 us; cnt=32500610
8x shift and add    : 90028247 us; cnt=32500610
32x shift and add   : 86390410 us; cnt=32500610
Bitslice(24) is still the fastest bit-twiddler solution, but quite a bit slower than using an 8-bit lookup table. There is no single best solution. It seems the code is also compiler dependent so I'm installing GCC 4.2 to see how that affects things. That'll probably take a few hours so it's time to sleep.

Read: Testing the bitslice algorithm for popcount

Topic: TurboGears 1.0.5 released Previous Topic   Next Topic Topic: Faster block Tanimoto calculations

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use