The Wide Finder 2 benchmark
measures the speed at which a program can analyze 42GB worth of webserver logs
and generate basic statistics (top URLs by hits, byte count and 404 errors,
top clients and referrers) on a pretty beefy machine: a 8-core Sun Fire T2000
with 32 hardware threads and 32GB of RAM.
Several people have written a multitude of implementations in various
languages, including C++, OCaml, C, Java, Scala, Groovy, Fan, Python, Perl,
Ruby, and combinations thereof. The results are presented
on this table.
Speedup
The reference implementation was written in Ruby and took over 25 hours to
process the logs. The fastest versions, coded in C++ and OCaml, can do it in
around 5 minutes, nearly 300 times faster.
In the case of OCaml, this speedup can be broken down into 3 factors.
Use of a more efficient language implementation: 1 order of magnitude
This does not necessarily cause an explosion in code size, as shown by the
wf2_simple.ml semi-naïve port to OCaml, which
is in fact shorter than the reference Ruby implementation by line count (this
is far from being the case in C++, though), but runs an order of magnitude
faster.
It is interesting to analyze the bang-per-buck ratio for the different
languages: while two C++ implementations ultimately yielded the highest
performance (a third one failed to parallelize properly, and was over three
time slower than the top C++ or OCaml versions) , they do so at the cost of
an explosion in the amount of code, requiring 2 to 3 times more lines than the
OCaml version whose performance was within 10% of the best one, even though
the C++ entries used libraries like Boost.
Parallelism: 1 order of magnitude
The Wide Finder 2 problem is fairly easy to decompose into smaller subproblems
(essentially by having several workers processing different parts of the
logs). The T2K is a bit particular in that each of its 32 hardware threads
(which are seen as 32 CPUs when you program) is much less powerful than the
x64 cores we're using most of the time, so parallel execution is a must.
I parallelized the OCaml program trivially by using a semi-standard 15-line
higher-order function which accepts a function and executes it in another
process.
Small-scale optimizations that apply to both the single-threaded and the parallel cases: 2-4X
The Wide Finder 2 implies lots of IO activity, which proved to be relatively hard
to optimize on the T2K, because it you can easily saturate a core by doing
mere IO (i.e., a single core is barely able to cope with the sustained read
rate of the disk). While my first OCaml implementations used line-oriented IO,
like the reference version in Ruby, the fastest one (like most other entries)
uses block-oriented IO, which accounted for a large part of the linecount
increase.