Ruby Buzz Forum - Verifying benchmarks with ministat

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Zlib background

For simplicity I'll only be mentioning inflate from here on, but the same loop handles inflate and deflate of gzip, zlib and raw deflate streams identically.

While the Zlib extension is written in C, I've written this in roughly equivalent ruby while hand-waving over some of the details of how libz works since they're unimportant to the performance comparison. Before my patch run loop looked like this:

setup_input_buffer stream
setup_output_buffer stream

loop do
  status = inflate stream

  schedule_another_thread

  return output_buffer if status == STREAM_END

  if status == BUF_ERROR then
    expand_output_buffer stream
    next
  else
    # handle other errors
  end
end

First the input and output buffers are set up. The input buffer contains the deflated string to be inflated and the output buffer will contain the inflated string. The default output buffer size is 1024 bytes.

Inside the loop inflate is called which will expand as much of the input stream as will fit in the output buffer, then ruby is given the chance to schedule another thread to provide fairness, then errors are checked.

A BUF_ERROR indicates there isn't enough room in the output buffer and more space needs to be added. The expand_output_buffer function adds up to another 16KB to the output buffer. Combined with scheduling another thread each time inflate returned, fairness is provided.

With my patch the run loop looks something like this:

setup_input_buffer stream
setup_output_buffer stream

release_GVL do
  until interrupted do
    status = inflate stream

    return output_buffer if status == STREAM_END

    if status == BUF_ERROR then
      expand_output_buffer_without_GVL stream
      next
    else
      # …
    end
  end
end

The differences are releasing the GVL, the interrupted flag and the change in expanding the output buffer.

To reduce copying the Zlib extension uses a ruby String for the output buffer, so expanding it requires GC interaction. The new function uses realloc(3) directly to avoid GVL contention and mirrors rb_str_resize().

The interrupted flag allows processing of timeouts, a killed thread or clean shutdown to proceed smoothly. Like the no-GVL version the buffer is only expanded by 16KB to allow interruptions to be serviced quickly. Without this flag a large stream would need to complete before the thread it is running in could be shut down which may take several seconds. (In the C version of release_GVL, rb_thread_blocking_region(), you provide an unblocking function that sets the interrupted flag when ruby wants to regain control of the thread.)

Releasing the GVL allow other threads to run the Ruby VM so while a stream is being inflated other threads can run on other CPUs. For large streams this can result in a performance increase when other Ruby threads are running.

Worst Case Benchmark

After inspecting the two implementations we have enough information to create a worst-case scenario for testing the performance of the patch.

Since GVL release/acquire can be expensive we want to do that as much as possible which means inflate should do as little as possible. The buffer should not be expanded since that allows more than one trip through inflate() per GVL release/acquire, so inflated strings must fit in the 1024 byte output buffer.

I wrote a benchmark that inflates many strings with an inflated size of 1000 bytes using multiple threads, but not so many strings as to prevent a quick return from the benchmark section. (While inflating strings of zero bytes would probably be worse, that seems unrealistic.)

Here's the benchmark program I wrote:

require 'zlib'
require 'benchmark'

r = Random.new 0

file_count = 10_000

deflated = (0..file_count).map do
  input = r.bytes 1000
  Zlib::Deflate.deflate input
end

times = Benchmark.measure do
  (0..3).map do
    Thread.new do
      deflated.each do |input|
        Zlib::Inflate.inflate input
      end
    end
  end.each do |t|
    t.join
  end
end

puts times.real

Instead of using time, Benchmark.measure is used, but only around the inflate part. This reduces the amount of noise in the benchmark. By choosing only 10,000 files multiple benchmarks can be run (on my machine the inflate portion takes a little over ½ second). By using the same random seed the benchmark is repeatable.

Before using ministat I ran the same benchmark but with 100,000 files I had the following results without the patch:

$ for f in `jot 5`; do ruby20 test.rb; done
5.420000   5.970000  11.390000 (  8.162893)
5.400000   6.270000  11.670000 (  8.263046)
5.460000   5.920000  11.380000 (  8.133742)
5.410000   6.290000  11.700000 (  8.289913)
5.500000   6.620000  12.120000 (  8.478085)

and with the patch:

$ for f in `jot 5`; do make -s runruby; done
5.120000   6.240000  11.360000 (  8.039715)
5.240000   6.260000  11.500000 (  8.097961)
5.280000   5.940000  11.220000 (  8.004246)
5.210000   6.360000  11.570000 (  8.171124)
5.240000   6.200000  11.440000 (  8.054929)

So while running with the patch seemed faster, the results overlap by a small margin. I wanted to be reasonably sure there was an improvement which is where ministat comes in. In order to get useful data out of ministat I would need more data (this is easy to verify by running ministat on these results and seeing that it says "No difference").

To capture the data I used a shell for loop to run the benchmark 50 times and redirect the output to a file: for f in `jot 50`; do make -s runruby; done > with, and again without the patch applied directed to a "without" file.

ministat

To determine if the results were statistically significant I ran the data through ministat with a confidence interval of 99%. From the ministat man page:

The ministat command was written by Poul-Henning Kamp out of frustration over all the bogus benchmark claims made by people with no understanding of the importance of uncertainty and statistics.

So through ministat we will be able to determine if the suspected difference from the previous benchmark is an actual difference, and what the certainty of that difference is.

$ ministat -c 99 -s without with
x without
+ with
+------------------------------------------------------------------------------+
|                           +     + +  x    *  xx      x                       |
|         +     x +   +     +  +  + +  x   **+xx*xx    x * x +               x |
|+        ++++  x + x*+ * + ++x+x++x*++x* ***+x**xx * +x**** * + x   x       xx|
|                                |_____________A_____________|                 |
|                    |______________A_______________|                          |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  50    0.61660767    0.71641994    0.66669798    0.66647618   0.022548872
+  50    0.59242272    0.69294882    0.64941692    0.64917859    0.02509235
Difference at 99.0% confidence
	-0.0172976 +/- 0.0125332
	-2.59538% +/- 1.88051%
	(Student's t, pooled s = 0.0238545)

Looking at the last four lines, you can see there is a statistically significant difference in the benchmark results at 99% confidence. Running the benchmark with the patch shows a 17ms decrease in time ± 13ms or a 2.6% decrease in time ± 1.9% for this sample set, so I can say "minor improvement".

The box shows a plot of the with and without values along with the average (and median) values, and standard deviation bars. For full details see the ministat man page.

There are a couple of likely reasons for the slight increase in performance such as the GVL release/acquire being inexpensive enough that inflate takes longer allowing parallelism or that GVL release/acquire isn't as expensive as thought, but without an easy way to measure GVL contention it's only speculation.

Getting ministat

To my knowledge ministat only ships with FreeBSD, but fortunately it is really easy to add to nearly any system with a C compiler. You can find a copy of ministat in this repository (which, ironically, uses autotools) or you can grab the C file directly and run cc minitest.c -lm -o minitest and move the minitest executable into your PATH like I did.

Read: Verifying benchmarks with ministat

Previous Topic

Next Topic


	Web Artima.com