The Artima Developer Community
Sponsored Link

Ruby Buzz Forum
Verifying benchmarks with ministat

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Eric Hodel

Posts: 660
Nickname: drbrain
Registered: Mar, 2006

Eric Hodel is a long-time Rubyist and co-founder of Seattle.rb.
Verifying benchmarks with ministat Posted: Jun 27, 2012 5:51 PM
Reply to this message Reply

This post originated from an RSS feed registered with Ruby Buzz by Eric Hodel.
Original Post: Verifying benchmarks with ministat
Feed Title: Segment7
Feed URL: http://blog.segment7.net/articles.rss
Feed Description: Posts about and around Ruby, MetaRuby, ruby2c, ZenTest and work at The Robot Co-op.
Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by Eric Hodel
Latest Posts From Segment7

Advertisement

Most people know to benchmark to verify performance improvements, but comparing a handful of results isn't enough. You need to be sure your results are statistically significant. ministat is a tool written by Poul-Henning Kamp that determines if two result sets are statistically significant and display the difference in performance between the two.

I recently used ministat to measure performance changes for my GVL release patch for Ruby's Zlib extension. The patch will allow multiple zlib or gzip streams to be deflated or inflated in parallel. I was asked to show any performance difference because the GVL release/acquire can be expensive depending upon the work done without the GVL.

I posted a video showing the difference between inflating very large files, but what about single-threaded use of Zlib? What about the worst case? While parallel processing of large streams is great, it shouldn't overly penalize single-thread or small-stream processing. I ran several benchmarks of single-threaded use to show minimal difference, but in the worst-case benchmarks I couldn't be sure there was any significant performance change.

Zlib background

For simplicity I'll only be mentioning inflate from here on, but the same loop handles inflate and deflate of gzip, zlib and raw deflate streams identically.

While the Zlib extension is written in C, I've written this in roughly equivalent ruby while hand-waving over some of the details of how libz works since they're unimportant to the performance comparison. Before my patch run loop looked like this:

setup_input_buffer stream
setup_output_buffer stream

loop do
  status = inflate stream

  schedule_another_thread

  return output_buffer if status == STREAM_END

  if status == BUF_ERROR then
    expand_output_buffer stream
    next
  else
    # handle other errors
  end
end

First the input and output buffers are set up. The input buffer contains the deflated string to be inflated and the output buffer will contain the inflated string. The default output buffer size is 1024 bytes.

Inside the loop inflate is called which will expand as much of the input stream as will fit in the output buffer, then ruby is given the chance to schedule another thread to provide fairness, then errors are checked.

A BUF_ERROR indicates there isn't enough room in the output buffer and more space needs to be added. The expand_output_buffer function adds up to another 16KB to the output buffer. Combined with scheduling another thread each time inflate returned, fairness is provided.

With my patch the run loop looks something like this:

setup_input_buffer stream
setup_output_buffer stream

release_GVL do
  until interrupted do
    status = inflate stream

    return output_buffer if status == STREAM_END

    if status == BUF_ERROR then
      expand_output_buffer_without_GVL stream
      next
    else
      # …
    end
  end
end

The differences are releasing the GVL, the interrupted flag and the change in expanding the output buffer.

To reduce copying the Zlib extension uses a ruby String for the output buffer, so expanding it requires GC interaction. The new function uses realloc(3) directly to avoid GVL contention and mirrors rb_str_resize().

The interrupted flag allows processing of timeouts, a killed thread or clean shutdown to proceed smoothly. Like the no-GVL version the buffer is only expanded by 16KB to allow interruptions to be serviced quickly. Without this flag a large stream would need to complete before the thread it is running in could be shut down which may take several seconds. (In the C version of release_GVL, rb_thread_blocking_region(), you provide an unblocking function that sets the interrupted flag when ruby wants to regain control of the thread.)

Releasing the GVL allow other threads to run the Ruby VM so while a stream is being inflated other threads can run on other CPUs. For large streams this can result in a performance increase when other Ruby threads are running.

Worst Case Benchmark

After inspecting the two implementations we have enough information to create a worst-case scenario for testing the performance of the patch.

Since GVL release/acquire can be expensive we want to do that as much as possible which means inflate should do as little as possible. The buffer should not be expanded since that allows more than one trip through inflate() per GVL release/acquire, so inflated strings must fit in the 1024 byte output buffer.

I wrote a benchmark that inflates many strings with an inflated size of 1000 bytes using multiple threads, but not so many strings as to prevent a quick return from the benchmark section. (While inflating strings of zero bytes would probably be worse, that seems unrealistic.)

Here's the benchmark program I wrote:

require 'zlib'
require 'benchmark'

r = Random.new 0

file_count = 10_000

deflated = (0..file_count).map do
  input = r.bytes 1000
  Zlib::Deflate.deflate input
end

times = Benchmark.measure do
  (0..3).map do
    Thread.new do
      deflated.each do |input|
        Zlib::Inflate.inflate input
      end
    end
  end.each do |t|
    t.join
  end
end

puts times.real

Instead of using time, Benchmark.measure is used, but only around the inflate part. This reduces the amount of noise in the benchmark. By choosing only 10,000 files multiple benchmarks can be run (on my machine the inflate portion takes a little over ½ second). By using the same random seed the benchmark is repeatable.

Before using ministat I ran the same benchmark but with 100,000 files I had the following results without the patch:

$ for f in `jot 5`; do ruby20 test.rb; done
5.420000   5.970000  11.390000 (  8.162893)
5.400000   6.270000  11.670000 (  8.263046)
5.460000   5.920000  11.380000 (  8.133742)
5.410000   6.290000  11.700000 (  8.289913)
5.500000   6.620000  12.120000 (  8.478085)

and with the patch:

$ for f in `jot 5`; do make -s runruby; done
5.120000   6.240000  11.360000 (  8.039715)
5.240000   6.260000  11.500000 (  8.097961)
5.280000   5.940000  11.220000 (  8.004246)
5.210000   6.360000  11.570000 (  8.171124)
5.240000   6.200000  11.440000 (  8.054929)

So while running with the patch seemed faster, the results overlap by a small margin. I wanted to be reasonably sure there was an improvement which is where ministat comes in. In order to get useful data out of ministat I would need more data (this is easy to verify by running ministat on these results and seeing that it says "No difference").

To capture the data I used a shell for loop to run the benchmark 50 times and redirect the output to a file: for f in `jot 50`; do make -s runruby; done > with, and again without the patch applied directed to a "without" file.

ministat

To determine if the results were statistically significant I ran the data through ministat with a confidence interval of 99%. From the ministat man page:

The ministat command was written by Poul-Henning Kamp out of frustration over all the bogus benchmark claims made by people with no understanding of the importance of uncertainty and statistics.

So through ministat we will be able to determine if the suspected difference from the previous benchmark is an actual difference, and what the certainty of that difference is.

$ ministat -c 99 -s without with
x without
+ with
+------------------------------------------------------------------------------+
|                           +     + +  x    *  xx      x                       |
|         +     x +   +     +  +  + +  x   **+xx*xx    x * x +               x |
|+        ++++  x + x*+ * + ++x+x++x*++x* ***+x**xx * +x**** * + x   x       xx|
|                                |_____________A_____________|                 |
|                    |______________A_______________|                          |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  50    0.61660767    0.71641994    0.66669798    0.66647618   0.022548872
+  50    0.59242272    0.69294882    0.64941692    0.64917859    0.02509235
Difference at 99.0% confidence
	-0.0172976 +/- 0.0125332
	-2.59538% +/- 1.88051%
	(Student's t, pooled s = 0.0238545)

Looking at the last four lines, you can see there is a statistically significant difference in the benchmark results at 99% confidence. Running the benchmark with the patch shows a 17ms decrease in time ± 13ms or a 2.6% decrease in time ± 1.9% for this sample set, so I can say "minor improvement".

The box shows a plot of the with and without values along with the average (and median) values, and standard deviation bars. For full details see the ministat man page.

There are a couple of likely reasons for the slight increase in performance such as the GVL release/acquire being inexpensive enough that inflate takes longer allowing parallelism or that GVL release/acquire isn't as expensive as thought, but without an easy way to measure GVL contention it's only speculation.

Getting ministat

To my knowledge ministat only ships with FreeBSD, but fortunately it is really easy to add to nearly any system with a C compiler. You can find a copy of ministat in this repository (which, ironically, uses autotools) or you can grab the C file directly and run cc minitest.c -lm -o minitest and move the minitest executable into your PATH like I did.

Read: Verifying benchmarks with ministat

Topic: The Front-End Future Previous Topic   Next Topic Topic: On Leaving Sencha

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use