Python Buzz Forum - Unleash the test army

This approach sounds compelling but I had my doubts. I also had my doubts about the adequacy of both my code and tests. A perfect opportunity, then, to try out Hypothesis, an open source property-based testing library developed by David MacIver.

I used the Python version of the library, which is the primary implementation. The rest of this article describes my experience of using hypothesis for the first time: I’m not claiming expertise.

First impressions

Installation was the usual pip invocation. The documentation is written with care. It’s evident the library is mature, supported and actively developed. It’s licensed under the Mozilla Public License.

My first test

def chunk(xs, n):
    '''Split the list, xs, into n evenly sized chunks'''
    L = len(xs)
    assert 0 < n <= L
    s, r = divmod(L, n)
    t = s + 1
    return ([xs[p:p+t] for p in range(0, r*t, t)] +
            [xs[p:p+s] for p in range(r*t, L, s)])

from itertools import accumulate, chain, repeat, tee

def chunk(xs, n):
    '''Split the list, xs, into n evenly sized chunks'''
    assert n > 0
    L = len(xs)
    s, r = divmod(L, n)
    widths = chain(repeat(s+1, r), repeat(s, n-r))
    offsets = accumulate(chain((0,), widths))
    b, e = tee(offsets)
    next(e)
    return [xs[s] for s in map(slice, b, e)]

The first thing you notice when thinking about a property based test is that the specification — the function’s docstring — doesn’t describe the exact form of the output. In fact, as a comment on the article pointed out, my own interpretation of the specification is not the only one, and allowing the chunks to be formed from non-contiguous items permits a particularly elegant solution.

Also, if the list doesn’t divide exactly into n chunks, what should the result be? Well, although I’d have been happy with any evenly-chunked solution, my conventional unit tests assumed an implementation which placed the larger chunks first.

def test_chunk():
    assert chunk('', 1) == ['']
    assert chunk('ab', 2) == ['a', 'b']
    assert chunk('abc', 2) == ['ab', 'c']
    
    xs = list(range(8))
    assert chunk(xs, 2) == [[0, 1, 2, 3], [4, 5, 6, 7]]
    assert chunk(xs, 3) == [[0, 1, 2], [3, 4, 5], [6, 7]]
    assert chunk(xs, 5) == [[0, 1], [2, 3], [4, 5], [6], [7]]
    
    rs = range(1000000)
    assert chunk(rs, 2) == [range(500000), range(500000, 1000000)]

Notice, by the way, that although the docstring only mentions lists, I can’t resist demonstrating the algorithm also behaves for strings and ranges — for any sliceable sequence, in fact.

Here’s what I started with when I tried specifying the “evenly sized” property using hypothesis.

def test_evenly_chunked(xs, n):
    chunks = chunk(xs, n)
    assert len(chunks) == n
    chunk_lens = {len(c) for c in chunks}
    assert set(chunk_lens) in {1, 2}
    assert max(chunk_lens) - min(chunk_lens) in {0, 1}

This first test case defines “evenly sized”, stating that the result comprises n chunks, that the set of the lengths of these chunks is either 1 (all chunks the same size) or 2, and the maximum chunk length is equal to or one more than the minumum chunk length.

This doesn’t fully specify the function. We also need assertions which confirm that recombining the chunks produces the original sequence.

def test_combining_chunks(xs_n):
    pass # We'll come back to this!

Now, test_evenly_chunked() looks quite like a conventional test function. It just needs some input values. Rather than poke the function with some hand-chosen values, we can let hypothesis have a go.

import hypothesis as ht
import hypothesis.strategies as st

@ht.given(xs=st.lists(), n=st.integers())
def test_evenly_chunked(xs, n):
    chunks = chunk(xs, n)
    assert len(chunks) == n
    chunk_lens = {len(c) for c in chunks}
    assert len(chunk_lens) in {1, 2}
    assert max(chunk_lens) - min(chunk_lens) in {0, 1}

As you can see, the test function pre-conditions are encapsulated in a hypothesis.given decorator, which specifies the use of hypothesis.strategies.lists() and hypothesis.strategies.integers() to generate test values for xs and n respectively.

The result was a lengthy but helpful failure, which printed out the documentation of the lists() strategy and the usage error:

hypothesis.errors.InvalidArgument: Cannot create non-empty lists without an element type

OK then. The function doesn’t really care about the element type. Integers will do.

@ht.given(xs=st.lists(st.integers()), n=st.integers())
def test_evenly_chunked(xs, n):
    ....

xs = [], n = 0

def chunk(xs, n):
        '''Split the list, xs, into n evenly sized chunks'''
        L = len(xs)
>       assert 0 < n <= L
E       AssertionError

Our function, chunk() requires the value n to be in the closed range (0, len(xs)]. Looking more closely at the failure, we can see that the function under test, chunk(), isn’t great, since we won’t be able to split an empty list into any number of chunks since, in this case, L is zero and no value of n satisfies 0 < n <= L.

My second test

@ht.given(xs=st.lists(st.integers()),
          n=st.integers(min_value=1))
def test_evenly_chunked(xs, n):
    chunks = chunk(xs, n)
    assert len(chunks) == n
    chunk_lens = {len(c) for c in chunks}
    assert len(chunk_lens) in {1, 2}
    assert max(chunk_lens) - min(chunk_lens) in {0, 1}
    ....

When I tried running the tests again, they appeared to hang until I interrupted them.

> py.test
============================= test session starts =============================
platform win32 -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
....
plugins: hypothesis-3.8.3
collected 1 items

test_chunk.py   C-c C-c

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
to show a full traceback on KeyboardInterrupt use --fulltrace

Now, I had a suspicion that chunk() couldn’t really handle any input integers — it was designed for a value of n equal to multiprocessing.cpu_count() — but I wanted to see what would happen with no upper limits. Here was my answer. Running and interrupting again with --fulltrace set, I got several pages of output ending with the test inputs:

xs = [50], n = 67108865

Evidently my code was taking a while to create a list comprising a single list [50] and over 67 million empty lists [], [], [], ....

Once again, I had a decision to make. Perhaps unsurprisingly, it’s a decision I’d already faced. I could make chunk() a generator function, yielding the chunks one at a time — a trivial and natural change to the itertools based implementation — or I could constrain the tests to more suitable values of n.

In this case I decided to stick with what I had: my function would accept a list and return a list (of lists). In an attempt to get some passing tests, I set a maximum value on n.

@ht.given(xs=st.lists(st.integers()),
          n=st.integers(min_value=1, max_value=100))
def test_evenly_chunked(xs, n):
    ....

py.test 
============================= test session starts =============================
platform win32 -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: /work/sliced-python, inifile:
plugins: hypothesis-3.8.3
collected 1 items

test_chunk.py .

========================== 1 passed in 0.23 seconds ===========================

Building on this success, I wanted to confirm the function also handled other sliceable types — strings and bytes specifically. Hypothesis provides a one_of strategy for combining other strategies.

@ht.given(xs=st.one_of(st.text(),
                       st.binary(),
                       st.lists(st.integers())),
          n=st.integers(min_value=1, max_value=100))
def test_evenly_chunked(xs, n):
    chunks = chunk(xs, n)
    assert len(chunks) == n
    chunk_lens = {len(c) for c in chunks}
    assert len(chunk_lens) in {1, 2}
    assert max(chunk_lens) - min(chunk_lens) in {0, 1}

py.test
============================= test session starts =============================
platform win32 -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: /work/sliced-python, inifile:
plugins: hypothesis-3.8.3
collected 1 items

test_chunk.py .

========================== 1 passed in 0.30 seconds ===========================

This output is rather inscrutable. Generally, passing tests shouldn’t draw attention to themselves, but what inputs had my test strategies generated? Were they sufficient?

py.test --hypothesis-show-statistics

....

test_chunk.py::test_evenly_chunked:

- 200 passing examples, 0 failing examples, 0 invalid examples
  - Typical runtimes: < 1ms
  - Stopped because settings.max_examples=200

Complete test suite

Here’s my final test suite. Rather than hard code a maximum value for n, I used a composite strategy which adapts n to the size of xs. I’ve also added a test which confirms the result does comprise chunks of the input sequence.

import functools

import hypothesis as ht
import hypothesis.strategies as st

@st.composite
def items_and_chunk_count(draw):
    xs = draw(st.one_of(st.text(),
                        st.binary(),
                        st.lists(st.integers())))
    n = draw(st.integers(min_value=1,
                         max_value=max(1, len(xs))))
    return xs, n

@ht.given(xs_n=items_and_chunk_count())
def test_evenly_chunked(xs_n):
    '''Verify there are n evenly sized chunks'''
    xs, n = xs_n
    chunks = chunk(xs, n)
    assert len(chunks) == n
    chunk_lens = {len(c) for c in chunks}
    assert len(chunk_lens) in {1, 2}
    assert max(chunk_lens) - min(chunk_lens) in {0, 1}

@ht.given(xs_n=items_and_chunk_count())
def test_combining_chunks(xs_n):
    '''Verify recombining the chunks reproduces the original sequence.'''
    xs, n = xs_n
    chunks = chunk(xs, n)
    assert functools.reduce(lambda x, y: x+y, chunks) == xs

Quality of failure

In the comments to my original article Mike Edey put forward an elegant solution to the original problem of evenly subdividing a sequence into an exact number of chunks:

def chunk(xs, n):
    return [xs[index::n] for index in range(n)]

This is a delightful piece piece of code, and an approach I simply hadn’t considered. If the input list xs represents a number of tasks to be distributed amongst n workers, this does the job evenly. In my actual motivating example, though, however, the input sequence was a document which caused a problem, and what I wanted to do was split that document up into a number of sections and see which of these exhibited the same problem: that is, I needed the chunks to be contiguous blocks of text from the original document. This is the property which test_combining_chunks() checks.

py.test
============================= test session starts =============================
platform win32 -- Python 3.5.3, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: /work/sliced-python, inifile:
plugins: hypothesis-3.8.3
collected 2 items

test_chunk.py .F

================================== FAILURES ===================================
____________________________ test_combining_chunks ____________________________

@ht.given(xs_n=items_and_chunk_count())
>   def test_combining_chunks(xs_n):

test_chunk.py:29: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
d:\venvs\slackbot\lib\site-packages\hypothesis\core.py:634: in wrapped_test
    state.run()
d:\venvs\slackbot\lib\site-packages\hypothesis\core.py:531: in run
    print_example=True, is_final=True
d:\venvs\slackbot\lib\site-packages\hypothesis\executors.py:58: in default_new_style_executor
    return function(data)
d:\venvs\slackbot\lib\site-packages\hypothesis\core.py:113: in run
    return test(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

xs_n = ('001', 2)

@ht.given(xs_n=items_and_chunk_count())
    def test_combining_chunks(xs_n):
        '''Verify recombining the chunks reproduces the original sequence.'''
        xs, n = xs_n
        chunks = chunk(xs, n)
>       assert functools.reduce(lambda x, y: x+y, chunks) == xs
E       AssertionError: assert '010' == '001'
E         - 010
E         + 001

test_chunk.py:33: AssertionError
--------------------------------- Hypothesis ----------------------------------
Falsifying example: test_combining_chunks(xs_n=('001', 2))
===================== 1 failed, 1 passed in 0.52 seconds ======================

Hypothesis has discovered a minimal failing example: the string 001 splits into 2 chunks as 01, 0.

Conclusions

More generally, I found the hypothesis library solid. It’s well designed and documented, as well as being a fine example of how to use Python decorators.

I’d say property based testing complements example based testing. Example based unit tests show you how a function is used, for instance; with hypothesis, this useful demonstration happens behind the scenes (though note that your hypothesis tests can include explicit @examples). Example based unit tests are typically one or two orders of magnitude quicker to execute. It’s not a problem if a couple of tests take half a second to run, but what if you have a couple of thousand tests?

In my case the built-in strategies were good enough to generate inputs to my function. I can imagine that’s not the case for functions higher up a software stack. Test setup functions can be hard work and I suspect test setup strategies would be harder.

In closing, I’d like to quote from the section of the Hypothesis documentation which describes its purpose.

Contents

Are the tests adequate?

Property based testing

Hypothesis