This post originated from an RSS feed registered with Python Buzz
by Thomas Guest.
Original Post: DEFLATE: run-length encoding, but better
Feed Title: Word Aligned: Category Python
Feed URL: http://feeds.feedburner.com/WordAlignedCategoryPython
Feed Description: Dynamic languages in general. Python in particular. The adventures of a space sensitive programmer.
Run-length encoding is a simple compression scheme in which runs of equal values are represented by the value and a repeat count. For example, a supermarket cashier might process this line of shopping
as
4 bananas
3 apples
2 bananas
1 pineapple
3 apples
Unix packs in its very own run length encoder, uniq -c. It works just fine — so long as the values you want to encode are newline separated byte strings, that is.
Let’s use a sequence of coin tosses as an example stream. $RANDOM generates random numbers. and we use the least significant bit of these numbers as an index into an array containing the values heads, tails.
An awk script can be used as a run-length decoder. (There must be a neater way, using sed maybe?)
$ runlendec() { awk '{ while ($1--) print $2 }'; }
$ tosses | head | tee orig.log | uniq -c | runlendec | tee encdec.log
heads
tails
heads
tails
heads
heads
tails
tails
heads
heads
$ diff orig.log encdec.log
Here, we toss a coin 10 times teeing the original sequence to a file. The next two links in the pipeline compress and decompress the sequence, teeing the results to another file. Finally, as a sanity check, we confirm the round trip results are the same.
Run-length encoding in Python
This Unix run-length codec is fun, but of limited practical use. One good feature, though, is the way it operates on streams of data (including infinite streams), leaving clients free to decide how best to slice and buffer these streams.
Python has a fine library of high-level stream transformation tools from which we can build a generic and flexible run-length codec in just a few lines. Since I want to progress from run-length coding to something more advanced, I’ll leave discussing how to implement this codec for now, but if you’d like to write your own version, here’s a description suitable for doctesting.
Import the run-length codec functions and compress a short string.
>>> from runlength import compress, decompress
>>> comp = compress('AABBBACC')
The returned compressor is a stream (an iterable).
>>> next(comp)
(2, 'A')
Pull the rest of the stream into memory.
>>> rest = list(comp)
>>> rest
[(3, 'B'), (1, 'A'), (2, 'C')]
Simple decompress example.
>>> concat = ''.join
>>> concat(decompress(rest))
'BBBACC'
Compress, decompress also work with infinite streams, like the
a2b3 stream, which repeatedly cycles two pairs.
>>> from itertools import cycle, islice
>>> a2b3 = cycle([(2, 'a'), (3, 'b')])
>>> dec = decompress(a2b3)
Pull 8 values from the decompressed stream.
>>> concat(islice(dec, 8))
'aabbbaab'
Now compress the decompressed stream, and explore a few items.
>>> comp = compress(dec)
>>> next(comp)
(2, 'b')
>>> list(islice(comp, 2))
[(2, 'a'), (3, 'b')]
DEFLATE
The Wikipedia page on run-length encoding identifies monochrome images as good candidates for run-length compression. The white and black pixels typically group into long runs. Indeed, any simple image using a limited palette should reduce well using this compression scheme.
The chessboard above is 256×256 pixels, each square being 32×32 pixels. We could run-length encode this 64K pixel image as 256×8 = 2K runs of 32 pixels, a decent saving. (Actually, we should do slightly better, noting that there are runs of length 64 at the chessboard rank boundaries, but you get the idea.)
Like a paletted image, a block of text — the web page you’re reading now, for example — employs a limited alphabet. Although the characters in this text don’t usually group into long runs there’s plenty of repetition, especially in the raw HTML: all the occurrences of <div>, <span> and class used for CSS styling, for example. The DEFLATE compression algorithm uses a clever twist on run-length encoding to remove this redundancy:
The compressed data consists of a series of elements of two types: literal bytes (of strings that have not been detected as duplicated within the previous 32K input bytes), and pointers to duplicated strings, where a pointer is represented as a pair <length, backward distance>. (RFC-1951)
(In addition, a multiple-level dynamic Huffman encoding scheme reduces the space needed for the strings, distances and lengths themselves.)
There’s more to these pointer elements than first appears: the length can exceed the backward distance. Thus the sequence:
heads
heads
heads
heads
heads
can be deflated as the literal type heads\n followed by the pointer type <24, 6>.
If you’ve spotted the potential for recursion, good! The inflating stream can reference itself, which can reference itself, which can … Confusing?
Zipping pixels
PNG images use DEFLATE compression (as implemented by zlib) to save on pixel storage space. Here’s a binary view of the raw data in the chessboard graphic shown above, all 137 bytes of it. The 64K pixels themselves compress into a 88 byte IDAT chunk, of which the final 8 bytes are a checksum and (I think?) some padding. Maybe the image could be squeezed harder, but I’m impressed!
I’ve attempted to show the first few stages of the genesis of the uncompressed stream in the picture below. The way the stream recursively inflates itself is quite beautiful.
put 00
put ff
go back 1 (to ff), put 3
put 00
go back 1 (to 00), put 3
go back 8 (to 00 00 00 00 ff ff ff ff)
put 24
Two elements later, and the repeat length has grown to 258. In fact, the entire chessboard is generated from just 3 literal and 43 pointer elements.
(Not all graphics have such a regular pattern, of course, so we can’t always achieve such dramatic compression.)
Deflated HTML
Web servers can and do save on band-width by transferring gzip compressed HTML to gzip capable clients. (Gzip is a simple wrapper around DEFLATE.) Any PNG images transferred will also have their pixels DEFLATE compressed.
$ curl http://wordaligned.org --head --compress
HTTP/1.1 200 OK
Date: Sun, 17 May 2009 17:41:53 GMT
Server: lighttpd | Word Aligned
Content-Type: text/html; charset=UTF-8
....
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 20
The Word Aligned front page contains about 75Kb of HTML, which gzips to just 16Kb — a decent saving. Relevant lines from the lighttpd configuration file read:
I uphold Gzip (built on zlib, which implements DEFLATE) as a hero of the web. As we’ve seen, it implements a powerful and elegant algorithm, but perhaps the best thing about it is that it’s free to use, a freedom worth fighting for. Check out this battle report from the FAQ.
What about patents?
gzip was developed as a replacement for compress because of the UNISYS and IBM patents covering the LZW algorithm used by compress.
I have probably spent more time studying data compression patents than actually implementing data compression algorithms. I maintain a list of several hundred patents on lossless data compression algorithms, and I made sure that gzip isn’t covered by any of them. In particular, the --fast option of gzip is not as fast it could, precisely to avoid a patented technique. — Jean-Loup Gailly, Gzip FAQ