This post originated from an RSS feed registered with Python Buzz
by Thomas Guest.
Original Post: The Third Rule of Program Optimisation
Feed Title: Word Aligned: Category Python
Feed URL: http://feeds.feedburner.com/WordAlignedCategoryPython
Feed Description: Dynamic languages in general. Python in particular. The adventures of a space sensitive programmer.
The list of experts who’ve cautioned against optimising computer
programs too soon reads like a who’s who of computer science. Michael
A Jackson famously reduced the advice to a couple of succinct rules:
The First and Second Rules of Program Optimisation
Don’t do it.
(For experts only!): Don’t do it yet.
The first rule guards us against writing complicated code in the
belief it will run faster – it’s simple, readable code we want, and
humans are notoriously bad at guessing where the bottlenecks in a
program will be. The second rule notes that computers get faster and
tools (such as compilers) improve, so a program which may stress a
computer today will run comfortably on another computer in six months.
It’s the second of these rules I’m going to discuss. Of course I agree
with it, but there are caveats.
Caveats
Since computers get exponentially faster, linear optimisations to a
program are a waste of effort in the long run. This is true, but …
It’s not always simple to replace a computer with a more up to date
version: you may be able to download the latest drivers, but you
can’t download the latest processor.
Users’ expectations increase in line with computers’ power: so in
six months time you may find they want their computer to do
rather more.
In an agile development environment, you’ll probably want to
demonstrate your product to a customer today, even if the final
delivery is scheduled for some time next year.
In a typical project, it’s preferable to settle on a hardware platform
sooner rather than later.
Recently, the second and third of these caveats bit me. I can’t go into details
about the project here – let’s say I was working on a dedicated server
which acted bit like a traffic junction. Several roads fed into the junction and
a single road fed out of it. Traffic should flow through this junction smoothly
in real time with no user intervention. And the server had to run reliably even
when (here’s the specialised hardware requirement)
subjected to a high level of vibration.
Preliminary Development
During preliminary development we used standard off-the-shelf
machines running suitably stripped versions of Linux. We chose to
divide the job into three core processes: one, written in C, addressed
the drivers and platform directly; another, in C++, did the bulk of
the heavy processing; and finally, a Python program performed the
more fiddly logical operations. To use our road-junction analogy: the C++
program dealt with incoming traffic, the C program with outgoing
traffic, and Python managed the junction itself.
We soon had something we could demonstrate to the customer. We
explained the demonstration was merely a proof-of-concept, and in
particular used unrepresentative hardware (please don’t shake
it!). During the demo the server ran comfortably, in a steady state
consuming about 10% of the available processor power, and of this
percentage the C:C++:Python ratio was about 6:3:1. At peak times we
saw processor use double to around 20% – no cause for concern.
On the strength of this success, we arranged a date for a second
demonstration.
In parallel with this ongoing development the customer was working on
the requirements and we were investigating a more suitable hardware
platform (one which would work reliably when shaken, that is).
Changing Requirements
As I mentioned, the server’s job was to process a steady flow of
traffic. A critical factor in performing this job is the rate of flow
of this traffic – how many vehicles per second would the server need
to handle? It turned out that the customer wanted to double the
original specified rate. Could our prototype handle the new peak rate?
Changing Hardware
We found a supplier who could provide a server guaranteed to function
over the specified vibration range – and with some room to
spare. The only problem was that the processor speed was roughly half
that of those we’d used in the original demonstration. We ordered
a trial machine and waited.
Changing Figures
It took a while for the trial machine to be delivered. It took a while
for us to install a suitable flavour of Linux. And it took a while for
us to learn we had performance problem with our software on the new
platform at the new bitrate.
In the steady state the three processes were now typically consuming
50% of the available CPU – which I suppose we could have
predicted. Less predictably, the ratio of C:C++:Python had changed; it
was now about 3:4:3 (compared to the earlier 6:3:1). At peak times,
the load reached 100%.
This new load on the machine was unacceptable. Running at 100% capacity
caused the server function to degrade to the extent that we couldn’t
demonstrate it; and the next customer demonstration was scheduled in three
days time.
Why had the ratios changed?
I don’t have an answer. You only have to look at real-life traffic
flow to understand that a system can behave very differently when
running at high rates; and in particular, at maximum capacity,
things can go wrong. I’m guessing that the junction got
gridlocked.
This is a good example of why you shouldn’t optimize programs, or at
least not yet. If we’d optimised the original demonstration on the
original platform, we’d have concentrated on the C program – since it
appeared to be half as hungry again as the Python and C++ programs
combined. Now it was the C++ program we’d have to look at first.
Options
We could:
reschedule the second demo
run the demo on the original platform
try and optimise the software so it would perform adequately on the new platform
Let me restate the first two rules of program optimization: don’t do
it, and (for experts only) don’t do it yet. I’ll add a third rule – a rule
which applies to software development in general – don’t rush it.
The last thing we wanted to do was complicate the software just to speed
it up. Given more time, couldn’t we source a faster machine?
C++ Optimisation
Despite these rules of software optimisation we decided it was at
least worth investigating the third option: we had enough time to
implement some opportunistic optimisation (that is, to find the few
places in the code which were slowing the server down the most, and
which might be easy to optimise).
The thing to do, of course, was to record current performance figures,
and then run a profiler to try and determine where time was being
spent. Actually, I ran two profilers: callgrind and shark. Shark
proved better for this real-time analysis since callgrind slowed the
program down to the extent that it was no longer operating in
real-time and consequently spent much of its time in recovery
mode. Both profilers, however, told much the same story: the program
was processing a lot of data and it was spending a considerable
amount of time in the C++ standard library simply walking through
collections and comparing iterators.
The first part of this story was to be expected (there was lots of data
to be processed). The second part was more surprising: the C++ standard library
provides generic solutions with no drop in performance – a vector should
be pretty much as efficient as an array, for example, and advancing an iterator
should be as efficient as incrementing a pointer.
How do you optimise the C++ standard library?
Easy. Let the compiler do it for you. GCC comes with a wide range
of optimisation options. I tried -O3, which applies them all, and to
my surprise saw the program apparently run 5 times faster. This, to
me, was an unprecedented speed up. (We tried the same trick on the C
code, with almost no measurable effect.)
Why were we not routinely using optimized code? Because, until now, we
hadn’t needed to, and we preferred to have just one build variant; and
because a debug build can flag problems and offer useful diagnostics,
while an optimised build is less helpful. As the GCC manual says:
Turning on optimization flags makes the compiler attempt to improve
the performance and/or code size at the expense of compilation time
and possibly the ability to debug the program.
In our case, compilation times and code size weren’t an issue – it
wasn’t a huge program since the tricky code was all written in Python.
An extra build variant was a problem, as was any loss of ability to debug.
How do you optimise Python?
Python comes with a standard profiling library. It’s as easy
to use as:
Running the Python Profiler
$ python -m cProfile -o prof.dump <program to profile>
You then load the prof.dump output file and analyse it using the
pstats module. For example, to find which 10 functions took the
most time:
Armed with this evidence, I soon found a quick way to speed things up:
the Python program was sticking debug labels onto traffic passing through
the junction which the C program would then remove and analyse to confirm the
junction was behaving correctly. It turned out that these debug
labels – which were inessential to the server’s visible
functionality – took a considerable amount of time to compute. By
simply eliminating them, we got the Python program running at twice the speed.
Again, this was less than ideal. We weren’t happy to remove the debug
labels. In the event of things going wrong, we no longer had all the
information we needed. Of course, we could parametrise the
labelling decision but this again would mean introducing a variant we didn’t
particularly want to introduce.
The Second Demonstration
Within a few hours, we’d got the server running at something closer to
the desired capacity, and without really compromising the code. We still
ended up selecting the second option – of running the demo on the
original hardware – for reasons I won’t go into here. I don’t regard
the time spent profiling the program as wasted: it’s always useful
watch and measure your code in action, even if you need to rein in the
urge to act on the results. The third rule of program
optimisation applies: don’t rush it.