Software Jam Sessions
Memory debugging in Python -- fighting the GC
by Barry Warsaw
October 13, 2006

Summary
Debugging memory use in a Python program is hard, and sometimes you have to fight the garbage collector.

I work on a big embedded Python app which contains lots of extension types. The application normally runs in two phases: phase 1 builds up a big static model of the world based on the input data. This can create a lot of objects but the number of objects is fairly well bounded. However, these phase 1 objects contain reference cycles that they will live until the end of the phase 2. After phase 2, we want those objects to get collected (through decref or gc) because we may go back to a new phase 1 run. We'll see below why this can be a huge problem.

Phase 2 is less well-bounded. Lots of objects are created and destroyed, and lots of those objects participate in cycles. We definitely want Python's cyclic gc to run over these objects periodically to collect whatever trash exists. For most input data, our app is pretty well behaved memory-wise, but some input data can trigger very long phase two runs, and some of those long phase two runs can result in out-of-memory crashes. So, how do you debug and fix this, and are there any tools available to help?

Let me explain how we've tuned Python's gc. By default, we set the generation 0, 1, and 2 thresholds to 30000, 5, and 100000. The huge gen2 threshold effectively means we never collect gen2 cyclic trash. This seems bad at first, but it was implemented to try to fix another nasty problem.

As I mentioned above, our phase 1 can create lots and lots of objects and lots of cycles. Now, we definitely want gc to collect these objects, but we also know that they won't be collected until phase 2 is complete. The problem is that Python's gc traverses all the objects in a generation when it collects that generation. If you've got millions of gen2 objects that you know won't be collected this time through, you're paying a significant performance penalty for no gain. Our initial solution was to crank up the gen2 threshold so that we don't needlessly traverse all those millions of objects when we know they won't get collected until the end of phase 2.

But that causes another problem of course: many objects that could be collected during phase 2 end up in gen2 because they (correctly) survive the gen0 and gen1 collections. What it seems like we really want is a way to segregate the phase 1 objects so that we don't traverse them until after phase 2 is complete, but then let the normal gc process collect cycle trash quickly and effectively. More on this later.

Let me backtrack a bit. So, we're basically running out of memory sometimes, but why? The first step is to try to see if there's an obvious memory leak. Having been here before, we've instrumented debug builds of the app so that on program exit, it iterates through every Python object in existence, checking the object's reference count against what (through experience) we've come to expect. We see no regressions here in either the pathological cases (at least the ones that do eventually exit) or the non-pathological cases. This covers many of our's and Python's extension types, but not everything, and it doesn't address most pure-Python objects either. But it's a pretty good first-line-of-defense against stupid coding errors.

It is at this point that I start diving into memory analysis tools.

I develop on the Mac primarily, and there are several very good tools on that platform, which I'm just starting to learn. Shark is excellent, and between it and Activity Monitor, I can watch pathological cases grow very large in about an hour's time. One thing I've learned about Shark though is that if the process dies with an out of memory error before you stop collecting samples, Shark will throw up an error dialog and you've just lost all your data. I've submitted an Apple bug on this issue.

Unfortunately, while Shark provided some good clues, it didn't provide any definitive answers. On reason is that our app is highly recursive, and the stack traces can get unreadable. I decided to look at a few other OSX tools to see what information they could provide. Next up MallocDebug.app.

Unfortunately, I didn't get very far with this because of an underlying configuration problem. MallocDebug.app relies on lower level malloc(3) debugging support, such as the environment variables MallocStackLogging. You can actually run some decent malloc debugging from the shell, but I quickly learned that doing so crashes Python in such a way as to corrupt the stack, even when running in gdb. After much pain and single-stepping, I discovered that we were linking Python's _ssl module against OpenDarwin's OpenSSL 0.9.8b instead of Apple's default openssl 0.9.7i. If you turn on malloc debugging with the OpenDarwin version of openssl, it crashes too. I've opened a bug in the OpenDarwin project on this.

Still, with a bit of dilligent work (along with spending some quality time in valgrind on Linux), I was able to learn where a lot of our memory was going. There was one particular subsystem that was creating a lot of objects during phase 2; these objects contained cycles, but we know that these objects too will survive until the end of phase 2. I rewrote this subsystem to save the data in a database instead of keeping alive a big tree of Python objects, and that helped our app considerably. Instead of crashing w/oom on a particular app after say, an hour or two, the same app could run for almost 24 hours before running out of memory. Because the rewrite also allowed us to get intermediate results, it was a huge improvement.

But the question still remains, in the really pathological cases, where are all those objects coming from and why aren't they getting destroyed? In my next post, I'll talk about some of the modifications I've made to Python's gc to provide better diagnostics, and what some of that data tells me.

Talk Back!

Have an opinion? Readers have already posted 7 comments about this weblog entry. Why not add yours?

RSS Feed

If you'd like to be notified whenever Barry Warsaw adds a new entry to his weblog, subscribe to his RSS feed.

Digg |

del.icio.us |

About the Blogger

Barry Warsaw has been developing software and playing music for more than 25 years. Since 1995 he has worked in Guido van Rossum's Pythonlabs team. He has been the lead developer of the JPython system, and is now the lead developer of GNU Mailman, a mailing list management system written primarily in Python. He's also a semi-professional musician. Python and the bass are his main axes. Music and software are both at their best when enjoyed, participated in, and shared by their enthusiastic fans and creators.


	Web Artima.com