Debugging memory use in a Python program is hard, and sometimes you have to fight the garbage collector.
I work on a big embedded Python app which contains lots of extension types. The application normally runs in two phases: phase 1 builds up a big static model of the world based on the input data. This can create a lot of objects but the number of objects is fairly well bounded. However, these phase 1 objects contain reference cycles that they will live until the end of the phase 2. After phase 2, we want those objects to get collected (through decref or gc) because we may go back to a new phase 1 run. We'll see below why this can be a huge problem.
Phase 2 is less well-bounded. Lots of objects are created and destroyed, and lots of those objects participate in cycles. We definitely want Python's cyclic gc to run over these objects periodically to collect whatever trash exists. For most input data, our app is pretty well behaved memory-wise, but some input data can trigger very long phase two runs, and some of those long phase two runs can result in out-of-memory crashes. So, how do you debug and fix this, and are there any tools available to help?
Let me explain how we've tuned Python's gc. By default, we set the generation 0, 1, and 2 thresholds to 30000, 5, and 100000. The huge gen2 threshold effectively means we never collect gen2 cyclic trash. This seems bad at first, but it was implemented to try to fix another nasty problem.
As I mentioned above, our phase 1 can create lots and lots of objects and lots of cycles. Now, we definitely want gc to collect these objects, but we also know that they won't be collected until phase 2 is complete. The problem is that Python's gc traverses all the objects in a generation when it collects that generation. If you've got millions of gen2 objects that you know won't be collected this time through, you're paying a significant performance penalty for no gain. Our initial solution was to crank up the gen2 threshold so that we don't needlessly traverse all those millions of objects when we know they won't get collected until the end of phase 2.
But that causes another problem of course: many objects that could be collected during phase 2 end up in gen2 because they (correctly) survive the gen0 and gen1 collections. What it seems like we really want is a way to segregate the phase 1 objects so that we don't traverse them until after phase 2 is complete, but then let the normal gc process collect cycle trash quickly and effectively. More on this later.
Let me backtrack a bit. So, we're basically running out of memory sometimes, but why? The first step is to try to see if there's an obvious memory leak. Having been here before, we've instrumented debug builds of the app so that on program exit, it iterates through every Python object in existence, checking the object's reference count against what (through experience) we've come to expect. We see no regressions here in either the pathological cases (at least the ones that do eventually exit) or the non-pathological cases. This covers many of our's and Python's extension types, but not everything, and it doesn't address most pure-Python objects either. But it's a pretty good first-line-of-defense against stupid coding errors.
It is at this point that I start diving into memory analysis tools.
I develop on the Mac primarily, and there are several very good tools on that platform, which I'm just starting to learn. Shark is excellent, and between it and Activity Monitor, I can watch pathological cases grow very large in about an hour's time. One thing I've learned about Shark though is that if the process dies with an out of memory error before you stop collecting samples, Shark will throw up an error dialog and you've just lost all your data. I've submitted an Apple bug on this issue.
Unfortunately, while Shark provided some good clues, it didn't provide any definitive answers. On reason is that our app is highly recursive, and the stack traces can get unreadable. I decided to look at a few other OSX tools to see what information they could provide. Next up MallocDebug.app.
Unfortunately, I didn't get very far with this because of an underlying configuration problem. MallocDebug.app relies on lower level malloc(3) debugging support, such as the environment variables MallocStackLogging. You can actually run some decent malloc debugging from the shell, but I quickly learned that doing so crashes Python in such a way as to corrupt the stack, even when running in gdb. After much pain and single-stepping, I discovered that we were linking Python's _ssl module against OpenDarwin's OpenSSL 0.9.8b instead of Apple's default openssl 0.9.7i. If you turn on malloc debugging with the OpenDarwin version of openssl, it crashes too. I've opened a bug in the OpenDarwin project on this.
Still, with a bit of dilligent work (along with spending some quality time in valgrind on Linux), I was able to learn where a lot of our memory was going. There was one particular subsystem that was creating a lot of objects during phase 2; these objects contained cycles, but we know that these objects too will survive until the end of phase 2. I rewrote this subsystem to save the data in a database instead of keeping alive a big tree of Python objects, and that helped our app considerably. Instead of crashing w/oom on a particular app after say, an hour or two, the same app could run for almost 24 hours before running out of memory. Because the rewrite also allowed us to get intermediate results, it was a huge improvement.
But the question still remains, in the really pathological cases, where are all those objects coming from and why aren't they getting destroyed? In my next post, I'll talk about some of the modifications I've made to Python's gc to provide better diagnostics, and what some of that data tells me.
It sounds like you need to think hard about alternate ways to manage the different lifetimes of your objects.
If you can find an efficient external representation for the state you need to share between phase 1 and phase 2, you may find it far easier to reclaim resources in phase 2 explicitly. For instance you could have each run of phase 2 occur in an entirely separate process or interpreter from phase 1. If you use a separate process the operating system will reclaim resources for you.
If there is even a remote chance that your application may have to "fight the gc", it's probably best to design that problem away from the start. This is what you probably would have had to have done if you were coding in a low-level language like C. I think a good part of the reason that most people think they are more productive in Python than in a lower level language is Python allows you to develop without worrying about this level of detail until you hit the limit of your resources. But once you do, there's simply no substitute for doing this kind of design.
In many ways your war metaphor is a good one here. Letting your application grow to the point where you hit the limit of your resources is like letting your community grow without regard to the limits posed by your neighbors. When you're at the point of war it is unlikely that you'll find the same kind of clean and simple solution to your problem that you could have had you engaged in some diplomatic resource planning early on.
> There > was one particular subsystem that was creating a lot of > objects during phase 2; these objects contained cycles,
I have read that CPython doesn't have true GC like Java and instead uses 'reference counting'. Are you saying that the cycles are causing things to not be GC'd? In Java, cycles are collected if they are not reachable from the root.
Like the previous poster stated, the issues you describe here are much easier to deal with from the start of the design than in debugging. Does Python have anything comparable to WeakReference in Java?
Thank you for this article. I look forward to the next installment. Python is great, and the more information there is about getting in the nitty-gritty and handling scaling issues like these the better.
> > There > > was one particular subsystem that was creating a lot of > > objects during phase 2; these objects contained cycles, > > I have read that CPython doesn't have true GC like Java > and instead uses 'reference counting'. Are you saying > that the cycles are causing things to not be GC'd? In > Java, cycles are collected if they are not reachable from > the root. >
> Like the previous poster stated, the issues you describe > here are much easier to deal with from the start of the > design than in debugging. Does Python have anything > comparable to WeakReference in Java?
Does it do this by default? I'm just trying to understand why the author mentioned cycles. I think this is a non-issue. I like Python and am using it more and more but I'm definitely a noob.
> > Like the previous poster stated, the issues you > describe > > here are much easier to deal with from the start of the > > design than in debugging. Does Python have anything > > comparable to WeakReference in Java? > > Python has weakref in its standard library, which works > like Java's WeakReference. > http://docs.python.org/lib/module-weakref.html
I must admit that I have no solution to your problem, I'll just express some remarks that may, or may not, help.
A few years ago I attended a conference held by a smalltalk specialist, called "All you ever wanted to know about gc that you never dared to ask", where he explained a very similar problem he had with a giant billing application, which also had the peculiarity of running in two phases, with a big creation phase followed by an intensive computation phase.
He ended debugging the gc itself, who ran out of memory while trying to collect garbage (what an irony!). At the time, with nearly no visual ide, it was just a matter of analysing a 20Mb memory core dump file (under UNIX), which took him some days (and nights too, the customer couldn't wait that much on the billing).
His conclusion went along the same way as the frist person who posted a comment: provided the gc is there to help you about forgetting memory management, when you are at the point of twiddling the gc itself so it does not get in the way, then surely you are on the wrong track. Get rid of the gc in the first place and manage the mermory yourself! (provided it's possible in Python, what I'm not aware of).
At least in Smalltalk, he was able to design a simple pre-allocator, who reserved a big bunch of memory that he managed itself by cutting it in small pieces for the numerous objects he had to build in the first phase. What made his solution simple was that these objects were all the same size, or even all of the same class, I don't remember exactly.
He had nonetheless to twiddle the gc a bit to deactivate it during the first phase, so he could do his allocation job without inferance and performance penalty.
At the end ot the conference, I had learned a big deal about garbage collection (that was the positive side), but nonetheless leaved the room with a small interior smile, because every moderately experienced C++ programmer would have known beforehand that in such a case, you have to design you own simplified allocator (i.e. redefine the new operator for the concerned classes), thus avoiding the bug in the first place.
I always find these real-life examples funny, because they very well illustrate that a gc is by no mean the silver-bullet some narrow-minded gurus advocate. It's only a tool that may help you most of the time to deal with one single kind of ressource (i.e. memory), provided you are not too much concerned by performance, but that some other times gets badly in your way at the worst moment: when your application (and you, by the way, too) is under stress.
Good luck, and I hope you'll find an acceptable solution. Else, the possibility still remains to resort to a programming language without gc or where you can neatly switch it off to do the serious job yourself.
Why have cycles of strong references? That's what weak references are for.
I'd argue that a program becomes incorrect at the moment it has a cycle of strong references. It would be interesting to have a debug mode to detect cycles at the moment they're created. Such a mode wouldn't be efficient, but it would be useful.
If that could be made to work, the garbage collector could be eliminated, and only reference counting would be needed.
I know the fact that Debugging memory use in a Python program is hard and it makes us to fight with the garbage collector sometimes. I have read about Barry’s embedded python application and was really nice. All the details about this application he developed and the problems and errors that he faced during the development are explained in this thread and are really helpful for all those, who program in python. I will be waiting for your next post that contain the details of the modifications you have made to Python's garbage collector to provide better diagnostics so that it will be useful for me.