Thursday, March 14, 2013

Autorelease pool hell

I spent a lot of time hunting down a problem with a crash in autorelease pools in our Python - Chromium Mac application and I finally found the primary cause. I couldn't find any information at all about this kind of crash on the internet, so I decided to share the experience with you. Brace yourselves, it's going to be long.

Our app was crashing on exit somewhere in AutoreleasePoolPage::pop. Attempting to access memory at 0x00000010 which is very bad. Everybody on the internet says that for Objective C crashes, use Instruments and Zombies, right? Well, zombies were sleeping in the graves peacefuly for me. Nothing found. So let's try the next tool.

I couldn't get Valgrind to work for this app on OS X, because I had a recent machine which supports HW instructions for AES and Valgrind in 32b mode doesn't, which is basically game-over. So I got an older machine, without these instructions, and found ... nothing. Really, no information related to this problem at all. It seems that I'll need to do some more thinking.

In the meantime, I discovered some other problems, largely unrelated to this one. Because of them, I was considering to remove wxPython as the window-creating mechanism from our app and roll my own. This did solve the problem with accessing the USB on OS X, but this crash, in exact same form, persisted even after getting rid of wx. It's starting to get sad, really.

I was forced to roll up my sleeves and dig once more into disassembly. I found that the memory which AutoreleasePoolPage::pop() is accessing is somewhat wrong. But I couldn't find any more information from that mountain of instructions and this is where I got lucky - I found that there is actually source code for this class! Just get the right version of objc from http://opensource.apple.com! This made things a lot easier and clearer. And what's more, after a while I was able to compile this project and use the debug version instead of the system one. How awesome is that!

What I found out is that autorelease pools are nothing more than a block of memory containing pointers. This memory is managed manually by the AutoreleasePoolPage C++ object. And that is rather unfortunate, because there are almost no safeguards at all. If they were using the standard malloc or new, I could employ all kinds of diagnostics, safeguards and checking and they may be able to slap me at the right moment. But with this manual management, we are on our own. This was probably one of the reasons why Valgrind, Guard Malloc and other tricks just didn't work.

Takeway 1: If you manage memory by yourself, you will probably need heavy checking, the kind that malloc has.

So I knew that I'm somehow overwriting my own memory inside an AutoreleasePoolPage. So let's introduce some logging and custom diagnostics. I used dtrace and Instruments to get arguments and call stack at key points in the program (well, basically at every call to AutoreleasePoolPage methods). dtrace is awesome, if you don't know it, you need to. This told me that, indeed, one autorelease pool memory block is getting overwritten by a different one. And I also knew which one! I didn't need much to realize that I'm using the pools incorrectly. It seems they should always be used like a block and nested correctly:

  p1 = [NSAutoreleasePool alloc init];
  p2 = [NSAutoreleasePool alloc init];
  [p2 release];
  [p1 release];
or use this shorthand syntax:

@autoreleasepool {
  @autoreleasepool {
    ...
  }
}

which means creation and destruction in a single function. What I did was to create the block somewhere in one method and release it somewhere in a run loop. This apparently breaks nesting and causes all this headache.

Takeway 2: Always nest autorelease pools correctly.

Thanks to the call stacks from Instruments, I knew which pool is destroying which other pool and now I only need to figure out how to fix it. Phewww.

2 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. Thanks for posting this! My Preview application was having the same problem with crashing on calls to that function.

    ReplyDelete