Saturday, December 22, 2012

Small Object Allocator


Prior to the latest update of Granny Smith, we ran into memory problems when loading levels. This was due to the unfortunate choice of using XML and TinyXML for level data. I'm generally very positive to both XMl as a data format (as opposed to most game developers) and TinyXML as a library. However, since we don't use instancing, a level can easily be 500 kb or more and TinyXMl will parse the entire file into a DOM structure before it can be read. This results in a myriad of small allocations ranging from a few bytes up to about 100 bytes. I'm not sure exactly what the per-allocation overhead is on iOS and Android, but I suppose it's at least 16 or 32 bytes. In addition the LUA library also makes a fair amount of small allocations.

So far I had not bothered using a custom allocator for anything since I usually try to avoid allocations alltogether. Strings, dynamic arrays, stream buffers, etc, all have a templated built-in storage, so you can reserve a specific amount when they are declared:

So instead of:

std::vector<QiVec3> tmpVerts;

I use:

QiArrayInplace<QiVec3, 128> tmpVerts;

Which means I reserve memory for 128 verts on the stack before any dynamic allocation happen. Anyway, I decided to try a small object allocator for the TinyXML and LUA allocations and my choice fell on the simplest possible solution - a fixed memory area for small allocations.

The idea is to reserve a fixed amount of memory at application startup, say 2 Mb, devide it into sections of different sizes. In Granny Smith I ended up using the following sizes: 8, 16, 32, 48, 64, 96 and 128 bytes, but they can of course be chosen to match the most common object sizes etc. The number of slots for each size is based on real statistics from the game. In my case, the 48 byte allocation was the most common, so almost half of the preallocated space is reserved for that size.

Now, for allocating out of this pool I use a packed bitfield for each allocation size to track which slots are in use. I also track the lowest free slot for each size. In the worst case scenario the whole bitfield needs to be searched. This is a linear operation, but in practice it is very cheap - one can search 32 slots per instruction (actually 64 nowadays) and it's a contiguous memory block, so very cache-friendly. I use 12.800 slots for 48 byte allocations, which translates to 400 32-bit integer comparisons all lined up in memory. This is the worst case scenario. On average I typically only need to search a few bytes. Deallocation is basically fo free. The pointer adress reveals if it's a small object allocation (if the pointer lies within the preallocated memory block) and also what size the allocation is (all allocations of the same size are grouped together). The actual deallocation is done by clearing the corresponding bit and potentially update the lowest free slot.

The most obvious win here is not speed, but low memory overhead. It uses only one bit of memory overhead per allocation (plus potential padding bytes up to the smallest matching size), so doing 1.000 allocations of 8 byte each will require exactly 8.125 bytes of memory. Performance is great, but there might be other allocators out there that are even better. Since it is all allocated out of a contiguous memory block it is cache friendly and does not cause any fragmentation.

It all sounds great, but there is of course one huge downside to it - when you run out of slots your're out. The allocator can only be made this efficient by using a single preallocated memory block, so you need to know the memory requirements at load time. However, when running out of slots one can just start to use the regular malloc instead, also for small objects. These will of course have the regular overhead, but at least it "fails" gracefully. I'm really happy with the results so far. Levels load very noticably faster and the overall memory usage went down. For games in particular, where the memory allocation pattern and total memory usage is known at compile time, I think this is a pretty good choice for general purpose small object allocator.

2 comments:

  1. You may want to try RapidXML. On top of being way faster than TinyXML it constructs the DOM in place over the block of memory that contains the original XML data. Granted you can also count me in as one of those people against parsing stuff in a shipping title :)

    ReplyDelete
  2. A haven't heard about that one, thanks! I don't really have anything against parsing in a shipping title if level load times are okay, but I always parse the nodes in order, so it feels really lame to parse the whole file and then traverse it in the exact same order instead of just streaming it..

    ReplyDelete