Skip to main content

Small Object Allocator

Prior to the latest update of Granny Smith, we ran into memory problems when loading levels. This was due to the unfortunate choice of using XML and TinyXML for level data. I'm generally very positive to both XMl as a data format (as opposed to most game developers) and TinyXML as a library. However, since we don't use instancing, a level can easily be 500 kb or more and TinyXMl will parse the entire file into a DOM structure before it can be read. This results in a myriad of small allocations ranging from a few bytes up to about 100 bytes. I'm not sure exactly what the per-allocation overhead is on iOS and Android, but I suppose it's at least 16 or 32 bytes. In addition the LUA library also makes a fair amount of small allocations.

So far I had not bothered using a custom allocator for anything since I usually try to avoid allocations alltogether. Strings, dynamic arrays, stream buffers, etc, all have a templated built-in storage, so you can reserve a specific amount when they are declared:

So instead of:

std::vector<QiVec3> tmpVerts;

I use:

QiArrayInplace<QiVec3, 128> tmpVerts;

Which means I reserve memory for 128 verts on the stack before any dynamic allocation happen. Anyway, I decided to try a small object allocator for the TinyXML and LUA allocations and my choice fell on the simplest possible solution - a fixed memory area for small allocations.

The idea is to reserve a fixed amount of memory at application startup, say 2 Mb, devide it into sections of different sizes. In Granny Smith I ended up using the following sizes: 8, 16, 32, 48, 64, 96 and 128 bytes, but they can of course be chosen to match the most common object sizes etc. The number of slots for each size is based on real statistics from the game. In my case, the 48 byte allocation was the most common, so almost half of the preallocated space is reserved for that size.

Now, for allocating out of this pool I use a packed bitfield for each allocation size to track which slots are in use. I also track the lowest free slot for each size. In the worst case scenario the whole bitfield needs to be searched. This is a linear operation, but in practice it is very cheap - one can search 32 slots per instruction (actually 64 nowadays) and it's a contiguous memory block, so very cache-friendly. I use 12.800 slots for 48 byte allocations, which translates to 400 32-bit integer comparisons all lined up in memory. This is the worst case scenario. On average I typically only need to search a few bytes. Deallocation is basically fo free. The pointer adress reveals if it's a small object allocation (if the pointer lies within the preallocated memory block) and also what size the allocation is (all allocations of the same size are grouped together). The actual deallocation is done by clearing the corresponding bit and potentially update the lowest free slot.

The most obvious win here is not speed, but low memory overhead. It uses only one bit of memory overhead per allocation (plus potential padding bytes up to the smallest matching size), so doing 1.000 allocations of 8 byte each will require exactly 8.125 bytes of memory. Performance is great, but there might be other allocators out there that are even better. Since it is all allocated out of a contiguous memory block it is cache friendly and does not cause any fragmentation.

It all sounds great, but there is of course one huge downside to it - when you run out of slots your're out. The allocator can only be made this efficient by using a single preallocated memory block, so you need to know the memory requirements at load time. However, when running out of slots one can just start to use the regular malloc instead, also for small objects. These will of course have the regular overhead, but at least it "fails" gracefully. I'm really happy with the results so far. Levels load very noticably faster and the overall memory usage went down. For games in particular, where the memory allocation pattern and total memory usage is known at compile time, I think this is a pretty good choice for general purpose small object allocator.


  1. You may want to try RapidXML. On top of being way faster than TinyXML it constructs the DOM in place over the block of memory that contains the original XML data. Granted you can also count me in as one of those people against parsing stuff in a shipping title :)

  2. A haven't heard about that one, thanks! I don't really have anything against parsing in a shipping title if level load times are okay, but I always parse the nodes in order, so it feels really lame to parse the whole file and then traverse it in the exact same order instead of just streaming it..


Post a Comment

Popular posts from this blog

Bokeh depth of field in a single pass

When I implemented bokeh depth of field I stumbled upon a neat blending trick almost by accident. In my opinion, the quality of depth of field is more related to how objects of different depths blend together, rather than the blur itself. Sure, bokeh is nicer than gaussian, but if the blending is off the whole thing falls flat. There seems to be many different approaches to this out there, most of them requiring multiple passes and sometimes separation of what's behind and in front of the focal plane. I experimented a bit and stumbled upon a nice trick, almost by accident. I'm not going to get into technical details about lenses, circle of confusion, etc. It has been described very well many times before, so I'm just going to assume you know the basics. I can try to summarize what we want to do in one sentence – render each pixel as a discs where the radius is determined by how out of focus it is, also taking depth into consideration "somehow". Taking depth i

Screen Space Path Tracing – Diffuse

The last few posts has been about my new screen space renderer. Apart from a few details I haven't really described how it works, so here we go. I split up the entire pipeline into diffuse and specular light. This post will focusing on diffuse light, which is the hard part. My method is very similar to SSAO, but instead of doing a number of samples on the hemisphere at a fixed distance, I raymarch every sample against the depth buffer. Note that the depth buffer is not a regular, single value depth buffer, but each pixel contains front and back face depth for the first and second layer of geometry, as described in this post . The increment for each step is not view dependant, but fixed in world space, otherwise shadows would move with the camera. I start with a small step and then increase the step exponentially until I reach a maximum distance, at which the ray is considered a miss. Needless to say, raymarching multiple samples for every pixel is very costly, and this is with

Undo for lazy programmers

I often see people recommend the command pattern for implementing undo/redo in, say, a level editor. While it sure works, it's a lot of code and a lot of work. Some ten years ago I came across an idea that I have used ever since, that is super easy to implement and has worked like a charm for all my projects so far. Every level editor already has the functionality to serialize the level state (and save it to disk). It also has the ability to load a previously saved state, and the idea is to simply use those to implement undo/redo. I create a stack of memory buffers and serialize the entire level into that after each action is completed. Undo is implemented by walking one step up the stack and load that state. Redo is implemented in the same way by walking a step down the stack and load. This obviously doesn't work for something like photoshop unless you have terabytes of memory laying around, but in my experience the level information is usually relatively compact and se