Tuesday, February 24, 2015

Stack allocated containers

No matter what fancy allocator you come up with, memory allocation will always be expensive. In order to reduce memory allocations I have been using stack-allocated containers for the past four-five years and I think it has worked really well, so I thought I'd write a post about them here. These methods have been used in all our games (Sprinkle, Granny Smith, Smash Hit as well as our new project).

Using containers is really convenient and often necessary. Take the array for example:

MyArray<Object> array;

Everybody's got one. The problem with these are that you need to allocate memory when putting stuff in them. In many cases you know on beforehand approximately how many objects you want to put in there, reducing it to a single allocation:

MyArray<Object> array;

But if this code is in a function that is called frequently it would be much nicer if those first fifty objects were reserved on the stack instead of the heap, like this:

MyArray<Object, 50> array;

Now, the array object would have built-in storage for fifty objects and still be able to grow beyond that using regular heap allocations. Great! Or? What if you want to pass such an array by refence to a method? Say, a string splitting method:

void splitString(const String& del, MyArray<Object, 50>& result);

The obviously problem is that the output array now has to be stack allocated for exactly fifty objects. If we try to pass another array it won't compile, because C++ requires arguments with matching template arguments:

MyArray<Object, 8> result;
String str="this is a test";
str.split(" ", result);

Compiler error!

This effectively kills the entire charm of using template arguments for the array size. So how can we design a container class that can be passed around by reference while still offering stack allocation with a template argument?

What I've done is to create a base class, without stack allocation that can be used as a regular array:

template<class T>
class MyArray
    void* mData;

And then a subclass for the stack-allocated version:

template<class T, int size>
class MyArrayStack : public MyArray
        mData = mStackStorage;
    char mStackStorage[size*sizeof(T)];

The problem now becomes resizing the array at the base class, because the base class won't know wether the data storage is heap or stack allocated, and we don't want to have virtual methods and a vtable for such a lightweight class. Therefore, the base class needs to be aware of the subclass and avoid freeing memory when it is stack-allocated.

Fortunately this is not very complicated, since the stack allocation will always be placed immediately after the array object itself in memory, so we can just look at the pointer when resizing. If the storage pointer is right after the object itself in memory we have a stack allocation:

void resize(int newSize)
    ... allocate heap memory and copy over contents from mData
    if (mData != ((char*)this)+sizeof(MyArray))

Now we can modify the string splitting method to accept the base class:

void splitString(const String& delimiter, MyArray& result);

This will enable us to pass any stack allocated size array we want:

MyArrayStack<String, 8> result;
str.splitString(" ", result);


MyArrayStack<String, 128> result;
str.splitString(" ", result);

Compiler happy!

So the next big question would be – is this safe? There is still a small (well microscopic) chance that a non-stack allocated array object lines up right at the end of the last stack frame, and at the next memory adress is our heap-allocated storage data for it. In practice though this won't happen, because in standard memory allocators there is a header before the allocated data. Even if you use your own allocator without a header, this won't happen, because the memory area it is using would still need to be allocated by the OS, which will put a header in front of it.

I'm currently using this for arrays, sets, hash tables as well as memory streams, memory buffers and a few other things. It has been incredibly useful and probably one of the most important optimizations I have ever added to my framework. This effectively avoids memory allocations altogether without the hassle of using the typical "maxSize" for method arguments. I'm currently not using it for string objects (which all have a fixed stack allocated storage that depends on the project), but I'm kind of tempted to refactor that.

Friday, February 20, 2015

Physics tutorial at GDC 2015

I will give a talk about the fracture algorithm in Smash Hit at GDC 2015. Come by room 304 on Tuesday at 1:45 PM. The session will cover the actual fracture algorithm as well as the physics engine to support it and how fracture affects other subsystems in the game. There will be be cool, shiny videos of stuff breaking :-P

Wednesday, May 28, 2014

Hail to the hall - Environmental Acoustics

One of our early goals with Smash Hit was to combine audiovisual realism with highly abstract landscapes and environments. A lot of effort was put into making realistic shadows and visuals, and our sound designer spent long hours finding the perfect glass breaking sound. However, without proper acoustics to back up the different environments, the sense of presence simply would be there.

To achieve full control over the audio processing and add environmental effects I needed to do all the mixing myself. Platform dependent solutions like OpenAL and OpenSL cannot be trusted here, because support for environmental effects is device/firmware specific and missing in most mobile implementations. Even it was available it would be virtually impossible to reliably map parameters between OpenSL and OpenAL. As in most cases with multi-platform game development, DIY is the way to go.

Showcasing a few different acoustic environments

Software mixer

Writing a software mixer is quite rewarding – a small, well defined task with a handful operations performed on a large chunk of data, thus very suitable for SIMD optimization. A software mixer is also one of those few subsystems that has, or can have, a real work analogue - the physical audio mixer. I chose a conventional, physical abstraction, so my interface classes are named Mixer, Channel, Effect, etc, but there might better ways to structure it.

The biggest hurdle when writing a software mixer turned out to be the actual mixing. Two samples playing at the same time are added together, but what happens if they both play at maximum volume? The intuitive implementation, and what also happens in the real world is clipping. This is what most real audio programs do. Clipping is a form of distortion, where minimum and maximum audio levels are simply clamped above or below the physical threshold, effectively destroying or reshaping the waveform. In an audio software, you would typically adjust the levels manually in order to avoid clipping, but in games, where audio is interactive this can be really tricky. Say for instance you have a click sound for buttons. If there are no other sounds playing you want the click played back at maximum volume, but if there is music in the background the volume level needs to be lowered. If there is an explosion nearby it needs to be adjusted even further to avoid clipping.

One way to reduce clipping is to transform the output signal in a non-linear fashion, so that it never really reaches the maximum level. This still has the problem that it will affect the result when there is only one sample playing. Hence, when the click is played back in isolation it won't be at maximum volume.

Some people suggest that the output should be averaged in the case of multiple channels. So if there are three sounds, A B C playing at once. You mix them as (A+B+C)/3. This is not a good way to do it, because the formula doesn't know anything about the content of each channel (B and C can for instance be silent, still resulting in A played back at a third of the volume).

What we need is some form of audio compression - an algorithm that compress audio dynamically, based on the current levels. Real audio compressors are pretty advanced, with a sliding window to analyze the current audio content and adjust the levels accordingly. Fortunately there is a "magic formula" that sounds good enough in most cases. I found this solution by Viktor T. Toth: mix=A+B-A*B, but when adapting it to floating point math I realized that a slight modification into: mix=A+B-abs(A)*B is more suitable to better deal with negative numbers. Each channel is added to the mix separately, one at a time, using the following pseudo code:

mix = 0
for each channel C
  mix = mix+C - abs(mix)*C

This means that if there is only one channel playing, it will pass unmodified through the mixer. The same applies if there are two channels but one is completely silent. If both channels have the maximum value (1.0), the result will be 1.0, and anything in between will be compressed dynamically. It is definitely not the best or most accurate way to do it, but considering how cheap it is, it sounds amazingly good. I use this for all mixing in Smash Hit, and there are typically 10-20 channels playing simultaneously, so it does handle complex scenarios quite well.

I use three separate mixers in Smash Hit – the HUD mixer, which is used for all button clicks and menu sounds, the gameplay mixer, which represent all 3D sounds, and the music mixer which is used for streaming music. The gameplay mixer has a series of audio effects attached to it to emulate the acoustics of different room types.


Given how useful a reverb effect is in game development, it's quite surprising to me how difficult it was to find any implementations or even an explanation online. At a first glance, the reverb effect seems much like a long series of small echoes, but when trying it, the result sounds exactly like that – a long series of small echoes, not the warm, rich acoustics of a big church. If one tries to make the echoes shorter, it turns more and more metallic, like being inside a sewage pipe.

There is a great series of blog posts about digital reverberation by Christian Floisand that contains a lot of the theory and also a practical implementation: Digital reverberation and Algorithmic Reverbs: The Moorer Design.

It uses a series of parallel comb filters passed through all-pass filters in series. The comb filters is basically a short delay line with feedback, representing reflected sounds, while the all-pass filter are used to thicken and diffuse the reflected sound by altering the phase. I don't know enough signal theory to fully understand the all-pass filter, but it works great and implementation is fairly easy.

In addition to the comb filters and all-pass filters I also added a couple of tap-delays (delay line without feedback), representing early reflections on hard surfaces, as well as low-pass filters in each comb filter allowing a great way to control the room characteristics. Christian's article suggest the use of six comb filters, but for performance reasons I cut it down to four. I'm using four tap-delays and two all-pass filter, plus a pre-delay on the entire late reflection network.

All audio is processed in stereo in Smash Hit, so the reverb needs to be processed separately on the left and right channel. I slightly randomize the loop time in the comb filters differently for the left and right channel, which gives the final mix a very nice stereo spread and a much better sense of presence.

In addition to reverb I also implemented a regular echo as well as a low-pass filter. The parameters of these three filters are used to give each room its unique acoustics.

Tuesday, May 13, 2014

Cracking destruction

Smash Hit is a game built entirely around destruction. We knew from the beginning that destruction had to be fully procedural and also 100% reliable. Prefabricated pieces broken the same way every time simply wouldn't be enough for the type of game we wanted to make. The clean art style and consistent materials made breakage easier, since everything that breaks is made out of the same, solid glass material.

Procedural breakage

Physically based breakage is hard. Really hard. I remember NovodeX had some kind of FEM approximation for breakage in the early days of PhysX, but it never worked well and didn't look very convincing. I think for most game scenarios it is also completely overkill, especially for brittle fracture, like glass. I designed the breakage in Smash Hit around one rough simplification – objects always break where they get hit. This is not true in the real world, where tension builds up in the material, and objects tend to break at their weakest spot, but hey, we're not really after accurate breakage here. What we want is breakage that feels right, and looks cool.

In Smash Hit, when a breakable object gets hit, and the exerted impulse is above a certain threshold, I carve out a small volume around the point of impact and break that up into several smaller pieces that become new dynamic objects.

It sounds simple, but looks quite convincing. However, there are a few obstacles to overcome in the implementation:

  • Carving out a piece from a generic object in a robust way
  • Slicing that volume
  • Check for topology changes in the original object. Carving out a piece may cause it to fall apart.

Plane splitting 

Carving out a volume from a generic mesh in a robust way is a really complex geometric problem. Furthermore, since every object in Smash Hit is physically simulated we want the resulting objects to be well-behaved in a physics-friendly format. From a physics point of view, I represent all objects as compounds of convex shapes, so the resulting broken object must also be a collection of convex shapes. Luckily there is one safe way to split up a convex object into two new objects that are also guaranteed to be convex – slicing it with a plane.

So to carve out a piece from the original object, I slice all the convex shapes with five bounding planes, slightly randomized around the point of impact. It will increase the total number of convex shapes used in the original object, but that is inevitable due to the fact we are making the object more concave. It takes a bit of bookkeeping to get the splitting code right, but as long as the plane splitting is robust, so will the overall breakage algorithm.

The carved out piece can be split up into smaller pieces using the same plane-splitting, so all we need is a really fast, reliable plane-splitting method for convex objects.

Robust splitting

The classic problem with plane-splitting is not handling degenerate cases. Say for instance one of the faces coincide with the splitting plane, so that only one edge crosses the plane due to floating point precision. That can easily result in degenerate geometry, non-closed polyhedra or even broken data structures and hard crashes. To avoid that, I start by determining a side for each vertex (above or below the plane) and then base all edge-splitting on that information, hence only split an edge if the corresponding two vertices are on opposite sides of the plane. The split point for each edge is capped to be a certain distance away from both vertices, so that the resulting two shapes are always closed, well-defined and all edges are above a certain length.

I tried quite a few data structures for doing the plane-splitting before finding one that works well in practice. The biggest hurdle was that I needed a way to track vertices through multiple splits in order to keep vertex normals consistent. It might not be visible at a first glance, but vertex normals are used extensively to make nice gradients in the glass rendering and create a certain softness to the material. Recomputing those normals after splitting an object would create a deviation in the material that is very visible to the player. In the end, the half-edge data structure turned out perfect for the job, offering both high performance and very small memory footprint. I even use 16 bit integers instead of pointers to keep the size down.

Tracking topology

After carving out a hole in the object, it needs to be checked for topology changes. Depending on the shape, it might have been split into two or more separate objects. Finding connected components in a series of  inter-connected nodes is classic graph theory. What makes out a connection in this case is wether two shapes are touching face to face. The most elegant way to do this would be to track split faces between shapes and remember the connections, but it takes a lot of bookkeeping. I'm using a more pragmatic approach, where the distance between shapes are measured geometrically. I have earlier on this blog written about the GJK algorithm and how immensely useful it is for all kinds of geometric operations. In this case, we only want to know if two object are within a certain distance of each other. This can be done as a simple overlap test in configuration space, which is extremely fast.

The algorithm for finding connected components goes like this:
Assign each object a unique id
For each object A
  For each object B
    If id[A] and id[B] are different and they are touching
      Replace all id[A] and id[B] with min(id[A], id[B])

After we're done all shapes with the same id represent the same rigid body. This is obviously not a very fast general algorithm due to cubic complexity, but since the number of shapes in a rigid body is usually within a few dozen it works well in practice. The nice thing about it is the complete lack of state, which simplifies further breakage.

Simulation pipeline

A final word on breakage is the importance of getting the simulation pipeline right. With an off-the-shelf physics engine, contact impulses are typically analyzed after the simulation step to determine if something breaks. This give the impression of unbreakable objects colliding and then artificially split up afterwards. The key to natural looking breakage is to limit the contact forces of a breakable object during the simulation and break them up while preserving some of the motion. There are tricks to mix in some of the pre-break velocity on broken pieces, but I went all in and actually do multiple solves during breakage, so a simulation step typically involves running the following steps:
  1. General collision detection
  2. Run solver with capped impulses on breakable objects
  3. Check if some contacts reached the cap and in that case break objects
  4. Collision detection for new objects
  5. Run the solver again on affected contacts with capped impulses
  6. Repeat from 3
  7. Integrate
This means that objects are created and broken several times during a simulation step before proceeding to integration. I have capped the number of solves per simulation step to three in order to limit performance impact.

This is the first game where I'm giving my low level physics engine a proper work-out. It has been in development on and off for almost four years so it's great to finally see it in action. Good destruction ties very deeply into the simulation pipeline, so it's not ideal to plaster it on top of an existing physics engine.

Wednesday, April 2, 2014

Smashing tech

Our latest game Smash Hit has gone far beyond all expectations, being the #1 free game in over 100 countries during launch week and approaching 35 million downloads! I will write several blog posts about the technology here, starting with a tech summary of what is being used and then go deeper into each subject in future posts.

This is by far our most physics intense game to date. It's almost like a physics playground with a game glued on top. The physics engine is tailor made for this game specifically, but builds on top of the low level physics library I was working on a few years ago. The game is actually a great show case for the low level physics library since it is very non-generic. It's a streaming, highly dynamic world where more or less everything is moving all the time and objects get inserted and removed constantly. Therefore there is no deactivation (sleeping) or islands generation. There are also two types of objects simulated differently. The full rigid body, plus debris which are more light weight.

Destruction is the core game mechanic and had to be fully procedural, very robust and with predictable performance. The engine supports compounds of convex shapes, like most physics engines. These shapes are then split with planes and glued back together when shattered. Though most objects in the game are flat, the breakage actually support full 3D objects with no limitations. The breaking mechanic is built into the core solver, so that objects can break in multiple steps during the same iteration. This is essential for good breakage of this magnitude.

Due to the highly dynamic environment where there can be hundreds of moving objects at the same time, one draw call per object was not an option. Instead, all objects are gathered into dynamic vertex buffers. So there are basically only one draw call per material. Vertex transformation is done on the CPU to offload the GPU and allow culling before vertices and triangles are even sent to the GPU. CPU transformation also opens up for a few other tricks not available with conventional rendering. The camera is facing the same direction all the time, which allows the use of billboards to approximate geometry. You can see this in a few instances for round shapes in particular throughout the game.

The static soft shadows are precomputed vertex lighting based on ambient occlusion. Lighting is quadratically interpolated in the fragment shader for a natural falloff. The dynamic soft-shadows are gaussian blobs rendered with one quad per rigid body. The size and orientation of the shadow need to be determined in run-time since an object can break arbitrarily. I'm using the inertia tensor of the rigid body to figure this out, and the shadow is then projected down on a plane using a downward raycast. This is of course an enormous simplification, but it looks great in 99% of all cases!

Music and sound
I wrote my own software mixing layer for this game, which enables custom sound effects processing for environmental acoustics. I use a reverb, echoes and low-pass filters with different settings for each environment in the game. The music is made out of about 30 different patterns, each with an intro and an outro, which are sample-correct mixed together during the transitions. The camera motion is synchronized to the music progression, so the music always switches to the next pattern exactly when entering a new room. This was pretty hard to get right, since this had to be done independent of the main time stepping in order to support slower devices. Hence, camera motion and physics simulation had to be completely decoupled in order to have both predictable simulation and music synchronization on all devices.

Scripting has been a very useful tool during the development of this game. Each obstacle in the game is built and animated using a separate lua script. Since each obstacle is procedurally generated, it allows us to make many different varations of the same obstacle. For instance configuring width, height and color, or number blades in a fan, etc. Each obstacle runs within its very own lua context, so it is a completely safe sandbox environment. I've configured lua to minimize memory consumption, and implemented an efficient custom memory allocator, so each context only requires a single memory block of about 40 kb, and there are a few dozed of them active at the same time at most. Garbage collection is amortized to only run for one context each frame, so performance impact is minimal.

The game is designed for multicore devices and uses a fork-and-merge approach for both physics and graphics. I was considering putting the rendering on a separate background thread, but this would incur an extra frame of latency, which is really bad for an action game. The audio mixing and sounds decoding is done on separate threads.

If there is any area you find particularly interesting, let me know!

Tuesday, January 7, 2014

GDC Physics Tutorial

I will give a talk on fluid simulation on GDC this year! Make sure to attend the physics tutorial.

The session will focus on the formulation of a fluid constraint. In contrast to most other particle-based fluid simulatiors, mine uses a sequential impulse solver, normally found in rigid body engines. This improves incompressibility and makes interaction with rigid bodies very stable. This is the method used in Sprinkle and all 3D fluid movies posted earlier on the blog.

The tutorial does also include interesting talks about convex hull creation, physics debugging, constraint solvers and character collision.

Tuesday, June 25, 2013

Low level audio

Audio has been a neglected area of our game engine for a long time. Except for a few tweaks, we have essentially used the exact same code for all our games. A pretty simple channel pool with pitch and volume per channel and an OpenAL back-end for iOS, Windows and OSX and an OpenSL back-end for Android.

OpenAL is quite okay for iOS and OSX, but requires a third-party installation on Windows. That's not a big issue since we're only using Windows for development, but still annoying. A bigger issue is the lack of render-to-file support in almost every Audio API I have seen, including OpenAL and OpenSL. That's a pretty given feature for a game developer, I can't believe there is so little support for it. We always render out gameplay sequences directly from the game engine to use in trailers, teasers, etc. Because we render in high resolution with motion blur there is no way to capture the audio in real-time along with the video, so audio has to be manually added later in a video editing software. Obviously a very time consuming procedure. A better way would be to render out the game audio frame by frame along with the video, but this requires software mixing and direct access to the final mix - something a hardware-centric audio API cannot offer. Does anyone know if OpenAL and OpenSL actually use hardware mixing these days or do they just mix in software anyway?

I decided to write my own software mixer with a very thin hardware abstraction layer per platform. Partly because I would get the render-to-file feature basically for free, but it also opens up the possibility to add in effects like echo and reverb.

I thought for sure that the hardest part of the mixer project would be to write the actual mixer. I couldn't be more wrong.. It turns out researching approrpriate low-level audio interfaces for each platform took way more time. There are at least four or five different APIs on Windows and just as many on OSX, yet it's suprisingly hard to just stream raw PCM data to the left and right speaker. Extremely frustrating, because this is exactly what the driver wants in the end anyway. Why not just offer a simple callback from a high-priority system thread - feedMoreDataPlease(void* buffer, int size)? I get that some people want 3D positioning, Internet streaming and what-not, but that's not a reason to hide low-level access for the rest of us that just want to submit a raw mix to the driver.

Cudos to Apple for exposing exactly this type of interface through the Audio Unit API (part of Core Audio). It can actually be configured to do what I'm asking for, and offers really good latency too - around 10 ms on both OSX and iOS.

Android doesn't seem to offer any viable options to OpenSL through the NDK. Fortunately OpenSL is also reasonably sane when it comes to streaming and offers latency in the 50 ms range (plus some hidden latency within OpenSL or the drivers, unclear exactly how much). It's not perfect, but acceptable. OpenSL is a beast to setup though. Why on earth would anyone design an API like that? And there is very little help available online except for the 600 page reference manual. I finally got it to work and I hope I never ever need to touch it again.

Windows is the only platform where I still haven't found a decent low-level audio API. There is Media Foundation and WASAPI which I haven't looked at because it's Windows 7 and Vista only. DirectSound is probably the most widely used, but it doesn't seem to offer a direct callback. Instead it relies on a user thread to feed new buffers periodically (making it virtually useless for low-latency stuff due to the horrible Windows scheduler). There is also the old waveOut interface in WinMM which at a first glance looks like a perfect match - it offers a callback when the driver needs more data, but here's the catch - you are not allowed to feed audio data from the callback itself because it may cause the process to deadlock! You are supposed to do this from a separate thread and can only commnuicate with certain "safe" system functions. I'm totally ignoring that for the time being, feeding audio data from the callback and it seems to work great in Windows 7 at least (I suppose this is because the waveOut interface is deprecated and wrapped in fifteen layers of user-mode code at this point...). The latency with the waveOut method ended up being in the 30 ms range.

It took some experimenting to get decent performance of the software mixer, but since it all happens on a separate thread I'm not too concerned these days when even mid-range mobile phones have multiple cores...