Friday, May 4, 2018

Bokeh depth of field in a single pass

When I implemented bokeh depth of field I stumbled upon a neat blending trick almost by accident. In my opinion, the quality of depth of field is more related to how objects of different depths blend together, rather than the blur itself. Sure, bokeh is nicer than gaussian, but if the blending is off the whole thing falls flat. There seems to be many different approaches to this out there, most of them requiring multiple passes and sometimes separation of what's behind and in front of the focal plane. I experimented a bit and stumbled upon a nice trick, almost by accident.

I'm not going to get into technical details about lenses, circle of confusion, etc. It has been described very well many times before, so I'm just going to assume you know the basics. I can try to summarize what we want to do in one sentence – render each pixel as a discs where the radius is determined by how out of focus it is, also taking depth into consideration "somehow".

Taking depth into consideration is the hard part. Before we even start thinking about how that can be done, let's remind ourselves that there is no correct solution to this problem. We simply do not have enough information. Real depth of field needs to know what's behind objects. In a post processing pass we don't know that, so whatever we come up with will be an approximation.

I use a technique sometimes referred to as scatter-as-gather. I like that term, because it's very descriptive. GPU's are excellent at reading data from multiple inputs, but hopelessly miserable at writing to multiple outputs. So instead of writing out a disc for each pixel, which every sane programmer would do in a classic programming model, we have to do the whole operation backwards and compute the sum of all contributing discs surrounding a pixel instead. This requires us to look as far out as there can be any contributions and the safest such distance is the maximum disc radius. Hence, every pixel becomes the worst case. Sounds expensive? It is! There are ways to optimize it, but more on that later. Here is the test scene before and after blur with a fixed disc size for all pixels:

Now to the hard part. When gathering discs, each one of them have different sizes. That in itself it not very hard, just compare the distance to the disc radius. If it's smaller it contributes, otherwise not. The hard part is how it should contribute and when. Taking a stab at a "correct" solution, I compute how much a sample from each disc contributes. A larger disc means that the pixel intensity will get spread out over a larger area, so each individual sample gives a smaller contribution. Doing the math right, summing up everything in the end should yield the same overall intensity in the image. This is what it looks like:

It's not totally wrong, but it's definitely not right. Foreground and background objects do get blurred, but foreground objects get darkened around edges. You might also notice bright streaks across the focus areas. The streaks are artefacts from working with discrete samples in something that should be continuous. It's hard to do the correct thing when the disc size approaches zero, since you have to take at least one sample. What about the dark edges? If we did this the right way, we would actually blend in each disc, allowing some of the background to shine through. Since that information is occluded this is what we get.

Instead of trying to compute the correct intensity from each disc sample, let's just sum them up and remember the total, then divide with the total afterwards. This is pretty classic conditional blur and always gives the correct intensity:

for each sample
if sample contributes then
color += sample * contribution
total += contribution
color /= total

Definitely better. Both the streaks and the dark corners are gone, but foreground objects don't blur correctly over a sharp background. This is because the contribution from in-focus objects have a stronger weight than the large discs in the foreground. There are several ways to attack this issue. One that I found to look relatively good is to compute the contribution using the distance to the sample instead of the disc size. This is far from correct, but gives a nicer fall-off:

Now to the trick. Instead of computing a contribution based on disc size or distance, I use a fixed weight for all contributing samples no matter the distance or disc size, and blend in the current average if there is no contribution:

color = centerColor
total = 1.0
for each sample
if sample contributes then
color += sample
color += color/total
total += 1.0
color /= total

Huh? What this does is basically to give each contributing sample equal weight, but when there is no contribution, the current average is allowed to grow stronger and get more impact on the result. This is what it looks like:

Not bad. The blur is continuous and nice for both foreground and background objects. Note that this only works when sampling in a spiral, starting at the center pixel being shaded, because the first sample needs to be the pixel itself. I'm not going to try and explain details on why this algorithm works, simply because I'm not sure. I found it by accident, and I have spent days trying other methods but nothing works as well as this one.

There is one small thing to add before this is usable. The background shouldn't blur over foreground object that are in focus. Note though that if the foreground is out of focus, we want some of the background to blur in with it. What I do is to clamp the background circle of confusion (disc size) between zero and the circle of confusion of the center pixel. This means that the background can contribute only up to the amount of blurryness of the pixel being shaded. The scatter-as-gather way of thinking requires a lot of strong coffee.

if sample depth > center depth then
sample size = clamp(sample size, 0, center size)

Here is what the final image looks like:

A couple of notes on my implementation. After some experimentation I changed the clamping of sample size to clamp(sample size, 0, center size*2.0). For larger max radius I increase it even more. This controls how much of the background gets blended into a blurry foreground. It is totally unphysical and is only there to approximate the occluded information behind the foreground object.

The following code is written for clarity rather than speed. My actual implementation is in two passes, calculating the circle of confusion for each pixel and stores in the alpha component, at the same time downscaling the image to quarter resolution (half width, half height). When computing the depth of field I also track the average sample size and store in the alpha channel, then use this to determine wether the original or the the downscaled blurred version should be used when compositing. More on performance optimizations in a future post.

Monday, April 23, 2018

Stratified sampling

After finishing my framework overhaul I'm now back on hybrid rendering and screen space raytracing. My first plan was to just port the old renderer to the new framework but I ended up rewriting all of it instead, finally trying out a few things that has been on my mind for a while.

I've been wanting to try stratified sampling for a long time as a way to reduce noise in the diffuse light. The idea is to sample the hemisphere within a certain set of fixed strata instead of completely random to give a more uniform distribution. The direction within each stratum is still random, so it would still cover the whole hemisphere and converge to the same result, just in a slightly more predictable way. I won't go into more detail, but full explanation is all over the Internet, for instance here.

Let's look at the difference between stratified and uniform sampling. To make a fair comparison there is no lighting in these images, just ambient occlusion and an emissive object.

They may look similar at first, but when zooming in a little one can easily see that the noise in the stratified version is quite different. Less chaotic and more predictable. The light bleeding onto the sphere here is a good example. In the stratified version, the orange pixels are more evenly distributed.

For environment lighting, I use fixed, precomputed light per stratum, sampled from a low resolution version of the environment cube map. The strata are fixed in world space and shared for all fragments. More accurately, my scene is lit with a number of uniformly distributed area lights. The reason I want these lights fixed in world space is because the position and area of each light can be adapted to the environment. An overcast sky might for instance have a uniform distribution of lights around the upper hemisphere, while a clear sky will have one very narrow and powerful stratum directed towards the sun and a number of less powerful wider ones at an even distribution. The way I represent environment lighting has therefore changed from a cubemap to an array of the following:

struct Light
    vec3 direction; // Normalized direction towards center of light
    vec3 perp0;     // Vector from center of light to one edge
    vec3 perp1;     // Vector from center of light to other edge
    vec3 rgb;       // Color and intensity

Each pixel shoots one ray towards every light in the direction of direction + perp0*a + perp1*b, where a and b are random numbers from -1 to 1. If the ray is a miss, the light contributes to that pixel's lighting. If it's a hit, I use radiance from the hit point, using a downscaled and reprojected version of the lighting from previous frame.

The key to reducing noise here is that each pixel is getting exactly the same incoming light every frame, so an unshadowed surface will always get the same color. Here is an example using 16 of these area lights.

Importance sampling is another popular method for reducing noise, but that requires a unique distribution of rays per pixel. Since my area lights are fixed in world space that isn't really an option. But, one thing I can change is the quality of each ray. Since this is raymarching, rather than raytracing, lower probability samples (those that deviate a lot from the surface normal) can be of lower quality (smaller step count) without affecting the result very much. This makes a huge difference for performance with very little visual difference. I'm now marching a ray between four and sixteen steps depending on the probability, almost cutting the diffuse lighting time in half.

While I'm at it, the stepping in my raymarching is done in world space, not in screen space like you would typically do for screen space reflections. The reason for this is that each ray is sampled very sparsely with incremental step size. I start with a small step size and increase every iteration in a geometric series. This allows for fine detail near geometry and contact shadows, while still catching larger obstacles further away at a reasonable speed. World space stepping also gives a more consistent (well, as far as screen space methods goes...) result as the camera is moving.

Since lighting in the stratified version is more evenly distributed it is also easier to denoise. I still use a combination of spatial and temporal filters, but I've abandoned the smoothing group id and now using depth and normals again. When blurring, I use a perspective correct depth comparison, meaning that when the depth of two pixels are compared, one is projected onto the plane formed by the other one and the corresponding normal. Doing that is quite expensive, but since the stratified sampling looks good already when blurred with a 4x4 kernel I found it to be worth the effort.

Reflections were rewritten to do stepping in screen space. I feel like there is an opportunity to use stratified sampling also for rough reflections, but I haven't tried it yet. As a side note, materials with higher roughness shoot rays with larger step size. There is actually a lot more to say about denoising reflections (especially rough ones) and hierarchical raymarching, but I'll stop here and might come back to that in another post. If there is an area you would like to hear more detail about, don't hesitate to contact me on twitter: @tuxedolabs.

Friday, March 23, 2018

GDC Rant

It's almost a tradition for me to get grumpy on the last day of GDC, and even though I had a great week this year there are some things that I would like to shine some light on.

A lot of people seem to think of GDC as this cuddly, educational industry event, by game developers for game developers. It might have been in the beginning, but nowadays it is not. GDC is run by UBM Tech, a global, non-transparent corporation, organizing dozens of different conferences for profit. They don't care about the games industry, they care about making money. Every year the passes get more expensive and every year something is excluded. Since last year you don't even get free coffee unless you buy one of the more expensive passes (and as a side note they probably don't even pay for the coffee – look for that "sponsored by" tag).

As a speaker you get a free pass, a shiny tag on your badge and a couple of lunch boxes. That's it. They don't pay for travel or accomodation, which is standard on many other conferences. On top of that they lock in recordings of your preentation in the vault, to which they offer access for an extra $550 unless you already purchased the most expensive pass. Preparing a high quality session takes a lot of time. I know I spent at least a full week on each of my presentations. You do this for free, because you want to share something and then UBM Tech sell it for hard cash.

More and more sessions now come with a "presented by" tag, meaning someone actually paid UBM Tech to give the talk. And even those talks you can't see unless you buy an "Expo Plus" pass, or better.

That said, I still love going to GDC, and I'll probably come back next year, but I really wish there was an industry initiative to organize this in a better way.

Tuesday, March 13, 2018

Hot reloading hardcoded parameters

Here is a trick that I cannot take any credit for, but that I finally implemented. I remember reading about it online several years ago, but I cannot find the reference again (it might have been on mollyrocket), so I'll write up the idea:

Everyone uses hardcoded parameters sometimes because it's fast and easy:

float someValue = 5.0f;

Once you have a parameter in the code, it's likely that you sooner or later want to tune that into some kind of sweet spot. With a hardcoded parameter the process often involves recompiling and restarting (unless you implemented code hot reloading, in which case it still involves recompiling) many times to try out different values. A popular approach is to add some form of config file to get rid of the recompile step. Config files can be hot reloaded to also get rid of the restart step, but config files require some extra work for each parameter. You need to name the parameter, and you need to add it to the config file.

The idea of parameter hot reloading is to use the code itself as config file. The file is already there, the context and naming is already there, the initial value is already there, and once you're done tweaking, the result is already in your code!

This can be done by wrapping the tweakable, hardcoded parameter in a macro:

float someValue = TWEAK(5.0f);

The macro expands to a function call that looks something like this, using the __FILE__ and __LINE__ built-ins:

float TweakValue(float v, const char* fileName, int lineNumber);

This function stores each tweakable value in a hash table using the file and line properties, so that it can be accessed quickly. The really sweet part is that since we know the file and line number we can periodically (say once a frame, or using some file modification event) check if the source file has changed, and when it changes, just parse out the new value. Note that this is rather trivial, since at this point we know exactly on what line to look for it and how to parse it, because it will be wrapped by a parenthesis right after the word TWEAK.

One limitation is that it only works for one tweakable parameter per line. It's probably possible to make it work for more than one, but that requires a lot more parsing. Note that this can be done for more than just floats. I've also added support for booleans, vectors and colors. The boolean especially can be useful to toggle between two implementations at run time:

if (TWEAK(true))

Needless to say, in production builds, the TWEAK macro is just bypassed, adding zero overhead. Pretty neat isn't it?

Friday, February 9, 2018

Header file dependencies

Ten years ago, I wrote my own C++ software framework and it was probably one of the best moves in my career as a software developer. It has been immensely useful for every little project I have done ever since, but adding bits and pieces and modifying it down the road has made the software quality slowly degrade. I'm half-way through a rewrite, not from scratch but a pretty serious overhaul. One thing I've spent a lot of time on is reducing header file dependencies to improve compile times. It is one of those strangely satisfying things that you can never really motivate to spend time on while in production. So far I've managed to cut the compile time in half (from 17 seconds for a full rebuild down to 9, so it really wasn't that bad before either), mostly by eliminating system headers.

A very accessible tool for this is actually GCC. Just add the -H flag and it will print out a hierarchical header dependency graph, including system headers. Using this I found out that the system header, required for standard placement new operator included over 50 system headers. I could get rid of it thanks to a tip I got on twitter, by just declaring my own placement new operator like this:

enum TMemType { MEMTYPE_DUMMY };
inline void* operator new(size_t, TMemType, void* ptr) { return ptr; }

Then I just use T_PNEW instead of new in all of my templated containers.

I've also moved over to private implementation as a standard practice. It's a bit clunky and requires an extra pointer dereference for every call, but it reduces dependencies a lot. Most of my headers now only include the "tconfig.h" with standard types, and sometimes containers. The only place I found this extra dereference to be unacceptable was the mutex, but I also didn't want to include system headers in my own headers, so the ugly but functional compromise I settled on was to reserve memory using a char mMutexStorage[T_MUTEX_SIZE] in the header and then do a placement new in the implementation file. The value of T_MUTEX_SIZE will be platform dependent, but can easily be compared to the actual mutex size at runtime to avoid memory corruption.

Finally, I must mention something that took me a long time as a developer to realize – return types can be forward declared even when returned by value. It makes sense if you know how they are passed on the stack, but I somehow always just assumed the compiler needed to know the size for it unless it was pointer or reference. It was maybe five years ago I learned this and all compilers I came across since then accept forward declaration for return types. That makes a huge difference, because you can do this:

class TString getSomeInformation();


class TVec3 getWorldPosition();

Or any type you like, without including the actual header for it. This means you can get away with pretty much zero includes as long as the implementation is private without compromising your API. I'm pretty sure this works for parameters as well, but since they are usually passed as const references anyway (mine are at least) it's not such a big deal.

The only exception I found was MSVC, that does not allow this for cast operators, but luckily it can be bypassed by forward declaring outside the class instead of inlining it in the method declaration.

Speaking of forward declarations, this is also why I don't use namespaces. I always make inline forward declarations like the ones above, never at the top. It would have been so convenient to make inline forward declarations of namespace members, like this:

class mylib::math::Vec3 getWorldPosition();

But there is no such thing and I hate those clunky forward namespace declarations at the top of the header, so I'll stick with prefixes for now.

Tuesday, January 2, 2018

Screen Space Path Tracing – Diffuse

The last few posts has been about my new screen space renderer. Apart from a few details I haven't really described how it works, so here we go. I split up the entire pipeline into diffuse and specular light. This post will focusing on diffuse light, which is the hard part.

My method is very similar to SSAO, but instead of doing a number of samples on the hemisphere at a fixed distance, I raymarch every sample against the depth buffer. Note that the depth buffer is not a regular, single value depth buffer, but each pixel contains front and back face depth for the first and second layer of geometry, as described in this post.

The increment for each step is not view dependant, but fixed in world space, otherwise shadows would move with the camera. I start with a small step and then increase the step exponentially until I reach a maximum distance, at which the ray is considered a miss. Needless to say, raymarching multiple samples for every pixel is very costly, and this is without a doubt the most time consuming part of the renderer. However, since light is usually fairly low frequency information, it can be done at lower resolution and upscaled using this technique. I was also surprised that the number of steps needed for each ray is also quite low, as long as the step size is exponential, starting with a small step and then increase the step size gradually. This will capture fine detail near creases while also preserving occlusion from bigger obstacles nearby.

In the case of a ray miss, I fetch light from a low resolution environment map and in case of a hit, I fetch light from the hit pixel reprojected from the previous frame. This creates a very crude approximation of global illumination since light is able to bounce between surfaces across multiple frames. This is also what enables me to light objects using emissive materials as shown in this video. Here is pseudo code for the light sampler:

function sampleLight(pixel, dir)
  stepSize = smallDistance
  pos = worldPos(pixel)
  for each step s
    pos += dir * stepSize
    if pos depth is occluded by first or second layer of pixel(pos) then
      return fraction of light from pixel(pos) in previous frame
    stepSize *= gamma (greater than one to increase step size)
  return fraction of light transfer from cubeMap[dir]

Like any path tracing, what comes out will contain a certain amount of noise. Even on a powerful machine and in half resolution, there isn't time for more than a handful samples per pixel, so noise reduction is a very important step. This is what it looks like without any noise reduction at sixteen samples per pixel, and each sample is marched in twelve steps. Click image to view high resolution.

After applying a temporal reprojection filter, I get rid of a lot of the noise. Note that the filter runs on the light buffer before it gets applied to the scene using object color, etc. The problem with temporal filters is of course that areas where information is missing will be very noticeable during motion. Therefore, this filter cannot be too aggressive. I keep it around 70% and I also compare depth values in order to avoid ghosting.

After the temporal filter, I also apply a spatial filter that uses object ID, or more specifically smoothing group ID, to not blur across different surfaces. Again, this filter cannot be too aggressive, or fine detail will get lost. I use a 7x7 pixel kernel in two passes.

There is still some low frequency noise present, but when it will get eaten by temporal anti-aliasing and the final image looks like this.

Tuesday, November 21, 2017

A better depth buffer for raymarching

When doing any type of raymarching over a depth buffer, it is very easy to determine if there is no occluder – the depth in the buffer is farther away than the current point on the ray. However, when the depth in the buffer is closer you might be occluded or you might not, depending on a) the thickness of the occluder and b) if there are any other occluders behind the first one and their thickness. It seems most people assume a) is either infinite or a constant value and b) is ignored alltogether.

Since my new renderer is entirely based around screen space raymarching I wanted to improve on this to make it more accurate. This has been done before, but mostly in the context of order independent transparency (I think).

Let's look at a scene where the occluders are assumed to have infinite depth (I have tweaked the lighting for more distinct shadows to get a better look at raymarching artefacts, so the lighting does not exactly match the environment in these screenshot).

At a first glance it may look okay, but at certain angles, it is very evident that something is off:

Even an object that is visibly thin will receive a shadow as if infinitely thick. The go-to trick in this situation is to hardcode a thickness and tweak until it looks acceptable:

Still artefacts, but much better. However, for most scenes it's just not possible to find one single thickness that works for everything. What we ideally want is the actual object thickness per pixel. One relatively cheap way of approximating depth is to render a depth buffer for back faces. As long as objects don't overlap, are closed and reasonably convex, the difference between front face depth and back face depth is actually a pretty accurate representation of the object thickness.

I store front face and back face depth in different channels of the same texture, so I just retrieve RG instead of R for each pixel and compare the depth to both values when raymarching, making it really cheap. This removes a lot of artefacts, but there is still room for improvement.

It is hard to visualize in a still image, but with a moving camera it becomes very clear that shadows are only visible for the first layer of objects. As soon as an object disappears behind something, its shadow is also gone. This is of course particularly evident with long shadows from, say, a sunset.

Creating another layer of depth information is called depth peeling and there are several ways to do it. I use the stencil buffer, but it can also be done by discarding fragments in a shader. I already mentioned that I store front and back face depth in two different channels of the same texture, so why not add another layer of front and back face depth and make it a full, four channel texture? All four depth values (first front, first back, second front, second back) can still be fetched as a single texture read, making it really fast.

One could imagine doing even more depth layers, but the visual improvement would be hard to notice.