Skip to main content


I've run into a pretty interesting/annoying problem while working on parallization of the physics pipeline.

I started off with a pretty standard thread pool, but proper physics requires quite a bit of synchronization. Especially in the solver, you need several full barriers to coordinate the threads. This is not so bad when you run thousands of objects, because the overhead of the synchronization is dwarfed by the massive simulation, but what would be more impressive is to go from say 3 ms per frame down to 1 ms.

Ideally you want your physics engine, rendering pipeline, AI, etc, all data parallel and run them sequentially every frame. The more common scenario of running physics on a separate thread and rendering on a separate thread introduces two extra frames of lag, which combined with double or triple buffering on the graphics card, plus another frame delay in the display (at least some modern TV sets do this) gives you very sluggish reponse. A secondary benefit from data-parallelism is the better cache utilization. It's more efficient to process all physics at once, and keep it in the L2, rather than doing all subsystems simultaneously, and have them trash the cache for each other along the way. Anyway, that's why I want the physics pipe to be super freakin data-parallel with a good speedup even for moderate scenes.

My problem with the standard synchronization primitives is not excessive locking. All tasks are pretty independent and data is locked very sparingly. My problem is the barrier sync, and especially getting the worker threads to start processing again after the sync. I use a standard semaphore which I release when there is a new job coming in. The worker threads then start munching jobs off a queue in any order they like. When there are no more jobs, they go to sleep again on the semaphore. Now, doing a barrier sync involves putting all worker threads to sleep, since I have to wait for the last job to finish, then I immediately feed in a new job and start processing again.

When a worker thread hits the semaphore, Windows first goes into kernel mode, then realizes that the thread should go to sleep, starts looking for other threads to run, and in most cases there won't be any, so it switches to the idle process, clocks down the core frequency, takes out a good book and starts preparing a cup of coffee. Then, one microsecond later (who knew!), the main thread releases the semaphore and the whole process is reversed. Well, at least this is what I think it does, but I'm no kernel expert...

Now, since I know exacly what needs to be computed I can tell from the beginning when this is going to happen. Which is every barrier sync, every iteration, every frame. That seems somewhat unnecessary. So I started ripping out synchronization code from the thread pool and replacing it with interlocked functions and spinlocks, having the workers spin until there is more work, but only during the simulation step, and then put them properly back to sleep when we're done with physics.

This new strategy works perfectly and gives a really nice speedup on small scenes. The one annoying problem is that if there is anything, really anything else running on the machine, the Windows scheduler gives my process a mad penalty. Even if they only spin for a small fraction of a frame. For some reason it chooses to swap out the entire process for several time quanta, leaving me with a 60 ms hickup every second or so. No cookie for Windows scheduler! Two questions:

A) Why does this bother Windows? Even on very small simulations where all workers sleep most of the frame this happens. That's right, most of my threads sleep on a semaphore most of the time, but I still get a penalty.

B) How the heck does Windows even know that I'm using a spinlock? Does modern hardware have some sort of spinlock detection layer? I can't imagine the scheduler will be so severely bothered by any process that happens run four threads simultaneously, especially not for such a short time.

Seriously, what's going on? The problem goes away if I do a Sleep(0) in the spinlock, but that kind of defeats the purpose. Even though Sleep(0) doesn't switch to the idle process, it's still a rather costly operation, and an open invitation to the OS to go do something else, which is exactly what I'm trying to prevent in the first place! Sigh.


  1. Yeah, trying to do multi threading for low latency stuff sucks on windows.

    I would guess you are triggering some sort of deadlock detection, which is effectivly what happens if you get rescheduled in a spinlock.

    The alternative of having a partial spinlock/windows sync primitive(eg a critical section and spin count) is just as bad because as soon as you exceed the spin count you cause a reschedule if there is anything else waiting(which then presumably gets a priority boost).

    I guess the ways to try and improve the situation are to avoid waiting for work by statically scheduling as much work upfront as possible and to statically allocate threads to a core(ala 360). Then only have one thread per core.

    Sadly it would be pretty difficult to allocate all work upfront due to not actually knowing in advance what your solver islands will be.

    Maybe some hacking with timeBeginPeriod() can improve things as well to reduce the quantum length(I think, havnt tried this though).


Post a Comment

Popular posts from this blog

Bokeh depth of field in a single pass

When I implemented bokeh depth of field I stumbled upon a neat blending trick almost by accident. In my opinion, the quality of depth of field is more related to how objects of different depths blend together, rather than the blur itself. Sure, bokeh is nicer than gaussian, but if the blending is off the whole thing falls flat. There seems to be many different approaches to this out there, most of them requiring multiple passes and sometimes separation of what's behind and in front of the focal plane. I experimented a bit and stumbled upon a nice trick, almost by accident.

I'm not going to get into technical details about lenses, circle of confusion, etc. It has been described very well many times before, so I'm just going to assume you know the basics. I can try to summarize what we want to do in one sentence – render each pixel as a discs where the radius is determined by how out of focus it is, also taking depth into consideration "somehow".

Taking depth into…

Screen Space Path Tracing – Diffuse

The last few posts has been about my new screen space renderer. Apart from a few details I haven't really described how it works, so here we go. I split up the entire pipeline into diffuse and specular light. This post will focusing on diffuse light, which is the hard part.

My method is very similar to SSAO, but instead of doing a number of samples on the hemisphere at a fixed distance, I raymarch every sample against the depth buffer. Note that the depth buffer is not a regular, single value depth buffer, but each pixel contains front and back face depth for the first and second layer of geometry, as described in this post.

The increment for each step is not view dependant, but fixed in world space, otherwise shadows would move with the camera. I start with a small step and then increase the step exponentially until I reach a maximum distance, at which the ray is considered a miss. Needless to say, raymarching multiple samples for every pixel is very costly, and this is without …

A better depth buffer for raymarching

When doing any type of raymarching over a depth buffer, it is very easy to determine if there is no occluder – the depth in the buffer is farther away than the current point on the ray. However, when the depth in the buffer is closer you might be occluded or you might not, depending on a) the thickness of the occluder and b) if there are any other occluders behind the first one and their thickness. It seems most people assume a) is either infinite or a constant value and b) is ignored alltogether.

Since my new renderer is entirely based around screen space raymarching I wanted to improve on this to make it more accurate. This has been done before, but mostly in the context of order independent transparency (I think).

Let's look at a scene where the occluders are assumed to have infinite depth (I have tweaked the lighting for more distinct shadows to get a better look at raymarching artefacts, so the lighting does not exactly match the environment in these screenshot).

At a first …