Skip to main content


I've run into a pretty interesting/annoying problem while working on parallization of the physics pipeline.

I started off with a pretty standard thread pool, but proper physics requires quite a bit of synchronization. Especially in the solver, you need several full barriers to coordinate the threads. This is not so bad when you run thousands of objects, because the overhead of the synchronization is dwarfed by the massive simulation, but what would be more impressive is to go from say 3 ms per frame down to 1 ms.

Ideally you want your physics engine, rendering pipeline, AI, etc, all data parallel and run them sequentially every frame. The more common scenario of running physics on a separate thread and rendering on a separate thread introduces two extra frames of lag, which combined with double or triple buffering on the graphics card, plus another frame delay in the display (at least some modern TV sets do this) gives you very sluggish reponse. A secondary benefit from data-parallelism is the better cache utilization. It's more efficient to process all physics at once, and keep it in the L2, rather than doing all subsystems simultaneously, and have them trash the cache for each other along the way. Anyway, that's why I want the physics pipe to be super freakin data-parallel with a good speedup even for moderate scenes.

My problem with the standard synchronization primitives is not excessive locking. All tasks are pretty independent and data is locked very sparingly. My problem is the barrier sync, and especially getting the worker threads to start processing again after the sync. I use a standard semaphore which I release when there is a new job coming in. The worker threads then start munching jobs off a queue in any order they like. When there are no more jobs, they go to sleep again on the semaphore. Now, doing a barrier sync involves putting all worker threads to sleep, since I have to wait for the last job to finish, then I immediately feed in a new job and start processing again.

When a worker thread hits the semaphore, Windows first goes into kernel mode, then realizes that the thread should go to sleep, starts looking for other threads to run, and in most cases there won't be any, so it switches to the idle process, clocks down the core frequency, takes out a good book and starts preparing a cup of coffee. Then, one microsecond later (who knew!), the main thread releases the semaphore and the whole process is reversed. Well, at least this is what I think it does, but I'm no kernel expert...

Now, since I know exacly what needs to be computed I can tell from the beginning when this is going to happen. Which is every barrier sync, every iteration, every frame. That seems somewhat unnecessary. So I started ripping out synchronization code from the thread pool and replacing it with interlocked functions and spinlocks, having the workers spin until there is more work, but only during the simulation step, and then put them properly back to sleep when we're done with physics.

This new strategy works perfectly and gives a really nice speedup on small scenes. The one annoying problem is that if there is anything, really anything else running on the machine, the Windows scheduler gives my process a mad penalty. Even if they only spin for a small fraction of a frame. For some reason it chooses to swap out the entire process for several time quanta, leaving me with a 60 ms hickup every second or so. No cookie for Windows scheduler! Two questions:

A) Why does this bother Windows? Even on very small simulations where all workers sleep most of the frame this happens. That's right, most of my threads sleep on a semaphore most of the time, but I still get a penalty.

B) How the heck does Windows even know that I'm using a spinlock? Does modern hardware have some sort of spinlock detection layer? I can't imagine the scheduler will be so severely bothered by any process that happens run four threads simultaneously, especially not for such a short time.

Seriously, what's going on? The problem goes away if I do a Sleep(0) in the spinlock, but that kind of defeats the purpose. Even though Sleep(0) doesn't switch to the idle process, it's still a rather costly operation, and an open invitation to the OS to go do something else, which is exactly what I'm trying to prevent in the first place! Sigh.


  1. Yeah, trying to do multi threading for low latency stuff sucks on windows.

    I would guess you are triggering some sort of deadlock detection, which is effectivly what happens if you get rescheduled in a spinlock.

    The alternative of having a partial spinlock/windows sync primitive(eg a critical section and spin count) is just as bad because as soon as you exceed the spin count you cause a reschedule if there is anything else waiting(which then presumably gets a priority boost).

    I guess the ways to try and improve the situation are to avoid waiting for work by statically scheduling as much work upfront as possible and to statically allocate threads to a core(ala 360). Then only have one thread per core.

    Sadly it would be pretty difficult to allocate all work upfront due to not actually knowing in advance what your solver islands will be.

    Maybe some hacking with timeBeginPeriod() can improve things as well to reduce the quantum length(I think, havnt tried this though).


Post a Comment

Popular posts from this blog

Bokeh depth of field in a single pass

When I implemented bokeh depth of field I stumbled upon a neat blending trick almost by accident. In my opinion, the quality of depth of field is more related to how objects of different depths blend together, rather than the blur itself. Sure, bokeh is nicer than gaussian, but if the blending is off the whole thing falls flat. There seems to be many different approaches to this out there, most of them requiring multiple passes and sometimes separation of what's behind and in front of the focal plane. I experimented a bit and stumbled upon a nice trick, almost by accident. I'm not going to get into technical details about lenses, circle of confusion, etc. It has been described very well many times before, so I'm just going to assume you know the basics. I can try to summarize what we want to do in one sentence – render each pixel as a discs where the radius is determined by how out of focus it is, also taking depth into consideration "somehow". Taking depth i

Screen Space Path Tracing – Diffuse

The last few posts has been about my new screen space renderer. Apart from a few details I haven't really described how it works, so here we go. I split up the entire pipeline into diffuse and specular light. This post will focusing on diffuse light, which is the hard part. My method is very similar to SSAO, but instead of doing a number of samples on the hemisphere at a fixed distance, I raymarch every sample against the depth buffer. Note that the depth buffer is not a regular, single value depth buffer, but each pixel contains front and back face depth for the first and second layer of geometry, as described in this post . The increment for each step is not view dependant, but fixed in world space, otherwise shadows would move with the camera. I start with a small step and then increase the step exponentially until I reach a maximum distance, at which the ray is considered a miss. Needless to say, raymarching multiple samples for every pixel is very costly, and this is with

Undo for lazy programmers

I often see people recommend the command pattern for implementing undo/redo in, say, a level editor. While it sure works, it's a lot of code and a lot of work. Some ten years ago I came across an idea that I have used ever since, that is super easy to implement and has worked like a charm for all my projects so far. Every level editor already has the functionality to serialize the level state (and save it to disk). It also has the ability to load a previously saved state, and the idea is to simply use those to implement undo/redo. I create a stack of memory buffers and serialize the entire level into that after each action is completed. Undo is implemented by walking one step up the stack and load that state. Redo is implemented in the same way by walking a step down the stack and load. This obviously doesn't work for something like photoshop unless you have terabytes of memory laying around, but in my experience the level information is usually relatively compact and se