Wednesday, October 13, 2010


I've run into a pretty interesting/annoying problem while working on parallization of the physics pipeline.

I started off with a pretty standard thread pool, but proper physics requires quite a bit of synchronization. Especially in the solver, you need several full barriers to coordinate the threads. This is not so bad when you run thousands of objects, because the overhead of the synchronization is dwarfed by the massive simulation, but what would be more impressive is to go from say 3 ms per frame down to 1 ms.

Ideally you want your physics engine, rendering pipeline, AI, etc, all data parallel and run them sequentially every frame. The more common scenario of running physics on a separate thread and rendering on a separate thread introduces two extra frames of lag, which combined with double or triple buffering on the graphics card, plus another frame delay in the display (at least some modern TV sets do this) gives you very sluggish reponse. A secondary benefit from data-parallelism is the better cache utilization. It's more efficient to process all physics at once, and keep it in the L2, rather than doing all subsystems simultaneously, and have them trash the cache for each other along the way. Anyway, that's why I want the physics pipe to be super freakin data-parallel with a good speedup even for moderate scenes.

My problem with the standard synchronization primitives is not excessive locking. All tasks are pretty independent and data is locked very sparingly. My problem is the barrier sync, and especially getting the worker threads to start processing again after the sync. I use a standard semaphore which I release when there is a new job coming in. The worker threads then start munching jobs off a queue in any order they like. When there are no more jobs, they go to sleep again on the semaphore. Now, doing a barrier sync involves putting all worker threads to sleep, since I have to wait for the last job to finish, then I immediately feed in a new job and start processing again.

When a worker thread hits the semaphore, Windows first goes into kernel mode, then realizes that the thread should go to sleep, starts looking for other threads to run, and in most cases there won't be any, so it switches to the idle process, clocks down the core frequency, takes out a good book and starts preparing a cup of coffee. Then, one microsecond later (who knew!), the main thread releases the semaphore and the whole process is reversed. Well, at least this is what I think it does, but I'm no kernel expert...

Now, since I know exacly what needs to be computed I can tell from the beginning when this is going to happen. Which is every barrier sync, every iteration, every frame. That seems somewhat unnecessary. So I started ripping out synchronization code from the thread pool and replacing it with interlocked functions and spinlocks, having the workers spin until there is more work, but only during the simulation step, and then put them properly back to sleep when we're done with physics.

This new strategy works perfectly and gives a really nice speedup on small scenes. The one annoying problem is that if there is anything, really anything else running on the machine, the Windows scheduler gives my process a mad penalty. Even if they only spin for a small fraction of a frame. For some reason it chooses to swap out the entire process for several time quanta, leaving me with a 60 ms hickup every second or so. No cookie for Windows scheduler! Two questions:

A) Why does this bother Windows? Even on very small simulations where all workers sleep most of the frame this happens. That's right, most of my threads sleep on a semaphore most of the time, but I still get a penalty.

B) How the heck does Windows even know that I'm using a spinlock? Does modern hardware have some sort of spinlock detection layer? I can't imagine the scheduler will be so severely bothered by any process that happens run four threads simultaneously, especially not for such a short time.

Seriously, what's going on? The problem goes away if I do a Sleep(0) in the spinlock, but that kind of defeats the purpose. Even though Sleep(0) doesn't switch to the idle process, it's still a rather costly operation, and an open invitation to the OS to go do something else, which is exactly what I'm trying to prevent in the first place! Sigh.

1 comment:

  1. Yeah, trying to do multi threading for low latency stuff sucks on windows.

    I would guess you are triggering some sort of deadlock detection, which is effectivly what happens if you get rescheduled in a spinlock.

    The alternative of having a partial spinlock/windows sync primitive(eg a critical section and spin count) is just as bad because as soon as you exceed the spin count you cause a reschedule if there is anything else waiting(which then presumably gets a priority boost).

    I guess the ways to try and improve the situation are to avoid waiting for work by statically scheduling as much work upfront as possible and to statically allocate threads to a core(ala 360). Then only have one thread per core.

    Sadly it would be pretty difficult to allocate all work upfront due to not actually knowing in advance what your solver islands will be.

    Maybe some hacking with timeBeginPeriod() can improve things as well to reduce the quantum length(I think, havnt tried this though).