Exodus is slow, and threading is part of it, but not the biggest concern. The main killer is actually the cycle-accurate VDP core. The amount of number crunching involved per pixel clock step is insane, and it's got to hammer through over 6.5 million of them per second. The VDP uses more time in its render process than every other device in the system, and the system itself, put together. The next biggest hit behind that is the bus system. Nothing about the interconnections between devices is hardcoded in Exodus, it's all defined in XML, then mapped out through data structures. These structures are heavily optimized, but they get hit millions of times a second, so the overhead they introduce does add up.Exodus is slow, because it starts more than 12 threads, all with its own stack and so on.
Don't forget, that all threads managed with "task manager" in core of Operating System. Not that "task manager" CTRL+ALT+DEL - it is graphical application, there is another "task manager" - system inside core of Operating System.
Also, each synchronization takes way too long because it's API call, which includes arguments checking, api call stack forming for interupt handler, because core functions requires privileged mode (supervisor mode?). That's why any system API call in Windows is slow by definition. There are some other Operating Systems that works in other way, but I'm just telling about some example.
Threading is a concern, but where contention is low and lock durations are short, you can often get away with a spin lock using some interlocked test and set machine code operations, with a memory barrier or two where required. With that kind of model, there's no OS calls at all, no context switching, and minimal overhead. There's more that can be done in Exodus to improve the threading, in particular making use of the new C++11 language level threading features, which can theoretically produce more optimized code than the rather heavy boost mutexes I'm using in a lot of places. The currently unreleased version 1.1 of Exodus actually folds active devices that are inherently bound in lock step into a single execution thread, so they actually do execute in a similar fashion to what you propose.