I've done a lot of work in this area. As you may have heard, the emulator I'm developing is multithreaded, and runs each CPU in a separate core in parallel. I'm really looking forward to attempting SegaCD and 32x emulation, as my timing model will provide perfect accuracy with no modifications or hard-coding, and due to some recent performance optimizations, there's a good chance it could be done in realtime on a decent quad-core. I'm sure a lot of people would consider that very slow, but perfect timing with this many devices in one system is no easy thing.
If the CPUs are tightly synced, as in the case of SegaCD and 32x, its going to be hard and slow. Since emulating SegaCD and 32x will work most accurate if you run the CPUs cycle by cycle (at least thats what my experience says so far). This means you are going to have to switch the CPUs threads quite a lot requiring lots of thread context switching (not to be confused with the emulated CPU context) of the CPU threads which can be (and on x86, I think it is) more heavy. Then, to solve the sync issue, you are going to have to use cooperative multithreading rather than preemptive so that we can control how much each CPU runs.
I've solved that problem.
In my emulator, I have what I call a "dependency" rule between the Z80 and the M68000. The Z80 is dependent on the M68000. In my emulator, this means the Z80 core will never advance beyond the point the M68000 has reached. This isn't the same as lock-step, since if the Z80 thread was switched out for a block of time, it may execute a few hundred opcodes at once for example. As long as the Z80 is behind the M68000, it will continue to execute without interruption. I keep the Z80 behind the M68000 so the Z80 is aware of RESET and BUSREQ events before they occur, so I don't end up with the case of the Z80 having advanced past the M68000, then finding out it should have been paused 30 cycles ago due to a BUSREQ. My emulator can handle that case, but it requires a relatively expensive rollback operation, which dependency rules are designed to predict and avoid. You can also set two cores as dependent on each other, in which case, my emulator will advance both cores side by side in lock-step.
Now, that's all very interesting, but where does context switching come into it? Here's an important thing to remember: You do not have to use a mutex, or require a context switch, in order to share information between threads. A full mutex is designed for locks which are expected to last for a relatively long period of time (say, a millisecond or two). When a thread hits a mutex which is already locked, it would usually call an OS-level function which forfeits the remainder of the timeslice for that thread, and instructs the OS to resume the thread some time after the lock is released. This is where you get a context switch. In the case of "high performance" thread synchronizaton, where a lock is extremely transient and expected to last for a very small period of time (say, a couple of nanoseconds), why would you forfeit the timeslice?
You can build low-level synchronization primitives on most platforms based on one or two opcodes. x86 provides opcodes like XADD which are designed specifically for this kind of task. Win32 also provides functions like InterlockedIncrement and InterlockedDecrement which wrap over these assembly primitives. Simply put, you can use these instructions to test whether it is "safe" to advance. If it is not, instead of forfeiting the remainder of the timeslice, you can just spin around in a loop continuously testing the condition until it is met. For long delays, this would be really bad, since the core your thread is executing on is going to sit there at max utilization until the lock is released, but when the overhead of a formal mutex is higher than the loop, this method can greatly improve performance. Significantly, when you hand the remainder of a timeslice back to the OS, the OS may not call your thread again until a significant period of time after the condition has been met. If you fail a lock 1000 times a second for example, and each lock takes an average of 0.5 milliseconds between the condition being met and the OS switching your thread back in, that's potentially an extra 0.5 seconds of OS-level overhead to your thread, each second. When you're sitting looping on the wait condition in your own code however, your thread knows within a few nanoseconds when the condition has been met, and can resume immediately.
Back to my emulator, using thin locks, I can guarantee two cores are kept within a cycle of each other, while still potentially running them in parallel on the hardware, and also potentially without generating a single context switch. In reality, the performance degredation for lock step vs no locks at all seems to be around 40% for most applications in the tests I've done.
The future of emulation has to be multithreading, since that is the future of computers themselves for the forseable future. We're hitting limits of how fast a single core can run with the current silicon technology. Single-threaded emulation is fine for older systems, as long as they can run full speed on a single 3GHz core. If they can't, well, it could be a long time before anyone can run them full speed.
I'm really looking forward to C++0x, since the revision will include native synchronization primitives and commands to allow high performance, thin locking mechanisms to be built, without a lot of the potential dangers we have to worry about in C++ currently, like compiler optimization, and cache synchronization. Thin locking mechanisms are essential for the development of complex multithreaded systems like emulators, where there are a lot of interactions occurring, and performance is extremely important.