The "dependency" rule is a very interesting idea and it seems you can always avoid the rollback if you check whether executing the next instruction won't get us past the 68k in the Z80. About the rollback, how do you find out when to create it? I mean rollback operation is simply a restore point for the emu, right? So you must be able to create it such that it is least damaging if done. "Least damaging" means that it should be as close to the point where the rollback was required so we have to re-execute as least as possible. From what I think, it will require to create rollbacks to be created very often which will be expensive. And then I am pretty sure that rollback has to be emulator-wide meaning that it restores entire emulator to that state rather than CPU states which will be more expensive. Have you thought about that yet? I think rollbacks maybe required a lot in VDP.
What the emulator essentially does is allocate a timeslice to each core. Once each core has reached the end of that timeslice, all the cores are in sync and idle, and the emulator is able to perform operations like take savestates, pause emulation, as well as either commit or rollback the timeslice that was just executed. Until a commit occurs, every change that has been made to any device in the system can be rolled back. For most devices, this just means taking a copy of the registers each time a commit occurs. If we trigger a rollback, just restore the state of the registers from the copy. It's a little more complex for memory buffers, but I've come up with some efficient containers which are optimised for this kind of use.
The "length" of the timeslice blocks that are allocated determine how costly a rollback operation is. A large number of small timeslice blocks reduces the impact of each rollback, but makes the system run slower when no rollbacks are being generated. Currently, I have a relatively long maximum timeslice length of 20 milliseconds. The intention is to build in heuristics to calculate the "optimal" maximum timeslice length, based on the collisions that are currently occurring within the system. I'm also probably going to build in a system to immediately abort the current timeslice once a rollback is triggered (currently, each device must complete the timeslice before it's rolled back).
Most of my efforts so far however have been directed at avoiding the need to generate rollbacks. A rollback is always going to be the most expensive operation. I've mostly directed my efforts at optimising the "commit" process, which is going to occur a lot more often than a rollback, and at finding efficient methods of preventing rollbacks occurring in the first place. Currently, virtually no rollbacks can occur when my emulator is running the basic Mega Drive system. If horizontal interrupts are enabled mid-frame, that would usually generate a rollback. If both the M68000 and Z80 were to attempt access to a device like the VDP, PSG, or YM2612 at virtually the same time, there's a chance that could generate a rollback. I can't think of any other cases right now, and I virtually never see a rollback occur anymore on most of the games I've run.
My general philosophy is that code which is not heavily timing dependent should run as fast as possible. If code is heavily dependent on timing, it may run slowly, but it must always be accurate to the cycle.
That was the point this topic originally

. How to get accurate 32x and SegaCD emulation running fast enough without requiring exotic quad cores ?

. Dual core is fine since it is now used very widely. I can get 32x and SegaCD emulation running quite accurately but whats the use if I can only get 5-15 FPS. I want a way by which I can get same accuracy at 60FPS without requiring the user to upgrade the hardware. I know its possible. There is always a way.
My thinking is that by the time you figure it out, everyone will have upgraded anyway.
The easiest way to do it is to find a way to optimise your cores. Run a profiler. Identify your bottlenecks, and target them. Look at the areas of code that get called the most. Small optimisations there will give you big performance increases. Use inline assembly if you can see a way to beat the compiler. That's how to make a single-threaded emulator faster. Apart from that, as has already been discussed, you could get a major speed boost by moving your VDP and YM2612 in particular into separate threads. How easy that is to do depends a lot on how you've designed your emulator.
But just using them all inefficiently isn't efficient. From what I understand, Snake is saying to multithread VDP, Sound chips and keep CPUs running on a single core which is simple and fast and frees up other cores for the user who is playing After Buner 32x while downloading stuff off the internet and simultaneously running a slow ass AVG in the background and not to mention other countless programs and services he will be running. Worse still, he is running .....Vista.
I'm trying to build a system which can scale to large numbers of processors, so I'm going to attempt threading them regardless. Snake may be right, and threading CPU's on the 32x or SegaCD may not be worth it. If that turns out to be true, I'll fold the problem devices into a single thread. I agree there's no point maxing out two cores instead of one when you're getting no benefit from it. For other systems, I'll still have the benefit of threaded CPU emulation, and I know that the base Mega Drive for one runs with it quite happily.
As for leaving cores available for other tasks, if someone wants to encode a video in the background while running my emulator, they can go right ahead. Just don't complain if it runs a little slow.

With my emulator being built as a debugger, I don't feel quite the same pressing need to get every game running full speed with as low system requirements as possible. My emulator is a debugger. It's designed for development, testing, analysis, and reverse engineering. If people don't like the speed, there are plenty of other emulators I can point them to, like Kega or Regen.
I kind of suspect my emulator is one people will thank me for in 10 years time, and bitch about until then. I really don't care about performance that much though. Afterall, emulation is supposed to be about preservation. Having the fastest emulator is only meaningful for a few years. I remember when people used to complain about Gens because it was too slow. They'd rather use Genecyst instead. I was one of them. Mind you, the realtime palette and VRAM windows had a little something to do with that.
Your emu is great. But its slow. But again, I don't think its slow because of multithreading (okay, may be a little) but rather because of the huge amount of debugging features it provides. Do you currently have a way to disable all the debugging hooks and run your emulator as fast as it can?
My emulator is still slow, but a lot has changed since the last build I released. The main performance issues in the build you have are related to lots of memory allocation and deallocation during execution. This problem has since been eliminated. Most games now run full speed on my 2.2GHz core 2 duo laptop.
Apart from the memory allocation, my emulator is slow because every core uses very high-level code (no ASM), and even for a high level core, I've avoided the use of pre-calculated lookup tables and excessive optimisation, to preserve the readability and transparency of the code itself. I consider my emulator cores a form of documentation on these devices, so it's important to me that I comment my cores really well, explain the reasons I implement particular functions in certain ways, and most of all, that other people can look at my code and understand how the device works. In other words, speed hasn't been the primary concern when I've been writing my cores. It's been a factor, but not the main one. I'm currently in the process of rewriting my VDP core however, which is the slowest core at this stage. I know I can significantly improve the rendering process, which I hope will provide a healthy speed boost.
Apart from that, there's also my bus implementation. One thing that's important to understand about my emulator is that it's generic. It's not a Mega Drive emulator. I have an xml file which tells my emulator how to
build a Mega Drive. I could add a few lines of XML and tell it I want to add a second Z80, or include an alternate 64KB of RAM which is only visible when the M68000 is in user-mode for example. The xml file tells my emulator which devices are included in a system, and describes the physical connections between those devices.
The generic nature of this system, and in particular the extremely flexible and open-ended bus system, are both the most powerful part of my emulator, and the slowest. When the M68000 accesses RAM for example, it doesn't just get a direct line to a block of memory. Rather, it has to hand off the access request to a bus object, which then has to determine which device, if any, the access is targeted at, and how the address lines and data lines are mapped to the target device. I build physical lookup tables to accelerate the process, but it's still going to be much slower than other emulators which are hard-coded to emulate a particular system.
Basically, due to the nature of how bus access works in my emulator, it's going to be several orders of magnitude slower. Since the bus gets hit millions of times a second, it adds up to a major performance hit.
Oh, and then there's the way I represent data. My emulator uses virtually no binary masks, shifts, or manual data manipulation of any kind throughout all the cores. It seemed to me that decoding, editing, and building the raw data structures used within a core was the most error prone part of emulation, and the most difficult part of any core to decypher. I've build a generic "Data" structure I use for registers, bus communication, pretty much everything. This structure has a large number of helper functions to assist in gritty data manipulation, and it greatly simplifies my code. This structure has minimal overhead, but it's used everywhere, so that minimal overhead gradually adds up.
And then there are the debug hooks as you mentioned. I've made an effort to reduce the overhead of the debug checks in the CPU cores in particular, so I think I've now minimized the impact they have on the emulator. Ultimately, I don't think removing them entirely would make all that much of a difference anymore. The biggest bottleneck is undoubtedly the bus, and I've already heavily optimized it.