The Mega Drive VDP has a H/V counter, but it has to be specifically read by the CPU.
Well, the same for reading the counter value from the SFC's PPU. It's just the H/V pin signals that the CPU polls every clock tick.
the only unsolicited output from the VDP to the Mega Drive is in the form of these interrupt lines
That really seems to be the same problem I'm having, just with interrupt in place of blanking. Interrupt lines seem to be a bit more forgiving, though. Interrupts tend to only trigger once per opcode, so you only have to sync up (or predict the state of the line) once.
Whereas in my case, the line counters could wrap at any cycle within an opcode (there are about ~16 clock cycles per opcode on average.) And being off by even a single cycle will misalign the two separate counters and permanently break IRQs.
But essentially, it's the same problem: how can your 68k core know the state of the VDP's interrupt line, if it isn't caught up? The only answers seem to be lock-step (way too slow), prediction (messy, a departure from a strict hardware-model) and rewind (very complex.) All the options suck
One possibility I've considered is that I'd provide a way for the source device, the PPU in this case, to predict its output lines ahead of time, and for the target device, if any, to "request" the state of those lines each time they're sampled, or possibly even a list of all the changes to those lines between two points in time. Consider this point: Given the current state of the PPU, it's always possible for the PPU to calculate what the state of the vblank and hblank pins will be at any point into the future, as long as no external accesses occur which alter the state of the PPU, such as modifying a register.
I
really wanted to avoid needing a look-ahead calculation to determine the pin states. Without a rewind mechanism, my entire emulator is 100% lock-step, with no prediction, no timestamps ... it just does exactly what hardware (probably) would. And it has a single integer to represent a bi-directional cycle counter to tell which of two chips is currently "ahead" of the other.
But yes, it appears to be the only solution. At the very least, I could very easily predict what the state will be in several hundred cycles. I could make a function inside the PPU, something like bool will_hblank_pin_(possibly_)change_in_n_cycles(unsigned cycle_count) : assert(cycle_count < 200); and only force sync when that is true. It'd get the number of force syncs down to 2(bidirectional)*262(scanlines)*60(fps), instead of 2*10.5 million.
Best case scenario, I could even hide the CPU calling this function inside the scheduler / synchronization core. So the CPU core itself would look like it was syncing up always, as it does now with memory accesses.
I don't know how you handle this kind of communication in your emulator, but perhaps, a similar approach might work for you?
I don't emulate the only two special chips that can assert the CPU's /IRQ line. I'll be pretty much screwed there, as the look-ahead method for the PPU's blanking lines will not work there.
That would require running both in true lock-step, which exposes the limitations of single-core programming. I'll cover this more below in response to Steve.
And luckily, no peripheral can really do so, either. They just change an I/O bit. I special case those controllers and test every single cycle if the bit should change. Since I don't have to perform a context switch to do the peripheral tests, it doesn't eat up much time at all. Maybe a 3% speed hit.
Nah, I think doing 100% perfect emu is insane task.
I know that at least in my case, 100% perfection is impossible. I just don't think we're anywhere near as close as we can get. I'll be happy with 99.98% or better.
I echo previous statements -- I have a world of respect for the hard-core code optimizers. But since you guys already exist, I see no harm (indeed only benefits) in guys like us taking on the opposite extreme. We complement each other nicely.
/ waves at byuu - it's been a while
Indeed it has! As always, a pleasure to speak with you
How about spending 99% of their time working on various parts of the same data, and making dangerous assumptions about when various pieces of it are ready?
Point well taken. Silly I missed that, I see that all the time with the main<>sound processor communication. This is what breaks the sound in Earthworm Jim 2 and others -- most emulators do not synchronize the two processors tightly enough.
My VDP core is not scanline driven.
Breathtaking. Here I thought only the NES crowd had just barely managed to pull off a cycle-level video processor.
Once again, SNES emulation falls painfully far behind everyone else
Yes, but I don't think you're really grasping how much more difficult this gets when you have two fast CPUs mapped to the exact same memory. It's pretty easy to do this with the MegaDrive, in fact I could have shoved every element into a separate thread in KGen and it would have pretty much worked with little extra effort. But when you have two CPUs that are supposed to run at the same speed (they actually don't due to collisions, but...), but every emulated instruction will take a different amount of time, running these in two threads is going to get out of sync almost immediately. How do you detect a collision and how do you fix it, in a way that isn't going to be slow? Unless you keep track of every access, I think this is much more complex than you are betting on. What about when both CPUs try to *write* the same address? By the time you notice it's going to be very difficult to work out which one should win, and when the other write should take place. And it's important, games will lock up if you get it wrong. (NB: *MOST* of the games do this.)
And again, you can only afford to add a few instructions before you're running slower than a single thread.
I believe we both understand these complexities. And they don't go away, regardless of whether you use a cooperative or pre-emptive model to emulate the chips. You still need exactly the same number of synchronization operations. The difference:
In the pre-emptive model, you put up a lock and wait for another core to catch up. That one core will sit there in a tight loop waiting, while the other core is actively doing things.
In the cooperative model, you switch out the context to pass control to the other chip. Your only core is bleeding time while switching contexts.
My experience shows me that the cooperative model has
tremendous costs on modern pipelined processors. Suddenly and violently switching from one emulated processor to another absolutely destroys the pipeline, L1 cache, etc. While back in the early 90s it didn't hurt much -- context switches are the bane of modern programming. And they only get more painful with each new generation.
Pretty sure I've bugged you (all) about my answer to speeding up the cooperative model: rather than using multiple error-prone, nested, hard to read+maintain state machines, that incur the same context switch problems in the first place ... why not instead keep a separate stack for each "process", and just add one more register to swap out, the stack pointer?
I tried this, and it works amazingly well. I got tremendous speed-ups over the old state machine model. But it's still a very painful operation. Doing absolutely nothing else but switching contexts back and forth in a tight loop, I can only manage ~10 million such switches per
second on a P4 1.7GHz. On my Core 2 Duo, that number only goes up to ~20 million a second. And the switch() operation is an amazing 11 x86 opcodes: { push ebx,edi,esi,ebp; mov [oldcontext],esp; mov esp,[newcontext]; pop ebp,esi,edi,ebx; ret } -- it's practically impossible to optimize that any further. The overhead is not in the instructions themselves, but hidden in the processor architecture's model to execute instructions out of order and as fast as possible.
Now take a look at my SNES CPU<>PPU example: it needs 21 million of these to remain in 100% perfect lock-step. So just by context swaps alone, it's already impossible on any modern processor to get full speed in this way. Obviously, that's why we absolutely need to use other tricks to ensure we can run them out of order.
And this is for the SNES! Imagine trying to do something similar on the 100MHz N64 processors, or 3GHz PS3 processors o_O
It's very likely getting these to less than ~10 million syncs a second and still 100% accurate is not possible, no matter how many tricks you use.
You see, the single-threaded model is already exhausted. I believe the gist of what Nemesis and I are getting at, is that the multi-threaded model scales better for these synchronizations: they're less painful. I believe Nemesis could pull off more than 21 million sync operations per second by using two true pre-emptive threads.
I do agree with you that it would "waste" more overall processing power this way. But it will scale much further, while the single-threaded model will actually become
less effective in time.
So yes, if you can get the same 100% accuracy in a single thread, that's obviously the ideal way to go. The million dollar question is, "can it be done for these 16-bit systems?" -- my experience tells me that it cannot. My example is two simultaneous CPUs sharing all hardware and memory, just as you described above: the S-CPU and SA-1. Maybe it can for the Genesis, maybe it can't. It seems much less likely to be possible for the 32x and Sega CD.
Either way, it's great that we have people trying to reach perfection with both approaches. Regardless of who ends up being right, we all win. But we all know that already: if only the lay-person who constantly belittles us over having "slow" emulators could understand that.