SegaCD and 32x

Eke · Post by **Eke** » Mon Nov 24, 2008 5:50 pm

yes, both approach are interesting, optimization gurus have all my respect, because I feel it is still a much harder task than designing the "perfect" hardware emulator

PS: you got virtua racing running at 60fps on GP2X, really ? what did you do with your core ?

notaz · Post by **notaz** » Tue Nov 25, 2008 10:52 am

Eke wrote:yes, both approach are interesting, optimization gurus have all my respect, because I feel it is still a much harder task than designing the "perfect" hardware emulator

Nah, I think doing 100% perfect emu is insane task.

Eke wrote:PS: you got virtua racing running at 60fps on GP2X, really ? what did you do with your core ?

Well it may be wrong to say 60fps as the game itself is doing ~15. I used dynarec for this, capable of detecting certain instruction sequences (as there is only one game) and replacing them with ARM code. I also enumerated all instruction RAM states so I can get away without ever clearing translation cache, so the actual translation only happens on the start of levels.

AamirM · Post by **AamirM** » Tue Nov 25, 2008 1:29 pm

Hi,

Well it may be wrong to say 60fps as the game itself is doing ~15. I used dynarec for this, capable of detecting certain instruction sequences (as there is only one game) and replacing them with ARM code. I also enumerated all instruction RAM states so I can get away without ever clearing translation cache, so the actual translation only happens on the start of levels.

You maybe better off using a static recompiler since its just this one game that uses the SVP.

stay safe,

AamirM

Snake · Post by **Snake** » Wed Nov 26, 2008 12:49 am

Chilly Willy wrote:I play most of my CDs without strict timing in PicoDrive. I can't think of the ONE that requires strict timing off the top of my head, but out of about a dozen, there was ONE that needed it.

... probably code based on Gens, I would imagine. In which case the timing is *way* more accurate than the 'once per frame' you quoted. The SegaCD BIOS won't even run if you do that.

Nemesis wrote:All of that is easily solvable. Remember that I've spent over 3 years writing an emulator which has to deal with these issues.

Well, I've been coding for multi-cpu systems for - what - 14 years. I may have given the impression I am not familiar with this stuff. I've done a lot of it - you aren't telling me anything I don't know. There's just not much of it in Kega. (well, there is some, has been since v1...)

I still think this is not a good idea.

Nemesis wrote:The simple fact is, 99.99999% of the time in any real program, this doesn't occur.

But it does. Some of the 32X games, for example, spend a very large amount of time doing exactly that, and with no safeguards whatsoever. Yes, it's a very bad way to do things. But they do it.

Nemesis wrote:To deal with cases where the collisions between two devices are extreme, build in heuristics.

But you're missing the point that you're seriously overcomplicating things for no benefit at all. This is the first rule of multithreaded programming - "is there a simpler, safer way to do this?"

Nemesis wrote:You're only thinking about the Mega Drive

I thought that was the topic of the conversation.

Nemesis wrote:Besides, and this is another point entirely, I don't consider hand-optimized assembly cores, fast or not, to be a solution to preserve a system into the future, which, afterall, is what emulation is supposed to be about. Give me a slow core, which is high-level, flexible, and easy to understand, any day.

This is why I suggested Aamir does this in C. Also - you are assuming that a well written ASM core cannot be flexible and easy to understand

But I certainly think all the things you are proposing will make any core far more difficult to understand.

Nemesis wrote:What, exactly, would exclude a multi-threaded emulator from being accurate?

Some of the things being suggested may get the job done in the end, but certainly could not be called accurate.

Nemesis wrote:Is a single-threaded emulator an efficient use of a quad core?

If said emulator is taking less than 100% of a single core, then yes, absolutely, it's highly efficient. Taking 50% of a second core while the first core isn't even maxed out? Inefficient, and missing the point.

Nemesis wrote:Those cores are there to be used.

...but not just because they are there.

Nemesis wrote:These problems are difficult, but not unsolvable. Spend a few years thinking about the issues, with an attitude that there IS a solution, and you'll start to come up with solutions.

But you're still trying to solve an issue that doesn't need to be solved. It makes no sense when there are simpler, safer, more efficient ways to achieve the same goal. It's also a hell of a lot easier to debug and maintain.

/ waves at byuu - it's been a while

byuu wrote:Take two processors that only share a 4-byte communication bridge. You only have to lock when one accesses the others' memory address range. Since there's only four addresses, your lock-stepping will be minimized greatly.

Oh yeah, that's not a problem. To me, this is the difference between "worth doing" and "not worth doing".

byuu wrote:Now, when you have two processors that share a large bus, say a big 64k read/write memory chunk (read-only memory would obviously not require locks since it cannot change any state); then yes, you have real problems.

Yup. And that's the case here. Except its just a bit more than 64K, plus a ton of hardware. And it's not always possible to even notice that there is a problem / you're going to have to 'roll back' - other than the fact the game will crash.

byuu wrote:but I can't really see how a game could function if the two cores spend 99% of their time talking to each other, rather than actually doing stuff.

How about spending 99% of their time working on various parts of the same data, and making dangerous assumptions about when various pieces of it are ready? Yeah, happens all the time. It isn't easy to detect, and even if it were, and even if you CAN roll back everything needed, you may have to roll back several frames. It's not pretty.

Nemesis wrote:Yep, same here. Keeping everything 100% separate and modular is the only way to do it. Imagine how easy it becomes to support all the crazy variations on 80's arcade systems for example when you can just drop in all your generic cores, and know it's all ready to go. It's good to know there's someone else out there working along a similar line.

OT, but all my cores are, absolutely, written this way.

Nemesis wrote:Nice, I wasn't aware of this prefix. That expands the list of thread-safe opcodes quite considerably.

It only works for a very limited set of instructions.

Nemesis wrote:Personally, I'm most interested in the behaviour of the InterlockedCompareExchange() function

Ah, that's just the magic of the x86 CMPXCHG instruction at work.

Nemesis wrote:If the current single-core limitations can be solved, I'm sure individual cores will continue to get faster and faster.

Absolutely. Given there are people who've almost doubled the speed of a Core2Duo via overclocking, I don't think we're anywhere near the limit yet.

Nemesis wrote:Dual core in particular is a really important step, as it gives your system the flexibility to run one task maxing out a core, while still having the other core around to drive the OS, and juggle the other apps idling in the background.

Again, absolutely, and this is why I don't think it's optimal to just jump on another core unless you really need it. Leave it for someone else to use. For a start, if you're using DDraw, D3D, DSound, DInput, you're already running five threads anyway.

byuu wrote:Looking at platforms like the Saturn, PS3, etc ... I see a really compelling case for refining multi-core methods. It may be the only way to get accurate and playable framerates for newer generation systems.

Absolutely. But you'd look at where it makes the most sense to do it first. I think you'll find CPU emulation is not that place.

TascoDLX · Post by **TascoDLX** » Wed Nov 26, 2008 4:35 am

AamirM wrote:
notaz wrote:Well it may be wrong to say 60fps as the game itself is doing ~15. I used dynarec for this, capable of detecting certain instruction sequences (as there is only one game) and replacing them with ARM code. I also enumerated all instruction RAM states so I can get away without ever clearing translation cache, so the actual translation only happens on the start of levels.
You maybe better off using a static recompiler since its just this one game that uses the SVP.

If you wanted, you could write high-level functions to replace the SVP entirely. Not to imply that this is an easy task -- static recomp would certainly be easier -- but the SVP code is very modular so I wouldn't be surprised if someone did this one day.

As notaz indicated, there are plenty of assumptions you can make about the code that can save a lot of time because "there is only one game".

notaz · Post by **notaz** » Wed Nov 26, 2008 12:06 pm

AamirM wrote: You maybe better off using a static recompiler since its just this one game that uses the SVP.

Well it sort of is already, as it only recompiles stuff once, but does it on demand only.

Snake wrote:
Chilly Willy wrote:I play most of my CDs without strict timing in PicoDrive. I can't think of the ONE that requires strict timing off the top of my head, but out of about a dozen, there was ONE that needed it.
... probably code based on Gens, I would imagine. In which case the timing is *way* more accurate than the 'once per frame' you quoted. The SegaCD BIOS won't even run if you do that.

Yeah it is a rewrite of Gens code. And true, it syncs once per line, not frame. And there is a bunch of games needing better sync then that (all Wolfteam games and several others).

Chilly Willy · Post by **Chilly Willy** » Thu Nov 27, 2008 1:25 am

notaz wrote:And true, it syncs once per line, not frame.

Well, once a line is still a HELL of a lot better for speed and multi-core than "lockstep". Spinlocking between lines would be quite doable and not waste much time. Not like spinning between intructions.

Nemesis · Post by **Nemesis** » Thu Nov 27, 2008 4:07 am

Snake wrote:...

Ok, let me put this another way. The number 1 goal of my emulator is accuracy. Right now, my emulator maintains cycle-level accuracy between all devices. My VDP core is not scanline driven. It is capable of responding to changes mid-line, and I've even emulated the "noise" that occurs on CRAM writes during rendering. My YM2612 and PSG are capable of responding to register changes at the exact sample they should be applied. Any processors can sit in an endless loop fighting over shared access to any memory address, device, or any other obscure dependency, and my emulator will ensure that it is always executed in the exact same way, with everything happening in the correct order, regardless of what's running in what thread or what order things get processed in.

Gens as you've pointed out is scanline driven. In regular Mega Drive mode at least, its timing is only accurate to the scanline. It wouldn't be hard to construct code which breaks the timing under this model, and most of the single-line raster bugs in various games are a symptom of a scanline-driven VDP.

How do you handle timing in Kega Fusion? I took the approach of a multi-threaded emulator because, after doing tests and calculations, I believed it was impossible to achieve cycle-accurate emulation in a single threaded emulator, which could emulate even the Mega Drive at full speed in cycle-level lock-step on the computers of the day (Athlon 64 when I started), let alone devices like the 32x or SegaCD. I moved into multi-threading as a way to unlock extra performance, given that a single thread was not able to do the job.

Every current Mega Drive emulator, including yours if I'm not mistaken, approximates the timing. I know Fusion doesn't support mid-line VDP changes for example. When drx released a whole batch of new prototypes recently, including a bunch of 32x prototypes, a significant number of them didn't work in the emulators available at the time. Apparently, there were a lot of hard-coded timing fixes for specific games to get them running. How many 32x games would break if I changed the title in the header?

My multi-threaded design may not be fast, but it is accurate. It's also much faster than I could make it with only a single thread, without compromising on accuracy. It's also got potential to scale to larger, more complex systems, while still maintaining that perfect timing model. If you or someone else is able to achieve the same level of accuracy at a faster speed on a single-threaded emulator, particularly when running on a quad-core, I'll be very impressed. Maybe it's possible, but I couldn't do it, and from what I can tell, nobody has done it yet.

Eke · Post by **Eke** » Thu Nov 27, 2008 9:21 am

about VDP midline changes, have you figured what can be modified and what couldn't ? the documentation says that some registers/data are latched during hblank and currently, I'm only supporting registers 1 (display on/off) and 7 (background color palette entry) changes.. I don't think VRAM/CRAM change during active display have an effect (except from the "dot bug")

about the "debate", I think the difference relies in the use and concept of your "projects": your emulator is designed to be 100% accurate representation of the system when other emulators are designed to be 100% compatible with games (and most homebrew programs)... and I can ensure you that you can get 100% accuracy with commercial games (at least for genesis) with a single threaded approach

I imagine the multi-thread approach becomes interesting when emulating a "pixel accurate" VDP , but even that could be done in single thread approach (run the CPUs for some cycles and when they writes VDP, execute the appropriate number VDP cycles, render pixel as blocks... or either use timestamped logged VDP writes and do the render appropriately at the end of the line)

Snake · Post by **Snake** » Thu Nov 27, 2008 10:26 am

Nemesis wrote:My VDP core is not scanline driven. It is capable of responding to changes mid-line

Mine does to, to a certain extent. Rather, changes to certain registers etc. happen on the correct line, this is not true of Gens. Changes to certain registers happen immediately. There is no doubt a few registers I don't have exactly as the real hardware but they are accurate according to all documentation. But I could definitely do what you say quite easily without taking much of a speed hit (in fact, next to none at all given that no game I know of does very much with the VDP during active scan). The main reason I don't is that most registers etc. are not supposed to have any effect. This is also the main reason why its 'scanline driven' - it doesn't have to be, really, but I bring it up to speed at least once per line, just because it's easier to debug/follow what's going on that way.

BTW, the original KGen wasn't linebased at all, and would render the VDP up to the correct cycle as soon as the 68K attempted to modify anything. Same for the Z80.

Nemesis wrote:and I've even emulated the "noise" that occurs on CRAM writes during rendering

Yep, I've thought about that once or twice, but I never got around to doing some decent tests to make sure I'd got all situations covered.

Nemesis wrote:My YM2612 and PSG are capable of responding to register changes at the exact sample they should be applied.

Me too. My PSG also runs at the exact hardware frequency of something-stupid-that-I-dont-recall.

Nemesis wrote:Any processors can sit in an endless loop fighting over shared access to any memory address, device, or any other obscure dependency, and my emulator will ensure that it is always executed in the exact same way, with everything happening in the correct order, regardless of what's running in what thread or what order things get processed in.

Yes, but I don't think you're really grasping how much more difficult this gets when you have two fast CPUs mapped to the exact same memory. It's pretty easy to do this with the MegaDrive, in fact I could have shoved every element into a separate thread in KGen and it would have pretty much worked with little extra effort. But when you have two CPUs that are supposed to run at the same speed (they actually don't due to collisions, but...), but every emulated instruction will take a different amount of time, running these in two threads is going to get out of sync almost immediately. How do you detect a collision and how do you fix it, in a way that isn't going to be slow? Unless you keep track of every access, I think this is much more complex than you are betting on. What about when both CPUs try to *write* the same address? By the time you notice it's going to be very difficult to work out which one should win, and when the other write should take place. And it's important, games will lock up if you get it wrong. (NB: *MOST* of the games do this.)

And again, you can only afford to add a few instructions before you're running slower than a single thread.

Nemesis wrote:I believed it was impossible to achieve cycle-accurate emulation in a single threaded emulator, which could emulate even the Mega Drive at full speed in cycle-level lock-step

...but you're not doing that now, either, right? The same things you are doing can be done on a single core.

Nemesis wrote:Every current Mega Drive emulator, including yours if I'm not mistaken, approximates the timing.

No, it's very accurate. Although the current release version does have some test code that I left in which breaks it a little (which is why things that used to work got broken at some point). This was all fixed a long time ago and I really do need to get a new build out.

Nemesis wrote:I know Fusion doesn't support mid-line VDP changes for example.

Covered above.

Nemesis wrote:When drx released a whole batch of new prototypes recently, including a bunch of 32x prototypes, a significant number of them didn't work in the emulators available at the time. Apparently, there were a lot of hard-coded timing fixes for specific games to get them running.

Actually it was only the Chaotix protos that had a problem. There are no hard-coded timing 'fixes', rather, in order to attempt to speed things up (I was still supporting people with 500Mhz CPUs at the time) I lowered the timing requirements of the handful of games that didn't need quite as heavy lockstepping. The 'fixed' version (which I provided) should work just fine with any ROM you throw at it, and I now have an option to just ignore this table anyway. There are no per-game 'fixes' of any kind, but there are certain games that need to be detected in order to enable something they use (such as EEPROMs.)

Nemesis wrote:If you or someone else is able to achieve the same level of accuracy at a faster speed on a single-threaded emulator, particularly when running on a quad-core, I'll be very impressed.

Well it's pretty much there already. As far as MegaDrive goes, everything is already cycle accurate, with the exception of the VDP stuff mentioned above. Given that it already runs at about 20 times real speed on my 1.7Ghz system, I don't see any problem.

Eke wrote:run the CPUs for some cycles and when they writes VDP, execute the appropriate number VDP cycles, render pixel as blocks

Absolutely. In a multi-threaded design, you'd have to lock one of the threads until the VDP caught up, anyway. Don't want to be writing over anything. This is basically exactly how KGen worked.

Near · Post by **Near** » Thu Nov 27, 2008 2:03 pm

The Mega Drive VDP has a H/V counter, but it has to be specifically read by the CPU.

Well, the same for reading the counter value from the SFC's PPU. It's just the H/V pin signals that the CPU polls every clock tick.

the only unsolicited output from the VDP to the Mega Drive is in the form of these interrupt lines

That really seems to be the same problem I'm having, just with interrupt in place of blanking. Interrupt lines seem to be a bit more forgiving, though. Interrupts tend to only trigger once per opcode, so you only have to sync up (or predict the state of the line) once.

Whereas in my case, the line counters could wrap at any cycle within an opcode (there are about ~16 clock cycles per opcode on average.) And being off by even a single cycle will misalign the two separate counters and permanently break IRQs.

But essentially, it's the same problem: how can your 68k core know the state of the VDP's interrupt line, if it isn't caught up? The only answers seem to be lock-step (way too slow), prediction (messy, a departure from a strict hardware-model) and rewind (very complex.) All the options suck

One possibility I've considered is that I'd provide a way for the source device, the PPU in this case, to predict its output lines ahead of time, and for the target device, if any, to "request" the state of those lines each time they're sampled, or possibly even a list of all the changes to those lines between two points in time. Consider this point: Given the current state of the PPU, it's always possible for the PPU to calculate what the state of the vblank and hblank pins will be at any point into the future, as long as no external accesses occur which alter the state of the PPU, such as modifying a register.

I really wanted to avoid needing a look-ahead calculation to determine the pin states. Without a rewind mechanism, my entire emulator is 100% lock-step, with no prediction, no timestamps ... it just does exactly what hardware (probably) would. And it has a single integer to represent a bi-directional cycle counter to tell which of two chips is currently "ahead" of the other.

But yes, it appears to be the only solution. At the very least, I could very easily predict what the state will be in several hundred cycles. I could make a function inside the PPU, something like bool will_hblank_pin_(possibly_)change_in_n_cycles(unsigned cycle_count) : assert(cycle_count < 200); and only force sync when that is true. It'd get the number of force syncs down to 2(bidirectional)*262(scanlines)*60(fps), instead of 2*10.5 million.

Best case scenario, I could even hide the CPU calling this function inside the scheduler / synchronization core. So the CPU core itself would look like it was syncing up always, as it does now with memory accesses.

I don't know how you handle this kind of communication in your emulator, but perhaps, a similar approach might work for you?

I don't emulate the only two special chips that can assert the CPU's /IRQ line. I'll be pretty much screwed there, as the look-ahead method for the PPU's blanking lines will not work there.

That would require running both in true lock-step, which exposes the limitations of single-core programming. I'll cover this more below in response to Steve.

And luckily, no peripheral can really do so, either. They just change an I/O bit. I special case those controllers and test every single cycle if the bit should change. Since I don't have to perform a context switch to do the peripheral tests, it doesn't eat up much time at all. Maybe a 3% speed hit.

Nah, I think doing 100% perfect emu is insane task.

I know that at least in my case, 100% perfection is impossible. I just don't think we're anywhere near as close as we can get. I'll be happy with 99.98% or better.

I echo previous statements -- I have a world of respect for the hard-core code optimizers. But since you guys already exist, I see no harm (indeed only benefits) in guys like us taking on the opposite extreme. We complement each other nicely.

/ waves at byuu - it's been a while

Indeed it has! As always, a pleasure to speak with you

How about spending 99% of their time working on various parts of the same data, and making dangerous assumptions about when various pieces of it are ready?

Point well taken. Silly I missed that, I see that all the time with the main<>sound processor communication. This is what breaks the sound in Earthworm Jim 2 and others -- most emulators do not synchronize the two processors tightly enough.

My VDP core is not scanline driven.

Breathtaking. Here I thought only the NES crowd had just barely managed to pull off a cycle-level video processor.

Once again, SNES emulation falls painfully far behind everyone else

Yes, but I don't think you're really grasping how much more difficult this gets when you have two fast CPUs mapped to the exact same memory. It's pretty easy to do this with the MegaDrive, in fact I could have shoved every element into a separate thread in KGen and it would have pretty much worked with little extra effort. But when you have two CPUs that are supposed to run at the same speed (they actually don't due to collisions, but...), but every emulated instruction will take a different amount of time, running these in two threads is going to get out of sync almost immediately. How do you detect a collision and how do you fix it, in a way that isn't going to be slow? Unless you keep track of every access, I think this is much more complex than you are betting on. What about when both CPUs try to *write* the same address? By the time you notice it's going to be very difficult to work out which one should win, and when the other write should take place. And it's important, games will lock up if you get it wrong. (NB: *MOST* of the games do this.)

And again, you can only afford to add a few instructions before you're running slower than a single thread.

I believe we both understand these complexities. And they don't go away, regardless of whether you use a cooperative or pre-emptive model to emulate the chips. You still need exactly the same number of synchronization operations. The difference:

In the pre-emptive model, you put up a lock and wait for another core to catch up. That one core will sit there in a tight loop waiting, while the other core is actively doing things.

In the cooperative model, you switch out the context to pass control to the other chip. Your only core is bleeding time while switching contexts.

My experience shows me that the cooperative model has tremendous costs on modern pipelined processors. Suddenly and violently switching from one emulated processor to another absolutely destroys the pipeline, L1 cache, etc. While back in the early 90s it didn't hurt much -- context switches are the bane of modern programming. And they only get more painful with each new generation.

Pretty sure I've bugged you (all) about my answer to speeding up the cooperative model: rather than using multiple error-prone, nested, hard to read+maintain state machines, that incur the same context switch problems in the first place ... why not instead keep a separate stack for each "process", and just add one more register to swap out, the stack pointer?

I tried this, and it works amazingly well. I got tremendous speed-ups over the old state machine model. But it's still a very painful operation. Doing absolutely nothing else but switching contexts back and forth in a tight loop, I can only manage ~10 million such switches per second on a P4 1.7GHz. On my Core 2 Duo, that number only goes up to ~20 million a second. And the switch() operation is an amazing 11 x86 opcodes: { push ebx,edi,esi,ebp; mov [oldcontext],esp; mov esp,[newcontext]; pop ebp,esi,edi,ebx; ret } -- it's practically impossible to optimize that any further. The overhead is not in the instructions themselves, but hidden in the processor architecture's model to execute instructions out of order and as fast as possible.

Now take a look at my SNES CPU<>PPU example: it needs 21 million of these to remain in 100% perfect lock-step. So just by context swaps alone, it's already impossible on any modern processor to get full speed in this way. Obviously, that's why we absolutely need to use other tricks to ensure we can run them out of order.

And this is for the SNES! Imagine trying to do something similar on the 100MHz N64 processors, or 3GHz PS3 processors o_O
It's very likely getting these to less than ~10 million syncs a second and still 100% accurate is not possible, no matter how many tricks you use.

You see, the single-threaded model is already exhausted. I believe the gist of what Nemesis and I are getting at, is that the multi-threaded model scales better for these synchronizations: they're less painful. I believe Nemesis could pull off more than 21 million sync operations per second by using two true pre-emptive threads.

I do agree with you that it would "waste" more overall processing power this way. But it will scale much further, while the single-threaded model will actually become less effective in time.

So yes, if you can get the same 100% accuracy in a single thread, that's obviously the ideal way to go. The million dollar question is, "can it be done for these 16-bit systems?" -- my experience tells me that it cannot. My example is two simultaneous CPUs sharing all hardware and memory, just as you described above: the S-CPU and SA-1. Maybe it can for the Genesis, maybe it can't. It seems much less likely to be possible for the 32x and Sega CD.

Either way, it's great that we have people trying to reach perfection with both approaches. Regardless of who ends up being right, we all win. But we all know that already: if only the lay-person who constantly belittles us over having "slow" emulators could understand that.

Snake · Post by **Snake** » Thu Nov 27, 2008 7:55 pm

byuu wrote:The overhead is not in the instructions themselves, but hidden in the processor architecture's model to execute instructions out of order and as fast as possible.

Yes indeed, specifically, it really does not like you changing the stack pointer like that. I forget the exact reason but it's documented somewhere. It's not something you're supposed to do. However, of course, the OS *is* expected to do this, so there may be a specific way of doing this that's much faster. It may not be doable from user code, though.

That being said, my 'context switches' are basically free, so I don't have a problem here.

byuu wrote:Now take a look at my SNES CPU<>PPU example: it needs 21 million of these to remain in 100% perfect lock-step.

I'm not sure how you arrive at the 21 million figure given that the SNES CPU is pretty damn slow. What am I missing?

byuu wrote:And this is for the SNES! Imagine trying to do something similar on the 100MHz N64 processors, or 3GHz PS3 processors o_O
It's very likely getting these to less than ~10 million syncs a second and still 100% accurate is not possible, no matter how many tricks you use.

N64, I would think, is doable. But sure, for things in the GHz level you'd probably want a core per core. Sync would be less of an issue there anyway, given that the real CPUs don't run in sync anyway (due to pipeline/cache/RAM timigs and other stalls) and the fact that proper sync will be absolutely neccessary in any software they are running.

byuu wrote:You see, the single-threaded model is already exhausted. I believe the gist of what Nemesis and I are getting at, is that the multi-threaded model scales better for these synchronizations: they're less painful.

I'm not convinced about either of those, personally.

byuu wrote:But it will scale much further, while the single-threaded model will actually become less effective in time.

Or this. Using multi-threading in this way actually scales very badly, and memory speeds/cache sizes keep increasing, making the single threaded model more effective.

I'd like to talk to you more sometime about BSNES and why you have made some of the decisions you have more recently, but this is not the place.

Near · Post by **Near** » Thu Nov 27, 2008 8:59 pm

I forget the exact reason but it's documented somewhere ... the OS *is* expected to do this, so there may be a specific way of doing this that's much faster.

It breaks everything: branch prediction, pipelining, out-of-order execution, L1 cache, etc. Very bad stuff.

And I'd be really surprised if the OS could do it any faster. It's probably not much of a problem as rarely will an OS need to swap contexts more than a hundred thousand times a second.

That being said, my 'context switches' are basically free, so I don't have a problem here.

How's that? You use the state machine approach? That just puts the overhead inside your code. My testing shows that a single switch/case to get to the code you want is faster, but when you need two or more switches to drill down to where you want to resume code (say ... exec -> switch(opcode) -> switch(opcodecycle) -> exec_one_cycle()), just swapping stacks is faster. I got a ~40% speedup in the CPU core from the latter with that exact setup. Of course, my tests could have been flawed in some way ...

I'm not sure how you arrive at the 21 million figure given that the SNES CPU is pretty damn slow. What am I missing?

The SNES crystal clock is 315/88*6MHz, or 6x the NTSC color subcarrier. Though each CPU opcode cycle consumes 6 cycles (I/O and fast memory regions), 8 cycles (slow memory regions) and 12 cycles (input device memory regions), so you get an effective rate of ~2.68MHz (slow) to ~3.58MHz (fast) for the CPU.

The two PPU processors each perform operations at a rate of 10.5MHz, half the crystal clock rate.

And the CPU itself runs the IRQ and ALU units at 10.5MHz as well. It's possible to range test these over whole CPU cycles (6-12 clock ticks at a time), but it's very tricky (despite sounding deceptively easy.) I spent the first two years trying to pass a few hundred edge case IRQ tests and failed miserably. I gave up and just tested every single clock cycle @ 10.5MHz, and I had all of my tests passing within two days. I probably am just a bad programmer, but it wasn't worth the hassle.

So you factor in 10.5M context switches both to the PPU and back to the CPU and you get 21M/second.

Sync would be less of an issue there anyway, given that the real CPUs don't run in sync anyway (due to pipeline/cache/RAM timigs and other stalls) and the fact that proper sync will (not) be absolutely neccessary in any software they are running.

Very true on both accounts. Speaking of which, I couldn't even imagine emulating a 12-stage pipeline + cache. The SNES CPU is a two-stage, but we fake it with some clever tricks.

Or this. Using multi-threading in this way actually scales very badly, and memory speeds/cache sizes keep increasing, making the single threaded model more effective.

... it seems we disagree, then. Though I'm not saying I know I'm right, and my instinct tells me to listen to someone with more years of experience working with multi-core setups.

Still, seeing is believing. I could post my single-threaded context switch timing test, and maybe Nemesis can do the same for his, and we can get some objective numbers, perhaps?

I'd like to talk to you more sometime about BSNES and why you have made some of the decisions you have more recently, but this is not the place.

Sure, I'd be happy to. I'll PM you my private e-mail address.

Snake · Post by **Snake** » Thu Nov 27, 2008 9:40 pm

byuu wrote:It breaks everything: branch prediction, pipelining, out-of-order execution, L1 cache, etc. Very bad stuff.

Yeah, but there are a lot of things that should also break everything, but they don't, because Intel thought about it and worked around it. This isn't one of them, and they specifically mention it somewhere, and the reasons why. I didn't pay it that much attention at the time because its not something I do often - not now, anyway - I actually used this technique all the time in ASM games programming

byuu wrote:Of course, my tests could have been flawed in some way ...

Probably not, its more likely that your context switches are not as optimal as they could be. Mine are down to two instructions, one of which I'd have to do anyway, and the other gets nicely pipelined and effectively takes less than a cycle. Of course, the cache thrashes a bit more, but that's going to happen anyway, and as it turns out, isn't a big deal.

byuu wrote:It's possible to range test these over whole CPU cycles (6-12 clock ticks at a time), but it's very tricky (despite sounding deceptively easy.)

Hmm. Given that the CPU will not see an interrupt till, at the very earliest, the start of the next instruction, there's got to be a way to do this. I'm surprised it's even a problem given the available interrupt sources you have on the SNES. Anyway, discussion for another time.

Nemesis · Post by **Nemesis** » Mon Dec 01, 2008 1:32 am

Eke wrote:about VDP midline changes, have you figured what can be modified and what couldn't ? the documentation says that some registers/data are latched during hblank and currently, I'm only supporting registers 1 (display on/off) and 7 (background color palette entry) changes.. I don't think VRAM/CRAM change during active display have an effect (except from the "dot bug")

Not yet. I can tell you that mid-line changes to the various ram buffers do have an effect however (http://www.spritesmind.net/_GenDev/foru ... =5347#5347). I know the timing for when VRAM/CRAM/VSRAM writes are committed (although there is more testing to be done), and I know when some register changes are applied, but I haven't checked when all the various register changes take effect. I've still got a massive list of tests I need to run for the VDP, and some of those tests relate to when register data is sampled. Most of my VDP testing so far has been targeted at documenting unknown and undefined behaviours related to basic data/control port access and DMA operations, of which there are many, and much of which is not emulated accurately.

Snake wrote:Mine does to, to a certain extent. Rather, changes to certain registers etc. happen on the correct line, this is not true of Gens. Changes to certain registers happen immediately. There is no doubt a few registers I don't have exactly as the real hardware but they are accurate according to all documentation. But I could definitely do what you say quite easily without taking much of a speed hit (in fact, next to none at all given that no game I know of does very much with the VDP during active scan). The main reason I don't is that most registers etc. are not supposed to have any effect. This is also the main reason why its 'scanline driven' - it doesn't have to be, really, but I bring it up to speed at least once per line, just because it's easier to debug/follow what's going on that way.

I think mid-line changes are more important than most people realize. If you've got a game with a HInt routine that modifies the VDP state (which is kind of the point of a HInt), it is making changes mid-line. There is no such thing as "between" lines. With a VDP that's scanline driven, you have to choose a single fixed point in time to commit every change to the VDP. This is fundamentally inaccurate. I'm sure picking the timing between when to generate HInt, and when to commit all pending VDP changes for the next line, while obtaining results which are correct for every Mega Drive game, is a very difficult task.

As for the documentation, I never trust it. I have yet to find any comprehensive documentation on any device, official or unofficial, that doesn't contain errors. When you start to stray into undefined territory like this, documentation becomes virtually useless. The only thing you can trust is the hardware. I make it a point of running my own hardware tests on anything I have even the slightest suspicion about. Then a week later I often find myself questioning my own documentation, and running the tests again.

Snake wrote:
Nemesis wrote:and I've even emulated the "noise" that occurs on CRAM writes during rendering
Yep, I've thought about that once or twice, but I never got around to doing some decent tests to make sure I'd got all situations covered.

The rules are pretty simple. A single write to CRAM/VRAM/VSRAM can occur on each "pixel" the VDP renders (ignoring access limitations while drawing). Only writes to CRAM cause any visual artifacts. When a write to CRAM is committed, the pixel that is being drawn at that location will use the colour value that was just written to the CRAM. One write to CRAM always causes one pixel to be altered, even while rendering the borders and overscan regions. Only writes during blanking don't cause this to occur, obviously.

Snake wrote:Me too. My PSG also runs at the exact hardware frequency of something-stupid-that-I-dont-recall.

Yep, same, with sample-accurate noise emulation and all that stuff. I still need to add FIR filtering to simulate the audio output circuitry of the Mega Drive though. It has a major effect on the square wave output of the PSG, especially for the longer tone cycles.

BTW, there's actually a problem with PSG emulation in all current emulators. There's some bad info in Maxim's doc. When the tone data is set to 0 on a Mega Drive, it acts the same as when the tone data is set to 1, namely, it produces a square wave which oscillates on each cycle. I was suspicious when I found a number of Mega Drive games which rely on the noise channel shifting each cycle when using the second channel tone data, when it was set to 0.

I took a measurement directly from the PSG output pin with an oscilloscope to confirm. For all 3 normal PSG channels, a tone setting of either 0 or 1 is equivalent. Check the sound in After Burner II for an example of the effect this has. The people who designed this game seemed to believe the output was held at +1 with tone data set to 0, like Maxim described. Since this is incorrect, the PSG sounds extremely low volume and somewhat distorted on the real system, and it actually sounds "too good" in current Mega Drive emulators (probably how the designers intended it to sound, rather than how it actually ended up). I don't know if Maxim's info is correct for other implementations of the SN76489, but it appears to be incorrect for the embedded version used in the Mega Drive.

Snake wrote:Yes, but I don't think you're really grasping how much more difficult this gets when you have two fast CPUs mapped to the exact same memory. It's pretty easy to do this with the MegaDrive, in fact I could have shoved every element into a separate thread in KGen and it would have pretty much worked with little extra effort. But when you have two CPUs that are supposed to run at the same speed (they actually don't due to collisions, but...), but every emulated instruction will take a different amount of time, running these in two threads is going to get out of sync almost immediately. How do you detect a collision and how do you fix it, in a way that isn't going to be slow? Unless you keep track of every access, I think this is much more complex than you are betting on. What about when both CPUs try to *write* the same address? By the time you notice it's going to be very difficult to work out which one should win, and when the other write should take place. And it's important, games will lock up if you get it wrong. (NB: *MOST* of the games do this.)

And again, you can only afford to add a few instructions before you're running slower than a single thread.

Detecting and dealing with collisions isn't a problem. Honestly, those issues are simpler than you think. As for performance, yes, there's overhead to all this of course.

To put things simply, if you're emulating a system/program which has constant, extremely tight timing requirements between all its processors, which require synchronization every few cycles, you're right, it will be slower in an emulator which attempts multithreading of CPU cores. If you're emulating a system where there are virtually no collisions, it will be faster in a multi-threaded emulator. Real programs running on real systems tend to fall somewhere in between, and everything in between is grey area, where performance depends on how efficient and sophisticated the designs of the respective emulators are, the nature of the collisions occurring between the devices, and how the multi-threaded emulator deals with them.

You obviously think an emulator which attempts to use threading for CPU emulation is going to be inherently slower with no significant benefits. Tests with my own code have proven to my satisfaction that it is not inherently slower in real-world applications, but at the same time, I'm not running those tests with your hyper-optimized cores. The lower the overhead of the actual cores themselves, the greater the relative overhead of the multi-threaded synchronization becomes.

Snake wrote:
Nemesis wrote:Every current Mega Drive emulator, including yours if I'm not mistaken, approximates the timing.
No, it's very accurate. Although the current release version does have some test code that I left in which breaks it a little (which is why things that used to work got broken at some point). This was all fixed a long time ago and I really do need to get a new build out.

Snake wrote:
Nemesis wrote:When drx released a whole batch of new prototypes recently, including a bunch of 32x prototypes, a significant number of them didn't work in the emulators available at the time. Apparently, there were a lot of hard-coded timing fixes for specific games to get them running.
Actually it was only the Chaotix protos that had a problem. There are no hard-coded timing 'fixes', rather, in order to attempt to speed things up (I was still supporting people with 500Mhz CPUs at the time) I lowered the timing requirements of the handful of games that didn't need quite as heavy lockstepping. The 'fixed' version (which I provided) should work just fine with any ROM you throw at it, and I now have an option to just ignore this table anyway. There are no per-game 'fixes' of any kind, but there are certain games that need to be detected in order to enable something they use (such as EEPROMs.)

Snake wrote:As far as MegaDrive goes, everything is already cycle accurate, with the exception of the VDP stuff mentioned above. Given that it already runs at about 20 times real speed on my 1.7Ghz system, I don't see any problem.

Well it sounds like you've achieved a higher level of timing accuracy than I thought. Your emulator must be insanely efficient in order to run at the speed it does, with that level of accuracy.

byuu wrote:But essentially, it's the same problem: how can your 68k core know the state of the VDP's interrupt line, if it isn't caught up? The only answers seem to be lock-step (way too slow), prediction (messy, a departure from a strict hardware-model) and rewind (very complex.) All the options suck

Well, right now I force the system to be synchronized before each VDP interrupt is generated. I'm currently rewriting my VDP core though, and in the redesigned core, I plan to use prediction, of a sort. Prediction isn't any less accurate than synchronization, as long as the prediction is correct. I can always rely on a rollback if a truly unusual circumstance breaks a generally safe prediction however (eg, if VInt is enabled at the start of the frame, and we predict it will still be enabled at the end of the frame).

I really wanted to avoid needing a look-ahead calculation to determine the pin states. Without a rewind mechanism, my entire emulator is 100% lock-step, with no prediction, no timestamps ... it just does exactly what hardware (probably) would. And it has a single integer to represent a bi-directional cycle counter to tell which of two chips is currently "ahead" of the other.

I had a long hard think about what kind of behaviour I would allow and what behaviour I would not, and this form of prediction was one of the points I debated. Ultimately, I came to the conclusion that there's no problem, as long as the outcome remains deterministic and accurate. This prediction method is just an optimization to lower the synchronization requirements. One of the interesting things about a multi-threaded design is that the efficiency of each individual emulation core becomes less important for performance. What becomes critically important for performance however, is the amount of time the cores can run unsynchronized.

byuu wrote:Breathtaking. Here I thought only the NES crowd had just barely managed to pull off a cycle-level video processor.

Once again, SNES emulation falls painfully far behind everyone else

Well, the Mega Drive VDP mostly runs in a world of its own, so I only have to get the timing cycle-accurate on its external interactions, and when external changes are applied. I don't have to emulate the actual read/write cycles between the VDP and its private VRAM for example. Since VRAM write access is limited to specific "slots" during rendering, I only have to ensure any "buffered" VRAM data is correct at each slot. It sounds like a much harder task to achieve this on the SNES. Everything sounds a little more inter-dependent.