
PS: you got virtua racing running at 60fps on GP2X, really ? what did you do with your core ?
Moderator: AamirM
Nah, I think doing 100% perfect emu is insane task.Eke wrote:yes, both approach are interesting, optimization gurus have all my respect, because I feel it is still a much harder task than designing the "perfect" hardware emulator![]()
Well it may be wrong to say 60fps as the game itself is doing ~15. I used dynarec for this, capable of detecting certain instruction sequences (as there is only one game) and replacing them with ARM code. I also enumerated all instruction RAM states so I can get away without ever clearing translation cache, so the actual translation only happens on the start of levels.Eke wrote:PS: you got virtua racing running at 60fps on GP2X, really ? what did you do with your core ?
You maybe better off using a static recompiler since its just this one game that uses the SVP.Well it may be wrong to say 60fps as the game itself is doing ~15. I used dynarec for this, capable of detecting certain instruction sequences (as there is only one game) and replacing them with ARM code. I also enumerated all instruction RAM states so I can get away without ever clearing translation cache, so the actual translation only happens on the start of levels.
... probably code based on Gens, I would imagine. In which case the timing is *way* more accurate than the 'once per frame' you quoted. The SegaCD BIOS won't even run if you do that.Chilly Willy wrote:I play most of my CDs without strict timing in PicoDrive. I can't think of the ONE that requires strict timing off the top of my head, but out of about a dozen, there was ONE that needed it.
Well, I've been coding for multi-cpu systems for - what - 14 years. I may have given the impression I am not familiar with this stuff. I've done a lot of it - you aren't telling me anything I don't know. There's just not much of it in Kega. (well, there is some, has been since v1...)Nemesis wrote:All of that is easily solvable. Remember that I've spent over 3 years writing an emulator which has to deal with these issues.
But it does. Some of the 32X games, for example, spend a very large amount of time doing exactly that, and with no safeguards whatsoever. Yes, it's a very bad way to do things. But they do it.Nemesis wrote:The simple fact is, 99.99999% of the time in any real program, this doesn't occur.
But you're missing the point that you're seriously overcomplicating things for no benefit at all. This is the first rule of multithreaded programming - "is there a simpler, safer way to do this?"Nemesis wrote:To deal with cases where the collisions between two devices are extreme, build in heuristics.
I thought that was the topic of the conversation.Nemesis wrote:You're only thinking about the Mega Drive
This is why I suggested Aamir does this in C. Also - you are assuming that a well written ASM core cannot be flexible and easy to understandNemesis wrote:Besides, and this is another point entirely, I don't consider hand-optimized assembly cores, fast or not, to be a solution to preserve a system into the future, which, afterall, is what emulation is supposed to be about. Give me a slow core, which is high-level, flexible, and easy to understand, any day.
Some of the things being suggested may get the job done in the end, but certainly could not be called accurate.Nemesis wrote:What, exactly, would exclude a multi-threaded emulator from being accurate?
If said emulator is taking less than 100% of a single core, then yes, absolutely, it's highly efficient. Taking 50% of a second core while the first core isn't even maxed out? Inefficient, and missing the point.Nemesis wrote:Is a single-threaded emulator an efficient use of a quad core?
...but not just because they are there.Nemesis wrote:Those cores are there to be used.
But you're still trying to solve an issue that doesn't need to be solved. It makes no sense when there are simpler, safer, more efficient ways to achieve the same goal. It's also a hell of a lot easier to debug and maintain.Nemesis wrote:These problems are difficult, but not unsolvable. Spend a few years thinking about the issues, with an attitude that there IS a solution, and you'll start to come up with solutions.
Oh yeah, that's not a problem. To me, this is the difference between "worth doing" and "not worth doing".byuu wrote:Take two processors that only share a 4-byte communication bridge. You only have to lock when one accesses the others' memory address range. Since there's only four addresses, your lock-stepping will be minimized greatly.
Yup. And that's the case here. Except its just a bit more than 64K, plus a ton of hardware. And it's not always possible to even notice that there is a problem / you're going to have to 'roll back' - other than the fact the game will crash.byuu wrote:Now, when you have two processors that share a large bus, say a big 64k read/write memory chunk (read-only memory would obviously not require locks since it cannot change any state); then yes, you have real problems.
How about spending 99% of their time working on various parts of the same data, and making dangerous assumptions about when various pieces of it are ready? Yeah, happens all the time. It isn't easy to detect, and even if it were, and even if you CAN roll back everything needed, you may have to roll back several frames. It's not pretty.byuu wrote:but I can't really see how a game could function if the two cores spend 99% of their time talking to each other, rather than actually doing stuff.
OT, but all my cores are, absolutely, written this way.Nemesis wrote:Yep, same here. Keeping everything 100% separate and modular is the only way to do it. Imagine how easy it becomes to support all the crazy variations on 80's arcade systems for example when you can just drop in all your generic cores, and know it's all ready to go. It's good to know there's someone else out there working along a similar line.
It only works for a very limited set of instructions.Nemesis wrote:Nice, I wasn't aware of this prefix. That expands the list of thread-safe opcodes quite considerably.
Ah, that's just the magic of the x86 CMPXCHG instruction at work.Nemesis wrote:Personally, I'm most interested in the behaviour of the InterlockedCompareExchange() function
Absolutely. Given there are people who've almost doubled the speed of a Core2Duo via overclocking, I don't think we're anywhere near the limit yet.Nemesis wrote:If the current single-core limitations can be solved, I'm sure individual cores will continue to get faster and faster.
Again, absolutely, and this is why I don't think it's optimal to just jump on another core unless you really need it. Leave it for someone else to use. For a start, if you're using DDraw, D3D, DSound, DInput, you're already running five threads anyway.Nemesis wrote:Dual core in particular is a really important step, as it gives your system the flexibility to run one task maxing out a core, while still having the other core around to drive the OS, and juggle the other apps idling in the background.
Absolutely. But you'd look at where it makes the most sense to do it first. I think you'll find CPU emulation is not that place.byuu wrote:Looking at platforms like the Saturn, PS3, etc ... I see a really compelling case for refining multi-core methods. It may be the only way to get accurate and playable framerates for newer generation systems.
If you wanted, you could write high-level functions to replace the SVP entirely. Not to imply that this is an easy task -- static recomp would certainly be easier -- but the SVP code is very modular so I wouldn't be surprised if someone did this one day.AamirM wrote:You maybe better off using a static recompiler since its just this one game that uses the SVP.notaz wrote:Well it may be wrong to say 60fps as the game itself is doing ~15. I used dynarec for this, capable of detecting certain instruction sequences (as there is only one game) and replacing them with ARM code. I also enumerated all instruction RAM states so I can get away without ever clearing translation cache, so the actual translation only happens on the start of levels.
Well it sort of is already, as it only recompiles stuff once, but does it on demand only.AamirM wrote: You maybe better off using a static recompiler since its just this one game that uses the SVP.
Yeah it is a rewrite of Gens code. And true, it syncs once per line, not frame. And there is a bunch of games needing better sync then that (all Wolfteam games and several others).Snake wrote:... probably code based on Gens, I would imagine. In which case the timing is *way* more accurate than the 'once per frame' you quoted. The SegaCD BIOS won't even run if you do that.Chilly Willy wrote:I play most of my CDs without strict timing in PicoDrive. I can't think of the ONE that requires strict timing off the top of my head, but out of about a dozen, there was ONE that needed it.
Ok, let me put this another way. The number 1 goal of my emulator is accuracy. Right now, my emulator maintains cycle-level accuracy between all devices. My VDP core is not scanline driven. It is capable of responding to changes mid-line, and I've even emulated the "noise" that occurs on CRAM writes during rendering. My YM2612 and PSG are capable of responding to register changes at the exact sample they should be applied. Any processors can sit in an endless loop fighting over shared access to any memory address, device, or any other obscure dependency, and my emulator will ensure that it is always executed in the exact same way, with everything happening in the correct order, regardless of what's running in what thread or what order things get processed in.Snake wrote:...
Mine does to, to a certain extent. Rather, changes to certain registers etc. happen on the correct line, this is not true of Gens. Changes to certain registers happen immediately. There is no doubt a few registers I don't have exactly as the real hardware but they are accurate according to all documentation. But I could definitely do what you say quite easily without taking much of a speed hit (in fact, next to none at all given that no game I know of does very much with the VDP during active scan). The main reason I don't is that most registers etc. are not supposed to have any effect. This is also the main reason why its 'scanline driven' - it doesn't have to be, really, but I bring it up to speed at least once per line, just because it's easier to debug/follow what's going on that way.Nemesis wrote:My VDP core is not scanline driven. It is capable of responding to changes mid-line
Yep, I've thought about that once or twice, but I never got around to doing some decent tests to make sure I'd got all situations covered.Nemesis wrote:and I've even emulated the "noise" that occurs on CRAM writes during rendering
Me too. My PSG also runs at the exact hardware frequency of something-stupid-that-I-dont-recall.Nemesis wrote:My YM2612 and PSG are capable of responding to register changes at the exact sample they should be applied.
Yes, but I don't think you're really grasping how much more difficult this gets when you have two fast CPUs mapped to the exact same memory. It's pretty easy to do this with the MegaDrive, in fact I could have shoved every element into a separate thread in KGen and it would have pretty much worked with little extra effort. But when you have two CPUs that are supposed to run at the same speed (they actually don't due to collisions, but...), but every emulated instruction will take a different amount of time, running these in two threads is going to get out of sync almost immediately. How do you detect a collision and how do you fix it, in a way that isn't going to be slow? Unless you keep track of every access, I think this is much more complex than you are betting on. What about when both CPUs try to *write* the same address? By the time you notice it's going to be very difficult to work out which one should win, and when the other write should take place. And it's important, games will lock up if you get it wrong. (NB: *MOST* of the games do this.)Nemesis wrote:Any processors can sit in an endless loop fighting over shared access to any memory address, device, or any other obscure dependency, and my emulator will ensure that it is always executed in the exact same way, with everything happening in the correct order, regardless of what's running in what thread or what order things get processed in.
...but you're not doing that now, either, right? The same things you are doing can be done on a single core.Nemesis wrote:I believed it was impossible to achieve cycle-accurate emulation in a single threaded emulator, which could emulate even the Mega Drive at full speed in cycle-level lock-step
No, it's very accurate. Although the current release version does have some test code that I left in which breaks it a little (which is why things that used to work got broken at some point). This was all fixed a long time ago and I really do need to get a new build out.Nemesis wrote:Every current Mega Drive emulator, including yours if I'm not mistaken, approximates the timing.
Covered above.Nemesis wrote:I know Fusion doesn't support mid-line VDP changes for example.
Actually it was only the Chaotix protos that had a problem. There are no hard-coded timing 'fixes', rather, in order to attempt to speed things up (I was still supporting people with 500Mhz CPUs at the time) I lowered the timing requirements of the handful of games that didn't need quite as heavy lockstepping. The 'fixed' version (which I provided) should work just fine with any ROM you throw at it, and I now have an option to just ignore this table anyway. There are no per-game 'fixes' of any kind, but there are certain games that need to be detected in order to enable something they use (such as EEPROMs.)Nemesis wrote:When drx released a whole batch of new prototypes recently, including a bunch of 32x prototypes, a significant number of them didn't work in the emulators available at the time. Apparently, there were a lot of hard-coded timing fixes for specific games to get them running.
Well it's pretty much there already. As far as MegaDrive goes, everything is already cycle accurate, with the exception of the VDP stuff mentioned above. Given that it already runs at about 20 times real speed on my 1.7Ghz system, I don't see any problem.Nemesis wrote:If you or someone else is able to achieve the same level of accuracy at a faster speed on a single-threaded emulator, particularly when running on a quad-core, I'll be very impressed.
Absolutely. In a multi-threaded design, you'd have to lock one of the threads until the VDP caught up, anyway. Don't want to be writing over anything. This is basically exactly how KGen worked.Eke wrote:run the CPUs for some cycles and when they writes VDP, execute the appropriate number VDP cycles, render pixel as blocks
Well, the same for reading the counter value from the SFC's PPU. It's just the H/V pin signals that the CPU polls every clock tick.The Mega Drive VDP has a H/V counter, but it has to be specifically read by the CPU.
That really seems to be the same problem I'm having, just with interrupt in place of blanking. Interrupt lines seem to be a bit more forgiving, though. Interrupts tend to only trigger once per opcode, so you only have to sync up (or predict the state of the line) once.the only unsolicited output from the VDP to the Mega Drive is in the form of these interrupt lines
I really wanted to avoid needing a look-ahead calculation to determine the pin states. Without a rewind mechanism, my entire emulator is 100% lock-step, with no prediction, no timestamps ... it just does exactly what hardware (probably) would. And it has a single integer to represent a bi-directional cycle counter to tell which of two chips is currently "ahead" of the other.One possibility I've considered is that I'd provide a way for the source device, the PPU in this case, to predict its output lines ahead of time, and for the target device, if any, to "request" the state of those lines each time they're sampled, or possibly even a list of all the changes to those lines between two points in time. Consider this point: Given the current state of the PPU, it's always possible for the PPU to calculate what the state of the vblank and hblank pins will be at any point into the future, as long as no external accesses occur which alter the state of the PPU, such as modifying a register.
I don't emulate the only two special chips that can assert the CPU's /IRQ line. I'll be pretty much screwed there, as the look-ahead method for the PPU's blanking lines will not work there.I don't know how you handle this kind of communication in your emulator, but perhaps, a similar approach might work for you?
I know that at least in my case, 100% perfection is impossible. I just don't think we're anywhere near as close as we can get. I'll be happy with 99.98% or better.Nah, I think doing 100% perfect emu is insane task.
Indeed it has! As always, a pleasure to speak with you/ waves at byuu - it's been a while
Point well taken. Silly I missed that, I see that all the time with the main<>sound processor communication. This is what breaks the sound in Earthworm Jim 2 and others -- most emulators do not synchronize the two processors tightly enough.How about spending 99% of their time working on various parts of the same data, and making dangerous assumptions about when various pieces of it are ready?
Breathtaking. Here I thought only the NES crowd had just barely managed to pull off a cycle-level video processor.My VDP core is not scanline driven.
I believe we both understand these complexities. And they don't go away, regardless of whether you use a cooperative or pre-emptive model to emulate the chips. You still need exactly the same number of synchronization operations. The difference:Yes, but I don't think you're really grasping how much more difficult this gets when you have two fast CPUs mapped to the exact same memory. It's pretty easy to do this with the MegaDrive, in fact I could have shoved every element into a separate thread in KGen and it would have pretty much worked with little extra effort. But when you have two CPUs that are supposed to run at the same speed (they actually don't due to collisions, but...), but every emulated instruction will take a different amount of time, running these in two threads is going to get out of sync almost immediately. How do you detect a collision and how do you fix it, in a way that isn't going to be slow? Unless you keep track of every access, I think this is much more complex than you are betting on. What about when both CPUs try to *write* the same address? By the time you notice it's going to be very difficult to work out which one should win, and when the other write should take place. And it's important, games will lock up if you get it wrong. (NB: *MOST* of the games do this.)
And again, you can only afford to add a few instructions before you're running slower than a single thread.
Yes indeed, specifically, it really does not like you changing the stack pointer like that. I forget the exact reason but it's documented somewhere. It's not something you're supposed to do. However, of course, the OS *is* expected to do this, so there may be a specific way of doing this that's much faster. It may not be doable from user code, though.byuu wrote:The overhead is not in the instructions themselves, but hidden in the processor architecture's model to execute instructions out of order and as fast as possible.
I'm not sure how you arrive at the 21 million figure given that the SNES CPU is pretty damn slow. What am I missing?byuu wrote:Now take a look at my SNES CPU<>PPU example: it needs 21 million of these to remain in 100% perfect lock-step.
N64, I would think, is doable. But sure, for things in the GHz level you'd probably want a core per core. Sync would be less of an issue there anyway, given that the real CPUs don't run in sync anyway (due to pipeline/cache/RAM timigs and other stalls) and the fact that proper sync will be absolutely neccessary in any software they are running.byuu wrote:And this is for the SNES! Imagine trying to do something similar on the 100MHz N64 processors, or 3GHz PS3 processors o_O
It's very likely getting these to less than ~10 million syncs a second and still 100% accurate is not possible, no matter how many tricks you use.
I'm not convinced about either of those, personally.byuu wrote:You see, the single-threaded model is already exhausted. I believe the gist of what Nemesis and I are getting at, is that the multi-threaded model scales better for these synchronizations: they're less painful.
Or this. Using multi-threading in this way actually scales very badly, and memory speeds/cache sizes keep increasing, making the single threaded model more effective.byuu wrote:But it will scale much further, while the single-threaded model will actually become less effective in time.
It breaks everything: branch prediction, pipelining, out-of-order execution, L1 cache, etc. Very bad stuff.I forget the exact reason but it's documented somewhere ... the OS *is* expected to do this, so there may be a specific way of doing this that's much faster.
How's that? You use the state machine approach? That just puts the overhead inside your code. My testing shows that a single switch/case to get to the code you want is faster, but when you need two or more switches to drill down to where you want to resume code (say ... exec -> switch(opcode) -> switch(opcodecycle) -> exec_one_cycle()), just swapping stacks is faster. I got a ~40% speedup in the CPU core from the latter with that exact setup. Of course, my tests could have been flawed in some way ...That being said, my 'context switches' are basically free, so I don't have a problem here.
The SNES crystal clock is 315/88*6MHz, or 6x the NTSC color subcarrier. Though each CPU opcode cycle consumes 6 cycles (I/O and fast memory regions), 8 cycles (slow memory regions) and 12 cycles (input device memory regions), so you get an effective rate of ~2.68MHz (slow) to ~3.58MHz (fast) for the CPU.I'm not sure how you arrive at the 21 million figure given that the SNES CPU is pretty damn slow. What am I missing?
Very true on both accounts. Speaking of which, I couldn't even imagine emulating a 12-stage pipeline + cache. The SNES CPU is a two-stage, but we fake it with some clever tricks.Sync would be less of an issue there anyway, given that the real CPUs don't run in sync anyway (due to pipeline/cache/RAM timigs and other stalls) and the fact that proper sync will (not) be absolutely neccessary in any software they are running.
... it seems we disagree, then. Though I'm not saying I know I'm right, and my instinct tells me to listen to someone with more years of experience working with multi-core setups.Or this. Using multi-threading in this way actually scales very badly, and memory speeds/cache sizes keep increasing, making the single threaded model more effective.
Sure, I'd be happy to. I'll PM you my private e-mail address.I'd like to talk to you more sometime about BSNES and why you have made some of the decisions you have more recently, but this is not the place.
Yeah, but there are a lot of things that should also break everything, but they don't, because Intel thought about it and worked around it. This isn't one of them, and they specifically mention it somewhere, and the reasons why. I didn't pay it that much attention at the time because its not something I do often - not now, anyway - I actually used this technique all the time in ASM games programmingbyuu wrote:It breaks everything: branch prediction, pipelining, out-of-order execution, L1 cache, etc. Very bad stuff.
Probably not, its more likely that your context switches are not as optimal as they could be. Mine are down to two instructions, one of which I'd have to do anyway, and the other gets nicely pipelined and effectively takes less than a cycle. Of course, the cache thrashes a bit more, but that's going to happen anyway, and as it turns out, isn't a big deal.byuu wrote:Of course, my tests could have been flawed in some way ...
Hmm. Given that the CPU will not see an interrupt till, at the very earliest, the start of the next instruction, there's got to be a way to do this. I'm surprised it's even a problem given the available interrupt sources you have on the SNES. Anyway, discussion for another time.byuu wrote:It's possible to range test these over whole CPU cycles (6-12 clock ticks at a time), but it's very tricky (despite sounding deceptively easy.)
Not yet. I can tell you that mid-line changes to the various ram buffers do have an effect however (http://www.spritesmind.net/_GenDev/foru ... =5347#5347). I know the timing for when VRAM/CRAM/VSRAM writes are committed (although there is more testing to be done), and I know when some register changes are applied, but I haven't checked when all the various register changes take effect. I've still got a massive list of tests I need to run for the VDP, and some of those tests relate to when register data is sampled. Most of my VDP testing so far has been targeted at documenting unknown and undefined behaviours related to basic data/control port access and DMA operations, of which there are many, and much of which is not emulated accurately.Eke wrote:about VDP midline changes, have you figured what can be modified and what couldn't ? the documentation says that some registers/data are latched during hblank and currently, I'm only supporting registers 1 (display on/off) and 7 (background color palette entry) changes.. I don't think VRAM/CRAM change during active display have an effect (except from the "dot bug")
I think mid-line changes are more important than most people realize. If you've got a game with a HInt routine that modifies the VDP state (which is kind of the point of a HInt), it is making changes mid-line. There is no such thing as "between" lines. With a VDP that's scanline driven, you have to choose a single fixed point in time to commit every change to the VDP. This is fundamentally inaccurate. I'm sure picking the timing between when to generate HInt, and when to commit all pending VDP changes for the next line, while obtaining results which are correct for every Mega Drive game, is a very difficult task.Snake wrote:Mine does to, to a certain extent. Rather, changes to certain registers etc. happen on the correct line, this is not true of Gens. Changes to certain registers happen immediately. There is no doubt a few registers I don't have exactly as the real hardware but they are accurate according to all documentation. But I could definitely do what you say quite easily without taking much of a speed hit (in fact, next to none at all given that no game I know of does very much with the VDP during active scan). The main reason I don't is that most registers etc. are not supposed to have any effect. This is also the main reason why its 'scanline driven' - it doesn't have to be, really, but I bring it up to speed at least once per line, just because it's easier to debug/follow what's going on that way.
The rules are pretty simple. A single write to CRAM/VRAM/VSRAM can occur on each "pixel" the VDP renders (ignoring access limitations while drawing). Only writes to CRAM cause any visual artifacts. When a write to CRAM is committed, the pixel that is being drawn at that location will use the colour value that was just written to the CRAM. One write to CRAM always causes one pixel to be altered, even while rendering the borders and overscan regions. Only writes during blanking don't cause this to occur, obviously.Snake wrote:Yep, I've thought about that once or twice, but I never got around to doing some decent tests to make sure I'd got all situations covered.Nemesis wrote:and I've even emulated the "noise" that occurs on CRAM writes during rendering
Yep, same, with sample-accurate noise emulation and all that stuff. I still need to add FIR filtering to simulate the audio output circuitry of the Mega Drive though. It has a major effect on the square wave output of the PSG, especially for the longer tone cycles.Snake wrote:Me too. My PSG also runs at the exact hardware frequency of something-stupid-that-I-dont-recall.
Detecting and dealing with collisions isn't a problem. Honestly, those issues are simpler than you think. As for performance, yes, there's overhead to all this of course.Snake wrote:Yes, but I don't think you're really grasping how much more difficult this gets when you have two fast CPUs mapped to the exact same memory. It's pretty easy to do this with the MegaDrive, in fact I could have shoved every element into a separate thread in KGen and it would have pretty much worked with little extra effort. But when you have two CPUs that are supposed to run at the same speed (they actually don't due to collisions, but...), but every emulated instruction will take a different amount of time, running these in two threads is going to get out of sync almost immediately. How do you detect a collision and how do you fix it, in a way that isn't going to be slow? Unless you keep track of every access, I think this is much more complex than you are betting on. What about when both CPUs try to *write* the same address? By the time you notice it's going to be very difficult to work out which one should win, and when the other write should take place. And it's important, games will lock up if you get it wrong. (NB: *MOST* of the games do this.)
And again, you can only afford to add a few instructions before you're running slower than a single thread.
Snake wrote:No, it's very accurate. Although the current release version does have some test code that I left in which breaks it a little (which is why things that used to work got broken at some point). This was all fixed a long time ago and I really do need to get a new build out.Nemesis wrote:Every current Mega Drive emulator, including yours if I'm not mistaken, approximates the timing.
Snake wrote:Actually it was only the Chaotix protos that had a problem. There are no hard-coded timing 'fixes', rather, in order to attempt to speed things up (I was still supporting people with 500Mhz CPUs at the time) I lowered the timing requirements of the handful of games that didn't need quite as heavy lockstepping. The 'fixed' version (which I provided) should work just fine with any ROM you throw at it, and I now have an option to just ignore this table anyway. There are no per-game 'fixes' of any kind, but there are certain games that need to be detected in order to enable something they use (such as EEPROMs.)Nemesis wrote:When drx released a whole batch of new prototypes recently, including a bunch of 32x prototypes, a significant number of them didn't work in the emulators available at the time. Apparently, there were a lot of hard-coded timing fixes for specific games to get them running.
Well it sounds like you've achieved a higher level of timing accuracy than I thought. Your emulator must be insanely efficient in order to run at the speed it does, with that level of accuracy.Snake wrote:As far as MegaDrive goes, everything is already cycle accurate, with the exception of the VDP stuff mentioned above. Given that it already runs at about 20 times real speed on my 1.7Ghz system, I don't see any problem.
Well, right now I force the system to be synchronized before each VDP interrupt is generated. I'm currently rewriting my VDP core though, and in the redesigned core, I plan to use prediction, of a sort. Prediction isn't any less accurate than synchronization, as long as the prediction is correct. I can always rely on a rollback if a truly unusual circumstance breaks a generally safe prediction however (eg, if VInt is enabled at the start of the frame, and we predict it will still be enabled at the end of the frame).byuu wrote:But essentially, it's the same problem: how can your 68k core know the state of the VDP's interrupt line, if it isn't caught up? The only answers seem to be lock-step (way too slow), prediction (messy, a departure from a strict hardware-model) and rewind (very complex.) All the options suck
I had a long hard think about what kind of behaviour I would allow and what behaviour I would not, and this form of prediction was one of the points I debated. Ultimately, I came to the conclusion that there's no problem, as long as the outcome remains deterministic and accurate. This prediction method is just an optimization to lower the synchronization requirements. One of the interesting things about a multi-threaded design is that the efficiency of each individual emulation core becomes less important for performance. What becomes critically important for performance however, is the amount of time the cores can run unsynchronized.I really wanted to avoid needing a look-ahead calculation to determine the pin states. Without a rewind mechanism, my entire emulator is 100% lock-step, with no prediction, no timestamps ... it just does exactly what hardware (probably) would. And it has a single integer to represent a bi-directional cycle counter to tell which of two chips is currently "ahead" of the other.
Well, the Mega Drive VDP mostly runs in a world of its own, so I only have to get the timing cycle-accurate on its external interactions, and when external changes are applied. I don't have to emulate the actual read/write cycles between the VDP and its private VRAM for example. Since VRAM write access is limited to specific "slots" during rendering, I only have to ensure any "buffered" VRAM data is correct at each slot. It sounds like a much harder task to achieve this on the SNES. Everything sounds a little more inter-dependent.byuu wrote:Breathtaking. Here I thought only the NES crowd had just barely managed to pull off a cycle-level video processor.
Once again, SNES emulation falls painfully far behind everyone else