SegaCD and 32x

AamirM's Regen forum

Moderator: AamirM

AamirM
Very interested
Posts: 472
Joined: Mon Feb 18, 2008 8:23 am
Contact:

SegaCD and 32x

Post by AamirM » Fri Nov 14, 2008 8:53 am

Hi,

I did some initial work on 32x and SegaCD some time ago. Didn't work out very well :( . CPU syncronization, among other things, was to blame. It turned out to be very slow and inaccurate. I have following possible soluions:

1) Create two separate instances of CPUs rather than doing expensive context switching. This is the model used by Gens. I don't know if it creates any worthwhile performance gain.

2) Use assembly. Can it be that C is too slow for this? Remember I am using everything in C (68K as well as SH2). My own 68K CPU emulator in assembly is already 30-50% faster than what I am using right now. But it has some problems and I didn't fix them. But if speed is a problem because of C, that can be a motivation for me to finish it.

3) Run CPU cores in parallel. This can be more accurate and may solve the speed issue as well but will be hard.

4) Write a SH2 dynarec :shock: . I wanna do this ;) .

I hope some of the experienced ones like Stef, Steve, Gerrie can guide me here.

stay safe,

AamirM

Snake
Very interested
Posts: 206
Joined: Sat Sep 13, 2008 1:01 am

Re: SegaCD and 32x

Post by Snake » Fri Nov 14, 2008 6:42 pm

AamirM wrote:2) Use assembly. Can it be that C is too slow for this?
I don't think so. Well, I guess it depends on what CPU you're trying to run it on. On anything less than five years old, C should be fast enough. Sure you can do 68K faster but SH2 doesn't even have any flags, so most of the benefits of going ASM disappear anyway. I think, where you could get a VERY big speed improvement from ASM, would be your VDP core.
AamirM wrote:3) Run CPU cores in parallel. This can be more accurate and may solve the speed issue as well but will be hard.
On the contrary. This is never going to work. The way pretty much all of the 32X games are written requires you to sync CPUs much tighter than you're going to be able to get this way. Also - pretty sure it wouldn't actually be faster.
AamirM wrote:4) Write a SH2 dynarec :shock: . I wanna do this ;) .
Again, not going to work, for the same reasons as above. Dynarec is great if you only have one CPU...

I originally write a DualCore emulator, all registers, everything, was duplicated in the main context, and the code was designed to run one instruction from two CPUs at once. It seemed like a fantastic way to obtain both accuracy and speed. But the fact that I dumped it before even releasing a build probably tells you it wasn't ;)

[EDIT] forgot to answer this:
1) Create two separate instances of CPUs rather than doing expensive context switching. This is the model used by Gens. I don't know if it creates any worthwhile performance gain.
Well, I don't have two instances, and it doesn't cause me any problems. In fact its probably a lot better for the cache that you *don't* do it this way.

AamirM
Very interested
Posts: 472
Joined: Mon Feb 18, 2008 8:23 am
Contact:

Post by AamirM » Sat Nov 15, 2008 7:26 am

Hi,
Snake wrote: I think, where you could get a VERY big speed improvement from ASM, would be your VDP core.
I had been thinking about that but never really gave it a try because I know it requires quite a lot of time to get right. I will try to write the sprite rendering in ASM, which takes the most time in VDP.
Snake wrote: The way pretty much all of the 32X games are written requires you to sync CPUs much tighter than you're going to be able to get this way.
Snake wrote:Again, not going to work, for the same reasons as above. Dynarec is great if you only have one CPU...
And SegaCD games are even less forgiving. So these options are ruled out.
Snake wrote: Well, I don't have two instances, and it doesn't cause me any problems. In fact its probably a lot better for the cache that you *don't* do it this way.
Ah....I didn't thought about that. But with growing caches these days, you may never know ;) .

So you aren't doing *anything* special to speed up? Thats pretty damn awesome considering the speed Kega has for 32x and SegaCD.

stay safe,

AamirM

Stef
Very interested
Posts: 3131
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Post by Stef » Sat Nov 15, 2008 10:57 am

In Gens i created 2 specific core for 68000 because they both has specific part of code... it was a pretty awful way to do these things. Starscream use a slow context switch method and you should avoid this solution in your case.
Just use a context pointer structure for your SH2 and 68000 cores so context switch becomes fast.
Try to minimize and optimize the "initialisation" and "finalization" code in the Core_Execute(context *cpu) function then the switch will be really fast.
In Gens the SH2 core eats an important part of CPU time, more than VDP... You can do fast SH2 core in C but ASM can help a bit here imo.

Stef
Very interested
Posts: 3131
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Re: SegaCD and 32x

Post by Stef » Sat Nov 15, 2008 10:58 am

Snake wrote: I originally write a DualCore emulator, all registers, everything, was duplicated in the main context, and the code was designed to run one instruction from two CPUs at once. It seemed like a fantastic way to obtain both accuracy and speed. But the fact that I dumped it before even releasing a build probably tells you it wasn't ;)
I remember Gerrie had the same idea, dumped it as well i believe ;)

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Mon Nov 17, 2008 9:21 am

Given the prevalence of multi-core CPUs now, how do you think a one emulated cpu per core emu would fair? Of the three main computers in my house, two are dual-core.

AamirM
Very interested
Posts: 472
Joined: Mon Feb 18, 2008 8:23 am
Contact:

Post by AamirM » Mon Nov 17, 2008 2:46 pm

Hi,
Chilly Willy wrote: Given the prevalence of multi-core CPUs now, how do you think a one emulated cpu per core emu would fair? Of the three main computers in my house, two are dual-core.
If the CPUs are tightly synced, as in the case of SegaCD and 32x, its going to be hard and slow. Since emulating SegaCD and 32x will work most accurate if you run the CPUs cycle by cycle (at least thats what my experience says so far). This means you are going to have to switch the CPUs threads quite a lot requiring lots of thread context switching (not to be confused with the emulated CPU context) of the CPU threads which can be (and on x86, I think it is) more heavy. Then, to solve the sync issue, you are going to have to use cooperative multithreading rather than preemptive so that we can control how much each CPU runs.

In case of Megadrive with just 68k and Z80, it can work, I think.

stay safe,

AamirM

Nemesis
Very interested
Posts: 791
Joined: Wed Nov 07, 2007 1:09 am
Location: Sydney, Australia

Post by Nemesis » Tue Nov 18, 2008 8:51 am

I've done a lot of work in this area. As you may have heard, the emulator I'm developing is multithreaded, and runs each CPU in a separate core in parallel. I'm really looking forward to attempting SegaCD and 32x emulation, as my timing model will provide perfect accuracy with no modifications or hard-coding, and due to some recent performance optimizations, there's a good chance it could be done in realtime on a decent quad-core. I'm sure a lot of people would consider that very slow, but perfect timing with this many devices in one system is no easy thing.
If the CPUs are tightly synced, as in the case of SegaCD and 32x, its going to be hard and slow. Since emulating SegaCD and 32x will work most accurate if you run the CPUs cycle by cycle (at least thats what my experience says so far). This means you are going to have to switch the CPUs threads quite a lot requiring lots of thread context switching (not to be confused with the emulated CPU context) of the CPU threads which can be (and on x86, I think it is) more heavy. Then, to solve the sync issue, you are going to have to use cooperative multithreading rather than preemptive so that we can control how much each CPU runs.
I've solved that problem. :D

In my emulator, I have what I call a "dependency" rule between the Z80 and the M68000. The Z80 is dependent on the M68000. In my emulator, this means the Z80 core will never advance beyond the point the M68000 has reached. This isn't the same as lock-step, since if the Z80 thread was switched out for a block of time, it may execute a few hundred opcodes at once for example. As long as the Z80 is behind the M68000, it will continue to execute without interruption. I keep the Z80 behind the M68000 so the Z80 is aware of RESET and BUSREQ events before they occur, so I don't end up with the case of the Z80 having advanced past the M68000, then finding out it should have been paused 30 cycles ago due to a BUSREQ. My emulator can handle that case, but it requires a relatively expensive rollback operation, which dependency rules are designed to predict and avoid. You can also set two cores as dependent on each other, in which case, my emulator will advance both cores side by side in lock-step.

Now, that's all very interesting, but where does context switching come into it? Here's an important thing to remember: You do not have to use a mutex, or require a context switch, in order to share information between threads. A full mutex is designed for locks which are expected to last for a relatively long period of time (say, a millisecond or two). When a thread hits a mutex which is already locked, it would usually call an OS-level function which forfeits the remainder of the timeslice for that thread, and instructs the OS to resume the thread some time after the lock is released. This is where you get a context switch. In the case of "high performance" thread synchronizaton, where a lock is extremely transient and expected to last for a very small period of time (say, a couple of nanoseconds), why would you forfeit the timeslice?

You can build low-level synchronization primitives on most platforms based on one or two opcodes. x86 provides opcodes like XADD which are designed specifically for this kind of task. Win32 also provides functions like InterlockedIncrement and InterlockedDecrement which wrap over these assembly primitives. Simply put, you can use these instructions to test whether it is "safe" to advance. If it is not, instead of forfeiting the remainder of the timeslice, you can just spin around in a loop continuously testing the condition until it is met. For long delays, this would be really bad, since the core your thread is executing on is going to sit there at max utilization until the lock is released, but when the overhead of a formal mutex is higher than the loop, this method can greatly improve performance. Significantly, when you hand the remainder of a timeslice back to the OS, the OS may not call your thread again until a significant period of time after the condition has been met. If you fail a lock 1000 times a second for example, and each lock takes an average of 0.5 milliseconds between the condition being met and the OS switching your thread back in, that's potentially an extra 0.5 seconds of OS-level overhead to your thread, each second. When you're sitting looping on the wait condition in your own code however, your thread knows within a few nanoseconds when the condition has been met, and can resume immediately.

Back to my emulator, using thin locks, I can guarantee two cores are kept within a cycle of each other, while still potentially running them in parallel on the hardware, and also potentially without generating a single context switch. In reality, the performance degredation for lock step vs no locks at all seems to be around 40% for most applications in the tests I've done.


The future of emulation has to be multithreading, since that is the future of computers themselves for the forseable future. We're hitting limits of how fast a single core can run with the current silicon technology. Single-threaded emulation is fine for older systems, as long as they can run full speed on a single 3GHz core. If they can't, well, it could be a long time before anyone can run them full speed.

I'm really looking forward to C++0x, since the revision will include native synchronization primitives and commands to allow high performance, thin locking mechanisms to be built, without a lot of the potential dangers we have to worry about in C++ currently, like compiler optimization, and cache synchronization. Thin locking mechanisms are essential for the development of complex multithreaded systems like emulators, where there are a lot of interactions occurring, and performance is extremely important.

SmartOne
Very interested
Posts: 77
Joined: Sun Sep 21, 2008 5:18 am

Post by SmartOne » Tue Nov 18, 2008 4:48 pm

Wow. That is all.

Snake
Very interested
Posts: 206
Joined: Sat Sep 13, 2008 1:01 am

Post by Snake » Tue Nov 18, 2008 6:51 pm

This is all basic stuff. Trouble is the CPUs are required to be in lockstep all the time, which makes it much, much faster to do on one CPU.

Sure, multithreading is "the future" but there's other things than CPUs you can stick in other threads.

Nemesis
Very interested
Posts: 791
Joined: Wed Nov 07, 2007 1:09 am
Location: Sydney, Australia

Post by Nemesis » Tue Nov 18, 2008 9:52 pm

Snake wrote:This is all basic stuff. Trouble is the CPUs are required to be in lockstep all the time, which makes it much, much faster to do on one CPU.
Not really. If you have two processors running in lock-step under this threading model, and you assume each virtual core has a dedicated physical core, it can be just as fast as the single-threaded implementation (in my model, if two processors are on the same cycle, they'll execute in parallel, so it could even be a little faster). In a relative sense, more processing power is used (two dedicated cores rather than one), but in terms of actual execution time, with the thin locking mechanism, a multithreaded implementation won't be any worse than the single-threaded implementation, provided you have enough physical cores to go around. And remember that lock-step is the worst case scenario. Even when it is required, if you're clever about it, you can begin to parallelize even two heavily dependent cores running in lock-step to a degree. Simply put, the best performance you can achieve with a single-threaded implementation should be the worst-case scenario for a multithreaded implementation. And with multithreading, meanwhile, every device which is NOT running in lock-step is also running freely in parallel. In a completely single-threaded model, your single core still has to time-share with the VDP, YM2612, Z80, etc.
Sure, multithreading is "the future" but there's other things than CPUs you can stick in other threads.
Definitely. State-driven output devices like the VDP, SN76489 and YM2612 should be the first things to be threaded. Processors can greatly benefit as well however, it's just harder to do.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Wed Nov 19, 2008 2:13 am

At the moment, SMP requires all TSCs (for x86/amd64 cpus) be as close to the same value as possible (they're within 10 counts on linux). It seems to me that by using the TSC as the speed control on emulated code, it should be possible to keep emulated CPUs in sync without any interaction at all.

Nemesis
Very interested
Posts: 791
Joined: Wed Nov 07, 2007 1:09 am
Location: Sydney, Australia

Post by Nemesis » Wed Nov 19, 2008 9:46 am

In order to use the TSC reliably in that way, you'd have to be able to deal with cases where the counters are not in sync. Eg, what happens with multi-processor systems, or "fake" quad-core systems like the Core 2 Quad, where you've got two cores on the same die? It might be possible to base a system on the counter, but it'd be a lot of work to make it reliable, and avoid a reliance on undefined behaviour.

SmartOne
Very interested
Posts: 77
Joined: Sun Sep 21, 2008 5:18 am

Post by SmartOne » Wed Nov 19, 2008 6:57 pm

Snake wrote:This is all basic stuff.
You're all basic stuff. :o :o :o
Sorry. I should keep my stupid distractions out of this. :wink:

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Wed Nov 19, 2008 11:54 pm

Nemesis wrote:In order to use the TSC reliably in that way, you'd have to be able to deal with cases where the counters are not in sync. Eg, what happens with multi-processor systems, or "fake" quad-core systems like the Core 2 Quad, where you've got two cores on the same die? It might be possible to base a system on the counter, but it'd be a lot of work to make it reliable, and avoid a reliance on undefined behaviour.
True, but the TSC runs at the instruction rate (GHz), so there's plenty of room for scaling down to 32X rates, which would allow for quite a variance in the counters. You could be almost 1000 off and still be okay after scaling. Even Windows is supposedly getting the TSCs within a few hundred counts of each other.

Post Reply