Designing a cooperative-threaded scheduler for an emulator

Ask anything your want about Megadrive/Genesis programming.

Moderator: BigEvilCorporation

Post Reply
byuu
Very interested
Posts: 91
Joined: Thu Feb 28, 2008 4:45 pm

Designing a cooperative-threaded scheduler for an emulator

Post by byuu » Wed Jun 12, 2019 8:52 am

(Hope this is okay to post here, also posted to NESdev for more feedback.)

I know I'm a bit of an outlier here with using cooperative-threading to design emulators, and that most will use state machines that run each chip the minimum amount of time, but I ran into a rather interesting challenge with threads, and was curious if anyone here had some input.

The Sega Genesis + Sega CD turns out to be a bit of a nightmare of a design to emulate.

You have two 68Ks, a Z80, a VDP graphics chip, PSG audio, YM2612 audio. And tons of components in the Sega CD that I basically roll into its 68K: a graphics rotozoom ASIC, CD controller, CD driver, CD-DA playback, etc. Then there's controllers that may need threads of their own, cartridges as well, etc.

The way my scheduler works is, every time a thread is running, and it does something that takes time, I increment that thread's current clock cycle. There's some extra math in there to scale the value so that each thread can run at different clock rates.

Many chips share RAM and resources, and so you need to make sure the current thread doesn't perform an operation that could affect the other chips while it's ahead in time. So every time the CPU accesses a PSG register, it has to catch up the Z80 *and* the PSG, since the Z80 can also access the PSG. The same for the YM2612. In fact, the Z80 can access almost the entire bus of the 68K, and vice versa, so it ends up being a lot of synchronization.

The naive approach looks like this:

Code: Select all

CPU::step(uint clocks) {
  clock += clocks;
  //resume(thread) is a thread switch:
  //after resume(cpu) later, CPU code execution continues here:
  if(clock >= apu.clock) resume(apu);
  if(clock >= psg.clock) resume(psg);
  if(clock >= ym2612.clock) resume(ym2612);
}

APU::step(uint clocks) {
  clock += clocks;
  if(clock >= cpu.clock) resume(cpu);
  if(clock >= psg.clock) resume(psg);
  if(clock >= ym2612.clock) resume(ym2612);
}

YM2612::step(uint clocks) {
  clock += clocks;
  if(clock >= cpu.clock) resume(cpu);
  if(clock >= apu.clock) resume(apu);
}
Once every emulated frame, I find the smallest clock value of all the threads, and then subtract that value from every thread's clock value, which prevents the clock counters from ever overflowing.

So notice how the YM2612 can't modify the PSG. As a result, it doesn't need to sync the PSG right now and can happily keep running even if it's ahead of the PSG, so long as it doesn't run so long that the PSG runs out of buffered audio samples.

The design works well, but with one glaring flaw: let's say the APU runs for a few clock cycles, and calls step. It is currently behind the CPU, but it is ahead of the YM2612. So it switches to it. The YM2612 runs for quite a while (my core is sample-based, so it eats 144 clock cycles per sample), and now it's quite a ways ahead of the CPU, so then it switches over to the CPU. BUT!! The CPU is still behind the APU, and now the CPU goes on to perform a write that affects the APU.

The naive solution to that is: every time any thread is about to do something that would influence another processor, we scan through every single thread, and find the thread with the smallest clock value, and if that's not the current thread, switch to it.

Code: Select all

Scheduler::synchronize() {
  uint minimum = ~0;
  Thread* context = nullptr;
  for(auto thread : allThreads) {
    if(thread->clock < minimum) {
      minimum = thread->clock;
      context = minimum;
    }
  }
  //should never happen, don't check in release builds
  assert(context);
  if(context != currentThread) resume(*context);
}
But that means we have to loop over a list of threads, compare their clock values, and keep track of which thread has the smallest value. And when we're the YM2612, that means synchronizing the PSG potentially, even though we usually don't care about that. And synchronizations are *painful*, so this approach ends up costing me about 20% more performance and I absolutely cannot spare a hit that big anymore.

My initial thought was something like a binary min-heap array for the threads, so I can find min(list) in O(1) time, but that doesn't solve the problem with synchronizing threads that aren't important.

I can also unroll the synchronization logic, for instance in the YM2612's case:

Code: Select all

YM2612::step(uint clocks) {
  clock += clocks;
  if(clock >= cpu.clock) {
    if(clock >= apu.clock) {
      resume(apu.clock >= cpu.clock ? apu : cpu);
    } else [
      resume(cpu);
    }
  } else if(clock >= apu.clock) {
    resume(apu);
  }
}
Which is an annoying increase of code that I can hide in a library function, but mostly works ... until you have three threads. Or four. Or in the case of the Genesis CPU, the APU+PSG+YM2612+VDP+Mega CD+Controller 1+Controller 2. Unrolling all of that is ... not for the faint of heart. And not even possible in some instances: if you don't have controller 2 plugged in, there is no controller 2 thread.

So right now, the naive approach seems to be the most likely to work. But as it's way too slow, I was curious if anyone had better ideas?

User avatar
Nemesis
Very interested
Posts: 754
Joined: Wed Nov 07, 2007 1:09 am
Location: Sydney, Australia

Re: Designing a cooperative-threaded scheduler for an emulator

Post by Nemesis » Wed Jun 12, 2019 10:58 am

I don't have time to do this justice right now, but I'll get back to you on this soon. My problem here will be reducing my comments from book length. When I designed Exodus, I deliberately took different approaches on a lot of things, with the threading and execution models being no exception. I can share with you a bit of a postmortem on the relevant aspects, what worked and what didn't, and what I plan to change in the future. There are a few things in there that I think will be directly relevant for you. You might also be able to offer me a few tips based on what's worked for you in the past.

One thing to remember is that nobody has really pulled this off yet. As you point out, the Mega Drive + MegaCD is a bit of a monster. Now throw the 32x on top (remember, there are MegaCD32x games). The only emulators that do that right now totally approximate the timing, and/or have game specific hacks. Cycle accurate emulation with this many active devices is hard, as you know. I tried to design Exodus to scale to this system and beyond. Some things worked, others not so much. I'll cover that when I get a chance. I'm away for the week, but I'll find some time to share my thoughts next week.

byuu
Very interested
Posts: 91
Joined: Thu Feb 28, 2008 4:45 pm

Re: Designing a cooperative-threaded scheduler for an emulator

Post by byuu » Wed Jun 12, 2019 3:35 pm

Hey cool, take your time if you're busy. Whenever you're up for it, I'd love to have a more thorough discussion.

I remember comparing my cooprerative approach to your preemptive approach in the past, but comparing the SNES to the Genesis was always rather imprecise anyway. And throwing in the Sega CD (and heaven forbid, the CD-32X) as well, makes it all the more interesting.

I agree this is definitely not a solved problem by any means. Our forebears had the same concerns, _Demo_ with the SNES/SA-1, and Steve Snake with the MD/Mega CD.

I thought of another way to describe the issue my old scheduler was running into more succinctly:

Thread A is about to do something, and needs threads B and C to be caught up before proceeding. A jumps to (not calls, eg this is asynchronous and not synchronous) thread B, because B is behind in time. B runs for a bit, and B needs to do something where thread D is caught up first, so B jumps to D this time. D does something, but only needs A to be caught up, which it is, so it jumps back to A. But the problem is that B only ran for a very short while, and so when D jumps back to A, A resumes execution even though B may not have caught up yet.

Another way to solve this could be to turn this:
* if(threadA.clock() >= threadB.clock()) threadB.resume();
Into this:
* while(threadA.clock() >= threadB.clock()) threadB.resume();

That should all but guarantee us that B is caught up before proceeding.
EDIT: and it does. Fun. Well, at least I ended up doing a lot of code cleanup and boilerplate removal.

User avatar
Miquel
Very interested
Posts: 437
Joined: Sat Jul 30, 2016 12:33 am

Re: Designing a cooperative-threaded scheduler for an emulator

Post by Miquel » Thu Jun 13, 2019 12:49 am

Thanks for bring this subject up, it’s really interesting. My major point is I need more time to think about all this because bring a lot of knowledge altogether. But so far my developing thoughts are:

General theory says you should make a difference between a thread and a workload: threads should be ONLY execution units, completely uncoupled with data, so any thread could be executing any workload indistinctly. A part from watch-dogs threads, you should have as many threads as physical execution units in you system.
byuu wrote:
Wed Jun 12, 2019 8:52 am
Many chips share RAM
CPUs are designed to work alone with and maximize usage of memory (read here ram or cache whatever it corresponds) because if you put more than one cpu on the same bus performance goes down the 50% barrier, so to communicate with other devices all sort of protocols are devised, this gives us the opportunity to use this protocols as a stop and start point for the workloads.

I think it will be also a great exercise to write the single thread version first, or at least a sketch, to not be too much lost on the details off the multithread implementation and to understand exactly what we want to accomplish.
byuu wrote:
Wed Jun 12, 2019 8:52 am
the APU+PSG+YM2612+VDP+Mega CD+Controller 1+Controller 2. Unrolling all of that is ... not for the faint of heart. And not even possible in some instances: if you don't have controller 2 plugged in, there is no controller 2 thread.
A workload for the controllers seems a bit overdone, is there any special reason for that?

And finally this kind of development deserves a rant for current cpu design, and a overview of the future to not be too obsessed with multithread implementation, probably on next post.

byuu
Very interested
Posts: 91
Joined: Thu Feb 28, 2008 4:45 pm

Re: Designing a cooperative-threaded scheduler for an emulator

Post by byuu » Thu Jun 13, 2019 7:03 am

General theory says you should make a difference between a thread and a workload:
I'm really an oddball here. My threads are actual cooperative threads, each with their own stack frames.

But there's no pre-emption. One threads runs until it needs state from another thread that's behind, and then it will jump to the other thread, swapping out the stack frame and all.

The result is an emulator that is *almost* devoid of state machines. I only really draw the line at things like flash chips. In an ideal world with unlimited CPU power, I'd absolutely have those be threaded as well, but ... a core goal is that all my emulators have to run full speed on my box (even if my box is a very expensive one.)

To the real host CPU though, my emulators run on a single actual thread, on a single actual CPU core (well, unless you're FreeBSD and like to constantly migrate threads between CPUs >_>)
A workload for the controllers seems a bit overdone, is there any special reason for that?
So far, I use it for latching the Super Scope and Justifier light guns at the right point in time instead of once per frame, and I also use it for the Genesis Fighting Pad timeout.

I have used it in the past and likely will again for UARTs (eg asynchronous serial bit-banging protocols like modems.)

Ordinary controllers in the current WIP are not threaded.
And finally this kind of development deserves a rant for current cpu design, and a overview of the future to not be too obsessed with multithread implementation, probably on next post.
The latencies of modern CPUs make multi-core useless for low-level emulation. And the hierarchical design (threads inside cores inside contexts inside dies inside chips inside sockets) makes even the traditional infrequent multi-threading (like my SNES parallel scanline renderer) far less efficient than it otherwise could be.

It's really bad. But hey, modern designs at least compile software and transcode video much faster, so ... good enough, right Intel/AMD? >_>

Chilly Willy
Very interested
Posts: 2736
Joined: Fri Aug 17, 2007 9:33 pm

Re: Designing a cooperative-threaded scheduler for an emulator

Post by Chilly Willy » Thu Jun 13, 2019 6:24 pm

The problem with single-threaded emulators is that single thread performance has stagnated over the last decade or so. Other than tiny improvements in IPC efficiency, virtually all work have been towards adding more and more threads in parallel. We've seen a huge shift in game programming in the same decade, from single-threaded games to ones that make full use of all threads. Emulations must do the same to improve accuracy while still performing well on current systems.

byuu
Very interested
Posts: 91
Joined: Thu Feb 28, 2008 4:45 pm

Re: Designing a cooperative-threaded scheduler for an emulator

Post by byuu » Thu Jun 13, 2019 11:53 pm

I agree, but modern processor design is entirely incompatible with synchronizing threads more than a few hundred thousand times a second. That's not even enough for a good Atari 2600 or NES emulator.

For PS2 and beyond, it's the only way to go, of course. Thankfully you don't need those to be cycle-accurate.

User avatar
Huge
Very interested
Posts: 196
Joined: Sat Dec 13, 2008 11:50 pm

Re: Designing a cooperative-threaded scheduler for an emulator

Post by Huge » Fri Jun 14, 2019 3:02 am

I have never done low level programming and I'm completely out of my depth here, but one time I contemplated, just for the sake of argument, how emulating a multiple devices down to a cycle could work... Very possible that I'm completely wrong here, but couldn't you keep an array of variables in a core thread, that would tell which part of the hardware is busy and at what cycle each chip is at, and use that as a basis for timing and cross-chip access?

For example, you have a chip running at 8 MHz, that means it executes 8 million cycles a second, so every time it does something, you increment an "elapsed time" variable for that device by (cycles the operation takes) * (1 / clockspeed). So an operation that takes 16 cycles at 8MHz would increment the elapsed time by 0,000002. If it locks Bus A, you again note that in a "global" variable next to the "elapsed time". Then you proceed to the next device with the lowest "elapsed time" count, and do an operation there. If this next device wants to access Bus A, then we will know that it can't because we can check the variable that tells us whether it is accessed or not, and then we can execute whatever the chip does in this case (like check the bus every cycle or take priority over the other device or whatever).

Of course for this you also need a specific order as to which chip takes priority to which bus / memory, which then gets more complex because there might be exceptions (like, chip C has last access to everything, except memory A, where it takes priority over everything)… and so on.
This of course assumes that you know exactly how many cycles each operation on a chip takes, otherwise you get a huge mess until you at least approximate the proper cycle lengths, which can take months of testing or even trial and error.

Every few seconds of execution, we can compare the device clocks and shave off a set value from each of them, to prevent our time counter to get too large and overflow. Or perhaps whenever we hit the one second mark for elapsed time on one device, we check all devices' clocks to see if they are above 1 second, and if we are, we subtract 1 second from all numbers (that way we only need to check all values as many times per emulated second as we have devices, which is still faster than doing it on every time the time counter advances, and more reliable than doing it at random intervals).

I feel that this would impose a significant amount of wasted CPU time, since every emulated operation hangs on how fast we can access our time counter. At worst we'd be bottlenecked by memory speed, at best, by shared CPU cache speed (L3 is shared on all cores in multicore x86 CPUs, are they not?). So on hindsight, this approach would be more suitable for a single threaded emulator, because there's extremely little that each thread could do, on its own, before the need arises to look into our time counter. Now that I think about it, I suppose this is why spreading workloads over multiple cpu threads is a challenge entirely of its own.


Please consider that I never wrote anything nearly as complex and timing sensitive as an emulator, and this may be decades old information for those of you who did. This was just my own, personal brainstorming without even looking up any reading material on how to implement software emulation, let alone on how to do multi-core scheduling. It's probably really useless to you guys.

byuu
Very interested
Posts: 91
Joined: Thu Feb 28, 2008 4:45 pm

Re: Designing a cooperative-threaded scheduler for an emulator

Post by byuu » Fri Jun 14, 2019 5:43 am

What you're describing is pretty close to what I'm doing, with a few key differences.

Instead of a global array of timestamps, each class that inherits from Thread gets its own timestamp and support functions. There's a scheduler class that automatically keeps track of every thread in existence.

You set a frequency for the thread, and it's a fixed-point version of what you mentioned: I allow up to one second of time to pass before the counters would overflow in a 128-bit integer, and so I scale the chip frequency to 2^127. That gives us timing accuracy that's far beyond the most accurate time keeping clocks in the world (64-bit integers and 2^63 would be significantly less.) Something more precise than attoseconds. If we emulated systems with only one oscillator, this would be entirely unnecessary and 100% timing accuracy would be trivial. But the more oscillators you throw in, the more difficult it gets without doing this scaling.

Every emulated frame, I subtract the smallest timestamp from every thread, so that they never overflow.

My original proposal in this thread was basically to just force-sync whatever is the furthest behind always, but that was incurring a lot of unnecessary speed penalties. About 20-40% depending on the emulator.

The optimization I'm using is: the Genesis YM2612 needs to synchronize to the CPU and APU, but doesn't care if it's ahead of the PSG, VDP, Mega CD, etc since those can't access the YM2612 anyway. By only syncing what we need, we reduce the number of context switches by a lot.

... but it's still really painful. Preemptive thread synchronizations may be worse, but cooperative thread switching still absolutely murders modern CPUs. They *really* don't like their stack frames being changed.

User avatar
Miquel
Very interested
Posts: 437
Joined: Sat Jul 30, 2016 12:33 am

Re: Designing a cooperative-threaded scheduler for an emulator

Post by Miquel » Sun Jun 16, 2019 1:53 am

Ok, knowing that you are aiming for single thread and cycle accuracy my first approach would be:

Code: Select all

void emulator()
{
	int commonCycles = 1;
	for(;;)
	{
		run( commonCycles );
		// This doesn't work, but the idea is to adjust the time slice to
		// speed up/down depending on the cpu power we got
		if( plentyOfTime )
			commonCycles -= 1;
		else if( needToHurry )
			commonCycles += 1;
	}
}

inline void run( int commonCycles )
{
	md68k.run( commonCycles * 8 );
	vpd.run( commonCycles * 12 );
	z80.run( commonCycles * 2 );
	cd68k.run( commonCycles * 14 );
	cdAsic.run( commonCycles * 60 );
	sh2A.run( commonCycles * 80 );
	sh2B.run( commonCycles * 80 );
	// ...
}

inline void z80::run( int cycles )
{
	while( cycles-- )
	{
		switch( z80ReadMem8( this->pc++ ) )
		{
		case 4: // inc b
			this->b++;
			break;
		// ...
		}
		if( somethingPending )
		{
			// Check for interrupts
			// ...
		}
	}
}
About the timing of the hole machine, what I would do is the minimum common multiple of every clock involved as the master clock; and the adjust to every device through division/multiplication. Which what I do here.

Post Reply

Who is online

Users browsing this forum: No registered users and 3 guests