SpritesMind.Net

Posted: **Thu Nov 20, 2008 3:19 am**

BTW, about a dynarec, you could potentially get a significant speed boost from a dynarec which is designed with multithreading in mind. Using thin locks again, imagine a dynarec which uses two physical cores for a single virtual core, and is smart enough to analyse which successive opcodes are dependent on each other, and run opcodes in parallel when the results don't interfere with each other. Basically, a dynarec which recompiles the target code into two separate threads with software hyperthreading. No reason it couldn't be done. It'd take a code ninja to pull off though.

Posted: **Thu Nov 20, 2008 7:12 pm**

run opcodes in parallel when the results don't interfere with each other

As I said before, for both of the consoles we're talking about, this is going to happen very, very rarely. The thing will be lockstepped almost all the time. Any extra processing you do to make this work is just overhead and you're going to end up with an emulator that not only runs slower, but uses more than one core to do it. Laptop users will not thank you for burning their battery (and legs).

Plenty of other reasons why this is a bad idea, to the point of it not actually working reliably, if at all. What happens when two CPUs access the exact same byte of memory in the same cycle? The answer is - it's undefined. Not only can a modern CPU reorder instructions and memory accesses, it will also ensure that anything written by one core is visible by the other - but there is no way you will know, or be able to control, which emulated CPU wins unless you completely lockstep all this stuff. On a real system, this is done ALL the time, and the CPUs will lock each other out when required. It is difficult, but doable, to work out which of the emulated CPUs should get priority for a read and write - and neccessary to get a lot of the games to even run. Some of them will sit in a tight loop doing this for most of a frame. It's going to be much faster to do this on a single core.

Doing this multicore is only harder because you're creating a whole ton of problems for yourself by attempting to do it that way in the first place. If you ignore said problems its actually the cheap, simple solution. It's not a good one. There are much better things to do with multicore - including, if you have no good reason to use a core, leaving it free for some other app to use.

Given that you can do all of the above properly, and easily, on a single core at only about 400mhz, there is no good reason not to.

Posted: **Thu Nov 20, 2008 11:08 pm**

Err, what games run lock-stepped? If you look at games like Metalhead, they only coordinate at the vertical blank, when one sh2 is started on the game logic, and the other is started on rendering the frame. Most SEGA CD games run just fine with "relaxed" timing - i.e., the two 68000s aren't synchronized except after each frame. Most 32X games run on existing emus the same way - just sync at the frame rate. It's rather uncommon to find CD/32X games that require strict timing.

Also, you can program your emulation with spinlocks to simulate bus arbitration lines. For example, with caches, you really just arb for the bus, transfer a cache line, then release it.

Posted: **Fri Nov 21, 2008 12:38 am**

Chilly Willy wrote:Err, what games run lock-stepped?

Without looking I can't tell you. But some of them do.

Chilly Willy wrote:It's rather uncommon to find CD/32X games that require strict timing.

I certainly would not agree with that.

In any case, it actually doesn't matter if no games do this, since the whole point is to emulate the hardware, not the games. Better to emulate it accurately, no? Plus as I keep saying, there's really no point at all in doing things this way. This is not an efficient use of dual core.

Chilly Willy wrote:Also, you can program your emulation with spinlocks to simulate bus arbitration lines.

Sure you could, you'd have to if you were going to do it this way. The trouble is you'd need to get both cores in sync before the instructions that required it were executed. How do you do that? How do you know when to do that? Any way of doing this is going to end up executing a lot more code, and especially when you're talking about an SH2, it really doesn't take much more code before you've doubled the time it takes to execute an instruction. Instead of trying to make it work, you quickly realise that this is not a good candidate for going dual core. There are better, more accurate, more efficient, less error prone, ways to do it.

Sure, this sort of thing would be a good solution to emulate a system with two 400mhz cpus each with their own bus and little interaction. Trouble is, I can't think of a system like that.

Posted: **Fri Nov 21, 2008 2:08 am**

As I said before, for both of the consoles we're talking about, this is going to happen very, very rarely. The thing will be lockstepped almost all the time. Any extra processing you do to make this work is just overhead and you're going to end up with an emulator that not only runs slower, but uses more than one core to do it. Laptop users will not thank you for burning their battery (and legs).

You may be right. You also may be wrong. This form of software hyperthreading through a dynarec doesn't just have to be two successive opcodes. It could potentially execute 4 opcodes on one core while the following 3 are executing on another core. It could even split the code, snatching 3 opcodes out of the middle of a block, and running them in parallel, while the surrounding opcodes are sent to another core. You could even try adding branch prediction and start executing subroutines before they're called.

I'd argue there's no way to know what the real performance benefits of this kind of recompilation would be without prototyping, and ultimately, it's going to be highly dependent on the code you're emulating.

Plenty of other reasons why this is a bad idea, to the point of it not actually working reliably, if at all. What happens when two CPUs access the exact same byte of memory in the same cycle? The answer is - it's undefined. Not only can a modern CPU reorder instructions and memory accesses, it will also ensure that anything written by one core is visible by the other - but there is no way you will know, or be able to control, which emulated CPU wins unless you completely lockstep all this stuff.

All of that is easily solvable. Remember that I've spent over 3 years writing an emulator which has to deal with these issues. If one CPU modifies a byte of memory, and another CPU accesses that byte in the same cycle, you detect and deal with that specific case. The simple fact is, 99.99999% of the time in any real program, this doesn't occur. There's a philosophy of optimism to this kind of design. Assume there won't be a collision, and deal with the exceptional case AS the exception.

On a real system, this is done ALL the time, and the CPUs will lock each other out when required. It is difficult, but doable, to work out which of the emulated CPUs should get priority for a read and write - and neccessary to get a lot of the games to even run. Some of them will sit in a tight loop doing this for most of a frame. It's going to be much faster to do this on a single core.

To deal with cases where the collisions between two devices are extreme, build in heuristics. At a regular interval, evaluate collisions between devices. If two devices are consistently interfering with each other, and attempting threading is losing you more than it's gaining, dynamically fold the two processors into a single worker thread, and step through those two devices in lock-step in isolation. This isn't hard. The point is, you try it in parallel first. Single-threaded lock step can always be used as a fallback, but it should be the last resort, not the only option.

Doing this multicore is only harder because you're creating a whole ton of problems for yourself by attempting to do it that way in the first place. If you ignore said problems its actually the cheap, simple solution. It's not a good one. There are much better things to do with multicore - including, if you have no good reason to use a core, leaving it free for some other app to use.

Given that you can do all of the above properly, and easily, on a single core at only about 400mhz, there is no good reason not to.

You're only thinking about the Mega Drive, and you're only thinking about running code, not debugging it. Besides, and this is another point entirely, I don't consider hand-optimized assembly cores, fast or not, to be a solution to preserve a system into the future, which, afterall, is what emulation is supposed to be about. Give me a slow core, which is high-level, flexible, and easy to understand, any day.

It's rather uncommon to find CD/32X games that require strict timing.
I certainly would not agree with that.

In any case, it actually doesn't matter if no games do this, since the whole point is to emulate the hardware, not the games. Better to emulate it accurately, no?

What, exactly, would exclude a multi-threaded emulator from being accurate?

Plus as I keep saying, there's really no point at all in doing things this way. This is not an efficient use of dual core.

Is a single-threaded emulator an efficient use of a quad core? How about quad-core with hyperthreading? How about a 6-core processor with hyperthreading, like desktop computers are going to start getting this time next year? Those cores are there to be used. Not using them at all is what seems inefficient to me.

Also, you can program your emulation with spinlocks to simulate bus arbitration lines.
Sure you could, you'd have to if you were going to do it this way. The trouble is you'd need to get both cores in sync before the instructions that required it were executed. How do you do that? How do you know when to do that?

If you're building a dynarec, you could attempt to determine that during recompilation. Discarding that for a second, the way my emulator works is that it doesn't attempt to predict collisions. What it does instead is assume there won't be any collisions. In the case that this assumption is wrong, the emulator rolls back the state of all the cores to a point before the error occurred, and re-executes the step, this time with precise information about when the collision will occur. When creating the system, you're also able to give "hints" to the emulator, to help it get things right the first time around. There's also a lot of room to build in heuristics, so that the emulator itself is able to analyse rollbacks, and make adjustments to try and prevent them.

These problems are difficult, but not unsolvable. Spend a few years thinking about the issues, with an attitude that there IS a solution, and you'll start to come up with solutions.

Any way of doing this is going to end up executing a lot more code, and especially when you're talking about an SH2, it really doesn't take much more code before you've doubled the time it takes to execute an instruction. Instead of trying to make it work, you quickly realise that this is not a good candidate for going dual core. There are better, more accurate, more efficient, less error prone, ways to do it.

Sure, this sort of thing would be a good solution to emulate a system with two 400mhz cpus each with their own bus and little interaction. Trouble is, I can't think of a system like that.

I haven't attempted the 32x yet. You may be right. You may be wrong. Time will tell. I started my emulator as a "what if" on the weekend. It's well and truly surpassed my expectations, and I'm still regularly coming up with ways to reduce the overhead of the threaded architecture. Your thinking is largely based around problems I've already solved, and it's really hard for me to adequately explain that in a few posts on a forum. Suffice to say, from my perspective, this doesn't seem like such an insurmountable problem anymore, and I for one am cautiously optimistic.

Posted: **Fri Nov 21, 2008 10:27 am**

Snake wrote:
Chilly Willy wrote:It's rather uncommon to find CD/32X games that require strict timing.
I certainly would not agree with that.

I play most of my CDs without strict timing in PicoDrive. I can't think of the ONE that requires strict timing off the top of my head, but out of about a dozen, there was ONE that needed it.

In any case, it actually doesn't matter if no games do this, since the whole point is to emulate the hardware, not the games. Better to emulate it accurately, no?

Better... if you can afford it. With high-end systems, you normally can. If you're trying to make a port to a lesser system (handhelds in particular), it can be critical.

Sure, this sort of thing would be a good solution to emulate a system with two 400mhz cpus each with their own bus and little interaction. Trouble is, I can't think of a system like that.

The GP2X and PSP. Both are dual core systems with the CPUs running somewhere between 250 and 333 MHz. If you used assembly and pushed both CPUs to the limit, you might just be able to squeeze 32X out of both of those.

Posted: **Fri Nov 21, 2008 3:41 pm**

Hi,

Awesome discussion.

Nemesis wrote:My emulator can handle that case, but it requires a relatively expensive rollback operation, which dependency rules are designed to predict and avoid. You can also set two cores as dependent on each other, in which case, my emulator will advance both cores side by side in lock-step.

The "dependency" rule is a very interesting idea and it seems you can always avoid the rollback if you check whether executing the next instruction won't get us past the 68k in the Z80. About the rollback, how do you find out when to create it? I mean rollback operation is simply a restore point for the emu, right? So you must be able to create it such that it is least damaging if done. "Least damaging" means that it should be as close to the point where the rollback was required so we have to re-execute as least as possible. From what I think, it will require to create rollbacks to be created very often which will be expensive. And then I am pretty sure that rollback has to be emulator-wide meaning that it restores entire emulator to that state rather than CPU states which will be more expensive. Have you thought about that yet? I think rollbacks maybe required a lot in VDP.

Nemesis wrote:The future of emulation has to be multithreading, since that is the future of computers themselves for the forseable future. We're hitting limits of how fast a single core can run with the current silicon technology. Single-threaded emulation is fine for older systems, as long as they can run full speed on a single 3GHz core. If they can't, well, it could be a long time before anyone can run them full speed.

And thats the way I want to go myself as well. I can tell you that one day Regen will be running things in parallel just like your emu does right now (but that day is not anytime soon

) because thats how the real hardware works as well.

Nemesis wrote:Not really. If you have two processors running in lock-step under this threading model, and you assume each virtual core has a dedicated physical core, it can be just as fast as the single-threaded implementation (in my model, if two processors are on the same cycle, they'll execute in parallel, so it could even be a little faster). In a relative sense, more processing power is used (two dedicated cores rather than one), but in terms of actual execution time, with the thin locking mechanism, a multithreaded implementation won't be any worse than the single-threaded implementation, provided you have enough physical cores to go around.

That was the point this topic originally

. How to get accurate 32x and SegaCD emulation running fast enough without requiring exotic quad cores ?

. Dual core is fine since it is now used very widely. I can get 32x and SegaCD emulation running quite accurately but whats the use if I can only get 5-15 FPS. I want a way by which I can get same accuracy at 60FPS without requiring the user to upgrade the hardware. I know its possible. There is always a way.

Nemesis wrote:Those cores are there to be used. Not using them at all is what seems inefficient to me.

But just using them all inefficiently isn't efficient. From what I understand, Snake is saying to multithread VDP, Sound chips and keep CPUs running on a single core which is simple and fast and frees up other cores for the user who is playing After Buner 32x while downloading stuff off the internet and simultaneously running a slow ass AVG in the background and not to mention other countless programs and services he will be running. Worse still, he is running .....Vista.

Nemesis wrote:I started my emulator as a "what if" on the weekend. It's well and truly surpassed my expectations, and I'm still regularly coming up with ways to reduce the overhead of the threaded architecture. Your thinking is largely based around problems I've already solved, and it's really hard for me to adequately explain that in a few posts on a forum. Suffice to say, from my perspective, this doesn't seem like such an insurmountable problem anymore, and I for one am cautiously optimistic.

Your emu is great. But its slow. But again, I don't think its slow because of multithreading (okay, may be a little) but rather because of the huge amount of debugging features it provides. Do you currently have a way to disable all the debugging hooks and run your emulator as fast as it can?

stay safe,

AamirM

Posted: **Sat Nov 22, 2008 4:05 am**

The "dependency" rule is a very interesting idea and it seems you can always avoid the rollback if you check whether executing the next instruction won't get us past the 68k in the Z80. About the rollback, how do you find out when to create it? I mean rollback operation is simply a restore point for the emu, right? So you must be able to create it such that it is least damaging if done. "Least damaging" means that it should be as close to the point where the rollback was required so we have to re-execute as least as possible. From what I think, it will require to create rollbacks to be created very often which will be expensive. And then I am pretty sure that rollback has to be emulator-wide meaning that it restores entire emulator to that state rather than CPU states which will be more expensive. Have you thought about that yet? I think rollbacks maybe required a lot in VDP.

What the emulator essentially does is allocate a timeslice to each core. Once each core has reached the end of that timeslice, all the cores are in sync and idle, and the emulator is able to perform operations like take savestates, pause emulation, as well as either commit or rollback the timeslice that was just executed. Until a commit occurs, every change that has been made to any device in the system can be rolled back. For most devices, this just means taking a copy of the registers each time a commit occurs. If we trigger a rollback, just restore the state of the registers from the copy. It's a little more complex for memory buffers, but I've come up with some efficient containers which are optimised for this kind of use.

The "length" of the timeslice blocks that are allocated determine how costly a rollback operation is. A large number of small timeslice blocks reduces the impact of each rollback, but makes the system run slower when no rollbacks are being generated. Currently, I have a relatively long maximum timeslice length of 20 milliseconds. The intention is to build in heuristics to calculate the "optimal" maximum timeslice length, based on the collisions that are currently occurring within the system. I'm also probably going to build in a system to immediately abort the current timeslice once a rollback is triggered (currently, each device must complete the timeslice before it's rolled back).

Most of my efforts so far however have been directed at avoiding the need to generate rollbacks. A rollback is always going to be the most expensive operation. I've mostly directed my efforts at optimising the "commit" process, which is going to occur a lot more often than a rollback, and at finding efficient methods of preventing rollbacks occurring in the first place. Currently, virtually no rollbacks can occur when my emulator is running the basic Mega Drive system. If horizontal interrupts are enabled mid-frame, that would usually generate a rollback. If both the M68000 and Z80 were to attempt access to a device like the VDP, PSG, or YM2612 at virtually the same time, there's a chance that could generate a rollback. I can't think of any other cases right now, and I virtually never see a rollback occur anymore on most of the games I've run.

My general philosophy is that code which is not heavily timing dependent should run as fast as possible. If code is heavily dependent on timing, it may run slowly, but it must always be accurate to the cycle.

That was the point this topic originally . How to get accurate 32x and SegaCD emulation running fast enough without requiring exotic quad cores ? . Dual core is fine since it is now used very widely. I can get 32x and SegaCD emulation running quite accurately but whats the use if I can only get 5-15 FPS. I want a way by which I can get same accuracy at 60FPS without requiring the user to upgrade the hardware. I know its possible. There is always a way.

My thinking is that by the time you figure it out, everyone will have upgraded anyway.

The easiest way to do it is to find a way to optimise your cores. Run a profiler. Identify your bottlenecks, and target them. Look at the areas of code that get called the most. Small optimisations there will give you big performance increases. Use inline assembly if you can see a way to beat the compiler. That's how to make a single-threaded emulator faster. Apart from that, as has already been discussed, you could get a major speed boost by moving your VDP and YM2612 in particular into separate threads. How easy that is to do depends a lot on how you've designed your emulator.

But just using them all inefficiently isn't efficient. From what I understand, Snake is saying to multithread VDP, Sound chips and keep CPUs running on a single core which is simple and fast and frees up other cores for the user who is playing After Buner 32x while downloading stuff off the internet and simultaneously running a slow ass AVG in the background and not to mention other countless programs and services he will be running. Worse still, he is running .....Vista.

I'm trying to build a system which can scale to large numbers of processors, so I'm going to attempt threading them regardless. Snake may be right, and threading CPU's on the 32x or SegaCD may not be worth it. If that turns out to be true, I'll fold the problem devices into a single thread. I agree there's no point maxing out two cores instead of one when you're getting no benefit from it. For other systems, I'll still have the benefit of threaded CPU emulation, and I know that the base Mega Drive for one runs with it quite happily.

As for leaving cores available for other tasks, if someone wants to encode a video in the background while running my emulator, they can go right ahead. Just don't complain if it runs a little slow.

With my emulator being built as a debugger, I don't feel quite the same pressing need to get every game running full speed with as low system requirements as possible. My emulator is a debugger. It's designed for development, testing, analysis, and reverse engineering. If people don't like the speed, there are plenty of other emulators I can point them to, like Kega or Regen.

I kind of suspect my emulator is one people will thank me for in 10 years time, and bitch about until then. I really don't care about performance that much though. Afterall, emulation is supposed to be about preservation. Having the fastest emulator is only meaningful for a few years. I remember when people used to complain about Gens because it was too slow. They'd rather use Genecyst instead. I was one of them. Mind you, the realtime palette and VRAM windows had a little something to do with that.

Your emu is great. But its slow. But again, I don't think its slow because of multithreading (okay, may be a little) but rather because of the huge amount of debugging features it provides. Do you currently have a way to disable all the debugging hooks and run your emulator as fast as it can?

My emulator is still slow, but a lot has changed since the last build I released. The main performance issues in the build you have are related to lots of memory allocation and deallocation during execution. This problem has since been eliminated. Most games now run full speed on my 2.2GHz core 2 duo laptop.

Apart from the memory allocation, my emulator is slow because every core uses very high-level code (no ASM), and even for a high level core, I've avoided the use of pre-calculated lookup tables and excessive optimisation, to preserve the readability and transparency of the code itself. I consider my emulator cores a form of documentation on these devices, so it's important to me that I comment my cores really well, explain the reasons I implement particular functions in certain ways, and most of all, that other people can look at my code and understand how the device works. In other words, speed hasn't been the primary concern when I've been writing my cores. It's been a factor, but not the main one. I'm currently in the process of rewriting my VDP core however, which is the slowest core at this stage. I know I can significantly improve the rendering process, which I hope will provide a healthy speed boost.

Apart from that, there's also my bus implementation. One thing that's important to understand about my emulator is that it's generic. It's not a Mega Drive emulator. I have an xml file which tells my emulator how to build a Mega Drive. I could add a few lines of XML and tell it I want to add a second Z80, or include an alternate 64KB of RAM which is only visible when the M68000 is in user-mode for example. The xml file tells my emulator which devices are included in a system, and describes the physical connections between those devices.

The generic nature of this system, and in particular the extremely flexible and open-ended bus system, are both the most powerful part of my emulator, and the slowest. When the M68000 accesses RAM for example, it doesn't just get a direct line to a block of memory. Rather, it has to hand off the access request to a bus object, which then has to determine which device, if any, the access is targeted at, and how the address lines and data lines are mapped to the target device. I build physical lookup tables to accelerate the process, but it's still going to be much slower than other emulators which are hard-coded to emulate a particular system.

Basically, due to the nature of how bus access works in my emulator, it's going to be several orders of magnitude slower. Since the bus gets hit millions of times a second, it adds up to a major performance hit.

Oh, and then there's the way I represent data. My emulator uses virtually no binary masks, shifts, or manual data manipulation of any kind throughout all the cores. It seemed to me that decoding, editing, and building the raw data structures used within a core was the most error prone part of emulation, and the most difficult part of any core to decypher. I've build a generic "Data" structure I use for registers, bus communication, pretty much everything. This structure has a large number of helper functions to assist in gritty data manipulation, and it greatly simplifies my code. This structure has minimal overhead, but it's used everywhere, so that minimal overhead gradually adds up.

And then there are the debug hooks as you mentioned. I've made an effort to reduce the overhead of the debug checks in the CPU cores in particular, so I think I've now minimized the impact they have on the emulator. Ultimately, I don't think removing them entirely would make all that much of a difference anymore. The biggest bottleneck is undoubtedly the bus, and I've already heavily optimized it.

Posted: **Sat Nov 22, 2008 7:11 am**

Every time I come here, I wish I had started on a Mega Drive emulator instead. You're all so friendly and accepting of each others' different approaches.

Nemesis wrote:Win32 also provides functions like InterlockedIncrement and InterlockedDecrement which wrap over these assembly primitives. Simply put, you can use these instructions to test whether it is "safe" to advance.

Hmm, my understanding was that you could mark a variable as "volatile", and it would guarantee to not cache the value and get you stuck in a loop. So eg you could use:

Code: Select all

volatile int cycle_68k_z80; //global; non-TLS
run_68k() {
  //wait until z80 catches up ...
  //we can skip this if the next op is guaranteed not to affect z80.
  while(cycle_68k_z80 >= 0);  //z80 will run and decrement counter
  do_something_thatd_affect_z80();
  cycle_68k_z80 += clocks_executed() * z80_clock_rate;
}

Perhaps that'd cause read-modify-write issues ... in that case, what about a volatile boolean lock setup? Should only cause a problem if bus conflicts corrupt the value in memory entirely.

True, it'd eat up ~99% CPU time while emulating. So you sleep everything after you've emulated an entire frame / 20ms / whatever.

So, are the Interlocked() functions really needed?

Chilly Willy wrote:True, but the TSC runs at the instruction rate (GHz), so there's plenty of room for scaling down to 32X rates, which would allow for quite a variance in the counters. You could be almost 1000 off and still be okay after scaling. Even Windows is supposedly getting the TSCs within a few hundred counts of each other.

No matter how accurate they are, if there's any variance then you'll get slightly different emulated results each time you run the emulator.

Now with different crystal clocks, that can be a good thing in that it's more like real hardware, as oscillators have their own +/- ~0.5% tolerances. But for emulation, it's all around not so good.

Even a single tiny change of one cycle executing in place of another because the TSC was off by just enough, could possibly cause the following (and if it's possible, we have to assume it will happen):

1) it would desync netplay
2) it would break movie playback (eg the tool-assisted-speedrun community efforts)
3) emulator rewind, which works like movie playback, would also break
4) it would make debugging much more painful when the game only breaks every ~12 runs.

I'm also not sure that it'd be any faster than the wait lock I mentioned above. You'd just be replacing it with TSC compares instead.

Snake wrote:As I said before, for both of the consoles we're talking about, this is going to happen very, very rarely. The thing will be lockstepped almost all the time.

For a naive approach that locks whenever a change can possibly be made ... yes, but it depends on the model.

Take two processors that only share a 4-byte communication bridge. You only have to lock when one accesses the others' memory address range. Since there's only four addresses, your lock-stepping will be minimized greatly.

SNES S-CPU and S-SMP communication is a great example of this. I can run the two on average of ~200,000 opcodes out of sync from one another with no locks.

Now, when you have two processors that share a large bus, say a big 64k read/write memory chunk (read-only memory would obviously not require locks since it cannot change any state); then yes, you have real problems. You can't know what the other core will do until you let it run (though heuristics can help greatly.)

So for that, you would either have to deal with non-stop locking, or implement a rollback system and hope that they are rarely needed. A prediction system could help in many instances here, too. And in truth, most of the time they probably are rarely needed. It obviously varies per game, but I can't really see how a game could function if the two cores spend 99% of their time talking to each other, rather than actually doing stuff.

Nemesis wrote:When the M68000 accesses RAM for example, it doesn't just get a direct line to a block of memory. Rather, it has to hand off the access request to a bus object, which then has to determine which device, if any, the access is targeted at, and how the address lines and data lines are mapped to the target device.

That's one of my favorite design approaches. It's a great model for flexibility. It's served me well with tricky special chips and hardware devices. One of the most fun aspects is being able to easily hook and chain memory accesses. Not that you'd want to, but it makes something like emulating the Game Genie BIOS trivial.

It's also easier to simulate on-cart hardware like DMA transfer detection and real-time decompression. And you don't have to hard-code that stuff into your CPU core. It stays with the on-cart chip emulation section.

Not sure why, but things like having a 68k core know about and directly access z80 stuff (and vice versa) drive me crazy. I like to keep all of that 100% separate, so you can literally plug and play modules however you want. Great for using the cores in other system emulators, too.

Nemesis wrote:Besides, and this is another point entirely, I don't consider hand-optimized assembly cores, fast or not, to be a solution to preserve a system into the future, which, afterall, is what emulation is supposed to be about.

...

My emulator is a debugger. It's designed for development, testing, analysis, and reverse engineering. If people don't like the speed, there are plenty of other emulators I can point them to, like Kega or Regen.

I kind of suspect my emulator is one people will thank me for in 10 years time, and bitch about until then. I really don't care about performance that much though. Afterall, emulation is supposed to be about preservation. Having the fastest emulator is only meaningful for a few years. I remember when people used to complain about Gens because it was too slow. They'd rather use Genecyst instead.

I can't tell you how many hours I've spent trying to get that same message through to people in my community. Yet even after four years, people still constantly throw insults at me for taking a similar approach. Given, my code is much less optimized than yours ... still ...

---

Myself, I'm not really sure if multi-core is needed or not for 16-bit systems. I think if you throw in the maximum number of chips (Genesis+32x+Sega CD; Genesis+SVP; SNES+SuperFX2) and get cycle timing absolutely clock-perfect (eg even the video renderer rendering one cycle at a time instead of just one scanline) ... I think you would be very, very hard pressed to get all of that working on a single core, even at ~3GHz.

But if it is possible, I'm sure ASM gurus like Steve Snake could do it.

Still, the research going on in this area is certainly a good thing. I'm not certain, but fairly confident, that multi-core is the future of computing. Looking at platforms like the Saturn, PS3, etc ... I see a really compelling case for refining multi-core methods. It may be the only way to get accurate and playable framerates for newer generation systems.

And the clean code, modeled much more like real hardware with no processor enslavement (especially the XML-style system layout), can only be a good thing for future emulator authors; no matter whether multi-core continues or dies off as a fad.

Who knows, maybe once multi-threading takes off more, CPU makers will start saving costs by making only one or two power cores @ ~3GHz, and ~16-32 cores @ ~1GHz. Being able to run without needing one of the full power cores would be very important at that point.

Posted: **Sun Nov 23, 2008 11:46 am**

Every time I come here, I wish I had started on a Mega Drive emulator instead.

Nah, if you'd done that, I'd have nothing to work on.

Nemesis wrote:Win32 also provides functions like InterlockedIncrement and InterlockedDecrement which wrap over these assembly primitives. Simply put, you can use these instructions to test whether it is "safe" to advance.
Hmm, my understanding was that you could mark a variable as "volatile", and it would guarantee to not cache the value and get you stuck in a loop. So eg you could use:
Code: Select all
volatile int cycle_68k_z80; //global; non-TLS
run_68k() {
  //wait until z80 catches up ...
  //we can skip this if the next op is guaranteed not to affect z80.
  while(cycle_68k_z80 >= 0);  //z80 will run and decrement counter
  do_something_thatd_affect_z80();  //this increments counter again
}
True, it'd eat up ~99% CPU time while emulating. So you sleep everything after you've emulated an entire frame / 20ms / whatever.

So, are the Interlocked() functions really needed?

You can often use a simple volatile variable for cases where one thread is pushing data to another, ie, when only one thread is modifying the value, and other threads are reading it, but you can't use a volatile variable as the basis for a lock, or shared data between threads where more than one thread modifies the value. The reason for this is the simple fact that as soon as you read the variable, it has already potentially been modified by another thread before you can write to it. This makes it impossible to read the value of the variable, make a decision, then write the value back, without the possibility that another thread is doing the exact same thing at that moment. For a lock, this means more than one thread could pass it simultaneously, and for a shared value, this means updates could be lost.

The interlocked functions, and the basic opcodes they wrap over, are designed to allow a combination of a read and a write to a target loction, which is wrapped into a single indivisible bus operation. In a single step, you're both reading from and writing to the target. Since this is treated by the system as a single bus operation, it is inherently indivisible, so even in a system with multiple cores or multiple processors, you can guarantee the operation will execute correctly. These opcodes are the basis for all high-level mutex/critical section systems. Often you would use these kind of operations to maintain a boolean semaphore. When the value is set, another thread is using the lock. When the value is clear, the lock is free. InterlockedCompareExchange() for example is a simple function which is perfect for this exact task.

In cases where only one thread is modifying a value, you can potentially get away with using a simple volatile variable to access it, but there is one point which makes it a little dangerous to do so. C++, at this point in time at least, doesn't have any concept of a multi-threaded environment. Marking a variable as volatile does prevent the compiler altering or re-ordering the operations that make use of it, which is important, but it does not solve the issues of processors caching that value. That issue is solved by creating a "memory barrier", which is designed to ensure the value is not cached between access. I haven't looked into the details of how this is constructed, but suffice to say, it's a non-trivial operation in C++ at this stage. The win32 Interlocked() functions include a memory barrier as part of the code, so this is handled for you. There are low-level versions of these functions which do not include a memory barrier, but in most cases, you'd need to construct your own memory barrier in order to use them safely. If you don't account for caching, there's no guarantee that one thread writing to a volatile variable will immediately make that new value visible to all other cores and processors in a system. This may or may not be a problem, depending on the nature of the value, and how and when it's modified.

That's one of my favorite design approaches. It's a great model for flexibility. It's served me well with tricky special chips and hardware devices. One of the most fun aspects is being able to easily hook and chain memory accesses. Not that you'd want to, but it makes something like emulating the Game Genie BIOS trivial.

Yeah, I'm looking forward to adding proper support for things like Sonic & Knuckles lock-on for one, and the actual Game Genie/Action Replay for the fun of it.

Not sure why, but things like having a 68k core know about and directly access z80 stuff (and vice versa) drive me crazy. I like to keep all of that 100% separate, so you can literally plug and play modules however you want. Great for using the cores in other system emulators, too.

Yep, same here. Keeping everything 100% separate and modular is the only way to do it. Imagine how easy it becomes to support all the crazy variations on 80's arcade systems for example when you can just drop in all your generic cores, and know it's all ready to go. It's good to know there's someone else out there working along a similar line.

I can't tell you how many hours I've spent trying to get that same message through to people in my community. Yet even after four years, people still constantly throw insults at me for taking a similar approach.

To be honest, I've got mixed feelings about releasing my emulator, because I know I'm going to face the same kind of issues. Still, I'm a selfish guy. I work on my projects for myself more than others, so I don't feel an obligation to make my work match the expectations of others. There are plenty of other emulators out there anyway, and people who complain are always free to write their own emulator.

Still, the research going on in this area is certainly a good thing. I'm not certain, but fairly confident, that multi-core is the future of computing. Looking at platforms like the Saturn, PS3, etc ... I see a really compelling case for refining multi-core methods. It may be the only way to get accurate and playable framerates for newer generation systems.

And the clean code, modeled much more like real hardware with no processor enslavement (especially the XML-style system layout), can only be a good thing for future emulator authors; no matter whether multi-core continues or dies off as a fad.

If the current single-core limitations can be solved, I'm sure individual cores will continue to get faster and faster. From a programming perspective, you always prefer to have one super fast core rather than lots of relatively slow cores. That said, now that multiple cores have hit computing, I don't see them going anywhere. Dual core in particular is a really important step, as it gives your system the flexibility to run one task maxing out a core, while still having the other core around to drive the OS, and juggle the other apps idling in the background. Whether we'll see 8+ cores in average desktop computers forever into the future is up for debate, but I don't see computers ever returning to a single core now.

Posted: **Sun Nov 23, 2008 12:20 pm**

Oh, one other point of voodoo I'd forgotten which makes basic volatile variables a little dangerous. Even if the compiler doesn't re-order instructions, the processor can still potentially reorder them itself during execution. I don't claim to know all the cases when this may or may not occur, and I'm sure it varies between processors. There are some opcodes you can drop in which force the processor to synchronise everything (look into Acquire/Release semantics), but you'd have to write that snippet in ASM. I just found a note while doing a bit of research that the Visual Studio 2005 compiler actually DOES synchronise everything for you when you access a volatile variable, but this is just something they do, it's not guaranteed by the C++ standard.

Posted: **Sun Nov 23, 2008 4:47 pm**

Hi,

On x86, you can use the "lock" instruction prefix to guarantee the operation will be atomic. I think byuu wants something portable between Linux and Windows because Interlock() function won't be available on Linux. Here is one way it can be done which I can think of (written in GCC inline asm for further portablility between Linux/Windows):

Code: Select all

// Avoid optimization and direct access
struct atomic
{
	volatile unsigned int counter;
};

static void atomic_inc(struct atomic *var)
{
	asm volatile("lock incl %0"
		: "=m" (var->counter)
		: "m" (var->counter)
		: "memory");
}

static void atomic_dec(struct atomic *var)
{
	asm volatile("lock decl %0"
		: "=m" (var->counter)
		: "m" (var->counter)
		: "memory");
}

stay safe,

AamirM

Posted: **Sun Nov 23, 2008 7:17 pm**

Nemesis wrote:you can't use a volatile variable as the basis for a lock, or shared data between threads where more than one thread modifies the value. The reason for this is the simple fact that as soon as you read the variable, it has already potentially been modified by another thread before you can write to it.

Makes sense. I was afraid what'd get me was one writing between the read and write cycles of the other. And yes, I'd need both to read+write to the shared counter in my case. Though I suppose I could work around that with some ugly trickery and two counters ...

As for the compiler and/or processor re-ordering stuff ... that could definitely be bad if it executes code after our lock before the lock releases. I haven't observed it doing that with my cothread switches -- but maybe they know to back off due to the stack pointer manipulation inside the calls.

Nemesis wrote:In a single step, you're both reading from and writing to the target.

Wow, didn't realize x86 could do something like that. Very cool! A wrapper around a simple XADD instruction should be very easy, and hopefully not hard to port.

Nemesis wrote:Yeah, I'm looking forward to adding proper support for things like Sonic & Knuckles lock-on for one, and the actual Game Genie/Action Replay for the fun of it.

I looked into Game Genie. Basic stuff, the BIOS writes out the decoded values to special cart registers and then writes to a special "start" register and deadlocks. You'd obviously want to swap out the cart mapper from the GG to the real game.

Only reason I've avoided it so far is because the bus emulation unfortunately has to "know" stuff about the carts -- given that it's not included inside the ROM data -- so the cart interface would have to support multiple carts to allow the hot-swapping required at the end.

The other dual cart systems like S&K, it's easy to just map each cart to different memory regions, but still treat it as a single device with a single cart-mapper detection pass / reset vector / and all that.

Definitely doable, though.

Nemesis wrote:It's good to know there's someone else out there working along a similar line.

Likewise :D
We should have some fun one day and make a "Segtendo Super Mega Drive" concept emulator, just to annoy / scare / amuse people:
- 2x Genesis 68k processor cores (way faster, way nicer)
- SNES PPU (32k colors + more sprites)
- Z80 or SMP audio processor (they both suck)
-- hell, 65816 for audio processor. Now you can have interrupts.
- Sony S-DSP + YM2612 + PSG mixed together to a final output stream
Make a nice demo game for it, "Shotgun Wedding."

--------------------

Well, I have it all modular but with one exception: my CPU enslaves the video processor. Hate to derail here (I imagine it's similar to an issue a cycle-based VDP core would face), but I don't suppose you or anyone has an idea on that?

Basically, the PPU (video proc) has Vblank and Hblank output pins that the CPU reads every other clock cycle (eg ~10.5MHz.) The CPU supports IRQs at given X/Y pixel coordinates by keeping its own counters internally. When Hblank goes 1->0, it resets Hcounter and increments Vcounter. When Vblank goes 1->0, it resets Vcounter.

This gets more complicated because the length of a scanline, and the number of scanlines per frame, can change dynamically based on eg the interlace setting.

Since the blanking pin states can change on any cycle, it means the CPU has to wait for the PPU to catch up 100% of the time. The PPU can run ahead of the CPU, until it tries to access any of the reigsters / memory that the CPU can touch. Given its design -- that's almost always.

So we end up in a horrible situation where the two basically have to each sync up ~10 million times a second. But that many context switches a second eats up way too much CPU time (~90% of a P4 @ 1.7GHz.)

The only way to avoid it that I can think of, is to cheat and make the CPU aware of how the PPU works, and able to determine the state of the PPU blanking pins before it catches up. But I'd really hate to cheat like that. For instance, there's no such thing as an NTSC and a PAL CPU. But there are those PPU variances. I'd have to make the CPU aware of NTSC v PAL to properly handle the Vcounter differences. And it would really be out of place if someone wanted to use the 65816 core for an Apple II GS emu.

Note that I unfortunately don't have a rewind mechanism.

Nemesis wrote:Still, I'm a selfish guy. I work on my projects for myself more than others, so I don't feel an obligation to make my work match the expectations of others.

Nothing at all selfish about giving one's work away for free to others! Quite the opposite, in fact. It's good that you're able to easily ignore others. I keep letting certain individuals really get to me about this :/

Nemesis wrote:If the current single-core limitations can be solved ... I don't see computers ever returning to a single core now.

I believe we'll see continued speedups of individual cores, definitely. But given that we're already hitting hard limits on the 16-bit era systems with single cores, a multi-core approach will be almost mandatory for future generations to have proper syncing. True, it may be less important there; but if we can get the accuracy by using more cores, we definitely should.

And yes, pretty safe to say dual core will remain. Yet, looking at prices double for single->dual core, and double again for dual->quad ... I don't see how we're going to get Intel's future vision of having 16, 32, 128+ cores. The economics make no sense.

If it happens, I think they're going to have to seriously scale back on what constitutes an individual core. Absolutely minuscule caches, slow clock rates, etc. That, or stop improving individual cores so that per-core costs fall.

Myself, I'm still barely able to get full speed on the fastest single cores out there ... so I'm holding off on true pre-emptive multi-threading. Once I cross that barrier though, my code's already 99% ready. I just have to replace thread switches with locks, and fake thread creations with real ones. Best yet, all of that code is already outside of the individual cores.

AamirM wrote:On x86, you can use the "lock" instruction prefix to guarantee the operation will be atomic. I think byuu wants something portable between Linux and Windows because Interlock() function won't be available on Linux. Here is one way it can be done which I can think of (written in GCC inline asm for further portablility between Linux/Windows):

Right, that's the main problem -- it'd need to run on Linux and OS X. I was thinking of just making a really minimal low-level library, and writing that in ASM / C++0x / whatever supports what I need.

Perhaps:
pe_create() / pe_delete() -- thread creation/deletion
pe_readmodifywrite() -- wrappers for InterlockedExchange type operations
pe_local<type> -- __thread / threadlocal wrapper for C++98 / C++0x TLS

The less functions, the better.

Posted: **Mon Nov 24, 2008 1:03 am**

AamirM wrote:On x86, you can use the "lock" instruction prefix to guarantee the operation will be atomic.

Nice, I wasn't aware of this prefix. That expands the list of thread-safe opcodes quite considerably.

I think byuu wants something portable between Linux and Windows because Interlock() function won't be available on Linux. Here is one way it can be done which I can think of (written in GCC inline asm for further portablility between Linux/Windows):

I was kind of hoping there was a linux equivalent for the win32 functions, but it looks like that might not be the case. At any rate, using some inline asm routines like the ones you presented, it would be possible to build your own, yes. I'd probably disassemble the win32 functions as well and have a peek at what they're doing. There are so many "gotchas" in this area it's like walking blindfolded through a minefield, so it'd probably take some time to build some locking mechanisms which really are guaranteed to work correctly on all x86 platforms, under all circumstances. I'm going to be attempting this myself in the near future however.

Personally, I'm most interested in the behaviour of the InterlockedCompareExchange() function. It (somehow) allows you to provide a target memory location, a value which indicates what the current contents of that memory location should be, and a new value to write to that location. In a single indivisible operation, the target location is read, and compared to the expected value. If the value differs, the operation is aborted. If the value matches, the new value is written to the target location. The previous value of the target location is returned from the function, so you can test whether the operation was completed or not by comparing the return value to the expected value. All this happens in a single indivisible operation.

This single function can form the basis for any locking mechanism. Eg, only use values of 0 and 1. Set the expected value to 0, and the value to write to 1. If the function returns 0, you've successfully obtained the lock. If the function returns 1, the lock is already owned by another thread, so you need to re-try the operation until it succeeds. Release the lock by writing an expected value of 1 and a write value of 0.

We should have some fun one day and make a "Segtendo Super Mega Drive" concept emulator, just to annoy / scare / amuse people:
- 2x Genesis 68k processor cores (way faster, way nicer)
- SNES PPU (32k colors + more sprites)
- Z80 or SMP audio processor (they both suck)
-- hell, 65816 for audio processor. Now you can have interrupts.
- Sony S-DSP + YM2612 + PSG mixed together to a final output stream
Make a nice demo game for it, "Shotgun Wedding."

Every console needs a trademark character though. How about a fat Italian plumber with spiky blue hair and red shoes that runs really fast?

Well, I have it all modular but with one exception: my CPU enslaves the video processor. Hate to derail here (I imagine it's similar to an issue a cycle-based VDP core would face), but I don't suppose you or anyone has an idea on that?

Basically, the PPU (video proc) has Vblank and Hblank output pins that the CPU reads every other clock cycle (eg ~10.5MHz.) The CPU supports IRQs at given X/Y pixel coordinates by keeping its own counters internally. When Hblank goes 1->0, it resets Hcounter and increments Vcounter. When Vblank goes 1->0, it resets Vcounter.

This gets more complicated because the length of a scanline, and the number of scanlines per frame, can change dynamically based on eg the interlace setting.

Since the blanking pin states can change on any cycle, it means the CPU has to wait for the PPU to catch up 100% of the time. The PPU can run ahead of the CPU, until it tries to access any of the reigsters / memory that the CPU can touch. Given its design -- that's almost always.

So we end up in a horrible situation where the two basically have to each sync up ~10 million times a second. But that many context switches a second eats up way too much CPU time (~90% of a P4 @ 1.7GHz.)

Ouch. It's easier than that on the Mega Drive. The Mega Drive VDP has a H/V counter, but it has to be specifically read by the CPU. Registers set in the VDP determine when interrupts are generated. A frame interrupt can be generated at the end of each frame, and a line interrupt can be generated on any line during the frame, but these interrupts have to be setup in advance by a write to the VDP, and the only unsolicited output from the VDP to the Mega Drive is in the form of these interrupt lines. Apart from DMA operations, the VDP uses its own memory exclusively, and the M68000 can only access the VDP memory by going through the VDP interface. I run the VDP and M68000 entirely out of sync.

The only way to avoid it that I can think of, is to cheat and make the CPU aware of how the PPU works, and able to determine the state of the PPU blanking pins before it catches up. But I'd really hate to cheat like that. For instance, there's no such thing as an NTSC and a PAL CPU. But there are those PPU variances. I'd have to make the CPU aware of NTSC v PAL to properly handle the Vcounter differences. And it would really be out of place if someone wanted to use the 65816 core for an Apple II GS emu.

This is an interesting problem, and not one I've had to deal with so far, but I have given this kind of issue some thought. Fundamentally, it's a problem related to generic lines connecting two devices, where one device is asserting lines connected to another, and the state of those lines affect the target device, while the target device can also communicate with the source device and make changes which affect the state of the lines, indirectly affecting itself.

One possibility I've considered is that I'd provide a way for the source device, the PPU in this case, to predict its output lines ahead of time, and for the target device, if any, to "request" the state of those lines each time they're sampled, or possibly even a list of all the changes to those lines between two points in time. Consider this point: Given the current state of the PPU, it's always possible for the PPU to calculate what the state of the vblank and hblank pins will be at any point into the future, as long as no external accesses occur which alter the state of the PPU, such as modifying a register.

Since the SNES only has one processor that can talk to the PPU (right?), this is a little simpler than it could be. The CPU takes the vblank and hblank lines, and also modifies the PPU registers. It is impossible however for the CPU to cause these operations to happen out of order. It's not going to sample the vblank and hblank pins on cycle 40, then write to a PPU register on cycle 30. Given this scenario, it should be possible for the PPU to make its vblank and hblank lines available to the CPU at any time, provided the CPU and PPU always agree on the PPU state. This means you could force the CPU to lock until the PPU catches up on an attempted register write, while leaving them running separately at other times. In reality, you'd only have to lock on register changes which affect the timing of the vblank and hblank lines, and depending on your model, you might not even have to lock for that (if you buffer register changes and can "walk" the buffer).

My current design couldn't do this however, since devices simply "publish" their line state changes, and it's up to any external devices that may be mapped to them to respond to those state changes. This works well for lines which change very rarely but are sampled very often, like reset, halt, and interrupt lines, which are the kind of connections I've been dealing with so far. The PPU wouldn't know when or if those lines were being used in this model however, and it wouldn't be possible for another device to "request" a line ahead of time. I'd have to put more thought into the matter to decide how I'd incorporate this kind of functionality. I don't know how you handle this kind of communication in your emulator, but perhaps, a similar approach might work for you?

I was thinking of just making a really minimal low-level library, and writing that in ASM / C++0x / whatever supports what I need.

Yep, that's what I intend to do as well. I currently use the boost.thread library to manage thread creation/deletion, and my "heavy duty" locks, but it doesn't cover the really lightweight stuff. I want to write a little support library with a handful of low-level synchronisation functions. These functions can be written in x86 ASM, or tied to OS-specific functions, but all the platform-specific nastiness is contained in the one spot. Porting the code to other platforms just means providing an alternate implementation of these functions for the target platform.

Posted: **Mon Nov 24, 2008 2:35 pm**

Crazy stuff..

The goals in emus I worked on are completely opposite: do as little as possible to get the game running, use large timeslices, hack around problems.. But it's fun non-the-less to get things running on rather low-end devices comparably well (think Virtua Racing on 240MHz GP2X @ 60fps with 44kHz sound).

SpritesMind.Net

SegaCD and 32x