Starting developing of another Emulator

Talk about development tools here

Moderator: BigEvilCorporation

Nemesis
Very interested
Posts: 791
Joined: Wed Nov 07, 2007 1:09 am
Location: Sydney, Australia

Post by Nemesis » Wed Apr 02, 2014 5:09 am

Exodus is slow, because it starts more than 12 threads, all with its own stack and so on.

Don't forget, that all threads managed with "task manager" in core of Operating System. Not that "task manager" CTRL+ALT+DEL - it is graphical application, there is another "task manager" - system inside core of Operating System.

Also, each synchronization takes way too long because it's API call, which includes arguments checking, api call stack forming for interupt handler, because core functions requires privileged mode (supervisor mode?). That's why any system API call in Windows is slow by definition. There are some other Operating Systems that works in other way, but I'm just telling about some example.
Exodus is slow, and threading is part of it, but not the biggest concern. The main killer is actually the cycle-accurate VDP core. The amount of number crunching involved per pixel clock step is insane, and it's got to hammer through over 6.5 million of them per second. The VDP uses more time in its render process than every other device in the system, and the system itself, put together. The next biggest hit behind that is the bus system. Nothing about the interconnections between devices is hardcoded in Exodus, it's all defined in XML, then mapped out through data structures. These structures are heavily optimized, but they get hit millions of times a second, so the overhead they introduce does add up.

Threading is a concern, but where contention is low and lock durations are short, you can often get away with a spin lock using some interlocked test and set machine code operations, with a memory barrier or two where required. With that kind of model, there's no OS calls at all, no context switching, and minimal overhead. There's more that can be done in Exodus to improve the threading, in particular making use of the new C++11 language level threading features, which can theoretically produce more optimized code than the rather heavy boost mutexes I'm using in a lot of places. The currently unreleased version 1.1 of Exodus actually folds active devices that are inherently bound in lock step into a single execution thread, so they actually do execute in a similar fashion to what you propose.

Mask of Destiny
Very interested
Posts: 616
Joined: Thu Nov 30, 2006 6:30 am

Post by Mask of Destiny » Wed Apr 02, 2014 5:20 am

Mask of Destiny wrote:
r57shell wrote:1) Where is your Musashi setup for comparison?
I'm using the version of musashi from Genesis Plus GX. I'll upload the code I'm using to drive it when I get home tonight.
Here it is

Nemesis
Very interested
Posts: 791
Joined: Wed Nov 07, 2007 1:09 am
Location: Sydney, Australia

Post by Nemesis » Wed Apr 02, 2014 5:25 am

A few more notes about your execution idea. Not trying to rain on your parade or anything, just share my thoughts after having been through a similar process.

You may find not using threads here actually hurts you here more than using them. The reason for that is caching. When you have a single thread doing something like you propose here:

Code: Select all

while(true) 
{ 
    m68k_update(); 
    vdp_update(); 
    z80_update(); 
    ym2612_update(); 
}
Chances are, there's going to be a hell of a lot of code contained in each of those update functions, much more than can be fit in your processor cache. You're going to get a lot of cache misses on each iteration of this function as a result. If you instead had each update process executing on its own core, each core with at the very least its own L1 cache, you have a better chance of getting good cache utilization from the update process.

The second thought I have is about the nature of how these update functions will be implemented. Consider that not every device in a system necessarily shares the same clock, and the clock cycles might not even be related in any way. In the Mega Drive, each of those devices you've listed for example have a totally different idea of what a "cycle", or "single step" is. Now, in the case of the Mega Drive, you can derive each individual device clock by dividing from the main system clock, sure. That system clock runs at around 53MHz, so that would mean your main update loop needs to run around 53 million times a second. What about your lower level update functions then, like the Z80, which only runs at around 3.5MHz? Does its update method look something like this?:

Code: Select all

void z80_update()
{ 
    static unsigned int accumulatedCycles = 0;
    static const unsigned int systemClockCyclesPerZ80Cycle = 15;
    ++accumulatedCycles;
    if(accumulatedCycles >= systemClockCyclesPerZ80Cycle)
    {
        z80_update_inner();
        accumulatedCycles = 0;
    }
}
If so, that's going to absolutely kill performance. You're actually going to hit this method 14 times more than you need to just to spin some internal counter up to a target value, right in the middle of your innermost, most performance critical loop in your emulator.

r57shell
Very interested
Posts: 478
Joined: Sun Dec 23, 2012 1:30 pm
Location: Russia
Contact:

Post by r57shell » Wed Apr 02, 2014 12:29 pm

Nemesis wrote:Chances are, there's going to be a hell of a lot of code contained in each of those update functions, much more than can be fit in your processor cache.
I thought about processor cache, and that's why I implemented ori "tiny" version. It has very few code, and you know what? I dont see help of processor cache. May be because in both cases all code of my M68k staying in cache... I'll check it again later.

Don't forget about all other threads in your system, they use same processor caches (different processors, different caches). There is no processor cache for every thread.

My processor has 3 MB cache. I don't think it helps a lot for one application: my M68K test program, while in system 116 processes is working. If I run Process Explorer, turn on Threads count in list, and sort list by threads, in middle of list I'll see 10 threads. So approximately ~1160 threads is running, with 2.4 GB RAM used.

Sure you can develop Operational System to run Genesis emulator, to turn off all things that's not needed. Then you achieve very good performance, if you have only one process: Genesis Emulator :).

This code:

Code: Select all

while(true) 
{ 
    m68k_update(); 
    vdp_update(); 
    z80_update(); 
    ym2612_update(); 
}
Actually only idea. Implementation is a bit clever. I already use different technique.
Nemesis wrote:That system clock runs at around 53MHz, so that would mean your main update loop needs to run around 53 million times a second. What about your lower level update functions then, like the Z80, which only runs at around 3.5MHz?
I'll skip any "empty" cycle.
Nemesis wrote:If so, that's going to absolutely kill performance.
Yes, but it's not "so".

Meanwhile, I'm writing M68k code generator, because I realize that I need flexible (agile) code. I thought a lot. Why I need it? It's very ugly to implement andi by copypaste ori with a little differences. And defines are not enough. Even templates (they more flexible) are not enough :(. Also, I don't belive to compiler optimizations. I'm trying to force some code unrolling. If this is not enought, I'll use this or this or something with same idea.
Image

r57shell
Very interested
Posts: 478
Joined: Sun Dec 23, 2012 1:30 pm
Location: Russia
Contact:

Post by r57shell » Fri Apr 04, 2014 7:38 pm

Mask of Destiny wrote:
Mask of Destiny wrote:
r57shell wrote:1) Where is your Musashi setup for comparison?
I'm using the version of musashi from Genesis Plus GX. I'll upload the code I'm using to drive it when I get home tonight.
Here it is
I'm using original Musashi with all OPT_OFF except reset handler. And I did all move, movea, move <ea>, sr, move <ea>, ccr.

Great test tool!
I have found READ_8 bug with it. But at this time, it's only one bug found with it. Do you test odd address with byte operations?

At this time I have m68k generator which generate 8 MB of code.
Microsoft C/C++ Compiler says:
genstation\core\m68k_code.c(238682) : fatal error C1128: number of sections exceeded object file format limit : compile with /bigobj
:roll:
but with /bigobj it's compiling fine.

Hmm:

Code: Select all

start:
	move #0, CCR
	move.l #1521579067, (16719515).l
	ori.l #-1272751427, (16719515).l
	move.l (16719515).l, d0
	move SR, d1
	move #$1F, CCR
	move.l #1737861168, (16719515).l
	ori.l #-1272751427, (16719515).l
	move.l (16719515).l, d2
	reset
My core will throw access exeption (address error). What to do with it?
Image

Mask of Destiny
Very interested
Posts: 616
Joined: Thu Nov 30, 2006 6:30 am

Post by Mask of Destiny » Mon Apr 07, 2014 4:23 am

r57shell wrote:My core will throw access exeption (address error). What to do with it?
I haven't implemented address exceptions yet and I'm guessing the version of Musashi I'm using doesn't either (or at least not with the options from Genesis Plus GX). The simplest fix would be to handle address exceptions in the code generator the same way that exceptions for instructions like chk and div are handled.

Post Reply