Starting developing of another Emulator

r57shell · Post by **r57shell** » Fri Mar 28, 2014 3:32 pm

Yes, I decided to make new Sega Genesis emulator.
More info here: https://github.com/realmonster/GenStation/wiki
Any help appreciated.
I'll start from some mess with M68k.

Mask of Destiny · Post by **Mask of Destiny** » Fri Mar 28, 2014 9:01 pm

Since your new emulator is GPLv3, feel free to take code from BlastEm if any of it would be useful.

r57shell · Post by **r57shell** » Fri Mar 28, 2014 11:08 pm

There is reason why I'm building it from scratch.
I want different system.

As I mentioned in wiki, I want to implement Cycle step emulation.
What does it mean? I know, may be it's mad, but whatever...
Cycle step emulation would be implemented in such way:

Code: Select all

while(true)
{
    m68k_update();
    vdp_update();
    z80_update();
    ym2612_update();
}

It's pseudocode. Did you notice, that there is no cycles argument. That is IT! Main point of whole approach. It advancing by exactly one cycle. No more, no less.

What about accuracy? I didn't say that it will be cycle accurate. I want to achieve only one thing: it will emulate in perfect sync. As you can see, from implementation, all chips advancing exactly by one cycle. You don't need any synchronization, and you don't need any threads.

Also, it's how hardware actually works. It's strictly synchronized by clock pulse. There is nothing parallel. It's more synchronized than parallel if you think for a while.

I need to write from scratch, or rewrite almost all code.
I don't like "rewrite almost all code" for two reasons:
1) It would still have all collaborators from past in dependency. Don't blame me, but I think of it as some "garbage".
2) Some code, that wasn't rewrited, would be difficult to see. I want full image of emulator work.

I have done some analysis about current most popular code. And, today I even view some of your code of BlastEm. But I haven't found any emulator with system what I want to implement.

Also, as feature, it will support save states at certain clock.

I know, all this is mad. But, I hope, it might work GREAT!
If it will be slow, as Exodus for example

.
Don't be afraid, I'll do my best to speed up it with dirty hacks

.
I don't need slow emulator.

By the way, there is my first bunch of code. I'm going sleep.

As I understand, your M68k implementation using recompilation from M68k into x86. It's not good for me because I want:
1) Cross platform
2) Cycle step
Recompiling idea is to gain speed from native x86 code emulating multiple opcodes in row.
But, as I mentioned before I need to change context very frequently.
M68k->VDP->Z80->YM2162... (I don't know correct order).

Stef · Post by **Stef** » Fri Mar 28, 2014 11:50 pm

I perfectly see the idea, just a "simple" way of doing cycle perfect emulation. But honestly i don't see how you will achieve that at descent speed. I mean Exodus is already cycle accurate, except it does it the complex way (it compiles cycles where it can) and Exodus is already very slow (at least too slow for my computer).
Just imagine, the megadrive master clock runs at 53 Mhz so that does mean you will have to call 53 000 000 times each method for all devices per second ! If we assume around 10 devices (i think that is realist as you will have to separate many hardware parts as BUS, IO ports, VDP IO...) that mean you make about 500 000 000 methods calls per second and not multi threaded... Good luck with that

I think that even the simple method call without any code will eat already all your CPU resource !

r57shell · Post by **r57shell** » Sat Mar 29, 2014 11:31 am

Exodus is slow, because it starts more than 12 threads, all with its own stack and so on.

Don't forget, that all threads managed with "task manager" in core of Operating System. Not that "task manager" CTRL+ALT+DEL - it is graphical application, there is another "task manager" - system inside core of Operating System.

Also, each synchronization takes way too long because it's API call, which includes arguments checking, api call stack forming for interupt handler, because core functions requires privileged mode (supervisor mode?). That's why any system API call in Windows is slow by definition. There are some other Operating Systems that works in other way, but I'm just telling about some example.

You said that I'm making too many calls. Ok, then try to figure out any other distinction with common method:

Code: Select all

while(true)
{
    m68k_update(M68K_CYCLES);
    vdp_update(VDP_CYCLES);
    ym2612_update(YM_CYCLES);
    z80_update(Z80_CYCLES);
}

Is there any other difference except order of execution? I don't think so.
So asymptotic speed will be same. Only constant is different.
I can only guess what constant would be.
You can think in same way with Exodus.
And therefore: Exodus asymptotic speed is same. All the matter in a constant.

If constant in my implementation will be better, then speed will be better.
But first, I want to write it in clear way, without any mess.
Then, if it's not enough, then make some mess.

don't forget that modern systems can do 10^9 calculation in second.

Stef · Post by **Stef** » Sat Mar 29, 2014 1:58 pm

r57shell wrote:Exodus is slow, because it starts more than 12 threads, all with its own stack and so on.

Don't forget, that all threads managed with "task manager" in core of Operating System. Not that "task manager" CTRL+ALT+DEL - it is graphical application, there is another "task manager" - system inside core of Operating System.

Also, each synchronization takes way too long because it's API call, which includes arguments checking, api call stack forming for interupt handler, because core functions requires privileged mode (supervisor mode?). That's why any system API call in Windows is slow by definition. There are some other Operating Systems that works in other way, but I'm just telling about some example.

You said that I'm making too many calls. Ok, then try to figure out any other distinction with common method:
Code: Select all
while(true)
{
    m68k_update(M68K_CYCLES);
    vdp_update(VDP_CYCLES);
    ym2612_update(YM_CYCLES);
    z80_update(Z80_CYCLES);
}
Is there any other difference except order of execution? I don't think so.
So asymptotic speed will be same. Only constant is different.
I can only guess what constant would be.
You can think in same way with Exodus.
And therefore: Exodus asymptotic speed is same. All the matter in a constant.

If constant in my implementation will be better, then speed will be better.
But first, I want to write it in clear way, without any mess.
Then, if it's not enough, then make some mess.

don't forget that modern systems can do 10^9 calculation in second.

I do agree that the main difference is "constant" compared to conventional emulator. But if i compare to Gens which is scanline timing based, the constant multiplier is ~x3500 which is definitely not a small constant. Also Gens is pretty inaccurate and does not emulate a bunch of stuff so that make the difference even more important

I do agree sound nice but i am just afraid that current CPU are not yet powerful enough for that approach, at least to emulate a megadrive at full speed. But maybe i am wrong

r57shell · Post by **r57shell** » Sat Mar 29, 2014 5:36 pm

Don't quote whole message, it's not necessary

Stef wrote:I do agree that the main difference is "constant" compared to conventional emulator. But if i compare to Gens which is scanline timing based, the constant multiplier is ~x3500 which is definitely not a small constant.

Where did you get that ~x3500?

Stef · Post by **Stef** » Sat Mar 29, 2014 8:04 pm

488 M68K cycles per line x 7 as you use the Master clock = 3416 (ok closer to 3400).

r57shell · Post by **r57shell** » Sat Mar 29, 2014 8:14 pm

Which one is faster?

Code: Select all

while (true)
{
  for (int i=0; i<3416; ++i)
      do_something_terrible();
  for (int i=0; i<3416*5; ++i)
      do_something_scary();
}

Code: Select all

while (true)
{
  for(int j=0; j<3416; ++j)
  {
      do_something_terrible();
      for (int i=0; i<5; ++i)
          do_something_scary();
  }
}

I think, now you must understand.

Constant depends on such things: stack manipulations, prefetching, branch prediction, and processor cache misses.
So 3416 is not true.

neologix · Post by **neologix** » Sat Mar 29, 2014 9:17 pm

r57shell wrote:Which one is faster? (code)

To be fair, your first example runs do_something_terrible() 3416 times, THEN do_something_scary() 3416*5 times, while the second runs {do_something_terrible(), then 5 do_something_scary()-s} 3416 times. If one depends on the other there will be significantly different results.

r57shell · Post by **r57shell** » Sat Mar 29, 2014 11:40 pm

You're right:

Code: Select all

void do_something_terrible()
{
    need_update = true;
}
void do_something_scary()
{
    if(need_update)
       for (int i=0; i<1000; ++i)
            move_slowly();
    need_update = false;
}

In this case second code will be slower, but we talking about emulation of chips. It's not the case (I think).

r57shell · Post by **r57shell** » Mon Mar 31, 2014 6:08 pm

I have good news.
I have working "ori" opcode, with almost all stuff it's needed.
Bus arbitration is not done.
Comparing to Musashi (all OPT_OFF), on this code:
my M68k: 0.147 average
Musashi: 0.047 average
You can see, my code is slower in 3 times.

Also, my M68k has same speed for this code. But I can't test same code for Musashi because this code messing up registers, and I can't guarantee no exceptions.
In my M68k I just strictly force any error to reset.

Mask of Destiny · Post by **Mask of Destiny** » Mon Mar 31, 2014 10:01 pm

r57shell wrote:There is reason why I'm building it from scratch.
I want different system.

I wasn't necessarily suggesting you should use code from BlastEm, just giving my blessing in case it would be useful. Not that you need my blessing given the license...

r57shell wrote:As I understand, your M68k implementation using recompilation from M68k into x86. It's not good for me because I want:
1) Cross platform
2) Cycle step
Recompiling idea is to gain speed from native x86 code emulating multiple opcodes in row.
But, as I mentioned before I need to change context very frequently.
M68k->VDP->Z80->YM2162... (I don't know correct order).

I wouldn't really recommend using my 68K core anyway. Beyond the issues you mentioned, it needs a lot of instruction timing work and it's somewhat awkward to use as it expects to effectively act as the "main loop" of the emulator currently.

Some of the other parts might be more useful. My VDP core isn't perfect and the actual drawing code is too slow and messy, but it's fairly accurate. It would also be fairly trivial to adapt to the cycle at a time interface. Just add a function that increments the target cycle by one and runs the core.

Beyond that, I have some Python code for generating 68K and Z80 test programs which might be useful for testing your cores.

Of course, I can certainly understand the desire to start from scratch so I won't be offended if you don't want to use it.

Anyway, your general approach seems somewhat similar to BSNES/Higan which is pretty slow (at least with the "accuracy" profile), but certainly useable on a sufficiently fast computer. Granted from what I remember, it also models bus operation at a fairly low level so that's some of the slowness and it's not clear you care about that for GenStation. Still, I think you may be underestimating the performance impact of this approach. I expect the VDP is where you'll be hit the hardest. Running that a cycle at a time gets expensive as things that would otherwise be part of a tight inner loop. It's also the most processor intensive part of a Genesis emulator. Again, I don't think it will be unusably slow given fast hardware, just that the performance hit will be somewhat significant.

r57shell · Post by **r57shell** » Tue Apr 01, 2014 12:51 pm

Great m68k tests generator!
But, I have few questions about it:
1) Where is your Musashi setup for comparison?
2) What assumptions you did with RAM? You write in address selected by random, is there "warranty" that it's not modifying code it's running?
3) As I understand, I need to implement only: move #imm,(xxx).l; move #imm,dn; move #imm,an; move <ea>,dn; move #imm,crr; move sr,dn. Is that true?

Mask of Destiny · Post by **Mask of Destiny** » Tue Apr 01, 2014 6:17 pm

r57shell wrote:1) Where is your Musashi setup for comparison?

I'm using the version of musashi from Genesis Plus GX. I'll upload the code I'm using to drive it when I get home tonight.

r57shell wrote:2) What assumptions you did with RAM? You write in address selected by random, is there "warranty" that it's not modifying code it's running?

It generally assumes a simplified Genesis/Megadrive memory map with ROM at $0 and RAM starting at $EE0000. It tries to keep reads and writes to the RAM area, but it doesn't properly handle instructions that use a register more than once. To avoid any complications I've made sure the ROM area is not writeable and that the unmapped region between ROM and RAM returns a predictable value.

r57shell wrote:3) As I understand, I need to implement only: move #imm,(xxx).l; move #imm,dn; move #imm,an; move <ea>,dn; move #imm,crr; move sr,dn. Is that true?

That sounds about right for the general case. For instructions that can trigger an exception (chk, div, etc.), you also need to implement rte and maybe addq. The ones you listed should be enough to get you started though.