How hard would be to code a NES/SMS emulator
Moderator: Mask of Destiny
How hard would be to code a NES/SMS emulator
Hi
I think I already chatted with MaskOfDestiny about this... long time ago.
What about coding a NES emulator that would use segacd?
I mean, we could use Segacd CPU to emulate nes CPU and Megadrive CPU as sound/display adaptator... I read that NES display was quite "close" to megadrive vdp (exept sprites maybe ?).
About emulating SMS, I know it would be more reasonable to use special hardware (cartridge that boot in sms mode)...
Any idea if its a stupid idea of a mad bad idea?
I think I already chatted with MaskOfDestiny about this... long time ago.
What about coding a NES emulator that would use segacd?
I mean, we could use Segacd CPU to emulate nes CPU and Megadrive CPU as sound/display adaptator... I read that NES display was quite "close" to megadrive vdp (exept sprites maybe ?).
About emulating SMS, I know it would be more reasonable to use special hardware (cartridge that boot in sms mode)...
Any idea if its a stupid idea of a mad bad idea?
It's a great mad idea !!!
www.smspower.org for a full bunch of docs.
www.smspower.org for a full bunch of docs.
-
- Very interested
- Posts: 3131
- Joined: Thu Nov 30, 2006 9:46 pm
- Location: France - Sevres
- Contact:
I'm not sure the NES is simpler to emulate than SMS.
The SMS's VDP is already present, the sound chip too...
Even Z80 is here, but we can't use it for that... actually except Z80 and IO, there is almost nothing to emulate... anyway, even a single Z80 can't be done on the Sega CD sub CPU :-/
NES video processor uses 2 bits palette, not as easy than 4 bits palette of the SMS (genesis hardware use 4 bits palette too).
That sound as a crazy project, but i'm afraid it's a too crazy one :p
I know there is a very fast Z80 emulator for 68000 CPU (used in a spectrum emulator for the TI 92 calculator). But i never heard about 6502 cpu on 68000...
The SMS's VDP is already present, the sound chip too...
Even Z80 is here, but we can't use it for that... actually except Z80 and IO, there is almost nothing to emulate... anyway, even a single Z80 can't be done on the Sega CD sub CPU :-/
NES video processor uses 2 bits palette, not as easy than 4 bits palette of the SMS (genesis hardware use 4 bits palette too).
That sound as a crazy project, but i'm afraid it's a too crazy one :p
I know there is a very fast Z80 emulator for 68000 CPU (used in a spectrum emulator for the TI 92 calculator). But i never heard about 6502 cpu on 68000...
-
- Very interested
- Posts: 616
- Joined: Thu Nov 30, 2006 6:30 am
Depends on how accurate you want to be. A 68K at 12Mhz isn't fast enough to do accurate Z80 emulation in anything close to full speed. My thinking is that you might get at least a few games to run reasonably well if you cut some corners. Flag calculation is a real speed killer. When I was working on Z80 emulation for my Sega CD SMS emulator, my plan was to just store the status register after each emulated instruction and fake the Z80 flags using the 68K flags. Accuracy would be horrible, but my hope was that enough games wouldn't use any bizarre Z80 flag behavior (half carry bit anyone?) for it to be a major issue.
As Stef said, video and sound are already done for you; however, you need to do a bit of work to handle the VDP interfacing as the Genesis VDP is word oriented and the SMS VDP is byte oriented. I got around this by keeping a copy of the Mode 4 accessible protion of VRAM (8KB), setting auto-increment to 1, combining the byte to be written with it's corresponding byte from my VRAM copy and then doing a swap for IIRC odd addressed VRAM writes (Genesis VDP swaps the word when you write to an odd address).
Supposedly Yuji Naka wrote an NES emulator for the Genesis back in the system's hayday, but I have no idea how accurate it was.
In some ways the architecture of the NES is more suited to being split between the sub and main CPUs as most games had tile data in a separate ROM accessed by the PPU. There's a performance penalty for passing data back and forth between the 2 CPUs, but on the NES the processor doesn't have to push the tile data to the PPU (unless the cart has RAM instead of ROM for it's CHAR ROM), so there should be less interaction between the chips.
Video would definately be more work, but the 6502 is a simpler chip than the Z80 so the CPU emulator should be a little easier. Not sure how easy it would be to do a decent performing CPU emulator. The 6502 has a much simpler instruction set than the Z80 and it runs at a significantly lower frequency, but IIRC it also executes more instructions per clock than the Z80.
If you want a fast 68K 6502 emulator you might look for some old Amiga emulators. The 6502 was also used in some old home computers (Commodore 64, Atari 800, etc.) so an emulator for one of those targetted at the Amiga might have a decent 68K asm core. Of course they might also be targetted at Amiga's with faster processors (68020+) so I'm not sure what you'd find.
As Stef said, video and sound are already done for you; however, you need to do a bit of work to handle the VDP interfacing as the Genesis VDP is word oriented and the SMS VDP is byte oriented. I got around this by keeping a copy of the Mode 4 accessible protion of VRAM (8KB), setting auto-increment to 1, combining the byte to be written with it's corresponding byte from my VRAM copy and then doing a swap for IIRC odd addressed VRAM writes (Genesis VDP swaps the word when you write to an odd address).
Supposedly Yuji Naka wrote an NES emulator for the Genesis back in the system's hayday, but I have no idea how accurate it was.
In some ways the architecture of the NES is more suited to being split between the sub and main CPUs as most games had tile data in a separate ROM accessed by the PPU. There's a performance penalty for passing data back and forth between the 2 CPUs, but on the NES the processor doesn't have to push the tile data to the PPU (unless the cart has RAM instead of ROM for it's CHAR ROM), so there should be less interaction between the chips.
Video would definately be more work, but the 6502 is a simpler chip than the Z80 so the CPU emulator should be a little easier. Not sure how easy it would be to do a decent performing CPU emulator. The 6502 has a much simpler instruction set than the Z80 and it runs at a significantly lower frequency, but IIRC it also executes more instructions per clock than the Z80.
If you want a fast 68K 6502 emulator you might look for some old Amiga emulators. The 6502 was also used in some old home computers (Commodore 64, Atari 800, etc.) so an emulator for one of those targetted at the Amiga might have a decent 68K asm core. Of course they might also be targetted at Amiga's with faster processors (68020+) so I'm not sure what you'd find.
-
- Very interested
- Posts: 2440
- Joined: Tue Dec 05, 2006 1:37 pm
- Location: Estonia, Rapla City
- Contact:
SMS emulator is little impractical, because MD has SMS practically built in.
NES emulator may work nice, since one crazy guy programmed a NES emulator in QB which is fast enough to run 15FPS on a 166MHz CPU (it would run hell faster if the programmer would have used XMS/EMS instead of swapfile). Since QB is around 6 times slower than other languages and 68K is easier to handle than x86, I guess you can write a decent speed NES emu.
NES emulator may work nice, since one crazy guy programmed a NES emulator in QB which is fast enough to run 15FPS on a 166MHz CPU (it would run hell faster if the programmer would have used XMS/EMS instead of swapfile). Since QB is around 6 times slower than other languages and 68K is easier to handle than x86, I guess you can write a decent speed NES emu.
Mida sa loed ? Nagunii aru ei saa
http://www.tmeeco.eu
Files of all broken links and images of mine are found here : http://www.tmeeco.eu/FileDen
http://www.tmeeco.eu
Files of all broken links and images of mine are found here : http://www.tmeeco.eu/FileDen
-
- Very interested
- Posts: 616
- Joined: Thu Nov 30, 2006 6:30 am
I did a few calculations and it would appear that the 6502 in the NES will take anywhere from 14 to 56 68K (@ 12.5MHz) cycles. 14 is obviously two few cycles to emulate an instruction on the 68000 as each 68K instruction takes a minimum of 4 cycles. You might be able to make up a few cycles on some of the longer to execute instructions, but I don't think it will come close to averaging out. The SMS and its Z80 isn't really any better. Its instructions take anywhere from 14 to 59.5 68K cycles.
Emulating the CPU at at least half speed should be doable. If you could figure out an efficient way to decode two instructions at once and efficiently combine certain pairs you might be able to make some gains, but it would be hard to do in a reasonable amount of memory and it's hard to say how big of an improvement it would make even if you could come up with an efficient mechanism.
I'll have to do some cycle counting when I get home and see what I can come up with.
Emulating the CPU at at least half speed should be doable. If you could figure out an efficient way to decode two instructions at once and efficiently combine certain pairs you might be able to make some gains, but it would be hard to do in a reasonable amount of memory and it's hard to say how big of an improvement it would make even if you could come up with an efficient mechanism.
I'll have to do some cycle counting when I get home and see what I can come up with.
Ho, i'm probably a bit off I thought that it was possible to find a 68K equivalent to each 6502 instruction (using a big equivalency table), but yeah, now comes in mind that all the jump stuff would get stuck
Understood.You might be able to make up a few cycles on some of the longer to execute instructions, but I don't think it will come close to averaging out.
Its funny you mention making a nes|sms emulator because I was thinking something similar just the other day, maybe even a gameboy emulator?
For the CPU emulation, perhaps doing some kind of dynamic re-compilation?? as that would definitely help with the speed plus you could utilize the 68k flags. (the idea being that the gameboy had 2 32k blocks it looks at and re-compiling code on the fly if needed and stored in the sub ram)
Also thats a pretty good idea splitting the cpu on the sub side and the graphics on the main side, I would think the sound would be better on the sub side with the pcm, but hey perhaps you could also use the z80 .
For the CPU emulation, perhaps doing some kind of dynamic re-compilation?? as that would definitely help with the speed plus you could utilize the 68k flags. (the idea being that the gameboy had 2 32k blocks it looks at and re-compiling code on the fly if needed and stored in the sub ram)
Also thats a pretty good idea splitting the cpu on the sub side and the graphics on the main side, I would think the sound would be better on the sub side with the pcm, but hey perhaps you could also use the z80 .
-
- Very interested
- Posts: 616
- Joined: Thu Nov 30, 2006 6:30 am
I've done some cycle counting and it's not pretty. If I do a standard 16-bit displacement style jump table, it takes 38 cycles just to read in an opcode and jump to the code that emulates it plus at least another 8 cycles to jump back for the next instruction. Here's the code:
So in total that's 46 cycles, only about 30% of the speed of the real thing.
If we assume that we can fit most of our emulated instructions (or at least the ones that need to be fastest anyway) in 16 bytes we can save two cycles by using the following:
That's still not very good. We're still at 44 cycles once we add in the final jmp, still only ~32% of full speed.
If we waste some RAM (and thereby limit ourselves to smaller carts) we can cheat a bit. If we zero out RAM and only write data to even addresses we can do something like this:
So that's 26 cycles with the return jump which gets us up to about 54% of full speed (for a nop).
If we grab two instructions at once we could use something like the following:
If we grabbed two nops then we'd be done in 46 cycles which is about 61% of full speed, though some instructions combos might incur additional overhead as we'd need to do some decoding to keep the code from bloating out too much.
A dynarec would be interesting. They're generally not considered appropriate when timing is important, but we've already kind of given up on any attempt at timing accuracy so it's probably no worse than an interpretter in our case.
Code: Select all
;assume a0 is PC, a2 points to jump table, a1 is base offset for emulation code
eor.w d0, d0; 4
move.b (a0)+, d0; 8
add.w d0, d0; 4
move.w (a2, d0.w), d0; 12
jmp (d0.w, a1); 10
;Total 38 cycles
If we assume that we can fit most of our emulated instructions (or at least the ones that need to be fastest anyway) in 16 bytes we can save two cycles by using the following:
Code: Select all
eor.w d0, d0; 4
move.b (a0)+, d0; 8
lsl.w #4,d0; 14
jmp (d0.w, a1) 10
;Total 36 cycles
If we waste some RAM (and thereby limit ourselves to smaller carts) we can cheat a bit. If we zero out RAM and only write data to even addresses we can do something like this:
Code: Select all
move.w (a0)+, d0; 8
jmp (d0.w, a1);10
;Total 18 cycles
If we grab two instructions at once we could use something like the following:
Code: Select all
;assume a0 is PC, a2 points to jump table
move.w (a0)+, d0; 8
move.l a2, a1 4
add.l d0, a1 8
move.w (a1), d1; 8
jmp (d0.w, a3); 10
;Total 38 cycles
A dynarec would be interesting. They're generally not considered appropriate when timing is important, but we've already kind of given up on any attempt at timing accuracy so it's probably no worse than an interpretter in our case.
-
- Very interested
- Posts: 2440
- Joined: Tue Dec 05, 2006 1:37 pm
- Location: Estonia, Rapla City
- Contact:
where do you put the recompiled code ? There isn't very much RAM in MCD
Mida sa loed ? Nagunii aru ei saa
http://www.tmeeco.eu
Files of all broken links and images of mine are found here : http://www.tmeeco.eu/FileDen
http://www.tmeeco.eu
Files of all broken links and images of mine are found here : http://www.tmeeco.eu/FileDen
-
- Very interested
- Posts: 3131
- Joined: Thu Nov 30, 2006 9:46 pm
- Location: France - Sevres
- Contact:
Yep, by using a simple interpreter, there is no way to get something close to full speed :-/Mask of Destiny wrote:I've done some cycle counting and it's not pretty. If I do a standard 16-bit displacement style jump table, it takes 38 cycles just to read in an opcode and jump to the code that emulates it plus at least another 8 cycles to jump back for the next instruction. Here's the code:So in total that's 46 cycles, only about 30% of the speed of the real thing.Code: Select all
;assume a0 is PC, a2 points to jump table, a1 is base offset for emulation code eor.w d0, d0; 4 move.b (a0)+, d0; 8 add.w d0, d0; 4 move.w (a2, d0.w), d0; 12 jmp (d0.w, a1); 10 ;Total 38 cycles
If we assume that we can fit most of our emulated instructions (or at least the ones that need to be fastest anyway) in 16 bytes we can save two cycles by using the following:That's still not very good. We're still at 44 cycles once we add in the final jmp, still only ~32% of full speed.Code: Select all
eor.w d0, d0; 4 move.b (a0)+, d0; 8 lsl.w #4,d0; 14 jmp (d0.w, a1) 10 ;Total 36 cycles
If we waste some RAM (and thereby limit ourselves to smaller carts) we can cheat a bit. If we zero out RAM and only write data to even addresses we can do something like this:So that's 26 cycles with the return jump which gets us up to about 54% of full speed (for a nop).Code: Select all
move.w (a0)+, d0; 8 jmp (d0.w, a1);10 ;Total 18 cycles
If we grab two instructions at once we could use something like the following:If we grabbed two nops then we'd be done in 46 cycles which is about 61% of full speed, though some instructions combos might incur additional overhead as we'd need to do some decoding to keep the code from bloating out too much.Code: Select all
;assume a0 is PC, a2 points to jump table move.w (a0)+, d0; 8 move.l a2, a1 4 add.l d0, a1 8 move.w (a1), d1; 8 jmp (d0.w, a3); 10 ;Total 38 cycles
A dynarec would be interesting. They're generally not considered appropriate when timing is important, but we've already kind of given up on any attempt at timing accuracy so it's probably no worse than an interpretter in our case.
Dynarec could be a solution, the sega CD ram is probably large enough to acccept the compiled SMS, NES or GB code and anyway we can limit the size of generated compiled code (more you limit, more recompilation you need). But the problem is the recompiler code itself... it's a complex stuff to do, i don't know if the recompiler code plus compiled code can fit in sega available ram :-/ and that's with ignoring the resource data as video and sound...
I think that 512 (or 512-emulator) KB is quite good...
I wonder if its possible to use the segacd communication table to fake the I/O (vdp ports of the nes).
I mean, if the maincpu runs at fullspeed on pooling the I/O, it may be no lag.
The flip/flap 128KB ram can be used to store some additional stuff on subcpu
and the emulator on maincpu...
Btw, It seems I awaken some of yours old deamons, MOD ;P
Also, why emulating the nops? Just skipping them woundn't damage the code, isn't it?
I wonder if its possible to use the segacd communication table to fake the I/O (vdp ports of the nes).
I mean, if the maincpu runs at fullspeed on pooling the I/O, it may be no lag.
The flip/flap 128KB ram can be used to store some additional stuff on subcpu
and the emulator on maincpu...
Btw, It seems I awaken some of yours old deamons, MOD ;P
Also, why emulating the nops? Just skipping them woundn't damage the code, isn't it?
-
- Very interested
- Posts: 3131
- Joined: Thu Nov 30, 2006 9:46 pm
- Location: France - Sevres
- Contact:
They can be emulated just for timing purpose, on interpreter that still eat time but not on dynarec. With dynarec we can also do more complex code analisys to remove wait loop or stuff like thatElBarto wrote:nops are sometimes used for timing so you have to emulate them.Fonzie wrote:Also, why emulating the nops? Just skipping them woundn't damage the code, isn't it?
doing a dynarec is a very interesting project imo