How hard would be to code a NES/SMS emulator

Ask anything your want about Mega/SegaCD programming.

Moderator: Mask of Destiny

Fonzie
Genny lover
Posts: 323
Joined: Tue Aug 29, 2006 11:17 am
Contact:

How hard would be to code a NES/SMS emulator

Post by Fonzie » Mon Feb 12, 2007 8:24 pm

Hi :)

I think I already chatted with MaskOfDestiny about this... long time ago.
What about coding a NES emulator that would use segacd?

I mean, we could use Segacd CPU to emulate nes CPU and Megadrive CPU as sound/display adaptator... I read that NES display was quite "close" to megadrive vdp (exept sprites maybe ?).

About emulating SMS, I know it would be more reasonable to use special hardware (cartridge that boot in sms mode)...

Any idea if its a stupid idea of a mad bad idea?

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Mon Feb 12, 2007 9:05 pm

It's a great mad idea !!!
www.smspower.org for a full bunch of docs.

Stef
Very interested
Posts: 3131
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Post by Stef » Mon Feb 12, 2007 9:07 pm

I'm not sure the NES is simpler to emulate than SMS.
The SMS's VDP is already present, the sound chip too...
Even Z80 is here, but we can't use it for that... actually except Z80 and IO, there is almost nothing to emulate... anyway, even a single Z80 can't be done on the Sega CD sub CPU :-/
NES video processor uses 2 bits palette, not as easy than 4 bits palette of the SMS (genesis hardware use 4 bits palette too).

That sound as a crazy project, but i'm afraid it's a too crazy one :p

I know there is a very fast Z80 emulator for 68000 CPU (used in a spectrum emulator for the TI 92 calculator). But i never heard about 6502 cpu on 68000...

Mask of Destiny
Very interested
Posts: 615
Joined: Thu Nov 30, 2006 6:30 am

Post by Mask of Destiny » Mon Feb 12, 2007 10:25 pm

Depends on how accurate you want to be. A 68K at 12Mhz isn't fast enough to do accurate Z80 emulation in anything close to full speed. My thinking is that you might get at least a few games to run reasonably well if you cut some corners. Flag calculation is a real speed killer. When I was working on Z80 emulation for my Sega CD SMS emulator, my plan was to just store the status register after each emulated instruction and fake the Z80 flags using the 68K flags. Accuracy would be horrible, but my hope was that enough games wouldn't use any bizarre Z80 flag behavior (half carry bit anyone?) for it to be a major issue.

As Stef said, video and sound are already done for you; however, you need to do a bit of work to handle the VDP interfacing as the Genesis VDP is word oriented and the SMS VDP is byte oriented. I got around this by keeping a copy of the Mode 4 accessible protion of VRAM (8KB), setting auto-increment to 1, combining the byte to be written with it's corresponding byte from my VRAM copy and then doing a swap for IIRC odd addressed VRAM writes (Genesis VDP swaps the word when you write to an odd address).

Supposedly Yuji Naka wrote an NES emulator for the Genesis back in the system's hayday, but I have no idea how accurate it was.

In some ways the architecture of the NES is more suited to being split between the sub and main CPUs as most games had tile data in a separate ROM accessed by the PPU. There's a performance penalty for passing data back and forth between the 2 CPUs, but on the NES the processor doesn't have to push the tile data to the PPU (unless the cart has RAM instead of ROM for it's CHAR ROM), so there should be less interaction between the chips.

Video would definately be more work, but the 6502 is a simpler chip than the Z80 so the CPU emulator should be a little easier. Not sure how easy it would be to do a decent performing CPU emulator. The 6502 has a much simpler instruction set than the Z80 and it runs at a significantly lower frequency, but IIRC it also executes more instructions per clock than the Z80.

If you want a fast 68K 6502 emulator you might look for some old Amiga emulators. The 6502 was also used in some old home computers (Commodore 64, Atari 800, etc.) so an emulator for one of those targetted at the Amiga might have a decent 68K asm core. Of course they might also be targetted at Amiga's with faster processors (68020+) so I'm not sure what you'd find.

TmEE co.(TM)
Very interested
Posts: 2440
Joined: Tue Dec 05, 2006 1:37 pm
Location: Estonia, Rapla City
Contact:

Post by TmEE co.(TM) » Tue Feb 13, 2007 1:16 pm

SMS emulator is little impractical, because MD has SMS practically built in.
NES emulator may work nice, since one crazy guy programmed a NES emulator in QB which is fast enough to run 15FPS on a 166MHz CPU (it would run hell faster if the programmer would have used XMS/EMS instead of swapfile). Since QB is around 6 times slower than other languages and 68K is easier to handle than x86, I guess you can write a decent speed NES emu.
Mida sa loed ? Nagunii aru ei saa ;)
http://www.tmeeco.eu
Files of all broken links and images of mine are found here : http://www.tmeeco.eu/FileDen

Stef
Very interested
Posts: 3131
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Post by Stef » Tue Feb 13, 2007 2:36 pm

A Pentium 166 Mhz (i guess it was the x86 cpu used here) is equivalent to a ~ 1000 Mhz 68000 (even more in reality)... that's really really really far from our 12 Mhz 68000 :-/

Mask of Destiny
Very interested
Posts: 615
Joined: Thu Nov 30, 2006 6:30 am

Post by Mask of Destiny » Tue Feb 13, 2007 6:53 pm

I did a few calculations and it would appear that the 6502 in the NES will take anywhere from 14 to 56 68K (@ 12.5MHz) cycles. 14 is obviously two few cycles to emulate an instruction on the 68000 as each 68K instruction takes a minimum of 4 cycles. You might be able to make up a few cycles on some of the longer to execute instructions, but I don't think it will come close to averaging out. The SMS and its Z80 isn't really any better. Its instructions take anywhere from 14 to 59.5 68K cycles.

Emulating the CPU at at least half speed should be doable. If you could figure out an efficient way to decode two instructions at once and efficiently combine certain pairs you might be able to make some gains, but it would be hard to do in a reasonable amount of memory and it's hard to say how big of an improvement it would make even if you could come up with an efficient mechanism.

I'll have to do some cycle counting when I get home and see what I can come up with.

Fonzie
Genny lover
Posts: 323
Joined: Tue Aug 29, 2006 11:17 am
Contact:

Post by Fonzie » Tue Feb 13, 2007 9:26 pm

Ho, i'm probably a bit off :) I thought that it was possible to find a 68K equivalent to each 6502 instruction (using a big equivalency table), but yeah, now comes in mind that all the jump stuff would get stuck :)
You might be able to make up a few cycles on some of the longer to execute instructions, but I don't think it will come close to averaging out.
Understood.

MERLiX
Newbie
Posts: 8
Joined: Mon Jan 15, 2007 10:45 pm

Post by MERLiX » Wed Feb 14, 2007 1:36 am

Its funny you mention making a nes|sms emulator because I was thinking something similar just the other day, maybe even a gameboy emulator?

For the CPU emulation, perhaps doing some kind of dynamic re-compilation?? as that would definitely help with the speed plus you could utilize the 68k flags. (the idea being that the gameboy had 2 32k blocks it looks at and re-compiling code on the fly if needed and stored in the sub ram)

Also thats a pretty good idea splitting the cpu on the sub side and the graphics on the main side, I would think the sound would be better on the sub side with the pcm, but hey perhaps you could also use the z80 ;).

Mask of Destiny
Very interested
Posts: 615
Joined: Thu Nov 30, 2006 6:30 am

Post by Mask of Destiny » Wed Feb 14, 2007 4:27 am

I've done some cycle counting and it's not pretty. If I do a standard 16-bit displacement style jump table, it takes 38 cycles just to read in an opcode and jump to the code that emulates it plus at least another 8 cycles to jump back for the next instruction. Here's the code:

Code: Select all

;assume a0 is PC, a2 points to jump table, a1 is base offset for emulation code

	eor.w	d0, d0;			4
	move.b	(a0)+, d0;		8
	add.w	d0, d0;			4
	move.w	(a2, d0.w), d0;		12	
	jmp	(d0.w, a1);		10
	;Total				38 cycles
So in total that's 46 cycles, only about 30% of the speed of the real thing.

If we assume that we can fit most of our emulated instructions (or at least the ones that need to be fastest anyway) in 16 bytes we can save two cycles by using the following:

Code: Select all

	eor.w	d0, d0;		4
	move.b (a0)+, d0;	8
	lsl.w	#4,d0;		14
	jmp	(d0.w, a1)	10
	;Total			36 cycles
That's still not very good. We're still at 44 cycles once we add in the final jmp, still only ~32% of full speed.

If we waste some RAM (and thereby limit ourselves to smaller carts) we can cheat a bit. If we zero out RAM and only write data to even addresses we can do something like this:

Code: Select all

	move.w	(a0)+, d0;	8
	jmp	(d0.w, a1);10
	;Total			18 cycles
So that's 26 cycles with the return jump which gets us up to about 54% of full speed (for a nop).

If we grab two instructions at once we could use something like the following:

Code: Select all

;assume a0 is PC, a2 points to jump table

	move.w	(a0)+, d0;		8
	move.l	a2, a1			4
	add.l	d0, a1			8
	move.w	(a1), d1;		8	
	jmp	(d0.w, a3);		10
	;Total				38 cycles
If we grabbed two nops then we'd be done in 46 cycles which is about 61% of full speed, though some instructions combos might incur additional overhead as we'd need to do some decoding to keep the code from bloating out too much.

A dynarec would be interesting. They're generally not considered appropriate when timing is important, but we've already kind of given up on any attempt at timing accuracy so it's probably no worse than an interpretter in our case.

TmEE co.(TM)
Very interested
Posts: 2440
Joined: Tue Dec 05, 2006 1:37 pm
Location: Estonia, Rapla City
Contact:

Post by TmEE co.(TM) » Wed Feb 14, 2007 8:19 am

where do you put the recompiled code ? There isn't very much RAM in MCD
Mida sa loed ? Nagunii aru ei saa ;)
http://www.tmeeco.eu
Files of all broken links and images of mine are found here : http://www.tmeeco.eu/FileDen

Stef
Very interested
Posts: 3131
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Post by Stef » Wed Feb 14, 2007 8:37 am

Mask of Destiny wrote:I've done some cycle counting and it's not pretty. If I do a standard 16-bit displacement style jump table, it takes 38 cycles just to read in an opcode and jump to the code that emulates it plus at least another 8 cycles to jump back for the next instruction. Here's the code:

Code: Select all

;assume a0 is PC, a2 points to jump table, a1 is base offset for emulation code

	eor.w	d0, d0;			4
	move.b	(a0)+, d0;		8
	add.w	d0, d0;			4
	move.w	(a2, d0.w), d0;		12	
	jmp	(d0.w, a1);		10
	;Total				38 cycles
So in total that's 46 cycles, only about 30% of the speed of the real thing.

If we assume that we can fit most of our emulated instructions (or at least the ones that need to be fastest anyway) in 16 bytes we can save two cycles by using the following:

Code: Select all

	eor.w	d0, d0;		4
	move.b (a0)+, d0;	8
	lsl.w	#4,d0;		14
	jmp	(d0.w, a1)	10
	;Total			36 cycles
That's still not very good. We're still at 44 cycles once we add in the final jmp, still only ~32% of full speed.

If we waste some RAM (and thereby limit ourselves to smaller carts) we can cheat a bit. If we zero out RAM and only write data to even addresses we can do something like this:

Code: Select all

	move.w	(a0)+, d0;	8
	jmp	(d0.w, a1);10
	;Total			18 cycles
So that's 26 cycles with the return jump which gets us up to about 54% of full speed (for a nop).

If we grab two instructions at once we could use something like the following:

Code: Select all

;assume a0 is PC, a2 points to jump table

	move.w	(a0)+, d0;		8
	move.l	a2, a1			4
	add.l	d0, a1			8
	move.w	(a1), d1;		8	
	jmp	(d0.w, a3);		10
	;Total				38 cycles
If we grabbed two nops then we'd be done in 46 cycles which is about 61% of full speed, though some instructions combos might incur additional overhead as we'd need to do some decoding to keep the code from bloating out too much.

A dynarec would be interesting. They're generally not considered appropriate when timing is important, but we've already kind of given up on any attempt at timing accuracy so it's probably no worse than an interpretter in our case.
Yep, by using a simple interpreter, there is no way to get something close to full speed :-/
Dynarec could be a solution, the sega CD ram is probably large enough to acccept the compiled SMS, NES or GB code and anyway we can limit the size of generated compiled code (more you limit, more recompilation you need). But the problem is the recompiler code itself... it's a complex stuff to do, i don't know if the recompiler code plus compiled code can fit in sega available ram :-/ and that's with ignoring the resource data as video and sound...

Fonzie
Genny lover
Posts: 323
Joined: Tue Aug 29, 2006 11:17 am
Contact:

Post by Fonzie » Wed Feb 14, 2007 11:21 am

I think that 512 (or 512-emulator) KB is quite good...
I wonder if its possible to use the segacd communication table to fake the I/O (vdp ports of the nes).
I mean, if the maincpu runs at fullspeed on pooling the I/O, it may be no lag.

The flip/flap 128KB ram can be used to store some additional stuff on subcpu
and the emulator on maincpu...

:D Btw, It seems I awaken some of yours old deamons, MOD ;P

Also, why emulating the nops? Just skipping them woundn't damage the code, isn't it?

ElBarto
Very interested
Posts: 160
Joined: Wed Dec 13, 2006 10:29 am
Contact:

Post by ElBarto » Wed Feb 14, 2007 2:02 pm

Fonzie wrote:Also, why emulating the nops? Just skipping them woundn't damage the code, isn't it?
nops are sometimes used for timing so you have to emulate them.

Stef
Very interested
Posts: 3131
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Post by Stef » Wed Feb 14, 2007 2:17 pm

ElBarto wrote:
Fonzie wrote:Also, why emulating the nops? Just skipping them woundn't damage the code, isn't it?
nops are sometimes used for timing so you have to emulate them.
They can be emulated just for timing purpose, on interpreter that still eat time but not on dynarec. With dynarec we can also do more complex code analisys to remove wait loop or stuff like that :)
doing a dynarec is a very interesting project imo :)

Post Reply