I've done some cycle counting and it's not pretty. If I do a standard 16-bit displacement style jump table, it takes 38 cycles just to read in an opcode and jump to the code that emulates it plus at least another 8 cycles to jump back for the next instruction. Here's the code:
Code: Select all
;assume a0 is PC, a2 points to jump table, a1 is base offset for emulation code
eor.w d0, d0; 4
move.b (a0)+, d0; 8
add.w d0, d0; 4
move.w (a2, d0.w), d0; 12
jmp (d0.w, a1); 10
;Total 38 cycles
So in total that's 46 cycles, only about 30% of the speed of the real thing.
If we assume that we can fit most of our emulated instructions (or at least the ones that need to be fastest anyway) in 16 bytes we can save two cycles by using the following:
Code: Select all
eor.w d0, d0; 4
move.b (a0)+, d0; 8
lsl.w #4,d0; 14
jmp (d0.w, a1) 10
;Total 36 cycles
That's still not very good. We're still at 44 cycles once we add in the final jmp, still only ~32% of full speed.
If we waste some RAM (and thereby limit ourselves to smaller carts) we can cheat a bit. If we zero out RAM and only write data to even addresses we can do something like this:
Code: Select all
move.w (a0)+, d0; 8
jmp (d0.w, a1);10
;Total 18 cycles
So that's 26 cycles with the return jump which gets us up to about 54% of full speed (for a nop).
If we grab two instructions at once we could use something like the following:
Code: Select all
;assume a0 is PC, a2 points to jump table
move.w (a0)+, d0; 8
move.l a2, a1 4
add.l d0, a1 8
move.w (a1), d1; 8
jmp (d0.w, a3); 10
;Total 38 cycles
If we grabbed two nops then we'd be done in 46 cycles which is about 61% of full speed, though some instructions combos might incur additional overhead as we'd need to do some decoding to keep the code from bloating out too much.
A dynarec would be interesting. They're generally not considered appropriate when timing is important, but we've already kind of given up on any attempt at timing accuracy so it's probably no worse than an interpretter in our case.