Gameboy on 32X

Chilly Willy · Post by **Chilly Willy** » Mon Feb 23, 2009 7:45 am

While the GB might be slow, certain things like the flags are going to take time, so best to save those cycles anywhere else you can. Fortunately you have another processor for handling graphics. Just think if you had to do the gfx as well as the CPU emulation on the Master SH2.

Does the GB do a lot of line oriented effects? It might be worth it to make an option for syncing once a frame as opposed to every line for games that don't need per line synchronization.

mic_ · Post by **mic_** » Mon Feb 23, 2009 7:46 am

As for the pending interrupt checks (lines 4181-4238 in cpu.s), they're kinda needed because an IRQ might've occurred earlier while the corresponding IE bit was 0 or IME was 0. But if you're willing to trade some accuracy for speed you could move those checks out of the main loop and put them right before the loop starts, or right after it finishes. That way the interrupt should still occur, but it will be something on the order of 100 GB-Z80 clocks too late.

Chilly Willy · Post by **Chilly Willy** » Mon Feb 23, 2009 8:25 am

I'm thinking one flag that is set for anything like an interrupt. Then your main loop is check the cycle count, check the one flag, loop. If the one flag is set, THEN check if it's DI, EI, or IME pending.

As to the cycle count, I was thinking once you find the min, store it, then negate it into r12. As instructions execute and do the ADDCYCLES macro, they'll be incrementing it toward 0, making the check for the cycle end a simple cmp/pz r12. Make that event flag b31 of r13 and the one flag check is now just cmp/pz r13. So your loop comes down to

Code: Select all

loop:
 fetch byte
 get vector
 jsr
 cmp/pz r13
 bf do_event
 cmp/pz r12
 bf loop

Oh, another thing that would speed things up - don't look up the read function every single loop, just on entry. Code is not going to "wander" from one zone into another. It'll change in specific places, like jumps and the like. Even branches won't change zones (highly improbably). So look up the read function before the main loop, and store it in r14. Then the fetch byte part of the code above is just jsr @r14. To handle the case where the code COULD change zones (like a jump), just set the event flag in the jump opcode code, then in do_event, if there was no int, just jump back into the loop a little further back where you look up the read function and store it in r14. That gives just about the tightest main loop possible.

mic_ · Post by **mic_** » Mon Feb 23, 2009 9:43 am

In my DS emulator I took that one step further and didn't even bother calling a function to fetch the instruction word. Instead I would map the ARM program counter to a host address and just read the instructions using that pointer, offsetting it for every instruction executed, and invalidating the mapping only when a branch was taken (and some other cases, like when an interrupt occurred). I'm not sure if that would be practical for the Gameboy because of how its memory space is organized, but bypassing _mem_read_byte sounds doable, and would cut the code path down by 8 instructions.

One other thing I've considered is adding functions for accessing the memory in 16-bit units, since it's a fairly common scenario that an instruction wants to read/write a 16-bit value. My main concern with that is that all the extra code might make it not fit in cache anymore (at least in the 2kB area where I'm explicitly caching stuff).

As for a "sync less" option, I guess I could try that and see what happens. The only variants I've tried are syncing on every scanline, and never syncing (which didn't work so good

)

Chilly Willy · Post by **Chilly Willy** » Mon Feb 23, 2009 10:01 am

I think right now the best bet would be the optimizations I suggested for the main loop, and the sync each vblank. You might not need the word optimization after that. At least the other stuff is doable pretty quickly and will give you a better idea of whether more optimization is needed.

mic_ · Post by **mic_** » Mon Feb 23, 2009 10:11 am

and the sync each vblank

I just tried that and it wasn't pretty. Anything that moves looks like it's been run through a noise filter. One problem might be that the slave gets the scanline number from the main SH2, so if it the main keeps adding draw-commands without waiting for the slave to be ready some scanlines might be skipped. I could try it with the slave keeping track of the scanline number internally later..

mic_ · Post by **mic_** » Mon Feb 23, 2009 10:13 am

...the speed was excellent though - at least when playing Super Mario Land in Fusion, I have no idea how it'd perform on HW since I don't have my 32X or Megacart handy.

Chilly Willy · Post by **Chilly Willy** » Mon Feb 23, 2009 10:24 am

Well, you could either do all the lines, or you could set a bitmask where each bit represents a line asked to be drawn. I'm not sure if your code skips lines or just simply does them all. Haven't looked at that part.

mic_ · Post by **mic_** » Mon Feb 23, 2009 3:16 pm

I could try it with the slave keeping track of the scanline number internally later..

Well, I did that. And then I split the command longword into two words so that the main SH2 can add a new command while the slave SH2 is executing another one (actually it can add two, because I also changed the command passing a bit). But already with amount of out of sync-ness it started getting a bit unstable (it looks bad in games that change the horizontal scrolling during HBLANK to get different amounts of scrolling for different parts of the screen).

Chilly Willy · Post by **Chilly Willy** » Mon Feb 23, 2009 9:57 pm

mic_ wrote:
I could try it with the slave keeping track of the scanline number internally later..
Well, I did that. And then I split the command longword into two words so that the main SH2 can add a new command while the slave SH2 is executing another one (actually it can add two, because I also changed the command passing a bit). But already with amount of out of sync-ness it started getting a bit unstable (it looks bad in games that change the horizontal scrolling during HBLANK to get different amounts of scrolling for different parts of the screen).

What we did for the Atari emu was make an array for parameters that could be changed on a line basis. The "default" value would be set at the start of each line, and the CPU could override it by setting it directly. Then when the screen was drawn (once at the end of an entire frame), the drawing routine would simply pull the value from the array. It was a huge time-saver while still allowing on-the-fly effects.

The Atari emu had several of these because a variety of things could be changed, but the GB shouldn't have nearly as many to be concerned with. We had a queue for the POKEY - when the CPU stored to the pokey, we recorded the CPU cycle count, register, and register value. The sound buffer fill code would then pull entries from the queue to generate near perfect audio.

Snake · Post by **Snake** » Tue Feb 24, 2009 7:51 am

I noticed you are doing some sort of cache purge every scanline. Is this really needed? That's probably going to be slow.

Also, you're doing something with the palette every scanline too. Again, is this needed?

mic_ · Post by **mic_** » Tue Feb 24, 2009 8:25 am

I noticed you are doing some sort of cache purge every scanline. Is this really needed? That's probably going to be slow.

The slave doesn't use cache-through when reading from the GB VRAM, so I'm making sure that before it starts drawing a scanline it'll get the most current VRAM data. Of course, I probably don't need to purge all lines, but I haven't looked into which ones would be needed.

Also, you're doing something with the palette every scanline too. Again, is this needed?

The palette could've changed since the previous scanline, so yes. And I hardly think this has much of an impact, since it's just setting up a 12-byte array that I've explicitly put in cache.

Chilly Willy · Post by **Chilly Willy** » Tue Feb 24, 2009 9:31 am

I tried moving the cache purge to gui_present so it's only done once a frame. I only use an associative purge to clear the line the IOREGS palette entries occupy in the line code. On an emulator, none of this cache stuff makes any difference at all - it's only real hardware that would show a difference. In this case, it seems to make almost no discernible difference on real hardware. The way I check is to time how long it takes in DigDug from the screen showing until the first baddie hops off the left side of the screen. In Gens/GS, that's about 20 seconds. On real hardware (with my changes), it's about 22 seconds. It MIGHT be 23 purging the cache every line. The timing is so close it's hard to tell, but it's clearly making very little difference. The cache IS being used as not purging the cache at least once per frame results in a blank display on real hardware.

mic_ · Post by **mic_** » Tue Feb 24, 2009 10:19 am

And just to clarify my previous post: I don't really care about the purging since the slave only reads from the GB VRAM. What I'm really after is the invalidation of the cache lines, so that it won't get any cache hits on the GB VRAM from the previous call to ppu_draw_scanline.

bastien · Post by **bastien** » Wed Apr 01, 2009 3:56 pm

Hi ,
Any news about this project

Can't wait for a Release