Code benchmarking techniques?

djcouchycouch · Post by **djcouchycouch** » Mon Mar 12, 2012 1:41 pm

Hi,

What are the recommended tools and techniques used to benchmark C code on the Megadrive? Does the 68000 provide any timers or registers to help?

Thanks!
DJCC

TmEE co.(TM) · Post by **TmEE co.(TM)** » Mon Mar 12, 2012 2:39 pm

Just change backdrop color on a segment and see how much time does it take. It wont work on emulators though but you get very accurate readings from real hardware tests.

djcouchycouch · Post by **djcouchycouch** » Tue Mar 13, 2012 1:31 pm

Let's say I want to benchmark a while loop in my main game loop. How would your suggested technique work? If I understand it, the color would change only for a fraction of a second, which would be hard to time manually.

djcouchycouch · Post by **djcouchycouch** » Tue Mar 13, 2012 1:48 pm

One idea I've got is to find the code I want to benchmark and set it to run a thousand or ten thousand times, and have something visible happen when it's done. Because it's looping many, many times, the amount of time it takes should be measurable with a stopwatch. Not the most friendly way, but it might be possible to use in some cases.

TmEE co.(TM) · Post by **TmEE co.(TM)** » Tue Mar 13, 2012 1:48 pm

You change color before loop and after it, you will see visual indication how much of the frame does the code take. You don't have much other options besides cycle counting. There are no timers you can set and read back.
You could count how many lines does the operation take but it is far less accurate than cycle counting or color change method. Line counting will work in emulators though.

Stef · Post by **Stef** » Tue Mar 13, 2012 6:38 pm

There is some timers methods in SGDK (timer.h) which can be used to profile your code with some limits though :

Code: Select all

// return elapsed subticks from console reset (1/76800 second based)
// WARNING : this function isn't accurate because of the VCounter rollback
u32  getSubTick();
// return elapsed ticks from console reset (1/300 second based)
u32  getTick();
// return elapsed time from console reset (1/256 second based)
u32  getTime(u16 fromTick);
// return elapsed time from console reset as fix32 number (in second)
fix32 getTimeAsFix32(u16 fromTick);

// start internal timer (0 <= numtimer < MAXTIMER)
void startTimer(u16 numTimer);
// get elapsed subticks from last startTimer(numTimer)
u32  getTimer(u16 numTimer, u16 restart);

As you can see the getSubTick() method is not very accurate because of the hardware V counter rollback feature... so as soon you use methods relying on subtick you have to take care of that. getTick() is accurate and can be used safely though.

djcouchycouch · Post by **djcouchycouch** » Wed Mar 14, 2012 2:04 am

I'm calling getSubTick() before and after the code I'm trying to benchmark and I'm getting numbers! And when I optimize the code the numbers get smaller!

So it seems to be working for me. Thanks!

djcouchycouch · Post by **djcouchycouch** » Wed Mar 14, 2012 2:37 am

Maybe I spoke too soon.

To get general idea of what my main loop performs like, I call getTick() at the beginning (starttime) and the end (endtime). But if I do a endtime - starttime I get zero. I also get zero if I use getSubTick(). But if I call getSubTick before and after some calls to VDP_setTileMap(), I get a time.

My game loop is in a while loop in the main function.

// totally simplified:
int main()
{
while(1)
{
VDP_waitVSync();
VDP_resetSprites();
u32 starttime = getSubTick();

/* Do all my game stuff here. Incredibly unoptimized and ugly */

u32 endtime = getSubTick();

char outputString[16];
uintToStr(endtime - starttime, outputString, 16);
VDP_drawText(outputString, 5, 6); // prints a whole bunch of zeros.

}
}

Code like this will give me differences in the start time and end time. This code, for example, updates the bottom row of tiles when the screen gets scrolled vertically downward.

if (Player.speedy > 0) // if the player happens to be pressing down on the DPAD.
{
arrayStart = endtiley * foreground.width;

for (loop = starttilex; loop <= endtilex; loop++)
{
tileNumber = foreground_background[arrayStart + loop] + FOREGROUND_TILE_STARTINDEX;
VDP_setTileMap(APLAN, TILE_ATTR_FULL(PAL3, 0, 0, 0, tileNumber), loop % 64, endtiley % 64);
}
}

Could it be the frequent calls to VDP_setTileMap() ?

Maybe the getTick and getSubTick functions aren't supposed to be called this way?

Stef · Post by **Stef** » Wed Mar 14, 2012 10:05 am

You are using them correctly but sub tick is scanline based (where tick is frame based) so if your code compute is less than one scanline (< 488 68000 cycles) you will get 0.
Also if your code fit in a single VBlank period you will also get inaccurate result.

Something you can do is to execute your critical code in a 10x or 100x loop so it takes much longer and can be profiled more easily.

Nemesis · Post by **Nemesis** » Wed Mar 14, 2012 10:52 pm

It should be possible to make a highly accurate timer using hcounter/vcounter progression, in H32 mode at least. H40 mode is also possible but would take more work. The VCounter jump-back behaviour is a problem though. I don't know of any easy way to compensate for that.

I've got another suggestion though. Is relying on the Z80 an option? The Z80 has an internal "refresh register", which is incremented by the Z80 after each instruction fetch. The refresh counter can be set in code and read out in code (using "LD R,A" and "LD A,R"). Sure, it only has 7 effective bits of precision, but it has the very useful property that it directly corresponds to Z80 instruction register fetch cycles, meaning the Z80 can loop for example and wait for it to pass some magic value, like, say, 0x6F, and then when it does, read it out, add the contents to a larger 16-bit counter, add an adjustment value based on the amount of time all this counter read code took in instruction fetch cycles, reset the counter to 0, and loop again. In this manner, if the adjustment value which is added to the counter is accurate, you could get a counter which can be read without any major interruption in its progression. All you have to do is make the loop interruptable by the M68K, eg, by monitoring some addresses in Z80 RAM, and when the M68K wants to start/stop the counter, it will obtain the bus, and write to those addresses to start/stop the counter, then read it back from Z80 RAM once the Z80 has calculated the total.

Thinking further down this line, you could do this in a much easier way if you don't mind sacrificing a little precision. Leave the Z80 in a loop adding 1 to a 16-bit counter, and checking for a stop flag to be set in memory. When the stop flag is set, write the accumulated value out to memory. You won't get quite as many ticks per Z80 clock cycle, but it'll be more accurate in the sense that an exact known number of Z80 cycles will correspond to each tick.

Stef · Post by **Stef** » Wed Mar 14, 2012 11:09 pm

Do you mean you are sacrificing the Z80 to get an accurate timer X'D ??

I can do a better timer by using H counter as you said. Problem is that H Counter does rollback too...
So i though about using HBlank and VBlank flag to detect rollbacking but if the user force VDP blanking then we are done ! Maybe we can assume the timer is not accurate in this case (another problem would be the HV counter latch on level 2 interrupt).

Shiru · Post by **Shiru** » Wed Mar 14, 2012 11:34 pm

Not a big deal to sacrifice Z80 for a debug build. However, why not simply modify an emulator and add a profiler? I.e. a write to a not existing reg starts a counter, a write to other reg stops it, and the value is displayed with OSD. It done this way in few debugging NES emulators, measurement is done in CPU cycles.

notaz · Post by **notaz** » Thu Mar 15, 2012 4:14 pm

You could also run your code in a loop and increment a counter on each iteration, then register vsync handler that would read and reset the counter. This way you get times/vsync value that's easy to understand, you can easily calculate how much time per frame it uses and how much you have left for other things.

Chilly Willy · Post by **Chilly Willy** » Thu Mar 15, 2012 4:58 pm

Nemesis wrote: I've got another suggestion though. Is relying on the Z80 an option? The Z80 has an internal "refresh register", which is incremented by the Z80 after each instruction fetch. The refresh counter can be set in code and read out in code (using "LD R,A" and "LD A,R"). Sure, it only has 7 effective bits of precision, but it has the very useful property that it directly corresponds to Z80 instruction register fetch cycles, meaning the Z80 can loop for example and wait for it to pass some magic value, like, say, 0x6F, and then when it does, read it out, add the contents to a larger 16-bit counter, add an adjustment value based on the amount of time all this counter read code took in instruction fetch cycles, reset the counter to 0, and loop again. In this manner, if the adjustment value which is added to the counter is accurate, you could get a counter which can be read without any major interruption in its progression. All you have to do is make the loop interruptable by the M68K, eg, by monitoring some addresses in Z80 RAM, and when the M68K wants to start/stop the counter, it will obtain the bus, and write to those addresses to start/stop the counter, then read it back from Z80 RAM once the Z80 has calculated the total.

Thinking further down this line, you could do this in a much easier way if you don't mind sacrificing a little precision. Leave the Z80 in a loop adding 1 to a 16-bit counter, and checking for a stop flag to be set in memory. When the stop flag is set, write the accumulated value out to memory. You won't get quite as many ticks per Z80 clock cycle, but it'll be more accurate in the sense that an exact known number of Z80 cycles will correspond to each tick.

Funny you mention that - Conleon added Z80 timing to the NeoFlash MD Myth menu quite some time back when he was profiling the SD reading. The difference is he simply incremented HL and stored it to sram and looped. This is actually better than adding the R register since the R register isn't incremented at a set rate; it's incremented at the rate the initial opcode byte is fetched. So it's faster in some parts of the code, and slower in others. Merely incrementing HL once per loop in fixed code always maintains the same rate of incrementing, giving a more accurate result in the end.

Fonzie · Post by **Fonzie** » Mon Mar 26, 2012 7:54 am

I concur with Tiido, the color update trick works all the time, no need any code modification, just put in your game loop a 1c change (equiv to 3 writes to VDP I think) between every single functions you want to benchmark and watch what's going on.

Since most games shall run their "game code" during VACTIVE and transfer data to VRAM during VBLANK it works for most cases !

It works on emulator nicely too but its less precise for very short stuff as emulators do not emulate "mid line" color update. But honestly it never bothered me.

Another method that can be used to benchmark CPU intensive functions unrelated to gameplay is just let them loop 10 000 times & see how long it takes in frames. Its so lame but I had to mention it