Cycle counting on real hardware

twosixonetwo · Post by **twosixonetwo** » Thu Aug 07, 2014 3:45 pm

So I have been running some tests because I had concerns that my code ran in the time it should on real hardware and timed some loops. To time my loops i made a sinewave, which I recorded with 96000 hz samplerate.
So I basically set the volume to max, run my loop, and set the volume to min again:

Code: Select all

	move.l #7000, %d0       | set Loopcounter
	move.b #0x00,0xA04001   | set Volume of Sinewave
countCycles_l1:
	asl.l #2, %d1           | do something
	sub.l #1, %d0           | decrement loop
	bne countCycles_l1      | loop
	move.b #0x7F,0xA04001   | mute sinewave

I ran the following tests:

Code: Select all

  Code lies in | Loop reads | loop iterations
1.   ROM       |   -        |   7000
2.   ROM       |   -        |  14000
3.   ROM       |  ROM       |   7000
4.   ROM       |  RAM       |   7000
5.   RAM       |   -        |   7000
6.   RAM       |   -        |  14000
7.   RAM       |  ROM       |   7000
8.   RAM       |  RAM       |   7000

So I went ahead and counted the cycles with easy68k and calculated the the time the waveforms should be based on the PAL CPU frequency (7600485.714hz):

Code: Select all

   time measur.| time predic.| difference
1. 0,0280625   | 0,027632181 | 1,53%
2. 0,05615625  | 0,055261995 | 1,59%
3. 0,031833333 | 0,031316156 | 1,62%
4. 0,03634375  | 0,035000132 | 3,70%
5. 0,0283125   | 0,027632181 | 2,40%
6. 0,05665625  | 0,055261995 | 2,46%
7. 0,03215625  | 0,031316156 | 2,61%
8. 0,03590625  | 0,035000132 | 2,52%

So what is interesting about this:

- Everything is slower than it should be.
- The Difference between the percentages of test 1 2 and 3 are so small, that it's possible they are due to measurement inaccuracy. Same for tests 5/6/7.
- RAM is slower generally, but comparing test 4 vs test 8, RAM is actually faster.

TL;DR / My actual questions:
1. My cpu runs slower than it should, about 7.48mhz. Is this likely?
2. There is a speed penalty for using ram. Does someone know if you can actually calculate how big it is?
3. There might be caching involved or something. Otherwise I wouldn't know how the difference between test 8 and test 4 could be in this direction.

Note: I measured everything twice, to see if there would be a significant error in measurement. The range of difference between my first and my second measurement was 0,0000313 seconds with most differences being smaller.

Mask of Destiny · Post by **Mask of Destiny** » Thu Aug 07, 2014 5:10 pm

Some memory accesses take longer due to what are presumably refresh cycles. From what I remember, these happen regardless of whether you're reading from ROM or RAM.

TmEE co.(TM) · Post by **TmEE co.(TM)** » Thu Aug 07, 2014 6:43 pm

YM is on Z80 bus, and bus artbitration is likely playing some role too. If you have Z80 running there will be fair bit of decrease in 68K side performance, also VDP accesses will cause some problems.
RAM has refresh cycles happening at least once per line. ROM area can have refresh enabled through a register which should not be enabled by default.

twosixonetwo · Post by **twosixonetwo** » Fri Aug 08, 2014 11:36 am

Okay, so refresh cycles might very well be the reason why the code execution from ram is slower than from rom.

However is the Z80 stealing cycles even when it does nothing? I don't transfer anything to the Z80, and Exodus doesn't show any execution there (it doesn't run through the NOPs so it shouldn't even fetch instructions). Also does the VDP cause slowdown even when I disabled all Interrupts and am not accessing the VRAM/CRAM in any way?

Mask of Destiny · Post by **Mask of Destiny** » Fri Aug 08, 2014 5:51 pm

TmEE co.(TM) wrote:RAM has refresh cycles happening at least once per line. ROM area can have refresh enabled through a register which should not be enabled by default.

I'm pretty sure I've seen delays on ROM access that look like refresh delays. That register may only control whether refresh signals are generated on the cart port and not whether DTACK delays are inserted. I remember being surprised at the time. Of course, that only explains why everything is slower than expected and not why there's a discrepancy between RAM and ROM access.

twosixonetwo wrote:However is the Z80 stealing cycles even when it does nothing?

I think what TmEE is suggesting is that there may be odd delays from accessing the YM-2612 since it's on the Z80's bus. I don't think this is likely to make a big difference since you don't touch it in the main loop, but you could use the PSG instead if you want to avoid that complication.

twosixonetwo wrote:Also does the VDP cause slowdown even when I disabled all Interrupts and am not accessing the VRAM/CRAM in any way?

There should be no VDP induced delays in this scenario.

Could you post the code for cases 3,4 and 8?

MintyTheCat · Post by **MintyTheCat** » Sat Aug 09, 2014 8:14 pm

Hello twosixonetwo.

I am not clear in my understanding here and it makes me ask myself some questions:

1. Are you interested in the timing for your code to exercise the 68K or to exercise a peripheral such as the YM2612?

2. Are you interested in timing your Z80 code's execution?

In either cases you might look at using UMDK and using it in trace-mode to time the execution of your code to get an exact value as opposed to using an Emulator: the hardware never lies

If you want to time the execution of some Commercial MD Game then you can again use Tracing and then disable the code to see what is going on. If you want to step through then set up a Breakpoint then watch the cycles per step.

I need to ask if any of you are interested in getting a UMDK PCB made as a community instead of having individuals having to make them on their own in isolation...

twosixonetwo · Post by **twosixonetwo** » Sun Aug 10, 2014 11:12 am

Mask of Destiny wrote:Could you post the code for cases 3,4 and 8?

Test 3/7:

Code: Select all

							|Test 3
	move.l #7000, %d0
	move.b #0x00,0xA04001  | set TL
countCycles_l5:
	move.l 0x000000, %d1
	sub.l #1, %d0
	bne countCycles_l5
	move.b #0x7F,0xA04001  | set TL

Test 4/8:

Code: Select all

							|Test 4
	move.l #7000, %d0
	move.b #0x00,0xA04001  | set TL
countCycles_l7:
	move.l 0xFF0000, %d1
	sub.l #1, %d0
	bne countCycles_l7
	move.b #0x7F,0xA04001  | set TL

MintyTheCat wrote:1. Are you interested in the timing for your code to exercise the 68K or to exercise a peripheral such as the YM2612?

2. Are you interested in timing your Z80 code's execution?

I am trying to find out how fast my 68k code is being executed and am only using the YM2612 as an output which I can measure. I am not using any Z80 code.
While the possibilities of the UMDK are indeed great, and would allow me to do some much more direct measurements, I am afraid that I currently don't have the money to order one...

MintyTheCat · Post by **MintyTheCat** » Sun Aug 10, 2014 5:20 pm

twosixonetwo wrote: I am trying to find out how fast my 68k code is being executed and am only using the YM2612 as an output which I can measure. I am not using any Z80 code.
While the possibilities of the UMDK are indeed great, and would allow me to do some much more direct measurements, I am afraid that I currently don't have the money to order one...

Ok, thank you for the clarification. I had assumed that but was not sure.
I would have to look up the time it takes the YM2612 to latch the values into its registers to work out the added delay but to be honest if you can then try to output on one of the Parallel ports using a pin-toggle bit of code in one of the Interrupts. If you have something like a Scope, Bus-Pirate or Logic-Analyser then you could then see how your code is executing externally and then do some timing-analysis.

You could even use the four pins on one of the Parallel-Ports to signify when each of the Interrupts are actived and then have two left for your code to make it very simple and fast. You need only to take the pin low, keep it low then raise it once you have finished. You can then see the output of the pins using a Scope, Logic-Analyser, Bus-Pirate or something else.

Another option: if you save the PC for where your Code is currently executing then you could even consider copying the PC for the task-code and a tick-count to a circular-buffer. Then, according to the limit of the MD's 'Serial-Port' you can print out on the RS232 your tick and PC values. Maybe even add in the H-Int and V-Int tick-counts for comparison with your own code. This will not work if you are using H/V-Interrupt Exceptions as part of your code though...

The MD's Serial-Port has Bauds at 1200, 2400, 4800 and 9600 but use 1200 first to check and do not attempt to send too much data out of the port all at once but 'buffer back' so that you have a backlog of messages to write out on the port.

I hope that these ideas are of some help.

I have not done any Megadrive development for a couple of weeks but I can pass you some code to handle the Serial side of things. I could also knock up an example to show you what I mean if you are still stuck but it might take me a few days to get back onto the Megadrive for now.

In another Thread we are discussing an idea to have a batch of UMDK Kits built over in the 'UMDK' section - please have a look

Eke · Post by **Eke** » Mon Aug 11, 2014 8:59 am

Mask of Destiny wrote: I'm pretty sure I've seen delays on ROM access that look like refresh delays. That register may only control whether refresh signals are generated on the cart port and not whether DTACK delays are inserted. I remember being surprised at the time. Of course, that only explains why everything is slower than expected and not why there's a discrepancy between RAM and ROM access.

I have seen that too and I couldn't figure what they were related to, they didn't seem related to Z80 access to ROM

the number of wait-states before DTACK is asserted when this happens is not constant, see this thread:
viewtopic.php?t=1411