Wait States on Z80 Bank Access

Ask anything your want about Megadrive/Genesis programming.

Moderator: BigEvilCorporation

Post Reply
Mask of Destiny
Very interested
Posts: 615
Joined: Thu Nov 30, 2006 6:30 am

Wait States on Z80 Bank Access

Post by Mask of Destiny » Fri Jan 24, 2014 6:59 am

Does anyone know if there are any wait states for Z80 access to the 68K bus even when the 68K's bus is not busy? I've got the sound driver I'm working on taking exactly 297 cycles per sample (as measured by hand and confirmed using the debugger in BlastEm after I fixed a few Z80 instruction timing bugs). When I run my simple test ROMs on real hardware though, it sounds as if the pitch is bending up and down. Since sample buffering is done in bursts it seems like a likely culprit for the change in playback speed.

I was going to check this with my logic analyzer, but I've managed to damage the cart slot in my Genesis 2 and I'd rather not crack open my Genesis 1 at the moment. Does anyone have any data on this?

Stef
Very interested
Posts: 3131
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Post by Stef » Fri Jan 24, 2014 9:05 am

As the Z80 is using the 68K in bus cycle stealing mode, you sometime have penalty when accessing 68k BUS from Z80. In my driver i always assume 1 cycle penalty for reading a byte from banked memory (ex: POP HL takes 12 cycles instead of usual 10 cycles). This is just an estimation but it worked more or less right on real hardware. Except very accurate emulator as exodus, i believe that no emulator take care of the BUS interaction emulation.

Nemesis
Very interested
Posts: 791
Joined: Wed Nov 07, 2007 1:09 am
Location: Sydney, Australia

Post by Nemesis » Fri Jan 24, 2014 10:26 am

And even in the case of Exodus currently, things aren't cycle accurate on bus sharing, since a lack of microcode level emulation of the M68000, or at the very least, correct bus access timing during opcode execution, makes it impossible for the M68000 to yield the bus at the correct points. In fact, it would take emulating both the Z80 and M68000 at this level, as well as measuring any additional delays introduced by the bus arbiter, in order to get the timing right here. My plan is to get Exodus to that point, and the platform itself supports it, but it requires a lot more sophisticated emulation cores for these chips thanhave ever been written, and a lot more hardware testing to confirm bus access timing within every opcode.

Mask of Destiny
Very interested
Posts: 615
Joined: Thu Nov 30, 2006 6:30 am

Post by Mask of Destiny » Sat Jan 25, 2014 7:50 am

I managed to coax my Genesis 2 back into action and I was able to capture some data. In my current test ROM, the 68K is halted with a stop #2700 instruction so my initial results pertain to the case in which the 68K's bus is effectively free.

!WAIT is brought low pretty much immediately after !MREQ on access to the bank area. !BR on the 68K will be brought low on the next rising edge of the 68Ks clock following the assertion of !MREQ. !BG will go low 2 68K cycles later, also on the rising edge of the clock. !WAIT then stays low for about 3.5 68K cycles and appears to rise on the falling edge of the 68K's clock, though there's a little bit of noise in my data on that.

The total number of wait states that results from the above appears to be 2-3 depending on how in sync the Z80 and 68K clocks are at the time of the access.

I did a second test with a ROM that replaces the stop #2700 with an infinite loop using jmp (a0). This instruction takes 8 cycles and consists of 2 reads so the 68K is constantly on the bus in this version. In this test the !BR to !BG time was sometimes 3 68K cycles and there were a total of 4 wait states. The average delay across 16 accesses appeared to be just shy of 3 cycles.

I can upload the logic analyzer captures if anyone wants to do any closer analysis. I probably won't do much more with them until I'm ready to properly tackle Z80/68K synchronization in BlastEm.

Stef
Very interested
Posts: 3131
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Post by Stef » Sat Jan 25, 2014 11:04 am

Oh it would be a nice feature to have that somehow emulated in BlastEm :)
I always though the STOP instruction was actually still doing BUS accesses by fetching next instruction indefinitely...
So based on your result the mean wait state is 3/4 cycles when 68000 is active ? that mean 2 cycles on Z80 ?
In my case with my drivers i saw that 2 cycles was more or less good but for word fetching. Does it means an instruction as POP HL which actually does 2 bytes read can have the same penalty as single byte read instruction ? I guess it has to do with the way the Z80 is requesting the BUS for it word memory instruction.

Mask of Destiny
Very interested
Posts: 615
Joined: Thu Nov 30, 2006 6:30 am

Post by Mask of Destiny » Sun Jan 26, 2014 8:02 am

Stef wrote:Oh it would be a nice feature to have that somehow emulated in BlastEm :)
I'll probably "fake it" pretty soon with either a fixed wait state based on the average. The fixed version of my driver sounds lousy in BlastEm now. A proper fix is a fair bit of work though. Currently synchronization is done on demand when a CPU accesses a non-memory device. This works fine for something like the VDP which won't access the 68K's bus unless the 68K requests it, but doesn't work for the Z80 which can access the bank area at any time. The long term plan was to support some sort of rollback for one of the CPU cores so I can back up to the point where the other CPU intruded on the bus. That's a non-trivial amount of work though and the payoff is relatively small, so it's down the priority list.
Stef wrote:I always though the STOP instruction was actually still doing BUS accesses by fetching next instruction indefinitely...
The manual says:
Moves the immediate operand into the status register (both user and
supervisor portions), advances the program counter to point to the next instruction, and stops the fetching and executing of instructions. A trace, interrupt, or reset exception causes the processor to resume instruction execution.
Stef wrote:So based on your result the mean wait state is 3/4 cycles when 68000 is active ? that mean 2 cycles on Z80 ?
The range is 2-4 Z80 cycles with an average of a little less than 3 (seemed about 2.93, but my sample was kind of small). It seems like the !BR to !BG delay is 2 cycles unless the request comes 1 cycle into a 68K bus operation in which case it will be 3 cycles. Obviously it can also be longer for TAS or for a bus operation that's delayed by DTACK. When the !BR to !BG delay is 2 cycles, there will be 2 or 3 wait states on the Z80 side depending on how in sync the clocks are. If the !BR to !BG delay is 3, then there will be 4 wait states. At least that's what it looks like from a quick looka t the data anyway.
Stef wrote:In my case with my drivers i saw that 2 cycles was more or less good but for word fetching. Does it means an instruction as POP HL which actually does 2 bytes read can have the same penalty as single byte read instruction ? I guess it has to do with the way the Z80 is requesting the BUS for it word memory instruction.
As far as I know, a POP just does 2 normal byte-sized instructions. If you have a ROM handy with your driver, I can do a logic analyzer capture on it.

Stef
Very interested
Posts: 3131
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Post by Stef » Sun Jan 26, 2014 10:12 am

Mask of Destiny wrote: I'll probably "fake it" pretty soon with either a fixed wait state based on the average. The fixed version of my driver sounds lousy in BlastEm now. A proper fix is a fair bit of work though. Currently synchronization is done on demand when a CPU accesses a non-memory device. This works fine for something like the VDP which won't access the 68K's bus unless the 68K requests it, but doesn't work for the Z80 which can access the bank area at any time. The long term plan was to support some sort of rollback for one of the CPU cores so I can back up to the point where the other CPU intruded on the bus. That's a non-trivial amount of work though and the payoff is relatively small, so it's down the priority list.
I understand that is far too much work just to fix a so minor issue in accuracy. Does BlastEm somehow fake the BUS contention when DMA occurs ? Imo that is the most problematic scheme for me (it can sound *very* different on real hardware) and that could be probably faked more easily than 68k interaction.
The manual says:
Moves the immediate operand into the status register (both user and
supervisor portions), advances the program counter to point to the next instruction, and stops the fetching and executing of instructions. A trace, interrupt, or reset exception causes the processor to resume instruction execution.
Glad to hear that, so we can definitely make the BUS free for testing !
I should make some tests with that to heard how it impacts my drivers.
Stef wrote:So based on your result the mean wait state is 3/4 cycles when 68000 is active ? that mean 2 cycles on Z80 ?
The range is 2-4 Z80 cycles with an average of a little less than 3 (seemed about 2.93, but my sample was kind of small).
...
As far as I know, a POP just does 2 normal byte-sized instructions. If you have a ROM handy with your driver, I can do a logic analyzer capture on it.
That is really surprising, based on your results we should see an average 3 cycles penalty on byte access then but doing that in my drivers would really screw up the playback (too fast). Now i am thinking of it, i am using an approximation in the number of available cycles because i wanted to use the same driver both for PAL and NTSC systems. For instance for 16000 Hz playback i estimated 223 Z80 cycles per sample which is a bit less than real NTSC value but more than PAL one. But even in taking that in consideration i am still quite far from your numbers. Did you try to put these numbers in your driver and see what happen on real hardware ?

You can see the source of my 4 PCM driver here :
http://code.google.com/p/sgdk/source/br ... 0_drv3.s80

You can find part where i counted penalty by looking for "10+2" text.

And you can test it with this rom :
http://code.google.com/p/sgdk/source/br ... ut/rom.bin

4 PCM as the 2 ADPCM drivers are interesting for tests as they use fixed playback rate and when they are idle (sample play done) they actually continue to access rom to play a "silent" sample.

Mask of Destiny
Very interested
Posts: 615
Joined: Thu Nov 30, 2006 6:30 am

Post by Mask of Destiny » Sun Jan 26, 2014 11:26 pm

Stef wrote:I understand that is far too much work just to fix a so minor issue in accuracy.
Well I intend to fix it eventually since it is my goal to have that level of accuracy, but I've got bigger accuracy issues I'd like to tackle first.
Stef wrote:Does BlastEm somehow fake the BUS contention when DMA occurs ? Imo that is the most problematic scheme for me (it can sound *very* different on real hardware) and that could be probably faked more easily than 68k interaction.
Not yet, but apart from the timing inaccuracy introduced by not properly emulating the Z80/68K bus interaction I can do a good job of this with my current sync strategy.
Stef wrote:That is really surprising, based on your results we should see an average 3 cycles penalty on byte access then but doing that in my drivers would really screw up the playback (too fast). Now i am thinking of it, i am using an approximation in the number of available cycles because i wanted to use the same driver both for PAL and NTSC systems. For instance for 16000 Hz playback i estimated 223 Z80 cycles per sample which is a bit less than real NTSC value but more than PAL one. But even in taking that in consideration i am still quite far from your numbers.
I did some measurements on the 4 PCM and 2 ADPCM drivers and it seems like the average is actually a little higher for those. The second read from the POP seems to always have at least 4 wait states and in one case there were 5 wait states. It seems the 68K manages to sneak in a bus operation in between the first and second Z80 read so the Z80 is stuck waiting on that read. The average across 10 reads was 3.5 cycles.
Stef wrote:Did you try to put these numbers in your driver and see what happen on real hardware ?
I adjusted my code to assume 2 wait states since that's the minimum and I only had 8 cycles to spare in my copy "instruction". Any further adjustments will require me to free up some cycles from the main playback loop. It sounds a lot better on real hardware now, though it sounds awful in BlastEm since I'm not emulating any wait states yet.

I suspect the reason it's more noticeable for my driver than your 4 PCM driver is that currently I buffer 4 bytes a sample for a bunch of samples and then I do no buffering at all for a long period. The buffering and not buffering period are long enough to perceive the change in pitch whereas in your 4 PCM driver it slows down playback a bit and increases jitter, but the changes in playback speed are mostly over a handful of samples so it's not that noticeable.

The pitch difference between BlastEm (and presumably a number of other emulators) and real hardware is noticeable if you play them back to back, but it's not terrible.

Stef
Very interested
Posts: 3131
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Post by Stef » Mon Jan 27, 2014 9:44 am

Mask of Destiny wrote:
Stef wrote:Does BlastEm somehow fake the BUS contention when DMA occurs ? Imo that is the most problematic scheme for me (it can sound *very* different on real hardware) and that could be probably faked more easily than 68k interaction.
Not yet, but apart from the timing inaccuracy introduced by not properly emulating the Z80/68K bus interaction I can do a good job of this with my current sync strategy.
I think it would be a good start ! The 68K/Z80 interaction itself is far less problematic ;)
I did some measurements on the 4 PCM and 2 ADPCM drivers and it seems like the average is actually a little higher for those. The second read from the POP seems to always have at least 4 wait states and in one case there were 5 wait states. It seems the 68K manages to sneak in a bus operation in between the first and second Z80 read so the Z80 is stuck waiting on that read. The average across 10 reads was 3.5 cycles.
Thanks for testing it ! So with a POP instruction you got a mean of 3.5 cycles of wait state per byte right ? So instead of the usual 10 cycles used by the instruction we should assume 17 cycles instead ?
Definitely i believe i have wrong cycle counting somewhere else, because for instance in the 4 PCM driver i am using, i do read 4 samples for each played sample so the difference of cycle per sample is 10 cycles (233 instead of 223) and so obtaining a final rate of 15350 Hz instead of ~16000 Khz which is still a huge difference.
Stef wrote: I adjusted my code to assume 2 wait states since that's the minimum and I only had 8 cycles to spare in my copy "instruction". Any further adjustments will require me to free up some cycles from the main playback loop. It sounds a lot better on real hardware now, though it sounds awful in BlastEm since I'm not emulating any wait states yet.

I suspect the reason it's more noticeable for my driver than your 4 PCM driver is that currently I buffer 4 bytes a sample for a bunch of samples and then I do no buffering at all for a long period. The buffering and not buffering period are long enough to perceive the change in pitch whereas in your 4 PCM driver it slows down playback a bit and increases jitter, but the changes in playback speed are mostly over a handful of samples so it's not that noticeable.
It's crazy that it could make a such important difference that it just sound awful on BlastEm with your driver when it sounds ok on real hardware.
The pitch difference between BlastEm (and presumably a number of other emulators) and real hardware is noticeable if you play them back to back, but it's not terrible.
You can clearly hear the difference between emulator and real hardware. For instance in the Bad Apple demo the emulator playback rate is always a bit ahead (more or less depending the emulator) and the song ends a bit before the animation where it ends almost perfectly synchronized on real hardware ;)
Last edited by Stef on Mon Jan 27, 2014 10:18 pm, edited 1 time in total.

Mask of Destiny
Very interested
Posts: 615
Joined: Thu Nov 30, 2006 6:30 am

Post by Mask of Destiny » Mon Jan 27, 2014 9:56 pm

Stef wrote:So with a POP instruction you got a mean of 3.5 cycles of wait state per byte right ? So instead of the usual 10 cycles used by the instruction we should assume 17 cycles instead ?
Correct. Typical timing is something like 4 cycles for decode/prefetch, 6 cycles for the first byte and 7 cycles for the second byte.
Stef wrote:Definitely i believe i have wrong cycle counting somewhere else, because for instance in the 4 PCM driver i am using, i do read 4 samples for each played sample so the difference of cycle per sample is 10 cycles (233 instead of 223) and so obtaining a final rate of 15350 Hz instead of ~16000 Khz which is still a huge difference.
It's about 4% which is a little less than a musical half step. I can only do a relatively small capture, so I can't say how long your relatively large main loop takes. I can say that readAndMix2 seems to take 81 cycles on real hardware though. There were two other pop to pop intervals I recorded, but they were harder to pin down and I don't remember the numbers offhand.
Stef wrote:It's crazy that it could make a such important difference that it just sound awful on BlastEm with your driver when it sounds ok on real hardware.
Well it effectively adds a bunch of vibrato which sounds rather weird on a guitar sample. On other samples it's not as bad.

Stef
Very interested
Posts: 3131
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Post by Stef » Mon Jan 27, 2014 11:36 pm

Mask of Destiny wrote: Correct. Typical timing is something like 4 cycles for decode/prefetch, 6 cycles for the first byte and 7 cycles for the second byte.
...
It's about 4% which is a little less than a musical half step. I can only do a relatively small capture, so I can't say how long your relatively large main loop takes. I can say that readAndMix2 seems to take 81 cycles on real hardware though. There were two other pop to pop intervals I recorded, but they were harder to pin down and I don't remember the numbers offhand.
81 cycles for readAndMix2 instead of the 76 cycles i expected, that indeed confirm the +5 cycles delay for the POP instruction !
I believe i just made a mistake somewhere else while counting my cycles, probably something stupid as using wrong base clock :p

Well it effectively adds a bunch of vibrato which sounds rather weird on a guitar sample. On other samples it's not as bad.
Ah yeah, it usually sounds really bad on that type of sample, even with minor distortion. By the way, what type of driver are you writing ? :)

Mask of Destiny
Very interested
Posts: 615
Joined: Thu Nov 30, 2006 6:30 am

Post by Mask of Destiny » Tue Jan 28, 2014 12:19 am

Stef wrote:By the way, what type of driver are you writing ? :)
FM + 2 PCM channels with resampling (8.8 fixed point increment), primitive looping (loop start address must be a multiple of 256, loop length must be a power of 2 multiple of 256) and a 2 level PCM volume control (mostly to prevent overflow when playing 2 samples together while allowing full volume when playing a single sample) with a ~12050Hz output sample rate. Sample buffering is handled in the command stream rather than in the main loop. The command stream interpreter has some higher level commands for things like looping, subroutines, coroutines and instrument loading, but all other FM stuff is handled with a simple register write command. The Wait instruction has a resolution of 1 sample, so generally events can be scheduled with that precision; however, most instructions take more than one sample to complete so conflicts need to be dealt with.

I have yet to implement a strategy for sound effects or syncing to VBlank (to avoid doing buffering during DMA). I also haven't decided whether I'm going to add PSG support. That probably will depend on whether PSG access involves cycle stealing from the 68K or not. I also need to work on the composition workflow. I have a simple Python script that deals with scheduling copy instructions for filling buffers, but it needs support for some more high level note-playing commands to really be usable. I'd also like to investigate adding an option to switch one (or both) PCM channels to some kind of compressed format with fixed speed playback, but that will probably come later.

Stef
Very interested
Posts: 3131
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Post by Stef » Wed Jan 29, 2014 10:11 am

Oh yeah now i remember, you already talk about that resampling PCM driver idea right ? It's cool to see all theses new sound drivers coming out lately :) Do you plan to have support for SFX as well ? I guess that is a custom driver designed for your needs and with your own tools. Do you plan to eventually open it at some point ? ^^
You said the wait command is 1 sample resolution, this is quite low, do we really need that granularity ? I was thinking of making a new driver with FM support and having wait command frame based (256 samples). I don't see case where we need more than that.

Edit: Just realized you actually want to add SFX support :p

Mask of Destiny
Very interested
Posts: 615
Joined: Thu Nov 30, 2006 6:30 am

Post by Mask of Destiny » Wed Jan 29, 2014 7:27 pm

Stef wrote: I guess that is a custom driver designed for your needs and with your own tools.
For the most part, I'm just scratching an itch to see what's possible. I do need to do some sound test ROMs for fixing issues in BlastEm and I do have a little homebrew game in mind, but neither of those necessarily require a new driver. I'm writing my own tools for working with the driver because there aren't any existing tools that are tailored to its abilities and limitations.
Stef wrote:Do you plan to eventually open it at some point ? ^^
That's the plan. I haven't released it yet because it's not really done enough to be terribly useful, but if someone wants to poke at it a bit I can upload it now.
Stef wrote:You said the wait command is 1 sample resolution, this is quite low, do we really need that granularity ? I was thinking of making a new driver with FM support and having wait command frame based (256 samples). I don't see case where we need more than that.
Generally speaking, I think 1 sample resolution is overkill; however, I do need it for scheduling the event that stops playback for a sample as there's no code in the main loop to handle that. Might also be useful for certain special effects, but I haven't looked into that yet.
Stef wrote:Just realized you actually want to add SFX support :p
Yeah, I just have to figure out how I'm going to make it work. I think I have enough cycles available for interpreter instructions that I can probably handle temporarily blocking a channel for FM sound effects. PCM sound effects are going to be problematic though, but I might be able to pull it off by abusing the co-routine feature. Time will tell.

Post Reply