VDP Internals

Nemesis · Post by **Nemesis** » Wed Sep 11, 2013 10:47 pm

There are a LOT of edge cases in the VDP, and they tend to betray internal information too, so you can kind of chain them together to probe internal operation, eg, building off an edge case that exposes internal information, you can probe into the internal state of another operation, or force unexpected conditions in it and observe the results, and so on. Data port reads are by far the buggiest part of the VDP, and a lot of the tests rely on them to dig information out. I'll try and answer your questions as best as I can, but there's more going on here than I can document in one post.

1) FIFO state has no impact on CTRL port writes: you can actually setup a READ operation while the FIFO is full and it still returns without delay => this means the VDP command list stores the source/destination as well as the type of access (presumably, setting a READ operation adds a read command at the end of the list)

Mostly correct, except that read operations aren't added to any kind of "list", a read is simply flagged as pending, and that state is ignored until the FIFO is empty. Each of the four FIFO entries effectively store their own copy of the command code and command address registers, while there's a fifth "live" command code and command address, which is what you write into when you perform control port writes. There's also some interesting behaviour to do with the auto-incremented addresses when performing reads and writes. I'm pretty sure, from memory, that there's a sixth "active" command address register too, which is what gets auto-incremented when performing reads or writes, and in the case of the FIFO, gets written into each FIFO entry when a data port write is made. This cached address register gets reloaded from the live address register even if you only perform a partial control port write, which may cause some surprising results. There's a test which covers this behaviour.

2) FIFO state is only affected by WRITE operations: setting a READ operation does not clear the FIFO EMPTY flag and this flag isn't cleared either when the read data is available.

Correct. The VDP doesn't use the FIFO for performing read operations, there's a separate read buffer. Most importantly, there's a read pre-cache process, where as soon as you setup a valid read target, the VDP will attempt to fetch the first value for the read target once the write FIFO is empty, and after each read operation, it will attempt to cache the next value. Note there are a lot of quirks and potential hardware lockups surrounding this read pre-cache process when you start messing with VDP state while a pre-cache operation is in progress. There's even problems that can occur intermittently if you attempt a data port read immediately after setting up a read target. With all the tests I've done, I still don't have the VDP behaviour fully mapped out surrounding control port access while read caching is in progress, but most of the bugs seem to be related to control port writes when only half of the data has been cached, such as when reading from VRAM where reads are byte-wide, but I saw unusual behaviour with VSRAM as well I think, in that case related to collisions with the render process reading from the VSRAM buffer I speculated. I think there are some disabled tests around this caching behaviour, and I have attempted support for the assumed causes of these problems within Exodus, but I currently don't actually deadlock the VDP core when errors like this occur, I just ignore them. After some more testing, I'm planning to activate this behaviour, when I'm confident I'm emulating it correctly.

Note also that the current data in the FIFO for previously processed data port writes can affect read operations. If you're reading from a target like VSRAM or CRAM, which has some bits in the 16-bit wide result that are undefined in the target memory, those undefined bits are actually initialized to the content on the next available FIFO entry (the one containing the data written to control port four writes ago). This is very useful behaviour, as it allows the previous contents of the FIFO to be snooped on, which is how you can determine how and when the VDP uses the FIFO. A lot of the tests in this ROM rely on this being done correctly, as a lot of the tests verify the resulting FIFO state.

3) reading from DATA port hangs until read command has been processed (and data is available): if the FIFO is full when the read is being setup, the DATA port read will takes more time than if the FIFO was empty => this indicates the read command is processed after all writes have been processed and FIFO is empty.

Correct, although be careful attempting data port reads from odd VRAM addresses until the read cache operation is complete! In the case of the FIFO being full, this means you should wait for it to drain first. There's a bug in the VDP core surrounding reads from odd VRAM addresses, where the VDP thinks the read is complete after reading the first byte (upper byte of result), and the second byte (lower byte of result) retains the previous data which was held in the read cache. Check the "TestVRAMByteswapping.asm" test, there's some interesting test cases in there. Also be sure to check out tests surrounding VDP fill, there's some crazy things going on there that'll show a lot about how DMA operations work. Ever thought about what happens when you try and setup a DMA operation while the FIFO contains pending writes? Or setup a DMA fill, then disable DMA operations while the fill is waiting for the data port write, then perform the write anyway? How about repeating these questions when trying a DMA fill to VSRAM? The tests for that kind of crazy stuff is all in there.

Eke · Post by **Eke** » Thu Sep 12, 2013 11:59 am

Thank you.

Nemesis wrote:A lot of the tests in this ROM rely on this being done correctly, as a lot of the tests verify the resulting FIFO state.

Indeed, the number of tests passing OK jumped to around 70 as soon as i added support for this.

I started looking to the source and it's very interesting to read.
I am not sure to understand how CRAM and VSRAM fill work though i.e how it is related to fifo last entry ?

Same with CRAM / VSRAM copy, what does CD4 bit do ? I would say it indicates special read/write access but did you figured how it exactly works and what effects it has outside dma copy ? Your tests seem to indicate it has no effect during dma fill.

Also I've seen one game setting a DMA Copy with CD0 bit set (VRAM write) and expecting the VRAM copy to work: wouldn't it make the first write happening before read and miss the first byte copy ?

Nemesis · Post by **Nemesis** » Fri Sep 13, 2013 2:28 am

Eke wrote:I started looking to the source and it's very interesting to read.
I am not sure to understand how CRAM and VSRAM fill work though i.e how it is related to fifo last entry ?

Same with CRAM / VSRAM copy, what does CD4 bit do ? I would say it indicates special read/write access but did you figured how it exactly works and what effects it has outside dma copy ? Your tests seem to indicate it has no effect during dma fill.

Also I've seen one game setting a DMA Copy with CD0 bit set (VRAM write) and expecting the VRAM copy to work: wouldn't it make the first write happening before read and miss the first byte copy ?

I'm going to have to give a lot of explanation in order to adequately answer some of your questions. I've got a bit of time to write this all up. It's going to be a long post, so please bear with me. Also note that some of the described behaviour surrounding CD4 isn't implemented in Exodus yet, so consider this theory, not 100% proven. It's also been a little while now since I did all this testing, so I may have forgotten some points. I may say something here that contradicts known behaviour, since I haven't actually modelled all of this in code and verified it passes my test suite. Let me know if you spot an apparent contradiction.

To begin with, one thing you need to understand is the asynchronous nature of VDP port access. The VDP has an internal update cycle that runs around continuously, looking at the current processor state, and determining what work it needs to do. When you perform a read or a write operation on either the control or data ports, the calling device is usually just writing data that gets cached, and then "picked up" by the internal VDP update cycle, or reading cached data that's available in output buffers that have been filled by the VDP previously. The calling device only actually gets held waiting on the VDP in certain circumstances, such as when the write FIFO is full and a data port write is attempted, when a read is attempted and no read data has been cached, etc. If you understand that, it'll also be clear that all control port writes, even something seemingly simple like a register write, is never processed immediately at the time of the write. Instead, it goes into the command code and address registers, and all these operations have a pending state. The "live" command and address registers are set by a calling device writing to the control port, but nothing is done until the VDP picks up that state change, detects if some kind of work is required as a result, and acts on it. I'll also add that based on my understanding of the operation of the VDP, I believe if the calling device was able to perform two control port writes before the VDP had been able to internally process the first, the calling device would also be held waiting until the first port write was complete, although I don't think this can ever occur on the Mega Drive because the clock rate of the 68000 isn't fast enough.

With all that understood, you now need to understand the command code register fully. There are 6 bits in the command code register, and they have the following basic interpretation:
CD0 - Read/Write target (write target if set)
CD1-CD3 - Target identifier
CD4 - Work complete
CD5 - DMA work pending

The interpretation of these bits is consistent under all operating modes. The most interesting and important one to understand is CD4, and how it affects the various states the VDP can be in. CD4 is the key for how the VDP knows it needs to do some kind of work. If CD4 is unset (0), and the VDP update cycle detects that the current internal state indicates some kind of work to perform, the VDP will perform that work, then set CD4 to indicate it is complete. Here are the cases when I believe this occurs under non-DMA conditions:
-When a write operation is made to the data port, CD4 is set. (Not 100% sure why at this stage, but probably related to the write being accepted into the FIFO.)
-When a read cache operation is complete, CD4 is set.
-When a cached read value is read from the data port, CD4 is cleared. (Next value will now be cached)
-When the first half of a control port write has been picked up by the internal VDP state loop (and a register write has been completed if necessary) CD4 is set. Note that CD4 is set in this case whether a register write is flagged or not.
-When the second half of a control port write has been picked up by the internal VDP state loop, if a non-read target has been specified, CD4 is set.

A little more on implementation too, you need to understand a few things about the read buffer. The read buffer contains a 16-bit data buffer, and appears to carry at least two internal state flags, one flag indicating if the upper 8 bits of the data buffer have been populated, and the other indicating if the lower 8 bits of the data buffer have been populated. When reading from CRAM or VSRAM, the data is read in a single operation, so both the upper and lower data present flags are set at the same time, and the data is loaded into the data buffer. For implementation reasons I don't fully understand, when you'rer reading from VSRAM or CRAM, which have "undefined" bits which aren't actually present in the source, the read buffer ends up with those bits being set according to the current contents of the next available FIFO buffer entry, which is the data you wrote to the data port 4 writes ago. When reading from VRAM, only one byte can be read at a time. In this case, the lower byte is always read first, and the upper byte is always read second. Also note that there's an implementation bug here when you pass in an odd VRAM address for a read operation. The VDP ignores the LSB of the target address for CRAM and VSRAM reads and writes. For VRAM writes, when the LSB is set, the data being written is byteswapped. For VRAM reads, when the LSB is set, it has no effect whatsoever on the actual read buffer, and the read buffer reads the target VRAM word by reading the lower byte first, and the upper byte second. What it does do however is switch when CD4 is set. If an even VRAM address is read from, CD4 will only be set when the upper byte has been read, so in other words, the data will only be flagged as available when it has been fully read. If an odd VRAM address is read from, CD4 will be set when the lower byte has been read, so the data will be flagged as available when only half of it has been read. If you perform a data port read at this point, you'll actually retrieve a result with the lower byte being the requested data, and the upper byte containing the previous contents of the read buffer at the time of the last read operation.

So, with this understood, it should now become clearer how various operations actually work. This covers how the VDP knows when to process register writes, how it knows when data port writes need to be added to the FIFO, and how the external device knows when a read cache operation is complete and there's data waiting in the read buffer. Apart from DMA operations, this is everything you can do. With this information, you should be able to start to see some cases where you can break things:
-If you setup a read target in CD0-CD3, but set CD4, the calling device locks up if it attempts a data port read. This happens because you've actually flagged to the internal VDP state loop that the data is already cached, so the VDP never fetches any data from the read target. The calling device sees CD4 set when you read from the control port though, then tries to access the read buffer to read the cached data out. Unfortunately, unless the data is actually really cached in the read buffer when the caller accesses the read buffer, you get a lockup, because both of the data available state flags in the read buffer will be cleared right now, because they are cleared whenever a control or data port write occurs, and the calling device is stalled waiting for them to be set at this point.
-If you attempt a data port read when you wrote a valid read target, but you rewrite just the first half of the two-word read command to the command port, you'll also get a lockup, because the first half of a command port write sets CD4, and you now enter the same condition described above.
-If you setup a read target and perform a read, then perform a write to the data port and perform a read again, you'll get a lockup, since you've just set CD4 by doing a data port write, and the read cache operation will no longer run.
There are may more. A lot of them should be covered in that port access test ROM.

Now on to CD5. CD5 has a similar function to CD4, but it relates specifically to DMA operations. One thing you need to understand about CD5 is that it can only ever be modified externally by a control port write if the DMA enable bit is set (reg 1, bit 4). If DMA enable is cleared, the state of CD5 will be retained whenever the command code register is modified. Note that I said retained, not cleared. If you have a pending DMA fill just waiting on a data port write to kick it off, and you then clear the DMA enable bit and attempt to rewrite the same command data you wrote to setup the DMA fill, but this time leave CD5 unset, CD5 will still be set afterwards, and a DMA fill operation will still be triggered when you perform a data port write. The absolute only effect the DMA enable bit ever has is to enable or disable control port writes being able to modify the current state of CD5.

Before I say any more, a quick word about DMA. You need to understand that DMA is kind of a "bolt-on" addition to the VDP. Nothing about the fundamental way the VDP processes command or data port writes is altered by the presence of the DMA unit, the DMA unit simply detects some additional state conditions and performs some work of its own over the top of what the VDP normally does. Another critical thing about DMA operations, is that, I believe, they have no additional internal state settings. DMA itself is driven entirely from the command code and address registers, and the DMA-specific VDP registers. At no point does the DMA unit latch or store additional data internally. DMA operations are advanced one "step" at a time, and whether a DMA operation is going to run is re-evaluated on each step based on the current register settings. Every DMA operation also performs the exact same set of steps after it is advanced one step, which is to firstly add 1 to the lower 2 DMA source address registers, then to subtract 1 from the DMA length counter register, and then if the resulting DMA length counter is 0, clear CD5 in the command code register, which signals that a DMA operation is complete. Note that this means that the DMA source registers need to be advanced for a DMA fill, even though it doesn't use them. These DMA registers are modified "live", so their modified state is retained between DMA operations, and of course, the third DMA source register 0x17(23), which contains the DMD1/DMD0 flags in the upper bits, is never modified by the DMA state advance process, only the lower two are modified. This is what causes DMA transfers to "wrap" on a 0x20000 byte boundary (0x20000 bytes because there's no bit 0 for the source address).

Ok, with all that said, let's talk about how DMA works. Let's start with a DMA fill. When a DMA Fill operation is pending, and you perform a data port write, that data port write is completed as normal, because the DMA unit is a bolt-on addition to the VDP core. The basic VDP state update cycle doesn't know or care about DMA. It doesn't know or care about the CD5 bit. All it sees is that you did a data port write. That data port write is picked up, and written to the FIFO, with a copy of the current command code and the current incremented command address register, and the incremented command address register is incremented again. That pending write is then pulled out of the FIFO, and processed as a normal FIFO write. Now here's where the DMA unit gets involved. I should say at this point, I'm not 100% confident of everything I'm about to state about DMA fill internals, but this is my best working theory, based on testing.

The DMA unit seems to have hooks into the memory writing logic and FIFO advance process in order to advance DMA fill operations. Somehow, when the FIFO enters an empty state, a DMA fill operation is triggered. I believe this is stateless, IE, there's never a "DMA fill in operation" flag set or cleared. If this is true, the DMA fill operation most likely listens for a memory write complete signal from the memory write logic. When this is triggered, if the FIFO is currently empty, it advances the DMA fill and performs the next write in the fill, and so on until the fill operation is complete. It's not clear how the DMA fill knows what data was written in order to repeat it. I highly doubt it pulls it from the FIFO itself, most likely, it snoops on the memory write hardware and caches itself, or it pulls it back out of some temporary buffer and feeds it back into the memory write logic continuously. Note that pending FIFO writes take priority over DMA fills, so DMA fill operations will only ever run at an access slot if the FIFO is empty. When deciding whether to run a DMA fill, it checks if CD5 is currently set. Note that this is based on the live command register state, not anything written in the FIFO. If CD5 is set, and DMD1 is true, and DMD0 is false, the DMA unit will pull the write target and the upper byte of the write data from the FIFO entry, and write that single byte to the write target, using the current incremented command address register, which will then be incremented afterwards. Once the write has been performed, the standard set of DMA advance operations is then performed, as described above.

When you perform a data port write during a DMA fill, that data port write is processed as normal, it simply gets added to the next available slot in the FIFO, and the incremented command address register is incremented again. The DMA fill operation will effectively be suspended until the FIFO is empty again, and at that point, it will now pick up its fill data from the last data that was moved through the FIFO, effectively modifying the fill data mid-way through the DMA fill operation. The fill will now continue along its way, and will finish one location further than it would have normally, since the command port write incremented the command address, and the fill continued from this incremented location. Note that there is a race condition here, where occasionally if the timing is spot on, the data port write will try and increment the command address at the same time the DMA fill operation tries to increment it. Remember that port access to the VDP is asynchronous to the internal update state, and the incremented command address is updated by the calling device when it writes to the data port, so this can happen. When it does, the command address is only incremented once between the two operations, so the fill will finish where it would have originally, but the DMA fill operation will write the new fill data back to the same location that was written to by the data port write before continuing on to the next address.

Note that there's a quirk you need to be aware of when executing a DMA fill, and that's to do with a non-empty FIFO. It's quite possible to perform both control port writes to setup a DMA fill operation while pending writes are still held in the FIFO. If you do this though, as soon as the FIFO is empty, it's going to kick off a DMA fill operation based on the last written data in the FIFO.

When it comes to DMA fills to CRAM and VSRAM, there's a bug. When VRAM is the write target, DMA fill behaves as I've described above, but when CRAM or VSRAM is the write target, DMA fill seems to fail to latch the fill data correctly. The apparent effect you see is that instead of using the data in the last written FIFO slot, it uses the data in the next available FIFO slot, or in other words, the data that was written 4 writes ago to the data port. I suspect this is because the implementation was only designed to work for VRAM, and is binding to some kind of internal register or buffer that's only set for VRAM writes, and when this buffer is undefined, it retrieves data from the next available FIFO buffer entry, just like the read buffer does for undefined bits. Whatever the cause, this is the main thing that affects DMA fill operations to VSRAM or CRAM. Apart from retrieving the data from the wrong write, the fill operation works, with a bonus in fact that it performs a full 2-byte write in each "step". This means you can perform a DMA fill to CRAM or VSRAM if you want, all you have to do is write the data you want to use for the fill 4 times, the first 3 of which you perform before setting up the fill, and the last one to trigger it.

That's basically DMA fill in a nutshell. As you'll see, with this implementation, it has no additional state beyond the DMA registers and the FIFO buffer itself, and you can start to understand how and why it will behave the way it will under various circumstances. CD4 is completely ignored, because setting it has no effect for write operations, and the DMA fill operation doesn't use it.

When it comes to DMA copy, it's actually much simpler than it seems. For DMA copy, CD0-CD3 are ignored. You can only perform a DMA copy within VRAM. You must set CD4 to avoid a clash with the read pre-cache operation I believe. Without CD4 set, the VDP locks up. I speculate that setting CD0 to true and CD4 to false might actually have the same effect, I haven't tested this in hardware yet, but it would be well worth trying. At any rate, during the VDP update cycle, if CD5, DMD1, and DMD0 are all set, a DMA copy operation will advance one step, which simply involves reading a byte from the current target address in VRAM based on the DMA source address register, and writing that byte to VRAM using the current incremented command address register, which will then be incremented afterwards. Once the write has been performed, the standard set of DMA advance operations is then performed, as described above. A DMA transfer is similar, if CD5 is set and DMD1 is clear, and there's an available slot in the FIFO, it will read a value from external memory using the DMA source address register and add it to the FIFO using the current command code and incremented command address registers, then it runs the standard set of DMA advance operations.

Jorge Nuno · Post by **Jorge Nuno** » Fri Sep 13, 2013 10:57 pm

While this info is pretty much overwhelming I'll just say that the test rom failed on my GenIII on tests 6, 25, 34, 74, 77, 80, 83, 86, 89, 92, and 95

The extra VScroll memory may be related to the fact why this particular model isn't affected by the 2-cell VScroll bug on column -1 (or maybe not)

But yeah this ASIC has it's own particularities, one of which is the fact that z80 accesses also have /AS and /DTAK pulses O_o

Oh, and fixing the TAS instruction...

This being the 315-6123 ASIC (VA2). The VA1 board may behave differently as it uses an earlier IC (-5960)

This model uses an 256kB SDRAM IC as the main 68k ram (only 64k being accessible by software), vram and z80 ram seem to be integrated inside the ASIC...

Eke · Post by **Eke** » Sat Nov 02, 2013 3:39 pm

@Nemesis: thanks for the detailled informations although i am not sure to follow your whole theory about CD4 being an image of internal "busy" status (seems a little bit too twisted from hardware design point of view)

Anyway, I took some time recently to implement these new findings in Genesis Plus GX (r829 & r830) which now passes all tests except one (FIFO timing).

Apart from the aferomentionned stuff (FIFO ring buffer usage on reads, dma, etc), here are some undocumented stuff that were also required and (partially) verified:

- on DMA Fill, busy flag is actually immediately (?) set after the CTRL port write, not the DATA port write that starts the Fill operation

- on VRAM copy, VRAM source and destination address are actually adjacent address ( address ^ 1) to internal address registers value. This does not matter for most VRAM Copy operations since they are done on an even byte quantity but can be verified when doing a single byte copy for example.

-on VSRAM read, address is masked like with CRAM reads (bits 0 and 7-15 are ignored) but since VSRAM only has 40 entries, reading above VSRAM boundaries will return VSRAM first entry

Can you confirm this ?

Nemesis · Post by **Nemesis** » Sun Nov 03, 2013 11:04 pm

Eke wrote:-on VSRAM read, address is masked like with CRAM reads (bits 0 and 7-15 are ignored) but since VSRAM only has 40 entries, reading above VSRAM boundaries will return VSRAM first entry

Can you confirm this ?

I can confirm everything except this. When you read beyond the end of VSRAM, you don't get the first VSRAM entry, what actually happens is that the read doesn't latch any data at all, and what gets returned is actually the current state of the internal register that latches the VSRAM read data. This register state is continuously modified by the render process itself, since it latches data from VSRAM. Some interesting points to note about how the VDP latches VSRAM data for rendering are as follows:
-The VDP latches VSRAM data for a 40-cell display even in 32-cell mode, it just doesn't use the extra data.
-The VDP latches VSRAM data for each line, even for lines that are outside the visible area of the display.
-Disabling the display doesn't stop the VDP latching VSRAM data

Obviously since reads from VSRAM aren't at a premium, it wasn't worth adding more complexity to limit access to them. The VSRAM read cycle runs its own independent process, simply latching the necessary VSRAM data relative to the hcounter and the vscroll register settings into an internal register. The render process simply reads the current live state of this register in order to determine which vscroll data to use. If you're reading during active scan with the display enabled, you can predict what values you'll see based on the relative alignment of the external access slots to the VSRAM read operations the VDP performs for rendering. Just fill VSRAM with different data for each entry and enable 2-cell column scrolling and you'll see this come through.

It's also possible for the VDP render VSRAM reads and an external VSRAM read from this upper area to "collide", which causes unexpected (but predictable) results. With a highly sensitive test, I was able to use these collision points to detect at which points the VDP latches VSRAM data for rendering. See "TestVSRAMDataCacheNew.asm", which is a disabled test. There's a lot of notes in there. I'm not sure the final results are actually in there, but there's a lot of info, and the test just needs another rewrite based on my latest info to make it useful.

What all this means for testing is that reads from the upper VSRAM areas are fully predictable, but highly timing sensitive. Exodus has support for this right now, but I don't believe I've got the timing 100% cycle accurate yet, it might be a couple of cycles off.

mickagame · Post by **mickagame** » Sun Oct 12, 2014 9:02 am

When The FIFO is Full the VDP dont release DTACK until one place is free?

Mask of Destiny · Post by **Mask of Destiny** » Sun Oct 12, 2014 7:37 pm

Correct. !DTACK will remain high until the VDP has been able to commit the word to the FIFO. Note that FIFO entries hold a full word of data to write, but VRAM is only byte-wide so it takes two external slots for a single external entry to exit the FIFO. There's also some latency involved in the FIFO, so there's a delay (2 or 3 slots IIRC) between when a word gets written to the FIFO and when the first byte gets written to VRAM even when the display is off.

mickagame · Post by **mickagame** » Mon Oct 13, 2014 8:24 am

Thanks !

mickagame · Post by **mickagame** » Mon Oct 13, 2014 2:04 pm

Why in Charles Mc Donald documentation PSG register are listed in "VDP registers" section?

TmEE co.(TM) · Post by **TmEE co.(TM)** » Mon Oct 13, 2014 3:00 pm

PSG is inside the VDP

Charles MacDonald · Post by **Charles MacDonald** » Sat Jan 03, 2015 10:29 pm

TmEE co.(TM) wrote:PSG is inside the VDP

In fact $7F00-$7F1F on the Z80 side maps to $C00000-$C0001F on the 68K side, though it's not possible for the Z80 to control the data or control port in a meaningful way because of the bus size mismatch of 8-bit vs 16-bit. Only in Mark III compatibility mode (/M3 pin grounded) can the Z80 correctly access the VDP and use Mode 5, etc.

So we can't use the Z80 as a graphics co-processor. ;)