VDP VRAM access timing

Nemesis · Post by **Nemesis** » Thu Dec 30, 2010 8:41 am

I've been doing a lot of testing on the VDP over the last year or so, mostly in bursts of activity here and there. I have a lot of unpublished information to share, but I'm going to wait to post most of it until I'm further along in some areas of my testing, because a lot of it relates to each other, and there are still a lot of things I don't know for sure, and I want to make sure everything I post is right. I've got a tidbit I want to post now though, because it's fairly independent, and I think there are a number of people here who might be interested in it, and could make use of it now.

One of the big things about the VDP that's currently unknown is exactly when various bits of data are latched from VRAM during a scanline, and when external reads and writes are allowed into VRAM relative to them. This information is essential in order to be able to accurately respond to mid-line changes to the contents of VRAM. Unfortunately, the only real way to know this for sure is to use a logic analyzer to snoop on the activity over the private VRAM bus during operation, which needs specialized equipment and a modified Mega Drive in order to do. Erm, like this:

I could post a lot of information about exactly how the VDP uses the VRAM, how the memory is addressed, the bus logic used to handle data transfer, etc, but I don't want to take too long right now writing up pages and pages of information which isn't really directly helpful for emulation. I'll post all this stuff later when I formally write up all my notes on the VDP as a whole.

Here's what you need to know in order to understand the timing diagrams:
1. The VDP has a Serial Clock (SC) which drives access to VRAM. SC is twice the "pixel clock", IE, there are two SC ticks for every pixel output by the VDP. When running in H40 mode, SC is equivalent to EDCLK. When running in H32 mode, SC is equivalent to MCLK/5.
2. Every VRAM bus operation effectively takes 4 SC cycles. It's a bit more complicated than that in reality (operations actually take 7 SC cycles, but 3 cycles of each operation overlap), but when thinking about VRAM access, you can simply consider each access to VRAM takes 4 SC cycles, or in other words, two pixels are output by the VDP for every one access to VRAM.
3. Within each 4 SC read cycle, 4 bytes are read (4 nybbles from each VRAM chip in parallel).
4. Within each 4 SC write cycle, 1 byte is written (1 nybble to each vRAM chip in parallel)
5. A refresh cycle takes 4 SC cycles, within which nothing can be read or written.

So, now that you know all that, here are my timing notes (I hope you can understand my writing):

Here's some additional things you need to know about these notes:
1. For layers A and B, two mapping pairs (IE, 4 bytes) are read in each mapping read slot.
2. A single row of pixels (4 bytes) is read for each pattern read slot. Only the row of pixels within a cell which is visible on the current scanline is read, so when a block mapping references a cell, only 4 bytes are read, returning an 8 pixel row from that cell.
3. Sprite mapping data is read in the order sprites are parsed according to the link data. If there are no more sprites on the current scanline, any remaining sprite mapping read slots are wasted. They don't become available for external VRAM access.
4. Sprite pattern data is read in the order the mapping data is read, with the order being left to right for cells within a single sprite. If there are no more sprites on the current scanline, any remaining sprite pattern read slots are wasted. They don't become available for external VRAM access.
5. As you can see, sprite pattern data is finished being read at the start of the scanline, and then is used to display pixels on that scanline. This means for each scanline, all the mapping data for every sprite, and most of the pattern data, is read during the previous line.
6. The first read of layer A and B pattern data at the start of the line is used to read in an additional 2 cell block for each layer, to support horizontal scrolling (IE, even though the display may be, for example, 32 cells wide, 33 cells might be visible due to scrolling). If the cells in layer A and B are perfectly aligned to the screen, so that every cell is entirely visible, these additional reads are still performed, but the results are not used.
7. Only the last 4 bytes of each 8-byte sprite attribute entry are read from VRAM, containing the block mapping for the sprite, and the horizontal position. The first 4 bytes, containing the vertical position, size, and link data, are cached internally in the VDP, as per the notes already published here about sprite attribute table caching.

I hope you guys find this info useful. Feel free to ask any questions.

TmEE co.(TM) · Post by **TmEE co.(TM)** » Thu Dec 30, 2010 10:54 am

I guess the refresh happens once every 64 pixel "zone" (there's no access spots there).

Regarding sprites, one thing I saw when connecting VRAM lines to RGB of TV (red color to be precise), I saw sprite tile data 16 pixels before the end of active screen area (well, all data is seen 16 pixels before its actually shown on VDP output, including background tile data), 32 pixels worth of sprite tiles were seen in 8 screen pixels, which should that there's 4 sprite pixels fetched per one screen pixel, so it should take 80 pixels to fetch a line of sprite pixels in H40, 64 in H32.... ?

TascoDLX · Post by **TascoDLX** » Thu Dec 30, 2010 9:50 pm

Awesome, Nemesis! Can't wait for more.

Nemesis · Post by **Nemesis** » Fri Dec 31, 2010 3:24 am

Regarding sprites, one thing I saw when connecting VRAM lines to RGB of TV (red color to be precise), I saw sprite tile data 16 pixels before the end of active screen area (well, all data is seen 16 pixels before its actually shown on VDP output, including background tile data), 32 pixels worth of sprite tiles were seen in 8 screen pixels, which should that there's 4 sprite pixels fetched per one screen pixel, so it should take 80 pixels to fetch a line of sprite pixels in H40, 64 in H32.... ?

Correct, each 4 SC read cycle, 4 bytes are read from VRAM, and for pattern data, there are two pixels per byte, so two pixels are read every SC cycle, with 8 pixels read every patten read slot. There are 40 sprite pattern read slots in H40 mode and 32 sprite pattern read slots in H32 mode, so there are a max of 256 sprite pixels in H32, and 320 sprite pixels in H40.

TascoDLX · Post by **TascoDLX** » Fri Dec 31, 2010 4:45 pm

A minor gripe:

The manual says the maximum wait time for VRAM writes (to a full FIFO) is 4.77us in H40 mode. Let's see: SC in H40 mode is MCLK/4, outside of H-sync. Because of the refresh cycle, there's a gap of 16 access slots between external access slots. So, that's a maximum wait time of 4/MCLK * 4 * 16 = 4.77us. Good.

However, according to the new data, the largest gap is actually 26 slots, and half of that is during H-sync with the overall slower clock. Maybe you have some thoughts on this.

Also, from the manual, it's pretty obvious that on V-blank lines it's all external access slots except for the usual refresh slots (H32: 167+4=171 ; H40: 205+5=210).

Now if we count the lines in NTSC, for instance, we have 224 active + 36 blank (that's according to the manual). So, 2 lines remain (or 2.5, in case that means anything). There should be 80 slots in H40 mode used to cache the sprite table, or 64 slots in H32 mode, but what other than that? Perhaps I'm forgetting something. Is there anything else going on there, or is it just bonus time?

panzeroceania · Post by **panzeroceania** » Sun Jan 02, 2011 6:50 am

Hey Nemesis, I was curious if there is any newer public build of Exodus than the one released in 2007

I don't mean to nag but I've been a fan for a long time and was just wondering if there have been any public updates since then.

Eke · Post by **Eke** » Mon Jan 03, 2011 11:07 am

Very good stuff (as always), thank you Nemesis for doing this.

I better understand now how sprites are processed:
so, during the previous line, sprites are preprocessed (one sprite per 2-cell group, giving 16 or 20 sprites max. per line) using ypos/size/link infos from internal RAM, retrieving xpos/attributes/name from VRAM and storing it somewhere (maybe still in internal RAM ?) for further processing.
During HBLANK, using previous informations, pixel data is grabbed from VRAM and stored into sprite internal buffer, which is later empied at the dot rate during the next active line.
All these steps only happen when the display is enabled (and VBLANK flag cleared) which explain all the weird stuff that can happen when playing with display status bit mid-frame and/or mid-line.

Some hypothesis:
-> Sprite overflow flag is probably set at the end of this line, if last link is != 0.
-> Sprite collision flag is probably set during HBLANK, when the internal buffer is filled. Maybe it can be set if the internal buffer has not been cleared entirely on previous line (display disabled during active line)

Remaing questions would be about the other RAM access. I think the VDP used a shared bus and wouldn't be able to access them at the same time so CRAM/VSRAM access slots must fit somewhere in that scheme (at least, external acces slots are obviously shared).
CRAM access are most likely done on each pixel, once priority controller output a CRAM entry value but it would be interesting to figure when VSRAM access are done on a line regarding VRAM slots.
This could also helps explaining the weird stuff hapenning to the left-most column when horizontal & 2-cell vertical scrolling is applied at the same time. I don't relly know how this could be accurately measured though

Another thing I'm curious about is the parrallel/serial access thing. VRAM can use both type of access and it seemed to me only the second use the serial clock. I would have thought that that mapping data was addressed using parrallel mode and pixel data using serial mode and that access slots might not been all the same...but I guess you will describe this more precisely later

I will also try to relate your data with HCOUNTER, which would be better use for emulators than the HSYNC signal.

Now if we count the lines in NTSC, for instance, we have 224 active + 36 blank (that's according to the manual). So, 2 lines remain (or 2.5, in case that means anything). There should be 80 slots in H40 mode used to cache the sprite table, or 64 slots in H32 mode, but what other than that? Perhaps I'm forgetting something. Is there anything else going on there, or is it just bonus time?

Maybe the manual is simply not correct (they made some calculation mistakes for other modes). We already know that the VBLANK flag is cleared on the last line, which is obviously done in order to enable sprites (pre)processing for line 0. I bet that external/refresh slots are similar as observed during active display and others are simply unused.

HardWareMan · Post by **HardWareMan** » Mon Jan 03, 2011 11:28 am

Also, I've been discovered that VRAM signal "SC" (yellow, clock for SIO) is EDCLK (blue).

And RAS-CAS cycle according to EDCLK:

Also, I saw CAS_before_RAS refresh cycle, but I don't know, where in raster it is (more than 2 ray oscilloscope required).

Nemesis · Post by **Nemesis** » Mon Jan 03, 2011 10:41 pm

TascoDLX wrote:A minor gripe:

The manual says the maximum wait time for VRAM writes (to a full FIFO) is 4.77us in H40 mode. Let's see: SC in H40 mode is MCLK/4, outside of H-sync. Because of the refresh cycle, there's a gap of 16 access slots between external access slots. So, that's a maximum wait time of 4/MCLK * 4 * 16 = 4.77us. Good.

However, according to the new data, the largest gap is actually 26 slots, and half of that is during H-sync with the overall slower clock. Maybe you have some thoughts on this.

I'd say the manual is simply incorrect. My measurements from the VRAM bus are pretty definitive. My logic analyser is powerful enough to sample all bus activity over an entire hscan line at once, so my reading of the bus activity is accurate. I haven't had to fill in any blanks or take guesses, just observe and record what the actual VDP is doing during operation.

TascoDLX wrote:Also, from the manual, it's pretty obvious that on V-blank lines it's all external access slots except for the usual refresh slots (H32: 167+4=171 ; H40: 205+5=210).

Correct, although I want to look a bit more into the bus activity on the very first and last active lines of the display. I know that the VDP becomes available for user access when it reaches the two back-to-back external access slots on the last line, and it becomes unavailable at that same point on the line before the first line of the display. I have some unanswered questions about sprites though. Does anyone know for sure if you can actually display sprites on the very first line of the display? If you can, the VDP would have to start parsing the sprite list before the first line, or parse it during the last line of the previous frame and carry it over to the next frame instead, otherwise, how would it know the sprite mappings for the first line? Usually the list of sprite mappings are built up in the previous line, but for the first line in the display, there is no previous line.

Another thing I need to confirm is what the VDP does when interlacing is active. One thing I can tell you is that when the "shorter" lines which occur during vsync happen, the refresh slots stay the same, IE, the bus activity is unaffected. During interlacing, the number of shorter lines in vblank is odd however, in order to displace the raster position on the following field, so I want to check what the VDP does in this case, since it would need to adjust the bus timing somehow to "resync" the bus access with the start of a line.

TascoDLX wrote:Now if we count the lines in NTSC, for instance, we have 224 active + 36 blank (that's according to the manual). So, 2 lines remain (or 2.5, in case that means anything). There should be 80 slots in H40 mode used to cache the sprite table, or 64 slots in H32 mode, but what other than that? Perhaps I'm forgetting something. Is there anything else going on there, or is it just bonus time?

The entire sprite table is never cached (that we know of). The VDP latches and stores the first 4 bytes of each sprite mapping in an internal cache as they are written to VRAM, but that cache is never invalidated and re-loaded from VRAM, which is why changing the sprite table address register results in the data from the old location being used. To my knowledge, at this point, I don't believe any VDP access to VRAM occurs outside of the active scan areas, apart from the refresh cycles.

Eke wrote:Some hypothesis:
-> Sprite overflow flag is probably set at the end of this line, if last link is != 0.
-> Sprite collision flag is probably set during HBLANK, when the internal buffer is filled. Maybe it can be set if the internal buffer has not been cleared entirely on previous line (display disabled during active line)

This is something I am going to very specifically test. It ties into this comment:

Eke wrote:I will also try to relate your data with HCOUNTER, which would be better use for emulators than the HSYNC signal.

One very cool thing that's important to note is that the VDP outputs the value of the status register live. What that means is, if the M68000 is in the process of reading the status register, and the value of any of the bits change while the data lines from the VDP are being asserted reporting its current value, the data lines immediately change to reflect the new value. I know from previous testing that the same is true of the hcounter. This means, I can get a cycle-accurate reading of the digital properties of the VDP, such as the status register and hv counter, and determine the exact SC cycle they change, and therefore how they relate to the analog output of the chip as well as VRAM bus access. This is my next project. I've successfully sampled the timing of the hblank flag relative to hsync now, for example. I'll post more information on how all these signals correlate when I've completed this next phase of testing.

Eke wrote:Remaing questions would be about the other RAM access. I think the VDP used a shared bus and wouldn't be able to access them at the same time so CRAM/VSRAM access slots must fit somewhere in that scheme (at least, external acces slots are obviously shared).
CRAM access are most likely done on each pixel, once priority controller output a CRAM entry value but it would be interesting to figure when VSRAM access are done on a line regarding VRAM slots.
This could also helps explaining the weird stuff hapenning to the left-most column when horizontal & 2-cell vertical scrolling is applied at the same time. I don't relly know how this could be accurately measured though

Yep, I'm going to look into this too, but I haven't thought too much yet about how I'll measure it. Oh well, I'll find a way somehow.

Eke wrote:Another thing I'm curious about is the parrallel/serial access thing. VRAM can use both type of access and it seemed to me only the second use the serial clock. I would have thought that that mapping data was addressed using parrallel mode and pixel data using serial mode and that access slots might not been all the same...but I guess you will describe this more precisely later

One of the most surprising things for me when analysing the VDP VRAM access was to discover that, although the VRAM is dual-port, the VDP seems to ONLY use serial access. The VDP never uses the RAM port at all. Since the serial bus provided enough bandwidth, and allowing external access during active scan wasn't a priority, I guess they opted for simple and just used the serial bus for everything. Coordinating parallel access to both ports would have no doubt increased the complexity of the VDP.

HardWareMan wrote:Also, I've been discovered that VRAM signal "SC" (yellow, clock for SIO) is EDCLK (blue).

Correct, when H40 mode is active, SC is equivalent to EDCLK (or, more specifically, when bit 7, RS0, of register 12 is set. Bit 0, RS1, affects the digial operation of the chip, enabling the drawing of a 40 cell display, while RS0 affects the analog operation of the chip, affecting the clock signals and analog video timing). When H32 mode is active, SC is equivalent to MCLK/5.

HardWareMan wrote:And RAS-CAS cycle according to EDCLK:

That's when the VDP is setting up a read. One SC cycle sets the row, the following SC cycle sets the column, then one cycle delay, then the first nybble of the data is moved over the serial bus on the following clock cycle. For a 4 cycle read (reads 4 bytes, because VRAM chips work in parallel, 4 nybbles from each), the read cycles overlap, so when RAS is next asserted, the second nybble of the previous read is only just being moved over the bus, and when CAS is asserted it's the third nybble, and when there's the one cycle delay, the last nybble of the previous read is sent, then the next read is processed, so the serial data bus can read one nybble (1 byte combined) every SC cycle without any gap between successive reads. Writes can't overlap with reads, so a write cycle can only manage to transfer 1 byte in 4 SC cycles.

Note that even when the bus is "idle", it is still transferring data. The VDP just runs off and reads the next successive value from VRAM when it has nothing to do, so in reality, the VRAM bus is always active transferring something, the VDP just doesn't always use the result.

HardWareMan wrote:Also, I saw CAS_before_RAS refresh cycle, but I don't know, where in raster it is (more than 2 ray oscilloscope required).

Those will be the refresh slots I've marked on the timing diagram.

Nemesis · Post by **Nemesis** » Mon Jan 03, 2011 10:44 pm

panzeroceania wrote:Hey Nemesis, I was curious if there is any newer public build of Exodus than the one released in 2007

I don't mean to nag but I've been a fan for a long time and was just wondering if there have been any public updates since then.

It's nice to know there are people interested in it

. I haven't released any "public" builds since 2007, but I have made some private releases on another forum. Here's the last release I made, back in February last year:
http://nemesis.hacking-cult.org/Exodus/ ... -02-04.rar

This is still using my old crappy VDP core, so that means it's still locked in PAL mode, and there'll still be a lot of graphical problems. The purpose of the testing I'm doing on the VDP right now is to answer the remaining questions I need to answer in order to complete my cycle-accurate VDP core, which is really the last major step in terms of emulation I need to complete in order to get my emulator in a releasable state. I'm quite confident that 2011 will be the year this thing finally gets a full public release. Lots more to do before then, but it's moving forward.

Here's what I said about this release last year when I posted it on the other forum:
"I haven't said much about this emulator in awhile because I haven't wanted to build up too much enthusiasm before it's ready for release. I've been working on it steadily for the last few years, with some breaks, and some frantic development at other times. It's coming along nicely IMO, but it's not ready for public release yet. I want to make sure this emulator is really, really awesome when it comes out so it gets a bit of a following on release, rather than release it in a crappy state and lots of people never look at it again, even though those problems all get fixed in a later release.

Ok, a few things:
1. This emulator is still in development. I've been working on it steadily for the last few years, with some breaks, and some frantic development at other times. It's coming along nicely IMO.
2. This emulator has come a long way, but it's still not ready for release. It's probably going to set a record for the most mature emulator on first release ever, but that's the way I want it.
3. This emulator is NOT Windows 9x compatible. Later builds use API features that lock in a minimum of Windows XP. Windows 2000 and earlier, and the Win9X line, will not work.
4. Minimum system requirements: Core 2 Duo. Won't run everything full speed. Recommended: Core 2 Quad. You'll be laughing. This emulator makes very effective use of multiple cores.
4. Here's a latest build: http://nemesis.hacking-cult.org/Exodus/ ... -02-04.rar

It's got the most accurate YM2612 and PSG cores ever made. The Z80 needs some work, but it's more accurate than anything except Kega and MAME. The M68000 is more accurate than anything I've tested, including MAME. The VDP is a pile of crap, a new core is in development, but isn't operational yet, so you've got the old crappy core. The I/O interface is still just a stub, so you've only got basic 3-button controller support. Region still locked to european until the new VDP core is online.

Because this is my pet project, I've been focusing on whatever interests me at the time. I haven't been working towards a release, so I've spent inordinate amounts of time working on some parts of the emulator, and pretty much no time working on some more critical sections. That's just the way it is.

Let me know what you think so far. Some things to check out:
-Savestates. Check the file menu for shortcut keys. I really like the interface. I think you will too.
-Not documented, but use ctrl+tab to select subwindows.

Oh, and PLEASE let me know if this emulator ever crashes. It should be a rock. If you ever get a crash, you've found a bug, and I can fix it, but only if I know it's there, and where to look. If a crash occurs, the emulator will spit out a minidump crash report to a "Crash Reports" subfolder. Send me a copy of this crash report and I'll be able to debug the crash."

Just to add to that, here's what I said in a previous post on that forum about how to use it:
"How to use the emulator:
1. Load up the exe. I've set the config file to auto-load the preliminary Mega Drive system.
2. Select "File->Open ROM" to load a game. This will change later, but for the time, that's how you load games. Select a raw binary mega drive rom. SMD format is not supported and will never be supported.
3. Controls are hard-coded. The input is fixed as a 3-button controller. Keys zxcv are A, B, C, and start respectively. Keys ijkl are up, left, down, and right. You'll need to have the "Image" subwindow selected in order for your keystrokes to count as input."

Enjoy.

Chilly Willy · Post by **Chilly Willy** » Mon Jan 03, 2011 11:53 pm

Nemesis wrote: One of the most surprising things for me when analysing the VDP VRAM access was to discover that, although the VRAM is dual-port, the VDP seems to ONLY use serial access. The VDP never uses the RAM port at all. Since the serial bus provided enough bandwidth, and allowing external access during active scan wasn't a priority, I guess they opted for simple and just used the serial bus for everything. Coordinating parallel access to both ports would have no doubt increased the complexity of the VDP.

If this is the case, it seems to me that a mod can be made that allows the 68000 to directly access the VRAM. Have a daughterboard with a buffer to the VRAM data and address buses, and have the logic watch for serial port loads. Seems to me a single CPLD would handle everything... granted, it would take a LOT of wiring, but it seems doable.

HardWareMan · Post by **HardWareMan** » Tue Jan 04, 2011 5:19 am

Nemesis wrote:One of the most surprising things for me when analysing the VDP VRAM access was to discover that, although the VRAM is dual-port, the VDP seems to ONLY use serial access. The VDP never uses the RAM port at all. Since the serial bus provided enough bandwidth, and allowing external access during active scan wasn't a priority, I guess they opted for simple and just used the serial bus for everything. Coordinating parallel access to both ports would have no doubt increased the complexity of the VDP.

I disagree with it. I analyzed "DT/OE" signal and saw that it is very active one. Moreover, it some time (periodically) make SIO access cycle. It not logical to use SIO for CPU access: you have to read whole row, then change single data cell and writeback row to matrix. But, SIO is very helpfull when need burst reading. I think CPU access still through PIO, while raster reads throug SIO. And becouse RAS/CAS are common signals, PIO and SIO has to share access cycle time between them (wich is periodically and has stable timing).

Chilly Willy · Post by **Chilly Willy** » Tue Jan 04, 2011 6:07 am

The serial port in VRAM isn't completely independent of the the parallel port - you have to make a standard access like you would the parallel port, but instead of getting one word, you load the entire row into the vram's shift register. After that, data can be shifted out independent of the parallel port. So some cycles for the serial port will take away access cycles from the parallel port.

Eke · Post by **Eke** » Tue Jan 04, 2011 9:14 am

Does anyone know for sure if you can actually display sprites on the very first line of the display? If you can, the VDP would have to start parsing the sprite list before the first line, or parse it during the last line of the previous frame and carry it over to the next frame instead, otherwise, how would it know the sprite mappings for the first line? Usually the list of sprite mappings are built up in the previous line, but for the first line in the display, there is no previous line.

Some games define a smaller screen area (vertical "bars") by using the display enable bit and in that very particular case, the first line of sprite is dismissed, which is explained by the fact sprites are not pre-processed when this bit is cleared. This is generally not handled by emulators.

However, in games that use the full active area, the first line can show sprites, I think otherwise they wouldn't have defined that virtual area for sprite coming from hidden parts of the screen.

Anyway, I'm pretty sure this is the reason the VBLANK flag is cleared one line before the first active line and that this line behaves like any other following lines, with only 16/18 slots for CPU access and 16/20 slots for sprite parsing, the rest being simply unused. I also have the example of Outrunners, which is very picky about DMA/VINT timings (corrupted intro screen) and, when I implemented more accurate DMA writes, only worked if I reduced the DMA transfer rate of the last line.

Nemesis · Post by **Nemesis** » Wed Jan 05, 2011 6:51 am

HardWareMan wrote:
Nemesis wrote:One of the most surprising things for me when analysing the VDP VRAM access was to discover that, although the VRAM is dual-port, the VDP seems to ONLY use serial access. The VDP never uses the RAM port at all. Since the serial bus provided enough bandwidth, and allowing external access during active scan wasn't a priority, I guess they opted for simple and just used the serial bus for everything. Coordinating parallel access to both ports would have no doubt increased the complexity of the VDP.
I disagree with it. I analyzed "DT/OE" signal and saw that it is very active one. Moreover, it some time (periodically) make SIO access cycle. It not logical to use SIO for CPU access: you have to read whole row, then change single data cell and writeback row to matrix. But, SIO is very helpfull when need burst reading. I think CPU access still through PIO, while raster reads throug SIO. And becouse RAS/CAS are common signals, PIO and SIO has to share access cycle time between them (wich is periodically and has stable timing).

You're right, DT/OE becomes active when user code is accessing VRAM. I haven't completed through testing of the bus logic during external access to VRAM yet, I just identified the access slots and focused on the VDP access to VRAM. The VDP uses the serial port exclusively for reading data for rendering, but it looks like the ram port is used for external VRAM access.

Anyway, I'm pretty sure this is the reason the VBLANK flag is cleared one line before the first active line and that this line behaves like any other following lines, with only 16/18 slots for CPU access and 16/20 slots for sprite parsing, the rest being simply unused. I also have the example of Outrunners, which is very picky about DMA/VINT timings (corrupted intro screen) and, when I implemented more accurate DMA writes, only worked if I reduced the DMA transfer rate of the last line.

Ahh, that would make sense. I'll try and capture some measurements on the number of "active" scanlines in a frame and see if I can verify this.