Super VDP

Ask anything your want about the 32X Mushroom programming.

Moderator: BigEvilCorporation

Post Reply
Snake
Very interested
Posts: 206
Joined: Sat Sep 13, 2008 1:01 am

Post by Snake » Fri Jan 30, 2009 8:37 am

Ok, thought I'd add my input here since I played with this sort of thing a lot back in the day. ob1 has already discussed this with me privately but I figured someone else might find it interesting.

First of all, sadly, most of the ideas presented here just will not work if you want to hit 60fps. In 256 colour mode, doing byte writes, it will take almost an entire frame just to do a straight copy from SDRAM to the frame buffer. So, one plane, and a few sprites is about the best you can expect.

The RAM speed doesn't really come into it. There is a FIFO in the VDP between you and the framebuffer, and writing constantly is going to get you one write every five cycles, at best. There's no way around that.

I could never work out if you can write faster during the VBLANK, the answer seemed to be no, but since the fastest you could hope for is three cycles per write, due to the RAM speed, the difference isn't huge anyway.

The upshot of this is approx 76000 writes per frame max, *IF* you can hit that 5 cycle per pixel, which isn't easy to do. Given that 320x224 is 71680 pixels, you can see it isn't possible to get much more than one plane this way.

You could up this to two planes if you do word writes. But that complicates scrolling, and flipping. Unfortunately the SH2 instruction set isn't exactly helpful here, and anything more complicated than that is going to make it very hard to hit that 5 cycles. Alternatively you could do one 32K colour plane, which would be much easier, but... kinda sucks.

You really don't have much time for decompressing/recolouring/caching or converting tiles on the fly. You might save a bit of bandwidth, but writing fast enough is the holy grail here.

DMA is a complete waste of time. It's no faster than the CPU at best, due to that 5 cycle thing again. It's only as fast as the CPU if you use the modes that require both source and destination to be on a, IIRC, 128 byte boundary, which is useless here - and unless you're going to DMA big chunks at a time, it's going to be way slower anyway.

It should be fairly obvious that writing to an external buffer, and then copying to the framebuffer, isn't going to be faster. Also, that means you're going to have to mask the layers, because you can't use the overwrite function for the second layer, or any sprites.

The best way to hit that 5 cycle per pixel limit is using both CPUs. The first one to write will lock for 5 cycles, the other will continue to run and fetch the next pixel, and eventually lock for 5 cycles when it comes to write - giving the first CPU time to fetch its next pixel, etc. If you time it right you can get max performance this way.

You could use longs - it won't write any faster, but it will mean you lock for 10 cycles instead of 5, giving you more time to prepare the next write.

The CPUs do not get in the way of each other anywhere near as much as everybody, everywhere, seems to keep saying. Most instructions in your loop will be coming from each CPUs own instruction cache, most data reads will come from each CPUs own data cache. They only contend when they both need to use the same bus, which, if you do it right, isn't often (this should happen while the other CPU is locked waiting for the VDP, anyway).

Obviously you don't need to clear the screen, you're going to overwrite the entire thing anyway. There may be mileage in checking, while writing the second layer, for blank tiles - and just skipping them - under the assumption that big chunks of the second layer are often empty. On the other hand, you don't want this to become a requirement...

This absolutely requires ASM, and mad optimisation skillz ;)

Good luck, you've made me want to play with this again. Just need that damn dev cart...

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Wed Feb 04, 2009 10:44 am

Snake wrote:There is a FIFO in the VDP between you and the framebuffer, and writing constantly is going to get you one write every five cycles, at best.
Well ... I don't get this.
The 32X Hardware Manual states
The frame buffer and overwrite image have 4 word write FIFO and can write in 3 clock cycles. Five clock cycle are required when continuously writing 4 words or more.
The way I understand this is that there's a FIFO attached to the FB. A kind of pipeline, as I imagine, like this :

Code: Select all

    VDP
+---------+	^
| Stage 4 |	|
+---------+	| W
| Stage 3 |	| R
+---------+	| I
| Stage 2 |	| T
+---------+	| E
| Stage 1 |	|
+---------+	|
    SH2
So, I wouldn't say I can not write more than once (16-bit word) every five cycles. I would say that each write has a 3 steps latency (I'd rather say 4 steps ...), and, when FIFO is full, one has to wait for a word to exit for another to enter.
I understand it exactly as a pipeline. You have to pass through some stages, but that straight. If the FIFO is not full, the data streams like water in a pipe.
Or do I misunderstand something ?

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Wed Feb 04, 2009 11:47 am

The way I read it is that if there is space in the FIFO, it takes only 3 cycles to write to the FIFO. However, the FIFO empties at no faster than 5 cycles, so once it fills up, you have to wait for a slot to become free, which is 5 cycles.

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Wed Feb 04, 2009 12:34 pm

Here's a datasheet of a 16 4-bit FIFO register : http://www.datasheetcatalog.com/datashe ... 0105.shtml

Snake
Very interested
Posts: 206
Joined: Sat Sep 13, 2008 1:01 am

Post by Snake » Fri Feb 06, 2009 7:48 pm

Chilly Willy wrote:The way I read it is that if there is space in the FIFO, it takes only 3 cycles to write to the FIFO. However, the FIFO empties at no faster than 5 cycles, so once it fills up, you have to wait for a slot to become free, which is 5 cycles.
Correct. And since you are going to be constantly writing, you are pretty much going to be waiting 5 cycles constantly.

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Mon Mar 16, 2009 9:50 am

Snake wrote:The upshot of this is approx 76000 writes per frame max, *IF* you can hit that 5 cycle per pixel
76k writes / frame = 23 MHz / 60fps / 5cycles/writes

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Mon Mar 16, 2009 8:22 pm

If you're writing words, remember a framebuffer will be 32000 writes for 256 color mode, and 64000 words for 15 bit mode. So 256 color mode is easy to keep going 60 Hz, but 15 bit mode consumes a considerable amount of bus time. Much of the CPU would be devoted soley to writing video data. Yet another reason most folks use 256 color mode for game screens, saving 15 bit mode for static displays.

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Tue Mar 17, 2009 8:28 am

@Chilly Willy : answer in GLide32X topic ;)

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Transparency and propagation

Post by ob1 » Tue Apr 21, 2009 2:06 pm

Let's say we have a tile line (1 32 bits longword) : 40516273
I unpack them in ~5 cycles and get 2 32-bits longword :
-0-1-2-3 and -4-5-6-7

Previously I interlaced them with the pal so to get :
p0p1p2p3 and p4p5p6p7
If you're in A plane (background), no problem, you're going to overwrite previous pixels and we happy. But, if you're in B plane, it can be a problem. If I want entry 0 of each palette to be transparent (lets say pixel 2 and 5 are 0), I must modify my 2 longwords to get :
p0p100p3 and p400p5p6p7
How do I do this ?

First I get back (to where my heart belong) to where I started from :
-0-1-2-3 and -4-5-6-7
and add 0F0F0F0F. I get :
snsnsnsn and snsnsnsn
"s" being Saturation, and "n"... yway !!! s is set if the palette entry is not zero, s is reset if palette entry is zero.
Then I isolate Saturation, ANDing F0F0F0F0. I get :
s-s-s-s- and s-s-s-s-
But "s" is only 1 bit long. In order to mask the palette number, I need to propagate this value on the 4 bits of the palette. Thus I'd get a mask.
Not that hard. R3 being s-s-s-s-,
mov R3, R4
shll R4
or R4, R3
mov R3,R4
shll R4
or R4, R3
6 cycles. Now, s is 4 bit long, 0h if palette entry is 0, and Fh if palette entry is <> 0. I can now mask the palette number. R1 being p0p1p2p3 :
and R3,R1
Et voilà !

While in meeting this morning, I got the illumination, the holy shine. I suddenly remember that :
0h x Fh = 0h
and
1h x Fh = Fh

So, R3 being s-s-s-s-,
mov #$F,R0
mul.l R0,R3
sts R3
2 cycles. 4 fewer cycles for 4 pixels. Nearly 10k fewer cycles / CPU / B frame. 5% more !
(Yes, I will watch the pipeline and make other things between mul.l and sts ;) )
It's not Wonderland yet, there's still a huge amount of work to do, but I'm improving, I am.

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Propagation and multiply

Post by ob1 » Thu Apr 23, 2009 7:30 am

Well ... heaven is not here yet !
MUL.L is a damn slow operation : IF ID EX (stalls CPU) MA MA (yes, twice !) and mm mm mm mm. 9 steps, 10 cycles and a lot of stalls. With that in mind, I can only have 6 instructions (besides MUL.L and STS.L) in 12 cycles. It ain't faster than classic propagation anymore :(

Edit : SH2-A is better suited.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Thu Apr 23, 2009 6:23 pm

Did we ever get any exact timings on the overwrite area? If MUL is too slow, it might make it useful after all.

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Sat Apr 25, 2009 8:10 am

As you and Snake said, the VDP is between 3 and 5 cycles long.
I guess the O/W is connected to the VDP, so it should be 3-5 cycles long.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Sun Apr 26, 2009 1:06 am

ob1 wrote:As you and Snake said, the VDP is between 3 and 5 cycles long.
I guess the O/W is connected to the VDP, so it should be 3-5 cycles long.
If overwrite is just 5 cycles long, isn't that far better than making masks and all that other stuff. The code you gave above was what, 8 cycles? Waiting 5 cycles for the O/W would be 3 cycles faster. As long as 0 is your transparent color, writing to the O/W area would eliminate the need for masking. That's the whole point of having an O/W area in the first place. Remember that the masking code also has to read the framebuffer to get the data that isn't masked, then write it back out, and we know the write will take 3 to 5 cycles. So you have code to make the masks, the read, the masking, then the write. That's GOT to be slower than just writing the O/W area.

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Sun Apr 26, 2009 8:41 am

I don't think the O/W makes a read from current FB, compare with value to write, and, if this value is 0x00, lets the current FB untouched. I'd rather think about a simple logic circuitry. So it is quite fast. Then there's the VDP, and the FIFO that makes it 5 cycles long when full. And that isn't that fast any more.
Anyway, this is not my point.

I got a tile entry in the plane table for example :
pppptttt tttttttt
p being the 4-bit palette number (I divided the 32X Palette in 16 palettes of 16 colors).
and t being a 12-bit tile adress. Just like what you'd find on Genesis.
Then, each line of my tiles are defined like this :
40516273 (32-bits)
which stands for pix4, pix0, pix5, pix1, ... pix7 then pix3. They are all 4-bits, and, when combined with the palette number in the tile entry, you finally get a 8-bit value that you can write to the FB (or O/W). This interlaced tile was suggested by TotoOnTheMoon and is still one of the best ideas of this SuperVDP. It allows reduced bandwith, clever use of both SH2, and, a quite good performance.
I call unpacking the operation getting from
40516273
to
#0#1#2#3
and
#4#5#6#7
It is achieved in 5 cycles, which is quite fast.

Then we got the transparency. I want each entry 0 of each palette to be transparent, just like the Genesis.
Of course, if the palette number is 0 (4 bits), it's gonna be easy. If the palette is 0 and a tile value is 0 (4 bits), the final value will be 0 (8 bits = 1 byte), which won't be written on the O/W, reading the 32X Hardware Manual. We happy.
But, if the palette number is not 0, it's gonna be harder. In this case, and that case only happens on B-plane, the one that allows transparency, I have to take a look at the tile value, which is a palette entry. And if this value is 0 (4 bits), I have to transform the palette number to 0 (4 bits) too, in order not to write anything with the O/W.
I call propagating the operation from getting from
#0#1#2#3
to
#0--#2#3
, pix1 being 0.
I used to think about propagate in bit manipulations only. And found it quite slow. Then I thought about multiply. And, while it allows fewer operations, the SH2 hardware implementations of MUL.L (I even tried MULU.W !) limits the gain, even worsens it. So, I finally gave up with MUL.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Sun Apr 26, 2009 7:12 pm

ob1 wrote:I got a tile entry in the plane table for example :
pppptttt tttttttt
p being the 4-bit palette number (I divided the 32X Palette in 16 palettes of 16 colors).
and t being a 12-bit tile adress. Just like what you'd find on Genesis.
Then, each line of my tiles are defined like this :
40516273 (32-bits)
which stands for pix4, pix0, pix5, pix1, ... pix7 then pix3. They are all 4-bits, and, when combined with the palette number in the tile entry, you finally get a 8-bit value that you can write to the FB (or O/W). This interlaced tile was suggested by TotoOnTheMoon and is still one of the best ideas of this SuperVDP. It allows reduced bandwith, clever use of both SH2, and, a quite good performance.
I call unpacking the operation getting from
40516273
to
#0#1#2#3
and
#4#5#6#7
It is achieved in 5 cycles, which is quite fast.
Ah, yes, that's a clever way of turning 16 color tiles into a 256 color value.

Then we got the transparency. I want each entry 0 of each palette to be transparent, just like the Genesis.
Of course, if the palette number is 0 (4 bits), it's gonna be easy. If the palette is 0 and a tile value is 0 (4 bits), the final value will be 0 (8 bits = 1 byte), which won't be written on the O/W, reading the 32X Hardware Manual. We happy.
Yes, that's the ideal case.
But, if the palette number is not 0, it's gonna be harder. In this case, and that case only happens on B-plane, the one that allows transparency, I have to take a look at the tile value, which is a palette entry. And if this value is 0 (4 bits), I have to transform the palette number to 0 (4 bits) too, in order not to write anything with the O/W.
I call propagating the operation from getting from
#0#1#2#3
to
#0--#2#3
, pix1 being 0.

I used to think about propagate in bit manipulations only. And found it quite slow. Then I thought about multiply. And, while it allows fewer operations, the SH2 hardware implementations of MUL.L (I even tried MULU.W !) limits the gain, even worsens it. So, I finally gave up with MUL.
Okay, lemme make sure I got that: If the tile's palette table number is 0, all its pixels will be of the form 0x0n, where n is from the tile pattern data. So if the data is 0, then the final 256 color data is 0x00 which is transparent and Bob's your friend.

However, now our tile palette number is 1, so the pixels are all of the form 0x1n. So if the data is 0, the final value is 0x10 which is NOT transparent in the O/W area and we're screwed. Now I see the issue. :?

Perhaps an optional bitmask for tiles? That would eliminate the need to construct a bitmask from the data at the expense of a little more space used for the tile. Instead of an 8x8 tile taking 32 bytes, it would then take 40. The mask could also be used for more sophisticated collision detection as well.

Post Reply