Super VDP

Ask anything your want about the 32X Mushroom programming.

Moderator: BigEvilCorporation

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Thu May 14, 2009 8:21 pm

You're thinking of a modern PC, not a 32X. You don't have the memory or CPU cycles to mess around with. You don't want any screen related buffers in SDRAM if you can help it. You also neglect the time it takes to determine which write you'd use based on two/three different pixel "masks" - you have to fetch values from three arrays, compute the entry in a jump table, fetch it, then call it... for every single pixel for every layer. That's not a speed up, it's an incredible slow down.

TotOOntHeMooN
Interested
Posts: 38
Joined: Sun Jun 01, 2008 1:12 pm
Location: Lyon, France
Contact:

Post by TotOOntHeMooN » Thu May 14, 2009 8:39 pm

It's possible that I don't realy comes to identify the SH-2 power.
I explain a concept, not the way to program it.
All stufs are not good to take, but can help to go forward.
I'll have to do better. :)

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Thu May 14, 2009 9:40 pm

No sweat. First, you want to operate on as many pixels as possible in the smallest number of instructions. That's why that "interleaved" format for pixel colors was discussed previously. You'll also notice that the currently discussed methods operate on at least 4 pixels at once (in 256 color mode). Your method would need to be expanded to operate on four pixels at once, which would increase the size of the write arrays, which would waste more memory. If I've followed the discussion right, the current code is something like this:

fetch a block of pixels
optionally shift it to de-interleave pixels
AND to clear upper bits
OR in the cell color palette
fetch bit mask
look up pixel mask using bit mask
AND to clear transparent pixels
store to framebuffer if first layer, store to overwrite if other layer

If the framebuffer is pre-filled with a particular background color, you could simply always write to overwrite as that last step. I believe the code is dealing with the screen on a line-to-line basis, so the framebuffer could be filled on each line to start while the SH2 initializes for the loop. That way you could change the background color on a line-to-line basis as well.

TotOOntHeMooN
Interested
Posts: 38
Joined: Sun Jun 01, 2008 1:12 pm
Location: Lyon, France
Contact:

Post by TotOOntHeMooN » Thu May 14, 2009 10:08 pm

I have read again. (sorry for my english)

May be, I think that it's possible to do it more faster by using a bit-mask for the screen. (9.8KB)
I add sprites & tiles masks to it.
Depending this "big mask", it may skip or write pixels to the frame buffer.
By this way, you never expand & write hidden pixels.
Last edited by TotOOntHeMooN on Fri May 15, 2009 2:30 pm, edited 1 time in total.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Thu May 14, 2009 11:20 pm

TotOOntHeMooN wrote:
Steps :
1- init the clip-screen mask like the previous post. (with the border)
2- ADD sprite mask to the clip-screen mask, then draw sprite to the FB [...]
3- ADD tile mask to the clip-screen mask, then draw B-plane tile to the FB [...]
4- draw A-plane tile to the FB [...]
The bit-mask for the screen take 9.8KB.
I ADD sprites & tiles masks to it.
Depending this mask, it may skip or write pixels to the frame buffer.

Note that all what I do are bit-masks manipulations into the memory. It takes a lot of time ?
(compared to one made to not deinterlace and write pixels, many times, at the same coordinates, into the Frame Buffer)
Note that each one of my steps was either one or two assembly instructions (depending on if it's working on one or two regs worth of pixels). Each one of your steps is several instructions... at best. You're discussing cell/sprite graphics at a high level while we're talking about individual cycles and instructions.

Also, you're STILL talking about processing individual pixels. My list was on a block of 8 pixels. Your method does minimize writes to the framebuffer, but at 3 cycles per write (5 if the FIFO is full), writing the framebuffer is less expensive than testing if we should write to the framebuffer at all. Consider: to determine if you should write to the framebuffer (your method), we have to fetch the data from the array that has the bit representing the pixel we are concerned about, we have to test that bit, then we have to do a conditional branch. If you assume the bits to test are prefetched, we can ignore the fetch time. If we assume a register holds a single bit that we rotate once per pixel to test against, we get one cycle for the rotate. The test will be another cycle. The conditional branch will be 3 or 1 cycles depending on the way you do the branching. So you're looking at 3 cycles per pixel best case, just for this one test. Since you can write two pixels per write (256 color mode), that's 6 cycles, which is more than the 5 cycle worst case if you just wrote the damn thing. 8)

EDIT: Oh yeah, most of the time, pixels WILL be drawn, so your method adds 3 cycles (best case) to every write, which since it's a pixel at a time, you have 3/5 cycles per pixel on the write itself. So you've quadrupled the write time for sprites/cells that are mostly drawn - twice the time for the extraneous test, and twice that because we're only writing one pixel at a time instead of two.

TotOOntHeMooN
Interested
Posts: 38
Joined: Sun Jun 01, 2008 1:12 pm
Location: Lyon, France
Contact:

Post by TotOOntHeMooN » Thu May 14, 2009 11:58 pm

Yes, I forgot that I have to write pixel per pixel.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Fri May 15, 2009 12:00 am

TotOOntHeMooN wrote:Yes, I forgot that I have to write pixel per pixel.
Well, if you can combine it to work on multiple pixels at once, it might become worth doing. Think about it some more and see if anything comes to mind. Not all ideas pan out, but sometimes you have a breakthrough the helps.

TotOOntHeMooN
Interested
Posts: 38
Joined: Sun Jun 01, 2008 1:12 pm
Location: Lyon, France
Contact:

Post by TotOOntHeMooN » Fri May 15, 2009 6:24 am

Yes, I'm thinking about.
Else, is it possible to perform "blitting" by permorming data-block transfert, from RAM to FB, by using DMA ?
Is it efficient to using a part of the FB like a buffer ?

Thank you.
Last edited by TotOOntHeMooN on Fri May 15, 2009 1:49 pm, edited 1 time in total.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Fri May 15, 2009 6:50 am

TotOOntHeMooN wrote:Yes, I'm thinking about.

Else, is it possible to perform "blitting" by permorming data-block transfert, from RAM to FB, by using DMA ?
Is it efficient to using a part of the FB like a buffer ?

Thank you.
According to the SEGA docs, you can't use the SH2 DMA on the framebuffer. Or the cache memory either if you're running the cache in half-n-half mode. By the way, one interesting thing - everyone one knows in half-n-half mode that you can treat half the cache like very fast memory; what many people don't know is that when the cache is disabled, you can treat the ENTIRE cache as very fast memory. So you really have THREE modes you could run: all cache and no ram, half cache and half ram, and all ram and no cache. So if you needed 4 KB of fast RAM and didn't need cache, you could disable the cache.

TotOOntHeMooN
Interested
Posts: 38
Joined: Sun Jun 01, 2008 1:12 pm
Location: Lyon, France
Contact:

Post by TotOOntHeMooN » Fri May 15, 2009 8:36 am

Chilly Willy wrote:when the cache is disabled, you can treat the ENTIRE cache as very fast memory.
Nice. And performing DMA ?
So, using this memory should be a fastest way to copy many bytes to framebuffer ?

Compared to the 68K, SH-2 have a lot of registers.
Is-it a good way to using two of them for storing a tile and speed up some copy ?

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Fri May 15, 2009 11:23 am

TotOOntHeMooN wrote:
Chilly Willy wrote:when the cache is disabled, you can treat the ENTIRE cache as very fast memory.
Nice. And performing DMA ?
You cannot perfom DMA on the cache memory, as I mentioned in the previous post.
So, using this memory should be a fastest way to copy many bytes to framebuffer ?
Fastest, yes, so it would be a good place for small temporary data structures. The SDRAM would be the next best place.
Compared to the 68K, SH-2 have a lot of registers.
They have the same number, it's just the SH2 makes them all general purpose while half of them are address oriented on the 68000.
Is-it a good way to using two of them for storing a tile and speed up some copy ?
Well, you won't store a whole tile in two registers, but they would hold one line of a cell (assuming cells are 8x8).

TotOOntHeMooN
Interested
Posts: 38
Joined: Sun Jun 01, 2008 1:12 pm
Location: Lyon, France
Contact:

Post by TotOOntHeMooN » Fri May 15, 2009 11:44 am

Chilly Willy wrote:Well, if you can combine it to work on multiple pixels at once, it might become worth doing.
You suggest to use a bitmask + lookup table for allowing O/W.

Why not computing a new bitmask + lookup table ?
sprite_mask OR screen_mask + lookup table. This allow O/W too !
By writing front to back, you can skip all hidden pixels, and not only transparents values.

EDIT :
Thank you for your last reply. I'm learning too. :)

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Fri May 15, 2009 7:38 pm

TotOOntHeMooN wrote:
Chilly Willy wrote:Well, if you can combine it to work on multiple pixels at once, it might become worth doing.
You suggest to use a bitmask + lookup table for allowing O/W.

Why not computing a new bitmask + lookup table ?
sprite_mask OR screen_mask + lookup table. This allow O/W too !
By writing front to back, you can skip all hidden pixels, and not only transparents values.

EDIT :
Thank you for your last reply. I'm learning too. :)
The bitmask is made by the artist to show transparent pixels in the cell/sprite. Each bit represents a pixel, so one byte makes a single cell line of 8 pixels. One byte is also 0 to 255, so a lookup table will only have 256 entries, which is manageable (2 KB for an 8 pixel mask). Just two fetches using the bit mask as an index gives your pixel mask.

Now from the way I take your meaning, what you would do now would be to fetch the bitmask for the cell line, fetch the bitmask for the screen, AND the two together (if we assume that set bits represent empty pixels... the opposite would require NAND, which is two operations), then lookup the pixel mask for the result.

That would work, but remember since we're writing multiple pixels at a time and always writing with the understanding that the OW area takes care of overlap, the only thing the above will do is allow us to write front to back instead of back to front. So in the end, it's not worth the extra space and time needed to AND the two bitmask together unless the game REQUIRES front to back drawing for some reason.

Snake
Very interested
Posts: 206
Joined: Sat Sep 13, 2008 1:01 am

Post by Snake » Fri May 15, 2009 10:52 pm

Chilly Willy wrote:Loading the mask can be neglected as you'll do it once at the start of the loop. That spreads 1 clock cycle over enough loops that it's essentially zero. So it really is just five cycles. :D
Hmm, no. The last three instructions are going to pipeline stall. I thought this might be the case.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Sat May 16, 2009 2:12 am

Snake wrote:
Chilly Willy wrote:Loading the mask can be neglected as you'll do it once at the start of the loop. That spreads 1 clock cycle over enough loops that it's essentially zero. So it really is just five cycles. :D
Hmm, no. The last three instructions are going to pipeline stall. I thought this might be the case.
I need to pay more attention to the pipes in the SH2. :lol:

Post Reply