Faster Sprites routines on the 32X?

Ask anything your want about the 32X Mushroom programming.

Moderator: BigEvilCorporation

Post Reply
gameblabla
Interested
Posts: 13
Joined: Thu Sep 26, 2013 12:41 am

Faster Sprites routines on the 32X?

Post by gameblabla » Mon Feb 20, 2017 10:04 pm

I'm back again (almost gave up on the 32X but i just came back now) and it is clear to me that the 32K color mode is unusable
for games, only for static screens.
I switched back to using the 256-color mode and after some changes to 32x_images.c, i came up with the following :

Code: Select all

void drawBackground(char* spriteBuffer)
{
	vu16 *frameBuffer16 = &MARS_FRAMEBUFFER;
	unsigned short size = 35840;
	fast_memcpy(frameBuffer16 + 0x100, spriteBuffer, size);
}

void drawSprite(vu8* spriteBuffer2, vu16 x, vu16 y, unsigned short xWidth, unsigned char yWidth)
{
	//each byte represents the color for each pixel. Allows us to reverse which we can't do with words
	vu8 *frameBuffer8 = (vu8*) &MARS_FRAMEBUFFER;
	vu16 xOff;
	unsigned short bufCnt;
	unsigned char rowPos,yCount,xCount;
	const unsigned short lineTableEnd = 0x100;
	uint32 fbOff = lineTableEnd;
	unsigned short topre;
	int i;

	//offset the number of pixels in each line to start to draw the image
	xOff = x;
	fbOff = lineTableEnd*2;
	//y-offset for top of 320 to correct line in framebuffer
	fbOff = fbOff + (y * 320);
	//x-offset from start of first line
	fbOff = fbOff + xOff;
	bufCnt = 0;
	yCount = 0;
	xCount = 0;
	rowPos = 0;
	topre = (320 - (xWidth + xOff)) + xOff;

	for (rowPos = 0; rowPos < yWidth; rowPos++)
	{
		for(xCount = 0; xCount < xWidth; xCount++)
		{	
			bufCnt = (rowPos * (xWidth)) + xCount;
			frameBuffer8[fbOff] = spriteBuffer2[bufCnt];
			fbOff++;
		}
		fbOff = fbOff + topre;
	}
}
The funny thing i noticed is the fact that byte 0x00 is transparent when using drawSprite, even though i'm not explicitly supporting transparency.
I looked up online why it does that and apparently it was done on purpose by Sega.
Either way, it works in our advantage for sprites but it's really annoying for backgrounds.

I got pretty good speed but i feel like it could have been a little... faster.
ammianus came up with drawSpriteMaster here but it does not work well for me. (plus, it's slower anyway)

haroldoop told me to use SuperVDP but apparently it does not work on real hardware due to 68k->32X DMA not working properly.
The function i'm taking issue with is drawSprite : two for..loops just slowdowns things a bit.

How could it be improved ?
I'm just interested in one plane and sprites : if i want to have another plane, i would just use the Genesis's hardware for that.

User avatar
Stef
Very interested
Posts: 2586
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Re: Faster Sprites routines on the 32X?

Post by Stef » Tue Feb 21, 2017 1:15 pm

Using word_8byte_copy(..) method each time you want to copy 8 pixels of sprite data is definitely heavy if your compiler does not inline this method (and if you use GCC i'm not sure it does it). Also C method is definitely much less efficient for a low level stuff like that, with pure SH2 assembly code you can probably improve the performance by a factor of 3.

User avatar
Sik
Very interested
Posts: 511
Joined: Thu Apr 10, 2008 3:03 pm
Contact:

Re: Faster Sprites routines on the 32X?

Post by Sik » Tue Feb 21, 2017 4:14 pm

gameblabla wrote:The funny thing i noticed is the fact that byte 0x00 is transparent when using drawSprite, even though i'm not explicitly supporting transparency.
I looked up online why it does that and apparently it was done on purpose by Sega.
Either way, it works in our advantage for sprites but it's really annoying for backgrounds.
The feature is optional, depends on which address you're using to access the framebuffer (0x24000000 doesn't do it, 0x24020000 does, if I'm reading the manual correctly).
Sik is pronounced as "seek", not as "sick".

gameblabla
Interested
Posts: 13
Joined: Thu Sep 26, 2013 12:41 am

Re: Faster Sprites routines on the 32X?

Post by gameblabla » Tue Feb 21, 2017 11:45 pm

The feature is optional, depends on which address you're using to access the framebuffer (0x24000000 doesn't do it, 0x24020000 does, if I'm reading the manual correctly).
Hmm, you're right but oddly enough, i was not using 0x24020000 but 0x24000000 and my sprite was still transparent.
Strange.

Anyway, when i changed spriteBuffer2 from vu8 to vu16, i noticed a huge speed-up.
After fixing it, it ran twice as fast !

Code: Select all

void drawSprite(vu16* spriteBuffer2, vu16 x, vu16 y, unsigned short xWidth, unsigned char yWidth)
{
	//each byte represents the color for each pixel. Allows us to reverse which we can't do with words
	vu16 *frameBuffer8 = (vu16*) &MARS_OVERWRITE_IMG;
	vu16 xOff;
	unsigned short bufCnt;
	unsigned char rowPos,xCount;
	const unsigned short lineTableEnd = 0x100;
	uint32 fbOff = lineTableEnd;
	unsigned short topre;
	unsigned short width_halved;

	//offset the number of pixels in each line to start to draw the image
	xOff = x;
	fbOff = lineTableEnd;
	//y-offset for top of 320 to correct line in framebuffer
	fbOff = fbOff + (y * 160);
	//x-offset from start of first line
	fbOff = fbOff + xOff;
	bufCnt = 0;
	xCount = 0;
	rowPos = 0;
	width_halved = xWidth/2;
	topre = (160 - (width_halved + xOff)) + xOff;

	for (rowPos = 0; rowPos < yWidth; rowPos++)
	{
		for(xCount = 0; xCount < width_halved; xCount++)
		{	
			bufCnt = (rowPos * width_halved) + xCount;
			frameBuffer8[fbOff] = spriteBuffer2[bufCnt];
			fbOff++;
		}
		fbOff = fbOff + topre;
	}
}
I guess i was hitting a bottleneck somewhere ? Seems like this is due to the framebuffer being 16-bit.
Then i tried to implement it using fast_memcpy :

Code: Select all

for (rowPos = 0; rowPos < yWidth; rowPos++)
{
	for(xCount = 0; xCount < width_halved; xCount+=8)
	{	
		bufCnt = (rowPos * width_halved) + xCount;
		fast_memcpy(frameBuffer8 + fbOff, spriteBuffer2 + bufCnt, 4);
		//word_8byte_copy(frameBuffer8 + fbOff, spriteBuffer2 + bufCnt, 2);
		//frameBuffer8[fbOff] = spriteBuffer2[bufCnt];
		fbOff+=8;
	}
	fbOff = fbOff + topre;
}
I doubt it's inlined because even with "inline" thing applied to the functions, it does not seem to do anything....
Also, i noticed no noticeable difference between fast_memcpy and word_8byte_copy here.
Maybe i'm doing it wrong ? Or maybe i should release the full source on github :P
I wish i could do it in SH2 ASM but i'm not really interested to learn assembly for any cpus.
I think it's plently fast, for now. Thanks for the help
.
One last thing though : has anyone managed to compile ammianus's sixpack ?
https://github.com/ammianus/sixpack-ammianus
The official sixpack reorders and strips the palette, even if the images are using the same internal palette.
He's using Eclipse on Windows : that sucks as i'm using Ubuntu on my main PC and i don't want to use Windows ever again.
In fact, i wonder how you guys managed to go around this ?

User avatar
Stef
Very interested
Posts: 2586
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Re: Faster Sprites routines on the 32X?

Post by Stef » Wed Feb 22, 2017 10:25 pm

If inlining does not work try at least to replace your fast_memcpy(..) call by the inlined fast_memcpy(..) code.
I don't understand if you are copying 8 bytes or 16 bytes per memcpy(..) call, it looks like each pixel is 16 bits so 16 bytes ?
In which case you should try to replace :

Code: Select all

for (rowPos = 0; rowPos < yWidth; rowPos++)
{
   for(xCount = 0; xCount < width_halved; xCount++)
   {   
      bufCnt = (rowPos * width_halved) + xCount;
      frameBuffer8[fbOff] = spriteBuffer2[bufCnt];
      fbOff++;
   }
   fbOff = fbOff + topre;
}
by this :

Code: Select all

int srcOff = 0;
int row = yWidth;

while (row--)
{
   vu32* src = (vu32*) (&spriteBuffer2[srcOff]);
   vu32* dst = (vu32*) (&frameBuffer8[fbOff]);
   int x = width_halved / 2;
   
   while(x--) *dst++ = *src++;
   
   srcOff += width_halved;
   fbOff += topre;
}
Should be faster...

gameblabla
Interested
Posts: 13
Joined: Thu Sep 26, 2013 12:41 am

Re: Faster Sprites routines on the 32X?

Post by gameblabla » Thu Feb 23, 2017 4:07 am

I could be wrong but i think you forgot to add fbOff++; in the while(x--) loop.
I couldn't get it to work as is so i had to hack it a bit :

Code: Select all

inline void drawSprite(vu32* spriteBuffer, vu16 x, vu16 y, unsigned short xWidth, unsigned char yWidth)
{
	vu32 *frameBuffer8 = (vu32*) &MARS_OVERWRITE_IMG;
	vu16 xOff;
	unsigned short bufCnt;
	const unsigned short lineTableEnd = 0x100;
	uint32 fbOff = lineTableEnd;
	unsigned short topre;
	unsigned short width_halved;

	width_halved = xWidth/4;
	//offset the number of pixels in each line to start to draw the image
	xOff = x;
	fbOff = lineTableEnd;
	//y-offset for top of 320 to correct line in framebuffer
	fbOff = fbOff + ((y * 320));
	//x-offset from start of first line
	fbOff = ((fbOff + xOff) - 128);
	bufCnt = 0;
	topre = (80 - (width_halved + xOff)) + xOff;
	
	unsigned long srcOff = 0;
	short row = yWidth;

	while (row--)
	{
		vu32* src = (vu32*) (&spriteBuffer[srcOff]);
		vu32* dst = (vu32*) (&frameBuffer8[fbOff]);
		short x = width_halved;
	   
		while(x--)
		{
			*dst++ = *src++;
			fbOff+=1;
		}
	   
		srcOff += width_halved;
		fbOff += topre;
	}
}
It's not ideal but i can definitively spot an improvement in speed so many thanks Stef !
I was doing a memcpy for every 8 pixels thinking it would somewhat improve the speed so i was wrong...
Looking at fast_memcpy, it seems that doing so would actually slow down it...
Anyway, thanks guys.

User avatar
Stef
Very interested
Posts: 2586
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Re: Faster Sprites routines on the 32X?

Post by Stef » Thu Feb 23, 2017 12:02 pm

gameblabla wrote:I could be wrong but i think you forgot to add fbOff++; in the while(x--) loop.
Oh yeah stupid typo, in fact seeing you code you can improve it a bit further :

Code: Select all

inline void drawSprite(u32* spriteBuffer, u16 x, u16 y, u16 xWidth, u16 yWidth)
{
   vu32 *frameBuffer = (vu32*) &MARS_OVERWRITE_IMG;
   // dst frame buffer pointer (X + Y offseted, need to divide by 4 as we have 32 bit pointer here)
   vu32* dst = &frameBuffer[(0x100 + (y * 320) + (x - 0x80)) / 4];
   // src sprite pointer
   u32* src = spriteBuffer;

   const u16 qwidth = xWidth / 4; 
   const int dstStep = 80 - qwidth; 
   
   u16 row = yWidth;
   
   while (row--)
   {
      u16 col = qwidth;
      
      while(col--) *dst++ = *src++;
      
      dst += dstStep;
   }
}

gameblabla
Interested
Posts: 13
Joined: Thu Sep 26, 2013 12:41 am

Re: Faster Sprites routines on the 32X?

Post by gameblabla » Fri Feb 24, 2017 12:07 am

Hmm.... Sorry for annoying you but your new code does not work well for me, the sprites moves correctly vertically but not horizontally.
Here's the source :
https://github.com/gameblabla/32x-playground
And the ROM :
https://github.com/gameblabla/32x-playg ... xample.32x

The sprites moves at 2-pixels instead of 1. Not sure why

User avatar
Stef
Very interested
Posts: 2586
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Re: Faster Sprites routines on the 32X?

Post by Stef » Fri Feb 24, 2017 10:22 am

Yeah i though about the problem just after having posted the code.
The problem is that sprite can be display at any X offset so possible odd address.
With 32 bit pointer you force address to be align on 4 bytes, which result in the X movement 4 pixels wide movement.
So you definitely need to pass by byte copy (as spriteBuffer is not necessary aligned to frame buffer), just change the code like that :

Code: Select all

inline void drawSprite(u8* spriteBuffer, u16 x, u16 y, u16 xWidth, u16 yWidth)
{
   vu8 *frameBuffer = (vu8*) &MARS_OVERWRITE_IMG;
   // dst frame buffer pointer (X + Y offseted)
   vu8* dst = &frameBuffer[0x100 + (y * 320) + (x + 256)];
   // src sprite pointer
   u8* src = spriteBuffer;

   const u16 xw = xWidth;
   const int dstStep = 320 - xw;
   
   u16 row = yWidth;
   
   while (row--)
   {
      u16 col = xw;
     
      while(col--) *dst++ = *src++;
     
      dst += dstStep;
   }
}
Unfortunately that will be much slower then :-/
Using the 16 bits graphics mode would allow to use 16 bit transfer at least.

gameblabla
Interested
Posts: 13
Joined: Thu Sep 26, 2013 12:41 am

Re: Faster Sprites routines on the 32X?

Post by gameblabla » Sat Feb 25, 2017 3:26 am

Stef wrote:Yeah i though about the problem just after having posted the code.
The problem is that sprite can be display at any X offset so possible odd address.
With 32 bit pointer you force address to be align on 4 bytes, which result in the X movement 4 pixels wide movement.
So you definitely need to pass by byte copy (as spriteBuffer is not necessary aligned to frame buffer), just change the code like that :
...
Unfortunately that will be much slower then :-/
So copying per byte is the only way to make sure it is not oddly placed due to alignment issues ?
Well that sucks.
I guess the only way is to have 2 functions : one that is copied per byte and the other one that is faster but less precise.
This way, i would only use the slower function when a sprite needs it.
Not a real fix though but it's better than nothing. (dat ref to m. n°9)
Using the 16 bits graphics mode would allow to use 16 bit transfer at least.
Yeah, that could work (it would still have issues in 32-bits but 16-bits is faster than 8-bits).
But then i would be left with less memory and a smaller screen resolution due to the higher color depth...

This ended up being harder then i thought it would be, never thought i would deal with alignment issues haha.
A new alternative to SuperVDP would be nice. (rip ob1)

Chilly Willy
Very interested
Posts: 2420
Joined: Fri Aug 17, 2007 9:33 pm

Re: Faster Sprites routines on the 32X?

Post by Chilly Willy » Wed Jun 28, 2017 11:51 pm

If you write BYTES to the frame buffer, writes of 0x00 will be ignored. It's in the manual, but not spelled out clearly. The overwrite buffer looks for 0 bytes during WORD writes to the overwrite buffer. So if you're writing bytes to the frame buffer, please note that 0x00 will not be written. One way around that is to clear the frame buffer after switching it. You can do that in the background by DMAing a 16 byte block of 0's to the frame buffer using one of the SH2 DMA channels in 16 byte mode. Don't advance the source, do advance the destination. Why 16 bytes? Put it on a 16 byte boundary (very important) and the DMA will read the whole block in one burst ram read cycle (which the 32X SDRAM does support). Oh yeah, be sure that block of zeroes is in the sdram, not rom. While the DMA is busy clearing the frame buffer, you're free to do other things.

Another way around this is to keep your frame in sdram (on a 16 byte boundary), then dma the frame to the frame buffer once you're done with it. Again, the dma will run in the background transferring 16 byte blocks. This code works well for me.

Code: Select all

    // start DMA1 to draw frame into screen
    SH2_DMA_CHCR1; // read TE
    SH2_DMA_CHCR1 = 0; // clear TE

    // start DMA
    SH2_DMA_SAR1 = (uint32_t)frame;
    SH2_DMA_DAR1 = (uint32_t)&MARS_FRAMEBUFFER + 512;
#ifdef PAL_HW
    SH2_DMA_TCR1 = ((320 * 240 >> 4) << 2); // xfer count (4 * # of 16 byte units)
#else
    SH2_DMA_TCR1 = ((320 * 224 >> 4) << 2); // xfer count (4 * # of 16 byte units)
#endif
    SH2_DMA_CHCR1 = 0x5EE1; // dest incr, src incr, size 16B, auto req, cycle-steal, dual addr, intr disabled, clear TE, dma enabled
You can also do that with TE int if you want to do something the instant the dma is done. I've posted code on using the dma done int for double-buffered sound dma.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest