Faster Sprites routines on the 32X?

Ask anything your want about the 32X Mushroom programming.

Moderator: BigEvilCorporation

gameblabla
Interested
Posts: 13
Joined: Thu Sep 26, 2013 12:41 am

Faster Sprites routines on the 32X?

Postby gameblabla » Mon Feb 20, 2017 10:04 pm

I'm back again (almost gave up on the 32X but i just came back now) and it is clear to me that the 32K color mode is unusable
for games, only for static screens.
I switched back to using the 256-color mode and after some changes to 32x_images.c, i came up with the following :

Code: Select all

void drawBackground(char* spriteBuffer)
{
   vu16 *frameBuffer16 = &MARS_FRAMEBUFFER;
   unsigned short size = 35840;
   fast_memcpy(frameBuffer16 + 0x100, spriteBuffer, size);
}

void drawSprite(vu8* spriteBuffer2, vu16 x, vu16 y, unsigned short xWidth, unsigned char yWidth)
{
   //each byte represents the color for each pixel. Allows us to reverse which we can't do with words
   vu8 *frameBuffer8 = (vu8*) &MARS_FRAMEBUFFER;
   vu16 xOff;
   unsigned short bufCnt;
   unsigned char rowPos,yCount,xCount;
   const unsigned short lineTableEnd = 0x100;
   uint32 fbOff = lineTableEnd;
   unsigned short topre;
   int i;

   //offset the number of pixels in each line to start to draw the image
   xOff = x;
   fbOff = lineTableEnd*2;
   //y-offset for top of 320 to correct line in framebuffer
   fbOff = fbOff + (y * 320);
   //x-offset from start of first line
   fbOff = fbOff + xOff;
   bufCnt = 0;
   yCount = 0;
   xCount = 0;
   rowPos = 0;
   topre = (320 - (xWidth + xOff)) + xOff;

   for (rowPos = 0; rowPos < yWidth; rowPos++)
   {
      for(xCount = 0; xCount < xWidth; xCount++)
      {   
         bufCnt = (rowPos * (xWidth)) + xCount;
         frameBuffer8[fbOff] = spriteBuffer2[bufCnt];
         fbOff++;
      }
      fbOff = fbOff + topre;
   }
}


The funny thing i noticed is the fact that byte 0x00 is transparent when using drawSprite, even though i'm not explicitly supporting transparency.
I looked up online why it does that and apparently it was done on purpose by Sega.
Either way, it works in our advantage for sprites but it's really annoying for backgrounds.

I got pretty good speed but i feel like it could have been a little... faster.
ammianus came up with drawSpriteMaster here but it does not work well for me. (plus, it's slower anyway)

haroldoop told me to use SuperVDP but apparently it does not work on real hardware due to 68k->32X DMA not working properly.
The function i'm taking issue with is drawSprite : two for..loops just slowdowns things a bit.

How could it be improved ?
I'm just interested in one plane and sprites : if i want to have another plane, i would just use the Genesis's hardware for that.

User avatar
Stef
Very interested
Posts: 2532
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Re: Faster Sprites routines on the 32X?

Postby Stef » Tue Feb 21, 2017 1:15 pm

Using word_8byte_copy(..) method each time you want to copy 8 pixels of sprite data is definitely heavy if your compiler does not inline this method (and if you use GCC i'm not sure it does it). Also C method is definitely much less efficient for a low level stuff like that, with pure SH2 assembly code you can probably improve the performance by a factor of 3.

User avatar
Sik
Very interested
Posts: 483
Joined: Thu Apr 10, 2008 3:03 pm
Contact:

Re: Faster Sprites routines on the 32X?

Postby Sik » Tue Feb 21, 2017 4:14 pm

gameblabla wrote:The funny thing i noticed is the fact that byte 0x00 is transparent when using drawSprite, even though i'm not explicitly supporting transparency.
I looked up online why it does that and apparently it was done on purpose by Sega.
Either way, it works in our advantage for sprites but it's really annoying for backgrounds.

The feature is optional, depends on which address you're using to access the framebuffer (0x24000000 doesn't do it, 0x24020000 does, if I'm reading the manual correctly).
Sik is pronounced as "seek", not as "sick".

gameblabla
Interested
Posts: 13
Joined: Thu Sep 26, 2013 12:41 am

Re: Faster Sprites routines on the 32X?

Postby gameblabla » Tue Feb 21, 2017 11:45 pm

The feature is optional, depends on which address you're using to access the framebuffer (0x24000000 doesn't do it, 0x24020000 does, if I'm reading the manual correctly).

Hmm, you're right but oddly enough, i was not using 0x24020000 but 0x24000000 and my sprite was still transparent.
Strange.

Anyway, when i changed spriteBuffer2 from vu8 to vu16, i noticed a huge speed-up.
After fixing it, it ran twice as fast !

Code: Select all

void drawSprite(vu16* spriteBuffer2, vu16 x, vu16 y, unsigned short xWidth, unsigned char yWidth)
{
   //each byte represents the color for each pixel. Allows us to reverse which we can't do with words
   vu16 *frameBuffer8 = (vu16*) &MARS_OVERWRITE_IMG;
   vu16 xOff;
   unsigned short bufCnt;
   unsigned char rowPos,xCount;
   const unsigned short lineTableEnd = 0x100;
   uint32 fbOff = lineTableEnd;
   unsigned short topre;
   unsigned short width_halved;

   //offset the number of pixels in each line to start to draw the image
   xOff = x;
   fbOff = lineTableEnd;
   //y-offset for top of 320 to correct line in framebuffer
   fbOff = fbOff + (y * 160);
   //x-offset from start of first line
   fbOff = fbOff + xOff;
   bufCnt = 0;
   xCount = 0;
   rowPos = 0;
   width_halved = xWidth/2;
   topre = (160 - (width_halved + xOff)) + xOff;

   for (rowPos = 0; rowPos < yWidth; rowPos++)
   {
      for(xCount = 0; xCount < width_halved; xCount++)
      {   
         bufCnt = (rowPos * width_halved) + xCount;
         frameBuffer8[fbOff] = spriteBuffer2[bufCnt];
         fbOff++;
      }
      fbOff = fbOff + topre;
   }
}

I guess i was hitting a bottleneck somewhere ? Seems like this is due to the framebuffer being 16-bit.
Then i tried to implement it using fast_memcpy :

Code: Select all

for (rowPos = 0; rowPos < yWidth; rowPos++)
{
   for(xCount = 0; xCount < width_halved; xCount+=8)
   {   
      bufCnt = (rowPos * width_halved) + xCount;
      fast_memcpy(frameBuffer8 + fbOff, spriteBuffer2 + bufCnt, 4);
      //word_8byte_copy(frameBuffer8 + fbOff, spriteBuffer2 + bufCnt, 2);
      //frameBuffer8[fbOff] = spriteBuffer2[bufCnt];
      fbOff+=8;
   }
   fbOff = fbOff + topre;
}


I doubt it's inlined because even with "inline" thing applied to the functions, it does not seem to do anything....
Also, i noticed no noticeable difference between fast_memcpy and word_8byte_copy here.
Maybe i'm doing it wrong ? Or maybe i should release the full source on github :P
I wish i could do it in SH2 ASM but i'm not really interested to learn assembly for any cpus.
I think it's plently fast, for now. Thanks for the help
.
One last thing though : has anyone managed to compile ammianus's sixpack ?
https://github.com/ammianus/sixpack-ammianus
The official sixpack reorders and strips the palette, even if the images are using the same internal palette.
He's using Eclipse on Windows : that sucks as i'm using Ubuntu on my main PC and i don't want to use Windows ever again.
In fact, i wonder how you guys managed to go around this ?

User avatar
Stef
Very interested
Posts: 2532
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Re: Faster Sprites routines on the 32X?

Postby Stef » Wed Feb 22, 2017 10:25 pm

If inlining does not work try at least to replace your fast_memcpy(..) call by the inlined fast_memcpy(..) code.
I don't understand if you are copying 8 bytes or 16 bytes per memcpy(..) call, it looks like each pixel is 16 bits so 16 bytes ?
In which case you should try to replace :

Code: Select all

for (rowPos = 0; rowPos < yWidth; rowPos++)
{
   for(xCount = 0; xCount < width_halved; xCount++)
   {   
      bufCnt = (rowPos * width_halved) + xCount;
      frameBuffer8[fbOff] = spriteBuffer2[bufCnt];
      fbOff++;
   }
   fbOff = fbOff + topre;
}


by this :

Code: Select all

int srcOff = 0;
int row = yWidth;

while (row--)
{
   vu32* src = (vu32*) (&spriteBuffer2[srcOff]);
   vu32* dst = (vu32*) (&frameBuffer8[fbOff]);
   int x = width_halved / 2;
   
   while(x--) *dst++ = *src++;
   
   srcOff += width_halved;
   fbOff += topre;
}


Should be faster...

gameblabla
Interested
Posts: 13
Joined: Thu Sep 26, 2013 12:41 am

Re: Faster Sprites routines on the 32X?

Postby gameblabla » Thu Feb 23, 2017 4:07 am

I could be wrong but i think you forgot to add fbOff++; in the while(x--) loop.
I couldn't get it to work as is so i had to hack it a bit :

Code: Select all

inline void drawSprite(vu32* spriteBuffer, vu16 x, vu16 y, unsigned short xWidth, unsigned char yWidth)
{
   vu32 *frameBuffer8 = (vu32*) &MARS_OVERWRITE_IMG;
   vu16 xOff;
   unsigned short bufCnt;
   const unsigned short lineTableEnd = 0x100;
   uint32 fbOff = lineTableEnd;
   unsigned short topre;
   unsigned short width_halved;

   width_halved = xWidth/4;
   //offset the number of pixels in each line to start to draw the image
   xOff = x;
   fbOff = lineTableEnd;
   //y-offset for top of 320 to correct line in framebuffer
   fbOff = fbOff + ((y * 320));
   //x-offset from start of first line
   fbOff = ((fbOff + xOff) - 128);
   bufCnt = 0;
   topre = (80 - (width_halved + xOff)) + xOff;
   
   unsigned long srcOff = 0;
   short row = yWidth;

   while (row--)
   {
      vu32* src = (vu32*) (&spriteBuffer[srcOff]);
      vu32* dst = (vu32*) (&frameBuffer8[fbOff]);
      short x = width_halved;
      
      while(x--)
      {
         *dst++ = *src++;
         fbOff+=1;
      }
      
      srcOff += width_halved;
      fbOff += topre;
   }
}

It's not ideal but i can definitively spot an improvement in speed so many thanks Stef !
I was doing a memcpy for every 8 pixels thinking it would somewhat improve the speed so i was wrong...
Looking at fast_memcpy, it seems that doing so would actually slow down it...
Anyway, thanks guys.

User avatar
Stef
Very interested
Posts: 2532
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Re: Faster Sprites routines on the 32X?

Postby Stef » Thu Feb 23, 2017 12:02 pm

gameblabla wrote:I could be wrong but i think you forgot to add fbOff++; in the while(x--) loop.


Oh yeah stupid typo, in fact seeing you code you can improve it a bit further :

Code: Select all

inline void drawSprite(u32* spriteBuffer, u16 x, u16 y, u16 xWidth, u16 yWidth)
{
   vu32 *frameBuffer = (vu32*) &MARS_OVERWRITE_IMG;
   // dst frame buffer pointer (X + Y offseted, need to divide by 4 as we have 32 bit pointer here)
   vu32* dst = &frameBuffer[(0x100 + (y * 320) + (x - 0x80)) / 4];
   // src sprite pointer
   u32* src = spriteBuffer;

   const u16 qwidth = xWidth / 4;
   const int dstStep = 80 - qwidth;
   
   u16 row = yWidth;
   
   while (row--)
   {
      u16 col = qwidth;
     
      while(col--) *dst++ = *src++;
     
      dst += dstStep;
   }
}

gameblabla
Interested
Posts: 13
Joined: Thu Sep 26, 2013 12:41 am

Re: Faster Sprites routines on the 32X?

Postby gameblabla » Fri Feb 24, 2017 12:07 am

Hmm.... Sorry for annoying you but your new code does not work well for me, the sprites moves correctly vertically but not horizontally.
Here's the source :
https://github.com/gameblabla/32x-playground
And the ROM :
https://github.com/gameblabla/32x-playground/raw/master/Example.32x

The sprites moves at 2-pixels instead of 1. Not sure why

User avatar
Stef
Very interested
Posts: 2532
Joined: Thu Nov 30, 2006 9:46 pm
Location: France - Sevres
Contact:

Re: Faster Sprites routines on the 32X?

Postby Stef » Fri Feb 24, 2017 10:22 am

Yeah i though about the problem just after having posted the code.
The problem is that sprite can be display at any X offset so possible odd address.
With 32 bit pointer you force address to be align on 4 bytes, which result in the X movement 4 pixels wide movement.
So you definitely need to pass by byte copy (as spriteBuffer is not necessary aligned to frame buffer), just change the code like that :

Code: Select all

inline void drawSprite(u8* spriteBuffer, u16 x, u16 y, u16 xWidth, u16 yWidth)
{
   vu8 *frameBuffer = (vu8*) &MARS_OVERWRITE_IMG;
   // dst frame buffer pointer (X + Y offseted)
   vu8* dst = &frameBuffer[0x100 + (y * 320) + (x + 256)];
   // src sprite pointer
   u8* src = spriteBuffer;

   const u16 xw = xWidth;
   const int dstStep = 320 - xw;
   
   u16 row = yWidth;
   
   while (row--)
   {
      u16 col = xw;
     
      while(col--) *dst++ = *src++;
     
      dst += dstStep;
   }
}


Unfortunately that will be much slower then :-/
Using the 16 bits graphics mode would allow to use 16 bit transfer at least.

gameblabla
Interested
Posts: 13
Joined: Thu Sep 26, 2013 12:41 am

Re: Faster Sprites routines on the 32X?

Postby gameblabla » Sat Feb 25, 2017 3:26 am

Stef wrote:Yeah i though about the problem just after having posted the code.
The problem is that sprite can be display at any X offset so possible odd address.
With 32 bit pointer you force address to be align on 4 bytes, which result in the X movement 4 pixels wide movement.
So you definitely need to pass by byte copy (as spriteBuffer is not necessary aligned to frame buffer), just change the code like that :
...
Unfortunately that will be much slower then :-/

So copying per byte is the only way to make sure it is not oddly placed due to alignment issues ?
Well that sucks.
I guess the only way is to have 2 functions : one that is copied per byte and the other one that is faster but less precise.
This way, i would only use the slower function when a sprite needs it.
Not a real fix though but it's better than nothing. (dat ref to m. n°9)

Using the 16 bits graphics mode would allow to use 16 bit transfer at least.

Yeah, that could work (it would still have issues in 32-bits but 16-bits is faster than 8-bits).
But then i would be left with less memory and a smaller screen resolution due to the higher color depth...

This ended up being harder then i thought it would be, never thought i would deal with alignment issues haha.
A new alternative to SuperVDP would be nice. (rip ob1)


Return to “Super 32X”

Who is online

Users browsing this forum: No registered users and 1 guest