GFX_WRITE_VRAM_ADDR is expensive? (was: code benchmarking)

djcouchycouch · Post by **djcouchycouch** » Thu Mar 15, 2012 2:41 am

Hi,

I've been using getTick()/getSubTick() as suggested in the Code Benchmarking techniques thread, running a thousand iterations and measuring the time it takes. For example, I've been benchmarking a loop that sets a column of tiles at the edge of the screen when the screen scrolls. One surprising thing I noticed is that the calls to GFX_WRITE_VRAM_ADDR and TILE_ATTR_FULL to set the tiles are very expensive.

Running my loop 1000 times and using getSubTick(), I get a benchmark time of 61440 ticks. I've managed to bring it down from around 67000 by using some of the C optimization tricks that were suggested in another thread (thanks for that!). On a whim I commented out the calls to GFX_WRITE_VRAM_ADDR and TILE_ATTR_FULL and the number of ticks dramatically dropped to 6210. I was like "wow!". No matter what else I could optimize on that loop, those calls would still slow it down by a lot.

Are those calls naturally expensive or am I seeing something out of the ordinary?

What can I do to minimize the performance impact?

One idea would be to use DMA, but since I'm writing a column of tiles I'm not sure if that'll work right. But it might if I was writing a row.

More generally, what are the typical techniques on the Megadrive for scrolling for levels that are bigger than the ram available? My background planes are 64x64 but my foreground is 256x64. Obviously, my current techniques are very slow.

Thanks!
DJCC

Shiru · Post by **Shiru** » Thu Mar 15, 2012 2:56 am

These aren't calls, actually, as you can see there is nothing that could be that slow:

Code: Select all

#define GFX_WRITE_VRAM_ADDR(adr)    ((0x4000 + ((adr) & 0x3FFF)) << 16) + (((adr) >> 14) | 0x00)

#define TILE_ATTR(pal, pri, flipV, flipH)   (((flipH) << 11) + ((flipV) << 12) + ((pal) << 13) + ((pri) << 15))

So probably you have some other problem?

Stef · Post by **Stef** » Thu Mar 15, 2012 9:27 am

Shiru wrote:These aren't calls, actually, as you can see there is nothing that could be that slow:
Code: Select all
#define GFX_WRITE_VRAM_ADDR(adr)    ((0x4000 + ((adr) & 0x3FFF)) << 16) + (((adr) >> 14) | 0x00)

#define TILE_ATTR(pal, pri, flipV, flipH)   (((flipH) << 11) + ((flipV) << 12) + ((pal) << 13) + ((pri) << 15))
So probably you have some other problem?

In the last SGDK i replaced the GFX_WRITE_VRAM_ADDR macro :

Code: Select all

#define GFX_WRITE_VRAM_ADDR(adr)    vramwrite_tab[adr]

So if you have many calls to GFX_WRITE_VRAM_ADDR it can speeds up your code a bit but as that lookup table cost 256 KB of ROM space and many people complained about minimum rom size i will make it optional in the next version (as it was initially)

Anyway as Shiru said, i think there is something wrong in your code... If thoe methods slow down your code this way, it is probably because you call GFX_WRITE_VRAM_ADDR and TILE_ATTR for each single tilemap entry write, which is definitely not the way you have to update your tilemap.

djcouchycouch · Post by **djcouchycouch** » Thu Mar 15, 2012 11:01 am

In the last SGDK i replaced the GFX_WRITE_VRAM_ADDR macro :
Code: Select all
#define GFX_WRITE_VRAM_ADDR(adr)    vramwrite_tab[adr]

Yes, that's the version I have. It is noticeably faster than the version Shiru mentioned.

Anyway as Shiru said, i think there is something wrong in your code... If thoe methods slow down your code this way, it is probably because you call GFX_WRITE_VRAM_ADDR and TILE_ATTR for each single tilemap entry write,

That's exactly what my loop is doing, sadly

which is definitely not the way you have to update your tilemap.

In that case, what would be the best way to do it? As I mentioned earlier, if I were writing a row to the tile map, I think I could use DMA, but in this case I'm writing in a column. So every tile map index I write to needs to be offset by the width of the plane after every write. If there's a much better way of doing it, I'm all ears!

Thanks!
DJCC

TmEE co.(TM) · Post by **TmEE co.(TM)** » Thu Mar 15, 2012 12:23 pm

auto increment register does wonders

djcouchycouch · Post by **djcouchycouch** » Thu Mar 15, 2012 1:32 pm

TmEE co.(TM) wrote:auto increment register does wonders

How would I use that, like this?

// pseudocode-ish code ahead.

// pre-loop setup
plctrl = (u32 *) GFX_CTRL_PORT;
pwdata = (u16 *) GFX_DATA_PORT;

addr = <some location in the plane I want to write to>
*plctrl = GFX_WRITE_VRAM_ADDR(addr); // only calling this once.

VDP_setAutoInc(VDP_getPlanWidth());

// loop!
while (not done)
{
tilenumber = LookUpTileNumber(x, y);
*pwdata = tilenumber; // by doing this the address also automatically increments by the value I specified, going to the next tile address in the column.
}

Would that be right? I'm at work, so I'll test it when I get home tonight.

Stef · Post by **Stef** » Thu Mar 15, 2012 4:46 pm

Yep that's the idea.
Usually you set VRAM address once then you set your to VDP port.
Also you should cache the tile flag instead of setting it at each writes again...

By the way, can't you use the following method ?

Code: Select all

void VDP_setTileMapRect(u16 plan, const u16 *data, u16 index, u16 flags, u16 x, u16 y, u16 w, u16 h);

This one is taken from the last SGDK version but you have a similar one (the famous method that everyone report as buggy :p).
By using this method you can update a rectangular region of your tilemap, just prepare your data in the "data" buffer and call the method.

djcouchycouch · Post by **djcouchycouch** » Thu Mar 15, 2012 5:06 pm

Stef wrote: By the way, can't you use the following method ?
Code: Select all
void VDP_setTileMapRect(u16 plan, const u16 *data, u16 index, u16 flags, u16 x, u16 y, u16 w, u16 h);

I probably could. I didn't think of it at the time. I'll try it out later today.

djcouchycouch · Post by **djcouchycouch** » Fri Mar 16, 2012 1:23 am

I played around some more with the code, following the pseudocode I wrote earlier. It definitely helps with performance. Again, running 1000 times, the benchmark number drops to around 42240 ticks.

But by accident I discovered something weird.

So I'm caching the tile flag like this:

u16 tileAttr = TILE_ATTR_FULL(PAL3, 0, 0, 0, 0);

And then in my loop I set the data like so

do
{

u16 tileNumber = <look up for tile number>

*pwdata = tileAttr + tileNumber;
} while (not done)

Like that my benchmark number is around 42240 as above. But if I remove the tileNumber when setting pwdata, meaning I'm only really setting the tileAttr, the benchmark number dramatically drops to 9220.

*pwdata = tileAttr; // benchmark drops to 9220!

If I use just tileNumber, benchmark number is around 39000.

*pwdata = tileNumber; // ~39000.

Both variables are u16 and declared within or very close to the loop. I have no idea what is causing the difference.

Any ideas?

Here's the code I have. I apologize for the messiness. I've been going over and over on it to find out how to make
it run faster.

Code: Select all

// startilex is where the column is being drawn on the x axis
// starttiley is the start of the column of tiles I want to draw
// endtiley is the end

        if (Player.speedx < 0) // the player is moving towards the left.
        {
            u32 firstTick = getSubTick();
            VDP_setAutoInc(VDP_getPlanWidth());
            u16 tileMask = TILE_ATTR_FULL(PAL3, 0, 0, 0, 0);



            int counter = 0;

            while (counter < 1000)
            {

                loop = endtiley - starttiley;
                const u32 addr = APLAN + ((starttilex + (starttiley << 6)) << 1);

                *plctrl = GFX_WRITE_VRAM_ADDR(addr);

                do
                {
                   // actual map is 256 tiles wide, hence the shift left of 8
                    u16 tileNumber = foreground_layer[((loop + starttiley) << 8) + starttilex] + FOREGROUND_TILE_STARTINDEX; 
                    *pwdata = (tileMask + tileNumber); // weirdly slow!
                    ////*pwdata = tileNumber; // also slow!
                    ////*pwdata = tileMask; // not slow!
                    --loop;
                } while (loop > 0);


                 counter++;
            }
            counter = 0;

            u32 secondTick = getSubTick();

            uintToStr(secondTick - firstTick, blah, 16);
            VDP_drawText(blah, 5, 6);
         }

Thanks!
DJCC

Shiru · Post by **Shiru** » Fri Mar 16, 2012 2:37 am

Isn't GCC optimizer is smart enough to detect that the value is never used so it does not even try to calculate it, i.e. your u16 tileNumber ... line does not execute at all when you comment out its use?

Gigasoft · Post by **Gigasoft** » Fri Mar 16, 2012 8:59 am

Yes, it is. However, you should always check the generated assembly code when using GCC. GCC is a very poor compiler, and you need to help it a lot. By the way, you are drawing your tile map upside down.

If foreground_layer is a byte array, and FOREGROUND_TILE_STARTINDEX is a multiple of 256, then you could rewrite the loop like this:

Code: Select all

inline u16 WHL(u8 l,u16 h)
{
	u16 res;
	__asm (
		"move.b %2,%0" : "=d"(res) : "0"(h), "g"(l)
	);
	return res;
}

...

u8*p=foreground_layer + WHL(starttilex,starttiley<<8);
tileMask+=FOREGROUND_TILE_STARTINDEX;
do {
    *pwdata = tileMask = WHL(*p,tileMask); // or *pwdata = *p+tileMask, if FOREGROUND_TILE_STARTINDEX is not a multiple of 256 (slower)
    p+=256;
} while(--loop);

Stef · Post by **Stef** » Fri Mar 16, 2012 9:05 am

Exactly, GCC detects you never uses tileNumber so it will just remove the complete calculation of the variable.

The slow part of your code is this :

Code: Select all

u16 tileNumber = foreground_layer[(loop + starttiley) << 8) + starttilex] + FOREGROUND_TILE_STARTINDEX;

Indeed, compared to the rest is heavier.
As GCC compiler is not very good, you should use pointer instead to make your code more efficient.

Code: Select all

  ...
  const u16 tileBaseValue = TILE_ATTR_FULL(PAL3, 0, 0, 0, 0) + FOREGROUND_TILE_STARTINDEX;

  int counter = 0;

  while (counter < 1000)
  {
    loop = endtiley - starttiley;
    const u32 addr = APLAN + ((starttilex + (starttiley << 6)) << 1);
    u16* src = &foreground_layer[(starttiley << 8) + starttilex];

    *plctrl = GFX_WRITE_VRAM_ADDR(addr);

    while(loop--)
    {
      *pwdata = tileBaseValue + *src;
      // pass to next line
      src += 256;
    }

    counter++;
  }
  ...

This code should be faster...

Edit : seems Gigasoft was faster to reply :p
By the way, inline keyword does not work for me and this is really annoying when it comes to write smalls methods which need to be inlined, someone already managed to get it working in GCC m68k-elf target ?

djcouchycouch · Post by **djcouchycouch** » Fri Mar 16, 2012 1:30 pm

Shiru wrote:Isn't GCC optimizer is smart enough to detect that the value is never used so it does not even try to calculate it, i.e. your u16 tileNumber ... line does not execute at all when you comment out its use?

Yeah. I realized that the compiler was stripping that out right after posting. Should've thought of that much faster. Duh.

Gigasoft wrote: By the way, you are drawing your tile map upside down.

The function has stopped being proper a long time ago

It's now a testbed for developing the right coding techniques. I've never really written code that close to the metal before.

Stef wrote: As GCC compiler is not very good, you should use pointer instead to make your code more efficient.

I'll try out your suggestion this weekend.

I'll also try out just unrolling the loop. I think it might end up to always be the same length, anyway.

Thanks!
DJCC

Chilly Willy · Post by **Chilly Willy** » Fri Mar 16, 2012 5:59 pm

Stef wrote: By the way, inline keyword does not work for me and this is really annoying when it comes to write smalls methods which need to be inlined, someone already managed to get it working in GCC m68k-elf target ?

My tests on 4.6.2 show that gcc inlines the function as long as the opt level is not 0 and you don't use -fno-inline.

Stef · Post by **Stef** » Fri Mar 16, 2012 8:35 pm

Chilly Willy wrote:
Stef wrote: By the way, inline keyword does not work for me and this is really annoying when it comes to write smalls methods which need to be inlined, someone already managed to get it working in GCC m68k-elf target ?
My tests on 4.6.2 show that gcc inlines the function as long as the opt level is not 0 and you don't use -fno-inline.

Indeed, the inline does work with my GCC 4.1.1 build but not on the 3.4.6 version.

I made severals benchmarks and unfortunately it appears than GCC 4.1.1 (and i guess newer versions) is definitely slower than GCC 3.4.6 :-/
It's why i basically chosen it for SGDK (also it takes less place).

Here're the results for interested :

GCC 3.4.6

O1 :
cube 3D : 11 FPS minimum
partic : 15 FPS with 610 particles, 11 FPS with 999 particles

02 :
cube 3D : 10 FPS minimum
partic : start at 12 FPS, 10 FPS with 999

03 :
cube 3D : 10 FPS minimum
partic : start at 12 FPS, 10 FPS with 999

GCC 4.1.1

O1 :
cube 3D : 10 FPS minimum
partic : 15 FPS with 450 particles, 9 FPS with 999

02 :
cube 3D : 9 FPS minimum
partic : start at 12 FPS, 9 FPS with 999

03 :
cube 3D : 9 FPS minimum
partic : start at 12 FPS, 9 FPS with 999

Maybe i biased SGDK to produces better code with 3.4.6 version..
Does someone has some interesting piece of code we can benchmark with various version of GCC to see if older version are really better for m68k ?