GFX_WRITE_VRAM_ADDR is expensive? (was: code benchmarking)
Moderator: Stef
-
- Very interested
- Posts: 710
- Joined: Sat Feb 18, 2012 2:44 am
GFX_WRITE_VRAM_ADDR is expensive? (was: code benchmarking)
Hi,
I've been using getTick()/getSubTick() as suggested in the Code Benchmarking techniques thread, running a thousand iterations and measuring the time it takes. For example, I've been benchmarking a loop that sets a column of tiles at the edge of the screen when the screen scrolls. One surprising thing I noticed is that the calls to GFX_WRITE_VRAM_ADDR and TILE_ATTR_FULL to set the tiles are very expensive.
Running my loop 1000 times and using getSubTick(), I get a benchmark time of 61440 ticks. I've managed to bring it down from around 67000 by using some of the C optimization tricks that were suggested in another thread (thanks for that!). On a whim I commented out the calls to GFX_WRITE_VRAM_ADDR and TILE_ATTR_FULL and the number of ticks dramatically dropped to 6210. I was like "wow!". No matter what else I could optimize on that loop, those calls would still slow it down by a lot.
Are those calls naturally expensive or am I seeing something out of the ordinary?
What can I do to minimize the performance impact?
One idea would be to use DMA, but since I'm writing a column of tiles I'm not sure if that'll work right. But it might if I was writing a row.
More generally, what are the typical techniques on the Megadrive for scrolling for levels that are bigger than the ram available? My background planes are 64x64 but my foreground is 256x64. Obviously, my current techniques are very slow.
Thanks!
DJCC
I've been using getTick()/getSubTick() as suggested in the Code Benchmarking techniques thread, running a thousand iterations and measuring the time it takes. For example, I've been benchmarking a loop that sets a column of tiles at the edge of the screen when the screen scrolls. One surprising thing I noticed is that the calls to GFX_WRITE_VRAM_ADDR and TILE_ATTR_FULL to set the tiles are very expensive.
Running my loop 1000 times and using getSubTick(), I get a benchmark time of 61440 ticks. I've managed to bring it down from around 67000 by using some of the C optimization tricks that were suggested in another thread (thanks for that!). On a whim I commented out the calls to GFX_WRITE_VRAM_ADDR and TILE_ATTR_FULL and the number of ticks dramatically dropped to 6210. I was like "wow!". No matter what else I could optimize on that loop, those calls would still slow it down by a lot.
Are those calls naturally expensive or am I seeing something out of the ordinary?
What can I do to minimize the performance impact?
One idea would be to use DMA, but since I'm writing a column of tiles I'm not sure if that'll work right. But it might if I was writing a row.
More generally, what are the typical techniques on the Megadrive for scrolling for levels that are bigger than the ram available? My background planes are 64x64 but my foreground is 256x64. Obviously, my current techniques are very slow.
Thanks!
DJCC
These aren't calls, actually, as you can see there is nothing that could be that slow:
So probably you have some other problem?
Code: Select all
#define GFX_WRITE_VRAM_ADDR(adr) ((0x4000 + ((adr) & 0x3FFF)) << 16) + (((adr) >> 14) | 0x00)
#define TILE_ATTR(pal, pri, flipV, flipH) (((flipH) << 11) + ((flipV) << 12) + ((pal) << 13) + ((pri) << 15))
-
- Very interested
- Posts: 3131
- Joined: Thu Nov 30, 2006 9:46 pm
- Location: France - Sevres
- Contact:
In the last SGDK i replaced the GFX_WRITE_VRAM_ADDR macro :Shiru wrote:These aren't calls, actually, as you can see there is nothing that could be that slow:
So probably you have some other problem?Code: Select all
#define GFX_WRITE_VRAM_ADDR(adr) ((0x4000 + ((adr) & 0x3FFF)) << 16) + (((adr) >> 14) | 0x00) #define TILE_ATTR(pal, pri, flipV, flipH) (((flipH) << 11) + ((flipV) << 12) + ((pal) << 13) + ((pri) << 15))
Code: Select all
#define GFX_WRITE_VRAM_ADDR(adr) vramwrite_tab[adr]
Anyway as Shiru said, i think there is something wrong in your code... If thoe methods slow down your code this way, it is probably because you call GFX_WRITE_VRAM_ADDR and TILE_ATTR for each single tilemap entry write, which is definitely not the way you have to update your tilemap.
-
- Very interested
- Posts: 710
- Joined: Sat Feb 18, 2012 2:44 am
Yes, that's the version I have. It is noticeably faster than the version Shiru mentioned.In the last SGDK i replaced the GFX_WRITE_VRAM_ADDR macro :Code: Select all
#define GFX_WRITE_VRAM_ADDR(adr) vramwrite_tab[adr]
That's exactly what my loop is doing, sadlyAnyway as Shiru said, i think there is something wrong in your code... If thoe methods slow down your code this way, it is probably because you call GFX_WRITE_VRAM_ADDR and TILE_ATTR for each single tilemap entry write,
In that case, what would be the best way to do it? As I mentioned earlier, if I were writing a row to the tile map, I think I could use DMA, but in this case I'm writing in a column. So every tile map index I write to needs to be offset by the width of the plane after every write. If there's a much better way of doing it, I'm all ears!which is definitely not the way you have to update your tilemap.
Thanks!
DJCC
-
- Very interested
- Posts: 2440
- Joined: Tue Dec 05, 2006 1:37 pm
- Location: Estonia, Rapla City
- Contact:
auto increment register does wonders
Mida sa loed ? Nagunii aru ei saa
http://www.tmeeco.eu
Files of all broken links and images of mine are found here : http://www.tmeeco.eu/FileDen
http://www.tmeeco.eu
Files of all broken links and images of mine are found here : http://www.tmeeco.eu/FileDen
-
- Very interested
- Posts: 710
- Joined: Sat Feb 18, 2012 2:44 am
How would I use that, like this?TmEE co.(TM) wrote:auto increment register does wonders
// pseudocode-ish code ahead.
// pre-loop setup
plctrl = (u32 *) GFX_CTRL_PORT;
pwdata = (u16 *) GFX_DATA_PORT;
addr = <some location in the plane I want to write to>
*plctrl = GFX_WRITE_VRAM_ADDR(addr); // only calling this once.
VDP_setAutoInc(VDP_getPlanWidth());
// loop!
while (not done)
{
tilenumber = LookUpTileNumber(x, y);
*pwdata = tilenumber; // by doing this the address also automatically increments by the value I specified, going to the next tile address in the column.
}
Would that be right? I'm at work, so I'll test it when I get home tonight.
-
- Very interested
- Posts: 3131
- Joined: Thu Nov 30, 2006 9:46 pm
- Location: France - Sevres
- Contact:
Yep that's the idea.
Usually you set VRAM address once then you set your to VDP port.
Also you should cache the tile flag instead of setting it at each writes again...
By the way, can't you use the following method ?
This one is taken from the last SGDK version but you have a similar one (the famous method that everyone report as buggy :p).
By using this method you can update a rectangular region of your tilemap, just prepare your data in the "data" buffer and call the method.
Usually you set VRAM address once then you set your to VDP port.
Also you should cache the tile flag instead of setting it at each writes again...
By the way, can't you use the following method ?
Code: Select all
void VDP_setTileMapRect(u16 plan, const u16 *data, u16 index, u16 flags, u16 x, u16 y, u16 w, u16 h);
By using this method you can update a rectangular region of your tilemap, just prepare your data in the "data" buffer and call the method.
-
- Very interested
- Posts: 710
- Joined: Sat Feb 18, 2012 2:44 am
I probably could. I didn't think of it at the time. I'll try it out later today.Stef wrote: By the way, can't you use the following method ?Code: Select all
void VDP_setTileMapRect(u16 plan, const u16 *data, u16 index, u16 flags, u16 x, u16 y, u16 w, u16 h);
-
- Very interested
- Posts: 710
- Joined: Sat Feb 18, 2012 2:44 am
I played around some more with the code, following the pseudocode I wrote earlier. It definitely helps with performance. Again, running 1000 times, the benchmark number drops to around 42240 ticks.
But by accident I discovered something weird.
So I'm caching the tile flag like this:
u16 tileAttr = TILE_ATTR_FULL(PAL3, 0, 0, 0, 0);
And then in my loop I set the data like so
do
{
u16 tileNumber = <look up for tile number>
*pwdata = tileAttr + tileNumber;
} while (not done)
Like that my benchmark number is around 42240 as above. But if I remove the tileNumber when setting pwdata, meaning I'm only really setting the tileAttr, the benchmark number dramatically drops to 9220.
*pwdata = tileAttr; // benchmark drops to 9220!
If I use just tileNumber, benchmark number is around 39000.
*pwdata = tileNumber; // ~39000.
Both variables are u16 and declared within or very close to the loop. I have no idea what is causing the difference.
Any ideas?
Here's the code I have. I apologize for the messiness. I've been going over and over on it to find out how to make
it run faster.
Thanks!
DJCC
But by accident I discovered something weird.
So I'm caching the tile flag like this:
u16 tileAttr = TILE_ATTR_FULL(PAL3, 0, 0, 0, 0);
And then in my loop I set the data like so
do
{
u16 tileNumber = <look up for tile number>
*pwdata = tileAttr + tileNumber;
} while (not done)
Like that my benchmark number is around 42240 as above. But if I remove the tileNumber when setting pwdata, meaning I'm only really setting the tileAttr, the benchmark number dramatically drops to 9220.
*pwdata = tileAttr; // benchmark drops to 9220!
If I use just tileNumber, benchmark number is around 39000.
*pwdata = tileNumber; // ~39000.
Both variables are u16 and declared within or very close to the loop. I have no idea what is causing the difference.
Any ideas?
Here's the code I have. I apologize for the messiness. I've been going over and over on it to find out how to make
it run faster.
Code: Select all
// startilex is where the column is being drawn on the x axis
// starttiley is the start of the column of tiles I want to draw
// endtiley is the end
if (Player.speedx < 0) // the player is moving towards the left.
{
u32 firstTick = getSubTick();
VDP_setAutoInc(VDP_getPlanWidth());
u16 tileMask = TILE_ATTR_FULL(PAL3, 0, 0, 0, 0);
int counter = 0;
while (counter < 1000)
{
loop = endtiley - starttiley;
const u32 addr = APLAN + ((starttilex + (starttiley << 6)) << 1);
*plctrl = GFX_WRITE_VRAM_ADDR(addr);
do
{
// actual map is 256 tiles wide, hence the shift left of 8
u16 tileNumber = foreground_layer[((loop + starttiley) << 8) + starttilex] + FOREGROUND_TILE_STARTINDEX;
*pwdata = (tileMask + tileNumber); // weirdly slow!
////*pwdata = tileNumber; // also slow!
////*pwdata = tileMask; // not slow!
--loop;
} while (loop > 0);
counter++;
}
counter = 0;
u32 secondTick = getSubTick();
uintToStr(secondTick - firstTick, blah, 16);
VDP_drawText(blah, 5, 6);
}
DJCC
Yes, it is. However, you should always check the generated assembly code when using GCC. GCC is a very poor compiler, and you need to help it a lot. By the way, you are drawing your tile map upside down.
If foreground_layer is a byte array, and FOREGROUND_TILE_STARTINDEX is a multiple of 256, then you could rewrite the loop like this:
If foreground_layer is a byte array, and FOREGROUND_TILE_STARTINDEX is a multiple of 256, then you could rewrite the loop like this:
Code: Select all
inline u16 WHL(u8 l,u16 h)
{
u16 res;
__asm (
"move.b %2,%0" : "=d"(res) : "0"(h), "g"(l)
);
return res;
}
...
u8*p=foreground_layer + WHL(starttilex,starttiley<<8);
tileMask+=FOREGROUND_TILE_STARTINDEX;
do {
*pwdata = tileMask = WHL(*p,tileMask); // or *pwdata = *p+tileMask, if FOREGROUND_TILE_STARTINDEX is not a multiple of 256 (slower)
p+=256;
} while(--loop);
-
- Very interested
- Posts: 3131
- Joined: Thu Nov 30, 2006 9:46 pm
- Location: France - Sevres
- Contact:
Exactly, GCC detects you never uses tileNumber so it will just remove the complete calculation of the variable.
The slow part of your code is this :
Indeed, compared to the rest is heavier.
As GCC compiler is not very good, you should use pointer instead to make your code more efficient.
This code should be faster...
Edit : seems Gigasoft was faster to reply :p
By the way, inline keyword does not work for me and this is really annoying when it comes to write smalls methods which need to be inlined, someone already managed to get it working in GCC m68k-elf target ?
The slow part of your code is this :
Code: Select all
u16 tileNumber = foreground_layer[(loop + starttiley) << 8) + starttilex] + FOREGROUND_TILE_STARTINDEX;
As GCC compiler is not very good, you should use pointer instead to make your code more efficient.
Code: Select all
...
const u16 tileBaseValue = TILE_ATTR_FULL(PAL3, 0, 0, 0, 0) + FOREGROUND_TILE_STARTINDEX;
int counter = 0;
while (counter < 1000)
{
loop = endtiley - starttiley;
const u32 addr = APLAN + ((starttilex + (starttiley << 6)) << 1);
u16* src = &foreground_layer[(starttiley << 8) + starttilex];
*plctrl = GFX_WRITE_VRAM_ADDR(addr);
while(loop--)
{
*pwdata = tileBaseValue + *src;
// pass to next line
src += 256;
}
counter++;
}
...
Edit : seems Gigasoft was faster to reply :p
By the way, inline keyword does not work for me and this is really annoying when it comes to write smalls methods which need to be inlined, someone already managed to get it working in GCC m68k-elf target ?
-
- Very interested
- Posts: 710
- Joined: Sat Feb 18, 2012 2:44 am
Yeah. I realized that the compiler was stripping that out right after posting. Should've thought of that much faster. Duh.Shiru wrote:Isn't GCC optimizer is smart enough to detect that the value is never used so it does not even try to calculate it, i.e. your u16 tileNumber ... line does not execute at all when you comment out its use?
The function has stopped being proper a long time ago It's now a testbed for developing the right coding techniques. I've never really written code that close to the metal before.Gigasoft wrote: By the way, you are drawing your tile map upside down.
I'll try out your suggestion this weekend.Stef wrote: As GCC compiler is not very good, you should use pointer instead to make your code more efficient.
I'll also try out just unrolling the loop. I think it might end up to always be the same length, anyway.
Thanks!
DJCC
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
My tests on 4.6.2 show that gcc inlines the function as long as the opt level is not 0 and you don't use -fno-inline.Stef wrote: By the way, inline keyword does not work for me and this is really annoying when it comes to write smalls methods which need to be inlined, someone already managed to get it working in GCC m68k-elf target ?
-
- Very interested
- Posts: 3131
- Joined: Thu Nov 30, 2006 9:46 pm
- Location: France - Sevres
- Contact:
Indeed, the inline does work with my GCC 4.1.1 build but not on the 3.4.6 version.Chilly Willy wrote:My tests on 4.6.2 show that gcc inlines the function as long as the opt level is not 0 and you don't use -fno-inline.Stef wrote: By the way, inline keyword does not work for me and this is really annoying when it comes to write smalls methods which need to be inlined, someone already managed to get it working in GCC m68k-elf target ?
I made severals benchmarks and unfortunately it appears than GCC 4.1.1 (and i guess newer versions) is definitely slower than GCC 3.4.6 :-/
It's why i basically chosen it for SGDK (also it takes less place).
Here're the results for interested :
GCC 3.4.6
O1 :
cube 3D : 11 FPS minimum
partic : 15 FPS with 610 particles, 11 FPS with 999 particles
02 :
cube 3D : 10 FPS minimum
partic : start at 12 FPS, 10 FPS with 999
03 :
cube 3D : 10 FPS minimum
partic : start at 12 FPS, 10 FPS with 999
GCC 4.1.1
O1 :
cube 3D : 10 FPS minimum
partic : 15 FPS with 450 particles, 9 FPS with 999
02 :
cube 3D : 9 FPS minimum
partic : start at 12 FPS, 9 FPS with 999
03 :
cube 3D : 9 FPS minimum
partic : start at 12 FPS, 9 FPS with 999
Maybe i biased SGDK to produces better code with 3.4.6 version..
Does someone has some interesting piece of code we can benchmark with various version of GCC to see if older version are really better for m68k ?
Last edited by Stef on Thu Mar 29, 2012 1:37 pm, edited 2 times in total.