Optimizing background tiles drawn to framebuffer

Ask anything your want about the 32X Mushroom programming.

Moderator: BigEvilCorporation

ammianus
Very interested
Posts: 124
Joined: Sun Jan 29, 2012 2:10 pm
Location: North America
Contact:

Optimizing background tiles drawn to framebuffer

Post by ammianus » Sat May 26, 2012 5:57 pm

So based on some ideas in this demo thread I am looking at building some infrastructure for using tiles to draw the background instead of one big bitmap, which was more or less a placeholder.

I've got something basic to work, but looking for thoughts on how/if it can be done better. I've downloaded source for SuperVDP but its in assembly and I am making my game demo in C (plus I'm not sure I need some of those features yet).

Below you can see I have created a "floor" with a single repeating tile.
Image
Its not quite as smooth as before. I sort of expected that. But not too choppy yet.

Here is roughly the algorithm I use. The tiles are arbitrarily 16x16

Code: Select all

		for(background_tile_x = 0;background_tile_x < MAP_SCREEN_WIDTH; background_tile_x++)
		{
			for(background_tile_y = 10;background_tile_y < MAP_SCREEN_HEIGHT; background_tile_y++)
			{
				//where map_tiles[0] contains my 16x16 tile
				drawSprite(map_tiles[0],
					background_tile_x * BG_TILE_SIZE,
					background_tile_y * BG_TILE_SIZE,
					BG_TILE_SIZE,
					BG_TILE_SIZE,0);
			}
		}
I have a couple of "where should I go from here" questions now as I've never made a game at this low level before :D

- Is there something specifically bad about how I've built my for loops? Note that each drawSprite() call involves drawing pixels to the part of the framebuffer. It was logical to me and it seemed flexible so in the future I'll have a lookup to a tilemap where I'll have what tile to draw at each square. Maybe there is a better way to draw the data, like going across an entire line in one pass or something?

- I could do a "dirty rectangle" algorithm, and try to redraw only tiles that have a sprite move in front of them. But I think in the long term its going to be more work when I have a lot more sprites plus the entire background may be scrolling as the characters walk around.

- Is there any best practice around how large the tiles should be? I just picked 16x16 as it seemed to make sense. Not sure any significance to how large or small it needs to be.

- Is there any standard way that I can follow to represent the tile map for a level or stage?

- What about scrolling this horizontally? I would probably start that by just redrawing with the tiles shifted left or right, then maybe drawing different tiles from the map.


Edit: and how can we measure performance more quantitatively? Fusion seems to always tell me around 60 fps no matter how choppy the gameplay seems to be.

Thank you

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Sat May 26, 2012 10:25 pm

The sdram is designed to do burst reads of 8 words, so 16 bytes is the most efficient width for data in sdram. Note that rom is read as single words, and may be held up by activity on the bus.

On 32X programs, MAKE SURE THE 68000 SPENDS MOST OF ITS TIME RUNNING IN WORK RAM. Copy a code for the 68000 to execute to ram and run that. Do not run the 68000 from the rom if you can help it.

If the cells you use to draw the display are in rom, it would help to copy them to sdram... just the ones you need. Then when you copy the cell to the display, you can get those burst reads working for you. Even the SH2 DMA will burst read 16 bytes from sdram if you have the DMA set for 16 byte transfer mode.

The frame buffer is only written a word at a time, but it does have a FIFO. Writing the FIFO takes 3 cycles per word, and once filled, takes 5 cycles per word. There are four slots in the FIFO, so the "ideal" width from the point of view of the frame buffer is 4 words (8 bytes). You would read four words from sdram (use a cacheable address so it fills a cache line and then reads from the cache), then write four words to the frame buffer. Then while the FIFO writes the frame buffer, you would do things like calculations/fetch next line/whatever, then write the next batch of four words. Note that if you did four words, that may be 8 pixels in 256 color mode, but you HAVE to write it as four WORDS to the frame buffer, meaning you would only be able to position the pixels as if they were two pixels wide - you cannot write words to odd byte addresses. You can use the pixel delay setting to scroll by one pixel, but cells would always be written to even addresses.

ammianus
Very interested
Posts: 124
Joined: Sun Jan 29, 2012 2:10 pm
Location: North America
Contact:

Post by ammianus » Sun May 27, 2012 12:18 am

Thanks Chilly
Note that if you did four words, that may be 8 pixels in 256 color mode, but you HAVE to write it as four WORDS to the frame buffer, meaning you would only be able to position the pixels as if they were two pixels wide - you cannot write words to odd byte addresses. You can use the pixel delay setting to scroll by one pixel, but cells would always be written to even addresses.
When you say you can't write one pixel at a time, do you mean that its not optimal?

Because I am writing one pixel at a time in the drawSprite function I am writing to the framebuffer by casting the pointer to a vu8 array and writing one byte at a time, its been working so far :)
On 32X programs, MAKE SURE THE 68000 SPENDS MOST OF ITS TIME RUNNING IN WORK RAM. Copy a code for the 68000 to execute to ram and run that. Do not run the 68000 from the rom if you can help it.
I am just using your boiler plate code to start the 68K, do I need to do something else?

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Sun May 27, 2012 3:17 am

ammianus wrote:Thanks Chilly
Note that if you did four words, that may be 8 pixels in 256 color mode, but you HAVE to write it as four WORDS to the frame buffer, meaning you would only be able to position the pixels as if they were two pixels wide - you cannot write words to odd byte addresses. You can use the pixel delay setting to scroll by one pixel, but cells would always be written to even addresses.
When you say you can't write one pixel at a time, do you mean that its not optimal?

Because I am writing one pixel at a time in the drawSprite function I am writing to the framebuffer by casting the pointer to a vu8 array and writing one byte at a time, its been working so far :)
You can write one pixel at a time, but it wastes bandwidth. Byte writes take the same time as a word write, so writing two bytes takes twice the time as one word. Also, if you write a byte at a time to the frame buffer, it ignores writes of 0. The only way to write 0 to the frame buffer is as a word. The overwrite area ignores words with 0 as one or both bytes. So the regular frame buffer acts like the overwrite buffer for byte writes. So that's two reasons you might not want to write bytes.

On 32X programs, MAKE SURE THE 68000 SPENDS MOST OF ITS TIME RUNNING IN WORK RAM. Copy a code for the 68000 to execute to ram and run that. Do not run the 68000 from the rom if you can help it.
I am just using your boiler plate code to start the 68K, do I need to do something else?
My 32X code keeps the main 68000 loop in ram, so it should be fine. Just remember to keep this in mind if you add any 68000 code. One of the best features the 32X has is that you can run all the processors in different areas of memory so that no one hogs the bus to the cart. That makes your game faster. Wolf32X was less than half the speed before I moved the 68000 code from rom into ram.

ammianus
Very interested
Posts: 124
Joined: Sun Jan 29, 2012 2:10 pm
Location: North America
Contact:

Post by ammianus » Sun May 27, 2012 1:54 pm

Chilly Willy wrote:
You can write one pixel at a time, but it wastes bandwidth. Byte writes take the same time as a word write, so writing two bytes takes twice the time as one word. Also, if you write a byte at a time to the frame buffer, it ignores writes of 0. The only way to write 0 to the frame buffer is as a word. The overwrite area ignores words with 0 as one or both bytes. So the regular frame buffer acts like the overwrite buffer for byte writes. So that's two reasons you might not want to write bytes.
Ok thanks for the clarification, that is encouraging, since the game is now running pretty smoothly, and I have a lot of room to improve here.

When you say write a word or writing four words (8 pixels) in C what does that mean?

I can write a word by treating framebuffer as vu16 array. To write four words at a time in C is there some construct to do that or could I cast it to a uint64 array then I'd have to build a 64 bit number from the 8 pixels of data?

Edit: I think I am being dense. I can use this correct?

extern void fast_memcpy(void *dst, void *src, int len);
where src is my 8 bytes (4 words).


The other issue I have is drawing the images flipped, I have to write the pixels "backwards". This is easy to do where 1 pixel = 1 byte, but with words not only do I need to write the words backwards I have to flip the two bytes in the word. I don't know if there is some simple way to do that...I feel like I am back in my CS undergrad classes. :?

Edit: If I treat all the pixels as an array of bytes before writing to framebuffer using memcpy then this isn't so hard.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Sun May 27, 2012 6:14 pm

ammianus wrote: Ok thanks for the clarification, that is encouraging, since the game is now running pretty smoothly, and I have a lot of room to improve here.

When you say write a word or writing four words (8 pixels) in C what does that mean?

I can write a word by treating framebuffer as vu16 array. To write four words at a time in C is there some construct to do that or could I cast it to a uint64 array then I'd have to build a 64 bit number from the 8 pixels of data?
If you wish to move the data as words, use 16-bit values; make u16 pointers and use them to move the data. If you wish to move the data as longs, make the pointers u32 pointers. Sometimes, you want more control of the exact code used to move data, in which case I suggest making some assembly functions like my fast_memcpy.

Edit: I think I am being dense. I can use this correct?

extern void fast_memcpy(void *dst, void *src, int len);
where src is my 8 bytes (4 words).
Well, let's look at my "fast_memcpy" command... first, it's in sdram; the place where it exists in the file sh2_crt0.s is part of the data section, which is copied into sdram on reset.

Code: Select all

! Fast memcpy function - copies longs, runs from sdram for speed
! On entry: r4 = dst, r5 = src, r6 = len (in longs)

        .align  4
        .global _fast_memcpy
_fast_memcpy:
        mov.l   @r5+,r3
        mov.l   r3,@r4
        dt      r6
        bf/s    _fast_memcpy
        add     #4,r4
        rts
        nop
Note that it copies longs... a long on the SH2 is four bytes. On the 32X, a long is the most efficient unit for copying data IN GENERAL. If you remember what I said earlier, the sdram is set to burst reads - it reads eight words in 12 cycles to fill a cache line on reading a new address. But what if you have set the pointer to make the read uncacheable (OR with 0x20000000)? It STILL reads as a burst of eight words, but throws away the unused data. So reading a word as uncached from sdram throws away seven of the eight words read. Reading a long throws away six of the eight words read. So if I passed an uncached pointer to fast_memcpy, it wastes fewer cycles copying longs than if it copied words.

Of course, the code counts on the addresses being long aligned, and the length is the number of longs to copy, not the number of bytes. So you can't always use this routine... you could always make another function similar to it if you need a copy routine with different characteristics.

Putting the copy routine in sdram means that the code instructions themselves load faster (the SH2 can load instructions to the cache using burst reads from the sdram, while the cart takes many more cycles to load from). That's why there's the ".align 4" right in front of the function - that starts the code on the start of a cache line; notice the function is seven instructions, so one burst read will put the entire function into the cache where the SH2 can run with no further external instruction access cycles that might have to wait on the other processors.


The other issue I have is drawing the images flipped, I have to write the pixels "backwards". This is easy to do where 1 pixel = 1 byte, but with words not only do I need to write the words backwards I have to flip the two bytes in the word. I don't know if there is some simple way to do that...I feel like I am back in my CS undergrad classes. :?
The 68000 was easy for that... just "ror.w #8,d0" would flip the bytes in a word. The SH2 did keep byte swapping in mind - use "swap.b rs,rd". That copies rs to rd with the bytes of the lower word swapped. To swap all the bytes in a long, you would do like this

Code: Select all

  swap.b r1,r1
  swap.w r1,r1
  swap.b r1,r1

ammianus
Very interested
Posts: 124
Joined: Sun Jan 29, 2012 2:10 pm
Location: North America
Contact:

Post by ammianus » Mon May 28, 2012 10:34 pm

I created a copy of your function, just for word sized data

Code: Select all

! Fast wmemcpy function - copies words, runs from sdram for speed
! On entry: r4 = dst, r5 = src, r6 = len (in words)

        .align  4
        .global _fast_wmemcpy
_fast_wmemcpy:
        mov.w   @r5+,r3
        mov.w   r3,@r4
        dt      r6
        bf/s    _fast_wmemcpy
        add     #2,r4
        rts
        nop
I suppose though, what are the real bottlenecks to performance when writing data to the framebuffer. I am now writing the pixels by copying a array of words containing all that I want to draw (sized now at 16 bytes).

I assume my local variables in my C program are in SDRAM.

If I have a pointer to the bytes of my 16x16 tile bitmap I want to draw, How do I know if they are cache-able?

Is cache the main bottleneck? I am likely to want to draw many different background tiles, so I probably would be drawing multiple different tiles in a row. I suspect in practice the cache hit rate is going to be pretty low?


Anyways, I am wondering if there are any mechanisms for any emulator that lets me measure the performance of different code quantitatively? For example, after rewriting from using the method of writing bytes one at a time, to writing words using the fast_wmemcopy above, I don't see a visible difference, perhaps even the new way is slower.
[Edit, side by side movement comparisons seem to be 50% slower using above function, instead of writing one byte at a time]

[Edit 2. Ok I think I see part of the problem. What I was doing in the new approach was fill a buffer with the bytes until I had enough for writing the word(s) using the fast_wmemcopy... seems that copy of a byte is causing the slowdown (if I remove it goes as fast as before).

Code: Select all

//write byte to my pixelbuffer (slow!)
pixelWords[p] = spriteBuffer[bufCnt];
...
//psuedo code
//if p == size of bytes i want to write
//e.g. write to framebuffer four words at a time
//fast_wmemcpy((void *)(frameBuffer8+fbOff), (void *)&pixelWords, pixelWriteBufferSizeWords);

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Tue May 29, 2012 12:48 am

Page 17 of the 32X hardware manual shows the cached and uncached addresses of each block. In summary:

$00004000 - $000043FF = cached io
$20004000 - $200043FF = uncached io

$02000000 - $023FFFFF = cached rom
$22000000 - $223FFFFF = uncached rom

$04000000 - $0401FFFF = cached frame buffer
$24000000 - $2401FFFF = uncached frame buffer

$04020000 - $0403FFFF = cached overwrite buffer
$24020000 - $2403FFFF = uncached overwrite buffer

$06000000 - $0603FFFF = cached sdram
$26000000 - $2603FFFF = uncached sdram

You should always use uncached access to io. You don't need io values in the cache or weird things can happen.

The frame buffer and overwrite buffer should be used uncached.

Code (whether in rom or ram) should ALWAYS be cached.

Data in the SDRAM may or may not be cached, depending on how you use it. If it's shared between SH2s, you probably want it uncached. If you need it to be fetched as fast as possible, you normally want it cached.

Data in the ROM may or may not be cached, depending on how you use it. If the data is rather randomly accessed over a wide range, you probably want to access it uncached. If the data is accessed sequentially or over a small range, you probably want to cache it.

When drawing cells from ROM or SDRAM, you probably want them cached. You probably want to write four words per line to the frame buffer, then adjust the pointers for the next line in the cell and display. Since four words is not much, I wouldn't loop for the line - unroll it to four separate load/stores. Remember that the first load will load the cache line (for a cached address), loading 16 bytes. That's two lines worth of cell data if you do four words per line per cell. So write your code to fetch the first word (which loads the cache line), then stores the first word to the frame buffer. The next seven loads come from the cache. Load the second word and store it, then load the third word and store it, then the fourth word. The FIFO is now full, so update the frame buffer pointer to the next line. that should give the FIFO time to empty. Load and store the fifth word, then the sixth, then the seventh, then the eighth. You've now exhausted the data in the cache line. Now do your looping (4 loops of two lines for eight line tall cells).

No emulator comes close to accuracy on the 32X side. You'll have to try your code on real hardware to see how it really works speed-wise.

TapamN
Interested
Posts: 15
Joined: Mon Apr 25, 2011 1:05 am

Post by TapamN » Tue May 29, 2012 2:48 am

ammianus wrote:

Code: Select all

! Fast wmemcpy function - copies words, runs from sdram for speed
! On entry: r4 = dst, r5 = src, r6 = len (in words)

        .align  4
        .global _fast_wmemcpy
_fast_wmemcpy:
        mov.w   @r5+,r3
        mov.w   r3,@r4
        dt      r6
        bf/s    _fast_wmemcpy
        add     #2,r4
        rts
        nop
You could make that faster if you swap the DT and second MOV.W instructions. Loads have a extra cycle latency compared to ALU instructions, so you should try to put at least one cycle between when you load and when you use the result.

Also, the SH-2 has a single unified cache with one read/write port. The SH-2 can't read or write from cache when it's fetching an instruction to execute. The SH-2 fetches two instructions at a time, so in speed critical code you need to time your reads and write to occur between instruction fetches, or there will be a one cycle penalty. Under normal circumstances, this means memory read and write instructions should be on addresses that are long aligned (i.e. bottom two bits of address are 00). Because of this, it can sometime help to keep loop starts long aligned as well (it doesn't always matter, though).

So by swapping those two instructions, it should be at least one cycle faster each iteration. Assuming it never cache misses, I think it would be 6 cycles per loop. I'm not sure how it works out if both issues come up at the same time, if they add together into a two cycle overhead or one masks the other out resulting in just one cycle lost.

I don't know how well this applies to writing to the 32X framebuffer, but if you want fast copies, the loop needs to be unrolled. This should average one word copied every 4.5 cycles (assuming no cache misses). Each line represents a cycle; wasted cycles are marked with comments.

Code: Select all

        ! void word_8byte_copy(short *dst, short *src, int count)
        .align  4
        .global _word_8byte_copy
_word_8byte_copy:
        mov.w   @r5+,r0
        dt      r6
        mov.w   @r5+,r1
        !wasted cycle here
        mov.w   @r5+,r2
        !wasted cycle here
        mov.w   @r5+,r3
        !wasted cycle here
        mov.w   r0,@r4
        add     #2,r4
        mov.w   r1,@r4
        add     #2,r4
        mov.w   r2,@r4
        add     #2,r4
        mov.w   r3,@r4
        bf/s    _word_8byte_copy
        add     #2,r4
        rts
        nop 
Chilly Willy wrote:Data in the SDRAM may or may not be cached, depending on how you use it. If it's shared between SH2s, you probably want it uncached.
To clarify, when he says shared, he means when one or both CPUs write to it and the other needs to see the changes the other CPU has made. If the data is only read by either CPU and not changed (outside of loading a new level or something), it should still be cached.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Tue May 29, 2012 5:41 am

TapamN wrote: You could make that faster if you swap the DT and second MOV.W instructions. Loads have a extra cycle latency compared to ALU instructions, so you should try to put at least one cycle between when you load and when you use the result.

Also, the SH-2 has a single unified cache with one read/write port. The SH-2 can't read or write from cache when it's fetching an instruction to execute. The SH-2 fetches two instructions at a time, so in speed critical code you need to time your reads and write to occur between instruction fetches, or there will be a one cycle penalty. Under normal circumstances, this means memory read and write instructions should be on addresses that are long aligned (i.e. bottom two bits of address are 00). Because of this, it can sometime help to keep loop starts long aligned as well (it doesn't always matter, though).

So by swapping those two instructions, it should be at least one cycle faster each iteration. Assuming it never cache misses, I think it would be 6 cycles per loop. I'm not sure how it works out if both issues come up at the same time, if they add together into a two cycle overhead or one masks the other out resulting in just one cycle lost.

I don't know how well this applies to writing to the 32X framebuffer, but if you want fast copies, the loop needs to be unrolled. This should average one word copied every 4.5 cycles (assuming no cache misses). Each line represents a cycle; wasted cycles are marked with comments.

Code: Select all

        ! void word_8byte_copy(short *dst, short *src, int count)
        .align  4
        .global _word_8byte_copy
_word_8byte_copy:
        mov.w   @r5+,r0
        dt      r6
        mov.w   @r5+,r1
        !wasted cycle here
        mov.w   @r5+,r2
        !wasted cycle here
        mov.w   @r5+,r3
        !wasted cycle here
        mov.w   r0,@r4
        add     #2,r4
        mov.w   r1,@r4
        add     #2,r4
        mov.w   r2,@r4
        add     #2,r4
        mov.w   r3,@r4
        bf/s    _word_8byte_copy
        add     #2,r4
        rts
        nop 
Nice optimizing - I need to make those changes myself. :) It's been a while since I originally wrote that code for the copy.
Chilly Willy wrote:Data in the SDRAM may or may not be cached, depending on how you use it. If it's shared between SH2s, you probably want it uncached.
To clarify, when he says shared, he means when one or both CPUs write to it and the other needs to see the changes the other CPU has made. If the data is only read by either CPU and not changed (outside of loading a new level or something), it should still be cached.
More specifically, the cache on the SH2 is always write-through - when an SH2 writes to ram, it goes in both the cache and the ram; it reads it back from the cache. If the SH2 writing the data doesn't care if it changes, it can leave it cached since writes will always go through to ram. The other SH2 would need to either make the same location uncached to get the written data, or flush that address to force a new read of that location. I've done both in my code depending on circumstances. So with a little judicious flushing, you could leave both SH2s cacheing the data in ram. To flush a cache line, write to the address | 0x40000000. That flushes the entire line containing the addressed location. Since the cache is write-through, flushing is just invalidating the cache entry since write-back isn't needed since the data was written through to the memory and cache together.

TapamN
Interested
Posts: 15
Joined: Mon Apr 25, 2011 1:05 am

Post by TapamN » Tue May 29, 2012 4:56 pm

Oops, I wasn't paying attention to what I had said earlier about aligning memory accesses to long aligned addresses. A slight modification:

Code: Select all

        ! void word_8byte_copy(short *dst, short *src, int count)
        .align  4
        .global _word_8byte_copy
_word_8byte_copy:
        mov.w   @r5+,r0
        dt      r6
        mov.w   @r5+,r1
        !wasted cycle here
        mov.w   @r5+,r2
        !wasted cycle here
        mov.w   @r5+,r3
        nop     !realign writes
        mov.w   r0,@r4
        add     #2,r4
        mov.w   r1,@r4
        add     #2,r4
        mov.w   r2,@r4
        add     #2,r4
        mov.w   r3,@r4
        bf/s    _word_8byte_copy
        add     #2,r4
        rts
        nop 
According to the pipeline graph I made, it looks like leaving that nop out has a net cost of one cycle. In the fixed version, one cycle is lost by having the nop, but two cycles are gained from the instruction alignment. I think. I'm much more familiar with the SH-4 on the Dreamcast, so my cycle counts for the SH-2 might be off a bit (my estimations come from the CPU docs without any empirical timing observations), but this version should still be faster.

ammianus
Very interested
Posts: 124
Joined: Sun Jan 29, 2012 2:10 pm
Location: North America
Contact:

Post by ammianus » Sun Jun 03, 2012 8:57 pm

Chilly Willy wrote:Page 17 of the 32X hardware manual shows the cached and uncached addresses of each block. In summary:

...

$06000000 - $0603FFFF = cached sdram
$26000000 - $2603FFFF = uncached sdram
...
So in practice how do you find the address of something in the cached sdram block? Is it just by subtracting 20000000 from the pointer?

If I print the reference to an array that is used in my main for example it is at 0x0603F7A8 so is that mean it is cached or does the doc have it backwards and I need to get 0x2603F7A8? (which doesn't seem to work if I directly point to that address, it draws garbage instead of my tile if I use the 0x26... addr).

ammianus
Very interested
Posts: 124
Joined: Sun Jan 29, 2012 2:10 pm
Location: North America
Contact:

Post by ammianus » Sun Jun 03, 2012 9:23 pm

TapamN wrote: I don't know how well this applies to writing to the 32X framebuffer, but if you want fast copies, the loop needs to be unrolled. This should average one word copied every 4.5 cycles (assuming no cache misses). Each line represents a cycle; wasted cycles are marked with comments.

Code: Select all

        ! void word_8byte_copy(short *dst, short *src, int count)
        .align  4
        .global _word_8byte_copy
_word_8byte_copy:
        mov.w   @r5+,r0
        dt      r6
        mov.w   @r5+,r1
        !wasted cycle here
        mov.w   @r5+,r2
        !wasted cycle here
        mov.w   @r5+,r3
        !wasted cycle here
        mov.w   r0,@r4
        add     #2,r4
        mov.w   r1,@r4
        add     #2,r4
        mov.w   r2,@r4
        add     #2,r4
        mov.w   r3,@r4
        bf/s    _word_8byte_copy
        add     #2,r4
        rts
        nop 
Thanks for the suggestion. What units is your count argument in? bytes? words?

Edit I assume that count is the number of 8 byte groups to copy.
Last edited by ammianus on Mon Jun 04, 2012 12:09 am, edited 1 time in total.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Sun Jun 03, 2012 11:34 pm

ammianus wrote:
Chilly Willy wrote:Page 17 of the 32X hardware manual shows the cached and uncached addresses of each block. In summary:

...

$06000000 - $0603FFFF = cached sdram
$26000000 - $2603FFFF = uncached sdram
...
So in practice how do you find the address of something in the cached sdram block? Is it just by subtracting 20000000 from the pointer?

If I print the reference to an array that is used in my main for example it is at 0x0603F7A8 so is that mean it is cached or does the doc have it backwards and I need to get 0x2603F7A8? (which doesn't seem to work if I directly point to that address, it draws garbage instead of my tile if I use the 0x26... addr).
When in doubt, check the linker script: the linker script I use for the 32X has these values

Code: Select all

...
  .text 0x02000000 :
  AT ( 0x00000000 )
...
  .data 0x06000000 :
  AT ( LOADADDR (.text) + SIZEOF (.text) )
...
  .bss :
So the rom and sdram addresses used for the compiled code are both cached addresses. So if you wish to use an UNCACHED pointer, you need to OR the pointer with 0x20000000.

If you look at my 32x.h file, you find

Code: Select all

#define MARS_CRAM           (*(volatile unsigned short *)0x20004200)
#define MARS_FRAMEBUFFER    (*(volatile unsigned short *)0x24000000)
#define MARS_OVERWRITE_IMG  (*(volatile unsigned short *)0x24020000)
#define MARS_SDRAM          (*(volatile unsigned short *)0x26000000)

#define MARS_SYS_INTMSK     (*(volatile unsigned short *)0x20004000)
#define MARS_SYS_DMACTR     (*(volatile unsigned short *)0x20004006)
#define MARS_SYS_DMASAR     (*(volatile unsigned long *)0x20004008)
#define MARS_SYS_DMADAR     (*(volatile unsigned long *)0x2000400C)
#define MARS_SYS_DMALEN     (*(volatile unsigned short *)0x20004010)
#define MARS_SYS_DMAFIFO    (*(volatile unsigned short *)0x20004012)
#define MARS_SYS_VRESI_CLR  (*(volatile unsigned short *)0x20004014)
#define MARS_SYS_VINT_CLR   (*(volatile unsigned short *)0x20004016)
#define MARS_SYS_HINT_CLR   (*(volatile unsigned short *)0x20004018)
#define MARS_SYS_CMDI_CLR   (*(volatile unsigned short *)0x2000401A)
#define MARS_SYS_PWMI_CLR   (*(volatile unsigned short *)0x2000401C)
#define MARS_SYS_COMM0      (*(volatile unsigned short *)0x20004020) /* Master SH2 communication */
#define MARS_SYS_COMM2      (*(volatile unsigned short *)0x20004022)
#define MARS_SYS_COMM4      (*(volatile unsigned short *)0x20004024) /* Slave SH2 communication */
#define MARS_SYS_COMM6      (*(volatile unsigned short *)0x20004026)
#define MARS_SYS_COMM8      (*(volatile unsigned short *)0x20004028) /* controller 1 current value */
#define MARS_SYS_COMM10     (*(volatile unsigned short *)0x2000402A) /* controller 2 current value */
#define MARS_SYS_COMM12     (*(volatile unsigned long *)0x2000402C)  /* vcount current value */

#define MARS_PWM_CTRL       (*(volatile unsigned short *)0x20004030)
#define MARS_PWM_CYCLE      (*(volatile unsigned short *)0x20004032)
#define MARS_PWM_LEFT       (*(volatile unsigned short *)0x20004034)
#define MARS_PWM_RIGHT      (*(volatile unsigned short *)0x20004036)
#define MARS_PWM_MONO       (*(volatile unsigned short *)0x20004038)

#define MARS_VDP_DISPMODE   (*(volatile unsigned short *)0x20004100)
#define MARS_VDP_FILLEN     (*(volatile unsigned short *)0x20004104)
#define MARS_VDP_FILADR     (*(volatile unsigned short *)0x20004106)
#define MARS_VDP_FILDAT     (*(volatile unsigned short *)0x20004108)
#define MARS_VDP_FBCTL      (*(volatile unsigned short *)0x2000410A)
Note that all those addresses are uncached. If you wished to make a cached address from them, you would AND the pointer with 0x1FFFFFFF. Not that you should since hardware should be addresses uncached. The only one you might consider making cached is MARS_SDRAM.

ammianus
Very interested
Posts: 124
Joined: Sun Jan 29, 2012 2:10 pm
Location: North America
Contact:

Post by ammianus » Mon Jun 04, 2012 12:59 am

Thanks again for the clarifications, I am starting to see how it all fits together.

So I am using cache access to read all the image data, I am now using the further optimized word_8byte_copy function (thanks TapamN) to both read the cells from SDRAM and write to the FB and basically the graphics are being drawn the same as they were before (still got some issues with the "mirroring function" which I'll have to solve. It does seem almost as fast as my original performance before adding in the tiled "floor" and writing 1 byte at a time.

Given my experience with this so far, I am afraid to have too much logic in my draw function, so I wonder if it is better to store all of my transformed sprites in memory rather than compute the transformations as I draw e.g. for the flipping horizontally when a character turns around, but that seems like a waste of SDRAM and/or ROM memory to store both ways.

Post Reply