Super VDP
Moderator: BigEvilCorporation
Hey Chilly.
I've thought about it but haven't achieved too long since :
- I don't know how much DMA is implemented in Gens (quite hard for debugging without it, right ?)
- copying sprites is very memory intensive, but I need a special DMAC for it :
I copy 8 bytes from a tile to 8 bytes in Frame Buffer. I then copy 8 bytes from a tile to a place located 320 bytes (at least) further in Frame Buffer. I thus need a special incremenation, and the SH2 DMAC only allows +1/+2/+4 increment. So, with DMA, the transfer would be faster, but between each transfer, I'd have to set the destination value, ie address DMAC registers, and that would be quite long. Moreover, I ain't sure whether the DMAC handles the bus better than the normal copy.
And I don't even speak of changing tile !
I've thought about SH2 DMA, but I'm afraid that changing regulary registers will sink my fillrate.
By the way, I only achieve nearly 2 layers with the 2 CPU. 2 layers is not exact. It would be more precise that I can achieve 2200 tiles/frames @ 30 fps. Using only one CPU, I'd reach just a little more than 1100 tiles/frames @ 30fps.
I've thought about it but haven't achieved too long since :
- I don't know how much DMA is implemented in Gens (quite hard for debugging without it, right ?)
- copying sprites is very memory intensive, but I need a special DMAC for it :
I copy 8 bytes from a tile to 8 bytes in Frame Buffer. I then copy 8 bytes from a tile to a place located 320 bytes (at least) further in Frame Buffer. I thus need a special incremenation, and the SH2 DMAC only allows +1/+2/+4 increment. So, with DMA, the transfer would be faster, but between each transfer, I'd have to set the destination value, ie address DMAC registers, and that would be quite long. Moreover, I ain't sure whether the DMAC handles the bus better than the normal copy.
And I don't even speak of changing tile !
I've thought about SH2 DMA, but I'm afraid that changing regulary registers will sink my fillrate.
By the way, I only achieve nearly 2 layers with the 2 CPU. 2 layers is not exact. It would be more precise that I can achieve 2200 tiles/frames @ 30 fps. Using only one CPU, I'd reach just a little more than 1100 tiles/frames @ 30fps.
This code executes 12 instructions for every 8 bytes copied.ob1 wrote:Code: Select all
... ; Copy one tile MOV #7,R6 ; R6 = Counter : 8 lines/tile .align 4 REPEAT_TILE: MOV.L @R5,R0 ; --- ADD #4,R5 ; | MOV.L R0,@R1 ; | ADD #4,R1 ; | 89 cycles when cache miss, 23 when cache hit MOV.L @R5,R0 ; | For each tile 2-lines, 1 miss then 3 hits ADD #4,R5 ; | So 4 * (89 + 3*23) = 632 cycles MOV.L R0,@R1 ; --- MOV #$9E,R7 ; 0x9E = (320 - 4) >> 1 SHLL R7 ADD R7,R1 BT/S REPEAT_TILE ; 2 cycles SUB #1,R6 ...
You should be able to reduce it to about 6 instructions.
The problem is you are thinking in C, and converting from C to assembly. Don't do that. Think directly in assembly instead.
Toshi
Hi Toshi, and thank you for joining ;)
6 instructions ?!?
I have thought about
, going downside the FrameBuffer and having value 320 (pointer to next line) in R7. But then I'd have another problem :
is not located on 2 words boundaries, then, its MA stage will contend with
and its IF stage.
What's better ? More instructions or more contention ?
6 instructions ?!?
I have thought about
Code: Select all
MOV.L @R5+,R0
MOV.L R0,@R1-
MOV.L @R5+,R0
MOV.L R0,@R1-
BT/S REPEAT_TILE ; 2 cycles
SUB R7,R1
Code: Select all
MOV.L R0,@R1-
Code: Select all
BT/S REPEAT_TILE ; 2 cycles
What's better ? More instructions or more contention ?
Hi Tasco.
Thanks for interest.
Thanks for interest.
Sure I can be wrong. I think that BT/S is like BT, except that the instruction after BT/S is executed before. Is it correct ?I'm not sure you understand how BT/S works
I had forgotten it. I've just checked SUB instruction and it doesn't affect T ! Sure I will use DT instead ;)you should look to use DT before the branch instruction
I had also forgotten it ! But, reading my code once more, I don't see any problem with it. Thanks for the reminder anyway.And MOV #imm, Rn will sign-extend your bytes!
Well, I have 2 ways : either I load a value (0x800, 560, ...) from a constant declared further, either I build my value with some ARMish instructions. What I lose while shifting, I gain it not accessing memory. Is it really slower ?You're wasting your time with all that shifting.
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
I need to reread the DMA section on the SH2... I'm too used to memory based descriptor chains like in the Mac. You'd convert the name table into a linked list of DMA ops. Now that you mention it, the SH2 DMA is more restricted.ob1 wrote:Hey Chilly.
I've thought about it but haven't achieved too long since :
- I don't know how much DMA is implemented in Gens (quite hard for debugging without it, right ?)
- copying sprites is very memory intensive, but I need a special DMAC for it :
I copy 8 bytes from a tile to 8 bytes in Frame Buffer. I then copy 8 bytes from a tile to a place located 320 bytes (at least) further in Frame Buffer. I thus need a special incremenation, and the SH2 DMAC only allows +1/+2/+4 increment. So, with DMA, the transfer would be faster, but between each transfer, I'd have to set the destination value, ie address DMAC registers, and that would be quite long. Moreover, I ain't sure whether the DMAC handles the bus better than the normal copy.
And I don't even speak of changing tile !
I've thought about SH2 DMA, but I'm afraid that changing regulary registers will sink my fillrate.
By the way, I only achieve nearly 2 layers with the 2 CPU. 2 layers is not exact. It would be more precise that I can achieve 2200 tiles/frames @ 30 fps. Using only one CPU, I'd reach just a little more than 1100 tiles/frames @ 30fps.
Maybe I should replaceTascoDLX wrote:I'm not sure you understand how BT/S works
Code: Select all
...
ADD R7,R1
BT/S REPEAT_TILE ; 2 cycles
SUB #1,R6
...
Code: Select all
...
DT R6
BT/S REPEAT_TILE ; 2 cycles
ADD R7,R1 ; Increment FrameBuffer pointer before branching, ie not dead-code
...
There is no MOV.L R0,@R1- instruction. The SH2 does not support post-decrement on writes; only pre-decrements. So MOV.L R0,@-R1 is valid, but MOV.L R0,@R1- is not.ob1 wrote:Hi Toshi, and thank you for joining
6 instructions ?!?
I have thought about, going downside the FrameBuffer and having value 320 (pointer to next line) in R7. But then I'd have another problem :Code: Select all
MOV.L @R5+,R0 MOV.L R0,@R1- MOV.L @R5+,R0 MOV.L R0,@R1- BT/S REPEAT_TILE ; 2 cycles SUB R7,R1
is not located on 2 words boundaries, then, its MA stage will contend withCode: Select all
MOV.L R0,@R1-
and its IF stage.Code: Select all
BT/S REPEAT_TILE ; 2 cycles
What's better ? More instructions or more contention ?
Toshi
Not correct.ob1 wrote:Hi Tasco.
Thanks for interest.Sure I can be wrong. I think that BT/S is like BT, except that the instruction after BT/S is executed before. Is it correct ?I'm not sure you understand how BT/S works
...
With a pipelined CPU, the CPU normally fetches a few instructions past the current instruction. When a branch occurs the CPU flushes its instruction pipeline because the instructions after the branch are not executed. This is wasteful, because all those instructions have already been fetched and decoded, etc.
With a slotted branch instruction, the CPU executes one instruction past the branch instruction, so the instruction is executed while the branch instruction is fetching new instructions from the new address. So the slotted branch instruction cannot be dependent on the result of its slot instruction.
Toshi
Toshi is correct. In brief, BT/S (or BF/S) is executed before the slot instruction, but the branch (if taken) does not occur until after the slot instruction is executed.
Almost. The T bit is set when the result is zero, so you want to use BF/S. And make sure R6 starts out at 8 instead of 7, or else your loop will come up short.ob1 wrote:Code: Select all
... DT R6 BT/S REPEAT_TILE ; 2 cycles ADD R7,R1 ; Increment FrameBuffer pointer before branching, ie not dead-code ...
MOV #$8C,R4 (encoded as $E48C) stores $FFFFFF8C in R4, per the sign-extension. I don't think that's what you're aiming for.ob1 wrote:I had also forgotten it ! But, reading my code once more, I don't see any problem with it. Thanks for the reminder anyway.
Code: Select all
MOV #$8C,R4 ; 0x8C = 560 >> 2 SHLL2 R4 SUB #1,R4
Keep in mind, under normal conditions, all of your code is fetched from the cache. Assuming everything stays cached, fetching two extra instructions accesses the same amount of memory as loading a single longword. So, you might not gaining as much as you think.ob1 wrote:Well, I have 2 ways : either I load a value (0x800, 560, ...) from a constant declared further, either I build my value with some ARMish instructions. What I lose while shifting, I gain it not accessing memory. Is it really slower ?
Oups, I've missed it ! You're right.TascoDLX wrote:The T bit is set when the result is zero, so you want to use BF/S. And make sure R6 starts out at 8 instead of 7, or else your loop will come up short.
You're obviously right.TascoDLX wrote:MOV #$8C,R4 (encoded as $E48C) stores $FFFFFF8C in R4, per the sign-extension. I don't think that's what you're aiming for.
I got the thingTascoDLX wrote:Keep in mind, under normal conditions, all of your code is fetched from the cache. Assuming everything stays cached, fetching two extra instructions accesses the same amount of memory as loading a single longword. So, you might not gaining as much as you think.
edit : I got the thing, but if I load data from cache, I also have to fetch one instructuion. So, I have 2 IF vs 1 IF + 1 MA. So, what's better ?
Yet, in this particular case, after loading this data from the cache, a lot of things would be read (64 distinct tiles fill the whole cache), and it's gonna be a long time until the next access to this particular constant. Anyway, I got it.
Any clue about my quote ?
Anyway, thank you very much you all for interest.ob1 wrote:What's better ? More instructions or more contention ?
Ideally, the SH-2 can fetch one aligned longword per clock cycle *if* the data is on-chip (in the cache, in this case). So, if the code is properly arranged, you can't say either one is better. This is because, in the most ideal case, an instruction fetch (from memory) occurs every other clock cycle, so any odd clock cycle would be available for a memory access. Therefore, the MA would not delay (or, would not extend the slot time, as it were).ob1 wrote:I got the thing, but if I load data from cache, I also have to fetch one instructuion. So, I have 2 IF vs 1 IF + 1 MA. So, what's better ?
However, because of the pipeline, there is no effective way to measure how long it takes two instructions to execute without knowing the instructions that come before and after. At least, there is no purpose to measuring it.
In any case, this is not a good way to code. You shouldn't be scrapping for a couple clock cycles. First, you should write clear and concise code that works -- efficiently, that is -- then you can mangle the code all you want for speed.
You should really consider some type of compression, or else you've got a cache nightmare on your hands. On second thought, you've got a nightmare no matter what. This whole tile system was doomed from the start! But it's good practice to think these things out.ob1 wrote:Yet, in this particular case, after loading this data from the cache, a lot of things would be read (64 distinct tiles fill the whole cache), and it's gonna be a long time until the next access to this particular constant. Anyway, I got it.
Well, the more you can do without accessing the external bus, the better. That should go without saying. As far as on-chip access goes... let's just say, it's not about how much you have, but how you use it. Actually, let's not say that.ob1 wrote:What's better ? More instructions or more contention ?
-
- Interested
- Posts: 38
- Joined: Sun Jun 01, 2008 1:12 pm
- Location: Lyon, France
- Contact: