Super VDP

Ask anything your want about the 32X Mushroom programming.

Moderator: BigEvilCorporation

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Thu May 29, 2008 6:51 am

Hey Chilly.
I've thought about it but haven't achieved too long since :
- I don't know how much DMA is implemented in Gens (quite hard for debugging without it, right ?)
- copying sprites is very memory intensive, but I need a special DMAC for it :
I copy 8 bytes from a tile to 8 bytes in Frame Buffer. I then copy 8 bytes from a tile to a place located 320 bytes (at least) further in Frame Buffer. I thus need a special incremenation, and the SH2 DMAC only allows +1/+2/+4 increment. So, with DMA, the transfer would be faster, but between each transfer, I'd have to set the destination value, ie address DMAC registers, and that would be quite long. Moreover, I ain't sure whether the DMAC handles the bus better than the normal copy.
And I don't even speak of changing tile !
I've thought about SH2 DMA, but I'm afraid that changing regulary registers will sink my fillrate.

By the way, I only achieve nearly 2 layers with the 2 CPU. 2 layers is not exact. It would be more precise that I can achieve 2200 tiles/frames @ 30 fps. Using only one CPU, I'd reach just a little more than 1100 tiles/frames @ 30fps.

TMorita
Interested
Posts: 17
Joined: Thu May 29, 2008 8:07 am

Post by TMorita » Thu May 29, 2008 8:11 am

ob1 wrote:

Code: Select all

...
; Copy one tile
	MOV	#7,R6		; R6 = Counter : 8 lines/tile
.align	4
REPEAT_TILE:
	MOV.L	@R5,R0		; ---
	ADD	#4,R5		;  |
	MOV.L	R0,@R1		;  |
	ADD	#4,R1		;  | 89 cycles when cache miss, 23 when cache hit
	MOV.L	@R5,R0		;  | For each tile 2-lines, 1 miss then 3 hits
	ADD	#4,R5		;  | So 4 * (89 + 3*23) = 632 cycles
	MOV.L	R0,@R1		; ---

	MOV	#$9E,R7		; 0x9E = (320 - 4) >> 1
	SHLL	R7
	ADD	R7,R1

	BT/S	REPEAT_TILE	; 2 cycles
	SUB	#1,R6
...
This code executes 12 instructions for every 8 bytes copied.
You should be able to reduce it to about 6 instructions.

The problem is you are thinking in C, and converting from C to assembly. Don't do that. Think directly in assembly instead.

Toshi

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Thu May 29, 2008 8:23 am

Hi Toshi, and thank you for joining ;)

6 instructions ?!?
I have thought about

Code: Select all

   MOV.L   @R5+,R0
   MOV.L   R0,@R1-
   MOV.L   @R5+,R0
   MOV.L   R0,@R1-
   BT/S   REPEAT_TILE   ; 2 cycles
   SUB   R7,R1
, going downside the FrameBuffer and having value 320 (pointer to next line) in R7. But then I'd have another problem :

Code: Select all

   MOV.L   R0,@R1-
is not located on 2 words boundaries, then, its MA stage will contend with

Code: Select all

   BT/S   REPEAT_TILE   ; 2 cycles
and its IF stage.
What's better ? More instructions or more contention ?

TascoDLX
Very interested
Posts: 262
Joined: Tue Feb 06, 2007 8:18 pm

Post by TascoDLX » Thu May 29, 2008 10:34 am

ob1,

I'm not sure you understand how BT/S works. For the kind of loops you're coding, you should look to use DT before the branch instruction. And MOV #imm, Rn will sign-extend your bytes! You're wasting your time with all that shifting.

Clean it up. :wink:

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Thu May 29, 2008 11:13 am

Hi Tasco.
Thanks for interest.
I'm not sure you understand how BT/S works
Sure I can be wrong. I think that BT/S is like BT, except that the instruction after BT/S is executed before. Is it correct ?
you should look to use DT before the branch instruction
I had forgotten it. I've just checked SUB instruction and it doesn't affect T ! Sure I will use DT instead ;)
And MOV #imm, Rn will sign-extend your bytes!
I had also forgotten it ! But, reading my code once more, I don't see any problem with it. Thanks for the reminder anyway.
You're wasting your time with all that shifting.
Well, I have 2 ways : either I load a value (0x800, 560, ...) from a constant declared further, either I build my value with some ARMish instructions. What I lose while shifting, I gain it not accessing memory. Is it really slower ?

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Post by Chilly Willy » Thu May 29, 2008 11:17 am

ob1 wrote:Hey Chilly.
I've thought about it but haven't achieved too long since :
- I don't know how much DMA is implemented in Gens (quite hard for debugging without it, right ?)
- copying sprites is very memory intensive, but I need a special DMAC for it :
I copy 8 bytes from a tile to 8 bytes in Frame Buffer. I then copy 8 bytes from a tile to a place located 320 bytes (at least) further in Frame Buffer. I thus need a special incremenation, and the SH2 DMAC only allows +1/+2/+4 increment. So, with DMA, the transfer would be faster, but between each transfer, I'd have to set the destination value, ie address DMAC registers, and that would be quite long. Moreover, I ain't sure whether the DMAC handles the bus better than the normal copy.
And I don't even speak of changing tile !
I've thought about SH2 DMA, but I'm afraid that changing regulary registers will sink my fillrate.

By the way, I only achieve nearly 2 layers with the 2 CPU. 2 layers is not exact. It would be more precise that I can achieve 2200 tiles/frames @ 30 fps. Using only one CPU, I'd reach just a little more than 1100 tiles/frames @ 30fps.
I need to reread the DMA section on the SH2... I'm too used to memory based descriptor chains like in the Mac. You'd convert the name table into a linked list of DMA ops. Now that you mention it, the SH2 DMA is more restricted.

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Thu May 29, 2008 11:37 am

Chilly Willy wrote:You'd convert the name table into a linked list of DMA ops.
SEGA Saturn SCU's DMAC can act like this. You operate in "program" mode : you specify a bunch of DMA to perform, you put it somewhere in SCU RAM, then you trigger your DMA that will run the transfers alone.

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Thu May 29, 2008 1:06 pm

TascoDLX wrote:I'm not sure you understand how BT/S works
Maybe I should replace

Code: Select all

   ...
   ADD   R7,R1

   BT/S   REPEAT_TILE   ; 2 cycles
   SUB   #1,R6 
   ...
with

Code: Select all

   ...

   DT   R6
   BT/S   REPEAT_TILE   ; 2 cycles
   ADD   R7,R1   ; Increment FrameBuffer pointer before branching, ie not dead-code
   ...

TMorita
Interested
Posts: 17
Joined: Thu May 29, 2008 8:07 am

Post by TMorita » Thu May 29, 2008 5:02 pm

ob1 wrote:Hi Toshi, and thank you for joining ;)

6 instructions ?!?
I have thought about

Code: Select all

   MOV.L   @R5+,R0
   MOV.L   R0,@R1-
   MOV.L   @R5+,R0
   MOV.L   R0,@R1-
   BT/S   REPEAT_TILE   ; 2 cycles
   SUB   R7,R1
, going downside the FrameBuffer and having value 320 (pointer to next line) in R7. But then I'd have another problem :

Code: Select all

   MOV.L   R0,@R1-
is not located on 2 words boundaries, then, its MA stage will contend with

Code: Select all

   BT/S   REPEAT_TILE   ; 2 cycles
and its IF stage.
What's better ? More instructions or more contention ?
There is no MOV.L R0,@R1- instruction. The SH2 does not support post-decrement on writes; only pre-decrements. So MOV.L R0,@-R1 is valid, but MOV.L R0,@R1- is not.

Toshi

TMorita
Interested
Posts: 17
Joined: Thu May 29, 2008 8:07 am

Post by TMorita » Thu May 29, 2008 5:07 pm

ob1 wrote:Hi Tasco.
Thanks for interest.
I'm not sure you understand how BT/S works
Sure I can be wrong. I think that BT/S is like BT, except that the instruction after BT/S is executed before. Is it correct ?
...
Not correct.

With a pipelined CPU, the CPU normally fetches a few instructions past the current instruction. When a branch occurs the CPU flushes its instruction pipeline because the instructions after the branch are not executed. This is wasteful, because all those instructions have already been fetched and decoded, etc.

With a slotted branch instruction, the CPU executes one instruction past the branch instruction, so the instruction is executed while the branch instruction is fetching new instructions from the new address. So the slotted branch instruction cannot be dependent on the result of its slot instruction.

Toshi

TascoDLX
Very interested
Posts: 262
Joined: Tue Feb 06, 2007 8:18 pm

Post by TascoDLX » Fri May 30, 2008 11:04 am

Toshi is correct. In brief, BT/S (or BF/S) is executed before the slot instruction, but the branch (if taken) does not occur until after the slot instruction is executed.
ob1 wrote:

Code: Select all

   ...

   DT   R6
   BT/S   REPEAT_TILE   ; 2 cycles
   ADD   R7,R1   ; Increment FrameBuffer pointer before branching, ie not dead-code
   ...
Almost. The T bit is set when the result is zero, so you want to use BF/S. And make sure R6 starts out at 8 instead of 7, or else your loop will come up short.
ob1 wrote:I had also forgotten it ! But, reading my code once more, I don't see any problem with it. Thanks for the reminder anyway.

Code: Select all

   MOV   #$8C,R4      ; 0x8C = 560 >> 2
   SHLL2   R4
   SUB   #1,R4 
MOV #$8C,R4 (encoded as $E48C) stores $FFFFFF8C in R4, per the sign-extension. I don't think that's what you're aiming for.
ob1 wrote:Well, I have 2 ways : either I load a value (0x800, 560, ...) from a constant declared further, either I build my value with some ARMish instructions. What I lose while shifting, I gain it not accessing memory. Is it really slower ?
Keep in mind, under normal conditions, all of your code is fetched from the cache. Assuming everything stays cached, fetching two extra instructions accesses the same amount of memory as loading a single longword. So, you might not gaining as much as you think.

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Fri May 30, 2008 11:40 am

TascoDLX wrote:The T bit is set when the result is zero, so you want to use BF/S. And make sure R6 starts out at 8 instead of 7, or else your loop will come up short.
Oups, I've missed it ! You're right.
TascoDLX wrote:MOV #$8C,R4 (encoded as $E48C) stores $FFFFFF8C in R4, per the sign-extension. I don't think that's what you're aiming for.
You're obviously right.
TascoDLX wrote:Keep in mind, under normal conditions, all of your code is fetched from the cache. Assuming everything stays cached, fetching two extra instructions accesses the same amount of memory as loading a single longword. So, you might not gaining as much as you think.
I got the thing

edit : I got the thing, but if I load data from cache, I also have to fetch one instructuion. So, I have 2 IF vs 1 IF + 1 MA. So, what's better ?

Yet, in this particular case, after loading this data from the cache, a lot of things would be read (64 distinct tiles fill the whole cache), and it's gonna be a long time until the next access to this particular constant. Anyway, I got it.

Any clue about my quote ?
ob1 wrote:What's better ? More instructions or more contention ?
Anyway, thank you very much you all for interest.

TascoDLX
Very interested
Posts: 262
Joined: Tue Feb 06, 2007 8:18 pm

Post by TascoDLX » Sat May 31, 2008 11:56 am

ob1 wrote:I got the thing, but if I load data from cache, I also have to fetch one instructuion. So, I have 2 IF vs 1 IF + 1 MA. So, what's better ?
Ideally, the SH-2 can fetch one aligned longword per clock cycle *if* the data is on-chip (in the cache, in this case). So, if the code is properly arranged, you can't say either one is better. This is because, in the most ideal case, an instruction fetch (from memory) occurs every other clock cycle, so any odd clock cycle would be available for a memory access. Therefore, the MA would not delay (or, would not extend the slot time, as it were).

However, because of the pipeline, there is no effective way to measure how long it takes two instructions to execute without knowing the instructions that come before and after. At least, there is no purpose to measuring it.

In any case, this is not a good way to code. You shouldn't be scrapping for a couple clock cycles. First, you should write clear and concise code that works -- efficiently, that is -- then you can mangle the code all you want for speed. :twisted:
ob1 wrote:Yet, in this particular case, after loading this data from the cache, a lot of things would be read (64 distinct tiles fill the whole cache), and it's gonna be a long time until the next access to this particular constant. Anyway, I got it.
You should really consider some type of compression, or else you've got a cache nightmare on your hands. On second thought, you've got a nightmare no matter what. This whole tile system was doomed from the start! :lol: But it's good practice to think these things out.
ob1 wrote:What's better ? More instructions or more contention ?
Well, the more you can do without accessing the external bus, the better. That should go without saying. As far as on-chip access goes... let's just say, it's not about how much you have, but how you use it. Actually, let's not say that.

TotOOntHeMooN
Interested
Posts: 38
Joined: Sun Jun 01, 2008 1:12 pm
Location: Lyon, France
Contact:

Post by TotOOntHeMooN » Sun Jun 01, 2008 9:29 pm

Amazing project ! :shock: (my dream)
I hope that it will be possible to achive this 2 tiles planes "Super 32X".
Do you know if it will be possible to use the PWM audio after that ?

A 2D Ikaruga remake will be fabulous and realy match with the MD/32X capabilities ! 8)

ob1
Very interested
Posts: 463
Joined: Wed Dec 06, 2006 9:01 am
Location: Aix-en-Provence, France

Post by ob1 » Mon Jun 02, 2008 7:37 am

TotOOntHeMooN wrote:Do you know if it will be possible to use the PWM audio after that ?
I really don't know. I want to fully assign both SH2s to display. But the PWM remains available to the 68k.

Post Reply