Feed your pipeline

ob1 · Post by **ob1** » Wed Mar 21, 2007 11:16 am

I remembered an old post from Stef' I think about the Genesis vSync. Instead of incrementing vtimer (thank you Paul Lee), why not read the VDP status ?
I did so for the 32X, just looking at FB_CTRL_REG (2000 410Ah), bit 15.
Here's the 1rst routine I used to do so :

Code: Select all

waitForVBLK:
*	mov	#0,R11
	mov.l	REG_VDP,R1
	mov.w	VALUE_8000h,R2
whileDISP:
*	add	#1,R11
	mov.w	@($A,R1),R0	; while (FB_CTRL_REG & 8000h == 0) ;
	and	R2,R0
	cmp/eq	#0,R0
	bt	whileDISP
*	mov	R11,R10
	rts
	nop			; Executes NOP before branching

(R10 and R11 are for benchmark purpose)
In an emtpy main, this routine rolls 23k times a frame (nearly 1Mops, seems quite slow to me btw). I then remebered the delayed branch, used to fill the pipeline (see the SH2 programming manual, chapter 7). And here it is :

Code: Select all

waitForVBLK:
	mov.l	REG_VDP,R1
	mov.w	VALUE_8000h,R2
	mov.w	@($A,R1),R0	; while (FB_CTRL_REG & 8000h == 0) ;
whileDISP:
	and	R2,R0
	cmp/eq	#0,R0
	bt/s	whileDISP
	mov.w	@($A,R1),R0	; Executes MOV before branching - Feed pipeline
	rts
	nop			; Executes NOP before branching

And this time, it rolls 25k times a frame, or 9% more.

By the way, the compiler you can find on segaxtreme.net does not seem to use delayed branch.

Stef · Post by **Stef** » Wed Mar 21, 2007 11:51 am

25k isn't that bad if you take it only for the display period...
If you add the blank period you can reach 25k * 262/224 ~ 29k by frame
29k by frame at 60 FPS = 29k * 60 = 1740 k = 1740000 loops per second
23 Mhz / 1740000 ~ 13 cycles by loop which seems ok.
the BT/S flush the pipeline and the MOV @($A,R1),R0 eats severals cycles because of the indirect acces mode...

ob1 · Post by **ob1** » Wed Mar 21, 2007 2:09 pm

29k cycles / frame is quite short.
Even doubling it with the slave CPU, I'd just reach 60k cycles.
Compare it to the 80k pixles (320 x 240) of my framebuffer.

I don't think SH2 cycles can be compared to 68k cycles.
Basically, on the 68k, instruction move.w (d16,An),Dn takes 12 cycles, whereas move.w (An),Dn just uses 8 cycles.
But on SH2, thanks to the pipeline, you are sure one operation is issued by by cycle. A latency can sure exist, but add #1,R2 is as "fast" as mov.w @($A,R0),R12. In the EX stage, these 2 operations are executed in 1 cycle.
Moreover, the chapter 7.7.1 states that both mov @(disp,Rm),R0 and mov @Rm,Rn need the 5 cycles. Finally, it isn't faster to do add #$A,R1, then mov @R1,R0.

I don't think BT/S flushes the pipeline. Reading the chapter 7.7.5 of the programming manual, here's how I understand it.
With BT (non delayed branch) :

Code: Select all

waitForVBLK:
   mov.l   REG_VDP,R1
   mov.w   VALUE_8000h,R2
whileDISP:
   mov.w   @($A,R1),R0   ; while (FB_CTRL_REG & 8000h == 0) ;
   and   R2,R0
   cmp/eq   #0,R0
   bt   whileDISP
   rts
   nop         ; Executes NOP before branching

Cycle 1:
IF stage : bt whileDISP
ID stage :
EX stage :
MA stage :
WB stage :

Cycle 2:
IF stage : rts
ID stage : bt whileDISP
EX stage :
MA stage :
WB stage :

Cycle 3:
IF stage : nop
ID stage : (rts fetched but discarded)
EX stage : bt whileDISP
MA stage :
WB stage :

Cycle 4:
IF stage : mov.w @($A,R1),R0
ID stage : (nop fetched but discarded)
EX stage :
MA stage :
WB stage :

The branched instruction is fetched 3 cycles after the branch instruction was fetched. And yes, in kind of way, the pipeline is flushed (although I'd rather like the term "not filled").
Now, let's see with the delayed branch :

Code: Select all

waitForVBLK:
   mov.l   REG_VDP,R1
   mov.w   VALUE_8000h,R2
   mov.w   @($A,R1),R0   ; while (FB_CTRL_REG & 8000h == 0) ;
whileDISP:
   and   R2,R0
   cmp/eq   #0,R0
   bt/s   whileDISP
   mov.w   @($A,R1),R0   ; Executes MOV before branching - Feed pipeline
   rts
   nop         ; Executes NOP before branching

Cycle 1:
IF stage : bt/s whileDISP
ID stage :
EX stage :
MA stage :
WB stage :

Cycle 2:
IF stage : mov.w @($A,R1),R0
ID stage : bt/s whileDISP
EX stage :
MA stage :
WB stage :

Cycle 3:
IF stage : rts
ID stage : - (slot inserted)
EX stage : bt/s whileDISP
MA stage :
WB stage :

Cycle 4:
IF stage : and R2,R0
ID stage : mov.w @($A,R1),R0 (rts fetched but discarded)
EX stage :
MA stage :
WB stage :

Cycle 5:
IF stage :
ID stage : and R2,R0
EX stage : mov.w @($A,R1),R0
MA stage :
WB stage :

The branched instruction is still fetched 3 cycles after the branch instruction was fetched, but in the pipeline, the prefetched instruction ( mov.w @($A,R1),R0 ) is then in the ID stage.

Stef · Post by **Stef** » Wed Mar 21, 2007 2:58 pm

ob1 wrote:29k cycles / frame is quite short.
Even doubling it with the slave CPU, I'd just reach 60k cycles.
Compare it to the 80k pixles (320 x 240) of my framebuffer.

I don't think SH2 cycles can be compared to 68k cycles.
Basically, on the 68k, instruction move.w (d16,An),Dn takes 12 cycles, whereas move.w (An),Dn just uses 8 cycles.
But on SH2, thanks to the pipeline, you are sure one operation is issued by by cycle. A latency can sure exist, but add #1,R2 is as "fast" as mov.w @($A,R0),R12. In the EX stage, these 2 operations are executed in 1 cycle.
Moreover, the chapter 7.7.1 states that both mov @(disp,Rm),R0 and mov @Rm,Rn need the 5 cycles. Finally, it isn't faster to do add #$A,R1, then mov @R1,R0.

I don't think BT/S flushes the pipeline. Reading the chapter 7.7.5 of the programming manual, here's how I understand it.

...

The branched instruction is still fetched 3 cycles after the branch instruction was fetched, but in the pipeline, the prefetched instruction ( mov.w @($A,R1),R0 ) is then in the ID stage.

I don't have the SH-2 manual here and honestly i don't know all about the pipeline fill/flush logic :-/
Of course SH-2 instructions are a lot faster than 68k ones, almost of them can execute in 1 cycle. But, when you write :

<<A latency can sure exist, but add #1,R2 is as "fast" as mov.w @($A,R0),R12. In the EX stage, these 2 operations are executed in 1 cycle.>>

Take it as the best case, in fact, almost of time a mov instruction with external memory will eat 3 cpu cycles at best. Even by using the internal cpu cache i'm not sure you can reach the 1 cycle execution time...
In your case, you can't access it by cached map and there is probably some latency, i won't be surprised to see your MOV instruction taking something as 6/7 cycles here.

ob1 · Post by **ob1** » Wed Mar 21, 2007 3:18 pm

Let (Sinclair/C64 basic way)

Code: Select all

	mov.w	@R1,R0
	mov.w	@(2,R2),R0
	mov.w	@(4,R3),R0
	mov.w	@(6,R4),R0
	mov.w	@(8,R5),R0
	mov.w	@($A,R6),R0
	mov.w	@($C,R7),R0
	mov.w	@($E,R8),R0

The mov.w @(disp,Rm),R0 uses the 5 stages of the pipeline.

Cycle 1:
IF stage: mov.w @R1,R0
ID stage:
EX stage:
MA stage:
WB stage:

Cycle 2:
IF stage: mov.w @(2,R2),R0
ID stage: mov.w @R1,R0
EX stage:
MA stage:
WB stage:

Cycle 3:
IF stage: mov.w @(4,R3),R0
ID stage: mov.w @(2,R2),R0
EX stage: mov.w @R1,R0
MA stage:
WB stage:

Cycle 4:
IF stage: mov.w @(6,R4),R0
ID stage: mov.w @(4,R3),R0
EX stage: mov.w @(2,R2),R0
MA stage: mov.w @R1,R0
WB stage:

Cycle 5:
IF stage: mov.w @(8,R5),R0
ID stage: mov.w @(6,R4),R0
EX stage: mov.w @(4,R3),R0
MA stage: mov.w @(2,R2),R0
WB stage: mov.w @R1,R0

Cycle 6:
mov.w @R1,R0
is issued

IF stage: mov.w @($A,R6),R0
ID stage: mov.w @(8,R5),R0
EX stage: mov.w @(6,R4),R0
MA stage: mov.w @(4,R3),R0
WB stage: mov.w @(2,R2),R0

Cycle 7:
mov.w @(2,R2),R0
is issued

IF stage: mov.w @($C,R7),R0
ID stage: mov.w @($A,R6),R0
EX stage: mov.w @(8,R5),R0
MA stage: mov.w @(6,R4),R0
WB stage: mov.w @(4,R3),R0

Cycle 8:
mov.w @(4,R3),R0
is issued
...

So, each instruction takes 5 cycles to be issued (latency), but after these 5 cycles, one instruction is issued every cycle.

Beware : I may be missing something, since the pipeline mechanism is quite dark for me at some stages.

Stef · Post by **Stef** » Wed Mar 21, 2007 3:54 pm

ob1 wrote:Let (Sinclair/C64 basic way)
Code: Select all
	mov.w	@R1,R0
	mov.w	@(2,R2),R0
	mov.w	@(4,R3),R0
	mov.w	@(6,R4),R0
	mov.w	@(8,R5),R0
	mov.w	@($A,R6),R0
	mov.w	@($C,R7),R0
	mov.w	@($E,R8),R0
The mov.w @(disp,Rm),R0 uses the 5 stages of the pipeline.

Cycle 1:
IF stage: mov.w @R1,R0
ID stage:
EX stage:
MA stage:
WB stage:

...

Cycle 8:
mov.w @(4,R3),R0
is issued
...

So, each instruction takes 5 cycles to be issued (latency), but after these 5 cycles, one instruction is issued every cycle.

Beware : I may be missing something, since the pipeline mechanism is quite dark for me at some stages.

Yeah i do know that principe of pipelining

what i meant is that there is severals ways of having pipeline flushed and and some extra latencies that we aren't aware of, which explain your ~13 cycles time execution for a so small loop !

ob1 · Post by **ob1** » Wed Mar 21, 2007 4:05 pm

ok, got it.
;)

Stef · Post by **Stef** » Wed Mar 21, 2007 4:17 pm

ob1 wrote:ok, got it.

try to modify your loop in that way :

Code: Select all

whileDISP
  mov.w   @($A,R1),R0 

  mov.w   @($A,R1),R0 
  mov.w   @($A,R1),R0 
  mov.w   @($A,R1),R0 
  mov.w   @($A,R1),R0 
  mov.w   @($A,R1),R0 
  mov.w   @($A,R1),R0 
  mov.w   @($A,R1),R0 
  mov.w   @($A,R1),R0 
  mov.w   @($A,R1),R0 

  and   R2,R0 
  cmp/eq   #0,R0 
  bt/s   whileDISP

and test how many loop it can handle by frame, then test again but with that code :

Code: Select all

whileDISP
  mov.w   @($A,R1),R0 
  and   R2,R0 

  and   R2,R0 
  and   R2,R0 
  and   R2,R0 
  and   R2,R0 
  and   R2,R0 
  and   R2,R0 
  and   R2,R0 
  and   R2,R0 
  and   R2,R0 

  cmp/eq   #0,R0 
  bt/s   whileDISP

I won't be surprised to see the second one to be *a lot* faster

ob1 · Post by **ob1** » Wed Mar 21, 2007 4:30 pm

You're right.
With my classic waitVBLK, I roll nearly 14k times my loop.
Inserting 10 "mov.w ...", I roll 2k times my loop.
Inserting 10 "and ...", I roll 8k times my loop.
4 times faster !!!
But I don't really get where this huge gap could come from. The latency I have thought does not explain factor 4 ! 'Guess I need a little course or lecture on pipelining !!!

Stef · Post by **Stef** » Wed Mar 21, 2007 4:44 pm

ob1 wrote:You're right.
With my classic waitVBLK, I roll nearly 14k times my loop.
Inserting 10 "mov.w ...", I roll 2k times my loop.
Inserting 10 "and ...", I roll 8k times my loop.
4 times faster !!!
But I don't really get where this huge gap could come from. The latency I have thought does not explain factor 4 ! 'Guess I need a little course or lecture on pipelining !!!

Not only pipelining, see each memory as something slow

In gens i need to emulate them quite slow to be at the same speed than 32X... Each IO port, DRAM, VRAM access has a minimal 3 cycles penality so your MOV takes probably 4 or 5 cycles.
By the way, is your code located in DRAM or in ROM ? ROM access make stuff even slower !

ob1 · Post by **ob1** » Wed Mar 21, 2007 8:40 pm

The security program puts the program in SDRAM (I was surprised to learn ROM is that slow).
Do you know if Gens (weel, actually, the SH2 core) emulates the DMAC (not the 32X DMA, but the true SH2 DMAC) ?

Stef · Post by **Stef** » Wed Mar 21, 2007 9:33 pm

ob1 wrote:The security program puts the program in SDRAM (I was surprised to learn ROM is that slow).
Do you know if Gens (weel, actually, the SH2 core) emulates the DMAC (not the 32X DMA, but the true SH2 DMAC) ?

Of course i know since i wrote it

And yes it does, it's one of the main feature you want to see emulated in SH-2 core (with some others timers stuff)

I'm not sure about the level of accuracy of it, but enough anyway to make games working (and they do use it).

ob1 · Post by **ob1** » Wed Mar 21, 2007 10:04 pm

Stef wrote:Of course i know since i wrote it ;)

You guy rock. Oh !?! I've already said it ? Nevermind. I didn't know you wrote the SH2 core. I thought you got it from elsewhere.

Stef wrote:And yes it does, it's one of the main feature you want to see emulated in SH-2 core (with some others timers stuff) :)
I'm not sure about the level of accuracy of it, but enough anyway to make games working (and they do use it).

Cool. I might use it.

Stef · Post by **Stef** » Thu Mar 22, 2007 8:57 am

ob1 wrote:
Stef wrote:And yes it does, it's one of the main feature you want to see emulated in SH-2 core (with some others timers stuff)
I'm not sure about the level of accuracy of it, but enough anyway to make games working (and they do use it).
Cool. I might use it.

The DMA are quite fast as far i remember, still the best way to tranfert large amount of DATA

ob1 · Post by **ob1** » Sun Mar 25, 2007 8:32 pm

A Saturn programming guide advices to use compilers instead of handmade ASM. I wonder if compilerd are efficient enough to avoid contention, and sometimes suggest bypass ways ... I keep on ASM. But it's hard !!!

SpritesMind.Net

Feed your pipeline

Feed your pipeline

Pipeline contention