Page 1 of 1

32X raw performance

Posted: Tue May 15, 2007 7:35 am
by ob1
Hi you all.
With the SuperVDP project, I had a problem about speed, performance. Actually, how much tile can the 32X actually displays. Taking it back, I intend to benchmark the CPU, here's my method.

Enable VInt :

Code: Select all

	mov.w	@(0,GBR),R0
	or	#8,R0
	mov.w	R0,@(0,GBR)		; Enable V INT
Main loop :

Code: Select all

main:
	bra	main
	add	#1,R8		; Executes ADD before branching
VInt routine :

Code: Select all

V_INT:
	mov.l	V_INT_vtimer,R0
	mov.l	@R0,R1
	add	#1,R1
	mov.l	R1,@R0

	mov	R8,R9
	mov	#0,R8

	rte
	nop			; Executes NOP before branching
	.align	4
V_INT_vtimer:	dc.l	$2000402C
And I got R9 = $1 767D = 95 869 in NTSC,
or R9 = $1 BD5D = 114 013 in PAL.
Does it mean each CPU can do no more than ~100k operations by frame ?
On real hardware, it would be even slower since the operation I use (add #1,R8) just stays in 3 stages whereas more complex ones (mov.b @R8,R9 for example) uses 4 or even 5 stages !

Posted: Tue May 15, 2007 10:48 am
by evildragon
when you benchmark PAL, is it in 240 or 224 lines? (or does it not matter?) just curious..

Posted: Tue May 15, 2007 11:39 am
by TmEE co.(TM)
It doesn't matter if you use 224 or 240 lines, least not on MD on its own.

Posted: Tue May 15, 2007 2:39 pm
by ob1
Line number doesn't matter. What's important is the refresh rate : 60Hz or 50Hz.

Re: 32X raw performance

Posted: Tue May 15, 2007 3:24 pm
by Shiru
ob1 wrote:Does it mean each CPU can do no more than ~100k operations by frame ?
Did you count branch? In code with loop with add 1 (i.e. increment) and branch, result in counter ~100K means you have ~200K operations executed.

One SH2 @23MHz has performance approx 20 MIPS - 20000000 simple operations (usually register-register transfers) per second, so per frame you must get 333333..400000 (60/50Hz) simple operations.

Posted: Tue May 15, 2007 3:25 pm
by Mask of Destiny
I seem to remember that branches are relatively expensive on the SH-2 even with the delay slot instruction. Still, 95,869 * 60 fps * 2 instructions = 11.5 MIPS which means an average of ~2 cycles per instruction. You can do better than that if you keep your branching to a minimum and avoid pipeline stalls. Of course, on real world code you also have the cache to worry about too.