Sega Genesis Dev Kit (SGDK)

fdarkangel · Post by **fdarkangel** » Wed Mar 28, 2012 8:08 am

Stef wrote: Unfortunately inline is just *not supported* in the m68k-elf GCC 3.4.6 whatever the switch you try to specify, this is really a pity as this is the best GCC version for this target (in term of code generation).

?
I checked the documentation, but I couldn't find any such documented limitation.

Here's an experiment using the gcc you compiled; I get (for "wine gcc -S -O2 test.c")

Code: Select all

inline int f(int x) __attribute__((always_inline));

inline int f(int x)  {
	return x*x;
}

int g(int x) {
	return f(x);
}

Code: Select all

#NO_APP
	.file	"test.c"
	.text
	.align	2
	.globl	f
	.type	f, @function
f:
	link.w %a6,#0
	move.l 8(%a6),%d0
	muls.l %d0,%d0
	unlk %a6
	rts
	.size	f, .-f
	.align	2
	.globl	g
	.type	g, @function
g:
	link.w %a6,#0
	move.l 8(%a6),%d0
	muls.l %d0,%d0
	unlk %a6
	rts
	.size	g, .-g
	.ident	"GCC: (GNU) 3.4.6"

So it works (it works without the -O2 flag as well, of course, I just wanted to keep the output and the post a little bit smaller).
If you're worried about the "duplicate" f over there, it's there just because it might be referenced by some other file. You can simply use "static inline" (which is mentioned in the documentation page I mentioned last time) to get rid of it:

Code: Select all

#NO_APP
	.file	"test.c"
	.text
	.align	2
	.globl	g
	.type	g, @function
g:
	link.w %a6,#0
	move.l 8(%a6),%d0
	muls.l %d0,%d0
	unlk %a6
	rts
	.size	g, .-g
	.ident	"GCC: (GNU) 3.4.6"

Stef wrote:as this is the best GCC version for this target (in term of code generation).

There has been many improvements since 3.4.6; can I ask you to submit a sample C code that has better assembly code in 3.4.6? If there is such a case, I think we should submit it to the GCC bugtracker.

Stef wrote:To reply to sega16, inlining a function is exactly the same as doing a macro but the function code looks better

Speed-wise, we can have An Inline Function is As Fast As a Macro. Codewise not only they look better ---one can argue against using macros, since an inline function keeps you away from their pitfalls, and has explicit types.

Stef · Post by **Stef** » Wed Mar 28, 2012 9:22 am

fdarkangel wrote: I checked the documentation, but I couldn't find any such documented limitation.

Yeah from the documentation it should be supported but in practice it looks not...

Here's an experiment using the gcc you compiled; I get (for "wine gcc -S -O2 test.c")
Code: Select all
inline int f(int x) __attribute__((always_inline));
...

Hmm so you get functions inlined by forcing it with __attribute__ ?
Why do we have to use that ?
Normally using the simple :

Code: Select all

inline int f(int x) {
...
}

is enough to make sure the method is inlined. Also O2 level optimization should inline automatically methods where it make sense to do it which is not the case.
But if we can force it with "__attribute__((always_inline))" that is already a good new

Now what we still miss is to pass parameters in registers, this way we could make C code really faster in some case :p

There has been many improvements since 3.4.6; can I ask you to submit a sample C code that has better assembly code in 3.4.6? If there is such a case, I think we should submit it to the GCC bugtracker.

There is actually many case where code produced in GCC 3.4.6 is better than code in GCC 3.4.6.
I could probably do some tests case, to start with you have the

Code: Select all

while(i--) {
  ...
}

which produces a correct DBNE instruction with 3.4.6 but not on 4.1.1.
It really depends about how you write code actually.

Speed-wise, we can have An Inline Function is As Fast As a Macro. Codewise not only they look better ---one can argue against using macros, since an inline function keeps you away from their pitfalls, and has explicit types.

Of course, it's why i prefer to use function when i can. And if inline now works, i will probably replace some macros by functions =)

fdarkangel · Post by **fdarkangel** » Wed Mar 28, 2012 10:44 am

Stef wrote:Yeah from the documentation it should be supported but in practice it looks not...
Here's an experiment using the gcc you compiled; I get (for "wine gcc -S -O2 test.c")
Code: Select all
inline int f(int x) __attribute__((always_inline));
...
Hmm so you get functions inlined by forcing it with __attribute__ ?
Why do we have to use that ?
Normally using the simple :
Code: Select all
inline int f(int x) {
...
}
is enough to make sure the method is inlined. Also O2 level optimization should inline automatically methods where it make sense to do it which is not the case.
But if we can force it with "__attribute__((always_inline))" that is already a good new

Oh, as I mentioned last time

fdarkangel wrote: The inline keyword is a hint to the compiler. You should specify the always_inline attribute if you want it to be inlined no matter what.

You can tune some parameters that will affect compiler's judgement, which includes the optimization level as well, but at the end of the day it is the compiler who decides on what will be inlined. This is by the C99 standard, and not something specific to GCC, or any version of it.

wikipedia wrote:In various versions of the C and C++ programming languages, an inline function is a function upon which the compiler has been requested to perform inline expansion. In other words, the programmer has requested that the compiler insert the complete body of the function in every place that the function is called, rather than generating code to call the function in the one place it is defined. (However, compilers are not obligated to respect this request.)

You can read more on it here.
By the way, you can compile your code with -Winline switch to get some details on why a function declared to be inline (and without always_inline attribute) is not going be inlined.

Stef wrote:Now what we still miss is to pass parameters in registers, this way we could make C code really faster in some case :p

This is determined by the calling convention; I couldn't find any such convention for m68000 here.

Stef wrote:There is actually many case where code produced in GCC 3.4.6 is better than code in GCC 3.4.6.
I could probably do some tests case, to start with you have the
Code: Select all
while(i--) {
  ...
}
which produces a correct DBNE instruction with 3.4.6 but not on 4.1.1.
It really depends about how you write code actually.

Can you give a little more information? How is i defined? (Is it defined to be volatile?) What is inside the loop? What are the compiler flags? What are the assembly outputs in 3.4.6 and 4.1.1?
It would be great it we can compile a list of them. Could you please post other examples whenever you come across with them?
I would like to cross-compile the current stable release (4.7.0) and test against these cases as well. Latest stable is supposed to be the better compiler; generated code is supposed to improve with newer releases, so this may mean you've stumbled upon an unknown-yet compiler bug introduced after 3.4.6.

Stef · Post by **Stef** » Wed Mar 28, 2012 11:47 am

fdarkangel wrote: Oh, as I mentioned last time
fdarkangel wrote: The inline keyword is a hint to the compiler. You should specify the always_inline attribute if you want it to be inlined no matter what.
You can tune some parameters that will affect compiler's judgement, which includes the optimization level as well, but at the end of the day it is the compiler who decides on what will be inlined. This is by the C99 standard, and not something specific to GCC, or any version of it.

I see, but so why you don't need that in GCC 4.1.1 for instance ? inlining works as expected... unlike GCC 3.4.6.

By the way, you can compile your code with -Winline switch to get some details on why a function declared to be inline (and without always_inline attribute) is not going be inlined.

I should try it to see the reason of "not inlining", but honestly almost time it doesn't make any sense, event for single line static method, the inline keyword does not work.

This is determined by the calling convention; I couldn't find any such convention for m68000 here.

Yeah i know x86 has many calling conventions, some permit to use registers, unfortunately m68k does not have that... which is a pity as the CPU has many registers. D0-D1 and A0-A1 could be used for instance for the first four parameters.

Can you give a little more information? How is i defined? (Is it defined to be volatile?) What is inside the loop? What are the compiler flags? What are the assembly outputs in 3.4.6 and 4.1.1?
It would be great it we can compile a list of them. Could you please post other examples whenever you come across with them?
I would like to cross-compile the current stable release (4.7.0) and test against these cases as well. Latest stable is supposed to be the better compiler; generated code is supposed to improve with newer releases, so this may mean you've stumbled upon an unknown-yet compiler bug introduced after 3.4.6.

That would take me age to produces that much test cases and report differences etc...
Look at this topic (end of first page) :
viewtopic.php?t=1087

You will see there are majors differences between GCC 3.4.6 and GCC 4.1.1 code generation regarding inlining, and also generated code. I compiled both GCC version with the exact same parameters.

fdarkangel · Post by **fdarkangel** » Thu Mar 29, 2012 4:37 am

Stef wrote:I see, but so why you don't need that in GCC 4.1.1 for instance ? inlining works as expected... unlike GCC 3.4.6.

I think the point of the standard is, you shouldn't expect anything to begin with. The behavior is undefined, and GCC makes no promises to be consistent about such undefined behavior between releases.

Stef wrote: Yeah i know x86 has many calling conventions, some permit to use registers, unfortunately m68k does not have that... which is a pity as the CPU has many registers. D0-D1 and A0-A1 could be used for instance for the first four parameters

.
It is indeed such a shame that none of the scratch registers are used for passing parameters in function calls. However such a need is quite rare, and in such cases you can (force-)inline your function calls or and/or use inline assembly.

Stef wrote:That would take me age to produces that much test cases and report differences etc...
Look at this topic (end of first page) :
viewtopic.php?t=1087

You will see there are majors differences between GCC 3.4.6 and GCC 4.1.1 code generation regarding inlining, and also generated code. I compiled both GCC version with the exact same parameters.

Since it's just an excerpt, I "randomly" filled in the gaps, and made this C function

Code: Select all

#define APLAN 0

typedef unsigned short u16;
typedef unsigned int u32;

void f(int starttilex, int starttiley, int endtiley) {
	const u16 tileBaseValue = 0xc0de;

	int counter = 0;
	u16* foreground_layer = (u16 *)(0xc0dedead);
	u16* plctrl = (u16*)(0xdeadc0de);
	volatile u16 *vram = (u16*)(0xc0dec0de);
	u16* pwdata = (u16*)(0xc0dec);
	
	u16* src = &foreground_layer[(starttiley << 8) + starttilex]; 

	while (counter < 1000)
	{ 
		int loop = endtiley - starttiley; 
		const u32 addr = APLAN + ((starttilex + (starttiley << 6)) << 1); 

		*plctrl = vram[addr];

		while(loop--) 
		{ 
			*pwdata = tileBaseValue + *src; 
			src -= 256;

		} 

		counter++; 
	}  
}

I used gcc 3.4.6 and 4.7.0 with parameters -O2 to obtain the below assembly outputs

Code: Select all

#NO_APP
	.file	"test2.c"
	.text
	.align	2
	.globl	f
	.type	f, @function
f:
	link.w %fp,#0
	movem.l #15392,-(%sp)
	move.l 8(%fp),%d1
	move.l 12(%fp),%d0
	move.l %d0,%d2
	lsl.l #8,%d2
	move.l %d2,%a1
	add.l %d1,%a1
	add.l %a1,%a1
	add.l #-1059135827,%a1
	move.l 16(%fp),%d2
	sub.l %d0,%d2
	lsl.l #6,%d0
	move.l %d1,%a2
	add.l %d0,%a2
	add.l %a2,%a2
	add.l %a2,%a2
	add.l #-1059143458,%a2
	move.l %d2,%d3
	subq.l #1,%d3
	move.l %d3,%d4
	moveq #9,%d0
	lsl.l %d0,%d4
	move.l #-512,%d5
	sub.l %d4,%d5
	move.l %d5,%d4
	move.l #1000,%d1
.L4:
	move.w (%a2),-559038242
	tst.l %d2
	jeq .L2
	move.l %d3,%d0
	move.l %a1,%a0
.L3:
	move.w (%a0),%d5
	add.w #-16162,%d5
	move.w %d5,789996
	lea (-512,%a0),%a0
	dbra %d0,.L3
	clr.w %d0
	subq.l #1,%d0
	jcc .L3
	add.l %d4,%a1
.L2:
	subq.l #1,%d1
	jne .L4
	movem.l (%sp)+,#1084
	unlk %fp
	rts
	.size	f, .-f
	.ident	"GCC: (GNU) 4.7.0"

Code: Select all

#NO_APP
	.file	"test2.c"
	.text
	.align	2
	.globl	f
	.type	f, @function
f:
	link.w %a6,#0
	movm.l #0x3020,-(%sp)
	move.l 8(%a6),%a0
	move.l 12(%a6),%d1
	move.l #-1059143458,%a2
	move.l %d1,%d0
	lsl.l #8,%d0
	add.l %a0,%d0
	add.l %d0,%d0
	move.l %d0,%a1
	add.l #-1059135827,%a1
	move.l 16(%a6),%d2
	sub.l %d1,%d2
	lsl.l #6,%d1
	add.l %a0,%d1
	add.l %d1,%d1
	move.w #999,%a0
	move.w (%a2,%d1.l*2),-559038242
	move.l %d2,%d0
	subq.l #1,%d0
	moveq #-1,%d3
	cmp.l %d0,%d3
	jbeq .L11
	.align	2
.L6:
	move.w (%a1),%d3
	add.w #-16162,%d3
	move.w %d3,789996
	lea (-512,%a1),%a1
	dbra %d0,.L6
	clr.w %d0
	subq.l #1,%d0
	jbcc .L6
	jbra .L11
	.align	2
.L13:
	move.w (%a2,%d1.l*2),-559038242
	move.l %d2,%d0
	subq.l #1,%d0
	moveq #-1,%d3
	cmp.l %d0,%d3
	jbne .L6
	.align	2
.L11:
	subq.l #1,%a0
	tst.l %a0
	jbge .L13
	movm.l (%sp)+,#0x40c
	unlk %a6
	rts
	.size	f, .-f
	.ident	"GCC: (GNU) 3.4.6"

So far, 4.7.0 looks good.
However, I spotted a possible compiler bug when trying to compile the libgendev with -O2 -funroll-loops (compiles okay without unroll switch)

gcc 4.7.0 wrote:m68k-elf-gcc -m68000 -Wall -O2 -funroll-loops -fomit-frame-pointer -fno-builtin-memset -fno-builtin-memcpy -Iinclude -c src/maths3D.c -o src/maths3D.o
src/maths3D.c: In function ‘M3D_transform3D’:
src/maths3D.c:276:1: internal compiler error: in replace_pseudos_in, at reload1.c:577
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.

I fear that this won't be the last one either.

It looks like there is an observable difference in cube_flat FPS. I compiled it with -O2 using GCC 4.7.0, can you test this ROM? (I couldn't be sure about the used compiler flags for binary files that ships with sgdk, plus I don't have the hardware)

http://www.mediafire.com/?kipka82z28x582c

Regarding the discussion in this topic, I can also suggest using restrict keyword whenever possible, and use the -funroll-loops switch. With more hints (such as restrict, volatile and compiler switches), GCC can optimize pointers better & correctly.

Stef · Post by **Stef** » Thu Mar 29, 2012 1:35 pm

fdarkangel wrote: It is indeed such a shame that none of the scratch registers are used for passing parameters in function calls. However such a need is quite rare, and in such cases you can (force-)inline your function calls or and/or use inline assembly.

Indeed now i can force inline and that could help a lot

Since it's just an excerpt, I "randomly" filled in the gaps, and made this C function

...

So far, 4.7.0 looks good.

Agreed, the code looks as good in one version as the other but honestly -O2 does not give me the best results, -O1 produces generally better code. I guess there is a specific optimization flag which mess code with -O2 compared to -O1...

However, I spotted a possible compiler bug when trying to compile the libgendev with -O2 -funroll-loops (compiles okay without unroll switch)
gcc 4.7.0 wrote:m68k-elf-gcc -m68000 -Wall -O2 -funroll-loops -fomit-frame-pointer -fno-builtin-memset -fno-builtin-memcpy -Iinclude -c src/maths3D.c -o src/maths3D.o
src/maths3D.c: In function ‘M3D_transform3D’:
src/maths3D.c:276:1: internal compiler error: in replace_pseudos_in, at reload1.c:577
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.
I fear that this won't be the last one either.

Well, it definitely seems that newer GCC versions are not taking much attention to old architectures as M68K. I guess the problem is not present with 3.4.6.

It looks like there is an observable difference in cube_flat FPS. I compiled it with -O2 using GCC 4.7.0, can you test this ROM? (I couldn't be sure about the used compiler flags for binary files that ships with sgdk, plus I don't have the hardware)

http://www.mediafire.com/?kipka82z28x582c

Regarding the discussion in this topic, I can also suggest using restrict keyword whenever possible, and use the -funroll-loops switch. With more hints (such as restrict, volatile and compiler switches), GCC can optimize pointers better & correctly.

Thanks for the binary, i will test it after job and report obtained frame rate.
The "restrict" keyword can also help in multi memory block methods, it is good to know there is a keyword for that (i though it was possible only by custom pragma).

fdarkangel · Post by **fdarkangel** » Thu Mar 29, 2012 3:12 pm

Stef wrote:Agreed, the code looks as good in one version as the other but honestly -O2 does not give me the best results, -O1 produces generally better code. I guess there is a specific optimization flag which mess code with -O2 compared to -O1...

What kind of problem do you think there is with -O2? The list of optimization switches enabled at -O2 is quite a long list:

http://gcc.gnu.org/onlinedocs/gcc-3.4.6 ... tions.html

I know it's quite a task, but if you have the time, could you please try turning on these options one by one and try to pinpoint the problem?

Stef wrote:Well, it definitely seems that newer GCC versions are not taking much attention to old architectures as M68K. I guess the problem is not present with 3.4.6.

I just filed a bug report to the tracker, and attached a minimal excerpt from the SGDK that demonstrates the bug. I tested it against 3.4.6, and the bug is indeed not present there.
Apparently the community around m68k-elf target is relatively small. I think the moral of this story is we should be using the latest releases and cooperate with the GCC team by submitting bugs (including the issues related to the quality of generated code) and fixes for m68k-elf, or bugs may go unnoticed otherwise.

Stef wrote:Thanks for the binary, i will test it after job and report obtained frame rate.

Thanks a lot! I'm very curious about the results. If 3.4.6 outperforms 4.7.0, it probably indicates a bug in the optimizer.

Stef · Post by **Stef** » Thu Mar 29, 2012 5:39 pm

fdarkangel wrote: What kind of problem do you think there is with -O2? The list of optimization switches enabled at -O2 is quite a long list:

http://gcc.gnu.org/onlinedocs/gcc-3.4.6 ... tions.html

I know it's quite a task, but if you have the time, could you please try turning on these options one by one and try to pinpoint the problem?

I'll try to isolate what is/are the faulty flag... lemme some time to test that

Stef wrote: I just filed a bug report to the tracker, and attached a minimal excerpt from the SGDK that demonstrates the bug. I tested it against 3.4.6, and the bug is indeed not present there.
Apparently the community around m68k-elf target is relatively small. I think the moral of this story is we should be using the latest releases and cooperate with the GCC team by submitting bugs (including the issues related to the quality of generated code) and fixes for m68k-elf, or bugs may go unnoticed otherwise.

Yeah maybe, another reason why i used GCC 3.4.6 is the binary size, of course that is not a big deal today with GB hard drive but anyway, as 4.xx was not better (even worst actually) and almost new features probably does not apply on old architecture as M68K

Stef wrote: Thanks a lot! I'm very curious about the results. If 3.4.6 outperforms 4.7.0, it probably indicates a bug in the optimizer.

Tested, a lot slower actually (as was GCC 4.1.1) :
8 FPS at worst compared to 11 FPS with GCC 3.4.6 (i tested with -O1 on 3.4.6)
The problem is that i could biased a bit the result as i wrote my C code with GCC 3.4.6.
Severals time i tweaked the code to help GCC 3.4.6 optimizer, maybe these tweaks actually make things worse.

fdarkangel · Post by **fdarkangel** » Thu Mar 29, 2012 9:31 pm

Thanks a lot for the test!

Stef wrote:Yeah maybe, another reason why i used GCC 3.4.6 is the binary size, of course that is not a big deal today with GB hard drive but anyway, as 4.xx was not better (even worst actually) and almost new features probably does not apply on old architecture as M68K

Tested, a lot slower actually (as was GCC 4.1.1) :
8 FPS at worst compared to 11 FPS with GCC 3.4.6 (i tested with -O1 on 3.4.6)
The problem is that i could biased a bit the result as i wrote my C code with GCC 3.4.6.
Severals time i tweaked the code to help GCC 3.4.6 optimizer, maybe these tweaks actually make things worse.

The instruction set for m68000 doesn't change indeed, but that's not the whole story. New features of the compiler certainly do apply to m68k-elf, as since 4.0.0, gcc does language and architecture-independent optimizations, meaning it will apply to any architecture. I believe this is slowness is due to a bug. It'd be great if we pinpoint which function(s) slows down the program.
I noticed that there is a FASTFILL flag in cube_flat demo; I enabled it and recompiled:

http://www.mediafire.com/?f7lpbj5ndt2ha5a

I'm wondering how this one performs when compared to the FASTFILL enabled gcc-3.4.6 program. If they perform close, then we can compare the generated code for BMP_clear, BMP_flip and BMP_drawPolygone.

Do you have any functions in the library for counting ticks spent in a function? We can try checking the execution times for each call in the main loop, and continue from there on.

Stef · Post by **Stef** » Fri Mar 30, 2012 8:37 pm

fdarkangel wrote: The instruction set for m68000 doesn't change indeed, but that's not the whole story. New features of the compiler certainly do apply to m68k-elf, as since 4.0.0, gcc does language and architecture-independent optimizations, meaning it will apply to any architecture. I believe this is slowness is due to a bug. It'd be great if we pinpoint which function(s) slows down the program.
I noticed that there is a FASTFILL flag in cube_flat demo; I enabled it and recompiled:

http://www.mediafire.com/?f7lpbj5ndt2ha5a

I'm wondering how this one performs when compared to the FASTFILL enabled gcc-3.4.6 program. If they perform close, then we can compare the generated code for BMP_clear, BMP_flip and BMP_drawPolygone.

Do you have any functions in the library for counting ticks spent in a function? We can try checking the execution times for each call in the main loop, and continue from there on.

I tested with FASTFILL, actually my previous tests was done always in flat mode (which is not supported with FASTFILL yet), so this time i obtained 25 FPS in wireframe with GCC 4 and 30 FPS with GCC 3... so still the same difference.
There are some methods in timer.h unit to count ticks.
Anyway i will try to isolate what flags give problems with -O2 to start with.

Stef · Post by **Stef** » Fri Mar 30, 2012 9:32 pm

I found the problematic flags :
-fgcse give an important speed lost.
-funit-at-a-time give a minor speed lost.

All others flags give no improvement or very minor one.

With -O3 the -fweb flag also give minor speed drop.
So now i use -O3 with -fno-gcse -fno-unit-at-a-time and -fno-web and this bring a small speed improvement =)
I think i cannot improve compilations flags further.
I tried with GCC 4.XX, still a lot slower unfortunately...

fdarkangel · Post by **fdarkangel** » Sat Mar 31, 2012 9:38 am

Stef wrote:I found the problematic flags :
-fgcse give an important speed lost.
-funit-at-a-time give a minor speed lost.

All others flags give no improvement or very minor one.

With -O3 the -fweb flag also give minor speed drop.
So now i use -O3 with -fno-gcse -fno-unit-at-a-time and -fno-web and this bring a small speed improvement =)
I think i cannot improve compilations flags further.
I tried with GCC 4.XX, still a lot slower unfortunately...

Thanks a lot!
We can start tracking down why exactly -O1 performs better than -O2. I will compare the generated codes by toggling these options one by one when I have time. This sounds like a problem with the optimizer. We need to hunt down the culprit function that slows things down and generate a minimal working example to submit a bug; since I don't have the hardware, I will try Gens KMod's timer functions, which should give a rough idea. (KMod's display doesn't really work under wine, but the debug console works ok).

The next up is of course, why is gcc4 -On is worse than gcc3 -On. Was GCC 4.XX=4.7.0?

Relevant excerpt from GCC 4.7.0 manual

-fgcse
Perform a global common subexpression elimination pass. This pass also performs global constant and copy propagation.
Note: When compiling a program using computed gotos, a GCC extension, you may get better run-time performance if you disable the global common subexpression elimination pass by adding -fno-gcse to the command line.

-funit-at-a-time
This option is left for compatibility reasons. -funit-at-a-time has no effect, while -fno-unit-at-a-time implies -fno-toplevel-reorder and -fno-section-anchors.
Enabled by default.

-fno-toplevel-reorder
Do not reorder top-level functions, variables, and asm statements. Output them in the same order that they appear in the input file. When this option is used, unreferenced static variables will not be removed. This option is intended to support existing code that relies on a particular ordering. For new code, it is better to use attributes.
Enabled at level -O0. When disabled explicitly, it also implies -fno-section-anchors, which is otherwise enabled at -O0 on some targets.

-fsection-anchors
Try to reduce the number of symbolic address calculations by using shared “anchor” symbols to address nearby objects. This transformation can help to reduce the number of GOT entries and GOT accesses on some targets.
For example, the implementation of the following function foo:
Code: Select all
static int a, b, c;
int foo (void) { return a + b + c; }
would usually calculate the addresses of all three variables, but if you compile it with -fsection-anchors, it will access the variables from a common anchor point instead. The effect is similar to the following pseudocode (which isn't valid C):
Code: Select all
int foo (void)
{
    register int *xr = &x;
    return xr[&a - &x] + xr[&b - &x] + xr[&c - &x];
}
Not all targets support this option.

-fweb
Constructs webs as commonly used for register allocation purposes and assign each web individual pseudo register. This allows the register allocation pass to operate on pseudos directly, but also strengthens several other optimization passes, such as CSE, loop optimizer and trivial dead code remover. It can, however, make debugging impossible, since variables will no longer stay in a “home register”.
Enabled by default with -funroll-loops.

Stef · Post by **Stef** » Sat Mar 31, 2012 4:19 pm

Honestly given the description of each optimization flag, it does not necessary make sense why enable them make the generated code worse...

Something that can help in figuring problem with GCC 4.XX (it's GCC 4.1.1 for me but i also tested with 4.6.X and i obtained almost identical results) is just maybe compare generated code.

Here's the assembly code generated with GCC 3.4.6 for the "Particle" sample.

Code: Select all

	.align	2
	.type	updatePartic, @function
updatePartic:
	movm.l #0x3830,-(%sp)
	move.l 24(%sp),%a2
	move.w 30(%sp),%d4
	subq.w #1,%d4
	cmp.w #-1,%d4
	jbeq .L85
	lea random,%a3
	.align	2
.L83:
	move.w (%a2),%d1
	move.w %d1,%d0
	subq.w #1,%d0
	cmp.w #8190,%d0
	jbhi .L89
	move.w 2(%a2),%d2
	jble .L91
	add.w 4(%a2),%d1
	move.w %d1,(%a2)
	move.w 6(%a2),%d0
	add.w %d0,%d2
	move.w %d2,2(%a2)
	sub.w gravity,%d0
	move.w %d0,6(%a2)
.L78:
	addq.l #8,%a2
	dbra %d4,.L83
	jbra .L85
	.align	2
.L89:
	move.w baseposx,(%a2)
	move.w baseposy,2(%a2)
	jbsr (%a3)
	and.w #126,%d0
	moveq #64,%d1
	sub.w %d0,%d1
	move.w %d1,4(%a2)
	jbsr (%a3)
	and.w #504,%d0
	add.w #128,%d0
	move.w %d0,6(%a2)
	jbra .L78
	.align	2
.L91:
	move.w 6(%a2),%d3
	move.w %d3,%a0
	move.w gravity,%d0
	ext.l %d0
	lsl.l #3,%d0
	neg.l %d0
	cmp.l %a0,%d0
	jblt .L89
	add.w 4(%a2),%d1
	move.w %d1,(%a2)
	sub.w %d3,%d2
	move.w %d2,2(%a2)
	neg.w %d3
	asr.w #1,%d3
	move.w %d3,6(%a2)
	addq.l #8,%a2
	dbra %d4,.L83
	.align	2
.L85:
	movm.l (%sp)+,#0xc1c
	rts


	.align	2
	.type	drawPartic, @function
drawPartic:
	link.w %a6,#0
	movm.l #0x3f00,-(%sp)
	move.w 14(%a6),%d2
	move.b 19(%a6),%d5
	move.l %sp,%d6
	move.w %d2,%d0
	ext.l %d0
	lsl.l #2,%d0
	addq.l #2,%d0
	sub.l %d0,%sp
	move.l %sp,%d4
	move.w #160,%d3
	move.l 8(%a6),%a1
	move.l %sp,%a0
	move.w %d2,%d1
	subq.w #1,%d1
	cmp.w #-1,%d1
	jbeq .L97
	.align	2
.L98:
	move.w (%a1),%d7
	asr.w #6,%d7
	move.w %d7,(%a0)
	move.w 2(%a1),%d0
	asr.w #6,%d0
	move.w %d3,%d7
	sub.w %d0,%d7
	move.w %d7,2(%a0)
	addq.l #8,%a1
	addq.l #4,%a0
	dbra %d1,.L98
.L97:
	move.w %d2,-(%sp)
	clr.w -(%sp)
	clr.l -(%sp)
	move.b %d5,(3,%sp)
	move.l %d4,-(%sp)
	jbsr BMP_setPixels_V2D
	lea (12,%sp),%sp
	move.l %d6,%sp
	movm.l -24(%a6),#0xfc
	unlk %a6
	rts

I used following flags :

-O3 -fno-web -fno-gcse -fno-unit-at-a-time -fomit-frame-pointer -fno-builtin-memset -fno-builtin-memcpy

I added blank lines to separate functions to make code easier to read.
It will be nice if you can put code with your version of GCC with the exact same flags

Then we can compare function by function.

fdarkangel · Post by **fdarkangel** » Sat Mar 31, 2012 6:47 pm

Stef wrote:Honestly given the description of each optimization flag, it does not necessary make sense why enable them make the generated code worse...

This is the reason I suspect it's an optimizer bug. But we might be focusing on the wrong thing here as well; I'm wondering what FPS you're getting with optimizer disabled?

Stef wrote:Something that can help in figuring problem with GCC 4.XX (it's GCC 4.1.1 for me but i also tested with 4.6.X and i obtained almost identical results) is just maybe compare generated code.

Here's the assembly code generated with GCC 3.4.6 for the "Particle" sample.

...

I used following flags :
-O3 -fno-web -fno-gcse -fno-unit-at-a-time -fomit-frame-pointer -fno-builtin-memset -fno-builtin-memcpy
I added blank lines to separate functions to make code easier to read.
It will be nice if you can put code with your version of GCC with the exact same flags Then we can compare function by function.

I tried linking the cube example against libgendev that ships with SGDK 0.9, and got the same good FPS of SGDK 0.9's cube binary. So apparently the real culprit here is a library function. This is why I think we can start off by measure times in main.c, and narrow down the list by focusing down on functions that take long recursively.

Stef · Post by **Stef** » Sat Mar 31, 2012 8:11 pm

fdarkangel wrote: This is the reason I suspect it's an optimizer bug. But we might be focusing on the wrong thing here as well; I'm wondering what FPS you're getting with optimizer disabled?

I will test that and post the result but what is the point of having this information ?

Edit :
Ok, same awful performance in GCC 3.4.6 and GCC 4.1.1 when optimizer is disabled.

I tried linking the cube example against libgendev that ships with SGDK 0.9, and got the same good FPS of SGDK 0.9's cube binary. So apparently the real culprit here is a library function. This is why I think we can start off by measure times in main.c, and narrow down the list by focusing down on functions that take long recursively.

Yeah the cube sample heavily rely on the library and it is less the case of Particle when you have many particles on the screen. It is why i chosen this sample to compare code. Actually we have only to focus on 2 simples methods which eat 99% cpu time : updatePartic() & drawPartic(). I will modify my previous assembly output to only keep these methods.