Sprite List Code Messed Up

Miquel · Post by **Miquel** » Wed Nov 15, 2017 5:26 pm

Miquel wrote: Tue Nov 14, 2017 6:57 pm About question 2, "clr vs moveq":

What happens is that in the case of clr, the operand is read before the writing (to save dye space, a side effect). In case of a register that means nothing, "clr.l d0" and "moveq.l #0, d0" both take 4 typical cycles, in case of "clr.l (a0)", yes, it will add 2 more cycles in comparison with "moveq.l #0,(a0)".

I'm checking at what GCC does and:
- I see a lot: "clr.l d0" -like
- Not a single "clr.l (a0)" -like
- Some "clr.w -(%sp)", "clr.b -45(%a6)" -like which, unless I'm much mistaken, are not doable with moveq, in other words: saves cycles by combining instructions.

So really not that bad compiler at 68K level after all.

Correction:
"clr.w" and "moveq" takes, both, 4 cycles; "clr.l" takes 6 cycles.

So, knowing that, I rewrote the previous table:
GCC with my C code does:
- A lot of "clr.w", "clr.b"
- Not a single "clr.l"
- Not a single "clr.? (a?)"
- Some "clr.w -(%sp)", "clr.b -45(%a6)", "clr.l -(%sp)", "clr.l g_globals+840" -like

Again, I was mistaken because I didn't differentiate between word and long access, not GCC.

Miquel · Post by **Miquel** » Wed Nov 15, 2017 8:08 pm

flamewing wrote: Wed Nov 15, 2017 10:04 am
Miquel wrote: Tue Nov 14, 2017 6:57 pm 1) .b instructions are 2 byte larger (except for move) so they are necessarily slower.
That is not true at all. This is true for immediate .l instructions, but for most other instructions, .b and .w differ by a couple of bits. Those that are different by more than a couple bits are generally shorter in .b by 2 bytes. Hint: I already wrote a 68k disassembler in 68k assembly that is used in a lot of Sonic hacks as a generic error screen.

I really need to check this affair for sure since I'm writing code based on the assumption that's it's better to avoid byte addressing. Right now the problem is that my code don't use bytes on calculus due to the previous rule.

flamewing wrote: Wed Nov 15, 2017 10:04 am A hint: when someone says things like "the clr microcode is not well optimized", that is a good hint that they know what they are talking about.

And when someone says Earth lays in a plane (is not rounded) better believe him... oh! wait, you already know that one. Ok, let's try this other: when someone says there is no body else on the Solar System better believe him because for sure he is saying the truth.

No man, I want an explanation (like you did before), unfortunately even so, trickery could be at hand.

"clr" always reads the operand first, no matter the address mode, because it shares microopcode with "move" like opcodes, this way pre-decrement and post-increment is supported, and they not need to generate a new microopcode or a new instruction saving transistors.

Anyway, the thing is some times is better to use "clr" others "moveq". Hope this could be worked out some kind of macro which does the decision.

flamewing wrote: Wed Nov 15, 2017 10:04 am For what is worth, GCC will generate these if you pass -m68000 or -march=68000, so it is something to do if you don't already.

Here is the example I mentioned above:
Code: Select all
#include "stdint.h"
void InitVDP(uint16_t *init_vals, int size) {
    volatile uint16_t *Ctrl = (uint16_t*)0xC00004;
    for (int ii = 0; ii < size; ii++) {
        *Ctrl = init_vals[ii];
    }
}
[...]

I'm very glag to see the ussage of dbra on newer GCC!

That's what a eralier version of GCC is doing:

Code: Select all

void NOINLINE InitVDP(u16 *init_vals, u16 size)
{
    volatile u16 *Ctrl = (u16*)0xC00004;
	int ii=0;
    for(; ii < size; ii++)
	{
        *Ctrl = init_vals[ ii ];
    }
}

void NOINLINE InitVDP( u16 *init_vals, u16 size )
 move.l 4(%sp),%a1
 move.w 8(%sp),%d1
 clr.w %d0
 cmp.w %d0,%d1
 jbls .L8
.L13:
 move.w %d0,%a0
 add.l %a0,%a0
 move.w (%a1,%a0.l),12582916
 addq.w #1,%d0
 cmp.w %d0,%d1
 jbhi .L13
.L8:
rts

void NOINLINE InitVDP2( u16 *init_vals, u16 size )
{
	u16* p = (u16*)0xC00004;
	for( ; size; size-- )
	{
		*p = *init_vals++;
	}
}

void NOINLINE InitVDP2( u16 *init_vals, u16 size )
 move.l 4(%sp),%a0
 move.w 8(%sp),%d0
 jbeq .L8
.L13:
 move.w (%a0)+,12582916
 subq.w #1,%d0
 jbne .L13
.L8:
rts

with flags:
-m68000 -Wall -Wextra -Wparentheses -Wno-unused -Wno-switch -fshort-enums --param inline-unit-growth=10000 -fomit-frame-pointer -mshort -O3
do I usually add this ones:
-Winline -fmerge-all-constants -fgcse-sm -fno-keep-static-consts -funroll-loops
then the previous functions become unrolled and muuuuch larger.

The point is, I checked all this to see if is better to upgrade GCC version.

Stef · Post by **Stef** » Thu Nov 16, 2017 9:30 am

myself i always use while(i--) statement instead of for loop so GCC correctly optimize it in dbra

Chilly Willy · Post by **Chilly Willy** » Thu Nov 16, 2017 3:59 pm

Good point. Maybe there should be a thread on 68000 gcc optimizations and what versions they're good for. Things like the while (i--), or that gcc 7 with -Ofast will store longs when it can instead of bytes in a loop like this:

Code: Select all

                for (x=0; x<w; x++)
                    *dp++ = *sp++;

where dp and sp are both uint8_t pointers.

Stef · Post by **Stef** » Thu Nov 16, 2017 9:56 pm

Chilly Willy wrote: Thu Nov 16, 2017 3:59 pm ... Things like the while (i--), or that gcc 7 with -Ofast will store longs when it can instead of bytes in a loop like this:
Code: Select all
                for (x=0; x<w; x++)
                    *dp++ = *sp++;
where dp and sp are both uint8_t pointers.

Does it ?? It works only if both pointer are aligned on word address (even address), otherwise --> BUS error.

flamewing · Post by **flamewing** » Thu Nov 16, 2017 10:08 pm

FYI, I decided to completely ignoring your false equivalences with flat Earth and no one else in the solar system.

Miquel wrote: Wed Nov 15, 2017 8:08 pm"clr" always reads the operand first, no matter the address mode, because it shares microopcode with "move" like opcodes, this way pre-decrement and post-increment is supported, and they not need to generate a new microopcode or a new instruction saving transistors.

No, it shares microcode with neg/negx/not (these 4 instructions use the same microcode). For register access, they share no microcode with move opcodes. For memory access, they use different microcode from register access, because it needs to decode the addressing mode (and yeah, this microcode calls common subroutines to do that, and they are shared — for the most part — with move opcodes as well).

But it really is not the reading part that makes register access mode slower for .l for these instructions: move from any register to any other is always 4 cycles, regardless of size (.b, .w or .l). clr/neg/negx/not are 4 cycles for .b and .w, 6 cycles for .l because the register values go through the 68k's ALU, which is 16-bit: it takes one microopcode to handle the low byte (.b) or low word (.w), two microopcodes to clear the two 16-bit halves (.l). The final microopcode is prefetch.

Stef wrote: Thu Nov 16, 2017 9:30 am myself i always use while(i--) statement instead of for loop so GCC correctly optimize it in dbra

Hm. We have a regression to report in GCC 7.2 then (the browncc and brownc++ are GCC and G++): 6.x uses dbra.

Stef · Post by **Stef** » Fri Nov 17, 2017 10:42 am

flamewing wrote: Thu Nov 16, 2017 10:08 pm
Stef wrote: Thu Nov 16, 2017 9:30 am myself i always use while(i--) statement instead of for loop so GCC correctly optimize it in dbra
Hm. We have a regression to report in GCC 7.2 then (the browncc and brownc++ are GCC and G++): 6.x uses dbra.

Too bad

It's why i have carefully to test any GCC version before switching to it in SGDK.
3.4.6 was ok and 6.0.0 is good... so for now i'm staying on this version =)

Chilly Willy · Post by **Chilly Willy** » Fri Nov 17, 2017 6:20 pm

Stef wrote: Thu Nov 16, 2017 9:56 pmDoes it ?? It works only if both pointer are aligned on word address (even address), otherwise --> BUS error.

It checks for things like alignment and length before using the long code. If either pointer can't do a long read or it's not at least a long worth of bytes, it does a byte move. It does NOT do word moves that I noticed. I should probably do some more checking on it... maybe also try it with the while (i--) instead of a for loop. Maybe that's even better!

Miquel · Post by **Miquel** » Mon Nov 20, 2017 4:08 pm

Stef wrote: Thu Nov 16, 2017 9:30 am myself i always use while(i--) statement instead of for loop so GCC correctly optimize it in dbra

Awesome! With:

Code: Select all

void NOINLINE InitVDP3( u16 *init_vals, u16 size )
{
	u16* p = (u16*)0xC00004;
	while( size-- )
	{
		*p = *init_vals++;
	}
}

or with:

Code: Select all

void NOINLINE InitVDP4( u16 *init_vals, u16 size )
{
	u16* p = (u16*)0xC00004;
	for( ;size--; ) // <------- Only change here
	{
		*p = *init_vals++;
	}
}

GCC 3.4.6 uses dbra.
but with:

Code: Select all

void NOINLINE InitVDP5( u16 *init_vals, u16 size )
{
	u16* p = (u16*)0xC00004;
	for( ;size; size-- ) // <------- Only change here
	{
		*p = *init_vals++;
	}
}

don't!
Basically you can't use 'size' (or the looping variable) in the 3th statement if you are using a 'for'.
Nice to know!

Miquel · Post by **Miquel** » Mon Nov 20, 2017 6:03 pm

flamewing wrote: Thu Nov 16, 2017 10:08 pm FYI, I decided to completely ignoring your false equivalences with flat Earth and no one else in the solar system.

flamewing wrote: Thu Nov 16, 2017 10:08 pm For register access, they share no microcode with move opcodes. For memory access, they use different microcode from register access,

I'm wondering if you and me are in the same page here...

flamewing wrote: Thu Nov 16, 2017 10:08 pm But it really is not the reading part that makes register access mode slower for .l for these instructions: move from any register to any other is always 4 cycles, regardless of size (.b, .w or .l). clr/neg/negx/not are 4 cycles for .b and .w, 6 cycles for .l because the register values go through the 68k's ALU, which is 16-bit: it takes one microopcode to handle the low byte (.b) or low word (.w), two microopcodes to clear the two 16-bit halves (.l). The final microopcode is prefetch.

Obviously both things add cycles. "Register read + operation + register write" uses 2 cycles at microcode level.

flamewing · Post by **flamewing** » Tue Nov 21, 2017 8:26 pm

Miquel wrote: Mon Nov 20, 2017 6:03 pmObviously both things add cycles. "Register read + operation + register write" uses 2 cycles at microcode level.

The person who 2 minutes ago was avoiding byte instructions because he believed, without evidence, that they were slower and 2 bytes longer is trying to argue about microcode? Fine, if you want to make a fool of yourself, I will indulge you.

Microcode can write to a register, read another register and do an ALU operation on the same microcode. It can also do other things, like also increment PC and start a memory read or write cycle (with either the new or the old PC).

Lets start by opening US4325121. On figure 21G, you can see the microcode table for clr/neg/negx/not, and a few others that are not of interest here. First row is clr/neg/negx/not in .b mode; second line is for .w mode; third row is for .l mode. First column is data register; second is address register (illegal for these instructions); the next 7 columns are for the memory alterable modes and absolute modes; the others are illegal for these instructions.

For .b and .w, the microopcodes are NNRW1 (data register), and <address decoder>+NNMW1 (memory modes). For .l, the microopcodes are NNRL1 (data register), and <address decoder>+NNML1 (memory modes). I will ignore the memory modes because they are dominated by the memory read+write+prefetch.

Go now to appendix H and search for these microopcodes. NNRW1 is on pp. 171-172, and NNRL1 is on pp. 179-180. Looking them up you will see that we will also need ROAW2 and ROAL4; ROAW2 is on pp. 115-116; and ROAL4 is on pp. 119-120. Transcripts:

Code: Select all

----------------------------------------------
|                          <       |         |
| au -> db -> aob,au,pc            | irix    |
| (ryl)->ab*->alu                  |---------|
| 0->alu                           | dbi     |
| +2->au                           |---------|
|                                  | 2i      |
|                                  |---------|
|                                  | dxry    |
----------------------------------------------
|                      112 | NNRW1 | NNRW1   |
----------------------------------------------
                               |
                               v
                             ROAW2

----------------------------------------------
|                          <       |         |
| au -> db -> aob,au,pc            | irix    |
| (ryl)->ab*->alu                  |---------|
| 0->alu                           | dbi     |
| +2->au                           |---------|
|                                  | 2i      |
|                                  |---------|
|                                  | dxry    |
----------------------------------------------
|                      116 | NNRL1 | NNRW1   |
----------------------------------------------
                               |
                               v
----------------------------------------------
|                           >      |         |
| alu -> db -> ryl                 | frix    |
| edb->dbin,irc                    |---------|
| (ryh)->ab->alu                   | db      |
| 0->alu                           |---------|
|                                  | 3f      |
|                                  |---------|
|                                  |         |
----------------------------------------------
|                       AE | NNRL2 | NNRL2   |
----------------------------------------------
                               |
                               v
                             ROAL4

----------------------------------------------
|                           >      |         |
| alu -> db -> ryl                 | frix    |
| edb->dbin,irc                    |---------|
| (ir)->ird                        | a1      |
| (pc)->db->au                     |---------|
| +2->au                           | xnf     |
|                                  |---------|
|                                  |         |
----------------------------------------------
|                      297 | ROAW2 | ROAW2   |
----------------------------------------------

----------------------------------------------
|                                  |         |
| alu -> ab -> ryh                 | np      |
| (ir)->ird                        |---------|
| (pc)->db->au                     | a1      |
| +2->au                           |---------|
|                                  | x       |
|                                  |---------|
|                                  |         |
----------------------------------------------
|                      30B | ROAL4 | ROAL4   |
----------------------------------------------

Now following on the patent, this translates to:

NNRW1:
irix = initiate read of immediate or instruction
dbi = direct branch, (IRC)->IR
2i = on figure 17, select column 2 of proper row for ALU operation (i is irrelevant since it only applies to column 1)
dxry = don't care about field Rx, read register specified by Ry
au -> db -> aob,au,pc = output of addressing unit (PC computed by prefetch cycle of last instruction) goes to data bus, then to address output buffer, to addressing unit as input, and to PC
(ryl)->ab*->alu = low word of register Ry goes to address bus, then as input to ALU
0->alu = other input to ALU is zero
+2->au = other input of addressing unit is +2

NNRL1: identical to NNRW1 except for microopcode branch destination

NNRL2:
frix = initiate read of immediate or instruction
db = direct branch, (IRC)->IR
3f = on figure 17, select column 3 of proper row for ALU operation (f is irrelevant since it only applies to column 1)
alu -> db -> ryl = output of ALU goes to data bus, then to low word of register Ry
edb->dbin,irc = address output buffer goes to data bus input and to IRC
(ryh)->ab->alu = high word of register Ry goes to address bus, then into ALU as input
0->alu = other input to ALU is zero

ROAW2:
frix = initiate read of immediate or instruction
a1 = go to starting address A1
xnf = don't care about ALU function, do not change condition codes, byte transfer
alu -> db -> ryl = output of ALU goes to data bus, then to low word of register Ry
edb->dbin,irc = external data bus goes to data bus input and to IRC
(ir)->ird = value of IR goes to IRD
(pc)->db->au = PC goes to data bus, then to addressing unit as input
+2->au = other input to addressing unit is zero

ROAL4:
np = no memory access, process only
a1 = go to starting address A1
x = don't care about ALU function (I think this should be an xn instead of x)
alu -> ab -> ryh = output of ALU goes to address bus, then to high word of register Ry
(ir)->ird = value of IR goes to IRD
(pc)->db->au = PC goes to data bus, then to addressing unit as input
+2->au = other input to addressing unit is zero

In high level:
NNRW1 = start prefetch, do low word; ROAW2 = finish prefetch, save low word, compute PC which next instruction will prefetch

NNRL1 = start prefetch, do low word; NNRL2 = finish prefetch, save low word, do high word; ROAL4 = save high word, compute PC which next instruction will prefetch

So I was misremembering a tiny bit, in that prefetch happens concurrently with doing the two halves of the operation on .l; but overall, I was correctly remembering that the the bottleneck for .l is passing the data through the ALU: free half-register read + 1 ALU op, write + free half-register read + 1 ALU op, write.

I will leave moveq as an exercise, but will just note that it bypasses the ALU entirely and just sign-extends the sign of the second opcode byte and writes to the destination register directly as a full word on the first microopcode.

Hm. I wonder if anyone ever transcribed the whole microcode portion of US4325121... having it as a searchable document would be a lot better, as well as less error prone because of how bad the scan is.

Mask of Destiny · Post by **Mask of Destiny** » Tue Nov 21, 2017 9:25 pm

flamewing wrote: Tue Nov 21, 2017 8:26 pmHm. I wonder if anyone ever transcribed the whole microcode portion of US4325121... having it as a searchable document would be a lot better, as well as less error prone because of how bad the scan is.

I've got just shy of 100 transcribed which is pretty far from complete, though it probably covers a lot of the more "interesting" cases. I've thought about trying to make a concerted effort to transcribe everything, but I really want to try and decode the production micro/nanocode instead. Worked on that a bit, but got some nonsensical results and haven't revisited it. Happy to share my transcription when I get home.

flamewing · Post by **flamewing** » Wed Nov 22, 2017 4:30 am

Mask of Destiny wrote: Tue Nov 21, 2017 9:25 pm I've got just shy of 100 transcribed which is pretty far from complete, though it probably covers a lot of the more "interesting" cases. I've thought about trying to make a concerted effort to transcribe everything, but I really want to try and decode the production micro/nanocode instead. Worked on that a bit, but got some nonsensical results and haven't revisited it. Happy to share my transcription when I get home.

I'd love to see that transcription, yes. Maybe put it on GitHub so people can contribute to finishing it up?

I'd also like to know more about the issues you faced when looking at production micro/nanocode, when/if you have the time.

Miquel · Post by **Miquel** » Thu Nov 23, 2017 12:26 am

flamewing wrote: Tue Nov 21, 2017 8:26 pm Microcode can write to a register, read another register and do an ALU operation on the same microcode. It can also do other things, like also increment PC and start a memory read or write cycle (with either the new or the old PC).
[...]

And could calculate a integral. But that's not what we were talking about.

flamewing wrote: Tue Nov 21, 2017 8:26 pm The person who 2 minutes ago was avoiding byte instructions because he believed, without evidence, that they were slower and 2 bytes longer is trying to argue about microcode? Fine, if you want to make a fool of yourself, I will indulge you.

You are going personal. Please ignore me from now on.

SpritesMind.Net

Sprite List Code Messed Up

Re: Sprite List Code Messed Up

Re: Sprite List Code Messed Up

Re: Sprite List Code Messed Up

Re: Sprite List Code Messed Up

Re: Sprite List Code Messed Up

Re: Sprite List Code Messed Up

Re: Sprite List Code Messed Up

Re: Sprite List Code Messed Up

Re: Sprite List Code Messed Up

Re: Sprite List Code Messed Up

Re: Sprite List Code Messed Up

Re: Sprite List Code Messed Up

Re: Sprite List Code Messed Up

Re: Sprite List Code Messed Up