Faster x mod 384 (x%384)?

Miquel · Post by **Miquel** » Sun Jul 23, 2017 5:33 pm

Have you found a way to do "mod 384" faster than doing a division?

Since MD horizontal resolution is 320, not a n^2 number, the closest bigger computer friendly number is 384 (256+128). 512 is too big. As you may know having this kind of number helps a lot in computation, in this case when trying to subdivide playfield into close to screen areas.

Perhaps you have been in this situation before and already have found a solution. Kind to hear your thoughts.

cero · Post by **cero** » Sun Jul 23, 2017 5:59 pm

If you're doing a lot of that, enough that the 140-cycle divu is too slow, you can precompute a table that takes just 2kb space (assuming 16-bit coords).

Code: Select all

const u16 table[1024] = { 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ... };

u16 screen = table[x >> 6];

r57shell · Post by **r57shell** » Mon Jul 24, 2017 10:06 am

Implying that x % 320 = x - floor(x / 320) * 320, all we need is to find floor(x / 320), which is "whole" divisions.

We need cool property.
I'll describe it using decimals system.
Assume you want to divide decimal by 500.
In other words, you want to find floor(x / 500).
Imagine you know already result, and this result is y.
then y*5*100 <= x
in other words, floor of (x/500) is how many y*5 fits into number floor(x / 100) which is just cut off two digits in the end

Now, same in binary: 320 = 32*10 = 32*2*5 = 64*5.
this means, that floor(x/320)=floor(floor(x/64)/5)
where floor(x/64) is simple bitwise shift right by 6. 64 = 2^6.
assume z = floor(x/64), then it is 10 bits long if x is 16 bits long.
now you need only make table to find floor(z/5), for all numbers 10 bits long.
but then you need to multiply result by 320 again.
I suggest using another table for this purpose.

other possible way is some divisibility property.
If latest 6 bits is discardable, all you need is to find remainder of divison by 5.
And, there is only 5 possible remainders: 0, 1, 2, 3, 4, 5.
Somehow (I guess) you may find it as for example in divisibility of 3 in decimals: all digits must have sum divisible by 3.
But, it can be slower.

Edit: easier. If you imagine long division method from school,
you'll find that it will look as:

Code: Select all

 xxxx xxxx xxxx xxxx |1 0100 0000
-1010 0000 0
.....
     -1010 0000 0
     ....

subtracting 1 0100 0000 equivalent to subtracting 5 (101) shifted by 6.
from opposite, you can't subtract less than 1 0100 0000, this means shift is greater or equal to 6.
This means that it may subtract during process of division at following positions:

Code: Select all

 xxxx xxxx xxxx xxxx |1 0100 0000
-101
 -101
  -10 1
   ...
        -1 01

and result is equivalent to long division by 5 of shifted x by 6 with goal to get remainder.
result I mean remainder here. Three bits that you'll get, is same. Just because you do literally same.
This means: x % 320 = (floor(x/64)%5) * 64 + ( x % 64 ) = ( ((x >> 6) % 5) << 6) | (x & 0x3F)
where x & 0x3F is bitwise and.
With this method you need to precompute ((x % 5) << 6) table for 10 bits long numbers.

Edit: replaced ceil with floor.

Miquel · Post by **Miquel** » Mon Jul 24, 2017 5:02 pm

thanks! r57shell, I will take an slow and carefull look at it.

I prefer 384 over 320 because:
- This number still let you do a small scroll (looks better)
- Playfield, or map in RAM, is 4096x2048 pixels max in size, so using 384 fits better. (In true, playfield can take several sizes but all using numbers of 2^n serie, still 384 fits better).

So we have: 384 = 3*128. Mod of 128 is easy: x AND 127. For "x mod 3": I have found an easy solution:

( From http://homepage.cs.uiowa.edu/~jones/bcd/mod.shtml )

Code: Select all

static inline u32 mod3( u32 a )
{
    a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits a <= 0x1FFFE */
    a = (a >>  8) + (a & 0xFF);   /* sum base 2**8 digits a <= 0x2FD */
    a = (a >>  4) + (a & 0xF);    /* sum base 2**4 digits a <= 0x3C; worst case 0x3B */
    a = (a >>  2) + (a & 0x3);    /* sum base 2**2 digits a <= 0x1D; worst case 0x1B */
    a = (a >>  2) + (a & 0x3);    /* sum base 2**2 digits a <= 0x9; worst case 0x7 */
    a = (a >>  2) + (a & 0x3);    /* sum base 2**2 digits a <= 0x4 */
    if (a > 2) a = a - 3;
    return a;
}

But unfortunately I can't put all together to work because I can't see any correlation. I have to take a better look at it.

cero · Post by **cero** » Mon Jul 24, 2017 6:02 pm

r57shell, I think you mean floor(), not ceil().

GManiac · Post by **GManiac** » Mon Jul 24, 2017 7:21 pm

Miquel, this C-written mod3 function has about 20 arithmetic operations. I'm afraid, in terms of M68k they will consume more cycles in total than one DIVU

Fast version will require pre-computed array.
Straightforward method is (don't be confused with different meaning of 'x'):
x mod 320 = x - x div 320 * 320.
1. x div 320 = table[x >> 6] - something like 20 cycles

2. x * 320 = (x << 6) * 5
x*5 = x << 2 + x
in asm (something like 24 cycles):

Code: Select all

lsl.w #6, x
move.w x, x4
lsl.w #2, x4
add.w x4, x

GManiac · Post by **GManiac** » Mon Jul 24, 2017 9:02 pm

All code below is in Python
A key formula is the "general rule" in the article you linked earlier
Example:

Code: Select all

a = 123
m = 5
b = 16
# equal results
print( a % m )
print( ( (b % m) * (a // b) + (a % b) ) % m )

To find residue of dividing by 320 you can find residues of 5 and 64 and then convert them to usual positional notation (it's not trivial)
https://en.wikipedia.org/wiki/Residue_number_system
Conversion example:
http://neo.dmcs.p.lodz.pl/csII/ca2_ResidueNS.pdf
Fast example In Russian:
http://wiki.tgl.net.ru/index.php/%D0%A1 ... 0%BE%D0%B2

Code demo for your task

Code: Select all

# pre-compute table
table = [0]*320
for i in range(320):
    table[64 * (i % 5 ) + (i % 64)] = i

a = 65533
mod64 = a & 0x3F;

#a mod 5
a = (a >> 8) + (a & 0xFF) # 256 % 5 = 1, a will become <= 510
a = (a >> 4) + (a & 0x0F) #  16 % 5 = 1, a will become <=  45
if (a > 14): a = a - 15;
if (a > 14): a = a - 15;
if (a > 4): a = a - 5;
if (a > 4): a = a - 5;
mod5 = a

mod320 = table[64*mod5 + mod64]
print(mod320)

Anyway, this method takes at least 70 cycles.

r57shell · Post by **r57shell** » Tue Jul 25, 2017 8:31 am

cero wrote: Mon Jul 24, 2017 6:02 pm r57shell, I think you mean floor(), not ceil().

Yes, I mean floor instead ceil. Updated my post.
And, for full desc, here is C code:

Code: Select all

u16 rem320(u16 a)
{
	return (table[a>>6]<<6) | (a & 0x3F);
}

Try this code (my sanity check): https://gist.github.com/realmonster/3ab ... 3abc7ed895
you may run it here: http://cpp.sh/

Second version using *2 instead of <<6, which is faster if your compiler do:

Code: Select all

add d0, d0
instead of
lsr #1, d0

and, to do so, you should keep ((x%5)<<5) in table, because <<6 will overflow 8 bits,
plus you'll have issues with addressing 2 bytes instead of 1 bytes, because you'll need to discard least significant bit somehow.

you can do all of this with other modulo. results depends on modulo.

GManiac · Post by **GManiac** » Tue Jul 25, 2017 9:21 am

Great formula.
a = 56959
d1 = 21
d2 = 17
# d1 and d2 are random divisors
a % (d1*d2) == a // d1 % d2 * d1 + a % d1

a % 320 == a // 64 % 5 * 64 + a % 64
a % 320 == (((a >> 6) % 5) << 6) + (a & 0x3F)

r57shell, there's no need in "|" in next line, just use "+".

Code: Select all

| (a & 0x3F);

Miquel · Post by **Miquel** » Tue Jul 25, 2017 7:42 pm

GManiac wrote: Mon Jul 24, 2017 7:21 pm Miquel, this C-written mod3 function has about 20 arithmetic operations. I'm afraid, in terms of M68k they will consume more cycles in total than one DIVU

To simplify the explanation I imply the operation is with (16bit) integers. But instead is with 16:16 fixed points. So really the operation is something like:
x mod 384<<16 where x can range from 0 to 4096<<16. (Meaning it uses signed 32 bit integers)

I can try to use a lousy trick to get rid off decimals, but I shouldn't...

GManiac · Post by **GManiac** » Wed Jul 26, 2017 8:21 am

Miquel wrote: Tue Jul 25, 2017 7:42 pm To simplify the explanation I imply the operation is with (16bit) integers. But instead is with 16:16 fixed points. So really the operation is something like:
x mod 384<<16 where x can range from 0 to 4096<<16. (Meaning it uses signed 32 bit integers)

As I understand, integer part of your 16.16 number is stored in upper word of longword, and fractional part - in lower word.
It's not a problem, just use this formula again
a % (d1*d2) == a // d1 % d2 * d1 + a % d1
384<<16 = 24M (24 mega)
24M = 3 * 128 * 65536
a % 24M == a // 8M % 3 * 8M + a % 8M
a % 24M == (((a >> 23) % 3) << 23) | (a & 0x7FFFFF)

Or, to prevent calculations on longs you can drop fractional part and then restore it.

Code: Select all

# save fractional
move.w d0, d1
clr.w d0
swap d0
mod #384, d0
# restore fractional
swap d0
move.w d1, d0

Or, if this coordinate is stored in memory, you may just skip reading fractional part and read the integer part alone:

Code: Select all

move.l $F004, d0       -- read the whole coordinate
move.w $F004, d0       -- read the integer part

SpritesMind.Net

Faster x mod 384 (x%384)?

Faster x mod 384 (x%384)?

Re: Faster x mod 384 (x%384)?

Re: Faster x mod 384 (x%384)?

Re: Faster x mod 384 (x%384)?

Re: Faster x mod 384 (x%384)?

Re: Faster x mod 384 (x%384)?

Re: Faster x mod 384 (x%384)?

Re: Faster x mod 384 (x%384)?

Re: Faster x mod 384 (x%384)?

Re: Faster x mod 384 (x%384)?

Re: Faster x mod 384 (x%384)?