## Faster x mod 384 (x%384)?

**Moderator:** BigEvilCorporation

### Faster x mod 384 (x%384)?

Have you found a way to do "mod 384" faster than doing a division?

Since MD horizontal resolution is 320, not a n^2 number, the closest bigger computer friendly number is 384 (256+128). 512 is too big. As you may know having this kind of number helps a lot in computation, in this case when trying to subdivide playfield into close to screen areas.

Perhaps you have been in this situation before and already have found a solution. Kind to hear your thoughts.

Since MD horizontal resolution is 320, not a n^2 number, the closest bigger computer friendly number is 384 (256+128). 512 is too big. As you may know having this kind of number helps a lot in computation, in this case when trying to subdivide playfield into close to screen areas.

Perhaps you have been in this situation before and already have found a solution. Kind to hear your thoughts.

### Re: Faster x mod 384 (x%384)?

If you're doing a lot of that, enough that the 140-cycle divu is too slow, you can precompute a table that takes just 2kb space (assuming 16-bit coords).

Code: Select all

```
const u16 table[1024] = { 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ... };
u16 screen = table[x >> 6];
```

### Re: Faster x mod 384 (x%384)?

Implying that x % 320 = x - floor(x / 320) * 320, all we need is to find floor(x / 320), which is "whole" divisions.

We need cool property.

I'll describe it using decimals system.

Assume you want to divide decimal by 500.

In other words, you want to find floor(x / 500).

Imagine you know already result, and this result is y.

then y*5*100 <= x

in other words, floor of (x/500) is how many y*5 fits into number floor(x / 100) which is just cut off two digits in the end

Now, same in binary: 320 = 32*10 = 32*2*5 = 64*5.

this means, that floor(x/320)=floor(floor(x/64)/5)

where floor(x/64) is simple bitwise shift right by 6. 64 = 2^6.

assume z = floor(x/64), then it is 10 bits long if x is 16 bits long.

now you need only make table to find floor(z/5), for all numbers 10 bits long.

but then you need to multiply result by 320 again.

I suggest using another table for this purpose.

other possible way is some divisibility property.

If latest 6 bits is discardable, all you need is to find remainder of divison by 5.

And, there is only 5 possible remainders: 0, 1, 2, 3, 4, 5.

Somehow (I guess) you may find it as for example in divisibility of 3 in decimals: all digits must have sum divisible by 3.

But, it can be slower.

you'll find that it will look as:
subtracting 1 0100 0000 equivalent to subtracting 5 (101) shifted by 6.

from opposite, you can't subtract less than 1 0100 0000, this means shift is greater or equal to 6.

This means that it may subtract during process of division at following positions:
and result is equivalent to long division by 5 of shifted x by 6 with goal to get remainder.

result I mean remainder here. Three bits that you'll get, is same. Just because you do literally same.

This means: x % 320 = (floor(x/64)%5) * 64 + ( x % 64 ) = ( ((x >> 6) % 5) << 6) | (x & 0x3F)

where x & 0x3F is bitwise and.

With this method you need to precompute ((x % 5) << 6) table for 10 bits long numbers.

Edit: replaced ceil with floor.

We need cool property.

I'll describe it using decimals system.

Assume you want to divide decimal by 500.

In other words, you want to find floor(x / 500).

Imagine you know already result, and this result is y.

then y*5*100 <= x

in other words, floor of (x/500) is how many y*5 fits into number floor(x / 100) which is just cut off two digits in the end

Now, same in binary: 320 = 32*10 = 32*2*5 = 64*5.

this means, that floor(x/320)=floor(floor(x/64)/5)

where floor(x/64) is simple bitwise shift right by 6. 64 = 2^6.

assume z = floor(x/64), then it is 10 bits long if x is 16 bits long.

now you need only make table to find floor(z/5), for all numbers 10 bits long.

but then you need to multiply result by 320 again.

I suggest using another table for this purpose.

other possible way is some divisibility property.

If latest 6 bits is discardable, all you need is to find remainder of divison by 5.

And, there is only 5 possible remainders: 0, 1, 2, 3, 4, 5.

Somehow (I guess) you may find it as for example in divisibility of 3 in decimals: all digits must have sum divisible by 3.

But, it can be slower.

**Edit:**easier. If you imagine long division method from school,you'll find that it will look as:

Code: Select all

```
xxxx xxxx xxxx xxxx |1 0100 0000
-1010 0000 0
.....
-1010 0000 0
....
```

from opposite, you can't subtract less than 1 0100 0000, this means shift is greater or equal to 6.

This means that it may subtract during process of division at following positions:

Code: Select all

```
xxxx xxxx xxxx xxxx |1 0100 0000
-101
-101
-10 1
...
-1 01
```

result I mean remainder here. Three bits that you'll get, is same. Just because you do literally same.

This means: x % 320 = (floor(x/64)%5) * 64 + ( x % 64 ) = ( ((x >> 6) % 5) << 6) | (x & 0x3F)

where x & 0x3F is bitwise and.

With this method you need to precompute ((x % 5) << 6) table for 10 bits long numbers.

Edit: replaced ceil with floor.

Last edited by r57shell on Tue Jul 25, 2017 8:33 am, edited 1 time in total.

### Re: Faster x mod 384 (x%384)?

thanks! r57shell, I will take an slow and carefull look at it.

I prefer 384 over 320 because:

- This number still let you do a small scroll (looks better)

- Playfield, or map in RAM, is 4096x2048 pixels max in size, so using 384 fits better. (In true, playfield can take several sizes but all using numbers of 2^n serie, still 384 fits better).

So we have: 384 = 3*128. Mod of 128 is easy: x AND 127. For "x mod 3": I have found an easy solution:

( From http://homepage.cs.uiowa.edu/~jones/bcd/mod.shtml )

But unfortunately I can't put all together to work because I can't see any correlation. I have to take a better look at it.

I prefer 384 over 320 because:

- This number still let you do a small scroll (looks better)

- Playfield, or map in RAM, is 4096x2048 pixels max in size, so using 384 fits better. (In true, playfield can take several sizes but all using numbers of 2^n serie, still 384 fits better).

So we have: 384 = 3*128. Mod of 128 is easy: x AND 127. For "x mod 3": I have found an easy solution:

( From http://homepage.cs.uiowa.edu/~jones/bcd/mod.shtml )

Code: Select all

```
static inline u32 mod3( u32 a )
{
a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits a <= 0x1FFFE */
a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits a <= 0x2FD */
a = (a >> 4) + (a & 0xF); /* sum base 2**4 digits a <= 0x3C; worst case 0x3B */
a = (a >> 2) + (a & 0x3); /* sum base 2**2 digits a <= 0x1D; worst case 0x1B */
a = (a >> 2) + (a & 0x3); /* sum base 2**2 digits a <= 0x9; worst case 0x7 */
a = (a >> 2) + (a & 0x3); /* sum base 2**2 digits a <= 0x4 */
if (a > 2) a = a - 3;
return a;
}
```

### Re: Faster x mod 384 (x%384)?

r57shell, I think you mean floor(), not ceil().

### Re: Faster x mod 384 (x%384)?

Miquel, this C-written mod3 function has about 20 arithmetic operations. I'm afraid, in terms of M68k they will consume more cycles in total than one DIVU

Fast version will require pre-computed array.

Straightforward method is (don't be confused with different meaning of 'x'):

x mod 320 = x - x div 320 * 320.

1. x div 320 = table[x >> 6] - something like 20 cycles

2. x * 320 = (x << 6) * 5

x*5 = x << 2 + x

in asm (something like 24 cycles):

Fast version will require pre-computed array.

Straightforward method is (don't be confused with different meaning of 'x'):

x mod 320 = x - x div 320 * 320.

1. x div 320 = table[x >> 6] - something like 20 cycles

2. x * 320 = (x << 6) * 5

x*5 = x << 2 + x

in asm (something like 24 cycles):

Code: Select all

```
lsl.w #6, x
move.w x, x4
lsl.w #2, x4
add.w x4, x
```

Last edited by GManiac on Mon Jul 24, 2017 9:06 pm, edited 1 time in total.

### Re: Faster x mod 384 (x%384)?

All code below is in Python

A key formula is the "general rule" in the article you linked earlier

Example:
To find residue of dividing by 320 you can find residues of 5 and 64 and then convert them to usual positional notation (it's not trivial)

https://en.wikipedia.org/wiki/Residue_number_system

Conversion example:

http://neo.dmcs.p.lodz.pl/csII/ca2_ResidueNS.pdf

Fast example In Russian:

http://wiki.tgl.net.ru/index.php/%D0%A1 ... 0%BE%D0%B2

Code demo for your task
Anyway, this method takes at least 70 cycles.

A key formula is the "general rule" in the article you linked earlier

Example:

Code: Select all

```
a = 123
m = 5
b = 16
# equal results
print( a % m )
print( ( (b % m) * (a // b) + (a % b) ) % m )
```

https://en.wikipedia.org/wiki/Residue_number_system

Conversion example:

http://neo.dmcs.p.lodz.pl/csII/ca2_ResidueNS.pdf

Fast example In Russian:

http://wiki.tgl.net.ru/index.php/%D0%A1 ... 0%BE%D0%B2

Code demo for your task

Code: Select all

```
# pre-compute table
table = [0]*320
for i in range(320):
table[64 * (i % 5 ) + (i % 64)] = i
a = 65533
mod64 = a & 0x3F;
#a mod 5
a = (a >> 8) + (a & 0xFF) # 256 % 5 = 1, a will become <= 510
a = (a >> 4) + (a & 0x0F) # 16 % 5 = 1, a will become <= 45
if (a > 14): a = a - 15;
if (a > 14): a = a - 15;
if (a > 4): a = a - 5;
if (a > 4): a = a - 5;
mod5 = a
mod320 = table[64*mod5 + mod64]
print(mod320)
```

### Re: Faster x mod 384 (x%384)?

Yes, I mean floor instead ceil. Updated my post.

And, for full desc, here is C code:

Code: Select all

```
u16 rem320(u16 a)
{
return (table[a>>6]<<6) | (a & 0x3F);
}
```

you may run it here: http://cpp.sh/

Second version using

***2**instead of

**<<6**, which is faster if your compiler do:

Code: Select all

```
add d0, d0
instead of
lsr #1, d0
```

plus you'll have issues with addressing 2 bytes instead of 1 bytes, because you'll need to discard least significant bit somehow.

you can do all of this with other modulo. results depends on modulo.

### Re: Faster x mod 384 (x%384)?

Great formula.

a = 56959

d1 = 21

d2 = 17

# d1 and d2 are random divisors

a % 320 == a // 64 % 5 * 64 + a % 64

a % 320 == (((a >> 6) % 5) << 6) + (a & 0x3F)

r57shell, there's no need in "|" in next line, just use "+".

a = 56959

d1 = 21

d2 = 17

# d1 and d2 are random divisors

**a % (d1*d2) == a // d1 % d2 * d1 + a % d1**a % 320 == a // 64 % 5 * 64 + a % 64

a % 320 == (((a >> 6) % 5) << 6) + (a & 0x3F)

r57shell, there's no need in "|" in next line, just use "+".

Code: Select all

`| (a & 0x3F);`

### Re: Faster x mod 384 (x%384)?

To simplify the explanation I imply the operation is with (16bit) integers. But instead is with 16:16 fixed points. So really the operation is something like:

x mod 384<<16 where x can range from 0 to 4096<<16. (Meaning it uses signed 32 bit integers)

I can try to use a lousy trick to get rid off decimals, but I shouldn't...

### Re: Faster x mod 384 (x%384)?

As I understand, integer part of your 16.16 number is stored in upper word of longword, and fractional part - in lower word.

It's not a problem, just use this formula again

a % (d1*d2) == a // d1 % d2 * d1 + a % d1

384<<16 = 24M (24 mega)

24M = 3 * 128 * 65536

a % 24M == a // 8M % 3 * 8M + a % 8M

a % 24M == (((a >> 23) % 3) << 23) | (a & 0x7FFFFF)

Or, to prevent calculations on longs you can drop fractional part and then restore it.

Code: Select all

```
# save fractional
move.w d0, d1
clr.w d0
swap d0
mod #384, d0
# restore fractional
swap d0
move.w d1, d0
```

Code: Select all

```
move.l $F004, d0 -- read the whole coordinate
move.w $F004, d0 -- read the integer part
```