Faster x mod 384 (x%384)?
Moderator: BigEvilCorporation
Faster x mod 384 (x%384)?
Have you found a way to do "mod 384" faster than doing a division?
Since MD horizontal resolution is 320, not a n^2 number, the closest bigger computer friendly number is 384 (256+128). 512 is too big. As you may know having this kind of number helps a lot in computation, in this case when trying to subdivide playfield into close to screen areas.
Perhaps you have been in this situation before and already have found a solution. Kind to hear your thoughts.
Since MD horizontal resolution is 320, not a n^2 number, the closest bigger computer friendly number is 384 (256+128). 512 is too big. As you may know having this kind of number helps a lot in computation, in this case when trying to subdivide playfield into close to screen areas.
Perhaps you have been in this situation before and already have found a solution. Kind to hear your thoughts.
HELP. Spanish TVs are brain washing people to be hostile to me.
Re: Faster x mod 384 (x%384)?
If you're doing a lot of that, enough that the 140-cycle divu is too slow, you can precompute a table that takes just 2kb space (assuming 16-bit coords).
Code: Select all
const u16 table[1024] = { 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ... };
u16 screen = table[x >> 6];
Re: Faster x mod 384 (x%384)?
Implying that x % 320 = x - floor(x / 320) * 320, all we need is to find floor(x / 320), which is "whole" divisions.
We need cool property.
I'll describe it using decimals system.
Assume you want to divide decimal by 500.
In other words, you want to find floor(x / 500).
Imagine you know already result, and this result is y.
then y*5*100 <= x
in other words, floor of (x/500) is how many y*5 fits into number floor(x / 100) which is just cut off two digits in the end
Now, same in binary: 320 = 32*10 = 32*2*5 = 64*5.
this means, that floor(x/320)=floor(floor(x/64)/5)
where floor(x/64) is simple bitwise shift right by 6. 64 = 2^6.
assume z = floor(x/64), then it is 10 bits long if x is 16 bits long.
now you need only make table to find floor(z/5), for all numbers 10 bits long.
but then you need to multiply result by 320 again.
I suggest using another table for this purpose.
other possible way is some divisibility property.
If latest 6 bits is discardable, all you need is to find remainder of divison by 5.
And, there is only 5 possible remainders: 0, 1, 2, 3, 4, 5.
Somehow (I guess) you may find it as for example in divisibility of 3 in decimals: all digits must have sum divisible by 3.
But, it can be slower.
Edit: easier. If you imagine long division method from school,
you'll find that it will look as:
subtracting 1 0100 0000 equivalent to subtracting 5 (101) shifted by 6.
from opposite, you can't subtract less than 1 0100 0000, this means shift is greater or equal to 6.
This means that it may subtract during process of division at following positions:
and result is equivalent to long division by 5 of shifted x by 6 with goal to get remainder.
result I mean remainder here. Three bits that you'll get, is same. Just because you do literally same.
This means: x % 320 = (floor(x/64)%5) * 64 + ( x % 64 ) = ( ((x >> 6) % 5) << 6) | (x & 0x3F)
where x & 0x3F is bitwise and.
With this method you need to precompute ((x % 5) << 6) table for 10 bits long numbers.
Edit: replaced ceil with floor.
We need cool property.
I'll describe it using decimals system.
Assume you want to divide decimal by 500.
In other words, you want to find floor(x / 500).
Imagine you know already result, and this result is y.
then y*5*100 <= x
in other words, floor of (x/500) is how many y*5 fits into number floor(x / 100) which is just cut off two digits in the end
Now, same in binary: 320 = 32*10 = 32*2*5 = 64*5.
this means, that floor(x/320)=floor(floor(x/64)/5)
where floor(x/64) is simple bitwise shift right by 6. 64 = 2^6.
assume z = floor(x/64), then it is 10 bits long if x is 16 bits long.
now you need only make table to find floor(z/5), for all numbers 10 bits long.
but then you need to multiply result by 320 again.
I suggest using another table for this purpose.
other possible way is some divisibility property.
If latest 6 bits is discardable, all you need is to find remainder of divison by 5.
And, there is only 5 possible remainders: 0, 1, 2, 3, 4, 5.
Somehow (I guess) you may find it as for example in divisibility of 3 in decimals: all digits must have sum divisible by 3.
But, it can be slower.
Edit: easier. If you imagine long division method from school,
you'll find that it will look as:
Code: Select all
xxxx xxxx xxxx xxxx |1 0100 0000
-1010 0000 0
.....
-1010 0000 0
....
from opposite, you can't subtract less than 1 0100 0000, this means shift is greater or equal to 6.
This means that it may subtract during process of division at following positions:
Code: Select all
xxxx xxxx xxxx xxxx |1 0100 0000
-101
-101
-10 1
...
-1 01
result I mean remainder here. Three bits that you'll get, is same. Just because you do literally same.
This means: x % 320 = (floor(x/64)%5) * 64 + ( x % 64 ) = ( ((x >> 6) % 5) << 6) | (x & 0x3F)
where x & 0x3F is bitwise and.
With this method you need to precompute ((x % 5) << 6) table for 10 bits long numbers.
Edit: replaced ceil with floor.
Last edited by r57shell on Tue Jul 25, 2017 8:33 am, edited 1 time in total.
Re: Faster x mod 384 (x%384)?
thanks! r57shell, I will take an slow and carefull look at it.
I prefer 384 over 320 because:
- This number still let you do a small scroll (looks better)
- Playfield, or map in RAM, is 4096x2048 pixels max in size, so using 384 fits better. (In true, playfield can take several sizes but all using numbers of 2^n serie, still 384 fits better).
So we have: 384 = 3*128. Mod of 128 is easy: x AND 127. For "x mod 3": I have found an easy solution:
( From http://homepage.cs.uiowa.edu/~jones/bcd/mod.shtml )
But unfortunately I can't put all together to work because I can't see any correlation. I have to take a better look at it.
I prefer 384 over 320 because:
- This number still let you do a small scroll (looks better)
- Playfield, or map in RAM, is 4096x2048 pixels max in size, so using 384 fits better. (In true, playfield can take several sizes but all using numbers of 2^n serie, still 384 fits better).
So we have: 384 = 3*128. Mod of 128 is easy: x AND 127. For "x mod 3": I have found an easy solution:
( From http://homepage.cs.uiowa.edu/~jones/bcd/mod.shtml )
Code: Select all
static inline u32 mod3( u32 a )
{
a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits a <= 0x1FFFE */
a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits a <= 0x2FD */
a = (a >> 4) + (a & 0xF); /* sum base 2**4 digits a <= 0x3C; worst case 0x3B */
a = (a >> 2) + (a & 0x3); /* sum base 2**2 digits a <= 0x1D; worst case 0x1B */
a = (a >> 2) + (a & 0x3); /* sum base 2**2 digits a <= 0x9; worst case 0x7 */
a = (a >> 2) + (a & 0x3); /* sum base 2**2 digits a <= 0x4 */
if (a > 2) a = a - 3;
return a;
}
HELP. Spanish TVs are brain washing people to be hostile to me.
Re: Faster x mod 384 (x%384)?
r57shell, I think you mean floor(), not ceil().
Re: Faster x mod 384 (x%384)?
Miquel, this C-written mod3 function has about 20 arithmetic operations. I'm afraid, in terms of M68k they will consume more cycles in total than one DIVU
Fast version will require pre-computed array.
Straightforward method is (don't be confused with different meaning of 'x'):
x mod 320 = x - x div 320 * 320.
1. x div 320 = table[x >> 6] - something like 20 cycles
2. x * 320 = (x << 6) * 5
x*5 = x << 2 + x
in asm (something like 24 cycles):
Fast version will require pre-computed array.
Straightforward method is (don't be confused with different meaning of 'x'):
x mod 320 = x - x div 320 * 320.
1. x div 320 = table[x >> 6] - something like 20 cycles
2. x * 320 = (x << 6) * 5
x*5 = x << 2 + x
in asm (something like 24 cycles):
Code: Select all
lsl.w #6, x
move.w x, x4
lsl.w #2, x4
add.w x4, x
Last edited by GManiac on Mon Jul 24, 2017 9:06 pm, edited 1 time in total.
Re: Faster x mod 384 (x%384)?
All code below is in Python
A key formula is the "general rule" in the article you linked earlier
Example:
To find residue of dividing by 320 you can find residues of 5 and 64 and then convert them to usual positional notation (it's not trivial)
https://en.wikipedia.org/wiki/Residue_number_system
Conversion example:
http://neo.dmcs.p.lodz.pl/csII/ca2_ResidueNS.pdf
Fast example In Russian:
http://wiki.tgl.net.ru/index.php/%D0%A1 ... 0%BE%D0%B2
Code demo for your task
Anyway, this method takes at least 70 cycles.
A key formula is the "general rule" in the article you linked earlier
Example:
Code: Select all
a = 123
m = 5
b = 16
# equal results
print( a % m )
print( ( (b % m) * (a // b) + (a % b) ) % m )
https://en.wikipedia.org/wiki/Residue_number_system
Conversion example:
http://neo.dmcs.p.lodz.pl/csII/ca2_ResidueNS.pdf
Fast example In Russian:
http://wiki.tgl.net.ru/index.php/%D0%A1 ... 0%BE%D0%B2
Code demo for your task
Code: Select all
# pre-compute table
table = [0]*320
for i in range(320):
table[64 * (i % 5 ) + (i % 64)] = i
a = 65533
mod64 = a & 0x3F;
#a mod 5
a = (a >> 8) + (a & 0xFF) # 256 % 5 = 1, a will become <= 510
a = (a >> 4) + (a & 0x0F) # 16 % 5 = 1, a will become <= 45
if (a > 14): a = a - 15;
if (a > 14): a = a - 15;
if (a > 4): a = a - 5;
if (a > 4): a = a - 5;
mod5 = a
mod320 = table[64*mod5 + mod64]
print(mod320)
Re: Faster x mod 384 (x%384)?
Yes, I mean floor instead ceil. Updated my post.
And, for full desc, here is C code:
Code: Select all
u16 rem320(u16 a)
{
return (table[a>>6]<<6) | (a & 0x3F);
}
you may run it here: http://cpp.sh/
Second version using *2 instead of <<6, which is faster if your compiler do:
Code: Select all
add d0, d0
instead of
lsr #1, d0
plus you'll have issues with addressing 2 bytes instead of 1 bytes, because you'll need to discard least significant bit somehow.
you can do all of this with other modulo. results depends on modulo.
Re: Faster x mod 384 (x%384)?
Great formula.
a = 56959
d1 = 21
d2 = 17
# d1 and d2 are random divisors
a % (d1*d2) == a // d1 % d2 * d1 + a % d1
a % 320 == a // 64 % 5 * 64 + a % 64
a % 320 == (((a >> 6) % 5) << 6) + (a & 0x3F)
r57shell, there's no need in "|" in next line, just use "+".
a = 56959
d1 = 21
d2 = 17
# d1 and d2 are random divisors
a % (d1*d2) == a // d1 % d2 * d1 + a % d1
a % 320 == a // 64 % 5 * 64 + a % 64
a % 320 == (((a >> 6) % 5) << 6) + (a & 0x3F)
r57shell, there's no need in "|" in next line, just use "+".
Code: Select all
| (a & 0x3F);
Re: Faster x mod 384 (x%384)?
To simplify the explanation I imply the operation is with (16bit) integers. But instead is with 16:16 fixed points. So really the operation is something like:
x mod 384<<16 where x can range from 0 to 4096<<16. (Meaning it uses signed 32 bit integers)
I can try to use a lousy trick to get rid off decimals, but I shouldn't...
HELP. Spanish TVs are brain washing people to be hostile to me.
Re: Faster x mod 384 (x%384)?
As I understand, integer part of your 16.16 number is stored in upper word of longword, and fractional part - in lower word.
It's not a problem, just use this formula again
a % (d1*d2) == a // d1 % d2 * d1 + a % d1
384<<16 = 24M (24 mega)
24M = 3 * 128 * 65536
a % 24M == a // 8M % 3 * 8M + a % 8M
a % 24M == (((a >> 23) % 3) << 23) | (a & 0x7FFFFF)
Or, to prevent calculations on longs you can drop fractional part and then restore it.
Code: Select all
# save fractional
move.w d0, d1
clr.w d0
swap d0
mod #384, d0
# restore fractional
swap d0
move.w d1, d0
Code: Select all
move.l $F004, d0 -- read the whole coordinate
move.w $F004, d0 -- read the integer part