Glide 32X

ob1 · Post by **ob1** » Wed May 09, 2018 6:55 am

All right, I've been trying - for long - to make a 3D engine for the 32X.
I failed.
Twice.
At least.
But there's no reason not to share a post mortem ^^

So,
here is how I compute the slopes of a triangle

Rasteror how to draw a triangle without grDrawTriangle, a FPU, or even a DIV!

Slopes

Code: Select all

   O
    +---> x
    |
    |		+ A (160, 50)
    V          / \
               |  \
    y          |   \
              /     \
              |      \
              |       \
             /         \
             |          \
             + B (130, 100)
              \           \
               ---\        \
                   ---\     \
                       ---\  \
                           ---+ C (260, 150)

From the A point, I'm going to get down, 1 line after 1 line.
On each next line y, I'm going to have 2 points whose x-coordinate will be
- x1, on the [AB] segment
- x2, on the [AC] segment

x1 = xA + p * (y - yA), p being the reciprocal slope of the [AB] segment
x2 = xA + q * (y - yC), q being the reciprocal slope of the [AC] segment

Thus, p = (yB - yA) / (xB - xA)

Problem: I can't compute any decimal value on the 32X (and division is slow, ~20 cycles).

Introduction to fixed-point

Let's consider [AB]

Code: Select all

   O
    +---> x
    |
    |		+ A (160, 50)
    V          /
               |
    y          |
              /
              |
              |
             /
             |
             + B (130, 100)

Let p = (yB - yA) / (xB - xA) = (100 - 50) / (130 - 160) = 50 / (-30) = -1.666 ~ -1 - .65625 = - 1 - 21/32
So, starting from xA, I will reach xB, as long as, for each next y, I remove 1 (the floor part), AND, 21 line of 32, I also remove 1.

Question is: on which 21 lines from 32 will I also remove 1 ?

Fixed-point arithmetic

Let's consider 5/32 = .15625.
On the first next line, I'll add nothing. So, x value will be the floor part, and that's all.
On the second line, I'll add the floor part, AND 1 * 0.15625 = 0.15625, which is rounded to 0.
On the third line, I'll add the floor part, AND 2 * 0.15625 = 0.3125, which is rounded to 0.
On the fourth line, I'll add the floor part, AND 3 * 0.15625 = 0.46875, which is rounded to 0.
On the fifth line, I'll add the floor part, AND 4 * 0.15625 = 0.625, which is rounded to 1.
...
On the 32nd line, I'll add the floor part, AND 31 * 0.15625 = 4.84375, which is rounded to 5.

What we will focus on is the rounded part, and especially WHEN did it change.
On the first line, the rounded part was 0.
On the second line, the rounded part was 0.
On the third line, the rounded part was 0.
On the fourth line, the rounded part was 0.
On the fifth line, the rounded part was 1, we increased by 1.
...
And so on.

So,
On the first line, I didn't add anything.
On the second line, I didn't add anything.
On the third line, I didn't add anything.
On the fourth line, I didn't add anything.
On the fifth line, I did add 1.
...
And so on.

Finally, for 5/32, I get
Line 0, I add 0
Line 1, I add 0
Line 2, I add 0
Line 3, I add 0
Line 4, I add 1
Line 5, I add 0
Line 6, I add 0
Line 7, I add 0
Line 8, I add 0
Line 9, I add 0
Line 10, I add 1
Line 11, I add 0
...
Line 28, I add 0
Line 29, I add 1
Line 30, I add 0
Line 31, I add 0

Let's note that for 5/32, and considering 32 lines, I added 1 exactly 5 times.
If I had just 10 lines to raster, I would have add 1 once (line 0 to line 9).
Which may be slightly different from 10 * 5/ 32, but it's the rounding error.

What if I got more than 32 lines to draw, you ask ?
Let's just start from line 0 again!

Implementation
Of course, I could have gone to 6-bits precision (1/64), or even more.
And I could have computed for, eg 21/64, and the 64 times I would have incremented, or not.
But I didn't need the extra-precision.

Let's now consider 5-bit precision, which tells me 32 times whether or not I have to increment.
32 times true or false.
32 times a binary value.
32 times one bit.
So, a 32 bit value.
Sticking to 5/32, it's easy to write its increments horizontally:
0 0 0 0 1 0 0 0 0 0 1 0 0 ... 0 0 1 0 0
, then wrapping it all together, binary style:
0000100000100 ... 00100b
, then put it in a 32-bit register, hexa-style:
08208104h
, then rotate it multiple times to the left

Code: Select all

 +---+       +----------+    
 | T |<--+---| 08208104 |<--+
 +---+   |   +----------+   |
         |                  |
         +------------------+

Thus, on each next line, I rotate my value to the left,
the MSb is moved into T,
I then move T into a register with the MOVT instruction,
finally, I add this register to my xA register.

No test, no lookup-table, plus, 2 main advantages:
- reduced memory footprint
- I don't have to bother for the modulus, ROTL does this for me!

#Wrapping it all together
Here are the values for 5-bit precision.
I add to adjust the value above 15/32 to alias some roundig errors.
If your decimal value is between 0 and 1/32 ( 0.03125), use 0.
If your decimal value is between 1/32 ( 0.03125) and 2/32 (0.0625), use $00008000.
If your decimal value is between 2/32 (0.0625) and 3/32 (0.09375), use $00800080.
If your decimal value is between 3/32 (0.09375) and 4/32 (0.125), use $08080808.
If your decimal value is between 4/32 (0.125) and 5/32 (0.15625), use $08208104.
...
And so on.

DC.L $00000000,$00008000,$00800080,$02008010
DC.L $08080808,$08208104,$10821082,$11088422
DC.L $22222222,$22448891,$24892489,$24929249
DC.L $29292929,$294A94A5,$2A952A95,$2AAA9555
DC.L $55555555,$5555AAAB,$55AAD5AB,$56B5AD6B
DC.L $5ADADADB,$5B6DB6DB,$5BB6DDB7,$5DDBBB77
DC.L $6EEEEEEF,$6F77BDEF,$6FBEF7DF,$77EFDFBF
DC.L $7BFDFEFF,$7EFFEFFF,$7FEFFFFF,$7FFFFFFF

## How to do a real DIV without, floating-point nor DIV
In your ROM, just put a lookup table, eg 512 x 512. Let's say that the dividend will be on "lines", and divisors on "columns".
Let's say that each value will be, eg 9 bits for the integer part, and 5 bits for the "floating" part.
The "floating" part will be expressed in 32th of 1, eg 5 for 0.15625.
Each value will use 16 bits (2 bits for padding, 9 for the integer part, and 5 for the floating part), or 2-bytes.
I'll then use the "floating" part as an offset for the rotate table above.
Thus, 216 / 60 (= 3.6) is located at line 216, column 60, or, (216 * 512 + 60) * 2 = 221 304.
I will find there 0x0073

Code: Select all

+--------+-------+-------+--------+
| 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 |
+----+-----------------+----------+
 padd     integer part   how many 32-th

which is equals to 3 + 19/32 = 3.59375

Drawing a triangle
So, for each 2D-triangle (we'll see the 3D -> 2D projection another time), I sorted the vertices (well ... pixels), A being the uppermost, then B being the leftmost.
I compute the slopes (with the fixed-decimal part, remember?), and I figure two pixels starting from A : x1 and x2.
I increment the line, and add the left slope integer part to x1, and the right slope integer part to x2.
I rotate the left slope decimal part and if T is true, I increment (or decrement) x1 by 1 pixel.
I rotate the right slope decimal part and if T is true, I increment (or decrement) x2 by 1 pixel.
... till I'm at yB (or yC).
This is the upper part of my triangle.
Then, I compute the last slope, and to the same for the remaining of the triangle. That would be the lower part of my triangle.

Problem ?
It's sooooo damn slow.
While I'm not even projecting, rotating, or lighting, and even while I'm only drawing half of a triangle, I can barely reach 8 triangles @60fps. I'm pretty sure Virtua Racing was better :/

Anyway,
for anyone interested, and just for everyone to notice I'm still alive, here it is :
https://www.dropbox.com/sh/ojhwag0hvz2k ... Hcw1a?dl=0
As a bonus, a binary table to convert 16-bits to 8-bits. Could be useful for Z-buffer

cero · Post by **cero** » Wed May 09, 2018 7:37 am

DDA rendering is faster than fixed point. The details to one can be found in the Game Developer Magazine archives, starting from 1995 AprMay - all available online for free.

ob1 · Post by **ob1** » Wed May 09, 2018 9:16 am

I've stated fixed-decimal, but only the representation is decimal (10 bits for the integer part, 5 bits for the "fractional" part).
My implementation uses something between Digital Differential Analyzer and Bresenham's line algorithm.
PS : didn't know of GDM. Thx.

Sik · Post by **Sik** » Wed May 09, 2018 9:37 am

The most common way to do software rendering of triangles is to use something akin to bresenham lines for calculating the sides (really just the interpolation part). Then you don't need to do multiplication or division at all.

This is the gist of the interpolation if you wonder (basically error accumulation):

Code: Select all

x += xdir
accum += ydist
if accum >= xdist
    accum -= xdist
    y += ydir
end

Where

x = longer axis
y = shorter axis
xdir = direction of X axis (+1 or -1)
ydir = direction of Y axis (+1 or -1)
xdist = distance between x1 and x2 (i.e. abs(x2-x1))
ydist = distance between y1 and y2 (i.e. abs(y2-y1))
accum = accumulated error

Start with accum at either 0 or (xdist >> 1) or whatever. Sort that one out.

Note that it's possible the slow down comes from elsewhere, e.g. screwing up how you're doing the pixel writes to the framebuffer. So you need to check out that stuff first if you ever want to try again.

ob1 · Post by **ob1** » Wed May 09, 2018 6:01 pm

OR,
I could use 32X Auto Fill and climb from 8 to 68 triangles per frame.
Still a pity, but ...

Chilly Willy · Post by **Chilly Willy** » Thu May 10, 2018 1:38 pm

The best old-school renderer articles were by Chris Hecker for GDM. You can find them one his web site here:
http://chrishecker.com/Miscellaneous_Technical_Articles

This is rather close to how quads are rendered in Yeti3D, but the affine loops are in C. I've always intended to go back into my Yeti3D demo and replace the rendering, but never got around to it. As is, the Yeti3D demo does very roughly about 1200 polys per second on the 32X, with finding which polys to render being a significant amount of the slow down.

If you're doing more than 16 bit division, use the hardware divider built into the SH2 instead. You can do other things while it's dividing. The Saturn example code from Sega for their 3D uses the hw divider by default for all division in handling vertexes and such.

ob1 · Post by **ob1** » Thu May 10, 2018 2:07 pm

Great answer and lots of useful informations, as usual. Thank you very much Chilly ^^

ehaliewicz · Post by **ehaliewicz** » Thu May 24, 2018 6:28 pm

Sik wrote: ↑
Wed May 09, 2018 9:37 am
Note that it's possible the slow down comes from elsewhere, e.g. screwing up how you're doing the pixel writes to the framebuffer. So you need to check out that stuff first if you ever want to try again.

I remember writing a sprite blitting engine for the 32X once that I think was severely bottlenecked by framebuffer writes.
What's the correct way to do it?

Chilly Willy · Post by **Chilly Willy** » Thu May 24, 2018 8:26 pm

ehaliewicz wrote: ↑
Thu May 24, 2018 6:28 pm

Sik wrote: ↑
Wed May 09, 2018 9:37 am
Note that it's possible the slow down comes from elsewhere, e.g. screwing up how you're doing the pixel writes to the framebuffer. So you need to check out that stuff first if you ever want to try again.
I remember writing a sprite blitting engine for the 32X once that I think was severely bottlenecked by framebuffer writes.
What's the correct way to do it?

Frame buffer writes take 3 cycles per word until the FIFO fills (four words), then empty at 5 cycles per word. If you're filling pixels with the cpu, you need to arrange the code so that you take at least that many cycles between writes or you wind up wasting time waiting on a FIFO slot. You could also draw to an off-screen buffer in SDRAM at 2 cycles per word, then transfer the off-screen buffer to the frame buffer using DMA. If you use the 16 byte transfer size, the DMA will use burst reads from SDRAM (1.5 cycles per word). The writes will still be no faster than 5 cycles per word after the first four words, but you could schedule your DMA for times when you're otherwise not doing anything.

SpritesMind.Net

Glide 32X

Glide 32X

Re: Glide 32X

Re: Glide 32X

Re: Glide 32X

Re: Glide 32X

Re: Glide 32X

Re: Glide 32X

Re: Glide 32X

Re: Glide 32X