VLBitmap2Scroll

Sik · Post by **Sik** » Fri Apr 18, 2008 2:51 am

Here goes a subroutine made for VideoLib 3 (remind me to upload the patched VL2 in my page, VL3 isn't released yet

). You may find it useful. Basically it loads a 4-bit bitmap as tiles into VRAM. Tiles are stored this way in VRAM:

01 02 03 04 05
06 07 08 09 10
11 12 13 14 15

And so on

You get the idea.

This subroutine seems to go very fast. Under Fusion, I managed to load a 320x224 bitmap into VRAM in about two frames. The function is very sensible to size changes: a 256x192 bitmap loads about twice as fast. Again, this is under Fusion. No idea about real hardware, but it shouldn't differ too much since Fusion has FIFO emulation.

Also, before you ask, I didn't use (a0,a2) or similar because I'm not sure if doing that is valid on the 68000. Please somebody confirm me this, because in that case dealing with the precalculated table will be a lot easier (just so you know, I'm making a table and storing it into registers, and that's extremely crazy

).

Parameters:

d0.b: bitmap width (in tiles)
d1.b: bitmap height (in tiles)
d2.w: index of first tile in VRAM (0..2047)
a0.l: pointer to the bitmap (68k address, even).

Code: Select all

VLBitmap2Scroll:
    movem.l d0-a1, -(sp)
    andi.l  #2047, d2
    lsl.l   #7, d2
    lsr.w   #2, d2
    ori.w   #$4000, d2
    swap    d2
    move.l  d2, ($C00004).l
    lsl.w   #1, d0
    andi.l  #$FFFF, d0
    lsr.w   #3, d1
    subq.w  #1, d1
    lea     ($C00000).l, a1
    move.l  d0, d4
    lsl.l   #3, d4
    subq.l  #4, d4
    move.w  d0, d2
    add.w   d2, d2
    move.w  d0, d4
    add.w   d2, d4
    move.w  d2, d5
    add.w   d5, d5
    move.w  d5, d6
    add.w   d0, d6
    move.w  d5, d7
    add.w   d2, d7
    swap    d1
    move.w  d7, d1
    add.w   d0, d1
    swap    d1
VLBitmap2ScrollVLoop:
    move.w  d0, d3
    add.w   d3, d3
    add.w   d3, d3
    subq.w  #1, d3
    swap    d1
VLBitmap2ScrollHLoop:
    move.l  (a0), (a1)
    move.l  (a0,d0.w), (a1)
    move.l  (a0,d2.w), (a1)
    move.l  (a0,d4.w), (a1)
    move.l  (a0,d5.w), (a1)
    move.l  (a0,d6.w), (a1)
    move.l  (a0,d7.w), (a1)
    move.l  (a0,d1.w), (a1)
    addq.l  #4, a0
    dbf     d3, VLBitmap2ScrollHLoop
    swap    d1
    add.l   d7, a0
    add.l   d0, a0
    dbf     d1, VLBitmap2ScrollVLoop
    movem.l (sp)+, d0-a1
    rts

I wonder if this is any useful for software rendering. You know, you need to do things very fast when doing stuff such as 3D, so it may be helpful

PS: loading a 320x224 bitmap in about two frames is loading 560 tiles in a single frame, right? How many of them get loaded inside VBlank? Even yet, it's a lot, and I'm not loading that amount of tiles using DMA, which in theory doubles the amount of loadable tiles in VBlank and loads at the same speed as a 68k loop in active scan

EDIT: before I forget, this subroutine does not use RAM at all. In fact the only memory accesses it does are accessing the bitmap and the VDP

So I guess that helps with performance?

TmEE co.(TM) · Post by **TmEE co.(TM)** » Fri Apr 18, 2008 5:43 am

I managed to load about 10 320x240 4-bit BMPs a second in my FMV demo on real hardware.

http://www.hot.ee/tmeeco/DWNLOADS/CAR.RAR

Sik · Post by **Sik** » Fri Apr 18, 2008 4:26 pm

Already saw that. And 10FPS (you) vs. 30FPS (me) *shot*

Through seriously, it would be interesting to see how well does my subroutine perform in the real hardware (I can't test that). Also, I could make this subroutine even a bit faster (specially when you adapt it to a specific size), but I'm not sure how well would that work, probably I'm already pushing the VDP to its limits.

Stef · Post by **Stef** » Fri Apr 18, 2008 7:00 pm

Sik wrote:Already saw that. And 10FPS (you) vs. 30FPS (me) *shot*

Through seriously, it would be interesting to see how well does my subroutine perform in the real hardware (I can't test that). Also, I could make this subroutine even a bit faster (specially when you adapt it to a specific size), but I'm not sure how well would that work, probably I'm already pushing the VDP to its limits.

I have some functions like that in my library too, i used it to display the main basic logo in the mini dev kit. Compared to your it's pur C code and it should be not as optimised than your...

Sik · Post by **Sik** » Fri Apr 18, 2008 7:06 pm

Hey, I know many of you made fake bitmap functions, in fact a lot of games have such a thing, I was wondering if it's useful taking into account the speed it has.

tomaitheous · Post by **tomaitheous** » Sat Apr 19, 2008 3:23 am

Sik wrote:Already saw that. And 10FPS (you) vs. 30FPS (me) *shot*

I'd like to know how you're getting a faster transfer rate than local to vram DMA? H40 mode with 224 active scanlines is 7380bytes per frame and you're transferring 17920bytes per frame without DMA!? Please explain.

Sik · Post by **Sik** » Sat Apr 19, 2008 5:36 am

I would actually want to know from where did Sega get all those timings. Maybe they just took into account VBlank? Because this subroutine will continue sending data while in active scan. I mean, the Mega Drive isn't as slow as the tech docs say. Otherwise there are a lot of things game do that normally shouldn't be possible. The only thing I know is that it's possible to get over 15FPS unlike they said. The VDP definitely can push way more data, and there are out there some games that do manage to push that much. Not sure if 60FPS, but between 20FPS and 30FPS should be possible if a subroutine is fast enough.

By the way, look at the way this subroutine is done. I really don't think that writing individual data causes the FIFO ever to get filled. Those tests were done without any DMA being in the way, so I guess that helped since the FIFO was already empty, hence the 68k never got stopped and the VDP kept writing without any delay. Which remarks the importance of an empty FIFO

I guess that's why it manages to load so much data so fast.

Anyways, I would like you testing it. I don't want just to have the results of a single test I made which may have been inaccurate (through definitely it was refreshing at about two frames per second, and only Genecyst didn't respect this (there it went at 8 refreshes per frame

)).

tomaitheous · Post by **tomaitheous** » Sat Apr 19, 2008 5:55 am

And you've tested this on the real hardware?

Sik wrote:I would actually want to know from where did Sega get all those timings. Maybe they just took into account VBlank? Because this subroutine will continue sending data while in active scan. I mean, the Mega Drive isn't as slow as the tech docs say. Otherwise there are a lot of things game do that normally shouldn't be possible.

From my understand from talking with Charles, you're allowed 16 or so writes to the VDP during hblank(or whatever fits in that time window), active scanline will not transfer any data to the VDP other than filling FIFO. Once FIFO is full the CPU is stalled until the next opening, i.e. the end of the scanline. You're saying that there are gaps inbetween during active display on a scanline?

The only thing I know is that it's possible to get over 15FPS unlike they said. The VDP definitely can push way more data, and there are out there some games that do manage to push that much.

By ending the active display early, you gain the additional 205bytes per scanline for DMA. Games that push a faster transfer rate do this. Of course a PAL system gives a lot more bandwidth via DMA per frame than NTSC system. Running 320x200 on NTSC system you can get 20fps just using DMA.

Sik · Post by **Sik** » Sat Apr 19, 2008 6:01 am

Except those games I'm talking seem to take up the entire screen. Moreover, some of those even run at H32, and you know that there things are a lot slower...

Oh, yeah, now I remember, the timings in the tech docs were about H32. And I think that for some stupid reason Sega took those timings and used them to calculate how fast a 320x224 screen would load, when H40 timings should have been used. I guess that solves the mystery...

tomaitheous · Post by **tomaitheous** » Sat Apr 19, 2008 6:46 am

Sik wrote:Moreover, some of those even run at H32, and you know that there things are a lot slower...

Via DMA, 256x224 takes 4.797 frames to update the screen in H32 where as 320x224 takes 4.856 frames to update. Fairly close and you save vram space using H32. So it's not really slower - actually it's a hair faster. Though both equate to 12fps.

Oh, yeah, now I remember, the timings in the tech docs were about H32. And I think that for some stupid reason Sega took those timings and used them to calculate how fast a 320x224 screen would load, when H40 timings should have been used. I guess that solves the mystery...

That would give you 12fps in either mode. What was the manual stating? (I forget)

TmEE co.(TM) · Post by **TmEE co.(TM)** » Sat Apr 19, 2008 3:16 pm

Hey Sik, make a small demo ROM with your code and I'll see how fast it goes... it will not run too fast on real HW... VDP access limitations aren't emulated as well as they should be... none of the 3D games run as fast as they do on emulators, all are a bit slower.

Sik · Post by **Sik** » Sat Apr 19, 2008 6:37 pm

It seems like somehow I've scrambled up the subroutine >_> Anyways, nevermind, it seems like the demo runs a lot slower in Fusion than in Gens. Weird, I was quite sure it ran at the same speed in both emulators. Mmmmmmh...

Anyways, it's faster than 10FPS, that's for sure (now somebody comes and tells me it runs at 5FPS

).

TmEE co.(TM) · Post by **TmEE co.(TM)** » Sun Apr 20, 2008 9:43 am

Your code doesn't seem anything slower than this :

Code: Select all

LoadBMP:                ; Loads an 4-bit BMP, A0 = Source, D4 = Start tile
 MOVE.L #DPORT, A3
 ADD.L  #2, A0          ; ID WORD, must be "BM"
 ADD.L  #4, A0          ; Size LONG
 ADD.L  #4, A0          ; nothing
 ADD.L  #4, A0          ; Image type LONG, must be 1078
 ADD.L  #4, A0          ; Header size LONG, must be 40
 MOVE.L (A0)+, D0       ; Width LONG
 ROR.W  #8, D0
 SWAP   D0
 ROR.W  #8, D0
 MOVE.L D0, (BMPX)      
 MOVE.L (A0)+, D0       ; Height LONG
 ROR.W  #8, D0
 SWAP   D0
 ROR.W  #8, D0
 MOVE.L D0, (BMPY)
 ADD.L  #2, A0          ; 1 WORD
 ADD.L  #2, A0          ; bpp WORD, must be 4
 ADD.L  #4, A0          ; compression LONG, must be 0
 ADD.L  #4, A0          ; (compressed) image size LONG
 ADD.L  #16, A0         ; Nothing
 MOVE.L #$C0000000, (CPORT) ; BMP palette format B, G, R, Z
 MOVE.W #15, D3
BMPpalLoop:
 MOVE.B (A0)+, D0       ; R
 AND.L  #$E0, D0
 LSL.W  #4, D0
 MOVE.B (A0)+, D1       ; G
 AND.L  #$E0, D1
 MOVE.B (A0)+, D2       ; B
 AND.L  #$E0, D2
 LSR.W  #4, D2
 OR.W   D2, D0
 OR.W   D1, D0
 ADD.L  #1, A0          ; Z
 MOVE.W D0, (DPORT)
 DBRA   D3, BMPpalLoop
 MOVE.W D4, D0          ; D4=Start tile
 LSL.W  #5, D0
 AND.L  #$FFFF, D0
 MOVE.L #$40000000, D2  ; D2=VDP command
 MOVE.L D0, D1
 LSR.L  #8, D1
 LSR.L  #6, D1
 OR.L   D1, D2          ; Add Address bits 14 and 15
 MOVE.L D0, D1
 AND.L  #$3FFF,D1
 SWAP   D1
 OR.L   D1, D2          ; Add rest of the Address bits
 MOVE.L D2, (CPORT)     ; Write command+screen pointer
 MOVE.L A0, A1
 MOVE.L (BMPY), D6
 SUBQ.L #1, D6
 LSR.L  #3, D6
ImageLoopY:
 MOVE.L A1, A2
 MOVE.L (BMPX), D7
 SUBQ.L #1, D7
 LSR.L  #3, D7
TileLoop:
 MOVE.L A2, A0
 MOVE.L (BMPX), D0
 LSR.L  #1, D0
 MOVE.L (A0), (A3)      ; Load one tile
 ADD.L  D0, A0
 MOVE.L (A0), (A3)
 ADD.L  D0, A0
 MOVE.L (A0), (A3)
 ADD.L  D0, A0
 MOVE.L (A0), (A3)
 ADD.L  D0, A0
 MOVE.L (A0), (A3)
 ADD.L  D0, A0
 MOVE.L (A0), (A3)
 ADD.L  D0, A0
 MOVE.L (A0), (A3)
 ADD.L  D0, A0
 MOVE.L (A0), (A3)
 ADD.L  D0, A0
 ADD.L  #4, A2
 DBRA   D7, TileLoop
 ADD.L  (BMPX), A1
 ADD.L  (BMPX), A1
 ADD.L  (BMPX), A1
 ADD.L  (BMPX), A1
 DBRA   D6, ImageLoopY
 MOVE.L (BMPY), D1      ; Tiles loaded, now display
 SUBQ.L #1, D1
 LSR.L  #3, D1
 MOVE.W D1, D7
 MOVE.W #29, D0         ; Center image Y
 SUB.W  D1, D0
 LSR.W  #1, D0
 ADD.W  D1, D0
 MOVE.L (BMPX), D2
 SUBQ.L #1, D2
 LSR.L  #3, D2
 MOVE.W #39, D1         ; Center image X
 SUB.W  D2, D1
 LSR.W  #1, D1
 MOVE.W D4, D3          ; D3 = Start tile
 OR.W   #$1000, D3
Yloop2:                  
 JSR    CalcOffset
 MOVE.L (BMPX), D6
 SUBQ.L #1, D6
 LSR.L  #3, D6
Xloop2:
 MOVE.W D3, (A3)
 ADDQ.W #1, D3
 DBRA   D6, Xloop2
 SUBQ.W #1, D0
 DBRA   D7, Yloop2
 RTS

CalcOffset:             ; Calculates offset in VRAM for pattern table
 MOVEM.L D0-D1, -(A7)   ; modifying routines    
 LSL.W  #6, D0          
 ADD.W  D1, D0          ; D0=Y, D1=X
 ADD.W  D0, D0
 ADD.W  (PTADDR), D0
 MOVE.W D0, D1
 AND.W  #$3FFF, D0
 OR.W   #$4000, D0
 SWAP   D0
 ROL.W  #2, D1
 AND.W  #3, D1
 MOVE.W D1, D0
 MOVE.L D0, (CPORT)
 MOVEM.L (A7)+, D0-D1
 RTS

At least this scans the file for width/height and sets up the palette and centers the image on screen

Sik · Post by **Sik** » Sun Apr 20, 2008 2:20 pm

BMPs isn't really a file to try to use - not to mention that it supports native compression for 4-bit and 8-bit bitmaps, so I guess that overcomplicates things even more.

VLBitmap2Scroll doesn't work with such things, it only takes the size you pass in the registers and the bitmap is just that, the bitmap, and it's completely linear. Actually I've done speed up by moving eight pixels at once and not reading each pixel individually.

Anyways, argh, the subroutine got scrambled up or something... Weird, I've just copied it from a place that I was sure it was working properly before :/ I wonder what is wrong...

TmEE co.(TM) · Post by **TmEE co.(TM)** » Sun Apr 20, 2008 2:52 pm

A typo ? One letter can make a difference, especially in assembly... a MOVE.L instead of MOVE.W can cause quite a bit of headache....