Super VDP
Moderator: BigEvilCorporation
Yeah, for sure. But, as I've already stated : what's the point of it ? I mean, the real thing, drawing a single layer, is drawing an image. And I don't need to cut this image in tiles, then run 2 big CPU to re-arrange these tiles. A smarter way is to simply send the image, from the ROM to the Frame Buffer. And this operation can be handled by the mere 68k.Stef wrote:What about handling only 1 plan ? Genesis already offers 2 plans. You can use the 32X hardware to implement one enhanced plan :) It does sound good for me. Is 60 FPS possible with only one plan ?
More over, with a single layer, you can't have transparency, as the Genesis layers are not seen as frame buffer.
Finally, scrolling, even line-scrolling is equally pointless, since even scrolled, a single-layer image is still a simple image. Maybe it would be bigger on ROM, but the 2 CPU wouldn't run pointlessly.
Don't worry : I really enjoyed myself ! I've learned a lot of things. And I think that's a very important part of it !Fonzie wrote:Yeah... One plan is good too... Anyway, i hope you had fun experimenting this
-
- Very interested
- Posts: 3131
- Joined: Thu Nov 30, 2006 9:46 pm
- Location: France - Sevres
- Contact:
A tile plan is by far more interesting than a simple scrolled image.ob1 wrote:Yeah, for sure. But, as I've already stated : what's the point of it ? I mean, the real thing, drawing a single layer, is drawing an image. And I don't need to cut this image in tiles, then run 2 big CPU to re-arrange these tiles. A smarter way is to simply send the image, from the ROM to the Frame Buffer. And this operation can be handled by the mere 68k.Stef wrote:What about handling only 1 plan ? Genesis already offers 2 plans. You can use the 32X hardware to implement one enhanced plan It does sound good for me. Is 60 FPS possible with only one plan ?
More over, with a single layer, you can't have transparency, as the Genesis layers are not seen as frame buffer.
Finally, scrolling, even line-scrolling is equally pointless, since even scrolled, a single-layer image is still a simple image. Maybe it would be bigger on ROM, but the 2 CPU wouldn't run pointlessly.
A bitmap consum many memory, having a tilemap plan permit to define very large level with small data information You re-use the same tile many time Pitfall 32X did that for that reason. Only one plan and 30 FPS. Your is already better =)
Well, RLE goal is the same. And if you want scrolling, you can convert your RLE encoded image to a full screen image with DMA FILL. Then, just apply the tips I gave starting this topic.Stef wrote:A tile plan is by far more interesting than a simple scrolled image.
A bitmap consum many memory, having a tilemap plan permit to define very large level with small data information :) You re-use the same tile many time :) Pitfall 32X did that for that reason.
Why not ? Anyway, thank you.Stef wrote:Only one plan and 30 FPS. Your is already better =)
But something remains. What are we talking about fps ?
A streaming movie is smooth above 24 fps, it means 24 images are showned in one second. Luckily, electricity in Europe is 50Hz (and 60Hz in the USA and Japan). So, if I draw twice half an image, I'll get one image every 1/50sec (1/60 in the USA). Does it mean I have 50 fps ?
Equally, Gens states the FPS is 60 during my demo. Does it assume I actually get 60 fps ? If so, I get twice what I need, so I'm very happy with it.
Here's how I benchmarked my SuperVDP. Every V_INT, I increment a value in CommPort($1C). Before my drawPlane routine, I save V_INT. Just after my drawPlane routine, I compute current V_INT odded old V_INT. The number I get is the number of frames that were dropped. I do want it to be no more than 0 !!! And, drawing 2 planes, I get 1 :(
So, what to believe ? Do I get nearly 60 fps, or do I get nearly 30 fps ?
-
- Very interested
- Posts: 3131
- Joined: Thu Nov 30, 2006 9:46 pm
- Location: France - Sevres
- Contact:
FPS = frame per second.ob1 wrote:Well, RLE goal is the same. And if you want scrolling, you can convert your RLE encoded image to a full screen image with DMA FILL. Then, just apply the tips I gave starting this topic.Stef wrote:A tile plan is by far more interesting than a simple scrolled image.
A bitmap consum many memory, having a tilemap plan permit to define very large level with small data information You re-use the same tile many time Pitfall 32X did that for that reason.
Why not ? Anyway, thank you.Stef wrote:Only one plan and 30 FPS. Your is already better =)
But something remains. What are we talking about fps ?
A streaming movie is smooth above 24 fps, it means 24 images are showned in one second. Luckily, electricity in Europe is 50Hz (and 60Hz in the USA and Japan). So, if I draw twice half an image, I'll get one image every 1/50sec (1/60 in the USA). Does it mean I have 50 fps ?
Equally, Gens states the FPS is 60 during my demo. Does it assume I actually get 60 fps ? If so, I get twice what I need, so I'm very happy with it.
Here's how I benchmarked my SuperVDP. Every V_INT, I increment a value in CommPort($1C). Before my drawPlane routine, I save V_INT. Just after my drawPlane routine, I compute current V_INT odded old V_INT. The number I get is the number of frames that were dropped. I do want it to be no more than 0 !!! And, drawing 2 planes, I get 1
So, what to believe ? Do I get nearly 60 fps, or do I get nearly 30 fps ?
Just define how many frame you're drawing per second.
The Gens FPS counter is the number of frame than Gens draws per second, but on your side maybe you're only drawing 30 Frames Per Second.
With your implementation, you can know if you missed a frame, if you have 1 then you only handle 30 fps (mean you're refreshing screen 30 time per second).
When is V_Int triggered ?
The beam draws even lines, then odd lines, and that's a frame.
So how is it ?
Solution 1 :
the beam draws even lines
V_Int is triggered
the beam draw odd lines
V_Int is triggered
And I'd get 2 (since 2 V_Int are triggered).
Solution 2 :
the beam draws even lines
the beam draw odd lines
V_Int is triggered
And I'd get 1 (since 1 V_Int is triggered).
The beam draws even lines, then odd lines, and that's a frame.
So how is it ?
Solution 1 :
the beam draws even lines
V_Int is triggered
the beam draw odd lines
V_Int is triggered
And I'd get 2 (since 2 V_Int are triggered).
Solution 2 :
the beam draws even lines
the beam draw odd lines
V_Int is triggered
And I'd get 1 (since 1 V_Int is triggered).
-
- Very interested
- Posts: 3131
- Joined: Thu Nov 30, 2006 9:46 pm
- Location: France - Sevres
- Contact:
The TV displays 50 (PAL) or 60 (NTSC) half frames per second.
The console do that :
- send even lines (1st half frame)
- V Int
- send odd lines (2nd half frame)
- V Int
- ...
Anyway don't worry about the even and odd lines stuff, on a 320x240 resolution system, even and odd lines are the same
The console do that :
- send even lines (1st half frame)
- V Int
- send odd lines (2nd half frame)
- V Int
- ...
Anyway don't worry about the even and odd lines stuff, on a 320x240 resolution system, even and odd lines are the same
Last edited by Stef on Fri May 11, 2007 6:32 pm, edited 1 time in total.
OK so if I have this :
if R13 = 1, that means that I just got 1 V_int while drawing, that is, even if I don't reach 60 fps, I'd still stick with 30fps, don't I ?
Code: Select all
R11 = vtimer
drawPlane(A plane)
drawPlane(B plane)
R12 = vtimer
R12 = R12 - R11
R13 = R12
Here I am back.
After having cruised with the Amiga (www.amigaimpact.org) and messed around with PowerPC smart guys, I've re-opened my SuperVDP project (seems like GLide is stil far far away from me).
I've thought about various modes I could use : first is Tile Processing, the size of the data I move each time. I've started with byte, so Tile Processing 1, since I look for 1 byte after each.
It allows me to handle Packed Pixel tiles (mirror, flip, rotate), but it is slooow (30-60fps) as you've already seen.
I can work with Tile Processing 2 or 4 (respectively word or long), which would be faster. Dramatically faster with Tile Processing 4 and Packed Pixel Mode.
I can even look at Direct Color. This mode is the single that allows me to make transparency.
Finally, I get 6 modes (only 4 are really useful) :
- mode 0, Packed Pixel Background, which allows tile handling
- mode 3, Direct Color Background, with tile handling and transparency
- mode 4, Packed Pixel Sprites, with sprites
- mode 5, Sprites and Transparency, with transparency and maybe sprites
Stay tuned.
I've thought about various modes I could use : first is Tile Processing, the size of the data I move each time. I've started with byte, so Tile Processing 1, since I look for 1 byte after each.
It allows me to handle Packed Pixel tiles (mirror, flip, rotate), but it is slooow (30-60fps) as you've already seen.
I can work with Tile Processing 2 or 4 (respectively word or long), which would be faster. Dramatically faster with Tile Processing 4 and Packed Pixel Mode.
I can even look at Direct Color. This mode is the single that allows me to make transparency.
Finally, I get 6 modes (only 4 are really useful) :
- mode 0, Packed Pixel Background, which allows tile handling
- mode 3, Direct Color Background, with tile handling and transparency
- mode 4, Packed Pixel Sprites, with sprites
- mode 5, Sprites and Transparency, with transparency and maybe sprites
Stay tuned.
Hi you all.
Long time since last post, hu ?
Anyway, here it is.
Excuse-me, but it's theory only, since I can test nor implement it for now. Be sure I would if I could.
Here's the C main loop :
And here's the ASM, translated by me :
I've aligned the data to avoid the MA access contention with IF. I don't think this code would fill the whole cache, the instructions are less than 2 cache lines, so 16 byte align is unnecessary.
This code is going to run on both CPU, one for both part of the screen : the master CPU draws upper tiles, the slave CPU draws lower tiles. I took into account 32X wait time and SH2 wait time (approximatively here, but pessimistic).
Anyway, I get 560 x (24 + 632) = 367 360 cycles to draw a plane that is 62 planes/s. More than 2 planes/frame @ 30 fps !!!
OK. That's theory only. I don't know why I was wrong sooner. But it defintively needs to be tested (and debugged !!!) in GensKmod, then on real hardware ;)
There lacks 2 big steps :
- clear FB when switching frame buffer
- scrolling data ?
edit : clear FB routine
MOV #0,R0 ; Clear R0
MOV VALUE_16000,R1 ; 16000 = 320 x 200
MOV.L FB,R2
clearFBloop:
MOV.L R0,@R2 ; ---
ADD #4,R2 ; |
BT/S clearFBloop ; | 24 cycles (including wait time for concurrent access) for 1 32-bits longword, ie 4 bytes / CPU
SUB #1,R1 ; ---
.align4
VALUE_16000 dc.l 16000
FB dc.l $24000000
4 bytes x 2 CPU --> 24 cycles, that is 2666 cycles for clearing 64000 bytes.
2666 cycles reported to 286720 cycles is quite negligeable, so clearing FB shouldn't be a problem.
Long time since last post, hu ?
Anyway, here it is.
Excuse-me, but it's theory only, since I can test nor implement it for now. Be sure I would if I could.
Here's the C main loop :
Code: Select all
longint *FB = (int) *0x24000000;
int *tiles = (int) *0x06006000;
int *screenAMap = (int) *0x06004000;
int *screenMap = screenAMap;
int tileNumber;
longint *tileAddress;
if (CPU_SLAVE) {
screenMap += 0x800; /* Slave CPU draw the loawer tiles */
}
repeat(560) { /* 40 x 28 tiles = 1120 tiles, 560 for each CPU */
int tileNumber = (int) *screenMap++;
longint *tileAddress = tiles + (tileNumber * 64); /* One tile is 64 bytes long */
/* Copy one tile */
repeat(8) {
*FB++ = *tileAddress++; /* 1 long int = 1 32-bits longword = 4 bytes */
*FB++ = *tileAddress++; /* 1 long int = 1 32-bits longword = 4 bytes */
dest += 312; /* 312 = 320 - 8 */
}
}
Code: Select all
MOV.L FB,R1 ; R1 = FrameBuffer
MOV.L TILES,R2 ; R2 = Tiles data
MOV.L PLANE_A,R3 ; R3 = Plane data
MOV.L CPU,R0 ; Let's assume I've set bit 0 when the CPU is slave
CMP #1,R0
BF CPUSlaveInitSkip
MOV #$5D,R0 ; 0x5D = 0x800 >> 4
SHLL2 R0
SHLL2 R0
ADD R0,R3
CPUSlaveInitSkip:
; Main loop
MOV #$8C,R4 ; 0x8C = 560 >> 2
SHLL2 R4
SUB #1,R4
.align 4
REPEAT_PLANE:
MOV.B @R3+,R5 ; R5 = tileNumber - 18 cycles
SHLL8 R5
SHLR2 R5 ; R5 = tileOffset
ADD R2,R5 ; R5 = tileAddress
; Copy one tile
MOV #7,R6 ; R6 = Counter : 8 lines/tile
.align 4
REPEAT_TILE:
MOV.L @R5,R0 ; ---
ADD #4,R5 ; |
MOV.L R0,@R1 ; |
ADD #4,R1 ; | 89 cycles when cache miss, 23 when cache hit
MOV.L @R5,R0 ; | For each tile 2-lines, 1 miss then 3 hits
ADD #4,R5 ; | So 4 * (89 + 3*23) = 632 cycles
MOV.L R0,@R1 ; ---
MOV #$9E,R7 ; 0x9E = (320 - 4) >> 1
SHLL R7
ADD R7,R1
BT/S REPEAT_TILE ; 2 cycles
SUB #1,R6
BT/S REPEAT_PLANE ; 2 cycles
SUB #1,R4
.align 4
CPU dc.l $06000000
FB dc.l $24000000
TILES dc.l $06006000
PLANE_A dc.l $06004000
This code is going to run on both CPU, one for both part of the screen : the master CPU draws upper tiles, the slave CPU draws lower tiles. I took into account 32X wait time and SH2 wait time (approximatively here, but pessimistic).
Anyway, I get 560 x (24 + 632) = 367 360 cycles to draw a plane that is 62 planes/s. More than 2 planes/frame @ 30 fps !!!
OK. That's theory only. I don't know why I was wrong sooner. But it defintively needs to be tested (and debugged !!!) in GensKmod, then on real hardware ;)
There lacks 2 big steps :
- clear FB when switching frame buffer
- scrolling data ?
edit : clear FB routine
MOV #0,R0 ; Clear R0
MOV VALUE_16000,R1 ; 16000 = 320 x 200
MOV.L FB,R2
clearFBloop:
MOV.L R0,@R2 ; ---
ADD #4,R2 ; |
BT/S clearFBloop ; | 24 cycles (including wait time for concurrent access) for 1 32-bits longword, ie 4 bytes / CPU
SUB #1,R1 ; ---
.align4
VALUE_16000 dc.l 16000
FB dc.l $24000000
4 bytes x 2 CPU --> 24 cycles, that is 2666 cycles for clearing 64000 bytes.
2666 cycles reported to 286720 cycles is quite negligeable, so clearing FB shouldn't be a problem.
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
Have you tried using the SH2 DMA to draw the tiles instead of the CPU? Seems to me it'd probably be faster as well as not tying up both CPUs for god-knows-how-long.
In fact, I'd probably start the DMA on the Master SH2 to the framebuffer, then start the DMA on the Slave SH2 to the overwrite buffer to simulate two layers.
In fact, I'd probably start the DMA on the Master SH2 to the framebuffer, then start the DMA on the Slave SH2 to the overwrite buffer to simulate two layers.