Color counts per screen
Moderators: BigEvilCorporation, Mask of Destiny
tomaitheous,
I appreciate your taking the time to write all of that out, it was very informative. So, while storing more palettes takes up more ROM, the amount of ROM is relatively tiny, and calling/displaying more palettes doesn't affect system resources at_all. That is very interesting on its own, but so is your description of how backgrounds can be broken up on any system.
I find it hard to believe that each system was equally capable of line scrolling techniques when no software demonstrates this as such. There must be some disconnect here between theory and reality. For example, it can be said that the Genesis was capable of software scaling and rotation, but software does not demonstrate this to be superior to the effect generated by the SNES' Mode 7 effect. The lack of multiple scrolling backgrounds in Sonic clones on the SNES (i.e. Speedy Gonzales, Road Runner, etc) just doesn't seem like something the developer would have chosen to do arbitrarily. With that said, the only facts are that no PCE/SNES game demonstrates parallax at the level of top tier Genesis software, even though they were technically capable of doing so. It is worth noting that line scrolling background effects like those found in Super Aleste are just as technically impressive as any of the parallax effects in Genesis titles.
Are you adverse to my using your SF2 and Final Fight pics in a future comparison article? I will post the article here, and as always make any corrections the group deems necessary. Of course, I would never be able to cover everything we've discussed in a single article, but I would want to clear up any common misconceptions if possible.
I appreciate your taking the time to write all of that out, it was very informative. So, while storing more palettes takes up more ROM, the amount of ROM is relatively tiny, and calling/displaying more palettes doesn't affect system resources at_all. That is very interesting on its own, but so is your description of how backgrounds can be broken up on any system.
I find it hard to believe that each system was equally capable of line scrolling techniques when no software demonstrates this as such. There must be some disconnect here between theory and reality. For example, it can be said that the Genesis was capable of software scaling and rotation, but software does not demonstrate this to be superior to the effect generated by the SNES' Mode 7 effect. The lack of multiple scrolling backgrounds in Sonic clones on the SNES (i.e. Speedy Gonzales, Road Runner, etc) just doesn't seem like something the developer would have chosen to do arbitrarily. With that said, the only facts are that no PCE/SNES game demonstrates parallax at the level of top tier Genesis software, even though they were technically capable of doing so. It is worth noting that line scrolling background effects like those found in Super Aleste are just as technically impressive as any of the parallax effects in Genesis titles.
Are you adverse to my using your SF2 and Final Fight pics in a future comparison article? I will post the article here, and as always make any corrections the group deems necessary. Of course, I would never be able to cover everything we've discussed in a single article, but I would want to clear up any common misconceptions if possible.
-
- Very interested
- Posts: 256
- Joined: Tue Sep 11, 2007 9:10 pm
Snake wrote: You would think so, but try it. It takes a fairly huge chunk of the CPU just to do that with all of them.
Oh, I have Hardly taxing is relative of course (when you take out the collision detection for other objects and map) - haha
A single channel for direct mode is 8 master cycles per byte. That's 6.4k per frame(NTSC) for a 38 scanline length vblank.Worth mentioning here that SNES DMA is quite a bit slower than Genesis DMA...
It's actually 32 or 272 but it doesn't really matter. Not sure what you mean by the second bit, the genesis can do 320.
Opps yes, 32 sprites or 34 cells (34x8=272pixels). For the second part, I mean relative to the screen width. 256 wide display with 272 pixel limit is a better ratio than either of H32 or H40 of the VDP.
I find it hard to believe that each system was equally capable of line scrolling techniques when no software demonstrates this as such. There must be some disconnect here between theory and reality.
Hmm. I could give you an example with numbers. I currently have an h-int routine on the PCE that takes up 93 cpu cycles per scanline call. Now, let's say I have a screen window of 200 pixels tall (the rest is for a status window) and I want 22 parallax scrolls in that window. That would require 21 interrupt calls or 1953 cycles to per frame. Now, I'll need to a speed value that's added to each scroll based on the current map scroll speed. I want this to be flexible for non linear startups and slowdowns, so I'll not optimize it for immediate additions. So I have a plain Jane non optimized 8bit+8bit->16bit addition at 22 cycles. That's 462 cycles. So my total is 2415 cycles. The PCE has ~119436 cycles per frame (1/60th). 2415 cycles is 2% CPU resource. If a developer wrote a sloppy/unoptimized routine, then you're looking at 3% VS the one I wrote. This isn't going to be slower on the SNES either. If anything it would be faster since HDMA only steals a small amount of CPU cycles (stalls the CPU while fetching from the BUS).
And oh, yeah thats totally fine with me. You can use whatever pics you'd like. I 'm not sure if you're looking for demos or just commercial soft, but here's a couple of audio demos: here and here. Since we were talking about audio in another thread. If you don't have a flash card, then 'mednafen' is the only emu that runs them correctly. And maybe AamirM's emu soon as well
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
Just to clear something up - the 68000 is not that slow with interrupts. It DOESN'T push ALL the registers - it only pushes the current PC and the processor flags (total of three words); then it fetches the interrupt code vector from the ROM and jumps to it. In a 68000 interrupt routine you only have to push the registers you use, and you can actually make interrupt routines that don't use ANY registers. It's EXTREMELY efficient, and many Genesis games use the horizontal int to change the palette on the fly.tomaitheous wrote:On the PC-Engine, you can set the video processor to create an interrupt on every scanline (all 262 scanlines - yes even the ones in vblank). An interrupt is when another device taps the CPU on the shoulder to perform a small task (or larger or whatever). The CPU takes a break from what it's doing, jumps to the interrupt routine, then jumps back and continues its work. This means the CPU doesn't have to sit there in a specialized timed loop wait for the correct time for update the X/Y/whatever video registers. On the Genesis, this method isn't the best for the 68k CPU since it uses a slow interrupt system (pushes all regs onto the stack in comparison to the PCE an other processors with push barely anything onto the stack for the call). To get around with a more efficient method, Sega chose to have a block in VRAM that contains scroll values for each scanline. So no interrupt is needed.
The reason Genesis games don't change the scrolling in an interrupt is you don't need to. The scroll tables are sufficient for virtually any kind of scrolling you may need, examples being Sonic and the underwater levels of Earthworm Jim. Not needing to change the scrolling during an interrupt means that interrupt is free for things like changing the palette.
-
- Very interested
- Posts: 256
- Joined: Tue Sep 11, 2007 9:10 pm
44 Vs 8 cycles just for the call is slower. The return is low slower too. That's about a 11k cycle difference for a full scanline count effect (224 scanlines). I'm not saying it couldn't do it, and as I and you have already pointed out, the scroll block method is much more efficient and faster. And Especially for a full scanline count effect.Just to clear something up - the 68000 is not that slow with interrupts.
I have no problem with running/juggling multiple routines on single generated interrupt system. Be it palette, scroll, sprite on/off, BG on/off, resolution changes, and audio/DAC writes. But it sure would be nice not to have the over head of the scroll updates for full scanline effects and 16khz DAC playbackthat interrupt is free for things like changing the palette.
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
44 vs 8 what? Sounds like you're deliberately misrepresenting the "cycles" to make your point. The "cycles" on a 68000 are more like the Z80 than the 6502/65816 - you do multiple "cycles" for each instruction. A simple move register to register takes the 68000 4 "cycles". So that "44 cycles" isn't nearly as bad as you seem to imply. Let's put it in perspective:
addi.l #imm, address = 20 cycles
ror.l #8, reg = 24 cycles
mulu.w reg, reg = 70 cycles
move.l address, address = 36 cycles!
There now, 44 doesn't seem NEARLY so slow now, does it?
addi.l #imm, address = 20 cycles
ror.l #8, reg = 24 cycles
mulu.w reg, reg = 70 cycles
move.l address, address = 36 cycles!
There now, 44 doesn't seem NEARLY so slow now, does it?
-
- Very interested
- Posts: 256
- Joined: Tue Sep 11, 2007 9:10 pm
What are you talking about? I'm not deliberately misrepresenting anything. And I don't appreciate the implication.Chilly Willy wrote:44 vs 8 what? Sounds like you're deliberately misrepresenting the "cycles" to make your point.
I was referring *just* to the number of cycles form the interrupt being trigger to the point in which the interrupt user routine starts execution. The Motorola datasheet specifies 44 cycles. A 6280 takes 8 cycles from the point of trigger to the start of the user routine. And the cycles for returning from an interrupt which is larger on the 68k. That's uncontrollable overhead. Just the overhead. Overhead that would be added ontop of a routine that would have to manually update the scroll registers, if you didn't have a scroll block. I didn't go into any detail about how the routine itself. I gave an example of where the overhead would add up in a situation. I mentioned 'full scanlines' multiple times. I don't see how I was misrepresenting anything. And I don't see a need for you talk down to me. If you think there's something that's in question/false/mistake/whatever, I'm sure you're able to point it out without a condescending tone and rolling your eyes.
...or about 21% less than the Genesistomaitheous wrote:A single channel for direct mode is 8 master cycles per byte. That's 6.4k per frame(NTSC) for a 38 scanline length vblank.
Also - don't most games use the other mode where there's only 23 scanlines of vblank? The ones I remember did, anyway.
There's also the annoying way the SNES sprites are stored, meaning that if you want to DMA a new 32x32 sprite, it requires 4 DMAs. There's overhead there, both in starting the DMA, and writing the registers, that quickly adds up.
That, of course, leads to something else - SNES sprites can only access 16KB of VRAM in any one frame, whereas the Genesis can access all 64KB of it. Yet another reason why the Genesis is 'better at animation'.
IIRC, something similar is also true for the backgrounds.
So both chips have their positives and negatives, and once again, it entirely depends on what you need to do.
Last edited by Snake on Tue Feb 10, 2009 10:57 am, edited 1 time in total.
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
Then quit misrepresenting things.tomaitheous wrote: What are you talking about? I'm not deliberately misrepresenting anything. And I don't appreciate the implication.
44 68000 cycles.I was referring *just* to the number of cycles form the interrupt being trigger to the point in which the interrupt user routine starts execution. The Motorola datasheet specifies 44 cycles.
8 HuC6280 cycles.A 6280 takes 8 cycles from the point of trigger to the start of the user routine.
They aren't the same thing, but you're comparing them as if they were. Also, you neglect to note that the 6280 is a 6502-alike chip, and it takes about four times as many instructions to do anything compared to a 68000.
This is very true, and is why the SNES is much slower than the Genesis, no matter how hard some people try to claim otherwise... Having said that, if the SNES CPU ran at the speed of the PCE CPU, it would have made me very happy indeedChilly Willy wrote:you neglect to note that the 6280 is a 6502-alike chip, and it takes about four times as many instructions to do anything compared to a 68000.
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
There's a reason the 68000 was in workstations while the 6502 wasn't.Snake wrote:This is very true, and is why the SNES is much slower than the Genesis, no matter how hard some people try to claim otherwise... Having said that, if the SNES CPU ran at the speed of the PCE CPU, it would have made me very happy indeedChilly Willy wrote:you neglect to note that the 6280 is a 6502-alike chip, and it takes about four times as many instructions to do anything compared to a 68000.
-
- Very interested
- Posts: 256
- Joined: Tue Sep 11, 2007 9:10 pm
I didn't neglect it. I was only pointing out the additional overhead of the call/return of the interrupt request. That's easier to compare. Comparing code and optimization between the two processors isn't as easy and I wasn't looking to make this into some sort of processor fight/argument. I've seen too many of these instances and it gets old.Also, you neglect to note that the 6280 is a 6502-alike chip, and it takes about four times as many instructions to do anything compared to a 68000.
I have a spare 68000 original DIP rated a 8mhz. Both PCE video processors can run in 16bit BUS mode if a specific pin is removed from ground. Let's assume I drop the 68k in place of the 6280 (which I do plan on doing in the future for fun - I already have the 2612 setup to the system through the rom cart). And for the sake of comparison, each clock cycle is 1.39683ns.
Code: Select all
;call 8
pha ;3
lda $0000 ;6
bit #$04 ;2
beq .vcheck ;2
.hsync
st0 #$07 ;5 X scroll Reg select
phy ;3
.sm01 ldy #$00 ;2
lda array.l,y ;5
sta $0002 ;6 data port
lda array.h,y ;5
sta $0003 ;6 data port
st0 #$06 ;5 H-int reg select
sty $0002 ;6 data port
iny ;2
beq .check_01 ;2
sty .sm01+1 ;5
.vreg st0 #$00 ;5 restore current VDC register incase game is writing/reading VRAM during active display.
ply ;4
pla ;4
rti ;7
93 cycles
A3= register port
A4= scroll array
A5= data port
D2= test reg
D3= bits to test against
D4= scanline counter
D5= register restore
;call 44
move D2,(A3) ;8
btst D3,D2 ;6
beq v_check ;8
move.b (A3),#$07 ;12
move (A5),(A4)+ ;12
move.b (A3),#06 ;12
move (A5),D4 ;8
addq #1,D4 ;4
move (A3),D5 ;8
rte ;20
142 cycles
And then there's the 4 memory mapped 8/16bit wide registers with 21bit linear address range and 16bit signed auto incrementing, that I didn't optimize for.
My 68k is a bit rusty. Maybe you guys could write a more optimized version. The rules are simple: $0000 is the register select port, $0002/3 is the data read/write port, $07 is the X scroll register, and $06 is the scanline interrupt register. Reading $0000 (which must be done) gives you the status of which interrupt has been generated: D5 for Vblank, D2 for Hsync. You also need to restore the previous register being used (the interrupt routine assumes game logic/code will write to VDC during active display).
I'd say there's multiple reasons. The big three being the processor was actually available in higher clock speeds, having a non-convoluted large linear address range, a register/instruction set design that is actually optimal for higher level language compilers. All those reasons right there save on complexity of code and time/money.There's a reason the 68000 was in workstations while the 6502 wasn't.
I did read of a system that used external logic to monitor the bus of a original 6502 and insert NOP(s), LDx #imm, etc when an illegal instruction was encountered. The illegal opcodes (and operands for some) where treated as specialized instructions for pseudo registers that were memory mapped, for stuff like , mul/div, auto increment/decrement indirect, etc. I guess it was cheaper at the time than the cost of a 68k.
Anyway, I think it's ignorant to simply assume that anything on the 68k at the same clock speed it going to be a number of times faster than a 65x. These three systems have additional hardware that offsets processing tasks to the additional chips, bringing up the number of occurrences of simpler logic (load/compare/conditional branch). That said, I think it is safe to assume in general the code will run quickly and efficient on the 68k. That's not always the assumption with 65x model and variants.
That's what I thought when I first looked over the official SFX dev manuals, but I could have sworn there was a bit to select which bank the sprite in the table was referencing(like with the BG layer). The address bits are divided up into multiple regs. But even if that's so, you're right about not having a linear address range being accessible being a hindrance/limitation. The more I look at the sPPU layout, the more it seems to resemble the NES PPU layout - not specifically but in spirit if that makes sense.That, of course, leads to something else - SNES sprites can only access 16KB of VRAM in any one frame, whereas the Genesis can access all 64KB of it. Yet another reason why the Genesis is 'better at animation'.
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
On the other hand, Genesis at 7.67MHz has 130.378ns cycle times, and PC-Engine at 7.16MHz has 139.665ns cycle times. So really they aren't that far from the same thing at all, at least in the real world where units like seconds are a relevant metric for how much time we have to do things in (such as hblanks). Maybe you didn't know that PC-Engine's CPU is clocked almost as high as Genesis's is?Chilly Willy wrote:Then quit misrepresenting things.tomaitheous wrote: What are you talking about? I'm not deliberately misrepresenting anything. And I don't appreciate the implication.
44 68000 cycles.I was referring *just* to the number of cycles form the interrupt being trigger to the point in which the interrupt user routine starts execution. The Motorola datasheet specifies 44 cycles.
8 HuC6280 cycles.A 6280 takes 8 cycles from the point of trigger to the start of the user routine.
They aren't the same thing, but you're comparing them as if they were. Also, you neglect to note that the 6280 is a 6502-alike chip, and it takes about four times as many instructions to do anything compared to a 68000.
Just because 68k takes a large number of clock cycles to do other things doesn't make the interrupt overhead negligible, it just makes the 68k has a low instructions per cycle rate in general.
I won't deny that you can do more in fewer 68k instructions than 65xx style instructions in the SNES and PC-Engine, but throwing out a number like "4 times" needs empirical evidence to back it up. Since we're talking about game consoles it'd be good to know specifically how much more efficient 68k code is for typical games of the era. Quantifying such a thing would be very difficult indeed.
For very simple code that's just updating a few, fixed, hardware registers, yes, you can sometimes do it faster on a 65816. But your game is only going to be spending 0.01% of its time doing things like that.
A function that adds X and Y velocities to X and Y coordinates stored in RAM. They are stored as 32 bit values (they are fixed point 16:16 format), in the format XVel, XCo, YVel, YCo for simplicity. The function should return a 'pointer' to the next entry in the list.
68000:
65816:
In the 65816 indexed instructions I'm using the address as index, and the index as an address. It also assumes that both A and X are in 16 bit mode, if they're not, you'll need to set that externally. I don't think you can get it any shorter, in terms of instructions executed. It's almost four times as many instructions already, and that's very, very simple code.
That's not to say the 65816 version isn't faster. It might even be, I haven't checked cycle counts for any of this. But the CPU clock speed needs to be taken into account too - and even if this particular example IS faster, that doesn't last for very long...
It shouldn't be too difficult to see that only slightly more complex code is going to get much, much worse. As soon as you need to do something that requires more than the three registers you have you're going to have to start pushing and popping, or loading and storing all over the place. It's usually much faster at that point to just bite the bullet and rewrite your code using fixed addresses, and unrolling everything. Usually you end up with a lot more instructions this way, but it's still the most efficient way to do it.
"4 times the number of instructions" is in no way an exageration. In fact the opposite is probably true generally speaking.
It would, yes. Because the more complex the code gets, the worse things get for the 65816. Nobody is going to post some really complex code to illustrate, because nobody will be able to follow it. The best I can give you right now is a very simple example. No, I'm not going to supply cycle counts because this is right off the top of my head, but...Exophase wrote:Quantifying such a thing would be very difficult indeed.
A function that adds X and Y velocities to X and Y coordinates stored in RAM. They are stored as 32 bit values (they are fixed point 16:16 format), in the format XVel, XCo, YVel, YCo for simplicity. The function should return a 'pointer' to the next entry in the list.
68000:
Code: Select all
lea coords,a0
bsr AddVelocity
...
AddVelocity:
move.l (a0)+,d0
add.l d0,(a0)+
move.l (a0)+,d0
add.l d0,(a0)+
rts
Code: Select all
ldx #coords
jsr AddVelocity
...
AddVelocity:
lda 0,x
clc
adc 4,x
sta 4,x
lda 2,x
adc 6,x
sta 6,x
lda 8,x
clc
adc 12,x
sta 12,x
lda 10,x
adc 14,x
sta 14,x
txa
clc
adc #16
tax
rts
That's not to say the 65816 version isn't faster. It might even be, I haven't checked cycle counts for any of this. But the CPU clock speed needs to be taken into account too - and even if this particular example IS faster, that doesn't last for very long...
It shouldn't be too difficult to see that only slightly more complex code is going to get much, much worse. As soon as you need to do something that requires more than the three registers you have you're going to have to start pushing and popping, or loading and storing all over the place. It's usually much faster at that point to just bite the bullet and rewrite your code using fixed addresses, and unrolling everything. Usually you end up with a lot more instructions this way, but it's still the most efficient way to do it.
"4 times the number of instructions" is in no way an exageration. In fact the opposite is probably true generally speaking.
-
- Very interested
- Posts: 2984
- Joined: Fri Aug 17, 2007 9:33 pm
Love that example, Snake. I'm not saying the 6502 (and derivatives) was a bad CPU. It was an excellent 8-bit CPU. The thing was, programmers really had to be on the ball to write code to get things done. We're talking about systems with 4 to 48 KB of RAM, maybe four or eight sprites, 160x200 four-color bitmaps, and 40x25 character modes. They didn't really need to be that powerful, just efficient. What was silly was trying to use such processors in 16-bit systems. You COULD get the job done, but it was a much bigger pain in the butt. Just think of what could have been - like a SNES with a 68000 instead of the CPU it had. Or maybe with a NS16032 (I really thought that CPU had something, but it never caught on).