Color counts per screen

sheath · Post by **sheath** » Mon Feb 09, 2009 5:39 pm

tomaitheous,

I appreciate your taking the time to write all of that out, it was very informative. So, while storing more palettes takes up more ROM, the amount of ROM is relatively tiny, and calling/displaying more palettes doesn't affect system resources at_all. That is very interesting on its own, but so is your description of how backgrounds can be broken up on any system.

I find it hard to believe that each system was equally capable of line scrolling techniques when no software demonstrates this as such. There must be some disconnect here between theory and reality. For example, it can be said that the Genesis was capable of software scaling and rotation, but software does not demonstrate this to be superior to the effect generated by the SNES' Mode 7 effect. The lack of multiple scrolling backgrounds in Sonic clones on the SNES (i.e. Speedy Gonzales, Road Runner, etc) just doesn't seem like something the developer would have chosen to do arbitrarily. With that said, the only facts are that no PCE/SNES game demonstrates parallax at the level of top tier Genesis software, even though they were technically capable of doing so. It is worth noting that line scrolling background effects like those found in Super Aleste are just as technically impressive as any of the parallax effects in Genesis titles.

Are you adverse to my using your SF2 and Final Fight pics in a future comparison article? I will post the article here, and as always make any corrections the group deems necessary. Of course, I would never be able to cover everything we've discussed in a single article, but I would want to clear up any common misconceptions if possible.

tomaitheous · Post by **tomaitheous** » Mon Feb 09, 2009 11:44 pm

Snake wrote: You would think so, but try it. It takes a fairly huge chunk of the CPU just to do that with all of them.

Oh, I have

Hardly taxing is relative of course (when you take out the collision detection for other objects and map) - haha

Worth mentioning here that SNES DMA is quite a bit slower than Genesis DMA...

A single channel for direct mode is 8 master cycles per byte. That's 6.4k per frame(NTSC) for a 38 scanline length vblank.

It's actually 32 or 272 but it doesn't really matter. Not sure what you mean by the second bit, the genesis can do 320.

Opps yes, 32 sprites or 34 cells (34x8=272pixels). For the second part, I mean relative to the screen width. 256 wide display with 272 pixel limit is a better ratio than either of H32 or H40 of the VDP.

I find it hard to believe that each system was equally capable of line scrolling techniques when no software demonstrates this as such. There must be some disconnect here between theory and reality.

Hmm. I could give you an example with numbers. I currently have an h-int routine on the PCE that takes up 93 cpu cycles per scanline call. Now, let's say I have a screen window of 200 pixels tall (the rest is for a status window) and I want 22 parallax scrolls in that window. That would require 21 interrupt calls or 1953 cycles to per frame. Now, I'll need to a speed value that's added to each scroll based on the current map scroll speed. I want this to be flexible for non linear startups and slowdowns, so I'll not optimize it for immediate additions. So I have a plain Jane non optimized 8bit+8bit->16bit addition at 22 cycles. That's 462 cycles. So my total is 2415 cycles. The PCE has ~119436 cycles per frame (1/60th). 2415 cycles is 2% CPU resource. If a developer wrote a sloppy/unoptimized routine, then you're looking at 3% VS the one I wrote. This isn't going to be slower on the SNES either. If anything it would be faster since HDMA only steals a small amount of CPU cycles (stalls the CPU while fetching from the BUS).

And oh, yeah thats totally fine with me. You can use whatever pics you'd like. I 'm not sure if you're looking for demos or just commercial soft, but here's a couple of audio demos: here and here. Since we were talking about audio in another thread. If you don't have a flash card, then 'mednafen' is the only emu that runs them correctly. And maybe AamirM's emu soon as well

Chilly Willy · Post by **Chilly Willy** » Tue Feb 10, 2009 1:52 am

tomaitheous wrote:On the PC-Engine, you can set the video processor to create an interrupt on every scanline (all 262 scanlines - yes even the ones in vblank). An interrupt is when another device taps the CPU on the shoulder to perform a small task (or larger or whatever). The CPU takes a break from what it's doing, jumps to the interrupt routine, then jumps back and continues its work. This means the CPU doesn't have to sit there in a specialized timed loop wait for the correct time for update the X/Y/whatever video registers. On the Genesis, this method isn't the best for the 68k CPU since it uses a slow interrupt system (pushes all regs onto the stack in comparison to the PCE an other processors with push barely anything onto the stack for the call). To get around with a more efficient method, Sega chose to have a block in VRAM that contains scroll values for each scanline. So no interrupt is needed.

Just to clear something up - the 68000 is not that slow with interrupts. It DOESN'T push ALL the registers - it only pushes the current PC and the processor flags (total of three words); then it fetches the interrupt code vector from the ROM and jumps to it. In a 68000 interrupt routine you only have to push the registers you use, and you can actually make interrupt routines that don't use ANY registers. It's EXTREMELY efficient, and many Genesis games use the horizontal int to change the palette on the fly.

The reason Genesis games don't change the scrolling in an interrupt is you don't need to. The scroll tables are sufficient for virtually any kind of scrolling you may need, examples being Sonic and the underwater levels of Earthworm Jim. Not needing to change the scrolling during an interrupt means that interrupt is free for things like changing the palette.

tomaitheous · Post by **tomaitheous** » Tue Feb 10, 2009 3:45 am

Just to clear something up - the 68000 is not that slow with interrupts.

44 Vs 8 cycles just for the call is slower. The return is low slower too. That's about a 11k cycle difference for a full scanline count effect (224 scanlines). I'm not saying it couldn't do it, and as I and you have already pointed out, the scroll block method is much more efficient and faster. And Especially for a full scanline count effect.

that interrupt is free for things like changing the palette.

I have no problem with running/juggling multiple routines on single generated interrupt system. Be it palette, scroll, sprite on/off, BG on/off, resolution changes, and audio/DAC writes. But it sure would be nice not to have the over head of the scroll updates for full scanline effects and 16khz DAC playback

Chilly Willy · Post by **Chilly Willy** » Tue Feb 10, 2009 4:48 am

44 vs 8 what? Sounds like you're deliberately misrepresenting the "cycles" to make your point. The "cycles" on a 68000 are more like the Z80 than the 6502/65816 - you do multiple "cycles" for each instruction. A simple move register to register takes the 68000 4 "cycles". So that "44 cycles" isn't nearly as bad as you seem to imply. Let's put it in perspective:

addi.l #imm, address = 20 cycles
ror.l #8, reg = 24 cycles
mulu.w reg, reg = 70 cycles
move.l address, address = 36 cycles!

There now, 44 doesn't seem NEARLY so slow now, does it?

tomaitheous · Post by **tomaitheous** » Tue Feb 10, 2009 5:55 am

Chilly Willy wrote:44 vs 8 what? Sounds like you're deliberately misrepresenting the "cycles" to make your point.

What are you talking about? I'm not deliberately misrepresenting anything. And I don't appreciate the implication.

I was referring *just* to the number of cycles form the interrupt being trigger to the point in which the interrupt user routine starts execution. The Motorola datasheet specifies 44 cycles. A 6280 takes 8 cycles from the point of trigger to the start of the user routine. And the cycles for returning from an interrupt which is larger on the 68k. That's uncontrollable overhead. Just the overhead. Overhead that would be added ontop of a routine that would have to manually update the scroll registers, if you didn't have a scroll block. I didn't go into any detail about how the routine itself. I gave an example of where the overhead would add up in a situation. I mentioned 'full scanlines' multiple times. I don't see how I was misrepresenting anything. And I don't see a need for you talk down to me. If you think there's something that's in question/false/mistake/whatever, I'm sure you're able to point it out without a condescending tone and rolling your eyes.

Snake · Post by **Snake** » Tue Feb 10, 2009 10:51 am

tomaitheous wrote:A single channel for direct mode is 8 master cycles per byte. That's 6.4k per frame(NTSC) for a 38 scanline length vblank.

...or about 21% less than the Genesis

Also - don't most games use the other mode where there's only 23 scanlines of vblank? The ones I remember did, anyway.

There's also the annoying way the SNES sprites are stored, meaning that if you want to DMA a new 32x32 sprite, it requires 4 DMAs. There's overhead there, both in starting the DMA, and writing the registers, that quickly adds up.

That, of course, leads to something else - SNES sprites can only access 16KB of VRAM in any one frame, whereas the Genesis can access all 64KB of it. Yet another reason why the Genesis is 'better at animation'.

IIRC, something similar is also true for the backgrounds.

So both chips have their positives and negatives, and once again, it entirely depends on what you need to do.

Chilly Willy · Post by **Chilly Willy** » Tue Feb 10, 2009 10:57 am

tomaitheous wrote: What are you talking about? I'm not deliberately misrepresenting anything. And I don't appreciate the implication.

Then quit misrepresenting things.

I was referring *just* to the number of cycles form the interrupt being trigger to the point in which the interrupt user routine starts execution. The Motorola datasheet specifies 44 cycles.

44 68000 cycles.

A 6280 takes 8 cycles from the point of trigger to the start of the user routine.

8 HuC6280 cycles.

They aren't the same thing, but you're comparing them as if they were. Also, you neglect to note that the 6280 is a 6502-alike chip, and it takes about four times as many instructions to do anything compared to a 68000.

Snake · Post by **Snake** » Tue Feb 10, 2009 11:05 am

Chilly Willy wrote:you neglect to note that the 6280 is a 6502-alike chip, and it takes about four times as many instructions to do anything compared to a 68000.

This is very true, and is why the SNES is much slower than the Genesis, no matter how hard some people try to claim otherwise... Having said that, if the SNES CPU ran at the speed of the PCE CPU, it would have made me very happy indeed

Chilly Willy · Post by **Chilly Willy** » Tue Feb 10, 2009 11:06 am

Snake wrote:
Chilly Willy wrote:you neglect to note that the 6280 is a 6502-alike chip, and it takes about four times as many instructions to do anything compared to a 68000.
This is very true, and is why the SNES is much slower than the Genesis, no matter how hard some people try to claim otherwise... Having said that, if the SNES CPU ran at the speed of the PCE CPU, it would have made me very happy indeed

There's a reason the 68000 was in workstations while the 6502 wasn't.

tomaitheous · Post by **tomaitheous** » Tue Feb 10, 2009 6:57 pm

Also, you neglect to note that the 6280 is a 6502-alike chip, and it takes about four times as many instructions to do anything compared to a 68000.

I didn't neglect it. I was only pointing out the additional overhead of the call/return of the interrupt request. That's easier to compare. Comparing code and optimization between the two processors isn't as easy and I wasn't looking to make this into some sort of processor fight/argument. I've seen too many of these instances and it gets old.

I have a spare 68000 original DIP rated a 8mhz. Both PCE video processors can run in 16bit BUS mode if a specific pin is removed from ground. Let's assume I drop the 68k in place of the 6280 (which I do plan on doing in the future for fun - I already have the 2612 setup to the system through the rom cart). And for the sake of comparison, each clock cycle is 1.39683ns.

Code: Select all

    ;call            8 
      pha           ;3
      lda $0000     ;6
      bit #$04      ;2
      beq .vcheck   ;2
.hsync      
      st0 #$07      ;5  X scroll Reg select
      phy           ;3
.sm01 ldy #$00      ;2
      lda array.l,y ;5
      sta $0002     ;6  data port
      lda array.h,y ;5
      sta $0003     ;6  data port
      st0 #$06      ;5  H-int reg select
      sty $0002     ;6  data port
      iny           ;2
      beq .check_01 ;2
      sty .sm01+1   ;5
.vreg st0 #$00      ;5  restore current VDC register incase game is writing/reading VRAM during active display.
      ply           ;4
      pla           ;4
    rti             ;7
  
  93 cycles
  
    
    
    A3= register port
    A4= scroll array
    A5= data port
    D2= test reg
    D3= bits to test against
    D4= scanline counter
    D5= register restore
  
    ;call                  44
      move    D2,(A3)     ;8
      btst    D3,D2       ;6
      beq     v_check     ;8
      move.b  (A3),#$07   ;12
      move    (A5),(A4)+  ;12
      move.b  (A3),#06    ;12
      move    (A5),D4     ;8
      addq    #1,D4       ;4
      move    (A3),D5     ;8
    rte                   ;20
      
  142 cycles

The 6280 version does use self modifying code, but it only save 6 cycles. If the map is setup for parallax, I could shave another 11 cycles by taking advantage of only writing a byte for the 6280 code by removing the MSB write, giving 82 cycle count. And to be fair, I reserved 7 registers on the 68k side for no pushing/poping overhead. Not exactly 4 times the amount of instructions (exactly 2 times).

And then there's the 4 memory mapped 8/16bit wide registers with 21bit linear address range and 16bit signed auto incrementing, that I didn't optimize for.

My 68k is a bit rusty. Maybe you guys could write a more optimized version. The rules are simple: $0000 is the register select port, $0002/3 is the data read/write port, $07 is the X scroll register, and $06 is the scanline interrupt register. Reading $0000 (which must be done) gives you the status of which interrupt has been generated: D5 for Vblank, D2 for Hsync. You also need to restore the previous register being used (the interrupt routine assumes game logic/code will write to VDC during active display).

There's a reason the 68000 was in workstations while the 6502 wasn't.

I'd say there's multiple reasons. The big three being the processor was actually available in higher clock speeds, having a non-convoluted large linear address range, a register/instruction set design that is actually optimal for higher level language compilers. All those reasons right there save on complexity of code and time/money.

I did read of a system that used external logic to monitor the bus of a original 6502 and insert NOP(s), LDx #imm, etc when an illegal instruction was encountered. The illegal opcodes (and operands for some) where treated as specialized instructions for pseudo registers that were memory mapped, for stuff like , mul/div, auto increment/decrement indirect, etc. I guess it was cheaper at the time than the cost of a 68k.

Anyway, I think it's ignorant to simply assume that anything on the 68k at the same clock speed it going to be a number of times faster than a 65x. These three systems have additional hardware that offsets processing tasks to the additional chips, bringing up the number of occurrences of simpler logic (load/compare/conditional branch). That said, I think it is safe to assume in general the code will run quickly and efficient on the 68k. That's not always the assumption with 65x model and variants.

That, of course, leads to something else - SNES sprites can only access 16KB of VRAM in any one frame, whereas the Genesis can access all 64KB of it. Yet another reason why the Genesis is 'better at animation'.

That's what I thought when I first looked over the official SFX dev manuals, but I could have sworn there was a bit to select which bank the sprite in the table was referencing(like with the BG layer). The address bits are divided up into multiple regs. But even if that's so, you're right about not having a linear address range being accessible being a hindrance/limitation. The more I look at the sPPU layout, the more it seems to resemble the NES PPU layout - not specifically but in spirit if that makes sense.

Chilly Willy · Post by **Chilly Willy** » Tue Feb 10, 2009 9:54 pm

Okay, now THAT was a good comparison.

However, I still don't see the 68000 as "slow" - perhaps "slower", but not by much, and ONLY if you are clocking the 6280 at high speed AND writing tricky code. Then it CAN be a smidgen faster, and only because it responds to an interrupt faster.

Exophase · Post by **Exophase** » Tue Feb 10, 2009 10:18 pm

Chilly Willy wrote:
tomaitheous wrote: What are you talking about? I'm not deliberately misrepresenting anything. And I don't appreciate the implication.
Then quit misrepresenting things.

I was referring *just* to the number of cycles form the interrupt being trigger to the point in which the interrupt user routine starts execution. The Motorola datasheet specifies 44 cycles.
44 68000 cycles.

A 6280 takes 8 cycles from the point of trigger to the start of the user routine.
8 HuC6280 cycles.

They aren't the same thing, but you're comparing them as if they were. Also, you neglect to note that the 6280 is a 6502-alike chip, and it takes about four times as many instructions to do anything compared to a 68000.

On the other hand, Genesis at 7.67MHz has 130.378ns cycle times, and PC-Engine at 7.16MHz has 139.665ns cycle times. So really they aren't that far from the same thing at all, at least in the real world where units like seconds are a relevant metric for how much time we have to do things in (such as hblanks). Maybe you didn't know that PC-Engine's CPU is clocked almost as high as Genesis's is?

Just because 68k takes a large number of clock cycles to do other things doesn't make the interrupt overhead negligible, it just makes the 68k has a low instructions per cycle rate in general.

I won't deny that you can do more in fewer 68k instructions than 65xx style instructions in the SNES and PC-Engine, but throwing out a number like "4 times" needs empirical evidence to back it up. Since we're talking about game consoles it'd be good to know specifically how much more efficient 68k code is for typical games of the era. Quantifying such a thing would be very difficult indeed.

Snake · Post by **Snake** » Wed Feb 11, 2009 2:11 am

For very simple code that's just updating a few, fixed, hardware registers, yes, you can sometimes do it faster on a 65816. But your game is only going to be spending 0.01% of its time doing things like that.

Exophase wrote:Quantifying such a thing would be very difficult indeed.

It would, yes. Because the more complex the code gets, the worse things get for the 65816. Nobody is going to post some really complex code to illustrate, because nobody will be able to follow it. The best I can give you right now is a very simple example. No, I'm not going to supply cycle counts because this is right off the top of my head, but...

A function that adds X and Y velocities to X and Y coordinates stored in RAM. They are stored as 32 bit values (they are fixed point 16:16 format), in the format XVel, XCo, YVel, YCo for simplicity. The function should return a 'pointer' to the next entry in the list.

68000:

Code: Select all

lea coords,a0
bsr AddVelocity
...

AddVelocity:
move.l (a0)+,d0
add.l d0,(a0)+
move.l (a0)+,d0
add.l d0,(a0)+
rts

65816:

Code: Select all

ldx #coords
jsr AddVelocity

...

AddVelocity:
lda 0,x
clc
adc 4,x
sta 4,x
lda 2,x
adc 6,x
sta 6,x
lda 8,x
clc
adc 12,x
sta 12,x
lda 10,x
adc 14,x
sta 14,x
txa
clc
adc #16
tax
rts

In the 65816 indexed instructions I'm using the address as index, and the index as an address. It also assumes that both A and X are in 16 bit mode, if they're not, you'll need to set that externally. I don't think you can get it any shorter, in terms of instructions executed. It's almost four times as many instructions already, and that's very, very simple code.

That's not to say the 65816 version isn't faster. It might even be, I haven't checked cycle counts for any of this. But the CPU clock speed needs to be taken into account too - and even if this particular example IS faster, that doesn't last for very long...

It shouldn't be too difficult to see that only slightly more complex code is going to get much, much worse. As soon as you need to do something that requires more than the three registers you have you're going to have to start pushing and popping, or loading and storing all over the place. It's usually much faster at that point to just bite the bullet and rewrite your code using fixed addresses, and unrolling everything. Usually you end up with a lot more instructions this way, but it's still the most efficient way to do it.

"4 times the number of instructions" is in no way an exageration. In fact the opposite is probably true generally speaking.

Chilly Willy · Post by **Chilly Willy** » Wed Feb 11, 2009 4:34 am

Love that example, Snake. I'm not saying the 6502 (and derivatives) was a bad CPU. It was an excellent 8-bit CPU. The thing was, programmers really had to be on the ball to write code to get things done. We're talking about systems with 4 to 48 KB of RAM, maybe four or eight sprites, 160x200 four-color bitmaps, and 40x25 character modes. They didn't really need to be that powerful, just efficient. What was silly was trying to use such processors in 16-bit systems. You COULD get the job done, but it was a much bigger pain in the butt. Just think of what could have been - like a SNES with a 68000 instead of the CPU it had. Or maybe with a NS16032 (I really thought that CPU had something, but it never caught on).