Color counts per screen

tomaitheous · Post by **tomaitheous** » Wed Feb 11, 2009 4:48 am

I don't think you can get it any shorter, in terms of instructions executed.

ldx #coords
jsr AddVelocity

...

AddVelocity:
lda 0,x
clc
adc 4,x
sta 4,x
lda 2,x
adc 6,x
sta 6,x
lda 8,x
adc 12,x
sta 12,x
lda 10,x
adc 14,x
sta 14,x
txa
clc
adc #16
tax
rts

Shortened by one 'clc'

And you can replace this:

Code: Select all

txa
clc
adc #16
tax
rts

with this:

Code: Select all

inx
rts

If you interleave the layout accordingly.

But that's a pretty good example of larger generated code on the 65816. But then again, that's a very small part of an overall 'frame's' work of code. Statistically, the majority of your code is going to be simpler logic than that. And I counted cycles (sorry, it's in my nature) - your unmodified 65816 code as is, is 30% faster than the 68k version. Disregarding clock speed of course.

I thought I'd mention; you're still going to have to copy that upper 16bit value either directly to the VDP, or into a local sprite table for DMA usage, adding additional instructions and cycle times. The 65816 version can be easily modifed to write direct to the sPPU or a local sprite table in ram for DMA.

Or maybe with a NS16032 (I really thought that CPU had something, but it never caught on).

Oouu? What CPU is that?

What was silly was trying to use such processors in 16-bit systems. You COULD get the job done, but it was a much bigger pain in the butt. Just think of what could have been - like a SNES with a 68000 instead of the CPU it had.

I'm in the camp that feels speed is worth more than the size code. "Write once, play many" - you only need to write it once. I'm the kinda person who has no problem expanding code, abusing LUTs, unrolling logic(not just loops), if it means my code is going to be that much faster - an example here(ppu.asm) here. Of course, I'm not going to optimize something that takes less than 1% cpu resource per frame. Maybe that's why I feel at home with the 65x arch. Guess I'd make a poor developer

Chilly Willy · Post by **Chilly Willy** » Wed Feb 11, 2009 5:27 am

Actually, counting every last cycle makes you just about the perfect programmer for 8-bit systems, and some semi-16-bit systems like the 65816.

The NS16032 was renamed the NS32016 later to show it was a 32-bit CPU with a 16-bit bus. The full 32-bit bus version was the 32032. NS had a reputation for unreliable chips. You wanted to make sure you bought pre-tested chips. That made a lot of companies pass the 32016/32032 over in favor of the 68000/68020. Later chips in the NS32K family would go into laser printers and the like. There was even one "public domain" computer based on the 32532, the PC532. This CPU family is pretty much defunct and little more than a short page on wikipedia these days. It's major claim to fame was that it had the most orthogonal instruction set of any CISC chip out.

sheath · Post by **sheath** » Wed Feb 11, 2009 1:02 pm

Since it came up, I would like to know if anybody here has heard this before. I've seen in several interviews inference that Nintendo started out the SNES days with a very rigid policy on designing SNES games. The comments range from Nintendo not providing documents to aid developers in programming custom game engines, to Nintendo encouraging developers to run their games in "NES compatibility mode" (which is the lower Mhz setting on the CPU).

This would effectively mean that many to most SNES games ran in Modes 1-7 with code that was based on the *documentation* Nintendo provided to developers, not on the actual chips themselves. Now, we know this is true of Sega as well, as the YM2612 documentation proves, but what I don't know is the degree. The end result, I suspect, is another historical limitation on what could be theoretically coded for each system depending on what year the game was released.

We already have the time table during which ROM became cheaper and can safely say how much ROM could have been used in game up to 1995. Then there's the typical "first gen" release, in which most developers do little more than port their code from another system. With the chip documentation issue, we have a span of years, during which *no developer* could do the things members of this group can.

This theory is not even taking into consideration what types of workstations games were made on, which were far less capable than modern PCs to say the least. I don't think this goes too far from the discussion at hand, and I do think it is relevant. Do we think that in real world situations, or in real games, a developer could actually, as an example, make Sonic 2, or Contra Hardcorps, or Gunstar Heroes on all three platforms? If this is to theoretical that's fine, but it sounds too me like we've already delved head first into theory anyway.

Snake · Post by **Snake** » Wed Feb 11, 2009 5:26 pm

tomaitheous wrote:Shortened by one 'clc'

No, that breaks it. You can't remove that.

tomaitheous wrote:And you can replace...

You'd need 16 "inx"s, which makes the code even longer

tomaitheous wrote:But then again, that's a very small part of an overall 'frame's' work of code. Statistically, the majority of your code is going to be simpler logic than that.

What? You think the majority of game code is going to be simpler than adding a couple of values together? I'm afraid not

Hell, the code above didn't even take the current scroll coordinates into account, which makes it much more complex.

tomaitheous wrote:And I counted cycles (sorry, it's in my nature) - your unmodified 65816 code as is, is 30% faster than the 68k version. Disregarding clock speed of course.

I hope you remembered it's in 16 bit mode and therefore slower... 30% faster? Are you sure?

tomaitheous wrote:I'm in the camp that feels speed is worth more than the size code.

Oh, absolutely. I was just illustrating the point that 65816 code is usually much longer.

Exophase · Post by **Exophase** » Wed Feb 11, 2009 9:41 pm

When I was referring to code size I was referring to number of instructions executed, not number of instructions in a routine. For that to be a meaningful comparison you'd have to factor in instruction density which is totally different between the processors.

I don't think a 32bit add is a good "simple example." Using 16.16 fixed for coordinates in a 2D game of that era sounds more like a luxury that Genesis might afford you than something you really need to do. A 16bit fixed point format like 12.4 would probably work fine in most cases. In fact, I would argue that you rarely need 32bit pure ALU calculations in these games. When you do then of course 68k will win huge in number of instructions needed, same with a lot of ALU things. But a lot of game portions are also simple state machines that aren't as heavily penalized by the poor ISA as pure ALU sequences are.

You really can't take typical algorithms written with a 68k in mind and convert them to 65xx and expect it to be fair. A real comparison has to be much higher level than that, with code designed with the CPU's strengths and weaknesses in mind. But such a thing is, like I said, almost impossible to quantify.

tomaitheous · Post by **tomaitheous** » Wed Feb 11, 2009 11:25 pm

No, that breaks it. You can't remove that.

If was in 8bit mode, sure - I'd be worrying about rollover. But in 16bit mode? I doubt you're going to have an offset great than 65536 pixels on the X or Y access, and thus you can pretty much guarantee the carry will be cleared for the next instruction (the float part of Y).

You'd need 16 "inx"s, which makes the code even longer

No, I mean can you arrange the array into multiples of groups,i.e. treat the group as a large element. Then, you only need to increment by 1... err 2(brain is stuck in 8bit mode when doing stuff for the '816).

What? You think the majority of game code is going to be simpler than adding a couple of values together? I'm afraid not Hell, the code above didn't even take the current scroll coordinates into account, which makes it much more complex.

Hmm. I keep thinking cycles times in relevance to higher clocks (7+mhz)

. 75 cycles x 128 objects = 9600 cycles or 21% of a frame. But there's are only 128 16bit DP registers in 16bit mode, so you're going to have to change DP reg offset. Or did you mean to use indirect like LDA [$00],y ?

I hope you remembered it's in 16 bit mode and therefore slower... 30% faster? Are you sure?

It's faster because it is in 16bit mode and 16bit elements (yeah, I added the 1 cycle on the ZP load/add/store). And since we were looking at cycles only, I assumed slow rom mode so as not to have cycles with fractional parts (I hate having to list them in master cycles). From BSR to RTS it's 100 cycles. From JSR to RTS it's 75 cycles (removing that CLC). I forget to count the RTS cycles, so it's actually 25% difference.

But what I was getting at early, it would be faster overall on the 65816 to use indirect addressing and have the upper 16bit value being pointed directly into the sprite table that's going to be DMA'd to VRAM, than to have the overhead of fetching the value again. In the memory arrangement of the 68k version, because of how you have it laid out, you still need to grab that upper 16bit value and write it directly to the VDP or into a local copy for DMA. That same would also apply to scrolls block on both systems. And I would assume '816 would have even more of an upper hand in cycle times.

Snake · Post by **Snake** » Thu Feb 12, 2009 1:49 am

Exophase wrote:When I was referring to code size I was referring to number of instructions executed

Which is exactly what I posted.

Exophase wrote:I don't think a 32bit add is a good "simple example." Using 16.16 fixed for coordinates in a 2D game of that era sounds more like a luxury that Genesis might afford you than something you really need to do.

...No. I wouldn't have posted an example that's not useful. 16 bits does not give you big enough ranges in both the integer and fractional parts. 12 bits is not enough (2D game 'worlds' are usually much bigger than that) and 4 bits is not enough for the fractional, either. 16:16 may be overkill but it also happens to be much faster than 16:8.

Exophase wrote:You really can't take typical algorithms written with a 68k in mind and convert them to 65xx and expect it to be fair.

Didn't do that either. This code is actually used in a few SNES games. I merely invented the 68K version to show the difference.

I'm not quite sure why people are calling a simple add an 'algorithm'. You can't get much simpler than this.

tomaitheous wrote:I doubt you're going to have an offset great than 65536

What, you don't want your objects to be able to move left, or up?

tomaitheous wrote:No, I mean can you arrange the array into multiples of groups,i.e. treat the group as a large element

But that's completely changing how it works, for the sake of a very minimal saving. There may be 6 other places in the code where this change is going to make it slower.

tomaitheous wrote:by 1... err 2

yes I wish they'd have added an INX2 instruction

tomaitheous wrote:(yeah, I added the 1 cycle on the ZP load/add/store).

Ok, now where did I indicate DP? Wrong. X can be any address, that's why I used the cool trick of flipping addresses and indexes, and specified both A and X are 16 bit (incidentally there's pretty much no need for indirect addressing anymore when you do this). These are not direct page instructions. Consider the direct page completely out of bounds, because it's being used for something else that absolutely requires direct page. Well, that, and the fact that as you've figured out, the code posted wouldn't work very well in DP anyway.

I can see I'm going to have to time this myself

tomaitheous wrote:But what I was getting at early, it would be faster overall on the 65816 to use indirect addressing and have the upper 16bit value being pointed directly into the sprite table that's going to be DMA'd to VRAM

No, because the game needs to know where objects are in the world. It's internal variables need to be updated. THEN you need to add the scroll coordinates, do some shifting and messing around with that and various other bits and pieces, before writing them to the hardware. I think you'd find if you tried to do all that inside this function you'd end up doing much more work. Especially if you also want to priority sort your sprites.

Exophase · Post by **Exophase** » Thu Feb 12, 2009 2:30 am

Snake wrote:Which is exactly what I posted.

I'm not really sure if you're saying you posted execution or overall size when you say that. What I do know is that you've referred to several things as increasing instruction count that decrease execution time, like unrolling loops.

Snake wrote:...No. I wouldn't have posted an example that's not useful. 16 bits does not give you big enough ranges in both the integer and fractional parts. 12 bits is not enough (2D game 'worlds' are usually much bigger than that) and 4 bits is not enough for the fractional, either. 16:16 may be overkill but it also happens to be much faster than 16:8.

I'm sorry, but I don't agree with you at all and I hope you can do more to justify what you're saying than just saying "because it does." Even if a typical level in a game is more than 4096 pixels in both directions (I'm skeptical) you really don't need to keep full precision (or even move around) entities that are off the screen in quite a number of games.

I would be pretty surprised if any significant percentage of SNES games did use 32bit coordinates, do you have a reference on this?

Snake wrote:I'm not quite sure why people are calling a simple add an 'algorithm'. You can't get much simpler than this.

I was speaking in general terms there, but it certainly applies to what you're saying. Doing a 32bit coordinates is a bad approach for a 16bit platform, especially 65c816, and your example worked by taking code written with 68k in mind then adapting it.

You keep referring to this as a "simple add" but a 32bit example is quite loaded IMO, it'll take quite a bit to convince me otherwise.

tomaitheous · Post by **tomaitheous** » Thu Feb 12, 2009 5:41 am

Ok, now where did I indicate DP? Wrong.

Expanding your syntax to $xxxx, X would be less confusing/ambiguous or my interpretation of it

I can see I'm going to have to time this myself.

Should have done it in the beginning

Code: Select all

    ;call           6
AddVelocity:
    lda $0000,x    ;5
    clc            ;2
    adc $0004,x    ;5
    sta $0004,x    ;5
    lda $0002,x    ;5
    adc $0006,x    ;5
    sta $0006,x    ;5

    lda $0008,x    ;5
    clc            ;2
    adc $000c,x    ;5
    sta $000c,x    ;5
    lda $000a,x    ;5
    adc $000e,x    ;5
    sta $000e,x    ;5

    txa            ;2
    clc            ;2
    adc #$0010     ;3
    tax            ;2
    rts		      ;6

85 then

. Of course I didn't include the 3 cycles for lda #imm, but I also didn't include the 12(or 8 if it's within range) cycles for LEA in the previous post.

But that's completely changing how it works, for the sake of a very minimal saving. There may be 6 other places in the code where this change is going to make it slower.

Well, it changes your example

. One plays to the strengths of the processor and/or hardware. You did just that for the 68k - wrote an optimal method. Sure 5 cycles isn't much of a saving in much in that number of occurrences, but I think the layout would be easier to access as well. If I'm going to update float or the whole part, I would only need a single shift in the index reg before accessing said section. The code and method would be cleaner IMO. I hope I don't sound like I'm full of shit, 'cause I really do write my code that way - it's not an afterthought <_<

No, because the game needs to know where objects are in the world. It's internal variables need to be updated. THEN you need to add the scroll coordinates, do some shifting and messing around with that and various other bits and pieces, before writing them to the hardware. I think you'd find if you tried to do all that inside this function you'd end up doing much more work. Especially if you also want to priority sort your sprites.

Ahh ok, I can see that. Assuming that's the type of engine the game needs. None of the projects I've worked on require to have global positions that large updated in realtime, only a dynamic system for within small distances from the screen become active or none at all (fixed scrolling engine - scroll triggered event objects). But still, 65536 is 256 screens on the SNES. That's a bit overkill, no? Hehe - I guess if you needed more than 32768x32768 map, then the 'clc' stays.

Edit: Actually, the 68k doesn't handle higher than 16bit whole, so there's no reason to have that extra 'clc' in there on the '816 side either. It can be up to $FFFF as long as it doesn't roll over, so the float add of Y is fine without it. Yeah.. a whole 2 cycles

Also, I can still see an instance where you can copy the output to the sprite table since it's already in Acc register *when* the object is near onscreen, albeit in a different routine but still using the same argument calling as that one.

and 4 bits is not enough for the fractional

Hmm. I'd have to disagree with that. And 12.4 as in a single 16bit add would be perfect. But of course you'd want to do a dynamic system. Actually, it seems kind of wasteful to have objects farther 12bit being active when the player object isn't going to be interacting with them at that distance.

16:16 may be overkill but it also happens to be much faster than 16:8.

16.8+8.8->16.8

Code: Select all

    
    ldx #$0001           ;3
    jsr AddVelocity      ;6

AddVelocity:
    lda x_float,x       ;5
    clc                 ;2
    adc x_float_inc,x   ;5
    sta x_float,x       ;5
    lda x_float+1,x     ;5
    adc #$0000          ;3
    sta x_float+1,x     ;5= 30

    lda y_float,x       ;5
    adc y_float_inc,x   ;5
    sta y_float,x       ;5
    lda y_float+1,x     ;5
    adc #$0000          ;3
    sta y_float+1,x     ;5= 29

    inx                 ;2
    inx                 ;2
    inx                 ;2= 6
    rts                 ;6  
 
17    
                      ;59+6+6+6+3 = 80

Chilly Willy · Post by **Chilly Willy** » Thu Feb 12, 2009 9:20 am

tomaitheous wrote: Edit: Actually, the 68k doesn't handle higher than 16bit whole, so there's no reason to have that extra 'clc' in there on the '816 side either.

Actually, it does. The 68000 is a 32 bit architecture with a 16 bit bus and ALU. It may split a 32 bit add internally into two 16 bit adds, but the carry is certainly handled correctly without the programmer needing to do anything explicit. As far as the programmer is concerned, the 32 bit add is one operation.

Snake · Post by **Snake** » Thu Feb 12, 2009 9:42 am

Arrgh, another large post typed, and lost. Let's try that again.

I'll start by repeating that I actually like the 65816, and I've probably done as much code on it as I have the 68K.

Exophase wrote:I'm not really sure if you're saying you posted execution or overall size when you say that.

Well, my reply was in response to you saying that "4 times as many instructions executed" needed backup. I could have used a loop in my example, but I didn't, because although it would have looked shorter on paper, it would have been more instructions executed. I also pointed out that the 65816 version might be smaller cycle-count wise. Sorry, I thought that was pretty clear.

Exophase wrote:Even if a typical level in a game is more than 4096 pixels in both directions (I'm skeptical)

In a 16 bit game? Most games I've seen scroll 2 pixels per frame or faster. 1 pixel looks very slow. It takes no time at all to scroll 4096 pixels...

Exophase wrote:I would be pretty surprised if any significant percentage of SNES games did use 32bit coordinates, do you have a reference on this?

No, of course I don't, do you have a reference that says they don't?

But I know that some do, and I'm sure a LOT do, because contrary to what you might think, it makes perfect sense to do so. However, that doesn't really matter. If you were to look, you'll find a significant number of SNES games use 32 bit addition *somewhere*, even if not for coordinates. I can say that with a lot of confidence. The SNES hardware multiply is insanely quick, and returns a 24 bit result. Games are definitely going to use it, and are definitely going to want to add results - and as I already mentioned, 24 bit adds are more work than 32 bit ones.

Again, the code was only an example. Maybe tomaitheous can provide an example showing the 65816 using less instructions than the 68000. I'm sure other people can provide many more examples like mine.

Exophase wrote:Doing a 32bit coordinates is a bad approach for a 16bit platform, especially 65c816, and your example worked by taking code written with 68k in mind then adapting it.

It isn't, and as I said before, and will again - I didn't. This is production SNES code. The 68K I just made up on the spot.

Exophase wrote:You keep referring to this as a "simple add" but a 32bit example is quite loaded IMO

What, just because it happens to be something the SNES doesn't do in a single instruction? You seem to be wanting me to produce code thats deliberately limited on the 68K side just to make the 65816 look better. I don't see why I should do that given that we ARE trying to show that the 68000 can do more per instruction than the 65816 can. I haven't deliberately sabotaged the 65816 code, and I could have made it much worse had I wanted to do that. As I mentioned, that is actually real SNES code*, not something I invented to prove a point.

(* ok so its from memory, but given that the code cannot do exactly what it does in any more optimal way, I am confident to say its identical as far as instruction count goes.)

It *IS* a simple add. It's not like it's hideously slow or anything, far from it, and definitely isn't something any SNES programmer would consider a 'bad approach'. It's neccessary, and it's good code. If 16 bit is good enough, you'd do 16 bit on the 68000 too, and it would be faster, and still be less instructions than the 65816 version.

I really wasn't trying to force the issue, but since you don't seem to be convinced, and this isn't a good enough example, I'll give you this one:

Code: Select all

asr.w #4,d0

There, that's deliberately limited to only 16 bit, and something that will almost certainly appear in a lot of games. How many 65816 instructions is that going to take? If someone can do it in 4 or less, and without cheating by using a huge table, I'll be impressed.

tomaitheous wrote:Expanding your syntax to $xxxx, X would be less confusing/ambiguous or my interpretation of it

Fair enough, the trick I used there isn't immediately obvious to people coming from 6502-land, where they're not used to having 16 bit indexing

tomaitheous wrote:Well, it changes your example

...and my example is part of a larger project where you can't just go around changing the order of things in memory to save two instructions, because you'd have to rewrite most of the game

tomaitheous wrote:You did just that for the 68k - wrote an optimal method

Nah, I didn't

Seriously, to save anything there you'd need to either use 4 INX instructions, and store the data in a slightly strange way, or use 2 and store it in a REALLY odd way. Either way, you'll save, what, 2 instructions max? It isn't worth the saving, and I guarantee you'd just complicate code elsewhere anyway. Plus the 68K code would STILL be way shorter, so you've achieved nothing.

tomaitheous wrote:I hope I don't sound like I'm full of shit

No no, not at all, but I definitely don't agree that your method is 'cleaner'

tomaitheous wrote:None of the projects I've worked on require to have global positions that large updated in realtime

...and I'm not saying that any do, although it definitely makes sense in some games to have things being updated at least some distance from the viewable area. However, it is very nice to be able to use world coordinates. A lot of games have the player coordinate in world space. So you may be in the centre of the screen, but at X=5000 in the world. The scroll coordinates are calculated from there. Other objects on the screen are also in world space. So you see, it often doesn't have anything to do with the size of the screen. It's just a nice way to keep track of things.

tomaitheous wrote:But still, 65536 is 256 screens on the SNES. That's a bit overkill, no?

Yes. But once you've given up trying to cram everything into 16 bits, 32 bit is the next step up, because its less hassle than 24 bit.

tomaitheous wrote:I guess if you needed more than 32768x32768 map, then the 'clc' stays

No, you're missing the point. Think about it. Every time you add a negative value, you're going to get a carry. The clc needs to be there.

tomaitheous wrote:Hmm. I'd have to disagree with that

Well, then I'll tell you why you're wrong

What if you wanted to fire a missile from any point on the screen to any other point on the screen? To get a missile to move from just off the left of the screen, to just off the right of the screen one scanline lower, you'd need at least 9 bit fractions. You'd definitely want more accuracy than that if you wanted these things moving at different speeds, or changing speeds.

Now you may think that that extreme example of moving just one scanline isn't going to happen often. But you still need way more than 4 bits to get any sort of accuracy here, and do you really want to limit what your game engine can do just to save a few cycles? Please say no.

Can you see now why 32 bit is a good idea even if your game DOESN'T scroll?

tomaitheous wrote:16.8+8.8->16.8

That code does not work. Try it. It's also wasteful, and doesn't do 24+24=24

You think its worth a 2 cycle saving?

[edit]oh dear, this thread is now waaay off topic. Sorry, sheath

Chilly Willy · Post by **Chilly Willy** » Thu Feb 12, 2009 11:04 am

Snake wrote:I'll start by repeating that I actually like the 65816, and I've probably done as much code on it as I have the 68K.

I got my start on the Atari 400, so I'm more familiar with the 6502 than the 65816. Lemme tell ya - on one of those systems, you crammed for every byte and cycle you could. I'll give an example later.

I really wasn't trying to force the issue, but since you don't seem to be convinced, and this isn't a good enough example, I'll give you this one:
Code: Select all
asr.w #4,d0
There, that's deliberately limited to only 16 bit, and something that will almost certainly appear in a lot of games. How many 65816 instructions is that going to take? If someone can do it in 4 or less, and without cheating by using a huge table, I'll be impressed.

Ouch! Yeah, that's a good one.

What if you wanted to fire a missile from any point on the screen to any other point on the screen? To get a missile to move from just off the left of the screen, to just off the right of the screen one scanline lower, you'd need at least 9 bit fractions. You'd definitely want more accuracy than that if you wanted these things moving at different speeds, or changing speeds.

Okay, here's where I show my 8-bit roots. I can understand tomaitheous here because we didn't have the luxury of using 32 bit fixed point math for this. Here's what you'd do for this: You record the start and the end coords (as 8-bit values, not 16! ). Then you do a highly optimized Bresenham line algorithm, but instead of plotting the points, you use them as sprite coords.

... and do you really want to limit what your game engine can do just to save a few cycles? Please say no.

Want to? No! Have to? Quite often.

Snake · Post by **Snake** » Thu Feb 12, 2009 11:18 am

Chilly Willy wrote:Want to? No! Have to? Quite often.

On 8 bit systems, absolutely

But this is a SUPER nintendo. For SUPER games

Exophase · Post by **Exophase** » Thu Feb 12, 2009 3:58 pm

Snake wrote:Well, my reply was in response to you saying that "4 times as many instructions executed" needed backup. I could have used a loop in my example, but I didn't, because although it would have looked shorter on paper, it would have been more instructions executed. I also pointed out that the 65816 version might be smaller cycle-count wise. Sorry, I thought that was pretty clear.

Number of instructions executed, number of instructions in the code, and cycle times are all different variables. You can easily have fewer instructions that take more cycles, which is what I'm sure you were trying to demonstrate.

For reference on the cycles per instruction matter: Genesis games tend to achieve about 700,000 instructions per second and PC-Engine games about 1.5-1.8 million instructions per second (as I said earlier, PC-Engine is clocked slightly below Genesis). These numbers were obtained by profiling several games; I profiled PC-Engine and Lordus profiled Genesis. The numbers are a little biased given that they're sucking in the idle loops as well (done w/o any kind of idle loop detection/elimination of course) so they have a bit of a BogoMIPS quality to them, but they do include games that are genuinely pushing the CPUs pretty hard on both platforms.

PC-Engine isn't SNES and 65c816 isn't 6280, but the cycle timings should at least be in the same ballpark - 6280 has additional penalties for its memory translation, 65c816 will have 16bit instructions that cost more going over its 8bit bus.

I'm in no way trying to argue that 68k instructions aren't much more efficient than 65c816 ones, I just think that 4x is an exaggeration, I'd expect something closer to 2-3x for well optimized games written with the architecture in mind. But I'm not going to stand by that number because it's so difficult to actually verify. This is for instructions executed and not written, it's obviously worse for written because 65xx demands more inlining and unrolling, and has far less in the way of "macro instructions" that 68k does.

Snake wrote:In a 16 bit game? Most games I've seen scroll 2 pixels per frame or faster. 1 pixel looks very slow. It takes no time at all to scroll 4096 pixels...

What kind of game are you talking about here? If there's a need to store the coordinates for anywhere on the level it means that you have to be able to traverse the level at any given point. If the game is automatically scrolling like a shooter then you don't have this requirement. You also don't have it if the enemies have a limited active/spawn range like many games do (Mega Man games for instance). For a sidescroller, 16 screens for a level, no, for an AREA is plenty. When you reach the edge you can stop scrolling and go to a new region (which many sidescrollers do). You'll also often only be scrolling heavily in one area, although it varies. In most games that actually let you persistently go back and forth between the entire area you aren't going to be constantly scrolling through it, in both directions no less. Of course there are going to be exceptions.

Snake wrote:No, of course I don't, do you have a reference that says they don't?

Just doesn't seem to be something that's necessary based on how typical games seem to be modeled.

Snake wrote:But I know that some do, and I'm sure a LOT do, because contrary to what you might think, it makes perfect sense to do so. However, that doesn't really matter. If you were to look, you'll find a significant number of SNES games use 32 bit addition *somewhere*, even if not for coordinates. I can say that with a lot of confidence. The SNES hardware multiply is insanely quick, and returns a 24 bit result. Games are definitely going to use it, and are definitely going to want to add results - and as I already mentioned, 24 bit adds are more work than 32 bit ones.

I have no doubt that some use 32bit for coordinates, the question is whether or not they really have to. I'm sure that many games do things they don't really have to do.

Yeah, there might be a 32bit addition somewhere, but at least I'm talking about critical code and not just needing the instructions for it at some point. Saying that games are liable to use the 24bit results of a multiplication are like saying that they often use 64bit results for 64bit adds when 32x32->64 multiplies are available - generally speaking, they don't. I mean, they do sometimes, but much less than they don't (games, that is).

Again, the code was only an example. Maybe tomaitheous can provide an example showing the 65816 using less instructions than the 68000. I'm sure other people can provide many more examples like mine.

Snake wrote:It *IS* a simple add. It's not like it's hideously slow or anything, far from it, and definitely isn't something any SNES programmer would consider a 'bad approach'. It's neccessary, and it's good code. If 16 bit is good enough, you'd do 16 bit on the 68000 too, and it would be faster, and still be less instructions than the 65816 version.

I really wasn't trying to force the issue, but since you don't seem to be convinced, and this isn't a good enough example, I'll give you this one:
Code: Select all
asr.w #4,d0
There, that's deliberately limited to only 16 bit, and something that will almost certainly appear in a lot of games. How many 65816 instructions is that going to take? If someone can do it in 4 or less, and without cheating by using a huge table, I'll be impressed.

I had a feeling you'd get around to using something like this - which is of course a fair argument for the strength of the 68k. However, the problem I have with it is that it's an instruction that doesn't have a constant execution time but scales with the number of shifts - the way I've generally been looking at this (or, the reason why I've been saying anything) is to ascertain an average number of instructions necessary vs an average number of cycles per instruction. When you start using instructions like this it throws off the latter.

Let me reiterate this a little bit. Would your argue that 68k games typically will need less than 1/4th the MIPS to execute an SNES/Genesis era game than a 65c816 game will?

Snake wrote:What if you wanted to fire a missile from any point on the screen to any other point on the screen? To get a missile to move from just off the left of the screen, to just off the right of the screen one scanline lower, you'd need at least 9 bit fractions. You'd definitely want more accuracy than that if you wanted these things moving at different speeds, or changing speeds.

Now you may think that that extreme example of moving just one scanline isn't going to happen often. But you still need way more than 4 bits to get any sort of accuracy here, and do you really want to limit what your game engine can do just to save a few cycles? Please say no.

I didn't realize that a high percentage of SNES/Genesis era games actually let you fire projectiles at such small angles, with many being limited to 8 directions (which don't need any fractional portion at all). Not wanting to limit your game engine is a pretty open ended argument, where exactly do you draw the line? Maybe some engines are just overengineered, but I'm sure that at least back then most of them weren't meant to be extremely versatile and reusable and I think what you're describing isn't a necessity for at least a large number of games (ignoring ones that don't require any kind of precision motion in the first place).

tomaitheous · Post by **tomaitheous** » Thu Feb 12, 2009 6:14 pm

Actually, it does. The 68000 is a 32 bit architecture with a 16 bit bus and ALU. It may split a 32 bit add internally into two 16 bit adds, but the carry is certainly handled correctly without the programmer needing to do anything explicit. As far as the programmer is concerned, the 32 bit add is one operation.

No no, you misunderstand. I mean the upper two bytes of the LONG. He's got a 32bit fixed point add, with the lower 16bit being the float and the upper being the whole. A 32bit addition could results in overflow in his example, which means the 'whole' part would roll over. I was just saying that unless he's expects the whole part to roll over (or set the flag from a signed addition), you don't need the clear carry for the start of the Y addition in the '816 version.

Maybe tomaitheous can provide an example showing the 65816 using less instructions than the 68000.

Heh. I'm more interested in cycle times. But from some of my own examples I did on the side for code size, the 65x usually seems to be in the 3.x range of instructions for similar code and the 658x around 2-3.x .

Fair enough, the trick I used there isn't immediately obvious to people coming from 6502-land, where they're not used to having 16 bit indexing

I meant that seeing only two characters in the operand, and not being an immediate, clicked as ZP/DP. I didn't see the shift operator usually used on the operand, but I figured you were using a different assembler or your 65x/658x was just rusty (hehe). Anyway, it's cleared up now.

Now you may think that that extreme example of moving just one scanline isn't going to happen often. But you still need way more than 4 bits to get any sort of accuracy here, and do you really want to limit what your game engine can do just to save a few cycles? Please say no.

Hmm. Maybe we're talking about two different things then. Are you talking about scrolls or objects? 4bit float gives you a 16 subpixel precision. To be honest, I can't think of a situation where I need more precision than that for an object. In my current shooter engine, I do use 8bit float part - but it's way overkill. It's done for speed reasons. If I find a way to do 9bit.4bit+=4bit.4bit (no, I don't code 6280 in C) faster, then I will. None of my objects needs to move faster than 16.99 pixels per frame.

That code does not work. Try it. It's also wasteful, and doesn't do 24+24=24 You think its worth a 2 cycle saving?

Haha - I didn't add the whole part, but just the carry instead! 0bit.8bit movement FTW

The 16bit.16bit is faster (with my memory arrangement). That's what I get for writing examples on the fly at 1am in the morning.

But my point was, is that you don't need 24+24->24. 24bit is your object position, but do you really need to have an object to move at a rate higher than 256 pixels per frame? Though that example above, if corrected, would see no cycle improvemnt and 1byte less per array entry.

For 658x, I'll stick with 12bit.4bit for objects until the I come across the need for it (16.16) - in which there will be a line of ascii text in the rom (or ISO

)
stating Mr. Snake told so

Nah, I didn't. Seriously, to save anything there you'd need to either use 4 INX instructions, and store the data in a slightly strange way, or use 2 and store it in a REALLY odd way. Either way, you'll save, what, 2 instructions max? It isn't worth the saving, and I guarantee you'd just complicate code elsewhere anyway. Plus the 68K code would STILL be way shorter, so you've achieved nothing.

I wouldn't say it's odd. It's nicely organized separated into high/low 16bit parts. And now that I know you're not using DP, but straight indexed addressing, it makes it even simpler. The method would just be an extension of doing with WORDs on 8bit 65x processors. Two 'inx' is less complicated, cleaner looking, easy to understand in relation to whole overall process, and it saves some cycles to boot. But the other advantage is changing the adder, it's just a simple asl, tax and then changing the upper and lower arrays with the same index. That's really standard and common place for 65x/658x (at least judging from 6502.org, nesdev, c64, and A8 dev forums).

The 658x actually surprised me. I figured they'd be neck and neck at the same clock speed, but the 658x pulls a head in cycles times in a quite a few examples and with minimal effort (IMO). I'm reluctant to take this discussion further though into optimization area. Optimizations are very specific to the processor *and* situation. Too much time and effort to post huge examples('cause you'd need to know the exact situations and details, etc) and not that this wasn't fun, but I'd rather spend that time coding

I've always thought that putting a 68k at the same clock speed in the SNES wasn't going to solve the slowdown problems(and given some examples, maybe slightly add to it - who knows. Depends on the programmer). This discussion is evident of that, I think. All the SNES needed was a little kick in the ass, not a different processor. You don't how many times I seen, "If only the SNES had a 68k" comment.

I would love more discussions like this though. Not processor VS threads, but game logic and related coding, and the delicate dance of optimizations - be it high level or low level. For platformers (simple or complex), shooters, RPGs, etc.

On a side and somewhat related note: I had a chance to look at the source code (the actually programmers source, not some disassembled stuff) of Art of Fighting Neo Geo port to the PCE ACD. It's funny in that you can see the programmer just simulated the 68k registers with ZP and translated the code almost line for line into 6280 in most areas. Surprisingly is runs fast enough not to slow down, being that the source was a 68k at 12mhz, but really inefficient way of porting a game - speed wise. I've always theorized that the Megadrive had a 68k because its main focus was on arcade ports when it was released, as mentioned by the designer in some interview. And being that a large majority of Arcade systems were using 68k processors making for easier code portability.