Why's mah assembly so gosh-darn slow?

Ask anything your want about the 32X Mushroom programming.

Moderator: BigEvilCorporation

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Fri Jul 07, 2017 4:23 pm

If you're talking about the SH2 side, long writes can ONLY be to long addresses. Anything else generates an address error, although it's possible an emulator ignores that and simply writes to a long address. Most emulators ignore things like address errors for speed. It's why you see homebrew that works on emulators, but fail of real hardware. You can do most of your testing on emulators (I do), but you should always test on real hardware periodically. If you don't have real hardware to test on, find someone who does (most folks here) who can test it for you.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Fri Jul 07, 2017 5:17 pm

Gah. Well my drawing routines are about to get a bit more complicated then... T u T
See if I can get it simple enough to still outperform byte copies while doing something like drawing whatever lefthand portion is not long aligned byte-for-byte, then blast out the rest in long copies.

And I currently do not have access to hardware myself (got a base Model 2 that I think needs some resoldering, but no 32X addon). I may have someone in a gamedev discord who has access to a 32X though, I could see if I can pester him :)

EDIT
SHIT. If my sprites aren't long aligned in the framebuffer then even if I pad the misaligned portion with byte copies my sprite data reads are not long aligned then. So basically if they're not 4-pixel aligned I'm just SOL for optimization as far as I can tell. Awesome. Well at least this keeps my blit code simple T u T

EDIT 2
Well, I made the change so it just resorts to a per-byte copy for sprites not aligned to 4-pixel boundaries horizontally. Now it seems to handle around 80 sprites tops before performance starts taking a dump, whereas it handled easily 150 or so before T - T

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Fri Jul 07, 2017 6:36 pm

Yes, tile drawing in general can get complicated if you want it as fast as possible. You can normally guarantee that the source is aligned, so you want to read read as much data the fastest way possible... read at least a long, maybe two or four if you can spare the registers. Then you have complex code to store the contents of those registers to the frame buffer in the most efficient way that obeys alignment rules. Or can say "screw it" and just do a byte/word loop (for 8 bit/16 bit pixels). As long as it's fast enough for what you want your game to do, you don't need further optimization. Also, it's possible there are other ways to optimize than making a mess of the drawing routine. For example, the most common way to make tile drawing/scrolling faster on 8-bit systems was to have multiple instances of the tile with different shifts. If you don't have too many tiles, just keep four copies with different byte offsets (for 8-bit sprites) that you can now still draw using longs. That will require one extra long per tile line, but like I said, this is an optimization based on having relatively fewer tiles that need to be drawn at an arbitrary offset (sprites).

So have one table of tiles for the background that have only one copy of each tile, and one table of tiles for sprites that have four copies of each tile, one at each byte offset.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Fri Jul 07, 2017 6:55 pm

Man that offset copy trick sounds like it'd work but sounds real ugly lol.
I guess I could store those copies contiguously in memory, then just take a horizontal offset and multiply by total sprite size to get pointer to offset version.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Fri Jul 07, 2017 7:11 pm

Ugly but fast! That's old-school game programming in a nutshell. 8) :lol:

You could also keep all the tiles with a common shift together and use masks for the 1st and last long. That would allow the most compact storage, but require a little more effort in drawing. If you have fewer tiles, I'd keep them separate and have simpler/faster drawing routine. Say you're using 16x16 tiles - normally they'd take 4 longs per line; with the shifting, one unshifted copy would be 4 longs x 16 lines, and three copies of each at a further byte offset would be 5 longs x 16 lines. So the total storage would be 19x16 longs - just slightly less than five times the space.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Fri Jul 07, 2017 7:34 pm

I'm only really concerned about sprite rendering here, as tilemaps will be handled by the MegaDrive's VDP (early tests suggested I would not be getting a solid 60 out of trying to draw the tilemap purely in software, so I figure it'd be easy enough to just let the hardware VDP take care of that!).

So do you think it'd make sense to preprocess and store sprites in that copied-and-shifted format directly in the ROM, or would it make sense to try and make that part of a sprite loading function (retrieves source sprite from ROM, allocates buffer with enough space for copies and pastes shifted copies into that buffer)?

EDIT Or perhaps I approach this differently. The more I think about it the more I realize that doing this shift trick for everything that needs to be rendered might be a bit silly. So maybe instead I just do it on a case-by-case basis. Most things just rendered with the code I've got now, and then hand-pick elements that are processed and rendered using the shift-trick as an optimization if the situation demands it.

EDIT 2 Also happy to report that the guy I was talking about earlier on the gamedev Discord server just tested my ROM on his everdrive and sent back a video and everything appears to be working precisely as I intended. So it's 100% functional on real hardware, yay! :)

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Fri Jul 07, 2017 10:47 pm

Good to hear! As to the sprites, I'd make the loader code automatically make shifted versions if needed. That would be really easy to do while copying the sprite from rom. No need to waste rom space on something trivial. And yes, if you can design the game to not need it more often than not, that's clearly the way to go. With vertical scrolling shooters, only SOME of the sprites will move horizontally, and the others can simply always be aligned for proper drawing. Since you're using the MD layers for the background, as long as the sprites that move horizontally are not on the same line with ones that don't, you can use the 32X VDP line table and the scroll register to move them over a few pixels when they don't fall on a natural alignment. More work in maintaining the 32X screen, but if it fits the game better, it's yet another way to deal with it. The scroll register shifts the line a pixel, and the line table lets you set the word the line starts at. Between the two, you can cover any byte offset. But it means needing to use the horizontal interrupt.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Sun Jul 16, 2017 7:43 am

OK, so here's a mystery

Why on earth does the simple act of adding another entry to the slave vector base table, even just:

Code: Select all

	.long slav_irq
Suddenly make my rom break Gens completely? That is, if I *just* do that, that one thing all by itself, if I try to reload the ROM after loading it the first time, reset the CPU, or try to load literally any other ROM, Gens just sits there on a black screen. Your XM player does not exhibit this behavior at all.
I feel like this is somehow related to having issues getting any slave code to work (it's like the slave function in the hw_32x.c just doesn't get called at all, but I can't for the life of me figure out why)

EDIT: OK so the issues are now the same between Fusion and Gens. And it's *definitely* related to my slave code not running.

Now, if I do this in the slave function:

Code: Select all

MARS_SYS_COMM6 = MIXER_UNLOCKED;
With my main function just sitting there waiting for MARS_SYS_COMM6 to be equal to that value, it just waits forever. If I do:

Code: Select all

while( 1 )
{
	MARS_SYS_COMM6 = MIXER_UNLOCKED;
}
It works. I guess something, at some point after slave() runs, resets the comm register? I'm really not sure.

But then, if I add that interrupt, any interrupt at all, to the slave vector table, it goes right back to just waiting forever. I'm assuming there's something about this that I'm just being completely clueless about, but this has me stumped.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Sun Jul 16, 2017 3:05 pm

It means you're using one of my old fixed headers without understanding what the entries are.

Code: Select all

! Standard Mars Header at 0x3C0

        .ascii  "Rick Dangerous  "              /* module name (16 chars) */
        .long   0x00000000                      /* version */
        .long   __text_end-0x02000000           /* Source (in ROM) */
        .long   0x00000000                      /* Destination (in SDRAM) */
        .long   __data_size                     /* Size */
        .long   0x06000240                      /* Master SH2 Jump */
        .long   0x06000244                      /* Slave SH2 Jump */
        .long   0x06000000                      /* Master SH2 VBR */
        .long   0x06000120                      /* Slave SH2 VBR */
My newer header helps with this by using labels.

Code: Select all

! Standard Mars Header at 0x3C0

        .ascii  "Doom for 32X    "              /* module name */
        .long   0x00000000                      /* version */
        .long   __text_end-0x02000000           /* Source (in ROM) */
        .long   0x00000000                      /* Destination (in SDRAM) */
        .long   __data_size                     /* Size */
        .long   master_start                    /* Master SH2 Jump */
        .long   slave_start                     /* Slave SH2 Jump */
        .long   master_vbr                      /* Master SH2 VBR */
        .long   slave_vbr                       /* Slave SH2 VBR */

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Sun Jul 16, 2017 8:02 pm

AH. Of course. Took another look at both sh2_crt0.s in this project and crt0.s in your XM player and of course you are correct. Your XM player uses labels, while the one in this project hardcodes the addresses.
Thanks! I would never have thought to look there otherwise. Consequence of diving in a little too deep? XD

EDIT: AWESOME. It no longer blows up. Bugfix get! Just had to modify the linker script to add the symbols your updated header uses, and add a couple of labels for the vector tables. Suppose I should watch out in the future for anywhere a code comment specifically mentions a memory address ;)

EDIT 2: Well I've now got it outputting an extremely irritating high pitched noise via PWM audio, so that's something. However, this leads me to problem #2.
Your example uses COMM4 to send commands to the slave CPU, and COMM6 as a mixer lock. However, there's a problem here - I'm already using those to send commands to the 68k. COMM4 is used by both my load map and my load tileset commands, to store the lower two bytes of a pointer. COMM6 is used by my load map command, to indicate size of the data.
Should I be finding another way to send this data over to the 68k so that it only uses COMM0 and COMM2? For example, by writing the command and two bytes of the pointer, the 68k reads those bytes and sets COMM2 to zero, the MSH2 sees that COMM2 is zero and writes the next two bytes of the pointer, the 68k reads those two bytes and sets COMM2 to zero, the MSH2 sees that COMM2 is zero and writes the length, the 68k... etc. That seems like an extremely hacky workaround, so is there a different way I can utilize these ports or should I just go with that approach?

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Mon Jul 17, 2017 2:42 am

When I did the interrupt driven DMA PWM, I ran into that hardcoded address for the vector tables myself, so I fixed it right then, although all my projects that don't use the DMA interrupt tend to still be hardcoded. :lol:

If you use less than 16kHz for your sample rate, you may get aliasing noise on some 32X systems. I used 14kHz for Wolf32X, which was twice the sample rate for Wolf3D samples (7kHz). I don't get any aliasing noise on my 32X, but people reported they do, so for the last one, I bumped it up to 21kHz (three times the sample rate).

My initial usage of the 32X comm registers was pretty simplistic. You can redo it however you like. I'd recommend writing down each comm register, then list what you want each processor to do, then make some assignments for the comm registers. If you make functions for MSH2 <> 68K to read the ports and the vcount, you no longer need to reserve those comm registers used for those. You would do that similar to my MSH2 function for reading the mouse - that sets a command in comm0 and then reads the mouse values from comm2.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Mon Jul 17, 2017 3:49 am

Awesome, I realized I didn't don't have any code that requires vcount passed from the 68k side and don't really plan to (and even if I do I could probably just put that into a command that returns data via COMM2 or something like you mention). Also noticed that the tickcount actually spans two words, which is perfect, so I just split it into two pointers COMM12 and COMM14. COMM12 is used as the mixer lock, and COMM14 is used as the command port for the slave SH2. Both commands sent to the 68k and commands sent to the slave SH2 can still use COMM2, COMM4, and COMM6, as any time I send a command to either one I have the master SH2 spin wait until the command completes and signals (so it's never doing both at the same time). All is right in the world again! :)
(also believe it or not that irritating high pitched noise I mentioned was definitely intentional, just wanted to see if I could get any sound at all).
Now to see if I can get this MUS player ported over... >:)

EDIT Weee, I have a single solitary piano note playing, and I can change its patch (currently only have a piano but the code is in place for loading other patches), its note, its volume, the left/right balance of the channel it belongs to, and also the pitch bend of the channel it belongs to. Since I've already prototyped the bulk of the playback code in Unity (so I could work out my slightly custom song format and the logic for playing it), what I have now should be a pretty solid foundation for porting the rest of the actual playback logic.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Mon Jul 17, 2017 2:35 pm

MUS is pretty simple... much easier to deal with than MIDI, and even MIDI isn't that bad. What's tough is getting a full set of GM1/GM2 patches that sound good, but are still small enough to be worthwhile for a 32X games. That's the best thing about MOD/XM - they come with the instruments used built-in. :D

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Mon Jul 17, 2017 9:29 pm

I ended up picking a MUS-like format after talking with a composer in a game dev Discord. Not quite MUS because I've made some modifications to support loop points, and it also supports variable tickrates with a tickrate change event (at one point got really quite fed up with my timing conversion breaking on MIDI files so I just converted over the tempo change events to a custom event instead so I could directly use the input tick values), but it's still heavily based on MUS.
The primary strengths to going with this type of format are that it's easier to compose for, as you can just export a MIDI directly out of any modern DAW and then run it through my converter which spits out a SONG file (which is what I'm calling my custom format), and also that MUS and by extension this format is directly set up for having a global sample bank (including the fact that it spits out a table of all patches used by the song, which is great for only loading in samples off of ROM which are actually needed for a song).
XM is great, and almost certainly preferable, if you need a lot of unique samples per song. Sonic Rush and Sonic Rush Adventure would be good examples of this, with almost zero sample reuse from one song to the next (although I think they actually did use a MIDI-style format internally, with unique sample banks per song).
MUS is better if you need lots of sample reuse (short of creating a new XM-like format with patch ID instead of actual sample data, but then IMHO there's really not much good reason to use XM if you're just going to reference a global sample bank). Good examples would be games with symphonic soundtracks, or even games with rock-oriented soundtracks (where distortion guitar, bass, and drumkit are pretty much a constant in every track so you might as well reuse the samples).
Although, technically, my content builder project actually supports multiple sample banks - it spits out a pointer table to each sample bank in ROM. If I wanted to, I actually could even have my songs reference a sample bank ID and allow unique sample banks that way. In fact, I might do that - my content builder lets you cross-reference assets pretty easily, so I could just allow you to drop the sample bank onto the song that uses it. Then at build time it just spits out sample bank ID, and when it goes to load the patches for that song, it gets a pointer to the sample bank out of the pointer table using that ID and then loads the patches.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Tue Jul 18, 2017 3:52 pm

Sounds pretty nifty! An improved MUS format... might be better for Doom ports on lesser consoles. Or similar programs. :D

Post Reply