Why's mah assembly so gosh-darn slow?

Ask anything your want about the 32X Mushroom programming.

Moderator: BigEvilCorporation

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Tue Jul 18, 2017 7:51 pm

It's not quite an improved MUS per-se so much as a MUS modified for my needs. The original format would have been better if you were going to play DOOM's mus files specifically, as they're slightly smaller (delay bytes are optional, indicated with a delay flag on the event code, whereas I got rid of that flag so I could have 4 bits for event instead of 3 and so now there's always at least one delay byte which inflates sizes just a bit) and are set up to use a hardcoded tickrate of 140Hz, whereas I kept having irritating timing issues when trying to convert MIDI files over to the MUS timing of 140 ticks per second, so eventually I gave up and just tossed in a tickrate change event so that I could directly copy over the tick values from the source MIDI, which fixed all of my timing problems (nothing like listening to Supporting Me from SA2 played with all the musical acuity of a blackout drunk man slapping the keys... lol)

But if you wanted a general song playback system with global sample banks, loop point support, and easy conversion from MIDI, I do think this custom format's a pretty good one.

Speaking of which, got music playback and full sample bank support up and running last night. Right now my sample bank only has piano and string patches, it doesn't support drumkits, and there's no ADSR envelopes yet, but none of those should be much of a challenge at this point.

https://www.youtube.com/watch?v=dlpF5Kxd9wo

Also after the video was already uploaded I just so happened to fix a problem with my echo code... so it also has a (currently hardcoded) echo of 0.2ms delay with 50% feedback. That part can probably be optimized and/or perhaps shuffled off into assembly, but still, pretty happy with it :D

EDIT: Actually, this should have occurred to me before, but I just now realized that one of the event codes is completely unused. It's the End of Measure event. I'm not entirely sure what its original purpose is, and I don't think any of DOOM's music files contain it, and my converter certainly does not export it. I could just repurpose its event code as a tickrate change, and then all of my event codes fit within the 0-7 range and then I can add that delay bit back in to trim down on filesize.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Wed Jul 19, 2017 1:24 am

There's two unused events: 5, which was end of measure as you mention, and 7, which takes an additional byte in the stream, but was otherwise unused. All current MUS players simply fetch the next byte and skip it when they encounter event 7. If your timing event takes one byte, it would be perfect. When MUS players encounter 5, they simply skip it without fetching any extra bytes. Not that I imagine you care all that much about people trying to play your format with existing players. :lol:

In case you weren't aware, 3 (system event) and 4 (change control) are actually the same other than 3 implies the value is 0 while 4 passes the value in the next byte.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Wed Jul 19, 2017 5:08 am

Ooooh right, I don't know why it never occurred to me but yeah I don't need event 7 at all. I spit out event 7 whenever I encounter any event from a MIDI file that I'm not explicitly handling, as a default case. But now that you say that, I realize I don't need to at all - I can just read in the delay and add that delay to the last converted event. And then if I'm not spitting out any event 7s, and so long as I don't care about MUS players (which I don't), I can repurpose that event code for something else too.
Nice! Thank you.

EDIT: Although side note my tickrate change event is two bytes. I've seen several MIDIs with a computed tickrate above 255. Even a test file I made with just a few piano notes and a tempo of 170 already overflowed the limits of a byte. So, it's two bytes.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Wed Jul 19, 2017 6:24 pm

If you really want to allow old MUS players to not choke on your file (ignoring playing it correctly), use one 7/val for when it fits in one byte, and back to back 7's (7/lo/7/hi) for two bytes. That would also be easier to decode. Just always make the first 7 the low byte, then if another 7 comes immediately after, it's the hi byte. That would also make it smaller for when delays DO fit in one byte.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Sat Jul 22, 2017 11:10 am

Hm, is there anything you could think of off the top of your head that might be causing my audio to emit sound, but not the sound I'm expecting or getting in Fusion, on real hardware? As someone tested my ROM says:

"No sound at all on real hardware.
Actually, I take that back. About one note plays. I thought it was just white noise but it sounds like some sort of instrument."

A couple of notes:

- The samplerate in the bin I sent him was 24khz. I have it lowered down now to 22khz in my local code just in case that caused problems (at one point I had tried 32khz in Fusion and that just resulted in screeching, so maybe?), though by that time he had already put his 32X back away and I didn't want to be a bother.
- For the most part I just followed your tutorial on interrupt-driven DMA PWM audio and cross-referenced with your XM player. I also ended up porting over your voice mixing routine and switched over to signed 8-bit PCM for my samples (whereas before with my pure C voice mixing I was using unsigned 8-bit and converting by subtracting 127 and storing in a signed 16 bit integer).
- It works perfectly in Fusion, though gives me no sound in Gens Plus (then again, neither does your own XM player so I'm not too worried about that, I assume an isssue with Gens Plus and not my own code)
Attachments
OutRom.zip
32X ROM file of music player. Doesn't work properly on real HW apparently, can't figure out why.
(129.06 KiB) Downloaded 365 times

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Sat Jul 22, 2017 4:36 pm

Well, it does work in Fusion, but I also know that certain aspects of Fusion are hardcoded around existing 32X games rather than actually using what you set in the hardware. For example, Fusion assumes the PWM values played are in the range games use, which are centered around a 22 kHz sample rate. Using a different sample rate can result in noise on Fusion depending on how far from 22 kHz the rate is (that was an issue I ran into while testing Wolf32X). Since it's playing clean music in Fusion, you're mixing the audio correctly. I can't tell anything else without seeing the actual code as changing a single instruction can make code fail, depending on what you change.

If I had to guess... maybe you're starting the DMA on the wrong buffer before filling the wrong buffer. Maybe you're not clearing the DMA flags before restarting the DMA - that works on emulation, but not real hardware. Look at my int-driven DMA PWM int handler code closely

Code: Select all

void slave_dma1_handler(void)
{
    static int32_t which = 1;

    while (MARS_SYS_COMM6 == MIXER_LOCK_MSH2) ; // locked by MSH2

    SH2_DMA_CHCR1; // read TE
    SH2_DMA_CHCR1 = 0; // clear TE

    if (which)
    {
        // start DMA on first buffer and fill second
        SH2_DMA_SAR1 = ((uint32_t)&snd_buffer[0]) | 0x20000000;
        SH2_DMA_TCR1 = num_samples; // number longs
        SH2_DMA_CHCR1 = 0x18E5; // dest fixed, src incr, size long, ext req, dack mem to dev, dack hi, dack edge, dreq rising edge, cycle-steal, dual addr, intr enabled, clear TE, dma enabled

        fill_buffer(&snd_buffer[MAX_NUM_SAMPLES * 2]);
    }
    else
    {
        // start DMA on second buffer and fill first
        SH2_DMA_SAR1 = ((uint32_t)&snd_buffer[MAX_NUM_SAMPLES * 2]) | 0x20000000;
        SH2_DMA_TCR1 = num_samples; // number longs
        SH2_DMA_CHCR1 = 0x18E5; // dest fixed, src incr, size long, ext req, dack mem to dev, dack hi, dack edge, dreq rising edge, cycle-steal, dual addr, intr enabled, clear TE, dma enabled

        fill_buffer(&snd_buffer[0]);
    }

    which ^= 1; // flip audio buffer
}
Be very careful about the 'which' variable - be certain you're starting DMA on the buffer that was PREVIOUSLY filled, then fill the NEXT buffer.

See those two lines before the if?

Code: Select all

    SH2_DMA_CHCR1; // read TE
    SH2_DMA_CHCR1 = 0; // clear TE
Those are MANDATORY!!!! It WILL NOT WORK ON REAL HARDWARE without them.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Sat Jul 22, 2017 7:48 pm

I'm definitely reading and clearing TE. I DID have which set to 0 instead of 1, which meant that it would fill the first buffer, then on the very first interrupt it would DMA the second buffer and fill the first buffer again. Although would this really prevent it from generating audio? I've changed it anyway, but even if the second buffer had garbage in it I would think that'd result in a single buffer of garbage audio and then it should start producing audio as normal. In any case, the DMA code looks like I'd expect it to - it starts a DMA from one buffer, and then starts filling the other buffer, using precisely the code you just posted.

One thing is I did play with the max sample and num sample values. So num samples is set to 512, and MAX_NUM_SAMPLES is also 512 so I could reduce the size of my sound buffers and my echo buffer (which is also sized based on MAX_NUM_SAMPLES)

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Sat Jul 22, 2017 10:14 pm

An old version of the xmplayer had which start as 0, so it would have an issue with the first buffer. As to playing the wrong buffer all the time, you might not notice it if you don't have much processing to do. Remember that int-driven DMA places one buffer, then stops. The interrupt from the DMA causes the handler to start the next buffer, then fill the previous buffer. If you have it doing that in the wrong order, as long as the filling routine stays ahead of the DMA, you'll get proper sound. So you might only notice the very first sample being wrong before all the rest are right. Theoretically, you could mix the first few samples to a single buffer, then start the DMA, then do the rest of the samples. That would allow audio with a single buffer, but you'd need to do at least a sample or two before starting the DMA to avoid hearing bad samples.

If the DMA is actually running, and you're giving it the right buffer, then perhaps the issue is with the PCM to PWM conversion. The maximum PWM value you can use is determined by the period value you calculate. If the period is 1040, the max value is 1040, though I tend to subtract one or two from it. The minimum value is NOT ZERO!! Zero is an invalidate PWM value and should never be set. 1 is the minimum PWM value you can use, but I tend to use 2 as my minimum. The center value is (MAX - MIN) / 2 + MIN. So you scale your PCM (signed 2's complement format) for volume, add the center value, and clamp to the minimum and maximum. I usually scale the PCM first so that if it were at max, it would be somewhat larger than PWM can handle, then do volume scale (which can be combined into a single multiply), add center, and then clamp if needed. If you scale the PCM so that it's never larger than MAX and never smaller than MIN, then you can skip the clamping.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Tue Jul 25, 2017 6:37 am

This is my PCM to PWM conversion code:

Code: Select all

// convert buffer from s16 pcm samples to u16 pwm samples
for (i = 0; i < num_samples*2; i++)
{
    s16 s = *buffer + SAMPLE_CENTER;
    *buffer++ = (s < SAMPLE_MIN) ? SAMPLE_MIN : (s > SAMPLE_MAX) ? SAMPLE_MAX : s;
}
Where samplerate is currently 22050, SAMPLE_MIN is 2, SAMPLE_CENTER is 517, and SAMPLE_MAX is 1032.

I also tried to rule out the possibility of improper 32X setup. He's using a Model 1 which I've heard can have audio problems with the 32X but most things I've seen have said that using the headphone output instead of the rear output will fix those, and he IS using the 3.5mm headphone output so... I suppose I could try and send him your own XM player build and see if he has problems with that too.
Also taking a look at getting my own 32X. Eyeing a whole set of 32X/cables/Megadrive on Ebay for $50. Don't have a TV for it though, gave all my CRTs to Goodwill when I moved like two houses ago, so I'd also probably have to pick up a cheap upscaler or something. Money's kinda tight though, which makes getting myself a real 32X a challenge.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Tue Jul 25, 2017 2:22 pm

Does other 32X homebrew work and sound fine on his system? Let's narrow it down. Send him a rom for plain DMA PWM and Wolf32X and let's see what works and what doesn't for him. If none of it works, he's got the audio connected wrong... and my understanding was the 32X cannot be heard through the CD audio outputs. I've only ever used the audio out from the 32X itself.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Wed Jul 26, 2017 5:02 am

OK, so your XM player runs mostly fine on his system - except he described a weird white noise in the background. But otherwise, it produced sound.
However, to my test I also added sound effects played on the A, B, and C buttons. The demo has a bouncing face. When he presses any of those buttons to play a sound, the bouncing face thing locks up (so it seems like the entire game hangs at that point).
I'm wondering if the slave CPU just hangs at some point on my audio stuff, and never unlocked the mixer, which would explain the sound effects locking up the main CPU (if the mixer wasn't unlocked it'd just be sitting there waiting for a lock). The real question if that's the case is where it's locking up the slave CPU and why.

I'm going to walk through all of the relevant bits of my audio setup. Maybe you can spot an issue here?

So first, I modified the Mars header:

Code: Select all

.ascii  "32X Example     "              /* module name (16 chars) */
.long   0x00000000                      /* version */
.long   __text_end-0x02000000           /* Source (in ROM) */
.long   0x00000000                      /* Destination (in SDRAM) */
.long   __data_size                     /* Size */
.long   mstart                          /* Master SH2 Jump */
.long   sstart                          /* Slave SH2 Jump */
.long   master_vbr                      /* Master SH2 VBR */
.long   slave_vbr                       /* Slave SH2 VBR */
Then, I added a new entry to the slave vector table:

Code: Select all

.long   slav_irq    /* Command interupt */
.long   slav_irq    /* H Blank interupt */
.long   slav_irq    /* V Blank interupt */
.long   slav_irq    /* Reset Button */
.long   slave_dma1	/* DMA1 TE INT */
Which points to this subroutine:

Code: Select all

! Handles DMA interrupts		
slave_dma1:

	! save registers
	sts.l   pr,@-r15
	mov.l   r0,@-r15
	mov.l   r1,@-r15
	mov.l   r2,@-r15
	mov.l   r3,@-r15
	mov.l   r4,@-r15
	mov.l   r5,@-r15
	mov.l   r6,@-r15
	mov.l   r7,@-r15
	
	! call C-side slave_dma1_handler() callback
	mov.l	sd1_handler,r0
	jsr		@r0
	nop
	
	! restore registers
	mov.l   @r15+,r7
	mov.l   @r15+,r6
	mov.l   @r15+,r5
	mov.l   @r15+,r4
	mov.l   @r15+,r3
	mov.l   @r15+,r2
	mov.l   @r15+,r1
	mov.l   @r15+,r0
	lds.l   @r15+,pr

	rte
	nop
	
	.align 2
sd1_handler:
	.long	_slave_dma1_handler
That, in turn, points to this C function:

Code: Select all

void slave_dma1_handler(void)
{
	static u8 which = 1;
	
	while (MARS_SYS_COMM12 == MIXER_LOCK_MSH2) ; // locked by MSH2
	
	SH2_DMA_CHCR1; // read TE
    	SH2_DMA_CHCR1 = 0; // clear TE
	
	if (which)
	{
		// start DMA on first buffer and fill second
		SH2_DMA_SAR1 = ((u32)&snd_buffer[0]) | 0x20000000;
		SH2_DMA_TCR1 = num_samples; // number longs
		SH2_DMA_CHCR1 = 0x18E5; // dest fixed, src incr, size long, ext req, dack mem to dev, dack hi, dack edge, dreq rising edge, cycle-steal, dual addr, intr enabled, clear TE, dma enabled

		FillAudioBuffer(&snd_buffer[MAX_NUM_SAMPLES * 2]);
	}
	else
	{
		// start DMA on second buffer and fill first
		SH2_DMA_SAR1 = ((u32)&snd_buffer[MAX_NUM_SAMPLES * 2]) | 0x20000000;
		SH2_DMA_TCR1 = num_samples; // number longs
		SH2_DMA_CHCR1 = 0x18E5; // dest fixed, src incr, size long, ext req, dack mem to dev, dack hi, dack edge, dreq rising edge, cycle-steal, dual addr, intr enabled, clear TE, dma enabled

		FillAudioBuffer(&snd_buffer[0]);
	}
	
	which ^= 1; // flip audio buffer
}
Meanwhile, my slave entry function looks like this:

Code: Select all

void slave(void)
{
	// initialize DMA
	SH2_DMA_SAR0 = 0;
	SH2_DMA_DAR0 = 0;
	SH2_DMA_TCR0 = 0;
	SH2_DMA_CHCR0 = 0;
	SH2_DMA_DRCR0 = 0;
	SH2_DMA_SAR1 = 0;
	SH2_DMA_DAR1 = 0x20004034; // storing a long here will set left and right
	SH2_DMA_TCR1 = 0;
	SH2_DMA_CHCR1 = 0;
	SH2_DMA_DRCR1 = 0;
	SH2_DMA_DMAOR = 1; // enable DMA
	
	SH2_DMA_VCR1 = 72; // set exception vector for DMA channel 1
	SH2_INT_IPRA = (SH2_INT_IPRA & 0xF0FF) | 0x0F00; // set DMA INT to priority 15

	// initialize audio hardware
	InitAudio();

	// initialize mixer
	MARS_SYS_COMM12 = MIXER_UNLOCKED; // sound subsystem running
	FillAudioBuffer(&snd_buffer[0]); // fill first buffer
	slave_dma1_handler(); // start DMA
	
	// set up command port so we can do other stuff on the slave SH2 if necessary
	SetSH2SR(2);
	while (1)
	{
		if (MARS_SYS_COMM14 == SSH2_WAITING)
			continue; // wait for command

		// do command in COMM4
		switch(MARS_SYS_COMM14)
		{
			case SSH2_LOADSONG:
			{
				ssh2_loadSong( MARS_SYS_COMM2 );
			}
			break;
				
			case SSH2_STOPSONG:
			{
				ssh2_stopSong();
			}
			break;
		}

		// done
		MARS_SYS_COMM14 = SSH2_WAITING;
	}
}
And then InitAudio does:

Code: Select all

void InitAudio()
{	
	int i;
	for( i = 0; i < ECHOBUFMAX; i++ )
		echo_buffer[i] = 0;

	// init the sound hardware
	MARS_PWM_MONO = 1;
	MARS_PWM_MONO = 1;
	MARS_PWM_MONO = 1;
	if (MARS_VDP_DISPMODE & MARS_NTSC_FORMAT)
		MARS_PWM_CYCLE = (((23011361 << 1)/SAMPLE_RATE + 1) >> 1) + 1; // for NTSC clock
	else
		MARS_PWM_CYCLE = (((22801467 << 1)/SAMPLE_RATE + 1) >> 1) + 1; // for PAL clock
	MARS_PWM_CTRL = 0x0185; // TM = 1, RTP, RMD = right, LMD = left
	
	// ramp PWM to center to avoid a click on real hardware
	u16 sample = SAMPLE_MIN;
	u16 ix;
	while( sample < SAMPLE_CENTER )
	{
		for( ix = 0; ix < (SAMPLE_RATE*2)/(SAMPLE_CENTER-SAMPLE_MIN); ix++ )
		{
			while( MARS_PWM_MONO & 0x8000 ); // wait until the buffer is not full
			MARS_PWM_MONO = sample;
		}
		sample++;
	}
}
Beyond this, I can only imagine it's somehow my song loading or playback routines that are freezing the slave CPU?
I also tried running in Gens/GS release 7. It produces no sound, though neither does your XM player, despite the release notes claiming that it adds SH2 DMA support for PWM audio. Pressing buttons produces no sound either, but does not hang.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Wed Jul 26, 2017 2:35 pm

You probably have a slightly older version of Gens/GS r7 that only adds polled DMA PWM. I made more changes to it after that for int-driven DMA PWM. If you can compile it, I can send you the code. You need a 32-bit system to compile - I downloaded 32-bit 16.04, but still need to setup a virtual env for it for compiling. There's no way to compile any significant 32-bit code in 64-bit. I spent almost a month trying, and came to the same conclusion as everyone else - the gcc people don't care to get 32-bit compiling beyond tiny projects, which CAN be done via -m32. Anything of any significant size will require a 32-bit env using a 32-bit compiler.

It's possible the guy's 32X is barely functional. Many people needed to have Sega techs make changes to their Genesis to get their 32X to work. There's list of changes in the 32X manuals that were scanned a few years back, like the service manual. For example, on the Model 1 VA6.0, 6.5, and 6.8, you need to remove C78 and jumper across it. That cleans up the VCLK enough to make the 32X work right.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Wed Jul 26, 2017 8:44 pm

Hm, part of me really thinks it's unlikely that my demo not running would be due to faulty hardware, as I'm not doing much beyond what the XM player does. An issue that affected mine would theoretically affect the XM player too, wouldn't it? But then again I suppose hardware faults can be a bit less predictable than software issues. I'll see if I can get someone else to test on their 32X, but I have a sneaking suspicion I'm not going to get different results.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Wed Jul 26, 2017 9:26 pm

Yeah, it's probably the program, but in the little you've posted, I don't see anything offhand that would cause a problem on real hardware while working on an emulator.

Post Reply