Why's mah assembly so gosh-darn slow?

Ask anything your want about the 32X Mushroom programming.

Moderator: BigEvilCorporation

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Mon Jul 03, 2017 4:24 am

So I continue to revisit my passing interest in these things, and today I visit an interesting question almost certainly born out of my complete and utter incompetence with assembly.

I wanted to write some sprite blitting routines on the 32X. I first coded my routine in C, and it looked like this (note that I'm using 8-bit paletted framebuffer):

Code: Select all

// draws a sprite to the screen
inline void drawSprite( u8* sprite, int x, int y, int w, int h )
{
	// get fb pointer
	volatile u8 *fb = (volatile u8 *)&MARS_FRAMEBUFFER;

	// deal with sprites partially intersecting screen edges
	u8 hwritebegin = 0;
	if( x < 0 )
	{
		hwritebegin = -x;
		x = 0;
	}
		
	u8 hwritelength = w - hwritebegin;
	if( ( x + hwritelength ) >= DISPLAY_WIDTH )
		hwritelength = DISPLAY_WIDTH - x;
		
	u8 vwritebegin = 0;
	if( y < 0 )
	{
		vwritebegin = -y;
		y = 0;
	}
		
	u8 vwritelength = h - vwritebegin;
	if( ( y + vwritelength ) >= DISPLAY_HEIGHT )
		vwritelength = DISPLAY_HEIGHT - y;
	
	// pointer to the sprite's top left corner in the framebuffer
	int vram_ptr;
	vram_ptr = 0x200 + x;
	vram_ptr += (y * DISPLAY_WIDTH);
	
	// pointer to sprite pixel
	int sprite_ptr;
	sprite_ptr = ( vwritebegin * w ) + hwritebegin;
	
	u8 half_hwritelength = hwritelength / 4;
	
	fb += vram_ptr;
	sprite += sprite_ptr;
	
	// memcpy sprite line-by-line into frame buffer
	u8 v;
	for( v = 0; v < vwritelength; v++ )
	{
		// memcpy pixel row into framebuffer
		memcpy( fb, sprite, hwritelength );
		
		// increment vram_ptr to next fb row
		fb += DISPLAY_WIDTH;
		
		// increment sprite_ptr to next img row
		sprite += w;
	}
}
And this works relatively well. I can get around 128 16x16 sprites at a stable 60 FPS. Much more than that and it begins to slow down. I narrowed down the bulk of the cost to the memcpy (which, somehow, isn't surprising, but good to know that the upfront work being done per-sprite isn't contributing much to performance, verified in a test where I only rendered 16 sprites but did the rest of the work minus just the memcpy for 1024 sprites).

So I thought to myself, "This might even be good enough for a game, coupled with writing some code to utilize the Genny's hardware planes for tilemapping, but just because I can I want to see if I can implement this in assembly".
My approach ended up being to continue doing all of the upfront work in C, but reimplement that memcpy loop in assembly. That turned out to look like this:

Code: Select all

! void wtf( u8* spritePtr, u8* fbPtr, int blitW, int blitH, int blitStep )
.align 4
.global _wtf
_wtf:
! {
	! On entry: r4 = spritePtr, r5 = fbPtr, r6 = blitW, r7 = blitH, blitStep pushed onto stack
	! copy blitStep into r0
	mov.l 	@r15,r0
	
	! push r8, r9, and r10 onto stack (we're going to use them as a scratch, so we need to save them)
	mov.l	r8,@-r15
	mov.l	r9,@-r15
	mov.l	r10,@-r15

	! initialize r8 as framebuffer pitch so we can sum with fbPtr in the copy loop
	mov.w	fb_pitch,r8
	
	! initialize r1 as rows to copy
	mov		r7,r1
		
	! copy scanline loop
	wtf__copy_scan_loop:
	! {
		! copy number of bytes from spritePtr to fbPtr
		! this should be equivalent to a standard memcpy in terms of perf, right?
		
		! memory copy loop
		! initialize r2 as length to copy
		mov		r6,r2
		! copy sprite and fb pointers for iteration
		mov		r4,r9
		mov		r5,r10
		wtf__memcpy_loop:
		!{
			! load from spritePtr into r3, increment spritePtr
			mov.b	@r9+,r3
			
			! store from r3 to fbPtr
			mov.b	r3,@r10
			
			! loop check
			dt		r2
			bf/s		wtf__memcpy_loop
			add		#1,r10 ! increment fbPtr
		!} // wtf__memcpy_loop
		
		! finished copying a row of pixels
		add		r8,r5	! move framebuffer pointer to next scanline by adding fb_pitch
		
		! loop check
		dt		r1		! decrement and compare scan iterator
		bf/s		wtf__copy_scan_loop
		add		r0,r4	! move sprite pointer to next sprite pixel row by adding blitStep
	! } // wtf__copy_scan_loop
		
	! restore r8, r9, and r10 off of stack
	mov.l	@r15+,r10
	mov.l	@r15+,r9
	mov.l	@r15+,r8

	rts
	nop
	
! framebuffer pitch in bytes, used for incrementing framebuffer pointer by one scanline
fb_pitch:
	.word   320
! } // _wtf
And then the C code replaces that loop with a call to:

Code: Select all

wtf( sprite, fb, hwritelength, vwritelength, w );
Now, this works just the same as my C code did, with one major difference: it's really goddamn slow, at least in comparison.
I'm absolutely convinced there's just something stupid I'm doing, but I'd love to get some insights into how my C code is getting noticeably better performance over my hand-written assembly.
I also tried to see if it was the fact that I nested my externed function inside another function, but splitting off the original pure-C loop into a separate function with a similar signature and calling that instead still yields the performance of the original C code, much faster than my assembly.

EDIT: Hm, actually I might have a clue. Noticed that my pure C memcpy-based version actually doesn't seem to do the zero-byte-ignore copy thing quite as I'd expect. I'm using that feature to my advantage so that zero means transparent, but my memcpy code results in odd flickering when sprites cross. My assembly-based version, on the other hand, appears to be working precisely as expected, with no flickering. I wonder if that means memcpy is trying to do some sort of optimization involving not just straight copying byte-by-byte. That could explain both the artifacts and the performance difference.

EDIT 2: OK, so I did make an optimization to my assembly and this makes it actually a little bit more performant than my C code. I made it copy entire longs at a time into the framebuffer instead of copying byte-by-byte (so now adds the requirement that sprite width is a multiple of 4), and then modified the calling code so that it copies into the overwrite buffer to preserve the zero-byte-ignore behavior. I assume memcpy is trying to do some kind of similar optimization, but the speedup is slightly less reliable than my asm code (which I guess would make sense if memcpy was doing something like checking if what remains to be copied can be copied in a long or word or etc, whereas mine just blasts through using longs with no checks)

EDIT 3: OK is it normal for stuff like this to look SERIOUSLY GODDAMN CHOPPY on Gens? Good lord. It's not so much a framerate problem as far as I can tell as much as stuff looks like it's aligned on a 2 pixel boundary or something. They jump pixel positions like nobody's business. Works perfectly in Fusion though.

EDIT 4: Also caught an issue with that optimization where objects partially overlapping left or right screen edges would not be multiples of 4 bytes, so sometimes it'd just crash the game. So now it branches between two different methods (one a per-byte copy, one a per-long copy), switched on if the horizontal copy length is not a multiple of 4 bytes (so also removes the multiple of 4 pixels restraint, as if the sprite is not a multiple of 4 it simply switches over to a per-byte copy).

My new asm looks like this. I'm like 99% certain I've made these branches way less than optimal.

Code: Select all

! // Blits a sprite into the framebuffer. Source data width must be a multiple of 4 bytes unless byteCopy is TRUE
! void GFX_BlitSprite( u8* spritePtr, u8* fbPtr, int blitW, int blitH, int blitStep, int byteCopy )
.align 4
.global _GFX_BlitSprite
_GFX_BlitSprite:
! {
	! On entry: r4 = spritePtr, r5 = fbPtr, r6 = blitW, r7 = blitH, blitStep pushed onto stack, byteCopy pushed onto stack
	! copy blitStep into r0
	mov.l 	@r15,r0
	
	! copy byteCopy into r1
	mov.l	@(4,r15),r1
	
	! push r8, r9, r10, and r11 onto stack (we're going to use them as a scratch, so we need to save them)
	mov.l	r8,@-r15
	mov.l	r9,@-r15
	mov.l	r10,@-r15
	mov.l	r11,@-r15
	
	! initialize r8 as framebuffer pitch
	mov.w	fb_pitch,r8
	
	! initialize r11 as rows to copy
	mov		r7,r11
		
	! copy scanline loop
	0:
	! {

		! copy number of bytes from spritePtr to fbPtr
		
		! memory copy loop
		! initialize r2 as length to copy
		mov		r6,r2
		! copy sprite and fb pointers for iteration
		mov		r4,r9
		mov		r5,r10
		
		! if byteCopy is 0 we can use a faster copy operation
		! otherwise, resort to manual byte copy
		
		cmp/pl	r1
		bf/s		skip1
		nop
		
		1:
		!{
			! load from spritePtr into r3, increment spritePtr
			mov.b	@r9+,r3
			
			! store from r3 to fbPtr
			mov.b	r3,@r10
			
			! loop check
			dt		r2
			bf/s		1b
			add		#1,r10
		!}
		
		bra skip2
		nop
		
		skip1:
		2:
		!{
			! load from spritePtr into r3, increment spritePtr
			mov.l	@r9+,r3
			
			! store from r3 to fbPtr
			mov.l	r3,@r10
			
			! loop check
			dt		r2
			bf/s		2b
			add		#4,r10
		!}
		
		skip2:
		! finished copying a row of pixels
		add		r8,r5	! move framebuffer pointer to next scanline by adding fb_pitch
		
		! loop check
		dt		r11		! decrement and compare scan iterator
		bf/s		0b
		add		r0,r4	! move sprite pointer to next sprite pixel row by adding blitStep
	! }
		
	! restore r8, r9, r10, and r11 off of stack
	mov.l	@r15+,r11
	mov.l	@r15+,r10
	mov.l	@r15+,r9
	mov.l	@r15+,r8

	rts
	nop
	
fb_pitch:
	.word   320
! } // _GFX_BlitSprite
I'd also be super curious about how to get the 32X interacting with the MD's tileplanes. I see the command for it in ChillyWilly's stuff and even added my own extensions for interacting with either A or B plane but the idea of one tile at a time as commands sent to an MD-side program really bugs me. Is there a better way to get tile data over to the MD side? Esp. if I want to do screen scrolling for levels much larger than screen size?

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Mon Jul 03, 2017 6:21 pm

Seems you figured out much of your speed issues. Yes, writing a byte at a time will slow things down compared to writing a word or long when possible. memcpy tries to copy as fast as possible, but so does gcc 7. For example, look at this loop

Code: Select all

            for ( ; h; h--, sp+=src->s-w, dp+=dst->s-w)
                for (x=0; x<w; x++)
                    *dp++ = *sp++;
where the pointers are byte pointers. gcc 7.1 will optimize that loop to write longs when it can. Which can lead to another issue you noticed: when you write bytes of 0 to the frame buffer, they're ignored. It overwrites the frame buffer, even though you're not writing the overwrite area. However, when you start writing words or longs, this doesn't happen. You need to write the overwrite area for bytes of 0 to not be written when writing as words or longs. So you do this

Code: Select all

            dp += 0x00020000;
            for ( ; h; h--, sp+=src->s-w, dp+=dst->s-w)
                for (x=0; x<w; x++)
                    *dp++ = *sp++;
If you aren't using gcc 7.1, I'd suggest updating your toolchain. :D

As to faster tile handling on the MD side... do it all on the MD side. :wink: My normal 32X MD code is a small loop that sits in the 68000 work ram and watches for a few small commands from the 32X side, as well as reading the pads and updating a vblank count. That doesn't mean it's the ONLY thing you can do on the MD side. You can do anything a normal MD program can do. I would suggest trying to spend as little time running code in the rom or transferring data from the rom during the game as 68000 accesses to rom hold the 32X side off for quite some time. That's why I put the main loop code into work ram - so that the MD side does nothing that would halt the 32X side. So if you want to transfer tiles from rom to the MD VDP, put it between levels when you can, when a slow-down doesn't matter. If you need the MD side accessing the rom more than a little, try to avoid accessing the rom from the 32X side. It's only when both sides try to access the rom that conflicts occur.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Mon Jul 03, 2017 8:21 pm

Oooh I see what you're saying. So I guess the idea would be I could just have the 32X side issue a command to the MD side which just tells it to load a level, passing level index or w/e. The MD side then finds that in ROM and copies all of the tilemap data it needs into RAM, then handles all of the scrolling/loading/VDP stuff.
Anything special I need to consider about the MD side having pointers into the ROM?

EDIT: Or I guess let me rephrase... I'm reading this doc http://devster.monkeeh.com/sega/32xguide1.txt and finding it a bit difficult to mentally parse the memory map section of it lol.
It says that 0x880000 - 0x8FFFFF is cartridge and then in parenthesis "appears as 0x000000 - 0x07FFFF". Does that mean that, from the 68k side, an address of 0x880000 will read from 0x000000 on the cartridge? So if I had a pointer to a location in the ROM where a level was stored relative to 0x000000, the 68k would have to treat that pointer as being an offset from 0xx880000?
And also according to that doc, 0x900000 - 0x9FFFFF is similar except that 32X register 0x04+0xA15100 can be used to select up to four different consecutive banks. Theoretically (and I don't know that I need to worry about this yet but I still like to think things through), if I had a cartridge larger than 4 megabits, I'd have to make use of that bankswitching register in order to access that ROM from the 68k right? How would loading ROM data in that scenario work? Would it be possible/a good idea if that were the case to put all 68k-required data, like levels, in a specific bank so that the 68k doesn't have to do anything complex in order to access it? Or would it be better to code in a copy loop which is capable of automatically bankswitching when necessary? Something like:

- Take the upper nibble of the 3-byte address and set to register 0x04 (if upper nibble is 0x0, 0x04 is 0x00, if upper nibble is 0x1, 0x04 is 0x01, etc)
- Then take the lower two bytes of that address, add to 0x900000, and retrieve data.
- Maybe do something like figure out what bank boundary something lies on if any so it can be smart about when it bankswitches (set bankswitch, do copy loop, set bankswitch, do another copy loop)? Assembly is really not my forte though... T u T

EDIT 2:
Maybe a copy loop could look something like this (c-ish psuedo-code so I can hash out what the algorithm looks like before I try to implement it) and still deal properly with bankswitching and translating 0x000000-relative ROM addresses:

Code: Select all

void romCopy( u32 source, u32 dest, u32 length )
{
	u32 currentBank = source >> 20;
	u32 bankStart = currentBank << 20;
	u32 currentAddress = ( source - bankStart ) + 0x900000;
	
	setBank( currentBank ); // sets register at 0x04 + 0xA15100 to value
	
	while( length > 0 )
	{
		dest++ = currentAddress++; // not valid C obviously, just psuedocode for copying byte from one address to another and incrementing both.
		length--;
		
		// if we reach the end of the bank, increment bank, set bankswitch register, and reset address to beginning of new bank
		if( currentAddress == 0x9fffff )
		{
			setBank( ++currentBank );
			currentAddress = 0x900000;
		}
	}
}

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Tue Jul 04, 2017 9:45 am

OK, so I got my 68k-side ROM copy loop up and running :D Took a lot of mental gymnastics and dumping random crap to main 68k RAM for debugging lol.
But hey, at least it works and as far as I can tell should automatically handle bankswitching seamlessly. That is, aside from one assumption I'm making and I'm unsure if it's correct - that assumption being that setting the bankswitch register immediately makes the next bank of memory available without the need for nops or anything. This appears to be the case in an emulator based on some limited testing, but ofc that may have nothing to do with how actual hardware behaves ;)

Anyway, basically ended up with this function:

Code: Select all

| void rom_copy( u32 source, u32 dest, u32 length )
| Copies bytes from 32X ROM address into 68k RAM address, handling bankswitching as necessary
rom_copy:
|{
	| save modified regs first
	move.l	d2,-(sp)
	move.l	d3,-(sp)
	move.l	d4,-(sp)
	
	| get source, dest, and length off of stack into d0, a1, and d1
	move.l	16(sp),d0
	move.l	20(sp),a1
	move.l	24(sp),d1
	
	| decrement count so it doesn't copy an extra byte
	sub.b	#1,d1
	
	| d4 is set to 20, used for shift instructions later
	move.l	#20,d4
	
	| d2 is current bank, obtained from the top nibble of the 24-bit address
	move.l	d0,d2
	lsr.l	d4,d2
	
	| d3 is current bank's start address
	move.l	d2,d3
	lsl.l	d4,d3
	
	| set d0 to current address then copy into a0
	| could I do this on a0 directly I wonder?
	sub.l	d3,d0			| make address relative to the beginning of the current bank
	add.l	#0x900000,d0 	| accounts for 68k's 32X memory map. 0x900000 on the 68k = ( bank * 0x100000 ) in the cartridge's ROM
	move.l	d0,a0			| copy to a0, which becomes our source address iterator
	
	rom_copy_0:
	|{
		| set the bankswitch register to current bank
		move.w	d2,0xA15104
		
		| copy bytes
		rom_copy_1:
		|{
			| copy a byte from source (32X ROM) to destination (68K RAM). Increment both pointers.
			move.b	(a0)+,(a1)+
			
			| if we reach the end of the current bank, we need to reset our address and switch to the next bank
			cmp.l	#0x9FFFFF,a0
			bne.b	rom_copy_2
			|{
				| reset source address iterator
				move.l	#0x900000,a0
				| increment current bank
				addq	#1,d2
				| jump back to loop start ( will set the bankswitch register to the new value and then continue )
				bra.w	rom_copy_0
			|}
			rom_copy_2:
		|}
		dbeq	d1,rom_copy_1
	|}
	
	| restore modified regs
	move.l	(sp)+,d4
	move.l	(sp)+,d3
	move.l	(sp)+,d2
	
	rts
|}
Tested by adding a new copy test command that takes a long pointer as the parameter (this pointer is passed from the 32X side as a pointer to the ROM location of a label preceding a bunch of manually-typed bytes for testing, but in a game would probably be an included bin file containing resources). The command just copies a hardcoded number of bytes directly into the start of the 68K's main RAM. Running in Gens+ and then inspecting the 68K's memory yields the precise same sequence of values pasted into main RAM as were present in the file.

Now to start working on tossing some tilemaps together :)

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Tue Jul 04, 2017 8:35 pm

You can use VDP DMA like normal, just set the RV bit before starting the DMA, then clear it once it's done. Setting RV makes the ROM temporarily appear at the "normal" place, 0 to 4M. However, if either SH2 tries to access the rom, it will halt until RV is cleared. If you're loading a level, that doesn't matter, so it's not a big deal. It IS more of a deal if you're trying to load tiles on the fly during a level, in which case copying the data with the CPU like you're doing would maybe be better.

Remember that if you read/write save ram, or write the mapper registers (for roms > 4MB), you also need to set RV before doing so.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Tue Jul 04, 2017 9:13 pm

See, my thinking was that my level would be made of 16x16px tiles, which of course isn't VDP friendly so the MegaDrive would copy in the whole level's tilemap data over to its own RAM, and then "decompress" it into a more VDP-friendly 8x8px format. And then from there, my 68K driver would handle loading in new columns and rows of tiles as the 32X issues scroll commands, but all of the data is already in the 68K's RAM so that it doesn't have to get anything off of the ROM (so it'd do DMA copies out of its own RAM into the VDP to implement the virtual scrolling of a large map).

Good tip on the RV flag. I am setting the bankswitch register so I guess I should be setting and clearing the RV flag as well? (though the emulator doesn't seem to care about this requirement, interestingly)

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Wed Jul 05, 2017 3:31 pm

You don't need to set RV to change banks or access the rom with the CPU through the 32X bank areas. Those areas go through the 32X IO chip, and when there is a bus conflict, the IO chip arbitrates it (the SH2s win in this case). RV is only needed when you need to access the cart (or hardware on the cart) where the thing being accessed cannot respond to the 32X IO chip bus arbitrator. If the 32X can't hold it off, you MUST set RV. Good example - the MD VDP. It assumes that once it has the bus (using the 68000 bus grant lines) that it can keep it forever. If you tried to DMA from the 32X rom bank area, any time an SH2 accessed rom, the 32X IO chip would allow the SH2 to do a bus cycle while holding off the 68000... which is already being held off by the VDP while the VDP does it's own bus cycles. So you'd get garbage for the DMA cycle... possible cause the SH2 to get garbage on its cycle, too.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Wed Jul 05, 2017 8:55 pm

Ahh gotcha. So I don't need it in this case because I'm not DMAing anything off of the ROM at all (EDIT: Not yet anyway. I think I'll probably end up DMAing tilesets off of the cart though since I only have to do it on level load and that may be easier than having to round-trip it through the MD's RAM, also since the MD is the only one that has to be concerned about drawing tilemaps I can preprocess my tilesets into a VDP ready format)

In other news got my map loader up and running (expanding a 16x16 tile format into an 8x8 cell format), so now to work on DMAing that into the VDP...

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Wed Jul 05, 2017 10:39 pm

You might actually compress the tiles in rom and decompress directly to work ram. There's a number of good compressors here for that. It would save you room - not that that's much of an issue anymore. :D

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Thu Jul 06, 2017 6:07 am

Yeah I think for now I'm going to choose not to worry about that :)
That said, I'm investigating DMAing portions of work RAM into the VDP on the MD side, and I'm finding now that I have no earthly clue how DMA works and can't seem to find any good resources on what to do and (perhaps more importantly) why.

So for example, copying a row of cells into the VDP. I've got code already which grabs cell indices out of the map data into a temporary buffer (map in RAM is one byte tile index per cell, copy loop expands this into a word which also contains the other metadata bits like priority and palette and places in the temp buffer). So I've got a pointer to the temporary table, and the number of words in that buffer. Now I'd like to DMA this temporary buffer into one of the planes in the VDP. This is what's giving me a headache, as I can't seem to mentally parse any of the examples I've seen. Going to keep crawling docs and see what I can find though...

EDIT Well OK I'm starting to see the light. Just now got a hold of how writing to VDP registers works lol (for example, what writing 0x8174 to the VDP control port actually means. Just to verify that I'm right, that would select Genesis display mode, DMA enable, vertical interrupt enable, and display enable. correct? just so I know whether I'm correct in this.)

EDIT 2 OK I think I've got it. About a million more mental gymnastics later, I think this code should work, right? (also I stuck all of the registers into symbols in another file because I prefer named things over magic numbers. Each of those is basically just 0x8000, 0x8100, 0x8200, etc)

Code: Select all

| d0 contains address to copy from (pointer to start of temp buffer)
| d1 contains cells to copy (each cell is 1 word)
| d2 contains offset from start of plane A

| Lower byte is data to write to register.
move.w	#(VDP_REG_15 + 02)	,VDP_CTRL 	| write 2 to register 15 (increment)
move.w	#(VDP_REG_1 + 0x74)	,VDP_CTRL 	| set mode ( 0x74 sets bits 2, 4, 5, and 6. selects genesis display mode, dma enable, vint enable, display enable )

| Set VDP register #19 to words to copy (don't need to copy more than 255 words)
add.w	#VDP_REG_19,d1
move.w	d1,VDP_CTRL				| register #19 is low byte of copy length.
move.w	#VDP_REG_20,VDP_CTRL 	| register #20 is high byte of copy length. a row copy should never exceed 255 so we just set it to zero.

| Now setup address to copy from.
| Actually we only care about low and mid bytes. High byte is always FF 'cause we're just reading from RAM.
| First we mask d0 to only get low and mid bytes, then copy into d1.
and.l	#0x00FFFF,d0
move.l	d0,d1
| now mask d1 to get rid of mid byte.
and.w	#0x00FF,d1
| now write d1 to register.
add.w	#VDP_REG_21,d1
move.w	d1,VDP_CTRL
| now shift d0's mid byte over.
lsr.w	#8,d1
| and write d0 to the register
add.w	#VDP_REG_22,d0
| and finally just write FF as the high byte. That puts the address in work RAM ( 0xFF0000 - 0xFFFFFF )
move.w	#(VDP_REG_22 + 0xFF) ,VDP_CTRL

| move address of nametable A into d0
move.l	#0x40000083,d0	| high word has bit 14 set, to enable vram write. low word has bit 7 set (memory-to-vram), and top two bits of the address are in 0-1
						| those two bits point the address at 0xC000, the start of nametable A. the remaining 14 bits have to be added to the high word.
| now shift destOffset into the upper word and add it to d0
lsl.l	#8,d2 | todo: this is a shitty way to shift more than 8 bits.
lsl.l	#8,d2
add.l	d2,d0

| and finally, write that into the control port. this triggers the DMA operation.
move.l	d0,VDP_CTRL

| disable DMA.
move.w	#(VDP_REG_1 + 0x64)	,VDP_CTRL 	| set mode (same as before with 0x74 but with DMA bit cleared
EDIT OK the code's definitely wrong in some places and I'll update it when I get this working better but jesus christ I just found out this at some point stopped working in Fusion and as far as I can tell it's related to the value of a register during map unpacking. In Gens, the register is set to 3 and then decremented until zero (used as a row iterator for my very tiny 4x4 test map), according to the debugger. In Fusion, it completely crashes unless I directly assign the correct value to the register right before using it. Seriously wish Fusion had its own debugging tools - flying blind is no fun at all.
EDIT2 So I think for some reason at the location that the map data *should* have been read into, including headers for map size, Fusion's getting 0 from that location, minus one, overflows. Why this is, I do not know. It could be my rom copy routine is not working on Fusion. I don't appear to have absolutely any way to find out, unfortunately. Sigh.

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Thu Jul 06, 2017 3:36 pm

You goofed a register...

Code: Select all

| now shift d0's mid byte over.
lsr.w   #8,d1
Should be d0, not d1.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Thu Jul 06, 2017 7:19 pm

Yeah I caught that. Also caught the fact that I never actually write that register so mid byte is always just 00.
Still need to figure out why (completely unrelated to that code) my rom completely shits the bed in Fusion T u T

Chilly Willy
Very interested
Posts: 2984
Joined: Fri Aug 17, 2007 9:33 pm

Re: Why's mah assembly so gosh-darn slow?

Post by Chilly Willy » Thu Jul 06, 2017 9:27 pm

Well, without seeing/knowing more, we can only make educated guesses. Are you padding your rom out to a decent size? Some emulators hate when you pass an "odd" size or too small rom to it. You should always pad your roms to the next 128KB or so.

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Thu Jul 06, 2017 11:38 pm

The ROM is 64kb right now. Changing the makefile to pad it to 128kb makes no difference. However, writing some values I'm expecting at a particular location in RAM directly to that location in RAM does make a difference. Those values should be present after calling my rom_copy routine, so it seems that routine is failing on Fusion for some reason. It just occurred to me that maybe I could export a save state and see if I can find a way to peek at the RAM (maybe write 0xDEADBEEF into RAM and then search for that with a HEX editor), which could help in debugging what the heck's going on here.

EDIT: Huh. OK. At least now I can peek at the RAM in Fusion using that technique. But here's the weird thing... I'm now logging at a specific location in RAM, just after the DEADBEEF write, what pointer it gets for the map data to load. Viewing the RAM in Gens, it tells me that the pointer it got was 0x00901F00. However, in the Fusion save state, it instead logs the value 0x0090535F. And the area in RAM where the map should have been copied to is complete garbage data. So the rom_copy routine might actually be working, but for some reason the lower word of the pointer passed to it is totally different between the two. I wonder how this could be... :?

EDIT: OK, I got those pointer values wrong anyway. But here's an interesting clue. If I pass 0xAABBCCDD from my C code, and then log that directly to RAM on the MegaDrive side, here's the difference:

Gens: 0xAABBCCDD (as expected)
Fusion: 0xCCDD535F (oh hey that 535F looks awfully familiar)

I'm struggling to figure out what would be causing this difference between the emulators, but it's a start. Maybe I'll debug reading the individual comms used to store the long pointer.

EDIT 3: OK I fixed it. I still struggle to understand how an emulator difference would have produced this behavior change with the same binary ROM, but I ended up changing my C code to manually split the pointer into two words and pass into COMM2 and COMM4 separately rather than trying to cast COMM2 to an unsigned int and directly assigning it. And now maps are loaded into memory precisely as I'd expect in Fusion. Yay!

KillaMaaki
Very interested
Posts: 84
Joined: Sat Feb 28, 2015 9:22 pm

Re: Why's mah assembly so gosh-darn slow?

Post by KillaMaaki » Fri Jul 07, 2017 5:08 am

SUCCESS!!
insert m_bison_yes.gif

Image

One of the primary problems I had, after fixing the map pointer in Fusion, was that I did not realize source address had to be shifted right one bit (also word aligned but that wasn't an issue because my temporary buffers used for DMA are already word aligned), and then the top bit of the high byte had to be 0 to indicate a memory-to-vram copy. Once I figured that out, it didn't take long to get it copying rows of my map data into the VDP :D

'Course, currently just a font loaded as a tileset so it just spits out characters from the font lol but still, proof of concept :)
EDIT: Also a bunch of sprites rendered by my 32X code on top lol. Never bothered to remove them while testing map loading and DMA.

EDIT
Oh hey I think I figured out where the problem in Gens is. Seems like if I do a move.l to put something into the framebuffer, Gens forces that address to be long-aligned. This makes my sprites always snap to multiples of 4 pixels horizontally. If I use byte writes instead, it works perfectly. In fact, going to the area where it writes longs to the framebuffer and doing nothing more than simply replacing move.l with move.b (aside from making it only write one every four columns) also fixes the 4-pixel alignment problem (using move.w forces it to align to one every two pixels, which I sort of suspected).
Is this how actual hardware functions too, and Kega Fusion has it wrong, or is Gens just being dumb and it's Kega and real hardware that have this correct? I'd hate to have to either make my drawing function way more complex, or just resort to byte copies, because of a quirk like this.

Post Reply