To NOP or not to NOP

furrykef · Post by **furrykef** » Mon Jul 21, 2008 8:46 pm

I'm wondering when exactly you should use NOP. For instance, here's the read_joypad1 routine from genesis.c:

ushort read_joypad1()
{
    register volatile uchar *pb;
    ushort i, j;

    pb = (uchar *) 0xa10003;

    *pb = 0x40;        /* check joypad */
    asm("nop");
    asm("nop");
    i = *pb & 0x3f;

    *pb = 0;           /* check buttons */
    asm("nop");
    asm("nop");
    j = (*pb & 0x30) << 2;

    return( ~(i|j) );
}

I realize that the NOPs are to wait for the hardware to notice what you're doing. But when exactly is it needed, and how do you know how many cycles to wait? In this particular case, it's reading from a memory register shortly after writing to the same register, so I assume that's why the wait is necessary. But I don't see anything like this specified in sega2f.doc. Were the NOPs written in after trial and error, or is this documented somewhere?

Keep in mind that I'm thinking in terms of the actual hardware, not just emulation.

- Kef

TmEE co.(TM) · Post by **TmEE co.(TM)** » Mon Jul 21, 2008 9:43 pm

On actual hardware, you need the NOPs, or at least one of them. Omitting them can lead to some non-responsiveness on real hardware (especially 6-button pads, and when you have some overclocking going on).

furrykef · Post by **furrykef** » Mon Jul 21, 2008 9:49 pm

Yes, but my question is when they're necessary... how do you know when you should put them? As I said, I haven't found any information on it in the documentation.

TmEE co.(TM) · Post by **TmEE co.(TM)** » Mon Jul 21, 2008 9:53 pm

you have to put them after every TH line modification.

Shiru · Post by **Shiru** » Mon Jul 21, 2008 11:20 pm

I'd say, question is 'how long this delay must be'? Because different C compilers with different settings produces different code, so maybe those NOP's actually unneeded (execution of code between accesses to port almost surely makes enough delay).

furrykef · Post by **furrykef** » Tue Jul 22, 2008 12:38 am

you have to put them after every TH line modification.

So it's something specific to that particular register, then? Any other places where I might need to use NOP where it isn't obvious from the documentation?

Shiru wrote:I'd say, question is 'how long this delay must be'? Because different C compilers with different settings produces different code, so maybe those NOP's actually unneeded (execution of code between accesses to port almost surely makes enough delay).

Well, I have the same routine written in ASM and it also used two NOPs. I don't know which version was written first. But I'm willing to bet that it's likely enough that it'll get assembled into essentially the same code. Compiler bloat doesn't always affect every little line of code.

It's still a good question, though: how do you know how much to delay?

- Kef

tomaitheous · Post by **tomaitheous** » Tue Jul 22, 2008 1:42 am

furrykef wrote:
It's still a good question, though: how do you know how much to delay?

Unless you know about it previously, you don't know

. When coding for a console, it's best to grab as many documents as you can find. People forget to mention things, or sometimes it's assumed you know. Other times the doc authors aren't informed themselves. You need to weed through the docs and what's different. You don't do this in general, but more to a specific area or interface.

Having coded on other systems, the first thing about reading from the controller port would be "does it need a delay?". Especially for a multiplexed controller port.

I think after a while you start to get a feel for what might be timing sensitive communications and ask if it's not mentioned in the docs. A general rule is that anything interfacing with the processor has some sort of timing guidelines at some specific area or stage of the device. The VDP, 2612, z80, I/O ports, etc.

furrykef · Post by **furrykef** » Wed Jul 23, 2008 6:40 pm

I found one doc that says you need 16 cycles. Two NOPs is eight cycles. I'm not sure whether or not the doc means you need 16 cycles before the next move instruction that accesses the register, or 16 cycles including the next move instruction. If it includes it, then it should work out to at least 16 cycles total.

- Kef

HardWareMan · Post by **HardWareMan** » Thu Jul 24, 2008 3:46 am

furrykef wrote:I found one doc that says you need 16 cycles. Two NOPs is eight cycles. I'm not sure whether or not the doc means you need 16 cycles before the next move instruction that accesses the register, or 16 cycles including the next move instruction. If it includes it, then it should work out to at least 16 cycles total.

- Kef

Do not forget: you need 16 cycles between changes IO line, not commands M68K. I mean, opcode MOVE #$0000,$A10003 will do change IO lines after its fetching (1w opcode + 1w constant + 2w address = 4 words).

furrykef · Post by **furrykef** » Thu Jul 24, 2008 4:31 am

I have to admit it took me a second to figure out what you meant. So basically you're saying the fetch/decode part of the CPU's fetch/decode/execute sequence for the move instruction should cover it, right? Got it.

Or actually... could NOPs actually take 8 cycles rather than 4, hence two NOPs = 16 cycles? NOP takes four cycles to execute, but it should take another four cycles to fetch the NOP instruction in the first place, shouldn't it?

- Kef

Chilly Willy · Post by **Chilly Willy** » Thu Jul 24, 2008 5:01 am

No, the instruction fetch is part of the cycle timing. The only way the fetch would make it longer is if the hardware inserted wait states.

HardWareMan · Post by **HardWareMan** » Thu Jul 24, 2008 6:30 am

I think, every word (system bus is 16 bit) read/write takes 4 clocks (or 8 states) - without additional wait states. First word is opcode, its fetch combined with executing.

So, I think for "MOVE.W #$1234,$A10003" instruction execution flow will be:
Fetch 1 word - opcode MOVE.W
Fetch 2 word - constant #$1234
Fetch 3 word - high address word $00A1
Fetch 4 word - low address word $0003
Write 5 word - write constant #$1234 at address $A10003
And somewhere around write 5 word IO chip do latch constant and set IO port line.

Mask of Destiny · Post by **Mask of Destiny** » Thu Jul 24, 2008 4:46 pm

What HardWareMan says is confirmed by the MC68000 User Manual. There are a number of tables in section 8 that give total execution time and the number of read and write cycles. Doing a move.b or move.w with an immediate source and a 32-bit constant address takes 20 cycles total with 4 read and 1 write operation.

Depending on how you do the read, you might not even need any nops. For SLO I use the following code:

Code: Select all

	move.b	#$FF, $a10003	;set TH for controller A
	move.b	$a10003, d7	;CBRLUD
	andi.b	#$3F, d7
	move.b	#0, $a10003
	move.b	$a10003, d6	;SA00UD
	andi.b	#$30, d6
	lsl.b	#2, d6
	or.b	d6, d7		;SACBRLUD

move.b $a10003, d7 should take 16 cycles so depending on exactly when things latch, that's somewhere between 12 and 16 cycles worth of delay. For what it's worth, no one has reported any problems with the controller support in SLO and I've tested it with a number of 3 and 6-button controllers (all 1st party ones though, 3rd party pads could be a problem I suppose).

If you were to do something like this though:

Code: Select all

	lea	$a10003, a0
	move.b	#$FF, (a0)		;set TH for controller A
	move.b	(a0), d7		;CBRLUD
	andi.b	#$3F, d7
	move.b	#0, (a0)
	move.b	(a0), d6		;SA00UD
	andi.b	#$30, d6
	lsl.b	#2, d6
	or.b	d6, d7		;SACBRLUD

You would likely need to add in a nop or two as move.b (a0), d7 should only take 8 cycles. Presumably a decent C compiler would produce something more like the second example rather than the first, but you should check the output if you want to be sure.

furrykef · Post by **furrykef** » Thu Jul 24, 2008 7:52 pm

Mask of Destiny wrote:but you should check the output if you want to be sure.

I don't like the idea of relying on examining the compiler output for a decision like that. That gratuitously ties your code down to that particular version of that particular compiler... if you switch to a compiler that produces different code, you may need the NOPs again -- and completely fail to detect the problem, especially if your emulator doesn't require you to wait the 16 cycles. If you put 'em there, the code always works. Considering that I really don't think you need to be worrying about wasting 8 (or 16 for two controllers) cycles every frame, I doubt it's worth the trouble to omit them even if you technically can. If you care that much, write the whole routine in ASM and there'll be no doubt about its performance.

But then you should probably write the whole game in ASM in that case.

- Kef

TmEE co.(TM) · Post by **TmEE co.(TM)** » Thu Jul 24, 2008 8:14 pm

For any more serious MD dev, you need a flashcart or something else to run your code on the real deal... there's a lot of stuff you can do in emulators and not on real HW. Try using BTST on VDP