New Documentation: M68000 microcode-level bus access timing
Moderator: BigEvilCorporation
New Documentation: M68000 microcode-level bus access timing
While doing some research today I came across a new document that's just been posted on an Atari forum ( http://www.atari-forum.com/viewtopic.php?f=68&t=24710 ). This document provides the results of a commercial effort made in the past, using a combination of the official documentation provided by Motorola, the patent information on the internal operation of the M68000, and actual hardware testing, to document the behaviour of all M68000 instructions and exceptions in regards to the exact timing and order of all external bus operations during instruction execution.
This documentation is far more comprehensive and complete than anything else that I've ever seen, and the author claims that the results have proven to be accurate from actual use in his previous company. I'm quite confident based on this documentation, that I can now write an M68000 core which is able to overcome the limitations of the current emulation core I wrote for Exodus, and every other M68000 core I'm aware of, where the processor is unable to keep correct timing and order for external bus access, and is unable to yield the bus and respond to exceptions at the same points that the real processor was able to. I'm currently working on a new M68000 core for the next release of Exodus to incorporate these findings.
You need to register on the Atari forum in order to download the file, so for anyone who's interested, I've mirrored the document on my webspace here:
http://nemesis.hacking-cult.org/MegaDri ... /Yacht.txt
This documentation is far more comprehensive and complete than anything else that I've ever seen, and the author claims that the results have proven to be accurate from actual use in his previous company. I'm quite confident based on this documentation, that I can now write an M68000 core which is able to overcome the limitations of the current emulation core I wrote for Exodus, and every other M68000 core I'm aware of, where the processor is unable to keep correct timing and order for external bus access, and is unable to yield the bus and respond to exceptions at the same points that the real processor was able to. I'm currently working on a new M68000 core for the next release of Exodus to incorporate these findings.
You need to register on the Atari forum in order to download the file, so for anyone who's interested, I've mirrored the document on my webspace here:
http://nemesis.hacking-cult.org/MegaDri ... /Yacht.txt
Keep in mind that yacht is from motorola us patent. The difference is that the microcode listing in yacht is a lot more readable than the original.
You can never be sure if patent describes final 68k revision.
(Biggest difference is DCNT which become DBcc)
Logic Analyzer tests are needed.
For my emulation project I have written an 68k emulator based on this document.
http://sourceforge.net/projects/portable68000/
You can never be sure if patent describes final 68k revision.
(Biggest difference is DCNT which become DBcc)
Logic Analyzer tests are needed.
For my emulation project I have written an 68k emulator based on this document.
http://sourceforge.net/projects/portable68000/
-
- Very interested
- Posts: 616
- Joined: Thu Nov 30, 2006 6:30 am
From the "vocabulary" section of the doc...
From the rules of thumb in the doc...
Thanks for sharing Nemesis. As PiCiJi says this is a lot easier to read than the Motorola patent.[/quote]
Microcycles take a minimum of 2 clock cycles, but some microcycles perform a complete bus operation and thus require a minimum of 4 cycles. And of course, bus operations can be extended to any number of cycles >= 4 with !DTACK.Microcycle : indivisible CPU cycle of execution : takes 2 clock cycles.
From the rules of thumb in the doc...
This isn't entirely true. There are 2 16-bit data buses that are each split into 3 segments (one segment for high-words, one segment for address register low-words and one segment for data register low-words). This allows certain 32-bit transfers (like register to register moves) to operate in the same time as equivalent 16-bit transfers, but for others (like loading a 32-bit value into the ALU for bit-shift/rotate operations) an extra micro-cycle is required (from what I remember anyway, it's been a while since I looked at the micro/nanocode for those).2) 68000 internal data bus is 32 bit so reading/writing word or long word from / to a register take the same time.
Thanks for sharing Nemesis. As PiCiJi says this is a lot easier to read than the Motorola patent.[/quote]
My reading of the document and my understanding of what the author wrote is that it's not just taken from documentation alone. Remember, this was a commercial effort, so the company doing this analysis must have had a product they were working on that relied on this information, and from what he says, it sounds like that product was developed, and worked correctly using this information. Here's what he says on the Atari forum:
That said, if there are any points in question, I'm happy to break out the logic analyser and check them over. It's much easier to offer amendments or corrections to a document like this than test everything from scratch.
With this note from the author, and the extent of the documentation, with the corrections and additions given to the official documentation and the patent documentation within it, I think this document is more than just taken from the other available documentation.Most of the technical things written in it have been proven to be right by years of practice on real hardware. But, as always, it can still have some errors in it.
Even after years spent to refining it, there's still some mysteries floating around (especially when talking about exceptions).
That said, if there are any points in question, I'm happy to break out the logic analyser and check them over. It's much easier to offer amendments or corrections to a document like this than test everything from scratch.
You know this project ?
http://sourceforge.net/projects/portable68000/
The author claims his emulator is cycle accurate ...
http://sourceforge.net/projects/portable68000/
The author claims his emulator is cycle accurate ...
I think he claims the prefetch is done with the correct timing, and that external interrupts are taken at the correct timing, but it doesn't do full cycle accuracy and order for all external bus operations or group 0 exceptions (reset/address error/bus error). I've looked at the source, and it doesn't appear to be able to do that. I'll be aiming for these goals with my new core. I'll also be adding in support for the M68010 in the same core (as an option). The M68010 supports resuming from address and bus errors, something that can only be properly done with a core design that can emulate at a sub-opcode level.
Sure. No one has time to test it all again. There are a few things I am not sure about.Nemesis wrote:It's much easier to offer amendments or corrections to a document like this than test everything from scratch.
Things like dummy reads in some opcodes, or 2 cycle gap in execution times of exceptions.
For Example the author of yacht wrote:
It reads "14(3/0)" but, according to USP4325121 and with a
little common sense, 2 bus read accesses are far enough.
or
For all these exceptions, there is a difference of 2 cycles between Data
bus usage as obtained from USP4325121 and periods as written in M68000UM.
There's no proven theory to explain this gap.
For me it doesn't sound very trustworty. It should be confirmed with an logic analyzer.
Stack frame creation of all exceptions, (except reset exception) should be bus accurate.Nemesis wrote:but it doesn't do full cycle accuracy and order for all external bus operations or group 0 exceptions (reset/address error/bus error)
Please tell me which external bus operations do you mean?
For explanation: A derived class should be written to handle bus arbitration. For example, in amiga emulation the cpu have to wait for a free bus access window. Therefore I am emulating bus hold times, means cpu needs two cycles to put address on bus. The second two cycles are needed to read or write from this address. These two cycles should be repeated till bus is free.
If cpu is waiting for free bus, cpu thread should be leave and switch to a thread of another bus participant to progress the overall emulation.
Same should be done just before ipl latch is sampled. All other irq generating devices in the system should have caught up to cpu cycle position within opcode
I'm considering doing a "mass capture" of every form of every opcode using my logic analyser as one long continuous stream of data. This would serve as a definitive reference for the external bus behaviour of the M68000. If I do, I'll dump the raw data directly online, and then we can compare with the timing in yacht.txt. If any errors are found, we can provide an errata for this document correcting anything that's wrong.PiCiJi wrote:For me it doesn't sound very trustworty. It should be confirmed with an logic analyzer.
While that method can work for a simple execution scenario, it's impossible for you to, for example, generate a savestate when an opcode is half-executed in this manner. It's also impossible to effectively emulate the RTE behaviour on the M68010, where a bus or address error can be resumed from, with the exception handler possibly handling the failed bus operation in software. Also, if you have multiple devices in the one system which are emulated in this way, it's possible the system will never reach a stable point where each device is exactly at the start of an opcode, meaning things like savestates might not even be possible in such a scenario.PiCiJi wrote:Stack frame creation of all exceptions, (except reset exception) should be bus accurate.
Please tell me which external bus operations do you mean?
For explanation: A derived class should be written to handle bus arbitration. For example, in amiga emulation the cpu have to wait for a free bus access window. Therefore I am emulating bus hold times, means cpu needs two cycles to put address on bus. The second two cycles are needed to read or write from this address. These two cycles should be repeated till bus is free.
If cpu is waiting for free bus, cpu thread should be leave and switch to a thread of another bus participant to progress the overall emulation.
Same should be done just before ipl latch is sampled. All other irq generating devices in the system should have caught up to cpu cycle position within opcode
Now that's not necessarily a showstopper if those issues are unimportant for a particular use, but they're showstoppers for me. In order for the timing management system in Exodus to work, devices need to be able to be halted between each indivisible unit of execution, and likewise in order for savestates to work, they need to be able to fully save and load all device state at these points.
sounds goodNemesis wrote:If I do, I'll dump the raw data directly online, and then we can compare with the timing in yacht.txt.
Sure generating savestates is not that easy, but in my opinion the reduced code complexity is worth it.
savestates
------------
For my last emulator I have waited till cpu thread is at clean opcode edge. Afterwards I have synced to the other threads and let them run to entry points. If a sync is needed during this process all is lost and the whole process is repeated...damn
I am trying to find a save point for a whole frame. Such like save points are absolutely safe to recover from.
If it's not possible to find a save point for a whole frame, the message "save failed" will be displayed. If you handle all non cpu-threads in short cycles its unlikely to get the error message.
If In understand good the doc :
(An) 4(1/0) nr
2 microcodes :
- n : 2 clocks for put adress on the bus and asserting AS
- r : 2 clocks for receiving DTACK from memory and read data bus on the internal bus (in case memory has no latence)
Question : what happen if the memory send DTACK later? The data will ba tack in account 2 clocks (1 microcode) later, like nnr instead of nr?
I mean, what will be the perfect 68k emulator? The emulator in wich the most indivisable execution unit is the microcode (2 clocks) or emulate clock by clock?
I try to imagine the design of perfect 68K emulator, with chained list of microcode. When you execute the 68k, it execute the next microcode in the list. So the program wich use it could set data, adress bus, set pin signal between each microcode, like in real.
Would be sufficient or for have crazy perfect synchro have to be list of chained clock operation In this case only a computer from NASA could be able to run the PERFECT 68k emulator
(An) 4(1/0) nr
2 microcodes :
- n : 2 clocks for put adress on the bus and asserting AS
- r : 2 clocks for receiving DTACK from memory and read data bus on the internal bus (in case memory has no latence)
Question : what happen if the memory send DTACK later? The data will ba tack in account 2 clocks (1 microcode) later, like nnr instead of nr?
I mean, what will be the perfect 68k emulator? The emulator in wich the most indivisable execution unit is the microcode (2 clocks) or emulate clock by clock?
I try to imagine the design of perfect 68K emulator, with chained list of microcode. When you execute the 68k, it execute the next microcode in the list. So the program wich use it could set data, adress bus, set pin signal between each microcode, like in real.
Would be sufficient or for have crazy perfect synchro have to be list of chained clock operation In this case only a computer from NASA could be able to run the PERFECT 68k emulator
You need to step by a single clock cycle or less, not a 2-cycle "microcode" step like they talk about in the document. If you look at the M68000 User's Manual, you'll see that when DTACK isn't asserted, wait states are inserted, which consist of whole clock cycles. A wait could be 1, 2, 3, etc clock cycles, not just 2, 4, 6, etc clock cycles like you might expect from reading the yacht.txt document. Obviously delays in DTACK weren't important for whatever project they were attempting.
The proper way of handling bus operations is fully outlined in the M68000 User's Manual, section 5. The real bus logic latches and updates signals at both the rising and falling edges of the individual clock cycles. A simplification though is this:
-Nothing of interest happens externally on the first clock cycle
-At the beginning of the second clock cycle, the external bus signals are asserted (IE, this is the time at which external devices consider the read or write operation to be occurring)
-At the beginning of the third clock cycle, the state of DTACK is latched is latched. If DTACK hasn't been asserted, this clock cycle repeats until DTACK is asserted. This will loop infinitely if DTACK is never asserted.
-At the beginning of the fourth clock cycle, the provided data is latched for a read operation, and the external bus signals are negated.
Note that this is a simplification. In reality, the delay between when the external bus signals are asserted, and the point at which DTACK is sampled, is 1.5 clock cycles, since the bus signals are asserted on the rising edge of the clock, and DTACK is latched at the falling edge of the clock. A "perfect" emulator would be able to start the bus operation at a whole cycle boundary, latch DTACK 1.5 clock cycles later, and if DTACK hasn't been asserted, insert whole clock cycles as bus wait cycles, sampling DTACK again halfway through each bus wait cycle. This is the timing I'm going to be using in my core.
The proper way of handling bus operations is fully outlined in the M68000 User's Manual, section 5. The real bus logic latches and updates signals at both the rising and falling edges of the individual clock cycles. A simplification though is this:
-Nothing of interest happens externally on the first clock cycle
-At the beginning of the second clock cycle, the external bus signals are asserted (IE, this is the time at which external devices consider the read or write operation to be occurring)
-At the beginning of the third clock cycle, the state of DTACK is latched is latched. If DTACK hasn't been asserted, this clock cycle repeats until DTACK is asserted. This will loop infinitely if DTACK is never asserted.
-At the beginning of the fourth clock cycle, the provided data is latched for a read operation, and the external bus signals are negated.
Note that this is a simplification. In reality, the delay between when the external bus signals are asserted, and the point at which DTACK is sampled, is 1.5 clock cycles, since the bus signals are asserted on the rising edge of the clock, and DTACK is latched at the falling edge of the clock. A "perfect" emulator would be able to start the bus operation at a whole cycle boundary, latch DTACK 1.5 clock cycles later, and if DTACK hasn't been asserted, insert whole clock cycles as bus wait cycles, sampling DTACK again halfway through each bus wait cycle. This is the timing I'm going to be using in my core.
Thanks Nemesis for your informations, could be a great challenge !
So what would be perfect will be to emulate state by state like in the manual (S0, S1, ... S8).
You plane to keep the actually core in Exodus in the next version because I assume that's this level of emulation will decrease the speed?
So what would be perfect will be to emulate state by state like in the manual (S0, S1, ... S8).
You plane to keep the actually core in Exodus in the next version because I assume that's this level of emulation will decrease the speed?