M68000 Microcode-level emulation

Nemesis · Post by **Nemesis** » Mon Mar 19, 2012 2:29 am

I've become very interested as of late in taking M68000 emulation to a new level. I have written an extremely accurate M68000 core for my emulator. From what I've seen, it's more accurate than the one in MAME. That said, it still has some fundamental limits on its accuracy, which I've become increasingly aware are virtually unsolveable with the way my core is designed.

Here's the issue: The M68000 is microcoded, meaning every single machine code instruction the CPU reads isn't really an instruction for the CPU directly, it's more like a key, telling it what set of internal instructions to execute. In a way, microcode is kind of like a data table the CPU uses internally to map these high-level "macro" instructions down to a set of real-low level internal operations to execute. A single opcode for example may actually be made up of a dozen internal operations, or internal execution steps. Effectively, a micro-coded processor is a little like a RISC machine with a very basic, small instruction set, emulating a CISC machine with a more complex and larger instruction set, but in this case, the emulation layer is built into the processor.

Now all of this may seem irrelivent for emulation purposes. This is internal detail afterall, and how something is actually implemented internally doesn't really matter from an emulation point of view, as long as it appears to behave identically from an external point of view. The problem is, there are a lot of quirks with M68000 instructions that occur due to the way things are done on the microcode level, which are visible from outside the M68000, in the way it performs external bus operations. Sometimes, this sub-opcode timing and behaviour is critical. The most obvious example on the Mega Drive relates to bus requests. It's possible to perform a long-word write to the VDP for example, where the first half of the write triggers a DMA transfer operation, and the second word, for example, modifies a VDP register. In this case, on the real hardware, the first write would be performed, which would trigger the VDP to request the bus, which would cause the M68000 to yeild the bus before the second word write was performed. The VDP would then perform the DMA transfer, release the bus, and be ready to receive command port data again by the time the second half of the long-word write sends data to the VDP command port again. There are games that rely on this behaviour I believe. Things get more critical where you add on systems like the MegaCD and 32x, where you have half a dozen processors sharing buses. Suddenly you can have timing problems if one processor doesn't yeild the bus to another in a timely manner.

There are other cases where microcode level emulation can be important. In particular, if you emulate at the opcode level, there's no indication on the exact order of reads and writes to the external bus, especially if you also care about the order of reads and writes for an opcode relative to the instruction pre-cache operations. Some opcodes have surprising orders of reads and writes when breaking up long-word bus operations into word operations, one of which is described in this document:
http://web.archive.org/web/200910151327 ... knotes.txt
There's another good document giving a fairly comprehensive description of when pre-caching occurs for a large number of instructions:
http://pasti.fxatari.com/68kdocs/68kPrefetch.html
Both of these documents are very incomplete in their coverage though, and all of this is still ignoring timing issues, IE, if you want to actually get an M68000 core which performs not just the correct sequence, but the correct timing, between external bus operations, the information simply doesn't exist in the public domain. All of this can be critical when working with memory-mapped hardware, and the order of operations is critical with group 0 exception handling.

One of the biggest problems with an opcode-level M68000 is actually the code structure. You could try and retro-fit workarounds for some of this low-level order and timing behaviour into an opcode-level M68000 emulation core, but how do you make that core correctly respond to group 0 exceptions, and supports savestates, when you don't emulate at the microcode level? If you consider an opcode the smallest non-divisible unit of work in your M68000 core, exactly how do you write a generic core which can yeild the bus halfway through any long-word operation, or between every bus operation in that opcode? How do you generate a savestate when your M68000 core is in this state? What if you have memory-mapped hardware which is designed to request the bus after a read, and it's critical to have it obtain it before the next write to that address? How do you support that for every opcode which potentially performs a read followed by a write?

Now, some of these problems can be worked around on a specific case-by-case basis, depending on what's important for a given system. I'd like to write a "perfect" M68000 core that just does everything right for all instructions though, and it's becoming clear to me that the only practical way to make that happen is to gather much lower level information about how the M68000 uses the bus. I have a logic analyser, and I was planning on snooping on the M68000 bus to determine the correct bus access order and timing of every M68000 opcode. I'll probably still do this. I realised this wouldn't solve some of the more fundamental design problems I'm presented with though, like how to correctly support yielding the bus at all the correct points, during opcode execution. This has led me to look at another approach.

I've been very interested of late in the incredible work that's been done on the 6502 processor, where they've successfully decapped the chip, analysed and decyphered all the internal connections within the die, and emulated the entire chip on the transistor level (check it out here: http://www.visual6502.org ). This is a level of emulation which is totally insane, and I take my hat off to the guys who pulled this off. It's got me thinking though. We have complete, high resolution die shots of the entire M68000 die:
http://www.visual6502.org/images/pages/ ... 68000.html
Thanks to Motorola, we have a basic block diagram of what areas of the die perform what functions:

(sourced from http://www.easy68k.com/paulrsm/doc/dpbm68k2.htm , the original image comes from Motorola, I've seen it referenced in books such as "68000, 68010, 68020 Primer", page 42, which you can download from http://www.scribd.com/doc/29071553/6800 ... 020-Primer)

We can see from this image two clear banks of ROM data, one containing microcode (marked as µROM), and one containing nanocode (marked as NROM). The machine code maps to microcode, which in tern maps to nanocode, which is what triggers the actual internal operations performed by the M68000 CPU. For any Z80 gurus, this is "similar" to the M and T states in the Z80, where nanocode instructions are your T cycles, and microcode instructions are your M cycles. What I want to do is "decode" this internal ROM data, that is, figure out how it's used, and reverse-engineer a complete, accurate description of each stage of execution for each opcode. Effectively, I want to decode the internal microcode/nanocode instructions, so that we can emulate the M68000 on the microcode/nanocode level, rather than the opcode level like every M68000 core is currently doing. I've been analysing the die shots, and I'm making progress. Based on some careful observation and cross-referencing with other decapping projects where internal ROM data has been read, in particular the work done in decapping the YM3812 ( http://yehar.com/blog/?p=665 ), I'm now able to visually read the internal ROM data from the M68000 die shots:

Now I'm trying to take this to the next level, and start figuring out how an opcode maps to this ROM data, and how this ROM data drives the internal functions within the M68000. I'm actively researching this right now and making some progress.

Here's what's obvious/well known:
-The first 16-bit command word of each M68000 opcode is enough to define the instruction, and get loaded into an internal instruction register (IRD)
-The upper 4 bits of each opcode act as a main key, describing the high-level "category" of the opcode
-Opcodes take multiple cycles to execute
-Each nanocode "instruction" takes one cycle to execute
-An internal cycle counter of some kind must be kept to track the current progression through the execution of an opcode

It seems as though this ROM data probably acts as a table, with the combination of the instruction register and the cycle counter used as a key into the table. I'm guessing the upper 4 bits of the instruction register are used as a key into microcode ROM, which in tern gives some kind list of keys into the nanocode ROM. For each microcode instruction, multiple nanocode instructions may be run. The last nanocode instruction run for each opcode (except some special cases like the STOP opcode), is an instruction which resets the cycle counter and loads the instruction pre-cache (IR) into the instruction word (IRD). This means the next cycle will start executing that instruction.

What I want to do is open this research up to a larger audience than just, well, me. I'm interested in any feedback, comments, suggestions, and in particular, anyone who knows more about internal CPU architecture than me (which is to say, pretty much anything), who might be able to contribute with decyphering this thing. I'm trying to make sense of the M68000 die shot, but I have no previous experience with this. It's a bit like staring at a document you know holds the answers to all your questions, but it's written in ancient egyptian heiroglyphs, and you can only read yiddish.

TascoDLX · Post by **TascoDLX** » Mon Mar 19, 2012 7:07 am

In case you missed it, M68000 microcode/nanocode was briefly discussed here.

The M68000 patents are very thorough -- mind-numbingly so. They include all the nanocode (see US Patent 4,325,121) , somewhat distorted, but nevertheless spot-on. Must be read carefully, though. See also US Patents 4,296,469 and 4,307,445. You'll want to refer to all of these as you analyze the nanocode. (There are a couple other patents, but they're basically rehash of the others.)

I think you'll find the patents slightly more informative than the die shots. Although, there may be some undiscovered tidbits to be gleaned from looking at the die. Either way, good luck!

Nemesis · Post by **Nemesis** » Mon Mar 19, 2012 9:42 am

Haha! I'm constantly surprised at what information is available through patents. Thanks, I'm going to have a thorough dig through these documents.

Out of curiosity, do you know of a similar resource for the Z80? I think the information must be out there somewhere, the Z80 was a more "open" design than the M68K, but I haven't found any comprehensive information on what happens at each T state for all the opcodes. I'd love to make both the Z80 and M68K emulation cores accurate at this level.

TascoDLX · Post by **TascoDLX** » Tue Mar 20, 2012 8:45 am

There's loads of information at www.z80.info, but I'm not entirely sure it gets as detailed as you'd like it. Maybe it's buried somewhere in there -- I don't have much time to go digging. I remember seeing a bit of info on the Z80's T-states within each M-cycle, but nothing comprehensive.

Charles MacDonald · Post by **Charles MacDonald** » Wed Mar 21, 2012 1:37 am

The Mostek Z80 ("MK3880 Central Processing Unit" in their jargon) manual has more detailed information than any of the other manuals, though it's not complete. That along with Sean Young's "Z80 Documented" paper are the best sources IMO.

I made a Z80 analyzer a while ago to analyze/dump data from custom Z80s used in arcade games, and it can advance the processor a clock edge at a time and sample all the outputs and set up the inputs for the next edge. So if there is any specific behavior you'd like tested, let me know and I can run some tests in a few months (it's out on loan).

Nemesis · Post by **Nemesis** » Wed Mar 21, 2012 5:24 am

Yeah, I've seen all those references. Haven't seen anything that gets that low-level though. I guess there just isn't that level of detail written up, which is a bit surprising to be honest. With how many quirks the Z80 has, and how widely used (and cloned) it was, I thought there'd be pretty much everything documented and in the public domain by now.

Steve Snake said he had a cycle-exact Z80 core written for Kega Fusion ( viewtopic.php?p=6906#6906 ), I wonder how he pulled that off? Maybe he did the testing himself? Or maybe the documents are out there, they just haven't been scanned in and made available on the web.

I made a Z80 analyzer a while ago to analyze/dump data from custom Z80s used in arcade games, and it can advance the processor a clock edge at a time and sample all the outputs and set up the inputs for the next edge. So if there is any specific behavior you'd like tested, let me know and I can run some tests in a few months (it's out on loan).

Thanks for the offer, that sounds like a cool bit of hardware! To be honest, it'll probably be awhile before I build a new Z80 core at this level. I'm working on 5 things at once right now as usual, but I really want to get my emulator polished off and in the public domain this year, which means, I need to get my cycle-accurate VDP core finished off first. I'll probably release with the current opcode-based M68000 and Z80 cores initially, and make the new cycle-level M68000 and Z80 cores the first task after that. I want to get as much information gathering as possible done now though, so I have a starting point to work on the new cores later. If I can't find any documentation for this elsewhere, I'll probably just bite the bullet and get the logic analyser out, and document all the Z80 bus timing myself. I'm glad there's comprehensive documentation for the M68000 at least, since it's the more complicated of the two.

mikej · Post by **mikej** » Fri Dec 14, 2012 10:06 am

Hi,
I have just come across your post.
I have also been working on this for a while - my interest is improving the VHDL softcores we use. www.fpgaaracde.com
I have also been looking at the die scans.

There is a book called
P._Antognetti,_F._Anceau_and_J._Vuillemin__Microarchitecture_of_VLSI_Computers

which describes a test mode for the 68K where you can read out the rom contents.

"7 - TEST MODE
The MC 68000 has a special mode for simplifying the test of
its control part (the test of the data processing section is
much more easier due to its higher observability and controlability).
This mode is enabled by taking the VPA pin to approximately
8 volts. The special hardware activated for this mode occupies
approximately 4% of the whole chip area and involves around
250 MOS transistors.
In that mode the control part is directly exercised after input
pins reconfiguration. The microinstruction contents is readable
by third on address pins.

2 - 3 Decoding PLA Al, A2 and A3
These PLAs take the 16 bits of the IR instruction register
(and the 16 bits of their complements) as inputs and each of
them provides a 10-bit address. Electrically PLA Al has switched
loads while PLA A2-A3 have static loads.
PLAs A2 and A3 share the same AND matrix, these two PLAs also
generate the invalid operation code (IOC) and privileged instruction
signals (PRIV).
The PLA Al decode the instruction and generate a microinstruction
address except if the "effective address" mode is used to provide
the operand address.
In such case :
- either there exists one and only one operand address
provided by effective address mode, though PLA Al generates
the microaddress of the effective address evaluation
subroutine then PLA A2 provides the microaddress of
the instruction execution program.
- either there exist two effective address modes provided
operands, and then, PLA Al generates the microaddress
of the source operand effective address evaluation subroutine,
then PLA A2 generates the microaddress of the
destination operand effective address evaluation subroutine.
Afterwards, PLA A3 provides the microaddress
of the execution microprogram.
The figure 20 represents PLAs participation in the microprogram
execution."

Mail me and we can discuss more.
mikej at fpgaarcade dot com

I have a real 68K wired up to an FPGA so I think I can do this ...
/MikeJ

Charles MacDonald · Post by **Charles MacDonald** » Fri Dec 14, 2012 10:51 pm

Just to clarify this mode is for testing the PLAs that do instruction decoding, but you can't actually read out the PLA contents directly nor can you read out the microcode ROM.

You could run all possible inputs through the PLAs and determine the results of the logic which might be useful for verification, but I'm assuming the visual6502 guys have done or will do that with the die photos?

Mask of Destiny · Post by **Mask of Destiny** » Fri Dec 14, 2012 11:22 pm

Since this thread is alive again, I thought I would mention something I noticed when using patent 4,325,121 as a source of information. While it does have a lot of useful information, it does not appear to represent the shipping version of the 68000, but a late prototype. Specifically you'll notice that the dbcc instruction is not present and a simpler instruction dcnt is present in a different instruction group instead. dcnt is similar, but it lacks the ability to exit the loop based on a condition code and the displacement is built-in to the instruction rather than living in an extension word.

If we want the final nano-rom, it looks like the die shots are the only way to go. Hopefully the mapping of instruction words to micro-word addresses didn't change too much. If it's mostly the same we can use the patents as an aid in determining the meaning of various nano-word bits.

mikej · Post by **mikej** » Sat Dec 15, 2012 1:37 am

Charles MacDonald wrote:Just to clarify this mode is for testing the PLAs that do instruction decoding, but you can't actually read out the PLA contents directly nor can you read out the microcode ROM.

You could run all possible inputs through the PLAs and determine the results of the logic which might be useful for verification, but I'm assuming the visual6502 guys have done or will do that with the die photos?

Processing the die photos is taking some time. Vectorization is ongoing, but I think it is necessary to remove the top metal layer really and re-scan.

Do we have a clear picture of the IO assignment in test mode and what data we can extract?
/MikeJ

Charles MacDonald · Post by **Charles MacDonald** » Sat Dec 15, 2012 2:12 am

Do we have a clear picture of the IO assignment in test mode and what data we can extract?

From what I can tell:

Address bus : input (provides 23 of 32 inputs of the parameterization PLA with the remaining inputs forced to zero)

Data bus : input (defined the "operation word" which I think is the value in IR)

IPL1,2 : These sound like inputs but I'm unsure about this misuse of the word "third" in the description.

BR,BGACK : inputs that override whatever the conditional branching PLA would normally output

DTACK,IPL0 : Inputs that select PLA A0,A1, or A2

Notice there are no outputs.

They explicitly say the data and address busses are inputs, so the functions of those seem fixed. Maybe the others can change.

The parameterization PLA has a 32-bit bus but the article seems clear that it is a 16-bit bus containing the operation word (IR register) and the other 16 bits are the complement of those. So maybe this isn't the 32-bit bus the address bus drives?

Now I'm less sure about how it works.

TascoDLX · Post by **TascoDLX** » Sat Dec 15, 2012 11:02 am

Charles MacDonald wrote:The parameterization PLA has a 32-bit bus but the article seems clear that it is a 16-bit bus containing the operation word (IR register) and the other 16 bits are the complement of those. So maybe this isn't the 32-bit bus the address bus drives?

You seem to be confusing the parametrization PLA with the decoder PLAs. Maybe I can help clear this up.

The decoder PLAs (A1,A2,A3) each take a 16-bit opcode and output a 10-bit rom address, which references a 17-bit microword used for sequencing (FIG.8 in the patent) as well as a 68-bit nanoword used for parametrization, etc. FIG.10 in the patent shows how the microinstructions map out in rom, and FIG.21 shows the mapping by opcode for all 3 decoders.

For the test mode, I agree that the description appears to imply the opcode is to be input on the data bus. As for this part:

- each third of the action part of a microinstruction can be connected to the parametrisation PLA lines (23 lines of the 32 are used, the others are forced to zero). The appropriate third is chosen by the value of pins IPL1, IPL2.

That is, the 68-bit nanoword is split into three parts: 23 bits, 23 bits, and 22 bits. The IPL pins select which part is put on the parametrization PLA lines to be connected to the address bus. BR and BGACK replace the branch control bits (C0,C1). DTACK and IPL0 select which decoder PLA to use (A1,A2,A3) or else leaves it to default (i.e., whatever the microword says). Everything else is a mystery to me.

I must say, I'm not sure that I really care to know the exact nanoword encoding of every single instruction, but I'm definitely intrigued. I wish you good luck in figuring out this test mode.

mikej · Post by **mikej** » Sat Dec 15, 2012 10:40 pm

There are more details in the document, mail me and I can send you a copy.
The data and control blocks are described a little more clearly than the patents.
/Mike

galibert · Post by **galibert** » Thu Jan 31, 2013 1:11 pm

Mask of Destiny wrote:If we want the final nano-rom, it looks like the die shots are the only way to go. Hopefully the mapping of instruction words to micro-word addresses didn't change too much. If it's mostly the same we can use the patents as an aid in determining the meaning of various nano-word bits.

Actually it has changed a lot. But it's not that much of a problem because the a1/a2-a3 pals tell you a lot of the new values. I have a mostly validated dump for those who need one.

The real problem is that the no-metal image is insufficiently cleaned up to be able to type up the microcode/nanocode array correctly enough. For instance for the instructions that point to the mmrw3 code:

Code: Select all

087 cpdw1 16000 i dbi      mmrw3                             ||| 11480 i dbi      mmrw3                          
12d tsrw1 161c0 i dbi      mawl1                             ||| 11480 i dbi      mmrw3                          
158 mrgm1 16004 i a1       mmrw3                             ||| 11480 i dbi      mmrw3                          
215 btsm1 16000 i dbi      mmrw3                             ||| 11480 i dbi      mmrw3                          
23b rlql1 16004 i a1       mmrw3                             ||| 11480 i dbi      mmrw3                          
29b mrgw1 16120 i dbi      <348>                             ||| 11480 i dbi      mmrw3                          
363 b     16000 i dbi      mmrw3                             ||| 11480 i dbi      mmrw3                          
3c3 tsmw1 16020 i dbi      <340>                             ||| 11480 i dbi      mmrw3

(left = die, right = patent)

One can see that there are about 8 bits wrong out of 136, or around 6% error rate (the 11480->16000 change is due to the change of address of mmrw3 from 026 to 300). Or in other words almost 2000 bits wrong out of the 32096 of the array, which is way more than what I can handle, the redundancy not being high.

So anyone who want to do the microcode-level emulation game is going to need a better delayering. Dunno if the visual6502 guys have the means and/or the will to provide that, or if I'll have to find someone else to do it.

I'm ok for sharing what I've done if someone provides some collaborative space to upload the stuff. I'd rather not be the only one writing on it though

OG.

Nemesis · Post by **Nemesis** » Thu Jan 31, 2013 11:29 pm

I can't do much on this right now, I'm currently working towards the first release of my emulator at the end of March, and after that I've got to get the MegaLD dumping for the LaserActive into full swing, I've got a lot of people waiting on me for that. After that though, I'd love to take another look at this.

It's impressive that you've managed to decode so much of the data! It sounds like the quality of the current images are an issue though. I'm strongly considering picking myself up a few gold cap M68000's and taking a shot at this myself. With a gold cap chip, there's no need to eat away the surrounding package, you can pop the cap with a few basic tools. After that, it's a matter of cleanly removing the metal layer, and having good enough optics in order to capture a good image.

I'm willing to give it a go. If it all works out, I'll be able to provide a second set of images, this time focusing specifically on the microcode/nanocode arrays, with higher resolution. You could extract the ROM data from my images and cross-check it with the Visual6502 images, which should help to identify any errors in the decoded ROM. I'll also have the chip on-hand, so I can pop it under the microscope and check questionable sections as needed. If this experiment works too, I could then attempt to go on to the more difficult task of decapping other IC's that don't have gold caps.