I'm officially building a microcode-level 68000 core

Ask anything your want about Megadrive/Genesis programming.

Moderator: BigEvilCorporation

Nemesis
Very interested
Posts: 791
Joined: Wed Nov 07, 2007 1:09 am
Location: Sydney, Australia

Re: I'm officially building a microcode-level 68000 core

Post by Nemesis » Mon Jan 29, 2024 11:17 am

Mega bump to say that 5 years later, I've picked up Exodus again and I'm currently looking at this very issue once more. In the next few days I'll be writing up a post describing the opcode decoding steps taken by the processor, and a little "microcode 101" description that explains how to interpret the info in the patents, so you can work out the individual steps taken for a given instruction, the order the steps occur in, and the timing of them. I'll then (possibly in a following post), be cross-referencing this with what we know from yacht and the decap analysis to translate what we know from the patents, and apply it to the real physical M68000, allowing us to take into account changes in the final product from the patents, and correct any mistakes in there. A lot of this will be referring to the great work done by others in this area before and after I originally created this thread.

In parallel I'll be building an emulation core to actually do all this, which will help me verify everything along the way. I'm also going to build test roms for as much as I can.

Anyone who's still around here, feel free to just drop in this thread and say hi. I know this place is pretty quiet these days, but even if I'm talking to myself most of the time, I'll make an effort to document my findings and research here as I go. No better way to learn something than to try and explain it to others.

Michel Gerritse
Interested
Posts: 12
Joined: Sat Apr 25, 2020 7:25 pm
Location: The Netherlands

Re: I'm officially building a microcode-level 68000 core

Post by Michel Gerritse » Mon Jan 29, 2024 10:44 pm

I'm looking forward to your technical posts. While I have no interest in writing a Megadrive emulator, a M68000 emulator is still on my todo list for other reasons.

AmateurSegaDev
Interested
Posts: 24
Joined: Sun Feb 27, 2022 3:27 am

Re: I'm officially building a microcode-level 68000 core

Post by AmateurSegaDev » Tue Jan 30, 2024 2:35 am

Welcome back!

Yes, it is quite here, but rest assured your posts are being read. : :mrgreen:

Nemesis
Very interested
Posts: 791
Joined: Wed Nov 07, 2007 1:09 am
Location: Sydney, Australia

Re: I'm officially building a microcode-level 68000 core

Post by Nemesis » Tue Jan 30, 2024 10:59 am

Prefetch and initial state
Before you can properly understand the steps an instruction takes during execution, it's important to understand the state the processor is in before the instruction starts executing. Before each "normal" instruction (non-exception) begins executing, the following state is assumed:
  • IR and IRD both contain the first 16-bit instruction word (macroinstruction word) of the opcode about to be executed (IE, 0x4E71 for NOP)
  • IRC contains the 16-bit value that immediately followed that 16-bit instruction word in memory (IE, possibly an extension word or a following macroinstruction word)
  • PC contains the address the 16-bit value in IRC was fetched from (address of opcode about to be executed + 2)
  • The "au" unit contains the address stored in PC + 2. (address of opcode about to be executed + 4)
This means the processor is "primed" to perform either another prefetch, or to retrieve a 16-bit extension word for the new opcode if required.

A word about prefetch. Internally the 68000 has three distinct "instruction registers": IRC (Instruction Register Capture), IR (Instruction Register), and IRD (Instruction Register Delay). These three registers form a chain. Data can only be loaded into IRD from IR. Data can only be loaded into IR from IRC. IRD contains the first 16-bit instruction word of the opcode currently being executed. It cannot be read, and is directly tied to many control lines in the processor that affect execution, in particular the ALU. If a bus or address error occurs, the IRD value gets written to the exception stack frame as the "instruction register" field. Loading the value of the IR register into IRD is one of the essential steps performed at the end of each instruction. Since IRD needs to contain a valid instruction, it follows that at some point during instruction execution, a valid instruction word must also be loaded into IRC, and advanced into IR, so that it is ready at the end of the instruction to load into IRD.

Macroinstructions, microinstructions, and decoding steps
First, a quick screenshot from the annotated schematic:
Image
When an opcode (macroinstruction) is loaded into the IR register (NOT IRD as you might assume), one or more "microcode addresses" are decoded from the opcode. These are called A1, A2, and A3 respectively. Don't confuse these with the general 68000 address registers, they have nothing to do with it. These are (effectively) addresses of microcode routines which are used by the opcode (macroinstruction). All opcodes generate an A1 address, which is why it's a fairly large PLA on the die surface. Every opcode needs to be mapped to an A1 address. In many cases, different opcodes, or different forms of the same opcode, map to the same A1 address. A2 and A3 are optional. Opcodes may generate an address in either, both, or neither of the A2/A3 addresses. The A1 address is the "entry point" of the opcode. The address of the microinstruction generated by the A1 decoder defines the first microinstruction that will be executed for a given opcode.

There are technically two other addresses generated, A0 and A0S. A0 is used for group 1/2 exceptions, and A0S is used for group 0 exceptions, as detailed in the 68000 User's Manual. Simply put, when a group 1/2 exception (such as an interrupt) is being executed, A0 "overrides" A1, so instead of jumping to A1 at the start of the next opcode, the processor jumps to the start of the exception processing instruction instead. A0S is used for group 0 exceptions that are taken immediately, such as reset, bus error, and address error. In this case, A0S immediately overrides the next microinstruction address directly, so whatever was going to occur next and wherever the current opcode is during execution, the processor jumps to the start of the exception processing instruction instead.

In the patents, we're given a list of how each opcode maps to A1/A2/A3 addresses in Figure 21A-21U. Here's a single sheet from that section:
Image
Each square represents a given 16-bit opcode. The patterns matching the lowest 6 bits select the column, and are shown along the bottom of the table. The patterns matching the next 6 bits select the row, and are shown along the right side of the table. The upper 4 bits (the line) are "0000" for this sheet, as shown at the top of the page. For each opcode (even illegal instructions), you'll see there's an A1 microcode instruction name listed. A2 and and A3 are there for some opcodes and not others. By looking at appendix H, which I've previously transcribed: viewtopic.php?p=35470#p35470, we can find the 10-bit physical microcode addresses which correspond to these labels. Instructions have moved in the final physical processor, but we can decode the PLA structures from the delayered physical die photos in the "A1 decode" and "A2/A3 decode" sections I pointed out on the schematic above, which is what Oliver Galibert has done here:
https://og.kervella.org/m68k/a1-decode.txt
https://og.kervella.org/m68k/a23-decode.txt

Microcode
Here it is, how an opcode advances through its stages of execution. To understand this, here's a detailed "key" to understand the microword listing from Appendix H in the patent:
Image
With this key on hand, we can now decode the microword instructions listed from Appendix H, which I've previously transcribed: viewtopic.php?p=35470#p35470
I mistakenly labelled this as "Appendix F" when I posted transcriptions before. I've updated the naming in the post and for the files now. If we look at the flattened list in this file:
http://nemesis.exodusemulator.com/M6800 ... 20List.txt
We can now decode the microinstructions. We'll ignore the part labelled "nanoword content" for now. I'll cover that in a following post. For now, let's look at a simple instruction - NOP. To start with, we go to Fig21 and find the sheet with the NOP instruction. NOP has the opcode 4E71, so we find the sheets which are for line 0100 (4 in binary), which are Fig21G/H/I, and we'll find NOP on Fig21I here:
Image
This sheet tells us NOP has a single entry point, microinstruction "B" in A1. The note in parenthesis "(NOP)" below it is just adding information, it's not something in line A2. If we now go to that flat list, we find microinstruction B, and see it then branches to "mmrw3", the content for which are as follows:

Code: Select all

 ____________________________________ 
|                      <    |        |
| au --> db --> aob,au,pc   | irix   | mmrw3
| +2 --> au                 |--------|
|                           | dbi    |
|                           |--------|
|                           | x      |
|                           |--------|
|                           | dxdy   |
|___________________________|________|
|      213  |   b        |   b       |
\------------------------------------/

 ____________________________________ 
|                      >    |        |
| edb --> dbin,irc          | frix   |
| (ir) --> ird              |--------|
|                           | a1     |
|                           |--------|
|                           | x      |
|                           |--------|
|                           |        |
|___________________________|________|
|       26  |   mmrw3    |   mmrw3   |
\------------------------------------/
NOP typically takes 4 external clock cycles to execute, and is made up of two microinstructions - "b" and "mmrw3", which are shown sequentially above. We're going to ignore the "nanoword content" section for now. Let's start with the "b" microinstruction. Its access label is "irix", which indicates that when this microinstruction executes, the processor is to initiate a read operation of an immediate word or instruction word (IE, something that'll be going into the prefetch pipeline). The "next micro rom address" field is "dbi", which indicates that the instruction will be performing a direct branch to another microcode address, mmrw3 in this case, and that IRC (previously prefetched instruction word, not the one we're about to read) should be transferred to IR after this instruction completes. The next microinstruction is "mmrw3". Its access label is "frix", which indicates the processor must complete the current read bus operation. Its "next micro rom address" field is "a1". What does this mean in this case? Well since the last microinstruction advanced with "dbi", which caused IRC to move into IR, the instruction being decoded is now the instruction which follows this "NOP" opcode, so "a1" in this case actually means the "a1" address of the next opcode to execute. All normal instruction sequences will end with an a1 branch, instructing the processor to execute the instruction currently loaded into the IR register, which is the next instruction to execute.

A note about timing - A single microinstruction takes 2 external clock cycles to execute (one "microcycle"). The exception is microinstructions that need to "complete" a bus operation (read or write). A bus operation normally takes 4 external clock cycles in total, unless "wait cycles" are inserted (more on that later). What that means is that a bus operation can be initiated in 2 clock cycles, but waiting for a bus operation to complete can be delayed indefinitely if it needs to complete a bus cycle. If a microinstruction both starts a bus operation and waits for it to complete, it will take a minimum of 4 cycles.

I'll leave things there for now. More soon on interpreting the "nanoword content" section, and how specific operations are performed by a given opcode.

EDIT: Added a bit more detail.
Last edited by Nemesis on Thu Feb 01, 2024 8:31 am, edited 1 time in total.

Nemesis
Very interested
Posts: 791
Joined: Wed Nov 07, 2007 1:09 am
Location: Sydney, Australia

Re: I'm officially building a microcode-level 68000 core

Post by Nemesis » Thu Feb 01, 2024 8:30 am

The Execution Unit
So far we've covered how to determine the sequence of microinstructions that are executed for a given macroinstruction (opcode). Now we'll dive into what those individual microinstructions can do. The most critical component for this is the Execution Unit. There's an entire separate patent on this that describes the execution unit in some detail, it's US4296469. (https://nemesis.exodusemulator.com/M680 ... 296469.pdf). Here's figure 2 from that patent, which is also included in US4325121:
Image

And here's a stitched together version of figure 3 which shows some more detail:
Image

As we can see, the execution unit is broken into three separate parts, which are called from left to right "High Section", "Low Section", and "Data Section" respectively. The actual physical latches/flip-flops which store the register data are embedded within these sections, which combined together make up the "register file". Within each of the three sections of the execution unit there are separate arithmetic units, called AUH (Arithmetic Unit High), AUL (Arithmetic Unit Low), and ALU (Arithmetic Logic Unit) respectively. These units are capable of operating in parallel, and each have different capabilities. Finally, there are two main buses which connect everything together, called "Address Bus" and "Data Bus" respectively. These buses can be linked together or broken into three parts each at the boundaries between the sections. Almost all registers are dual-ported, meaning they can be connected to either the data bus or the address bus in their sections. This arrangement gives up to a total of 6 independent buses - Address bus high/low/data, and data bus high/low/data, and theoretically transfers can be occurring between registers/units on all 6 buses simultaneously.

Within the register file, there are a total of 21 32-bit registers. The arithmetic units are only 16-bits wide however, and all 32-bit registers are broken into 16-bit components and distributed between the sections. The high 16-bit parts of all 32-bit registers are contained within the "High" section. The low 16-bit components of "address" registers are contained in the "Low" section, while the low 16-bit components of "data" registers are contained in the "Data" section. Here's a key for the abbreviations given in the patent images above:

Code: Select all

AOB(H/L) - Address Output Buffer, a 32-bit latched register which stores the address value to output over the external address bus. Shown outside the execution unit as AOBH/AOBL above, but actually physically embedded within it like other registers. ["Memory address register" in Gigasoft analysis]
AT(H/L) = Address Temporary register ["immediate register" in Gigasoft analysis]
DT(H/L) = Data Temporary register ["Register B" in Gigasoft analysis]
PC(H/L) = Program Counter register
RF(H/L)' = SSP
RF(H/L) = USP
RE(H/L) = A6
RD(H/L) = A5
RC(H/L) = A4
RB(H/L) = A3
RA(H/L) = A2
R9(H/L) = A1
R8(H/L) = A0
R7(H/blank) = D7
R6(H/blank) = D6
R5(H/blank) = D5
R4(H/blank) = D4
R3(H/blank) = D3
R2(H/blank) = D2
R1(H/blank) = D1
R0(H/blank) = D0
where (H = Upper 16 bits (high), L or blank = Lower 16 bits (low))
This gives us a total of 21 32-bit registers in the main "register file". In addition, there are 10 16-bit registers within the register file:

Code: Select all

AUH Output - A latched (registered) 16-bit output from AUH (read-only, only AUH can modify) ["Register A" high 16 bits in Gigasoft analysis]
AUL Output - A latched (registered) 16-bit output from AUL (read-only, only AUL can modify) ["Register A" low 16 bits in Gigasoft analysis]
ALU Output - A latched (registered) 16-bit output from ALU (read-only, only ALU can modify)  ["Register C" in Gigasoft analysis]
ALUB - ALU Buffer register - A 16-bit input register to ALU (write-only, only ALU can read) ["Register E" in Gigasoft analysis]
ALUE - ALU Extension register - A 16-bit register that can be generally read and written, but also acts as a special shift register for the ALU. ["Register G" in Gigasoft analysis]
DCR - DeCodeR, A latched 4-bit to 16-bit decoder register, takes a 4-bit input and outputs "1" on exactly one of the 16-bit data lines according to the value (IE, 0000 -> 0000000000000001, 0001 -> 0000000000000010, 0010 -> 0000000000000011, etc) ["Bit selection register (BIT#)" for input, "Selected bit (BIT)" for output on Gigasoft analysis]
DOB - Data Output Buffer, a 16-bit latched register which stores the value to output over the external data bus. ["Output Register" in Gigasoft analysis]
DBIN - Data Bus Input Buffer, a 16-bit latched register which stores the last data value read from the external data bus. ["Register D" in Gigasoft analysis]
FTU - Field Translation Unit, a 16-bit latched register which serves as an I/O buffer for reading/writing content stored in other regions of the processor, such as the status register, writing a trap vector to trigger an exception (eg, divide by zero), or extracting bits from the IRD register.
PREN - PRiority ENcoder, a 16-bit register which takes a MOVEM register bitmask as an input, and outputs a 4-bit register number representing the internal register number of the next register with a bit set. [[MOVEM register list (List) / next bit in register list in Gigasoft analysis]
In addition to these, there are the following registers physically and logically outside the register file, which also store data:

Code: Select all

IRC/IR/IRD - As described in the last post, these 16-bit registers make up the instruction prefetch and decode pipeline. IRC and IR are physically attached to the bottom of the execution unit, but aren't really within it. IRD is located elsewhere.
SR - Status Register, as described in the 68000 User's Manual. Contains condition code flags, interrupt mask, supervisor state, and trace mode flags. This isn't a complete 16-bit register, it's stored as a series of latched bits in different regions of the processor.
INL - 3-bit pending interrupt level
TPEND - Bit flag indicating trace pending when the current macroinstruction completes
So the execution unit contains a number of physical register storage buffers, some arithmetic units (AUH, AUL, ALU), and some special purpose units (FTU, DCR, PREN). Additionally, there is a small table of short constant values (IE, 0, +1, -2, +8, etc) which can be transferred directly into the various arithmetic units as inputs. The only thing the execution unit itself does is perform transfers to and from these various units/registers. With this understanding of the basic contents of the execution unit, we can now move onto looking at nanocode.

Nanocode
Nanocode is really like a big set of control lines telling the processor (and the execution unit in particular), what to do during a given microinstruction. When we're at this level, it's important to make a distinction between the "logical" view of the processor, IE, some abstract concept that can take a series of instructions and perform them, and the "physical" view of the processor, meaning an actual concrete machine made out of transistors. It is important to understand that the microcode formats in the patents describe processing steps for a "logical" processor, IE, they're the kind of thing they could write before they'd even built the physical circuits, they're the "spec" the circuits have to meet. As a result, some things are not specified, such as the exact order/timing things occur in, or in some cases even what buses are used to perform transfers within the execution unit.

More on that in a little bit. For now, let's have a look at some nanocode. We're going to look at "NOP" again:

Code: Select all

 ____________________________________ 
|                      <    |        |
| au --> db --> aob,au,pc   | irix   | mmrw3
| +2 --> au                 |--------|
|                           | dbi    |
|                           |--------|
|                           | x      |
|                           |--------|
|                           | dxdy   |
|___________________________|________|
|      213  |   b        |   b       |
\------------------------------------/

 ____________________________________ 
|                      >    |        |
| edb --> dbin,irc          | frix   |
| (ir) --> ird              |--------|
|                           | a1     |
|                           |--------|
|                           | x      |
|                           |--------|
|                           |        |
|___________________________|________|
|       26  |   mmrw3    |   mmrw3   |
\------------------------------------/
Each of these microinstructions have some nanocode shown. The nanocode uses many abbreviations. To understand what they mean, we turn to "Appendix B" in the patent, which transcribed is this:

Code: Select all

abbreviation     | meaning
------------------------------------------------------------------------------------------
rx               | register (data or address) designated by Rx field in macroinstruction
  rxa            | address register designated by Rx field in macroinstruciton
  rxd            | data register designated by Rx field in macroinstruction
  rxh            | upper half (16 most significant bits) of register (data or address) designated by Rx field in macroinstruction
  rxl            | lower half (16 least significant bits) of register (data or address) designated by Rx field in macroinstruction
  rxdl           | lower half (16 least significant bits) of data register designated by Rx field in macroinstruction
ry               | register (data or address) designated by Ry field in macroinstruction
  rya            | address register designated by Ry field in macroinstruciton
  ryd            | data register designated by Ry field in macroinstruction
  ryh            | upper half (16 most significant bits) of register (data or address) designated by Ry field in macroinstruction
  ryl            | lower half (16 least significant bits) of register (data or address) designated by Ry field in macroinstruction
  rydl           | lower half (16 least significant bits) of data register designated by Ry field in macroinstruction
rz               | register (data or address) designated by 4-bit field of second word of macro-instructions using indexed addressing for specifying register to be used as the index
  rzl            | lower half (16 least significant bits) of register described immediately above
db               | DATA BUS (including high, low and data sections)
  dbh            | DATA BUS (high section only)
  dbl            | DATA BUS (low section only)
  dbd            | DATA BUS (data section only)
  db*            | DATA BUS (at least data section)
dbe              | sign extend sign bit onto high section of DATA BUS
edb              | external data bus
dbin             | data bus input buffer (including a latch) coupled to external data bus
  dbinh          | upper byte (8 most significant bits) of data bus input buffer
  dbinl          | lower byte (8 least significant bits) of data bus input buffer
dob              | data bus output buffer coupled to external data bus
  dobh           | upper byte (8 most significant bits) of data bus output buffer
  dobl           | lower byte (8 least significant bits) of data bus output buffer
ab               | ADDRESS BUS (including high, low and data sections)
  abh            | ADDRESS BUS (high section only)
  abl            | ADDRESS BUS (low section only)
  abd            | ADDRESS BUS (data section only)
  ab*            | ADDRESS BUS (at least data section)
abe              | sign extend sign bit onto high section of ADDRESS BUS
aob              | address bus output buffer coupled to external address bus
*                | ADDRESS BUS (high, low and data sections) or alternatively DATA BUS (high, low and data sections)
*e               | sign extend sign bit onto high section of ADDRESS BUS or alternatively onto high section of DATA BUS
psw              | program status word which stores condition codes, interrupt level, trace mode bit, supervisor mode bit
psws             | supervisor mode bit in the program status word
ssw              | special word which monitors status of current microinstruction; accessed in event of address error or bus error to aid processor in recovery from error
at               | temporary address register
  ath            | upper half (16 most significant bits) of temporary address register
  atl            | lower half (16 least significant bits) of temporary address register
sp               | user or supervisor stack pointer
  sph            | upper half (16 most significant bits) of user or supervisor stack pointer
  spl            | lower half (16 least significant bits) of user or supervisor stack pointer
pc               | program counter register
  pch            | upper half (16 most significant bits) of program counter register
  pcl            | lower half (16 least significant bits) of program counter register
dcr              | decoder in data section of execution unit which is used for bit manipulation
'reset pren'     | used during instruction which specifies access to multiple registers in order to advance encoder to the address of the next register to be accessed
ftu              | field translation unit
'idle wait'      | no transfers occur during this microcycle
tpend            | a one-bit latch which indicates whether the current macroinstruction should implement a trace upon completion ot the macroinstruction
inl              | latch which stores the interrupt level of the interrupting device upon recognition of an interrupt for subsequent transfer into program status word
trap             | stores vector which can be supplied to field translate unit for addressing a trap routine in event of trap condition (e.g. divide-dy-zero)
corf             | correction factor for decimal arithmetic which can be provided to ALU
'sr c-alu-alue'  | shift right used in multiply operation; carry bit coupled to msb of ALU; lsb of ALU coupled to msb of ALUE
This list is incomplete however. To it, I'm going to add a set of additional entries and list their meanings, all of which have already been introduced/discussed above:

Code: Select all

abbreviation     | meaning
------------------------------------------------------------------------------------------
au               | AUH (Arithmetic Unit High) and AUL (Arithmetic Unit Low) working together on a 32-bit value
alu              | ALU (Arithmetic Logic Unit) working on a 16-bit value
irc              | Instruction Register Capture register, top of prefetch pipeline.
ir               | Instruction Register, next macroinstruction to execute.
ird              | Instruction Register Delay register, current macroinstruction being executed.
alub             | ALU Buffer register
alue             | ALU Extension register
Now that we have our handy list of what each of these abbreviations mean, we'll now look at the format entries in the nanocode listing take. The nanocode is listed as a series of sequential lines. From our "b" microinstruction, the A1 address for NOP, the nanocode is as follows:

Code: Select all

au --> db --> aob,au,pc
+2 --> au
What this nanocode is doing is listing two separate data transfer operations that are performed by the execution unit during the two clock cycles this microinstruction runs for. All nanocode will be a list like this, with each line in one of the three following forms:

Code: Select all

<source> --> <targetList>:  Transfer data from <source> into <targetList>
<source> --> <bus> --> <targetList>: Transfer data from <source> into <targetList> via <bus>
<keyword>: Special keyword such as 'reset pren' or 'idle wait', defined above.
where <targetList> is a comma separated list of one or more targets, such as "aob,au,sp".
Note that as shown here, a single source can potentially be written to multiple registers/units in one step. These aren't separate operations, when data is being moved over a bus, that's a set of physical lines which have the source value loaded. If more than one target is set to be updated, they will all receive the same value at the same time. What about subsequent lines however? When we have two transfer operations as shown above, what order are they performed in? That comes down to a question of the "logical" vs the "physical" processor I mentioned before. The nanocode is describing a "logical", or "theoretical" processor. From that perspective, the list of nanocode instructions are "sequentially consistent", or in other words, if you ran the operations one at a time in the order listed, that sequence describes the expected outcome when the same buses or registers are used more than once. The "physical" processor won't do them one at a time of course. As discussed, the execution unit can potentially perform six separate transfers "at once", but at the level we're now at, the implementation gets "creative" with timing in order to do as much in parallel as possible and squeeze everything into two clock cycles. Internally, there's all kinds of special "timing hacks" that are performed to sequence things "just right" so that it all works out in the right order, as described by the nanocode. The patents go into more detail in some areas, but suffice to say unless you're trying to fully replicate/understand the processor at the transistor level, it simply doesn't matter at this point. For my purposes, it suffices me to say that timing signals regulate the sequence of operations such that the execution unit produces a result consistent with the nanocode instructions being sequentially executed, but it does so within the microinstruction timing previously described (2 external clock cycles per microinstruction, unless an external bus transfer is being completed).

With this understanding of names, syntax and sequence, I'll now quickly reiterate the "preconditions" I listed in my last post that are assumed before each macroinstruction starts executing:
  • IR and IRD both contain the first 16-bit instruction word (macroinstruction word) of the opcode about to be executed (IE, 0x4E71 for NOP)
  • IRC contains the 16-bit value that immediately followed that 16-bit instruction word in memory (IE, possibly an extension word or a following macroinstruction word)
  • PC contains the address the 16-bit value in IRC was fetched from (address of opcode about to be executed + 2)
  • The "au" unit contains the address stored in PC + 2. (address of opcode about to be executed + 4)
With all this, we should now be able to understand the nanocode in these NOP microinstructions. Let's start with "b"

Code: Select all

// Transfer the 32-bit AUH/AUL contents to aob (address bus output buffer) and pc (program counter) via db (data bus high/low/data)
// Additionally, feed the same value back into AUH/AUL as one input.
au --> db --> aob,au,pc

// Feed the constant +2 into AUH/AUL as another input
+2 --> au
So this microinstruction has advanced PC by 2, put that address on the external data bus, and fed it back into the AU to have 2 added to it again. Note that the actual addition operation doesn't occur as part of this code, all we did was retrieve the output in the AU unit (AUH/AUL) from a previous operation, and provide inputs for a following operation. We'll look at how these operations are controlled soon as part of the "ALU function" section. For now suffice to say it'll happen. Let's look at the read operation for prefetch now though. If we go back to the microword definition, the "access label" for the "b" microinstruction was irix. Our "key" chart tells us that means that instruction initiated a read operation for an immediate instruction parameter or a new instruction(x). This will cause the processor to start a bus read operation at the address we placed in aob, with the "Function Code" lines set appropriately ("program" read in this case, not "data"). This microinstriction also had "next micro rom address" set to "dbi", which means after this instruction completes, IRC will be advanced to IR, so the A1/A2/A3 pointers will now reference the following opcode.

Now let's look at the following "mmrw3" microinstruction:

Code: Select all

// Transfer the contents of edb (external data bus) into dbin (data bus input buffer) and irc (instruction register capture).
// This is pulling in our prefetched 16-bit value which follows the next instruction after this one. It may or may not be a macroinstruction.
edb --> dbin,irc

// Load the next instruction being decoded into IRD to make it the "current" instruction
(ir) --> ird
After this instruction completes, its "next micro rom address" field is set to a1, so it will branch to the entry point for the next instruction we just moved into the IRD register. So all in all, what we've done is move IRC into IR, prefetch another word into IRC, and move IR into IRD and jump to the start of the next macroinstruction, which is exactly what we expect NOP to do. We've also updated "au" to maintain the correct preconditions for the next macroinstruction.

Microinstructions and bus timing
The NOP instruction we've been examining takes 4 clock cycles to perform, if there are no delays performing the read. A single bus read operation also takes 4 clock cycles if there are no delays. Let's look at how the NOP macroinstruction maps to bus operations. Here's a modified image ripped from the 68000 User's Manual:
Image

Here we see an example of a NOP instruction executing, a crossed out section where "something else" happens in the middle, then an example of a NOP instruction executing where there's a delay in the read occurring, due to external device latency responding to the read request. As you can see, the bus operation starts immediately as the "b" microinstruction begins executing. This can occur, because the microinstruction itself tells the processor that it will perform a read, and the kind of read (program/data) being performed, which is sufficient to begin signaling the read operation with the correct function code. You see this happening on the diagram above within the "S0" (State 0) period. Within "S1", the address lines are asserted. This is because the nanocode has been able to perform the "au --> db --> aob,au,pc" transfer by this point. "S2" then begins, and the processor drives the Address Strobe (AS) to signal the read request to external devices. Nothing changes during "S3", and now we've reached the end of the "b" microinstruction. Here's the important point - the CPU is now waiting for the read to complete, and we're "between microinstructions". At this point, the processor waits for the bus operation to complete, IE, by DTACK or BERR being asserted. Until that occurs, the processor inserts full clock cycles, and just sits there, waiting, not beginning the new microinstruction. If DTACK is signaled at that magic edge of the clock cycle, the data is latched, and the "mmrw3" microinstruction will immediately begin. If BERR is signaled at the edge of the clock cycle, the group 0 bus error exception will be made pending, and rather than executing "mmrw3", the A0S microcode address will overwrite the target address and the bus error exception handler will trigger instead.

An address exception is similar. The "b" microinstruction executes as normal, however if an unaligned read is attempted triggering an address error, at the start of "S2" rather than asserting AS, the processor will flag a pending address error and set the A0S microcode address to the address errror exception handler, which will execute next rather than "mmrw3". Note that the external function code and address lines will still be asserted prior to the address exception being detected, but since AS isn't asserted this will be ignored by external devices.

That'll do for now. I'll write more about the ALU and more complex instructions later.

EDIT: Added info on address exceptions.

Nemesis
Very interested
Posts: 791
Joined: Wed Nov 07, 2007 1:09 am
Location: Sydney, Australia

Re: I'm officially building a microcode-level 68000 core

Post by Nemesis » Sat Feb 03, 2024 1:21 am

Math operations, flags, and the ALU
You should know there's another patent which covers the 68000 ALU specifically, and in some cases goes into more detail than US4325121. That patent is US4312034 here: https://nemesis.exodusemulator.com/M680 ... 312034.pdf

Before we talk about the ALU, it'll add some context to review the key design goals behind the microcoded approach used in the 68000 processor. These are actually laid out for us in the patents:
  • Reduce the time required to design an LSI data processor
  • Reduce the circuit complexity and simplify the layout of an LSI data processor
  • Provide an LSI data processor which provides an instruction set which may be easily modified or expanded
  • Provide a microprogrammed data processor adapted to execute a wide variety of macroinstructions while minimizing the size of the microprogram control store
These goals are sometimes in conflict, and a balance needs to be struck. This kind of compromise is most evident in the way the ALU is controlled in the 68000. So far we've talked about macroinstructions mapping down to microcode, with the execution unit controlled by nanocode instructions. The use of nanocode at all is really a space saving optimisation. It reduces the transistor count required to microcode the processor by allowing sets of microinstructions (a 17-bit entry) to share nanoinstructions (a 68-bit entry). Nanoinstructions themselves are space optimised, with those 68 bits driving around 180 control points in the processor, by using other PLAs to drive individual lines from different values in sets of bits (IE, using two bits with values 00, 01, 10, 11 to drive four different lines). The simplest and most flexible design would just wire all the control lines in the processor to individual bits in the microinstructions, and have 200+ bit microinstructions for the 544+ microwords in the 68000, but that would be a massive PLA, which wouldn't meet goal 4 stated above of "minimizing the size of the microprogram control store", which ultimately translates to physical die size, which translates to cost, and therefore profit.

In order to achieve the goal of keeping the programmable control store in check, for the 68000 they decided to do something significant - they decided to not use the either microwords or nanowords to determine what operations the ALU should be configured to perform on a given microinstruction. Instead, they go back to the IRD register, and based on the "macroword" that's currently loaded, they select a set of available operations. This allows operations which have a different ALU function, but otherwise identical execution steps (IE, ADD, SUB, AND, etc) to use the same microcode and nanocode. To achieve this, from the contents of the IRD register, a two-level PLA structure is used to resolve down to 44 lines (Figure 19A), then 15 lines (Figure 19B), which they call the "row selection lines". These row selection lines provide a row index into a table/PLA (Figure 20B), which selects the "set" of ALU operations which will be available for the current macroinstruction. Individual microinstructions are able to select the "column" index in this table, but all they can do is choose from the filtered set of operations available to them based on the macroword in the IRD register. The effective contents of this table are described in Figure 17:
Image

There are 15 rows in this table. Those are the 15 lines decoded from the macroword. They are "hard coded" for individual instructions based on the opcode. We then see 5 columns, with values 1-5. In reality, values 0-6 can be specified (which we'll talk about more soon). This information comes from the nanocode. There are three bits in the nanoinstruction (NIF0, NIF1, and NIF2) which set this column, so each microinstruction is capable of selecting/changing the active "column" in this table. That allows the enabled ALU function to be changed between each microinstruction, within the set of ALU functions available in the active "row" in this table.

Within each column, there's more than one set of information provided. First is the active ALU function. Second (if present) is configuration on how the CCR flags: C – Carry, V - oVerflow, Z - Zero, N - Negative, and X - eXtended, should be affected, called the condition code control information. If no information is provided here, the condition flags are unaffected. Here's the table from the patents describing how to interpret the condition code control information:
Image

Additionally, column 1 in the table has two sets of condition code control information. This is because the nanocode actually contains another two bits/lines (NCC0 and NCC1) which are called the "initiate" and "finish" bits respectively. In column 1, and column 1 only, this affects the CCR behaviour. If the "initiate" (NCC0) bit is set, the first set of condition code control information is used. If the "finish" (NCC1) bit is set, the second set of control code information is used. Columns 2-4 ignore these bits.

Going back to the information before the condition code control information, we mentioned it was the "active ALU" function. That needs some more explanation. For the detail on that, we refer to figure 16, which lists the ALU functions:
Image
Each of the labels in the first column matches an ALU function in the table shown in Figure 17. The second column shows us the kind of operation the 16-bit ALU will perform when that function is enabled. To understand the contents of this column, we need another key, which I'll provide here:

Code: Select all

a - ALU Input from "address" bus in Execution Unit
d - ALU Input from "data" bus in Execution Unit
r - ALU Output (result), stored in the ALU Output register
x - Arithmetic carry (X flag from CCR)
c - Standard carry (C flag from CCR)
' - Compliment of preceding source (Bitwise NOT)
^ - Bitwise AND
v - Bitwise OR
1 - The constant "1"
0 - The constant "0"
This should be enough to basically interpret the information in the second column. As you can see, the ALU itself can generate compliments (bitwise NOT) for its inputs, perform basic addition, logical AND and OR operations, byte sign extension, and shifts in several different modes. Other operations (subtraction, multiplication, division, etc) are built on top of these abilities. One interesting thing to note is the "lss" function, which shows either "0" or "1" being shifted in. How is this selected? It's a little bit of a hack. Going back to our table in Figure 17, there are 15 rows and 5 columns shown, but remember how I said that in reality values 0-6 could be selected? Value 6 is an "alias" for value 4, that is, if you select column "6", the values of column "4" will be used, with the single exception that the "lss" function will shift in a "1", rather than the normal "0". This feature is only used by the divide microcode. As for value "0", if that is set, there's an "ALURL" line signal that gets set, and stops the ALU from modifying its output register or affecting the CCR flags. Basically this means the ALU does nothing that microinstruction, otherwise it would trash register values, so effectively setting value "0" tells the ALU to "do nothing".

The third column is called "into cbit" and specifies the source of the "carry output" signal from the ALU. This column is relatively easy to understand. Here's a key to help:

Code: Select all

cm     - Carry signal generated from the most significant bit of the ALU
cm'    - Binary NOT of cm
msb    - Most significant bit of the ALU result
abd[0] - Bit 0 of from "address bus data" input value
0      - The constant "0"
All the rest of the columns here refer to physical control lines which run into the ALU. Basically this is telling you what lines are driven to tell the ALU to perform each of these tasks, which is beyond the point I'm going to dive into. In the 68000 itself, only the functions described in this table can be performed, which drive those control lines as described.

The last thing to tie this all together is to understand how we know from the microcode what ALU column is enabled (NIF0/1/2) and whether "initiate" (NCC0) or "finish" (NCC1) are set. To do that, let's look back at the key I gave earlier for interpreting the microcode listings:
Image
You'll see the "ALU function" section there I didn't explain earlier. It's time to describe that now. The "ALU function" section will contain 1, 2, or 3 characters.

The first character will be one of "x, 1, 2, 3, 4, 5, 6", which corresponds with column 1, 2, 3, 4, 5 in the lookup table shown in Figure 17 being activated for each of those numbers. "x" means "don't care" in the patents, which in practice means "0", or in other words, the ALU is idle. "6" as described before means select column "4" but shift "1" for the "lss" function rather than "0" as described before. This means the first character describes the states of the (NIF0/1/2) bits for the microinstruction (which come from the nanocode store).

The second character will either be "i", "f", "n", or it will be omitted (IE, there will only be one character in the ALU function section). "n" or omitted means the condition flags in the CCR register aren't changed by the ALU operation. "i" means this is an "initiate" operation, while "f" means this is a "finish" operation, as described when we detailed Figure 17 above. This means the second character describes the states of the (NCC0/1) bits for the microinstruction (which come from the nanocode store).

The third character if present will always be an "f", and this isn't really an instruction to the ALU at all, it's an instruction to the execution unit, saying that a byte transfer is being performed. As per the patent, in this case "only the low order 8 bits of the address bus in the DATA section are driven by the selected source such that only the low order 8 bits of the selected destination are changed while the upper 8 bits of the selected destination are not disturbed". Or in other words, transfer targets on the address bus in the data section of the register file will only have their lower 8 bits changed, even though the nanocode instruction listing doesn't indicate this by itself. This isn't part of the nanocode or the macroinstruction at all technically. From what I can see, this information is decoded directly from the IRD register, so it is actually a constant for a given macroinstruction across all microinstructions it uses, and can therefore also technically vary for one microinstruction depending on what macroinstruction it is being called as part of.

I'm going to leave this post there. Another time I'll provide some more detail the high/low arithmetic units, and conditional branching.

Nemesis
Very interested
Posts: 791
Joined: Wed Nov 07, 2007 1:09 am
Location: Sydney, Australia

Re: I'm officially building a microcode-level 68000 core

Post by Nemesis » Sat Feb 03, 2024 2:15 am

A quick note to say I've updated my 68000 annotated schematic here:
https://github.com/RogerSanders/M68000Schematic
I've corrected a lot of register names, and identified more latches and regions in the die. You can view the annotated schematic using the Schematic Explorer tool I put together an age ago here:
https://github.com/RogerSanders/SchematicExplorer

Nemesis
Very interested
Posts: 791
Joined: Wed Nov 07, 2007 1:09 am
Location: Sydney, Australia

Re: I'm officially building a microcode-level 68000 core

Post by Nemesis » Sun Feb 11, 2024 10:39 pm

Posting here to keep the updates flowing. Things have moved slower since my leave ended and I've started back at work, but it's still moving along. I've been able to fully decode the microcode in the production version of the 68000, and generate a listing for it in the same form as Appendix H in the patents. It's actually a very efficient encoding they use. When you break it down, the microcode listing format they use provides all the information in the microcode binary encoding, with the exception of some changes to the trace and tpend flags, which you could make a reasonable guess about anyway. It'd actually be possible to reconstruct the microcode/nanocode binary content of the pre-release 68000 processor described in the patents from the patent information alone.

In my next update I'll show a full side by side comparison of the patent microcode and the final microcode, as well as describing conditional branching and anything else that's left, with a few corrections to things I've said before. More schematic updates to push as well. After that, everything needed to understand the inner workings of the 68000 should basically be here.

Post Reply