I've been bashing my head against the operator unit for the last couple days, and here's some results. Some of these things were figured out from the YM2203 die, just because it's easier to read; but I've been confirming critical elements against the YM2612.
1. The YM2612 uses dynamic memory (DRAM) throughout. This means it can't operate below a minimum clock frequency, because data is stored as charge on small capacitors, and it slowly leaks out. My guess is that the chip would start losing stability at about 1/4 to 1/10 its rated clock speed, but this is not confirmed.
2. The operator unit, and probably the whole chip, makes heavy use of pipelining. In the operator unit I've counted five pipeline registers, plus the circular shift register array which probably counts as a sixth one:
a) At the adder between the 10-bit FM value and the 10-bit PG value
b) At the adder between the 11-bit logsin output and the 10-bit attenuation value
c) At the 10-bit mantissa value, immediately before the exponent shifter
d) At the two's compliment unit (this is where the output is taken from)
e) Circular shift register array (stores outputs of previously computed operators)
f) At the unit which adds two previously-computed operator values to make the new FM value
This has two direct results:
a) Figuring out the timing of any control unit will be a pain in the ass.
b) Exactly what bits go where is easier to see. For instance, bit 9 of the FM+PG value goes through two sequential shift register cells before going to the two's compliment unit, which is correct--it has to be delayed with the data in registers b and c above. Also, the top four bits of the 12-bit logsin+attenuation value, which don't go through the exponential table, each go through a single shift register cell before they go to a decoder and then the shifter unit; they have to be delayed to match the rest of the data waiting in register c.
c) Anyone who's trying to implement this chip in VHDL will have massive timing/complexity issues unless they get this right.
3. The chip--or at least the operator unit--doesn't process all four operators of one channel, then go to the next channel. It processes one operator from each channel before it goes to the next operator. I don't have enough information to tell the order, that is whether it processes operator 1 from all six channels in a row, or different operators on each channel--but definitely each next clock cycle (this is the main internal clock, which is the external clock divided by 6) a different channel is being processed, and only after they're all done is the first channel processed again. I don't know how this affects the outputting, but the channel accumulator unit definitely has at least six stages of circular shift registers, so there's no reason it can't add up the right operators from the right channels and spit them out at its leisure.
4. The circular shift register array which stores the outputs of previous operators has three entries. Based on Steve Snake's info, I think I understand exactly how they work--but the timing is kind of insane. The three entries store op 2, op "old 1", and op 1. Remember, 6 cycles between an operator being started and it finishing:
0) Op 1's FM value is fed into the pipeline from stored "old 1" and stored 1. Op 2 just finishing, sent to accumulator and stored as 2.
3) Op 3's FM value is fed into the pipeline from stored 1 and stored 2. Op 4 just finishing, sent to accumulator.
6) Op 2's FM value is fed into the pipeline from new Op 1 result, which is also stored as 1 and sent to the accumulator. The old value from "stored 1" is stored as "old 1".
9) Op 4's FM value is fed into the pipeline from new Op 3 result (which is also sent to the accumulator), Op 1 stored, and Op 2 stored.
Of course the accumulator doesn't always accept the values sent to it, and neither does the adder which produces the new FM value (it only has two inputs, but those could be hooked up to stored op 2, stored op 1, stored old 1, or new result).
Edit: After I posted this I remembered that all this was from the YM2203, which has only three channels. So on the YM2612 the cycles indicated should be 0, 6, 12, and 18. And then I remembered that the YM2612 has a six-stage shift register added to the pipeline between the calculated 10-bit FM value and adding that to the 10-bit phase value. They purposely lengthened the pipeline by 6 cycles so each operator takes 12 cycles now to compute, so all the timing stays in sync (but with 6 voices instead of 3)! Other than that, I believe it's exactly the same in both chips.
One more thing to add, it's a pity Yamaha didn't give us more detailed control over the chip. Just by modifying the control units (and adding some more externally-visible registers), we could have had programmable FM algorithms, and we could have also had independent modulation depths for each voice. The shifter unit that implements the "feedback amount" is in the datapath for all operators, but it's just set to a constant for all but op 1.