Bike shedding a new CD-ROM format
Posted: Thu May 30, 2019 7:46 pm
I managed to figure out the format of the lead-in TOC, save for a few bits (eg track A2's control bits.)
I would like to create a new CD image format that captures everything: lead-in, track 1 pregap, lead-out, and all subchannel data.
In doing so, we eliminate the need for CUE, SUB, and CCD files.
So the broader structure is this:
<arbitrary # of lead-in sectors>
<program sectors>
<arbitrary # of lead-out sectors>
[repeat for multi-session CDs]
And when it comes to encoding each sector, there are three pedantry levels:
2448 bytes per sector: 2352 bytes (F1 frame) + 96 bytes subchannel data (not interleaved, 12-bytes/channel, P-W)
3234 bytes per sector: 33 bytes (F3 frame) * 98 blocks (the two subchannel sync bytes must necessarily be 0x00, they have no EFM lookup value)
7203 bytes per sector: 73.5 bytes (channel frame) * 98 blocks (literally everything on the disc)
Every format must include subchannel data in order to recover the TOC, since this format omits the need for CCD/CUE sheets.
If a consistent hash is a required trait, then each format can be baked down to error-corrected F1 frames which can then be hashed exclusively.
Every format must include RSPC data for data tracks, so that seeking between sectors is sane and because bad RSPC data *is* used in copy protection.
Only the BCD1 is remotely possible to rip from any current drives, although most drives are unable to rip at this precision level already.
By separating the F1 frame from the subchannel data, we make it easy for tools/emulators to read in all F1 or subchannel data in one pass, without having to deinterleave block frames, and it allows us to cleanly omit the subchannel sync bytes. By deinterleaving the subchannel bytes, we make it easier for tools to grab specific channels of data. It's a very trivial transform to reinterleave if desired.
BCD2 gives us the ability to compute C1/C2 errors level counts for more authentic drive emulation, and may be used in copy protection.
Even if we can't rip at BCD2 level, if we use a drive that reports C1/C2 corrected/uncorrectable error counts, we can fake reconstruct an approximation that will give the same correction counts (but not error counts) from our image format when CIRC decoding is emulated.
We could leave out the unrepresentable subchannel sync bytes, but it would complicate the math to seek to specific blocks.
BCD3 is just the future proofing format. Creating discs with invalid EFM data is going to prevent drives from even being able to read the CDs. The only real benefit to it is that it's the literal exact bit pattern the drive sees (well, sort of ... it's an encoding of the pits and lands), and it's the only one that can encode the true subchannel markers.
All of these formats allow the encoding of invalid values, just as with a real disc. Eg BCD1 can have invalid RSPC data, and a track# C4, with a start time of B2:3A:5F. BCD2 can have invalid CIRC data. BCD3 can have invalid EFM data or sync patterns. The formats allow what a raw physical CD allows, and then it's up to the emulator to decide what to do if it encounters bad data. So if you were truly insane, you could emulate the quirks specific to eg the LC8951 when given bad values.
Size comparisons:
ISO/2048 = 650MiB
BIN/2352 = 747MiB
BCD1/2448 = 777MiB
BCD2/3234 = 1027MiB
BCD3/7203 = 2287MiB
Going beyond BCD3 would be a method to encode the spiral wobble for eg PS1 era copy protection.
I don't believe the wobble should be part of the image file format, however. It's not describing data but a shape instead.
A run-length encoding followed by a 0 (decrease pitch) or 1 (increase pitch) bit, repeated, would be ideal to me.
But it's kind of beyond the scope of this proposal, and can be added on in the future without breaking this format.
The absense of a WOB(ble) file would just indicate a standard spiral pitch.
My proposal is only for CD-ROMs, but I believe the general approach can be expanded for LaserDiscs, DVDs, Blurays, etc, barring the obvious differences between the formats (eg DVDs and beyond lack subchannel data.)
So really, my last major question: for the BCD1 format, should the data tracks be scrambled or not?
The data is certainly scrambled on a real disc, but if it's truly a symmetric cipher, then why destroy data compressability?
If you want scrambled data (which is rare), then scramble it. Simple.
I've only heard very weak arguments as to why this scrambling is needed, eg sectors that aren't fully written, but even in that case, it's trivial to create a sector that when scrambled will result in the same form as a CD with an incomplete sector.
The only compelling reason I can think of to store the data in scrambled format, is to not have to have BCD2/BCD3 have scrambled data whereas BCD1 does not. Or to more easily emulate playing a "data" track as audio anyway, or vice versa.
Well either way, I'm willing to scramble it or not.
My only fear in doing this pre-emptively, is I don't know if my scrambler implementation is correct as I have no test vectors to use.
Code is here, however: https://gitlab.com/higan/higan/blob/mas ... ambler.hpp
Any of my proposed image formats should be trivially convertible back to ISO/BIN/CCD-SUB/DAO96, etc.
So there is no strong compelling need for anyone to add support to this to their emulators, it's more of a proposed archival format if nothing else.
Anyone have thoughts on this?
I would like to create a new CD image format that captures everything: lead-in, track 1 pregap, lead-out, and all subchannel data.
In doing so, we eliminate the need for CUE, SUB, and CCD files.
So the broader structure is this:
<arbitrary # of lead-in sectors>
<program sectors>
<arbitrary # of lead-out sectors>
[repeat for multi-session CDs]
And when it comes to encoding each sector, there are three pedantry levels:
2448 bytes per sector: 2352 bytes (F1 frame) + 96 bytes subchannel data (not interleaved, 12-bytes/channel, P-W)
3234 bytes per sector: 33 bytes (F3 frame) * 98 blocks (the two subchannel sync bytes must necessarily be 0x00, they have no EFM lookup value)
7203 bytes per sector: 73.5 bytes (channel frame) * 98 blocks (literally everything on the disc)
Every format must include subchannel data in order to recover the TOC, since this format omits the need for CCD/CUE sheets.
If a consistent hash is a required trait, then each format can be baked down to error-corrected F1 frames which can then be hashed exclusively.
Every format must include RSPC data for data tracks, so that seeking between sectors is sane and because bad RSPC data *is* used in copy protection.
Only the BCD1 is remotely possible to rip from any current drives, although most drives are unable to rip at this precision level already.
By separating the F1 frame from the subchannel data, we make it easy for tools/emulators to read in all F1 or subchannel data in one pass, without having to deinterleave block frames, and it allows us to cleanly omit the subchannel sync bytes. By deinterleaving the subchannel bytes, we make it easier for tools to grab specific channels of data. It's a very trivial transform to reinterleave if desired.
BCD2 gives us the ability to compute C1/C2 errors level counts for more authentic drive emulation, and may be used in copy protection.
Even if we can't rip at BCD2 level, if we use a drive that reports C1/C2 corrected/uncorrectable error counts, we can fake reconstruct an approximation that will give the same correction counts (but not error counts) from our image format when CIRC decoding is emulated.
We could leave out the unrepresentable subchannel sync bytes, but it would complicate the math to seek to specific blocks.
BCD3 is just the future proofing format. Creating discs with invalid EFM data is going to prevent drives from even being able to read the CDs. The only real benefit to it is that it's the literal exact bit pattern the drive sees (well, sort of ... it's an encoding of the pits and lands), and it's the only one that can encode the true subchannel markers.
All of these formats allow the encoding of invalid values, just as with a real disc. Eg BCD1 can have invalid RSPC data, and a track# C4, with a start time of B2:3A:5F. BCD2 can have invalid CIRC data. BCD3 can have invalid EFM data or sync patterns. The formats allow what a raw physical CD allows, and then it's up to the emulator to decide what to do if it encounters bad data. So if you were truly insane, you could emulate the quirks specific to eg the LC8951 when given bad values.
Size comparisons:
ISO/2048 = 650MiB
BIN/2352 = 747MiB
BCD1/2448 = 777MiB
BCD2/3234 = 1027MiB
BCD3/7203 = 2287MiB
Going beyond BCD3 would be a method to encode the spiral wobble for eg PS1 era copy protection.
I don't believe the wobble should be part of the image file format, however. It's not describing data but a shape instead.
A run-length encoding followed by a 0 (decrease pitch) or 1 (increase pitch) bit, repeated, would be ideal to me.
But it's kind of beyond the scope of this proposal, and can be added on in the future without breaking this format.
The absense of a WOB(ble) file would just indicate a standard spiral pitch.
My proposal is only for CD-ROMs, but I believe the general approach can be expanded for LaserDiscs, DVDs, Blurays, etc, barring the obvious differences between the formats (eg DVDs and beyond lack subchannel data.)
So really, my last major question: for the BCD1 format, should the data tracks be scrambled or not?
The data is certainly scrambled on a real disc, but if it's truly a symmetric cipher, then why destroy data compressability?
If you want scrambled data (which is rare), then scramble it. Simple.
I've only heard very weak arguments as to why this scrambling is needed, eg sectors that aren't fully written, but even in that case, it's trivial to create a sector that when scrambled will result in the same form as a CD with an incomplete sector.
The only compelling reason I can think of to store the data in scrambled format, is to not have to have BCD2/BCD3 have scrambled data whereas BCD1 does not. Or to more easily emulate playing a "data" track as audio anyway, or vice versa.
Well either way, I'm willing to scramble it or not.
My only fear in doing this pre-emptively, is I don't know if my scrambler implementation is correct as I have no test vectors to use.
Code is here, however: https://gitlab.com/higan/higan/blob/mas ... ambler.hpp
Any of my proposed image formats should be trivially convertible back to ISO/BIN/CCD-SUB/DAO96, etc.
So there is no strong compelling need for anyone to add support to this to their emulators, it's more of a proposed archival format if nothing else.
Anyone have thoughts on this?