Page 1 of 2

Bike shedding a new CD-ROM format

Posted: Thu May 30, 2019 7:46 pm
by Near
I managed to figure out the format of the lead-in TOC, save for a few bits (eg track A2's control bits.)

I would like to create a new CD image format that captures everything: lead-in, track 1 pregap, lead-out, and all subchannel data.

In doing so, we eliminate the need for CUE, SUB, and CCD files.

So the broader structure is this:
<arbitrary # of lead-in sectors>
<program sectors>
<arbitrary # of lead-out sectors>
[repeat for multi-session CDs]

And when it comes to encoding each sector, there are three pedantry levels:
2448 bytes per sector: 2352 bytes (F1 frame) + 96 bytes subchannel data (not interleaved, 12-bytes/channel, P-W)
3234 bytes per sector: 33 bytes (F3 frame) * 98 blocks (the two subchannel sync bytes must necessarily be 0x00, they have no EFM lookup value)
7203 bytes per sector: 73.5 bytes (channel frame) * 98 blocks (literally everything on the disc)
Every format must include subchannel data in order to recover the TOC, since this format omits the need for CCD/CUE sheets.
If a consistent hash is a required trait, then each format can be baked down to error-corrected F1 frames which can then be hashed exclusively.
Every format must include RSPC data for data tracks, so that seeking between sectors is sane and because bad RSPC data *is* used in copy protection.

Only the BCD1 is remotely possible to rip from any current drives, although most drives are unable to rip at this precision level already.
By separating the F1 frame from the subchannel data, we make it easy for tools/emulators to read in all F1 or subchannel data in one pass, without having to deinterleave block frames, and it allows us to cleanly omit the subchannel sync bytes. By deinterleaving the subchannel bytes, we make it easier for tools to grab specific channels of data. It's a very trivial transform to reinterleave if desired.

BCD2 gives us the ability to compute C1/C2 errors level counts for more authentic drive emulation, and may be used in copy protection.
Even if we can't rip at BCD2 level, if we use a drive that reports C1/C2 corrected/uncorrectable error counts, we can fake reconstruct an approximation that will give the same correction counts (but not error counts) from our image format when CIRC decoding is emulated.
We could leave out the unrepresentable subchannel sync bytes, but it would complicate the math to seek to specific blocks.

BCD3 is just the future proofing format. Creating discs with invalid EFM data is going to prevent drives from even being able to read the CDs. The only real benefit to it is that it's the literal exact bit pattern the drive sees (well, sort of ... it's an encoding of the pits and lands), and it's the only one that can encode the true subchannel markers.

All of these formats allow the encoding of invalid values, just as with a real disc. Eg BCD1 can have invalid RSPC data, and a track# C4, with a start time of B2:3A:5F. BCD2 can have invalid CIRC data. BCD3 can have invalid EFM data or sync patterns. The formats allow what a raw physical CD allows, and then it's up to the emulator to decide what to do if it encounters bad data. So if you were truly insane, you could emulate the quirks specific to eg the LC8951 when given bad values.

Size comparisons:
ISO/2048 = 650MiB
BIN/2352 = 747MiB
BCD1/2448 = 777MiB
BCD2/3234 = 1027MiB
BCD3/7203 = 2287MiB

Going beyond BCD3 would be a method to encode the spiral wobble for eg PS1 era copy protection.
I don't believe the wobble should be part of the image file format, however. It's not describing data but a shape instead.
A run-length encoding followed by a 0 (decrease pitch) or 1 (increase pitch) bit, repeated, would be ideal to me.
But it's kind of beyond the scope of this proposal, and can be added on in the future without breaking this format.
The absense of a WOB(ble) file would just indicate a standard spiral pitch.

My proposal is only for CD-ROMs, but I believe the general approach can be expanded for LaserDiscs, DVDs, Blurays, etc, barring the obvious differences between the formats (eg DVDs and beyond lack subchannel data.)

So really, my last major question: for the BCD1 format, should the data tracks be scrambled or not?
The data is certainly scrambled on a real disc, but if it's truly a symmetric cipher, then why destroy data compressability?
If you want scrambled data (which is rare), then scramble it. Simple.
I've only heard very weak arguments as to why this scrambling is needed, eg sectors that aren't fully written, but even in that case, it's trivial to create a sector that when scrambled will result in the same form as a CD with an incomplete sector.
The only compelling reason I can think of to store the data in scrambled format, is to not have to have BCD2/BCD3 have scrambled data whereas BCD1 does not. Or to more easily emulate playing a "data" track as audio anyway, or vice versa.
Well either way, I'm willing to scramble it or not.

My only fear in doing this pre-emptively, is I don't know if my scrambler implementation is correct as I have no test vectors to use.
Code is here, however: https://gitlab.com/higan/higan/blob/mas ... ambler.hpp

Any of my proposed image formats should be trivially convertible back to ISO/BIN/CCD-SUB/DAO96, etc.
So there is no strong compelling need for anyone to add support to this to their emulators, it's more of a proposed archival format if nothing else.

Anyone have thoughts on this?

Re: Bike shedding a new CD-ROM format

Posted: Thu May 30, 2019 9:32 pm
by Mask of Destiny
For most "normal" CD-ROM imaging needs, I kind of feel like MAME's CHD format is probably the right way to go. The lack of proper documentation is a bummer, but it handles multi-track data well and can store raw F1 frames + subcode data. It also supports lossless compression of tracks including FLAC for audio tracks. CHD also has the advantage that it is supported by some non-MAME emulators of CD based consoles already. It also has support for video tracks (for Laserdisc support), but I'm less convinced that they have the right approach here.

For archival Laserdisc imaging, I think something that records the time between zero-crossings of the RF signal is what you want. Raw Domesday Duplicator captures contain a lot of information that doesn't strictly exist on the physical disc and at least in theory shouldn't be needed for accurate decoding of the contents

Re: Bike shedding a new CD-ROM format

Posted: Thu May 30, 2019 11:53 pm
by Near
I can't find good documentation on what CHD actually does with CDs. All I see from the format is a really complicated way of describing and compressing blocks of data that has something like seven different header revisions. I'm really into simplicity in my formats, and try not to add anything that can cause ambiguity or that adds unnecessary complexity.

Can you please elaborate on what CHD does that my propsal would not? I'm very interested if there's any shortcomings I am missing. And certainly, BCD<>CHD is doable.

Compression of audio to FLAC is definitely a big win for disk space but is not friendly to emulators that would have to decompress the whole thing into RAM (huge stall on switching audio tracks) or implement a streaming decompressor (a lot of extra work, few compression libraries are designed as being streamable. Though I'd be surprised if libflac or whatever were not.)

What I wanted with BCD was the easiest possible support in emulators, with no need for a bunch of revisions to the spec, no need for decompression libraries, etc. Just multiply your sector# by N and start reading.

Re: Bike shedding a new CD-ROM format

Posted: Fri May 31, 2019 8:27 am
by cero
Plus with zlib, and using gzopen/gzread you get transparent decompression. It works for both uncompressed and gzipped files.

Re: Bike shedding a new CD-ROM format

Posted: Fri May 31, 2019 8:14 pm
by Mask of Destiny
byuu wrote:
Thu May 30, 2019 11:53 pm
I can't find good documentation on what CHD actually does with CDs.
So in it's most basic form a CHD is a series of compressed "hunks" of data. Hunks are relatively small (512KB max supposedly) and are compressed individually to allow reasonably random access. For CDs, extra metadata is stored to handle the stuff that would be in the TOC and to keep track of whether the track was ripped in RAW format or not.
byuu wrote:
Thu May 30, 2019 11:53 pm
Can you please elaborate on what CHD does that my propsal would not?
Optical disc images are big enough that users reasonably care about compression. It also supported in a number of emulators outside MAME already (not for the Mega CD AFAIK though) so it has that going for it.
byuu wrote:
Thu May 30, 2019 11:53 pm
I'm very interested if there's any shortcomings I am missing.
For the scope you have chosen, BCD1 seems fine from a technical perspective. BCD2 seems pretty pointless IMO. You can't capture it with normal hardware and if you have something exotic like a Domesday Duplicator setup and are pedantic enough (or have a weird enough disk) to one to have something lower level than BCD1, you'll probably also want something that is lower level than BCD2 as well. For BCD3, I think you at least need to include the bit clock rate in a header somewhere if you want to support analog formats like Laserdisc as well, though like I said before I think storing the time between zero crossings (effectively RLEI guess) makes more sense at the bit rates required to preserve the original analog bandwidth.
byuu wrote:
Thu May 30, 2019 11:53 pm
Compression of audio to FLAC is definitely a big win for disk space but is not friendly to emulators that would have to decompress the whole thing into RAM (huge stall on switching audio tracks) or implement a streaming decompressor (a lot of extra work, few compression libraries are designed as being streamable. Though I'd be surprised if libflac or whatever were not.)
CHD is specifically structured to make streaming reasonable as best I can tell.
byuu wrote:
Thu May 30, 2019 11:53 pm
What I wanted with BCD was the easiest possible support in emulators, with no need for a bunch of revisions to the spec, no need for decompression libraries, etc. Just multiply your sector# by N and start reading.
I get this on some level, but I'm not convinced the advantages over existing formats are sufficient to justify yet another optical disc format. BIN/CUE has some limitations and it's kind of gross that it's split into multiple files, but it serves the "simple raw format" well enough that I think it will be difficult to get traction with a slightly better "simple raw format".

Re: Bike shedding a new CD-ROM format

Posted: Fri May 31, 2019 11:55 pm
by Sik
For what's worth it: I wouldn't be surprised if loading compressed data and decompressing it on the fly takes up less time than loading uncompressed data. Don't forget that reading from files is slow even with caching (especially if you're reading from hard disk, and some large Flash-based media can be pretty darn slow too).

Re: Bike shedding a new CD-ROM format

Posted: Sat Jun 01, 2019 6:20 pm
by Huge
This would be great, but what exactly would you use to rip in these formats? They are kind of pointless if nothing can create them.

Don't forget that most data you want to rip is "unrippable" due to not only damage to the CDs, but also due to defective duplication, mastering errors, etc. There's a reason why there are so many levels of error corrections in there. Even if you build a low level interface to a CD Drive that would make it possible to *get* to this data, it may be difficult to read due to damage and aging of the optical pickup. Certainly not possible to read it to matching CRCs over multiple discs, I think. That would be the first hurdle to get through.

You also have to take into account the general usability of a format, for example compatibility with older applications, as well as things like compression efficiency. This is why CloneCD has separate subcodes, while the main channel matches the format used by bin/cue to a degree. It's also why things like that ECM compression was invented (stripping the ECC info out of 2352 bin/cue and handling it as a separate block to make it compress more efficiently).

Perhaps we could keep this in mind, and use CCD format as a base and extend it with extra data? To keep compatibility with older apps? That would be simpler than using a *new* format, that many useful applications cannot and will not be able to directly read (such as legacy apps).

I thought of this before and perhaps some form of optical imaging would be more reliable to read out the pits, even if it would have to deal with orders of magnitude more data. At least until the "disc bits" are found and read. But that's a more radical approach and I'm not sure how feasible it is to make electron microscope scans of a 12cm compact disc.

Re: Bike shedding a new CD-ROM format

Posted: Sun Jun 02, 2019 7:41 am
by Near
This would be great, but what exactly would you use to rip in these formats?
There's different drives that can rip all of these components. I don't know if there's a single drive that can do all of them at once.

The data is going to be in a consistent format 99% of the time anyway (I can easily replicate CCD SUB files for CDs that don't use R-W subchannels), so it's similar to stripping the data track parity information and regenerating it on the fly.

I'm not too worried about compression: a good compressor with knowledge of this format could do all of this work for us:
* remove P/Q-parity from data tracks when it can be auto-generated
* remove subchannel data from tracks when it can be auto-generated
* compress audio tracks with FLAC
* reorder the data so that it compresses better
When it comes to emulating the images, I prefer the data to be uncompressed. It's much simpler for the emulator.
A tool or a library can convert between the compressed and decompressed formats, as well as support other formats like ISO, BIN/CUE, IMG/SUB, MDF/MDS, etc.

Re: Bike shedding a new CD-ROM format

Posted: Sun Jun 02, 2019 12:56 pm
by MetalliC
byuu wrote:
Sun Jun 02, 2019 7:41 am
This would be great, but what exactly would you use to rip in these formats?
There's different drives that can rip all of these components. I don't know if there's a single drive that can do all of them at once.
which components ? afaik most of drives can return only data area sectors (but not lead-in) as 2352 descrambled sectors, and it's subcodes. only small number of drives (mostly Plextors) can return data as scrambled 2352, and read lead-in / lead-out areas.
that's the best you can get. iirc there is no drives capable of return lower level data, prior to EFM decode.

Re: Bike shedding a new CD-ROM format

Posted: Sun Jun 02, 2019 1:44 pm
by Near
That was what I meant, we could rip at BCD1 legitimately. Beyond that (C1/C2 and EFM), we currently cannot. And at this rate, we may never be able to.

Re: Bike shedding a new CD-ROM format

Posted: Mon Jun 03, 2019 8:47 am
by Huge
Subcode reading is dodgy, even on the best drives. That is, you can't rip two of the same discs to identical CRCs.

Re: Bike shedding a new CD-ROM format

Posted: Mon Jun 03, 2019 4:21 pm
by Near
The good news is that subcode follows a pre-defined ruleset for most discs.

You can literally generate all of it, including the lead-in TOC and lead-out data, just through a BIN/CUE (with the exception of possibly not know how big the lead-in really is, but then it's supposedly standardized, even if I've heard conflicting values of 4500 and 7500 sectors for it.)

P is always all 1s or all 0s to indicate the pregap areas, and Q always has CRC16s for data integrity checks, and literally just reports control/address/track#/index#/relative-time/absolute-time. So it's quite easy to deduce and correct the errors to get a perfect rip.

Where it gets tricky is when CDs actually store data in R-W. Which at least for the Sega CD, doesn't happen outside of CD+G discs, which aren't used in actual Sega CD games. For other systems, you'd have to resort to multiple reads and averaging of the results, which is fallible. CDs are no doubt designed to accommodate a certain amount of subchannel errors here, and so a checksum that ignores R-W subchannel data would be sufficient.

Re: Bike shedding a new CD-ROM format

Posted: Mon Jun 03, 2019 10:58 pm
by Huge
It's true that subcode can be generated, but if you generate them, why even bother trying to read them? And if you *assume* how the format looks based on the rules, then you'll overlook discs which do not confirm to those rules.
Not that I can name any from the top of my head, but if we are using clever math and error detection to assume how a piece of dumped data looks, then we could also do away with ripping 2352 raw mode, because the ECC can be generated.

And if we do away with subcode when generating the CRC, well then CloneCD is sufficient for that already. Even bin/cue too, assuming your disc does not have multisession data. Hell if it only has one data track, one ISO is fine, the rest can be generated because the format follows a pre-defined ruleset (not counting any UDF or Mac-only filesystems)...

Re: Bike shedding a new CD-ROM format

Posted: Tue Jun 04, 2019 6:41 am
by Near
It's true that subcode can be generated, but if you generate them, why even bother trying to read them?
A fair question! It makes emulating the system easier.

The Sega CD can read back the 96-byte subchannel data, it's what populates the current time information, the header/subheader registers, and also is how CD+G playback works.

By implementing the CDD closer to real CD data, we avoid transformations and increase our overall accuracy when valid subchannel rips are available (which it looks like most Sega CD games have IMG/SUB rips for them, so that is indeed the case.)
Not that I can name any from the top of my head, but if we are using clever math and error detection to assume how a piece of dumped data looks, then we could also do away with ripping 2352 raw mode, because the ECC can be generated.
I have no objections to any compression or container formats. I just want the emulator to get 2448-byte sectors without having to implement dozens of codecs and container formats.

Re: Bike shedding a new CD-ROM format

Posted: Tue Jun 04, 2019 2:31 pm
by Sik
Yeah, CD+G pretty much kills the "no subcodes" idea. While the P and Q subcodes can be guessed (as they're used to aid with error correction and such) the rest can't, and they still make 75% of the byte, so you may as well store the whole thing. And while not on Mega CD, I recall hearing about at least one PC game relying on reading data from the subcodes as a form of DRM?

CD-Text data is also stored in subcodes, if you fancy that.