Phantasy Star IV text compression

Ask anything your want about Megadrive/Genesis programming.

Moderator: BigEvilCorporation

Post Reply
Paul Jensen
Interested
Posts: 32
Joined: Mon Apr 06, 2009 4:17 pm
Location: Hiroshima, Japan

Phantasy Star IV text compression

Post by Paul Jensen » Wed Apr 22, 2009 4:32 pm

Hi everybody,

Has anybody here ever successfully decoded compressed text in a Mega Drive game? I'm trying to extract the script for Phantasy Star IV (I want to do a translation for practice), but I can't seem to crack the encoding scheme. Here's what I've figured out so far:

The game seems to use some sort of LZ compression. The data near the front of a text bank is usually pretty readable, but gets harder to read near the end of the bank as redundant data is replaced with compressed data.

Text is stored in chunks, and each chunk has a two-byte prefix. The prefix for a non-compressed chunk is always FFFFh, and the length of a non-compressed chunk is always 16 bytes. Chunks can vary in length, depending on whether a chunk contains compressed data, and the prefixes also change depending on the presence of compressed bytes.

What I can't figure out is: (1) how the prefixes (aside from FFFFh) relate to the compression; and (2) how to decode the compressed data.

On (1): From what I've read about LZ compression, prefixes are bit masks that indicate which bytes in a chunk are (un)compressed. A 2-byte (16-bit) prefix should work prefectly for 16-byte chunks, but the prefixes used in PSIV doesn't seem to be bit masks.

On (2): The compressed data doesn't seem to point to anything. Most (all?) of the data is one byte in length. One of the bytes I came across had a value of 01h, and I can't see how that could point to anything.

EDIT: I figured out (2). The compression algorithm seems to use a look-behind buffer of 255 bytes. The compressed data is the starting index of the position of the data in the look-behind buffer. That was easy.

Anyway, the key seems to lie in the relationship between the prefixes and the bytes that replace the redundant data. I just can't see how it all fits together.

It's possible to get a decompressed script by dumping the Mega Drive's RAM to a file (and I've already dumped part of it so I can have a reference to compare the encoded data against). I guess I could do a RAM dump for each section of the game, but I'd really like to figure out how the compression works.

If anybody's interested in this, I've got a RAM dump and a table file for the ROM that contains all of the kana and about 60% of the kanji used by the game. If my explanation doesn't make sense, I can try to give some examples.

Thanks.

Christuserloeser
Very interested
Posts: 145
Joined: Sun Jan 28, 2007 2:01 am
Location: DCEvolution.net
Contact:

Post by Christuserloeser » Wed Apr 22, 2009 10:29 pm

Nice to see you here. Welcome to SpritesMind :)


A link which I hope might be of interest to you (there's a Phantasy Star III editor):

http://romhacking.net/?category=&Platfo ... tle=&desc=


EDIT: They also got a PS4 hack: http://romhacking.net/hacks/480/
Last edited by Christuserloeser on Wed Apr 22, 2009 11:38 pm, edited 1 time in total.
http://www.DCEvolution.net - Gigabytes of free Dreamcast software for you

Image

tomaitheous
Very interested
Posts: 256
Joined: Tue Sep 11, 2007 9:10 pm

Re: Phantasy Star IV text compression

Post by tomaitheous » Wed Apr 22, 2009 11:04 pm

Has anybody here ever successfully decoded compressed text in a Mega Drive game?
It's a console, not a computer. Even if it had a BIOS like GBA, there's no guarantee that it will use the specified format and decompresser. I.e - there's no one compression for everything text on the Megadrive.

Anyway, there are many variants and implementation of LZ(LZSS). If it's LZ/LZSS/variant then there is a sliding window 'frame' around the decompressed data. It increments with each byte written. The 16bit flag you described sounds like standard LZSS (although it's usually 8bits wide). One value of the bit means read the next byte as a literal, and the opposite value means treat the next byte/nibble/word/etc as a pointer into the sliding window. There's also a length value. A common setup is read a WORD that's packed with 4bit length and 12bit window pointer. The 'pointer' is added to the sliding 'window' offset into the decompressed data. You can get free RLE if you have it pointer to 1 byte behind the current data pointer and set the length ;). But like I said, there are *many* variants of how all this is implemented. I've personally seen over 10 versions.

If you have a chuck of decompressed text to look at, then you should be able to figure out the LZ format fairly easily without a debugger unless they've stacked compression layers or added additional control codes for alternate decompression (combination of LZ and something else).

LZ/LZSS isn't the greatest compression for text as it's not efficient for small chunks and it has to decompress the whole block just to get at single point/string. It also requires a block of RAM to decode to first, before getting at decompressed strings/data. But that doesn't stop Japanese games from using it (alot)...


If you figure out the exact format they use (or multiple format if so), please upload a doc over at RHDN :)

Paul Jensen
Interested
Posts: 32
Joined: Mon Apr 06, 2009 4:17 pm
Location: Hiroshima, Japan

Post by Paul Jensen » Thu Apr 23, 2009 3:34 am

@Christuserloeser: Nice to see you here, too. Thanks for the links. Unfortunately, the source code for that program seems to be available only from the author's webpage, which appears to be down right now. I might luck out if they use the same compression. I checked out that PSIV hack, but it doesn't look like they've messed with the script at all.

@tomaitheous: Yeah, I know the Genny doesn't have BIOS-based data compression/decompression. I was hoping that the encoding scheme used in PSIV is also used in any other Mega Drive games, on the chance that someone else had already figured it out.

Anyway, the encoding seems to be a variant of LZ. The encoder uses a sliding window of 255 bytes. Compressed data are single bytes that refer to the position in the window. What I still haven't figured out is how the decoder distinguishes between encoded and literal data, or how it stores the length value for the encoded data.

tomaitheous
Very interested
Posts: 256
Joined: Tue Sep 11, 2007 9:10 pm

Post by tomaitheous » Thu Apr 23, 2009 4:41 am

Is the text in ASCII format? Can you provide some offsets in the rom of compressed blocks?

TascoDLX
Very interested
Posts: 262
Joined: Tue Feb 06, 2007 8:18 pm

Post by TascoDLX » Thu Apr 23, 2009 5:27 am

Paul Jensen wrote:I was hoping that the encoding scheme used in PSIV is also used in any other Mega Drive games, on the chance that someone else had already figured it out.
It appears it is the so-called "Kosinski" compression (LZSS-type, indeed) -- used in Sonic games, Sega CD BIOS, to name a few... and PSIV it seems. :wink:

I didn't write that article but if you have any questions about the format, I'm sure I can answer them.

bastien
Very interested
Posts: 208
Joined: Mon Jun 25, 2007 7:19 pm
Location: Besançon,France
Contact:

Post by bastien » Thu Apr 23, 2009 7:29 am

TascoDLX wrote:
Paul Jensen wrote:I was hoping that the encoding scheme used in PSIV is also used in any other Mega Drive games, on the chance that someone else had already figured it out.
It appears it is the so-called "Kosinski" compression (LZSS-type, indeed) -- used in Sonic games, Sega CD BIOS, to name a few... and PSIV it seems. :wink:

I didn't write that article but if you have any questions about the format, I'm sure I can answer them.
So maybe you can use this soft for help you:
http://traf.romhack.org/?page=patches3& ... ile&id=787

Paul Jensen
Interested
Posts: 32
Joined: Mon Apr 06, 2009 4:17 pm
Location: Hiroshima, Japan

Post by Paul Jensen » Thu Apr 23, 2009 7:20 pm

Thanks for the help so far you guys.

@TascoDLX: The info you pointed to really got me pointed in the right direction.

I've figured out the following about the prefixes. It turns out the prefixes really are bit flags, but they work in a strange way. Here are the first several prefixes (starting at 1CC476h in the ROM*, for anybody who's interested):

(1) FFFF
(2) FFFF
(3) C3FF
(4) FFE1
(5) FF3F
(6) 02FC
(7) FFFD
(8) 61F8

The bits get reversed, which gives bit patterns of:

(1) 1111111111111111
(2) 1111111111111111
(3) 1100001111111111
(4) 1111111110000111
(5) 1111111111111100
(6) 0100000000111111
(7) 1111111110111111
(8) 1000011000011111

A value of 1 represents a literal (uncompressed) byte.
A value of 0 represents a compressed byte, and the next three bits after that represent the original length of the data at that location, plus two (because 2 is, for good reason, the minimum length required for compression), so:

0000 = 2 bytes
0001 = 3 bytes
0010 = 4 bytes
0011 = 5 bytes
0100 = 6 bytes
0101 = 7 bytes
0110 = 8 bytes
0111 = 9 bytes

What's interesting is that the decompressor apparently needs to use the previous prefix sometimes. Look at (5) and (6) again:

(5) 1111111111111100
(6) 0100000000111111

Prefix (5) shows 14 literal bytes, followed by 00. Prefix (6) starts with 01, which is the continuation of the last two bits of prefix (5). It gives the sequence 0001, which represents compressed data with an original length of 3 bytes. So the decoder has to look at both prefixes to figure out what to do with the data. A lot of this looks similar to Kosinski compression.

There's some other wierd stuff that happens. It looks like the prefixes themselves can get compressed, although I can't wrap my head around the logic of this. Also, even if a prefix is all 1's, it doesn't necessarily mean that there are 16 bytes between the current prefix and the next one. This seems to happen when the next prefix starts with FF.

Anyway, I feel like I've made a lot of progress on this. Almost all the pieces are fitting together. Of course I'm open to more advice, though. :)

*The ROM I'm looking at is a .bin version. It looks like there's a header in there, so the location moght not be the same in other ROMs I guess.

TascoDLX
Very interested
Posts: 262
Joined: Tue Feb 06, 2007 8:18 pm

Post by TascoDLX » Fri Apr 24, 2009 2:16 am

Paul Jensen wrote:The bits get reversed, which gives bit patterns of:
Yeah, you got it. The bytes are swapped and the bits are shifted out to the right. As soon the last bit is shifted out, the next 2 bytes are read. This is why the data starts with 0xFFFF then 15 bytes (instead of 16) then 0xFFFF then 16 bytes and so on.
Paul Jensen wrote:A value of 0 represents a compressed byte, and the next three bits after that represent the original length of the data at that location, plus two (because 2 is, for good reason, the minimum length required for compression), so:

0000 = 2 bytes
0001 = 3 bytes
0010 = 4 bytes
0011 = 5 bytes
0100 = 6 bytes
0101 = 7 bytes
0110 = 8 bytes
0111 = 9 bytes
Not quite that simple.

00xx duplicates a span from within the last 256 bytes written, where xx is the length (+2). It uses a single byte for the offset, where 0xFF equals -1 and 0x00 equals -256.

01 duplicates a span from within the last 8192 bytes written. It uses two bytes combined for the length and offset. Of those two bytes, the first byte is the lower 8 bits of the offset and the upper 5 bits of the second byte is the upper 5 bits of the offset (13 bits total). The bottom 3 bits of the second byte is the length (+2). If those 3 bits equal 0, an extra byte is used for the length (+1) allowing for a length up to 256 bytes. If that extra byte equals 0, the end of the file has been reached. If it equals 1, nothing is duplicated or written -- doesn't seem very useful, I know.

That article I linked to is completely accurate except for maybe a couple minor mistakes.
Paul Jensen wrote:A lot of this looks similar to Kosinski compression.
It's identical as far as I can tell.

Paul Jensen
Interested
Posts: 32
Joined: Mon Apr 06, 2009 4:17 pm
Location: Hiroshima, Japan

Post by Paul Jensen » Fri Apr 24, 2009 4:09 am

TascoDLX wrote:Not quite that simple.

00xx duplicates a span from within the last 256 bytes written, where xx is the length (+2). It uses a single byte for the offset, where 0xFF equals -1 and 0x00 equals -256.

01 duplicates a span from within the last 8192 bytes written. It uses two bytes combined for the length and offset. Of those two bytes, the first byte is the lower 8 bits of the offset and the upper 5 bits of the second byte is the upper 5 bits of the offset (13 bits total). The bottom 3 bits of the second byte is the length (+2). If those 3 bits equal 0, an extra byte is used for the length (+1) allowing for a length up to 256 bytes. If that extra byte equals 0, the end of the file has been reached. If it equals 1, nothing is duplicated or written -- doesn't seem very useful, I know.

That article I linked to is completely accurate except for maybe a couple minor mistakes.
Ah, you're totally right. That explains the "extra" bytes I was seeing in the data. I guess I got thrown off by some of the wording used in the Kosinsky description. It all makes sense now. :) Thanks!

Post Reply