aPLib decruncher for 68000

Shiru · Post by **Shiru** » Sun Mar 07, 2010 6:16 pm

sega.s uses * and /* */ for comments.

SyX · Post by **SyX** » Sun Mar 07, 2010 8:50 pm

Well, i have downloaded the sdk that appears in this thread, and i make the modifications about the comments and local labels that Chilly Willy said, and don't get any errors in the assembly.

And now i am reading the GENESIS Technical Overview, i would like to try something.

Thanks!!!

Shiru · Post by **Shiru** » Sun Mar 07, 2010 8:57 pm

SyX wrote:i make the modifications about the comments and local labels that Chilly Willy said, and don't get any errors in the assembly.

So why not to put the corrected code here? I'm ready to test it.

SyX · Post by **SyX** » Sun Mar 07, 2010 9:41 pm

Shiru wrote: So why not to put the corrected code here? I'm ready to test it.

Upss, ok i don't think it was necessary, because it was straight forward (change ; for |, and the most easy in the case of local labels was delete the . that precede them), sorry for the inconvenience, here is:

Code: Select all

*** CODE DELETE, LAST VERSION IN THE FIRST POST ***

Chilly Willy · Post by **Chilly Willy** » Sun Mar 07, 2010 11:23 pm

That's what I often do with local labels like the original code had - remove the . and make it a regular label. It works rather well when the labels are as specific as the ones used in the original code. More generic labels like ".loop1:" are easy enough to just "1:" (for example).

GManiac · Post by **GManiac** » Sun Mar 07, 2010 11:54 pm

Hmm, it's interesting to compare such parameters of compressors like:
Compression ratio / Size of unpacker (for 68000) / Speed (for 68000) / Used RAM / Used Registers.
Competitors are, for example: RNC, aPLib, Hrust, BitBuster.
I found only sources of BitBuster for z80/ARM.
I have sources for RNC and for aPLib for 68000. Their compression ratio vary to +-1-2% from each other, but aPLib depacker is smaller and a bit faster (1-2% in cycles).
I replaced BSR callings of .get_bit subroutine with MACROs and aPLib depacker became 1.5 times faster. Another replacement for .decode_gamma gives us improvement of about 1.075 times.
I didn't test such trick for RNC yet. Well, its usual (which many games use) depacker is 398 bytes, it's already pretty large.
Also, RNC requires at least 192 bytes of RAM for arrays + RAM for stack.

So, I prefer aPLib.

Shiru · Post by **Shiru** » Mon Mar 08, 2010 5:57 am

Latest version works great. It is (of course) much faster than BitBuster C depacker. Thanks for good work, SyX.

SyX wrote:Upss, ok i don't think it was necessary, because it was straight forward

It is necessary if you want someone to use it. The thread starts with code which not compiles, then here goes series of changes to follow. If someone is not good with assembly and GCC inner workings, he'll got problems, and could just skip the decompressor because code in the first post is not ready to use. So, please, move the latest code to your first post, and remove all other versions.

The only thing I'd add is the header skip. I don't want to fix the packer, and doubt many people want to. In my test I've simply skipped the header by increasing pointer to source data by 24, but this could be done in the assembly code instead, to make the use more comfortable. 24 bytes is not too much overhead for most of cases (if someone if really bothered by it, he still have option to cut the header).

GManiac wrote:Hmm, it's interesting to compare such parameters of compressors like:
Compression ratio / Size of unpacker (for 68000) / Speed (for 68000) / Used RAM / Used Registers.
Competitors are, for example: RNC, aPLib, Hrust, BitBuster.
I found only sources of BitBuster for z80/ARM.

I've posted the BitBuster and Hrust depackers somewhere on this forum (BitBuster is also in Uwol sources), but they are in pure C, so they for sure lose to assembly version of every depacker. It would be nice to have BitBuster M68K assembly version, though, but I'm not good in M68K code to make optimized version.

To test compression ratio we need to prepare set of test files, which should include data you'd most expect to be compressed: graphics, maps, etc.

SyX · Post by **SyX** » Mon Mar 08, 2010 9:41 am

GManiac wrote: I replaced BSR callings of .get_bit subroutine with MACROs and aPLib depacker became 1.5 times faster. Another replacement for .decode_gamma gives us improvement of about 1.075 times.

Yes, of course is the fight since the beginning of time "size vs speed"

Well the first version used macros, but at the end i chose the way of subrutine for my 68Kung-fu

Fell free to use what you think is more convenience for your code

Shiru wrote: Latest version works great. It is (of course) much faster than BitBuster C depacker. Thanks for good work, SyX.

Thanks for all your suggestion to make a better release. I have added the header skip and put the last version of code in the first post to make more easy the work at all person interested in it.

GManiac · Post by **GManiac** » Mon Mar 08, 2010 4:28 pm

Here are some tests:
http://www.fileden.com/files/2009/4/23/ ... _tests.rar

I made get_bit routine faster, so it's better to use my unit. Also I add 2 preprocessors definitions. Size of unpacker is 212 bytes (Macros version).

It's old version of get_bit.

Code: Select all

get_bit:
        subq.b  #1,d5
        bne.b   still_bits_left 
        moveq   #8,d5
        move.b  (a0)+,d3
still_bits_left:
        add.b   d3,d3

Here's new version:

Code: Select all

        dbra    d5,still_bits_left
        moveq   #7,d5
        move.b  (a0)+,d3        | Read next crunched byte
still_bits_left:
        add.b   d3,d3           | D3.b << 1 (lsl.b #1,d3 o roxl.b #1,d3)

Some bug in gas: if I place get_bit: routine before decode_gamma:, gas will make error. I can place get_bit: before aplib_decrunch:, but it's not a good choice.
Also, gas doesn't know .def, so I used .equ.

To compile demos you need to place as, ld, objcopy from GCC to this folder (see as.bat) and run batch files.
These demos will work on Kega and on hardware, other emulators don't play sound. I use sound on/off to count 68k cycles, I don't know other way to count them, emulators don't support this task.

alad.bin is taken from Aladdin, Tutorial screen, $C000 bytes of VRAM. Yes, most of this screen is originally compressed by RNC.
font.bin is example of font.

In my 2 tests aplib decompression was 1.65 times faster than RNC, taking about 65-80 cycles per byte. It's a good result, as we know that simple copying of bytes

Code: Select all

cycle:
        move.b (a0)+,(a1)+
        dbra d0,cycle

takes 22 cycles per byte.

In complex criteria I'd prefer aPLib than RNC. Hrust and Bitbuster are weaker but MAYBE faster, we need to verify it.

Shiru · Post by **Shiru** » Mon Mar 08, 2010 4:38 pm

GManiac wrote:Hrust and Bitbuster are weaker

Get SixPack, and try to pack /demo/smd/preview.bmp. You'll see the case when BitBuster is more effective than Aplib.

By the way, I've just found version of the BitBuster for MSX which depacks data directly to VRAM, and it was based on .. Aplib ported to Z80, for Sega 8-bit consoles, which also has version to depack directly in VRAM.

GManiac · Post by **GManiac** » Mon Mar 08, 2010 4:51 pm

Well, of course I know that there are NO THE BEST compressors. Sometimes one is better, sometimes another. But one can be better more frequently. That's why we need complex criteria and take into account speed / size / RAM, etc.

preview.bmp is not that case of data which you expect to decompress in MD games. Real MD graphics consist of additional tilemap and usually are more complex.

Anyhow, aplib overtook RNC.

SyX · Post by **SyX** » Mon Mar 08, 2010 6:39 pm

GManiac wrote: I made get_bit routine faster, so it's better to use my unit. Also I add 2 preprocessors definitions. Size of unpacker is 212 bytes (Macros version).

It's old version of get_bit.
Code: Select all
get_bit:
        subq.b  #1,d5
        bne.b   still_bits_left 
        moveq   #8,d5
        move.b  (a0)+,d3
still_bits_left:
        add.b   d3,d3
Here's new version:
Code: Select all
        dbra    d5,still_bits_left
        moveq   #7,d5
        move.b  (a0)+,d3        | Read next crunched byte
still_bits_left:
        add.b   d3,d3           | D3.b << 1 (lsl.b #1,d3 o roxl.b #1,d3)

Good optimization!!! But remember, that with the change to dbra, you need to change the initialization of D5 (i saw that you have it in your sources), from:

Code: Select all

moveq   #1,d5           ; Initialize bits counter

to:

Code: Select all

moveq   #0,d5           ; Initialize bits counter

An in the subject about other crunchers, another interesting it would be exomizer, it compress a few better than aplib and there is two version of the decompressor, a normal and an stream version (a few more slow but ideal for decrunching in background, for example). Of course, it has disadvantages, it needs 156 bytes of ram and there is not an 68000 version, but well somebody can correct the last defect

Shiru · Post by **Shiru** » Mon Mar 15, 2010 10:29 pm

I've just released updated version of Uwol, with fix for major bug and few minor changes, and one of the changes was packer replacement. Now it uses aPLib with SyX depacker - it works faster, and in case of this game it also has better compression ratio.

r57shell · Post by **r57shell** » Fri Jun 28, 2013 11:52 am

Some optimizations.

Code: Select all

; -------------------------------------------------------------------------------------------------
; Aplib decruncher for MC68000 "gcc version"
; by MML 2010
; Size optimized (164 bytes) by Franck "hitchhikr" Charlet.
; More optimizations by r57shell.
; -------------------------------------------------------------------------------------------------

; Make the function visible to the linker
;.global aplib_decrunch

; -------------------------------------------------------------------------------------------------
; aplib_decrunch: A0 = Source / A1 = Destination
; -------------------------------------------------------------------------------------------------
aplib_decrunch:         movem.l a2-a5/d2-d5,-(a7)
                        lea     32000.w,a3
                        lea     1280.w,a4
                        lea     128.w,a5
                        moveq   #-$80,d3
copy_byte:              move.b  (a0)+,(a1)+
next_sequence_init:     moveq   #2,d1           ; Initialize LWM
next_sequence:          bsr.b   get_bit
                        bcc.b   copy_byte       ; if bit sequence is %0..., then copy next byte
                        bsr.b   get_bit
                        bcc.b   code_pair       ; if bit sequence is %10..., then is a code pair
                        moveq   #0,d0           ; offset = 0 (eor.l d0,d0)
                        bsr.b   get_bit
                        bcc.b   short_match     ; if bit sequence is %110..., then is a short match

; The sequence is %111..., the next 4 bits are the offset (0-15)
                        moveq   #4-1,d5
get_3_bits:             bsr.b   get_bit
                        roxl.l  #1,d0           ; addx.l  d0,d0 <- my bug, Z flag only cleared, not SET
                        dbf     d5,get_3_bits   ; (dbcc doesn't modify flags)
                        beq.b   write_byte      ; if offset == 0, then write 0x00

                        ; If offset != 0, then write the byte on destination - offset
                        move.l  a1,a2
                        suba.l  d0,a2
                        move.b  (a2),d0
write_byte:             move.b  d0,(a1)+
                        bra.b   next_sequence_init

; Short match %110...
short_match:            moveq   #3,d2           ; length = 3
                        move.b  (a0)+,d0        ; Get offset (offset is 7 bits + 1 bit to mark if copy 2 or 3 bytes)
                        lsr.b   #1,d0
                        beq.b   end_decrunch    ; if offset == 0, end of decrunching
                        bcs.b   domatch_new_lastpos
                        moveq   #2,d2           ; length = 2
                        bra.b   domatch_new_lastpos

; Code pair %10...
code_pair:              bsr.b   decode_gamma
                        sub.l   d1,d2           ; offset -= LWM
                        bne.b   normal_code_pair
                        move.l  d4,d0           ; offset = old_offset
                        bsr.b   decode_gamma
                        bra.b   copy_code_pair
normal_code_pair:       subq.l  #1,d2           ; offset -= 1
                        lsl.l   #8,d2           ; offset << 8
                        move.b  (a0)+,d2        ; get the least significant byte of the offset (16 bits)
                        move.l  d2,d0
                        bsr.b   decode_gamma
                        cmp.l   a3,d0           ; >=32000
                        bge.b   domatch_with_2inc
compare_1280:           cmp.l   a4,d0           ; >=1280 <32000
                        bge.b   domatch_with_inc
compare_128:            cmp.l   a5,d0           ; >=128 <1280
                        bge.b   domatch_new_lastpos
domatch_with_2inc:      addq.l  #1,d2
domatch_with_inc:       addq.l  #1,d2
domatch_new_lastpos:    move.l  d0,d4           ; old_offset = offset
copy_code_pair:         subq.l  #1,d2           ; length--
                        move.l  a1,a2
                        suba.l  d0,a2
loop_do_copy:           move.b  (a2)+,(a1)+
                        dbf     d2,loop_do_copy
                        moveq   #1,d1           ; LWM = 1
                        bra.b   next_sequence   ; Process next sequence

; get_bit: Get bits from the crunched data (D3) and insert the most significant bit in the carry flag.
get_bit:                add.b   d3,d3
                        bne.b   still_bits_left
                        move.b  (a0)+,d3        ; Read next crunched byte
                        addx.b  d3,d3
still_bits_left:        rts

; decode_gamma: Decode values from the crunched data using gamma code
decode_gamma:           moveq   #1,d2
get_more_gamma:         bsr.b   get_bit
                        addx.l  d2,d2
                        bsr.b   get_bit
                        bcs.b   get_more_gamma
                        rts

end_decrunch:           movem.l (a7)+,a2-a5/d2-d5
                        rts

I have tested only one archive. So, it may be buggy.
Edit: line 32 fixed, tricky bug. Thanks to Ti_.

And, I made my own aplib packer. Better packing, more time needed.
Profit?! From my tests:
693 762 bytes input.
334 520 bytes official packer.
331 982 bytes my packer.
2538 bytes profit

= 0.36% of input, 0.76% of official output.
I like it

http://elektropage.ru/r57shell/aplib_pack.exe

Ahh... Here is aPLib binary for packing and unpacking files without header:
http://elektropage.ru/r57shell/appack_raw.exe

Ti_ · Post by **Ti_** » Sat Jun 29, 2013 11:57 am

And, I made my own aplib packer. Better packing, more time needed.

End of files corrupted (as I think > 32kb) [Tried with original version].
With your optimized version it sliglthy corrupted everywhere.