32-bit integer to string routine

Sik · Post by **Sik** » Mon Feb 13, 2017 12:03 pm

I may make a 16-bit variant later (which would be faster due to many less iterations and rotations). Also there are more optimizations that could be done, as mentioned in the comments. Anyway, here you go. May want to adapt to GAS' syntax if you want to wrap it into a C function (could be useful for sprintf?).

Yes ideally you should avoid needing this in the first place, but every so often we'll run into the need to show a number that wasn't BCD originally. Especially those that use C since it doesn't have native BCD support.

EDIT: aaaaand turns out I missed an obvious one (BCD can operate directly on memory, and while memory accesses are slow it's still nothing compared to all the rotations that were going on). Putting here the optimized routine now. The other optimizations mentioned are still relevant though.

EDIT 2: fixed what should have been an obvious bug (・＿・) I swear I had tested this earlier, must have broken when I did the optimization.

Code: Select all

;****************************************************************************
; IntToAscii, UintToAscii
; Converts a 32-bit integer into a nul-terminated ASCII string.
; IntToAscii is signed, UintToAscii is unsigned.
;----------------------------------------------------------------------------
; input d7.l ... 32-bit integer
; input a6.l ... Pointer to buffer
;----------------------------------------------------------------------------
; breaks: d5-d7, a4-a6
;----------------------------------------------------------------------------
; notes: 12 bytes are needed in the buffer.
;****************************************************************************

IntToAscii:
    tst.l   d7                          ; Is the value negative?
    bpl.s   UintToAscii
    move.b  #'-', (a6)+
    neg.l   d7

;----------------------------------------------------------------------------

UintToAscii:
    movem.l d0-d3, -(sp)                ; Save registers
    
    moveq   #0, d6                      ; The ugly part: converting from
    move.l  d6, -(sp)                   ; binary into BCD
    move.w  d6, -(sp)
    lea     1(sp), a5                   ; The way this algorithm works is
    lea     @Table(pc), a4              ; that we have BCD values 1, 2, 4...
    moveq   #32-1, d6                   ; every power of 2. Then we scan
@Loop:                                  ; through each bit and sum up the
    lsr.l   #1, d7                      ; corresponding BCD values.
    bcc.s   @Zero
    and.b   #$00, ccr                   ; Definitely faster than doing many
    addq.w  #5, a4                      ; 32-bit divisions (worse than it
    addq.w  #5, a5                      ; sounds since 68000 can only give
    abcd.b  -(a4), -(a5)                ; 16-bit quotients, uuuugh). Still
    abcd.b  -(a4), -(a5)                ; could be optimized further by
    abcd.b  -(a4), -(a5)                ; handling several bits at a time
    abcd.b  -(a4), -(a5)                ; (and the fact that not all digits
    abcd.b  -(a4), -(a5)                ; need to be added early on).
@Zero:
    addq.w  #5, a4
    dbf     d6, @Loop
    move.b  1(sp), d3
    move.l  2(sp), d2
    addq.w  #6, sp
    
    tst.b   d3                          ; Convert the first two digits
    beq.s   @No1stTwo                   ; (if any, that is)
    move.b  d3, d7
    lsr.b   #4, d7
    beq.s   @No1stOne
    add.b   #'0', d7
    move.b  d7, (a6)+
@No1stOne:
    and.b   #$0F, d3
    add.b   #'0', d3
    move.b  d3, (a6)+
    moveq   #7-1, d6
    rol.l   #4, d2
    bra.s   @NoZeroSkip
@No1stTwo:
    
    moveq   #7-1, d6                    ; Skip leading zeroes
@ZeroSkip:
    rol.l   #4, d2
    move.b  d2, d5
    and.b   #$0F, d5
    bne.s   @NoZeroSkip
    dbf     d6, @ZeroSkip
    rol.l   #4, d2
@NoZeroSkip:
    
    addq.w  #1, d6                      ; Convert remaining digits
@DigitLoop:
    move.b  d2, d5
    and.b   #$0F, d5
    add.b   #'0', d5
    move.b  d5, (a6)+
    rol.l   #4, d2
    dbf     d6, @DigitLoop
    
    clr.b   (a6)                        ; End the string here
    
    movem.l (sp)+, d0-d3                ; Restore registers
    rts                                 ; End of subroutine

;----------------------------------------------------------------------------

@Table:
    dc.b    $00,$00,$00,$00,$01
    dc.b    $00,$00,$00,$00,$02
    dc.b    $00,$00,$00,$00,$04
    dc.b    $00,$00,$00,$00,$08
    dc.b    $00,$00,$00,$00,$16
    dc.b    $00,$00,$00,$00,$32
    dc.b    $00,$00,$00,$00,$64
    dc.b    $00,$00,$00,$01,$28
    dc.b    $00,$00,$00,$02,$56
    dc.b    $00,$00,$00,$05,$12
    dc.b    $00,$00,$00,$10,$24
    dc.b    $00,$00,$00,$20,$48
    dc.b    $00,$00,$00,$40,$96
    dc.b    $00,$00,$00,$81,$92
    dc.b    $00,$00,$01,$63,$84
    dc.b    $00,$00,$03,$27,$68
    dc.b    $00,$00,$06,$55,$36
    dc.b    $00,$00,$13,$10,$72
    dc.b    $00,$00,$26,$21,$44
    dc.b    $00,$00,$52,$42,$88
    dc.b    $00,$01,$04,$85,$76
    dc.b    $00,$02,$09,$71,$52
    dc.b    $00,$04,$19,$43,$04
    dc.b    $00,$08,$38,$86,$08
    dc.b    $00,$16,$77,$72,$16
    dc.b    $00,$33,$55,$44,$32
    dc.b    $00,$67,$10,$88,$64
    dc.b    $01,$34,$21,$77,$28
    dc.b    $02,$68,$43,$54,$56
    dc.b    $05,$36,$87,$09,$12
    dc.b    $10,$73,$74,$18,$24
    dc.b    $21,$47,$48,$36,$48

r57shell · Post by **r57shell** » Tue Feb 14, 2017 5:02 pm

It's not obvious that abcd stuff is faster than division.
For better understanding it would be nice to have division code too.
And, both should have timings listed I guess.

From my opinion, easier & faster just subtract 36 times.
But we will see only after comparing.

Natsumi · Post by **Natsumi** » Tue Feb 14, 2017 5:12 pm

I dont know which of these are faster, but I thought to share this anyway, in case someone finds it helpful;

Code: Select all

; ===========================================================================
; Routine to translate an integer to null-terminated decimal string
; input:
;  d0.l - integer
;  d1 - number of digits -1
;  d2 - whether you want trailing 0's (0 - no, $FF - yes)
; output:
;  a1 - buffer address
; ===========================================================================
IntToDecString:
		lea	_IntToStringOffs.w,a1
		lea	.multbl(pc),a2
		move.w	d1,d3			; copy digit count

.rightmul	subq.w	#4,a2			; go to earlier multiplication
		dbf	d3,.rightmul		; loop til we got it right

.digitsloop	moveq	#'0',d3			; clear the digit counter
		move.l	(a2)+,d4		; get addition ctr

.numloop	sub.l	d4,d0			; sub the count from the output num
		bcs.s	.uflow			; if we underflowed, branch
		addq.w	#1,d3			; add 1 to digit ctr
		bra.s	.numloop

.uflow		add.l	d4,d0			; fix the number

		cmp.b	#'0',d3			; check if d3 = 0
		beq.s	.is0			; branch if so
		st	d2			; allow writes (if d2 was 0, then no trailing 0's will appear)

.is0		tst.b	d2			; check if writes are enabled
		bpl.s	.chkloop		; if not, branch
		move.b	d3,(a1)+		; write number

.chkloop	dbf	d1,.digitsloop		; loop for all numbers
		clr.b	(a1)			; terminate
		lea	_IntToStringOffs.w,a1
		rts

	dc.l 10000000
	dc.l 1000000
	dc.l 100000
	dc.l 10000
	dc.l 1000
	dc.l 100
	dc.l 10
	dc.l 1
.multbl

Had this code lying around for debugging options as well as quick conversion of bytes to decimal.

BigEvilCorporation · Post by **BigEvilCorporation** » Fri Feb 17, 2017 5:35 pm

Welp, my implementation is now officially embarrassing!

Sik · Post by **Sik** » Fri Mar 03, 2017 8:29 pm

And this is why I'm not ever getting hired anywhere.

flamewing · Post by **flamewing** » Sun Apr 02, 2017 12:04 am

Here is my contribution to this topic: I am presenting 3 routines to do packed BCD conversion of a 32-bit unsigned integer. They pick up the number at d7 and store the packed BCD representation at the buffer at a6.

Why doing integer to packed BCD? Well, because (1) this is the slow part on all code presented, (2) this levels the playing field by making input and output the same for all 3 routines and (3) from the packed BCD, converting to ASCII for printing is trivial (and this conversion can also be optimized and inlined to not depend on the intermediate buffer for added speed).

caveat: all code is untested, and likely buggy; it works in theory.

The first routine is based on the "shift and add 3" algorithm (also called "double dabble"), which is basically what Sik used. My version is optimized to use registers instead of RAM, keeping two digits per register; and it only uses registers when they become absolutely necessary in the worst case possible (0xffffffff). The code:

Code: Select all

; Modified shift-and-add-3 method
Long2BCD_shift_add_3:
        moveq   #0,d0                       ;  4(1/0); Clear d0
        moveq   #0,d1                       ;  4(1/0); Clear d1
        moveq   #0,d2                       ;  4(1/0); Clear d2
        moveq   #0,d3                       ;  4(1/0); Clear d3
        moveq   #$7,d4                      ;  4(1/0); Clear d4 with 3-bit mask
        swap    d7                          ;  4(1/0); So we can use word operations
        rol.w   #3,d7                       ; 12(1/0); Fetch top 3 bits
        and.b   d7,d4                       ;  4(1/0); Copy them over to d4
        ; 40(8/0)
    rept 3
        add.w   d7,d7                       ;  4(1/0); Get top bit
        abcd.b  d4,d4                       ;  6(1/0); Move it over to d4
    endm
        ; 30(6/0)
        ; We now have moved 6 bits; we now run the risk of a carry
        ; out of d4 when moving another bit to d4.
        ; We can move 3 bits to d3 without worrying about decimal
        ; correction, so we will use addx to save a few cycles.
    rept 3
        add.w   d7,d7                       ;  4(1/0); Get top bit
        abcd.b  d4,d4                       ;  6(1/0); Move it over to d4
        addx.b  d3,d3                       ;  4(1/0); Move carry to d3
    endm
        ; 42(9/0)
    rept 4
        add.w   d7,d7                       ;  4(1/0); Get top bit
        abcd.b  d4,d4                       ;  6(1/0); Move it over to d4
        abcd.b  d3,d3                       ;  6(1/0); Move carry to d3
    endm
        ; 64(12/0)
        ; We now have moved 13 bits; we now run the risk of a carry
        ; out of d3 when moving another bit to d4.
        ; We can move 3 bits to d2 without worrying about decimal
        ; correction, so we will use addx to save a few cycles.
    rept 3
        add.w   d7,d7                       ;  4(1/0); Get top bit
        abcd.b  d4,d4                       ;  6(1/0); Move it over to d4
        abcd.b  d3,d3                       ;  6(1/0); Move carry to d3
        addx.b  d2,d2                       ;  4(1/0); Move carry to d2
    endm
        ; 60(12/0)
        ; We have moved 16 bits out of d7; we can now swap and
        ; use word operations to gain cycles.
        swap    d7                          ;  4(1/0)
        ;  4(1/0)
    rept 3
        add.w   d7,d7                       ;  4(1/0); Get top bit
        abcd.b  d4,d4                       ;  6(1/0); Move it over to d4
        abcd.b  d3,d3                       ;  6(1/0); Move carry to d3
        abcd.b  d2,d2                       ;  6(1/0); Move carry to d2
    endm
        ; 66(12/0)
        ; We now have moved 19 bits; we now run the risk of a carry
        ; out of d2 when moving another bit to d4.
        ; We can move 3 bits to d1 without worrying about decimal
        ; correction, so we will use addx to save a few cycles.
    rept 3
        add.w   d7,d7                       ;  4(1/0); Get top bit
        abcd.b  d4,d4                       ;  6(1/0); Move it over to d4
        abcd.b  d3,d3                       ;  6(1/0); Move carry to d3
        abcd.b  d2,d2                       ;  6(1/0); Move carry to d2
        addx.b  d1,d1                       ;  4(1/0); Move carry to d1
    endm
        ; 78(15/0)
    rept 4
        add.w   d7,d7                       ;  4(1/0); Get top bit
        abcd.b  d4,d4                       ;  6(1/0); Move it over to d4
        abcd.b  d3,d3                       ;  6(1/0); Move carry to d3
        abcd.b  d2,d2                       ;  6(1/0); Move carry to d2
        abcd.b  d1,d1                       ;  6(1/0); Move carry to d1
    endm
        ; 112(20/0)
        ; We now have moved 26 bits; we now run the risk of a carry
        ; out of d1 when moving another bit to d4.
        ; We can move 3 bits to d0 without worrying about decimal
        ; correction, so we will use addx to save a few cycles.
    rept 3
        add.w   d7,d7                       ;  4(1/0); Get top bit
        abcd.b  d4,d4                       ;  6(1/0); Move it over to d4
        abcd.b  d3,d3                       ;  6(1/0); Move carry to d3
        abcd.b  d2,d2                       ;  6(1/0); Move carry to d2
        abcd.b  d1,d1                       ;  6(1/0); Move carry to d1
        addx.b  d0,d0                       ;  4(1/0); Move carry to d0
    endm
        ; 96(18/0)
    rept 3
        add.w   d7,d7                       ;  4(1/0); Get top bit
        abcd.b  d4,d4                       ;  6(1/0); Move it over to d4
        abcd.b  d3,d3                       ;  6(1/0); Move carry to d3
        abcd.b  d2,d2                       ;  6(1/0); Move carry to d2
        abcd.b  d1,d1                       ;  6(1/0); Move carry to d1
        abcd.b  d0,d0                       ;  6(1/0); Move carry to d0
    endm
        ; 102(18/0)
        move.b  d0,(a6)+                    ;  8(1/1); save output
        move.b  d1,(a6)+                    ;  8(1/1); save output
        move.b  d2,(a6)+                    ;  8(1/1); save output
        move.b  d3,(a6)+                    ;  8(1/1); save output
        move.b  d4,(a6)+                    ;  8(1/1); save output
        rts                                 ; 16(4/0)
        ; 56(9/5)
        ; GRAND TOTAL: 750(140/5)

Time: 750(140/5) for all inputs.

The second routine is based on Natsumi's code (which is based on Sonic's HUD code). I modified it to remove the digit counting (all the other routines can be modified for this purpose, so this levels the playing field), and I used rept and code generation to reduce the loop overhead and use smaller operations when possible to save cycles. Here it is:

Code: Select all

; Sonic's method of repeated subtraction
Long2BCD_rept_sub:
        move.l  a1,d4                       ;  4(1/0); Save a1
        lea     .multbl(pc),a2              ;  8(2/0); Load LUT
        ; 12(3/0)
c := 0
    rept 9
      if (c#2)==0
        moveq   #0,d3                       ;  4(1/0); Clear d3
      else
        lsl.w   #4,d3                       ; 14(1/0); Make space for another digit
      endif
      if c<5
        move.l  (a2)+,d4                    ; 12(3/0); Fetch next power of 10
      else
        move.w  (a2)+,d4                    ;  8(2/0); Fetch next power of 10
      endif

.loop:
      if c<5
        sub.l   d4,d7                       ;  8(1/0); Subtract power of 10
      else
        sub.w   d4,d7                       ;  4(1/0); Subtract power of 10
      endif
        bcs.s   .break                      ; T: 10(2/0); N: 8(1/0); Branch on borrow
        addq.w  #1,d3                       ;  4(1/0); Increase digit
        bra.s   .loop                       ; 10(2/0); Loop until borrow

.break:
      if (c#2)==1
        move.b  d3,(a6)+                    ;  8(1/1); Save output
      endif
c := c + 1
    endm
    ; Minimum: 346(63/4)
    ; Maximum: 2778(495/4)
    ; Average: 1562(279/4)
        lsl.w   #4,d3                       ; 14(1/0); Make space for another digit
        add.w   d7,d3                       ;  4(1/0); Last digit
        move.b  d3,(a6)+                    ;  8(1/1); Save output
        movea.l d4,a1                       ;  4(1/0); Restore a1
        rts                                 ; 16(4/0)
        ; 46(8/1)
        ; Totals:
        ; * assuming all loops take the minimum time:
        ;   404(74/5)
        ; * assuming all loops take the maximum time:
        ;   2836(506/5)
        ; * assuming all loops take average time:
        ;   1620(290/5)
;------------------------------------------------------------------------------
.mult_tbl:
        dc.l 1000000000
        dc.l 100000000
        dc.l 10000000
        dc.l 1000000
        dc.l 100000
        dc.w 10000
        dc.w 1000
        dc.w 100
        dc.w 10
        ;dc.w 1

Time: 404(74/5) in the best case (d7 = 0), 2836(506/5) in the worst case (d7 = $FFFFFFFF), 1620(290/5) on average.

The third version uses divu to get 3 digits at a time in each register, and uses a 1000-word LUT to convert these to BCD. While this may seem impossible at first glance (because 68k is limited to 16-bit divisor and 16-bit quotient), there is a way around this: I prove a theorem in the comments which allows me to obtain the quotient and remainder of division by 1,000,000; I then divide both quotient and remainder by 1000 to obtain 1 group with 1 digit (billions) and 3 groups with 3 digits (millions, thousands, units) into separate registers. These are converted to BCD by LUT and combined together for output. The code:

Code: Select all

; Division+LUT method
Long2BCD_divs_lut:
        ; Lets divide the input into groups of 3 decimal digits with
        ; divisions. Due to 68k limitations, we will have to make use
        ; of the following
        ; =========================================================================
        ; THEOREM: let a, b and c be positive integers such that
        ; c = a * b, and let x be an arbitrary positive number.
        ; Then:
        ;    (1) floor(x / c) = floor(floor(x / a) / b)
        ;    (2) x mod c = a * (floor(x / a) mod b) + (x mod a)
        ; PROOF: By definition of floor and mod, we have that
        ;    x / c = floor(x / c) + (x mod c) / c
        ; Moreover, since c = a * b,
        ;    x / c = (x / a) / b = floor(x / a) / b + (x mod a) / (a * b) =
        ;          = floor(floor(x / a) / b) + (floor(x / a) mod b) / b + (x mod a) / (a * b)
        ; So if we can prove that
        ;    0 <= (floor(x / a) mod b) / b + (x mod a) / (a * b) < 1
        ; then, because floor(floor(x / a) / b) and floor(x / c) are integers,
        ; we will have shown that both (1) and (2) are true.
        ; By definition of mod, we have that
        ;    0 <= m mod n <= n - 1 for all positive integers m, n.
        ; Therefore,
        ;    0 <= (floor(x / a) mod b) / b + (x mod a) / (a * b) <=
        ;      <= (b - 1) / b + (a - 1) / (a * b) = (a * b - a + a - 1) / (a * b) =
        ;      = (a * b - 1) / (a * b) < 1
        ; and the theorem is proved.
        ; So how is the theorem going to be useful? Well, we will pre-divide both
        ; numerator (d7) and denominator (1000000) by 16. This will put both of
        ; them in range for a 68k divu. This will give us the correct quotient
        ; by (1). Saving the low 4 bits of d7 will allow us to recover the correct
        ; remainder by (2). For our case, a = 16 and b = 1000000/16 = 62500.
        ; =========================================================================
        ; Armed with that, let us divide the original number by 1000000 by
        ; first dividing it by 16 (right shift by 4), then dividing it by
        ; (1000000/16) = 62500.
        moveq   #$F,d4                      ;  4(1/0); Mask for low 4 bits
        and.b   d7,d4                       ;  4(1/0); Fetch them
        lsr.l   #4,d7                       ; 16(1/0); Divide by 16
        divu.w  #1000000/16,d7              ; 84-148(3/0); Low word of d7 = d7 / 1000000
        move.w  d7,d1                       ;  4(1/0); Copy it to d1
        ; We now have to divide by 1000 a few times; so cache it.
        move.w  #1000,d3                    ;  8(2/0)
        ext.l   d1                          ;  4(1/0); Prepare for another division
        divu.w  d3,d1                       ; 76-140(1/0); d1 = billions digits
        move.w  d1,d0                       ;  4(1/0); d0 = billions digits
        swap    d1                          ;  4(1/0); d1 = millions digits
        ; Now fix remainder in d7
        clr.w   d7                          ;  4(1/0); Clear top bits
        swap    d7                          ;  4(1/0); Fetch remainder of division by 62500
        lsl.l   #4,d7                       ; 16(1/0); Multiply by 16
        add.l   d4,d7                       ;  8(1/0); Add in the remainder of division by 16
        ; d7 is <= 999999 here, so we can just divide by 1000
        divu.w  d3,d7                       ; 76-140(1/0); d7 = thousands digits
        ; Just to recap:
        ;    d0 = billions digits (1 digit: [0-4])
        ;    d1 = millions digits ([0-999])
        ;    d7, low word = thousands digits ([0-999])
        ;    d7, high word = units digits ([0-999])
        ; Lets convert it all to BCD. d0 is already there.
        add.w   d1,d1                       ;  4(1/0); Convert to LUT index
        add.l   d7,d7                       ;  8(1/0); Convert to LUT index; we never carry between words here
        ; Now lets combine this in useful groups
        move.w  .bcd_lut(pc,d1.w),d1        ; 14(3/0); Convert to BCD
        move.w  .bcd_lut(pc,d7.w),d7        ; 14(3/0); Convert to BCD
        swap    d1                          ;  4(1/0); Flip words
        move.w  d0,d1                       ;  4(1/0); Copy billions digit to d1
        swap    d1                          ;  4(1/0); Billions high, millions low
        lsl.w   #4,d1                       ; 14(1/0); Align top nibble
        lsr.l   #4,d1                       ; 16(1/0); Make 4-digit BCD number
        move.w  d1,(a6)+                    ;  8(1/1); Print it
        swap    d7                          ;  4(1/0); Flip words
        move.w  .bcd_lut(pc,d7.w),d7        ; 14(3/0); Convert to BCD
        lsl.w   #4,d7                       ; 14(1/0); Align top nibble
        lsr.l   #4,d7                       ; 16(1/0); Make 6-digit BCD number
        swap    d7                          ;  4(1/0); Flip words
        move.b  d7,(a6)+                    ;  8(1/1); Print it
        swap    d7                          ;  4(1/0); Flip words
        move.w  d7,(a6)+                    ;  8(1/1); Print it
        rts                                 ; 16(4/0)
        ; Totals
        ; * assuming all divisions take the minimum time:
        ;   494(46/3)
        ; * assuming all divisions take the maximum time:
        ;   686(46/3)
        ; * assuming all divisions take average time:
        ;   590(46/3)
;------------------------------------------------------------------------------
; Lookup table with 1000 words
.bcd_lut:
c := 0
    rept 1000
        dc.w ((c / 100)<<8)+(((c / 10) # 10)<<4) + (c # 10)
c := c + 1
    endm

Time: with best case divisions, 494(46/3); with worst-case divisions, 686(46/3); on average, 590(46/3).

As I mentioned, all of these can be optimized by inlining the conversion to ASCII and printing instead of using the buffer. The divs+LUT routine also benefits from making the LUT have the digits in the opposite order, which would also speed up printing.

In any event: the overall winner is the divs+LUT method if all you care about is time, the modified shift and add 3 method if you need constant time conversion, or Sonic's/Natsumi's for smaller code.

Edit: I updated the modified shift-and-add-3 version to use swap and word operations at the beginning, saving 50 cycles but needing one more read cycle.

Sik · Post by **Sik** » Wed Apr 05, 2017 7:30 pm

I'm surprised you optimized my BCD routine without doing the look-up stuff on that one even though I mentioned that =P the idea being processing multiple bits together (which is precisely where I slacked off... I need to get that sorted out some day). Converting from 0 to 63 requires only one BCD byte, up to 8191 only two BCD bytes. That's 13 bits out of 32 that would get done in just two iterations (and you wouldn't even need to add all bytes). Do note that this requires a different table in each iteration. In theory you should need only five iterations total to get all 32 bits processed. There's also the part that you can easily bail out earlier with shorter numbers, which will be the vast majority of them.

I suppose that's what you were trying to do with the DIVU variant, the idea I mentioned was basically the same thing but operating entirely with BCD (and more specifically, on groups of bits, which should be just AND + LSR instead of needing to divide).

I was busy with other stuff but some day I should go back to it and implement those optimizations.

flamewing · Post by **flamewing** » Thu Apr 06, 2017 2:19 am

Sik wrote:I'm surprised you optimized my BCD routine without doing the look-up stuff on that one even though I mentioned that =P the idea being processing multiple bits together (which is precisely where I slacked off... I need to get that sorted out some day). Converting from 0 to 63 requires only one BCD byte, up to 8191 only two BCD bytes. That's 13 bits out of 32 that would get done in just two iterations (and you wouldn't even need to add all bytes). Do note that this requires a different table in each iteration.

You mean something like this:

Code: Select all

bcd_byte_LUT:
        ; LUT goes here; 64 byte entries
;------------------------------------------------------------------------------
; Shift+LUT method
Long2BCD_shift_lut:
        clr.l   (a6)+                       ; 20(3/2)
        move.l  a5,d1                       ;  4(1/0)
        moveq   #$3F,d0                     ;  4(1/0)
        and.b   d7,d0                       ;  4(1/0)
        eor.b   d0,d7                       ;  4(1/0); Clear these bits for the future
        move.b  bcd_byte_LUT(pc,d0.w),(a6)+ ; 14(2/1); LUT with 64 byte entries
        ; 50(9/3)
        lsr.l   #5,d7                       ; 18(1/0); Move bits down; the low bits are clear, so d7 is even and X=0
        move.b  d7,d0                       ;  4(1/0)
        eor.b   d0,d7                       ;  4(1/0); Clear these bits for the future
        lea     .bcd_word_lut+2(pc,d0.w),a5 ; 12(2/0); LUT with 128 word entries
        abcd.b  -(a5),-(a6)                 ; 18(3/1)
        abcd.b  -(a5),-(a6)                 ; 18(3/1)
        addq.w  #2,a6                       ;  8(1/0)
        ; 82(12/2)
        lsr.l   #3,d7                       ; 14(1/0); Shift down to word boundary
        lsr.w   #3,d7                       ; 12(1/0); Shift only word down; top word is 0-$FF, bottom is 0-$3FF8, X is clear
        lea     .bcd_quad_lut+4(pc),a5      ;  8(2/0); LUT with 2048 4-byte entries
        adda.w  d7,a5                       ;  8(1/0)
        abcd.b  -(a5),-(a6)                 ; 18(3/1)
        abcd.b  -(a5),-(a6)                 ; 18(3/1)
        abcd.b  -(a5),-(a6)                 ; 18(3/1)
        abcd.b  -(a5),-(a6)                 ; 18(3/1)
        addq.w  #4,a6                       ;  8(1/0)
        ; 122(18/4)
        swap    d7                          ;  4(1/0)
        move.w  d7,d0                       ;  4(1/0)
        add.w   d7,d7                       ;  4(1/0)
        add.w   d7,d7                       ;  4(1/0)
        add.w   d0,d7                       ;  4(1/0)
        lea     .bcd_penta_lut+5(pc),a5     ;  8(2/0); LUT with 256 5-byte entries
        adda.w  d7,a5                       ;  8(1/0)
        abcd.b  -(a5),-(a6)                 ; 18(3/1)
        abcd.b  -(a5),-(a6)                 ; 18(3/1)
        abcd.b  -(a5),-(a6)                 ; 18(3/1)
        abcd.b  -(a5),-(a6)                 ; 18(3/1)
        abcd.b  -(a5),-(a6)                 ; 18(3/1)
        lea     5(a6),a6                    ;  8(2/0)
        ; 134(25/5)
        movea.l d1,a5                       ;  4(1/0)
        rts                                 ; 16(4/0)
        ; 20(5/0)
        ; GRAND TOTAL: 408(69/14)
;------------------------------------------------------------------------------
.bcd_word_lut:
        ; LUT goes here; 128 word entries
;------------------------------------------------------------------------------
.bcd_quad_lut:
        ; LUT goes here; 2048 4-byte entries
;------------------------------------------------------------------------------
.bcd_penta_lut:
        ; LUT goes here; 256 5-byte entries

That is a nice idea, and clocks at 408(69/14), the fastest so far (but a relatively close shave with the divs+LUT method, considering it loses 122 cycles just getting output in the same format as the others). It uses 7744 bytes of LUTs, versus 2000 bytes for the divs+LUT method.

Sik wrote:There's also the part that you can easily bail out earlier with shorter numbers, which will be the vast majority of them.

I did mention that I specifically did not consider this case because all routines can benefit from it to varying degrees. For example: narrowing the conversions to a single byte to BCD, we can estimate:

114(22/2) for the modified shift and add 3 method;
42(9/1) for the divu+LUT (really, just LUT now);
128(21/2), 596(111/2) and 362(66/2) (min, max, average) for the repeated subtraction method;
142(26/4) for the shift+LUT method

And suddenly, everything is completely different.

And I realized that the repeated subtract method can also benefit from an early swap, word arithmetic until < 100,000, swap, a few long subtractions of 10,000 until done, then back to word arithmetic for the remainder. It doesn't make much of a difference.

By the way, I found a wrong register in the modified shift and add 3 method; I fixed the other post.

Sik · Post by **Sik** » Thu Apr 06, 2017 5:00 pm

flamewing wrote:That is a nice idea, and clocks at 408(69/14), the fastest so far (but a relatively close shave with the divs+LUT method, considering it loses 122 cycles just getting output in the same format as the others). It uses 7744 bytes of LUTs, versus 2000 bytes for the divs+LUT method.

Huuuh, you merged the third and fourth iterations into one (it's going straight from two bytes to four bytes). That means you took way too many bits in that one iteration and made its table much larger than it should have been (not to mention many of those values using up four bytes when they should have been three bytes).

Off the top of my head:

- 1st iteration (up to 63) = 6 bits = 1×64 entries
- 2nd iteration (up to 8191) = 7 bits = 2×128 entries
- 3rd iteration (up to 524287) = 6 bits = 3×64 entries
- 4th iteration (up to 67108863) = 7 bits = 4×128 entries
- 5th iteration (up to 4294967295) = 6 bits = 5×64 entries

That should be 1344 bytes total (though some more on extra instructions).

flamewing · Post by **flamewing** » Thu Apr 06, 2017 7:44 pm

Sik wrote:Huuuh, you merged the third and fourth iterations into one (it's going straight from two bytes to four bytes). That means you took way too many bits in that one iteration and made its table much larger than it should have been (not to mention many of those values using up four bytes when they should have been three bytes).

That was completely intentional; trading space for time. The routine would be some 100-120 cycles slower with another iteration.

Sik · Post by **Sik** » Thu Apr 06, 2017 10:07 pm

Bleh (ーー；)

I'll just have to go and rewrite the whole function in Indigo from scratch since it's obviously the crappiest of the bunch in every sense.
EDIT: there https://github.com/sikthehedgehog/indigo/issues/5

And I have the feeling you can probably optimize the other ones further.

SpritesMind.Net

32-bit integer to string routine

32-bit integer to string routine

Re: 32-bit integer to string routine

Re: 32-bit integer to string routine

Re: 32-bit integer to string routine

Re: 32-bit integer to string routine

Re: 32-bit integer to string routine

Re: 32-bit integer to string routine

Re: 32-bit integer to string routine

Re: 32-bit integer to string routine

Re: 32-bit integer to string routine

Re: 32-bit integer to string routine