Page 1 of 1

Decoding UTF-8

Posted: Fri Jul 15, 2016 6:27 am
by Sik
Since I was messing with UTF-8 I may as well just show how decoding works. Normally you're better off doing ASCII-only (for English) or using your own charset (if you're doing something other than English), I was working on this mostly because the program in question was going will have to cope with real filenames and that means UTF-8. Note that the code here assumes strings are valid.

Unicode is gigantic so displaying UTF-8 text is probably out of the question. The trick here is that you have your own charset as usual, and then you convert Unicode codepoints to your own charset on the fly as you're rendering the text (and replace anything unavailable with a placeholder, e.g. a question mark). This helps keep resources usage low (just using enough for whatever you may need) while still being able to use UTF-8.

Anyway, here's how to do it in C:

Code: Select all

unsigned decode_utf8(const char *text)
{
   unsigned ch = (uint8_t)(*text);
   
   if (ch < 0x80) {
      return *text;
   }
   
   if (ch < 0xE0) {
      return (text[0] & 0x1F) << 6 |
             (text[1] & 0x3F);
   }
   
   if (ch < 0xF0) {
      return (text[0] & 0x0F) << 12 |
             (text[1] & 0x3F) << 6 |
             (text[2] & 0x3F);
   }
   
   return (text[0] & 0x07) << 18 |
          (text[1] & 0x3F) << 12 |
          (text[2] & 0x3F) << 6 |
          (text[3] & 0x3F);
}

unsigned codepoint_size(unsigned codepoint)
{
   if (codepoint < 0x80)
      return 1;
   if (codepoint < 0x800)
      return 2;
   if (codepoint < 0x10000)
      return 3;
   return 4;
}
Here's how to do it in assembly:

Code: Select all

; input a6.l .... Pointer to character
; output a6.l ... Pointer to next character
; output d7.l ... Unicode codepoint

DecodeUTF8:
    moveq   #0, d7
    move.b  (a6)+, d7
    bmi.s   @TwoBytes
    rts
    
@TwoBytes:
    cmp.b   #$E0, d7
    bhs.s   @ThreeBytes
    and.b   #$1F, d7
    lsl.w   #8, d7
    move.b  (a6)+, d7
    lsl.b   #2, d7
    lsr.w   #2, d7
    rts

@ThreeBytes:
    cmp.b   #$F0, d7
    bhs.s   @FourBytes
    and.b   #$0F, d7
    swap    d7
    move.b  (a6)+, d7
    lsl.b   #2, d7
    lsl.w   #6, d7
    move.b  (a6)+, d7
    lsl.b   #2, d7
    lsl.w   #2, d7
    lsr.l   #4, d7
    rts
    
@FourBytes:
    and.b   #$07, d7
    lsl.w   #8, d7
    move.b  (a6)+, d7
    lsl.b   #2, d7
    lsr.w   #2, d7
    swap
    move.b  (a6)+, d7
    lsl.b   #2, d7
    lsl.w   #6, d7
    move.b  (a6)+, d7
    lsl.b   #2, d7
    lsl.w   #2, d7
    lsr.l   #4, d7
    rts
On that note, that a 7.67MHz 68000 can cope with UTF-8 decoding just fine puts to shame all modern programs that still don't understand Unicode.

Re: Decoding UTF-8

Posted: Tue Jul 19, 2016 9:09 am
by Natsumi
it indeed is hilarious how difficult it is to deal with Unicode in C++ for example; I had to do some serious black magic to read UTF-8 file into a wchar_t*, but its doable. I was quite irfitated how I just couldn't whip out 68k and write that code myself in few minutes, as opposed to hunting for it hours on StackOverflow.