Decoding UTF-8

Ask anything your want about Megadrive/Genesis programming.

Moderator: BigEvilCorporation

Post Reply
Sik
Very interested
Posts: 939
Joined: Thu Apr 10, 2008 3:03 pm
Contact:

Decoding UTF-8

Post by Sik » Fri Jul 15, 2016 6:27 am

Since I was messing with UTF-8 I may as well just show how decoding works. Normally you're better off doing ASCII-only (for English) or using your own charset (if you're doing something other than English), I was working on this mostly because the program in question was going will have to cope with real filenames and that means UTF-8. Note that the code here assumes strings are valid.

Unicode is gigantic so displaying UTF-8 text is probably out of the question. The trick here is that you have your own charset as usual, and then you convert Unicode codepoints to your own charset on the fly as you're rendering the text (and replace anything unavailable with a placeholder, e.g. a question mark). This helps keep resources usage low (just using enough for whatever you may need) while still being able to use UTF-8.

Anyway, here's how to do it in C:

Code: Select all

unsigned decode_utf8(const char *text)
{
   unsigned ch = (uint8_t)(*text);
   
   if (ch < 0x80) {
      return *text;
   }
   
   if (ch < 0xE0) {
      return (text[0] & 0x1F) << 6 |
             (text[1] & 0x3F);
   }
   
   if (ch < 0xF0) {
      return (text[0] & 0x0F) << 12 |
             (text[1] & 0x3F) << 6 |
             (text[2] & 0x3F);
   }
   
   return (text[0] & 0x07) << 18 |
          (text[1] & 0x3F) << 12 |
          (text[2] & 0x3F) << 6 |
          (text[3] & 0x3F);
}

unsigned codepoint_size(unsigned codepoint)
{
   if (codepoint < 0x80)
      return 1;
   if (codepoint < 0x800)
      return 2;
   if (codepoint < 0x10000)
      return 3;
   return 4;
}
Here's how to do it in assembly:

Code: Select all

; input a6.l .... Pointer to character
; output a6.l ... Pointer to next character
; output d7.l ... Unicode codepoint

DecodeUTF8:
    moveq   #0, d7
    move.b  (a6)+, d7
    bmi.s   @TwoBytes
    rts
    
@TwoBytes:
    cmp.b   #$E0, d7
    bhs.s   @ThreeBytes
    and.b   #$1F, d7
    lsl.w   #8, d7
    move.b  (a6)+, d7
    lsl.b   #2, d7
    lsr.w   #2, d7
    rts

@ThreeBytes:
    cmp.b   #$F0, d7
    bhs.s   @FourBytes
    and.b   #$0F, d7
    swap    d7
    move.b  (a6)+, d7
    lsl.b   #2, d7
    lsl.w   #6, d7
    move.b  (a6)+, d7
    lsl.b   #2, d7
    lsl.w   #2, d7
    lsr.l   #4, d7
    rts
    
@FourBytes:
    and.b   #$07, d7
    lsl.w   #8, d7
    move.b  (a6)+, d7
    lsl.b   #2, d7
    lsr.w   #2, d7
    swap
    move.b  (a6)+, d7
    lsl.b   #2, d7
    lsl.w   #6, d7
    move.b  (a6)+, d7
    lsl.b   #2, d7
    lsl.w   #2, d7
    lsr.l   #4, d7
    rts
On that note, that a 7.67MHz 68000 can cope with UTF-8 decoding just fine puts to shame all modern programs that still don't understand Unicode.
Sik is pronounced as "seek", not as "sick".

Natsumi
Very interested
Posts: 82
Joined: Mon Oct 05, 2015 3:00 pm
Location: 0x0
Contact:

Re: Decoding UTF-8

Post by Natsumi » Tue Jul 19, 2016 9:09 am

it indeed is hilarious how difficult it is to deal with Unicode in C++ for example; I had to do some serious black magic to read UTF-8 file into a wchar_t*, but its doable. I was quite irfitated how I just couldn't whip out 68k and write that code myself in few minutes, as opposed to hunting for it hours on StackOverflow.

Post Reply