[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Patch for CHICKEN 6 uri-generic
From: |
felix . winkelmann |
Subject: |
Re: Patch for CHICKEN 6 uri-generic |
Date: |
Fri, 24 May 2024 17:19:31 +0200 |
> On Fri, May 24, 2024 at 3:31 AM Peter Bex <peter@more-magic.net> wrote:
>
> - It encodes how many bytes to use in the first byte's leading bit,
>
> > leading three bits, leading four bits or leading five bits depending
> > on the length.
> >
> > This latter property is extra annoying because you can't just extract
> > the length from the first byte - you have to scan the first bit to
> > decide what to do next. Then, you scan the second and third bit etc.
> >
>
> That's not actually true. You can use a table of 128 entries with one
> single-byte entry for each possible value of the first byte, specifyfing
> the length of the UTF-8 value. So table entries 0 to 127 have value 1,
> etc. Entries that aren't valid UTF-8 leading bytes, such as 255, have 0 in
> the table.
>
See also "C_utf_expect" in utf.c (or "C_utf_bytes_needed" in chicken.h,
which can be called via "##core#inline").
felix