[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF8 above U+10FFFF treated inconsistently
From: |
Eli Zaretskii |
Subject: |
Re: UTF8 above U+10FFFF treated inconsistently |
Date: |
Tue, 28 Sep 2021 08:42:18 +0300 |
> Date: Mon, 27 Sep 2021 18:17:23 -0400
> From: "Jason C. Kwan" via "Bug reports only for gawk." <bug-gawk@gnu.org>
>
> however, if one directly applies the split function on a character by
> character basis, ie -
>
> split ( str, arr, // )
>
> those 4 bytes starting with \366 will be grouped together into a single cell
> within the array.
>
> As a quick refresher, the 4 byte UTF8 structure has a leading byte resembling
>
> 1111 0xxx
>
> which, hypothetically, allows for up to
>
> \367\277\277\277 aka F7 BF BF BF,
>
> before necessitating a 5 byte sequence , if caps of 0x10FFFF were not
> explicitly enforced (since the earliest of UTF8 draft proposals have
> mentioned up to 6 byte sequences, something I’ve also observed in source
> codes for some other open source softwares as well. I’ve only reached out to
> u guys first cuz the wellness of awk is what I care about, first and foremost.
What would you expect Gawk to do instead? output an error message
about invalid UTF-8 sequence?
> If this is indeed an underlying C library issue, then I shall reach out to
> the gnu LIBC team instead. Thanks for your time.
I think it _is_ a libc issue, because Gawk uses the library function
'mbrlen' to parse the string into characters.