bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF8 above U+10FFFF treated inconsistently


From: Eli Zaretskii
Subject: Re: UTF8 above U+10FFFF treated inconsistently
Date: Tue, 28 Sep 2021 08:42:18 +0300

> Date: Mon, 27 Sep 2021 18:17:23 -0400
> From:  "Jason C. Kwan" via "Bug reports only for gawk." <bug-gawk@gnu.org>
> 
> however, if one directly applies the split function on a character by 
> character basis, ie -
> 
> split ( str, arr, // )
> 
> those 4 bytes starting with \366 will be grouped together into a single cell 
> within the array. 
> 
> As a quick refresher, the 4 byte UTF8 structure has a leading byte resembling 
> 
> 1111 0xxx
> 
> which, hypothetically, allows for up to
> 
>  \367\277\277\277  aka F7 BF BF BF,
> 
> before necessitating a 5 byte sequence , if caps of 0x10FFFF were not 
> explicitly enforced (since the earliest of UTF8 draft proposals have 
> mentioned up to 6 byte sequences, something I’ve also observed in source 
> codes for some other open source softwares as well. I’ve only reached out to 
> u guys first cuz the wellness of awk is what I care about, first and foremost.

What would you expect Gawk to do instead? output an error message
about invalid UTF-8 sequence?

> If this is indeed an underlying C library issue, then I shall reach out to 
> the gnu LIBC team instead. Thanks for your time.

I think it _is_ a libc issue, because Gawk uses the library function
'mbrlen' to parse the string into characters.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]