bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF8 above U+10FFFF treated inconsistently


From: arnold
Subject: Re: UTF8 above U+10FFFF treated inconsistently
Date: Wed, 29 Sep 2021 00:34:23 -0600
User-agent: Heirloom mailx 12.5 7/5/10

Thanks for the much clearer report.

I will look at this, but it may simply be an issue of GIGO - Garbage In,
Garbage Out, since the UTF-8 is invalid.  split() is actually fairly
complicated internally, and I don't have a definitive response without
reviewing the code.

Arnold

"Jason C. Kwan" via "Bug reports only for gawk." <bug-gawk@gnu.org> wrote:

> Dear Arnold
>
> It’s merely a set of bytes as input : 
>
> \073\145\037\366\254\271\230\131\000
>
> A while loop that subs away the first character  repeatedly , lobbed off via 
> sprintf ( % . 1 s ) , will see the string properly sliced out byte by byte, 
> since sprintf( % . 1 s ) gives either a full unicode character , single or 
> multi-byte , if the leading side bytes are well formed, or simply the first 
> byte, if it cannot locate any well formed UTF8 sequence at index position 1 
> (in awk lingo, or 0 in C) 
>
> sub( ) and sprintf( ) are working exactly as expected, since the 4 byte 
> sequence starting at \366 is above U+10FFFF
>
> however, if one directly applies the split function on a character by 
> character basis, ie -
>
> split ( str, arr, // )
>
> those 4 bytes starting with \366 will be grouped together into a single cell 
> within the array. 
>
> As a quick refresher, the 4 byte UTF8 structure has a leading byte resembling 
>
> 1111 0xxx
>
> which, hypothetically, allows for up to
>
>  \367\277\277\277  aka F7 BF BF BF,
>
> before necessitating a 5 byte sequence , if caps of 0x10FFFF were not 
> explicitly enforced (since the earliest of UTF8 draft proposals have 
> mentioned up to 6 byte sequences, something I’ve also observed in source 
> codes for some other open source softwares as well. I’ve only reached out to 
> u guys first cuz the wellness of awk is what I care about, first and foremost.
>
>  it saddens me to see many in the world ignore awk and think it’s nothing 
> more than a glorified sed or underpowered Perl, depending on who u ask, and 
> not leveraging its immense potential cuz they only see it as a legacy 
> dinosaur. 
>
> For the longest time, I myself have been misled, thinking awk is only for 
> tidying inputs at the Makefile stage before binary compilations. I was wrong, 
> for blindly following the concensus. I only discovered awk 3 years ago, but I 
> see awk as the future, not the past)
>
> That byte sequence contains 9 bytes but only 5 valid UTF8 code points, which 
> is the result you will observe if fed into gnu wc. 
>
> sub( ) gsub( ) and sprintf ( ) are working flawlessly, but the split( ) 
> function returns a 6 count array instead of 9 count.
>
> im using standard gawk 5.1.0, binary sourced from Homebrew, on enUS.UTF8 
> locale, on macOS 11. 
>
> If this is indeed an underlying C library issue, then I shall reach out to 
> the gnu LIBC team instead. Thanks for your time.
>
>
> Regards,
> Jason K



reply via email to

[Prev in Thread] Current Thread [Next in Thread]