bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF8 above U+10FFFF treated inconsistently


From: Jason C. Kwan
Subject: Re: UTF8 above U+10FFFF treated inconsistently
Date: Mon, 27 Sep 2021 18:17:23 -0400

Dear Arnold

It’s merely a set of bytes as input : 

\073\145\037\366\254\271\230\131\000

A while loop that subs away the first character  repeatedly , lobbed off via 
sprintf ( % . 1 s ) , will see the string properly sliced out byte by byte, 
since sprintf( % . 1 s ) gives either a full unicode character , single or 
multi-byte , if the leading side bytes are well formed, or simply the first 
byte, if it cannot locate any well formed UTF8 sequence at index position 1 (in 
awk lingo, or 0 in C) 

sub( ) and sprintf( ) are working exactly as expected, since the 4 byte 
sequence starting at \366 is above U+10FFFF

however, if one directly applies the split function on a character by character 
basis, ie -

split ( str, arr, // )

those 4 bytes starting with \366 will be grouped together into a single cell 
within the array. 

As a quick refresher, the 4 byte UTF8 structure has a leading byte resembling 

1111 0xxx

which, hypothetically, allows for up to

 \367\277\277\277  aka F7 BF BF BF,

before necessitating a 5 byte sequence , if caps of 0x10FFFF were not 
explicitly enforced (since the earliest of UTF8 draft proposals have mentioned 
up to 6 byte sequences, something I’ve also observed in source codes for some 
other open source softwares as well. I’ve only reached out to u guys first cuz 
the wellness of awk is what I care about, first and foremost.

 it saddens me to see many in the world ignore awk and think it’s nothing more 
than a glorified sed or underpowered Perl, depending on who u ask, and not 
leveraging its immense potential cuz they only see it as a legacy dinosaur. 

For the longest time, I myself have been misled, thinking awk is only for 
tidying inputs at the Makefile stage before binary compilations. I was wrong, 
for blindly following the concensus. I only discovered awk 3 years ago, but I 
see awk as the future, not the past)

That byte sequence contains 9 bytes but only 5 valid UTF8 code points, which is 
the result you will observe if fed into gnu wc. 

sub( ) gsub( ) and sprintf ( ) are working flawlessly, but the split( ) 
function returns a 6 count array instead of 9 count.

im using standard gawk 5.1.0, binary sourced from Homebrew, on enUS.UTF8 
locale, on macOS 11. 

If this is indeed an underlying C library issue, then I shall reach out to the 
gnu LIBC team instead. Thanks for your time.


Regards,
Jason K

> Jason C. Kwan <jasonckwan@yahoo.com>於2021年9月10日 00:14寫道:
> 
> \073\145\037\366\254\271\230\131\000

reply via email to

[Prev in Thread] Current Thread [Next in Thread]