[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF8 above U+10FFFF treated inconsistently
From: |
Jason C. Kwan |
Subject: |
Re: UTF8 above U+10FFFF treated inconsistently |
Date: |
Mon, 27 Sep 2021 18:17:23 -0400 |
Dear Arnold
It’s merely a set of bytes as input :
\073\145\037\366\254\271\230\131\000
A while loop that subs away the first character repeatedly , lobbed off via
sprintf ( % . 1 s ) , will see the string properly sliced out byte by byte,
since sprintf( % . 1 s ) gives either a full unicode character , single or
multi-byte , if the leading side bytes are well formed, or simply the first
byte, if it cannot locate any well formed UTF8 sequence at index position 1 (in
awk lingo, or 0 in C)
sub( ) and sprintf( ) are working exactly as expected, since the 4 byte
sequence starting at \366 is above U+10FFFF
however, if one directly applies the split function on a character by character
basis, ie -
split ( str, arr, // )
those 4 bytes starting with \366 will be grouped together into a single cell
within the array.
As a quick refresher, the 4 byte UTF8 structure has a leading byte resembling
1111 0xxx
which, hypothetically, allows for up to
\367\277\277\277 aka F7 BF BF BF,
before necessitating a 5 byte sequence , if caps of 0x10FFFF were not
explicitly enforced (since the earliest of UTF8 draft proposals have mentioned
up to 6 byte sequences, something I’ve also observed in source codes for some
other open source softwares as well. I’ve only reached out to u guys first cuz
the wellness of awk is what I care about, first and foremost.
it saddens me to see many in the world ignore awk and think it’s nothing more
than a glorified sed or underpowered Perl, depending on who u ask, and not
leveraging its immense potential cuz they only see it as a legacy dinosaur.
For the longest time, I myself have been misled, thinking awk is only for
tidying inputs at the Makefile stage before binary compilations. I was wrong,
for blindly following the concensus. I only discovered awk 3 years ago, but I
see awk as the future, not the past)
That byte sequence contains 9 bytes but only 5 valid UTF8 code points, which is
the result you will observe if fed into gnu wc.
sub( ) gsub( ) and sprintf ( ) are working flawlessly, but the split( )
function returns a 6 count array instead of 9 count.
im using standard gawk 5.1.0, binary sourced from Homebrew, on enUS.UTF8
locale, on macOS 11.
If this is indeed an underlying C library issue, then I shall reach out to the
gnu LIBC team instead. Thanks for your time.
Regards,
Jason K
> Jason C. Kwan <jasonckwan@yahoo.com>於2021年9月10日 00:14寫道:
>
> \073\145\037\366\254\271\230\131\000