bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

UTF8 above U+10FFFF treated inconsistently


From: Jason C. Kwan
Subject: UTF8 above U+10FFFF treated inconsistently
Date: Fri, 10 Sep 2021 04:14:46 +0000 (UTC)

earliest specs of Unicode allows for up to 6-byte codes in UTF8. however, 
Unicode consortium has amended that to only spec it up to U+10FFFF as of 
Unicode 13, or 1,114,111 in decimal. The "invalid UTF8" in question is 
\366\254\271\230, with a hypothetical Unicode integer value of 1,756,760, and 
hex of U + 001A CE58

$ echo ' 06*(64^3) + 44*(64^2) + 57*(64^1) +  24*(64^0)' | bc1756760
$ echo 'obase=16; 1756760' | bc1ACE58
I'd imagine the issue might impact any hypothetical character from U+110000 to 
end of 6-byte spec.
some parts of gawk is correct, such as failing ($0 ~ /^.*$/), failing match( ) 
on the same criteria, sprintf( ) spitting out one character at a time 
(*sprintf("%.1s") dumps out the first item, either a multi-byte UTF8 character, 
if it's a well-formed sequence, or just the first byte of any nature, ASCII or 
8-bit), and also properly splitting it into 9-cell array in  split( ).
i haven't tested gensub( ), but at least for sub( ) and gsub( ), it's showing 
inconsistent treatment (row 14 below) - splitting it in 6 elements as if the 4 
bytes  \xF6 \xAC \xB9 \x98 together comprise a valid UTF8 character, and thus 
resulting in length( ) properly error-ing out because gsub( ) provided the 
illusion that the byte sequence is well-formed UTF8 that's safe for length( ) 
to directly measure. 

$ gecho; time gcat  backupgenieaudio_53128949med_.lossless.mp3 |  gsed -n 
'562368p' | gawkfx  -e '{ s0=$0; k=strdump($0,"x"); gsub(/[%]/," \\x",k);  
print strdump($0) " :: " k " :: valid end-to-end ? " ($0 ~ /^.*$/) } { bytes = 
match($0, /$/); while (s0!="") { print sub("^"sprintf("%.1s",s0),"",s0) " --> " 
strdump(s0) ; };  print (s0=="")  } END { print NR, sum, bytes; print " s0 : [" 
 s0 "]"; orig0=$0; gsub(/./, "< & >");  gsub(/\037/, "\\037"); gsub(/[\000]/, 
"\\000");   print ; print length(orig9)    }'  | gcat -n
gawk: cmd. line:1: (FILENAME=- FNR=1) warning: Invalid multibyte data detected. 
There may be a mismatch between your data and your locale.     1 
\073\145\037\366\254\271\230\131\000 ::  \x3B \x65 \x1F \xF6 \xAC \xB9 \x98 
\x59 \x00 :: valid end-to-end ? 0     2 1 --> \145\037\366\254\271\230\131\000  
   3 1 --> \037\366\254\271\230\131\000     4 1 --> \366\254\271\230\131\000    
 5 1 --> \254\271\230\131\000     6 1 --> \271\230\131\000     7 1 --> 
\230\131\000     8 1 --> \131\000     9 1 --> \000    10 1 -->     11 1    12 
1==10    13  s0 : []    14 < ; >< e >< \037 >< ???? >< Y >< \000 >    15 9
One can verify it via a regex constant that hard-codes in the UTF8-spec 
(including explicitly skipping the high-and-low surrogates reserved for UTF16). 
A properly formed sequence will result in 0 bytes reported after the gsub( ) 
instead of 4. :

gecho; time gcat backupgenieaudio_53128949med_.lossless.mp3 | gsed -n '562368p' 
| gawk  -e '{ print match($0,/$/) -1 ;  
gsub(sprintf("[%c-%c%c-%c%c-%c]",0x00,0x7F,0x80,0xD7FF,0xE0000,0x10FFFF),""); 
print match($0,/$/) -1  }'
94 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]