bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF8 above U+10FFFF treated inconsistently


From: arnold
Subject: Re: UTF8 above U+10FFFF treated inconsistently
Date: Fri, 10 Sep 2021 00:24:31 -0600
User-agent: Heirloom mailx 12.5 7/5/10

Hello.

Thank you for taking the time to report an issue.

Unfortunately, your report is close to unreadable, due to line endings being
messed up and unreadably long lines.  Can you resend in a text only fashion?

Also, it's not clear to me what you think is a bug and what not.
Please see the manual on how to report a bug. In particular, include
short programs and their data (best as attachments) that reproduce the
bug, as well as some indication of your operating system. Then describe
what the bug is and what you think the behavior should be. Since you're
dealing with Unicode, please also include the locale you are using.

Please be aware that gawk relies on the underlying C library for
converting multibyte characters into wide characters, and on the
regex and dfa routines from GNULIB.  Thus, even if there are bugs,
it may be that there's nothing I can do about them if they are in
code that I don't control.

Thanks,

Arnold

"Jason C. Kwan" via "Bug reports only for gawk." <bug-gawk@gnu.org> wrote:

> earliest specs of Unicode allows for up to 6-byte codes in UTF8. however, 
> Unicode consortium has amended that to only spec it up to U+10FFFF as of 
> Unicode 13, or 1,114,111 in decimal. The "invalid UTF8" in question is 
> \366\254\271\230, with a hypothetical Unicode integer value of 1,756,760, and 
> hex of U + 001A CE58
>
> $ echo ' 06*(64^3) + 44*(64^2) + 57*(64^1) +  24*(64^0)' | bc1756760
> $ echo 'obase=16; 1756760' | bc1ACE58
> I'd imagine the issue might impact any hypothetical character from U+110000 
> to end of 6-byte spec.
> some parts of gawk is correct, such as failing ($0 ~ /^.*$/), failing match( 
> ) on the same criteria, sprintf( ) spitting out one character at a time 
> (*sprintf("%.1s") dumps out the first item, either a multi-byte UTF8 
> character, if it's a well-formed sequence, or just the first byte of any 
> nature, ASCII or 8-bit), and also properly splitting it into 9-cell array in  
> split( ).
> i haven't tested gensub( ), but at least for sub( ) and gsub( ), it's showing 
> inconsistent treatment (row 14 below) - splitting it in 6 elements as if the 
> 4 bytes  \xF6 \xAC \xB9 \x98 together comprise a valid UTF8 character, and 
> thus resulting in length( ) properly error-ing out because gsub( ) provided 
> the illusion that the byte sequence is well-formed UTF8 that's safe for 
> length( ) to directly measure. 
>
> $ gecho; time gcat  backupgenieaudio_53128949med_.lossless.mp3 |  gsed -n 
> '562368p' | gawkfx  -e '{ s0=$0; k=strdump($0,"x"); gsub(/[%]/," \\x",k);  
> print strdump($0) " :: " k " :: valid end-to-end ? " ($0 ~ /^.*$/) } { bytes 
> = match($0, /$/); while (s0!="") { print sub("^"sprintf("%.1s",s0),"",s0) " 
> --> " strdump(s0) ; };  print (s0=="")  } END { print NR, sum, bytes; print " 
> s0 : ["  s0 "]"; orig0=$0; gsub(/./, "< & >");  gsub(/\037/, "\\037"); 
> gsub(/[\000]/, "\\000");   print ; print length(orig9)    }'  | gcat -n
> gawk: cmd. line:1: (FILENAME=- FNR=1) warning: Invalid multibyte data 
> detected. There may be a mismatch between your data and your locale.     1 
> \073\145\037\366\254\271\230\131\000 ::  \x3B \x65 \x1F \xF6 \xAC \xB9 \x98 
> \x59 \x00 :: valid end-to-end ? 0     2 1 --> 
> \145\037\366\254\271\230\131\000     3 1 --> \037\366\254\271\230\131\000     
> 4 1 --> \366\254\271\230\131\000     5 1 --> \254\271\230\131\000     6 1 --> 
> \271\230\131\000     7 1 --> \230\131\000     8 1 --> \131\000     9 1 --> 
> \000    10 1 -->     11 1    12 1==10    13  s0 : []    14 < ; >< e >< \037 
> >< ???? >< Y >< \000 >    15 9
> One can verify it via a regex constant that hard-codes in the UTF8-spec 
> (including explicitly skipping the high-and-low surrogates reserved for 
> UTF16). A properly formed sequence will result in 0 bytes reported after the 
> gsub( ) instead of 4. :
>
> gecho; time gcat backupgenieaudio_53128949med_.lossless.mp3 | gsed -n 
> '562368p' | gawk  -e '{ print match($0,/$/) -1 ;  
> gsub(sprintf("[%c-%c%c-%c%c-%c]",0x00,0x7F,0x80,0xD7FF,0xE0000,0x10FFFF),""); 
> print match($0,/$/) -1  }'
> 94 
>
>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]