bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: inconsistency with counting characters vs bytes for multi-byte chara


From: arnold
Subject: Re: inconsistency with counting characters vs bytes for multi-byte characters
Date: Tue, 12 Sep 2023 15:37:03 -0600
User-agent: Heirloom mailx 12.5 7/5/10

Hi Ed.

Thank you for reporting this. It is most definitely a bug. This is Yet
Another Interesting Corner Case.  I guess as UTF-8 becomes more and more
common, these bugs will get shaken out.

I have attached a fix below, which passes the test suite and seems to
fix the problem. I'm going to let it stew for a day or two before pushing
it out to the Git repo.

Thanks!

Arnold

Ed Morton <mortoneccc@comcast.net> wrote:

> Arnold et al - someone on a forum just pointed out this:
>
>      $ awk 'BEGIN{str="abc"; n=gsub(//,"X",str); print n, str }'
>      4 XaXbXcX
>
>      $ awk 'BEGIN{str="\342\200\257"; n=gsub(//,"X",str); print n, str }'
>      4 X▒X▒X▒X
>
> i.e. gsub() with an empty regexp matches around each byte in that 3-byte 
> character. I don't recall ever having wanted to match an empty regexp 
> and can't find a reference to that in documentation  so I don't know if 
> that's expected behavior or undefined behavior or a similar issue to the 
> match() issue below so thought it best to just pass it along so you can 
> decide what, if anything, to do about it.
>
> In case some background would be useful, there's a discussion on this at 
> the bottom of https://stackoverflow.com/a/77010950/1745001 - the person 
> whose login there is "RARE Kpop Manifesto" advocating for not changing 
> match() is the same Jason Kwan you've interacted with previously in this 
> mailing list, e.g. at 
> https://lists.gnu.org/archive/html/bug-gawk/2021-09/msg00073.html.
>
>      Ed.

Attachment: gsub-fix.diff
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]