[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: inconsistency with counting characters vs bytes for multi-byte chara
From: |
arnold |
Subject: |
Re: inconsistency with counting characters vs bytes for multi-byte characters |
Date: |
Tue, 12 Sep 2023 15:37:03 -0600 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
Hi Ed.
Thank you for reporting this. It is most definitely a bug. This is Yet
Another Interesting Corner Case. I guess as UTF-8 becomes more and more
common, these bugs will get shaken out.
I have attached a fix below, which passes the test suite and seems to
fix the problem. I'm going to let it stew for a day or two before pushing
it out to the Git repo.
Thanks!
Arnold
Ed Morton <mortoneccc@comcast.net> wrote:
> Arnold et al - someone on a forum just pointed out this:
>
> $ awk 'BEGIN{str="abc"; n=gsub(//,"X",str); print n, str }'
> 4 XaXbXcX
>
> $ awk 'BEGIN{str="\342\200\257"; n=gsub(//,"X",str); print n, str }'
> 4 X▒X▒X▒X
>
> i.e. gsub() with an empty regexp matches around each byte in that 3-byte
> character. I don't recall ever having wanted to match an empty regexp
> and can't find a reference to that in documentation so I don't know if
> that's expected behavior or undefined behavior or a similar issue to the
> match() issue below so thought it best to just pass it along so you can
> decide what, if anything, to do about it.
>
> In case some background would be useful, there's a discussion on this at
> the bottom of https://stackoverflow.com/a/77010950/1745001 - the person
> whose login there is "RARE Kpop Manifesto" advocating for not changing
> match() is the same Jason Kwan you've interacted with previously in this
> mailing list, e.g. at
> https://lists.gnu.org/archive/html/bug-gawk/2021-09/msg00073.html.
>
> Ed.
gsub-fix.diff
Description: Text document