bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug


From: Michael Klement
Subject: Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug
Date: Fri, 29 Jan 2016 11:02:30 -0500

Thanks, Hermann.

LC_ALL=C is an effective workaround for the case at hand, though it precludes working with Unicode characters as such in the rest of the script (which may never be needed).

Another, workaround, though not fully equivalent, is:

echo 'hät' | gawk '{ gsub(/[^\x00-\x7F]/, ""); print }'

This works without LC_ALL=C, but excludes ALL non-ASCII characters, not just those in the range 128 - 255.

Which brings me to a question (couldn't figure it out from the docs): 

Are the \x.. escapes inside bracket expressions *supposed* to work with *all Unicode* codepoints?
In other words: in an UTF-8 locale, *can you specify Unicode code-point ranges* (that go way beyond 0xFF) rather than just individual-byte ranges?

The following does appear to work in locale "en_US.UTF-8", but it may be accidental:

# Exclude all non-ASCII chars (exclude the entire non-ASCII Unicode codepoint range).
echo 'hät' | gawk '{ gsub(/[^\x80-\x10f7ff]/, ""); print }'

The crash prevents me from testing the complement.

Obviously, without a construct to delimit the hex digits ({…} doesn't work), there's ambiguity.

Either way, I suggest clarifying the behavior at https://www.gnu.org/software/gawk/manual/html_node/Bracket-Expressions.html#Bracket-Expressions


Michael


On Jan 29, 2016, at 5:15 AM, Hermann Peifer <address@hidden> wrote:

$
$ echo 'hät' | gawk '{ gsub(/[\x80-\xFF]/, ""); print }'
gawk: cmd. line:1: error: Invalid collation character: /[�-�]/
$
$ echo 'hät' | LC_ALL=C gawk '{ gsub(/[^\x80-\xFF]/, ""); print }'
ä


About my second example: what I actually meant to post was that the original line works fine in C locale:

$ echo 'hät' | LC_ALL=C gawk '{ gsub(/[\x80-\xFF]/, ""); print }'
ht


Hermann


reply via email to

[Prev in Thread] Current Thread [Next in Thread]