bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug


From: Michael Klement
Subject: Re: [bug-gawk] v4.1.3 (run on OSX 10.11.3): potential gsub() bug
Date: Sun, 7 Feb 2016 10:19:56 -0500


P.S. I'm curious what current GNU grep does with such things?

As of GNU Grep 2.22 with locale en_US.UTF-8:

Test string is 'hätă', comprising 

  • 2 ASCII ('h' - 0x68, 't' - 0x74)
  • 1 extended ASCII ('ä' - 0xe4) 
  • 1 beyond ('ă' - 0x103) 

characters.

With no regex option (basic regex) and with -E (extended), \xhh is apparently not recognized at all (the same applies to BSD Grep on OSX 10.11.3):

$ grep -o '[\x00-\x7f]' <<<'hätă'  # !! NO output
$ grep -Eo '[\x00-\x7f]' <<<'hätă'  # !! NO output

With -P (PCRE), you get Unicode-aware range support, based on \x{…} with a variable number of hex digits:

Note that trying to use Perl's explicit Unicode escapes results in the following error message: "PCRE does not support \L, \l, \N{name}, \U, or \u"

$ grep -Po '[\x00-\x7f]' <<<'hätă'  # OK - includes only ASCII chars.
h
t

$ grep -Po '[^\x80-\xFF]' <<<'hätă'  # OK - only excludes the extended ASCII range, so ă (0x103) is retained
h
t
ă

$ grep -Po '[^\x80-\x{10f7ff}]' <<<'hätă'  # OK - excludes all non-ASCII chars.
h
t



reply via email to

[Prev in Thread] Current Thread [Next in Thread]