bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode


From: Assaf Gordon
Subject: Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
Date: Thu, 19 Jan 2017 19:33:53 -0500

Hello,

> On Jan 19, 2017, at 19:10, Norihiro Tanaka <address@hidden> wrote:
> 
> 1. How should \uHHHH expression be parsed in bracket?
> 
>    $ echo b | grep '[\U0041]'
> 
>    I \uXXXX expression should not work in bracket.

> 2. Which should following expression be parsed, [a-c] or \[a-c\] ?
> 
>    $ echo b | grep '\U005Ba-c\U005C'
> 
>    I think that \uHHHH expression should not work to meta character.
>    i.e. I think that many users will prefer \[a-c\] to [a-c].

Thank you for raising these good points.

Currently, escape sequences are parsed and converted before
being sent to re/dfa.
Thus, '[\u0041]' is equivalent to '[A]',
and   '\u005Baa-c\u005c' is equivalent to '[a-c]'.

Note that my current implementation is missing a key detail:
coreutils' printf rejects sequences in certain ranges, and
so this will not be accepted in practice:

 "A universal character name shall not specify a character short
  identifier in the range 00000000 through 00000020, 0000007F through
  0000009F, or 0000D800 through 0000DFFF inclusive. A universal
  character name shall not designate a character in the required
  character set."
 http://lingrok.org/xref/coreutils/src/printf.c#285

However other sequences are un-escaped,
thus:
  '[\u03a8]' means '[Ψ]'
and not the character set u/0/3/8/a/\\ .


It will take a bit more work (perhaps even touching re/dfa) to avoid
un-escaping sequences inside brackets. Worth considering and discussing.

Thanks again,
 - assaf




reply via email to

[Prev in Thread] Current Thread [Next in Thread]