Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode

From:	Assaf Gordon
Subject:	Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
Date:	Thu, 19 Jan 2017 19:33:53 -0500

Hello,

> On Jan 19, 2017, at 19:10, Norihiro Tanaka <address@hidden> wrote:
> 
> 1. How should \uHHHH expression be parsed in bracket?
> 
>    $ echo b | grep '[\U0041]'
> 
>    I \uXXXX expression should not work in bracket.

> 2. Which should following expression be parsed, [a-c] or \[a-c\] ?
> 
>    $ echo b | grep '\U005Ba-c\U005C'
> 
>    I think that \uHHHH expression should not work to meta character.
>    i.e. I think that many users will prefer \[a-c\] to [a-c].

Thank you for raising these good points.

Currently, escape sequences are parsed and converted before
being sent to re/dfa.
Thus, '[\u0041]' is equivalent to '[A]',
and   '\u005Baa-c\u005c' is equivalent to '[a-c]'.

Note that my current implementation is missing a key detail:
coreutils' printf rejects sequences in certain ranges, and
so this will not be accepted in practice:

 "A universal character name shall not specify a character short
  identifier in the range 00000000 through 00000020, 0000007F through
  0000009F, or 0000D800 through 0000DFFF inclusive. A universal
  character name shall not designate a character in the required
  character set."
 http://lingrok.org/xref/coreutils/src/printf.c#285

However other sequences are un-escaped,
thus:
  '[\u03a8]' means '[Ψ]'
and not the character set u/0/3/8/a/\\ .

It will take a bit more work (perhaps even touching re/dfa) to avoid
un-escaping sequences inside brackets. Worth considering and discussing.

Thanks again,
 - assaf

[Prev in Thread]

Current Thread

[Next in Thread]

[bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/10
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/11
- Re: [bug-gawk] [Grep-devel] GNU grep, awk, sed: support \u and \U for unicode, Paul Eggert, 2017/01/11
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, arnold, 2017/01/11
  - Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/19
    - Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Eli Zaretskii, 2017/01/19
    - Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/19
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Norihiro Tanaka, 2017/01/19
  - Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon <=
    - Re: [bug-gawk] [Grep-devel] GNU grep, awk, sed: support \u and \U for unicode, Paul Eggert, 2017/01/19
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, David Niklas, 2017/01/24

Prev by Date: Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
Next by Date: Re: [bug-gawk] [Grep-devel] GNU grep, awk, sed: support \u and \U for unicode
Previous by thread: Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
Next by thread: Re: [bug-gawk] [Grep-devel] GNU grep, awk, sed: support \u and \U for unicode
Index(es):
- Date
- Thread