Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode

sed-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode

From:	arnold
Subject:	Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode
Date:	Wed, 11 Jan 2017 02:00:27 -0700
User-agent:	Heirloom mailx 12.4 7/29/08

Hi Assaf.

Thanks for the initiative.  It is a good idea for the named tools to
be consistent in their appropach, even if that approach is to not
do anything. :-)

I'll take a look at the patch you posted (eventually).  If you haven't
done FSF paperwork for gawk, you should do so, even if I don't adopt \u / \U.

I've been (successfully) avoiding something like this for many years and
was hoping to continue to use the "somebody else's problem field" on it
for a while longer yet. :-)

There are a number of questions to answer, these are just what I can
think of off the top of my head.

1. What should gawk(/sed/grep) do upon encountering \u/\U in a non-UTF locale?

2. Do we even have a fool-proof way to know what we're in a UTF locale?

3. This feature further distances GNU tools from standard practice,
decreasing portability of programs that depend upon it.

4. I think that it's not hard to use current dfa/regex - just convert
the hex to a wchar_t string and then from there back to multibyte characters,
but maybe I'm wrong about that. Paul? Jim?

5. How do we handles MinGW and Cygwin where wchar_t is 16 bits, vs. 32
bits just about everywhere else?

I'm sure I'll think of other things.

For gawk, assuming you can convince me to go with this (:-) I will also
need documentation updates.

Thanks,

Arnold

Assaf Gordon <address@hidden> wrote:

> (sorry for cross posting, I hope the discussion is relevant for all)
>
> Hello,
>
> I'd like to suggest (or discuss) a minor addition to grep/awk/sed:
> adding support for '\u' and '\U' for unicode characters, with
> the same rules as coreutils' printf:
>   \uHHHH  Unicode (ISO/IEC 10646) character with hex value HHHH (4 digits)
>   \UHHHHHHHH  Unicode character with hex value HHHHHHHH (8 digits)
>
> For 'awk' and 'grep', I believe these sequences are currently
> undefined and unused. For sed, it uses '\U' and '\u' in limited
> capacity (upper case replacement in s///).
> As for POSIX, I believe the behavior is unspecified and thus can be 
> implemented.
>
> I think that supporting the exact same syntax with the same semantics
> across multiple GNU tools is a good long-term behavior,
> and multibyte/unicode supports is becoming more important and
> more useful as times goes by.
>
> For now I'm not asking about implementation issues (which I'm sure will be
> numerous, including interplay with gnulib and glibc, locales,
> and sed's backwards incompatibility).
>
> I'm more interested to discuss whether such long-term behavior is something
> that you'd consider for each respective projects (perhaps even mentally
> reserve '\u' and '\U' sequences for it, or accept patches in that direction).
>
>
> As for sed,
> I'm quite new here, but my thinking is that \u and \U
> are used in a limited way 
> (https://www.gnu.org/software/sed/manual/sed.html#The-_0022s_0022-Command),
> and perhaps it can be argued that breaking compatibility will cause limited 
> troubles
> for very specialized scripts, and is worth the long term improvement
> (of course the functionality will remain, just with a different letter).
>
>
> Thanks for reading,
> and for any suggestions or comments,
> regards,
>  - assaf
>
>
>

[Prev in Thread]

Current Thread

[Next in Thread]

GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/10
- Re: GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/11
- Re: [Grep-devel] GNU grep,awk,sed: support \u and \U for unicode, Paul Eggert, 2017/01/11
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, arnold <=
  - Re: GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/19
    - Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Eli Zaretskii, 2017/01/19
    - Re: GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/19
- Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Norihiro Tanaka, 2017/01/19
  - Re: [bug-gawk] GNU grep,awk,sed: support \u and \U for unicode, Assaf Gordon, 2017/01/19
    - Re: [Grep-devel] [bug-gawk] GNU grep, awk, sed: support \u and \U for unicode, Paul Eggert, 2017/01/19

Prev by Date: Re: [Grep-devel] GNU grep,awk,sed: support \u and \U for unicode
Next by Date: sed suggestion: selinux context based on symlink when using -i
Previous by thread: Re: [Grep-devel] GNU grep,awk,sed: support \u and \U for unicode
Next by thread: Re: GNU grep,awk,sed: support \u and \U for unicode
Index(es):
- Date
- Thread