bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Exposing wcwidth(3) as a built-in function


From: arnold
Subject: Re: [bug-gawk] Exposing wcwidth(3) as a built-in function
Date: Sat, 09 Dec 2017 11:24:23 -0700
User-agent: Heirloom mailx 12.4 7/29/08

Hi.

> The determination for this is simply 'length("???")'. If that returns 1,
> the interpreter is considered multi-byte safe,

Say, rather, multibyte aware.  Gawk is.  The particular value you
used is likely to valid only in a Unicode locale, though.

> Are there some multi-byte locales where I could not count on
> sprintf("%c", 23485) being "???" in GNU Awk?

Undoubtedly. I don't know which ones, though.

> I was imagining that in order to "insulate the programmer from the
> peculiarities of the underlying platform" GAWK might do something like
> use iconv to ensure that character values are invariant regardless of
> the locale which is I why searched the source for iconv references.

Heaven forbid! That is definitely one of those roads to Hell paved
with good intentions.  Users have to be aware of the character set
and encoding they use, that's just a fact of modern life. They must
do any desired recodings themselves, using iconv or whatever.

> Due to the way the "%c" format conversion works in GAWK, I ran into
> something that I'm not sure qualifies as a bug: "%c" cannot be used to
> write byte literals above 127 when the locale supports Unicode. An
> example:
>
>     $ mawk 'BEGIN { printf "%c", 255 }' | xxd
>     00000000: ff                                       .
>     $ gawk 'BEGIN { printf "%c", 255 }' | xxd
>     00000000: c3bf                                     ..

This is the reason for gawk's -b option.  Use that option if you
want bytes to be treated as individual characters, no matter what the
settings of the various LC_* enviroment variables may be.

> I understand why this is the case, but I still find the behavior
> surprising given that hexadecimal escapes in literals are interpreted as
> bytes:
>
>     $ gawk 'BEGIN { printf "\xff" }' | xxd
>     00000000: ff                                       .

This is a documented design decision, somewhat predating multibyte
support. For example, \x2A is '*'. Should /a\x2Ab/ match literal 'a*b'
or treat the \x2A as a metacharacter? I think traditional awk treated
it literally; in any case that's what gawk does (and has done for years
and years).

Hope this helps,

Arnold



reply via email to

[Prev in Thread] Current Thread [Next in Thread]