[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Exposing wcwidth(3) as a built-in function
From: |
Eli Zaretskii |
Subject: |
Re: [bug-gawk] Exposing wcwidth(3) as a built-in function |
Date: |
Sat, 09 Dec 2017 10:51:15 +0200 |
> Date: Fri, 8 Dec 2017 15:25:34 -0800
> From: Eric Pruitt <address@hidden>
> Cc: address@hidden
>
> > Thanks, but doesn't this still assume UTF-8 encoding of characters?
> > If so, it's not portable to non-UTF-8 locales, right?
>
> I realized I may've misinterpreted your question, so I will clarify and
> add a question of my own: only the code for interpreters that are not
> multi-byte safe falls back to manual UTF-8 parsing. This means that in
> GAWK, the lookup table uses lexical comparisons assuming the locale is
> multi-byte safe.
What do you mean by "multi-byte safe locale"? UTF-8 is but one
multi-byte encoding; it is not the only one.
> Are there some multi-byte locales where I could not count on
> sprintf("%c", 23485) being "宽" in GNU Awk?
23485 is a single character value, so I don't understand how it is
related to the locale's codeset being multi-byte. Instead, this has
to do with the codeset itself and its representation and
interpretation of codepoints. E.g., if the locale's codeset is some
ISO-2022 variant, where codepoints are specific to each charset, I
think 23485 could very well be something other. For example, in
codepage 936, this character's codepoint is 49133. (Codepage 936 is
used by Windows in Far Eastern locales; it is a multibyte encoding,
but the length of its byte sequences is fixed, unlike that of UTF-8.)
> From running
> "fgrep -ir iconv --include '*.h' --include '*.c'", it doesn't look like
> GAWK uses iconv. Perhaps a more accurate question is, will GAWK work on
> platforms that do not have **any** Unicode support (be it UTF-8, UTF-16,
> etc.)?
It already does: MS-Windows is one such platform. (It does support
UTF-16, but using that would require not to use 'char *' pointers for
text, which would require a thorough rewrite of most of Gawk's code,
something I don't expect to happen just to cater to Windows.)
> * I have since rewritten the code for multi-byte unsafe interpreters so
> the lookup table is indexed by UTF-8 byte strings instead of numeric
> code points for performance reasons.
Using the UTF-8 byte sequences was the reason why I asked whether your
implementation relies on UTF-8. In a locale whose codeset is not
UTF-8, this will not work well.
Re: [bug-gawk] Exposing wcwidth(3) as a built-in function, arnold, 2017/12/03