bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] GAWK for Windows does not work properly with UTF-8


From: Eli Zaretskii
Subject: Re: [bug-gawk] GAWK for Windows does not work properly with UTF-8
Date: Thu, 11 Feb 2016 22:46:42 +0200

> Date: Thu, 11 Feb 2016 12:46:21 +0100
> From: Marc de Bourget <address@hidden>
> Cc: address@hidden
> 
> I use this version:
> http://sourceforge.net/projects/ezwinports/files/gawk-4.1.3-w32-bin.zip/download
> 
> Problem: This GAWK for Windows version counts bytes instead of characters. 
> Céline has 6 characters but 7 bytes due tu the multibyte character "é". 
> 
> The length function for the string "Céline" should result in 6 but it is 7. 
> Using gawk for Windows with UTF-8 produces wrong results for at least the 
> functions length, substr, index,
> match, split("Céline", CHARS, ""), printf, sprintf.
> 
> Creating a DOS Batch with setting the environment variable LC_ALL doesn't 
> help:

MS-Windows doesn't support UTF-8 as the locale's codeset, so sadly you
cannot do this, as long as Gawk uses libc functions such as mbrtowc,
and as long as it uses wchar_t as the type to hold Unicode codepoints
in scalar values.

In fact, because Gawk relies on the system's locale support, Gawk
programs that manipulate non-ASCII characters cannot be fully
portable, in the sense that they will produce the same output given
the same input on all supported platforms, even on those that do have
UTF-8 locales.

So the only way you can write a portable Gawk program that does TRT
with UTF-8 is to implement the functionality in Awk (or hack Gawk's C
implementation, if you want).

Sorry.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]