bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Inconsistent results between mac and linux when $'\t' is


From: Nelson H. F. Beebe
Subject: Re: [bug-gawk] Inconsistent results between mac and linux when $'\t' is used in a command
Date: Thu, 1 Dec 2016 11:33:18 -0700

Besides the portable reimplementation of Peng Yu's original test program

        print | "sort -t $'\t' -k 1,1n"

as

        print | "sort -t '\t' -k 1,1n"

that Arnold proposed to avoid shell-syntax dependencies, it is
particularly important to remember that lots of new software,
including ash, bash, dash, gawk, grep, ksh, mksh, sed, sort, zsh, all
of coreutils, and others, is locale aware, and unless a user forces a
particular locale by explicit settings of LANG and the LC_xxx
environment variables, then different results are likely to be found
on different machines.

If you really want to ensure consistent behavior, then you need to
specify the locale when running many modern programs.  Thus, in my
case, I would instead write

        print | "env LANG=C LC_ALL=C sort -t '\t' -k 1,1n"

but you might, for example, prefer to replace C (the original
ASCII/POSIX locale) with something like fr_FR.utf8 or de_DE.utf8.  If
you want to be really paranoid, you should first eliminate ALL
environment variables with "env -i", then put back just the ones you
want:

        print | "env -i PATH=/bin:/usr/bin HOME=$HOME LANG=C LC_ALL=C sort -t 
'\t' -k 1,1n"

Unfortunately, apart from the (identical) locales C and POSIX, locale
names are sadly not standardized.  The above examples are from a
CentOS 7 GNU/Linux system; on my old Solaris 10 workstation, and on
FreeBSD, they have to be rewritten as fr_FR.UTF-8 and de_DE.UTF-8.

To make matters even worse, when a locale name from the environment is
unrecognized, it is simply ignored, and the system default locale
(perhaps chosen by the O/S distributor, or reset by local system
administrators, or the running user's shell startup profiles) is used.

Also, locales depend on human languages, planetary regions, and also
the character set used for encoding text in files.  For example,
Solaris offers 22 variations of French locales:

% locale -a | grep ^fr | pr -c3 -w80 -f -t
fr                         address@hidden           fr_FR
fr.ISO8859-15              fr_CA                      fr_FR.ISO8859-1
fr.UTF-8                   fr_CA.ISO8859-1            fr_FR.ISO8859-15
fr_BE                      fr_CA.UTF-8                address@hidden
fr_BE.ISO8859-1            fr_CH                      fr_FR.UTF-8
fr_BE.ISO8859-15           fr_CH.ISO8859-1            address@hidden
address@hidden      fr_CH.UTF-8                fr_LU.UTF-8
fr_BE.UTF-8

The reality of the evolution of character sets from the 1963 ASCII is
that any modern system is likely to contain at least text files
encoded in ASCII (most), ISO 8859-1 (aka Latin 1) covering much of
Western European language needs, UTF-8 (Unicode in the 8-bit
variable-byte-length encoding), and perhaps others, such as any of the
15 or so ISO 8859-n encodings.  When users start to pull text
documents from the Web, things get even more hairy.  I often encounter
Web documents containing mixtures of character-set encodings.

Internationalization of software, when not done according to
widely-implemented international standards, can be a plague!

------------------------------------------------------------------------

P.S. The 2001 POSIX specification says:

>> ...
>> 3942 7.2       POSIX Locale
>>
>> 3943           Conforming systems shall provide a POSIX locale, also known 
>> as the C locale. The behavior of
>> 3944           standard utilities and functions in the POSIX locale shall be 
>> as if the locale was defined via the
>> 3945           localedef utility with input data from the POSIX locale 
>> tables in Section 7.3.
>>
>> 3946           The tables in Section 7.3 describe the characteristics and 
>> behavior of the POSIX locale for data
>> 3947           consisting entirely of characters from the portable character 
>> set and the control character set. For
>> 3948           other characters, the behavior is unspecified. For C-language 
>> programs, the POSIX locale shall be
>> 3949           the default locale when the setlocale ( ) function is not 
>> called.
>>
>> 3950           The POSIX locale can be specified by assigning to the 
>> appropriate environment variables the
>> 3951           values "C" or "POSIX".
>>
>> 3952           All implementations shall define a locale as the default 
>> locale, to be invoked when no
>> 3953           environment variables are set, or set to the empty string. 
>> This default locale can be the POSIX
>> 3954           locale or any other implementation-defined locale. Some 
>> implementations may provide facilities
>> 3955           for local installation administrators to set the default 
>> locale, customizing it for each location.
>> 3956           IEEE Std 1003.1-2001 does not require such a facility.
>> ...

The ISO C99 Standard says:

>> ...
>>     7.11.1.1 The setlocale function
>>     Synopsis
>> 1            #include <locale.h>
>>              char *setlocale(int category, const char *locale);
>>     Description
>> ...
>>
>> 3   A value of "C" for locale specifies the minimal environment for C 
>> translation; a value
>>     of "" for locale specifies the locale-specific native environment. Other
>>     implementation-defined strings may be passed as the second argument to 
>> setlocale.
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ WARNING: EXPLICIT PLATFORM DEPENDENCE
>> ...

>> ...
>> 4   At program startup, the equivalent of
>>             setlocale(LC_ALL, "C");
>>     is executed.
>> 5   The implementation shall behave as if no library function calls the 
>> setlocale function.
>> ...

To make matters even more complex, some systems implement locales only
partially, as noted in this snippet from "man locale" on a
bleeding-edge FreeBSD 12 system:

>> ...
>> STANDARDS
>>      The locale utility conforms to IEEE Std 1003.1-2004 (``POSIX.1'').  The
>>      LC_CTYPE, LC_MESSAGES and NLSPATH environment variables are not
>>      interpreted.
>> ...

-------------------------------------------------------------------------------
- Nelson H. F. Beebe                    Tel: +1 801 581 5254                  -
- University of Utah                    FAX: +1 801 581 4148                  -
- Department of Mathematics, 110 LCB    Internet e-mail: address@hidden  -
- 155 S 1400 E RM 233                       address@hidden  address@hidden -
- Salt Lake City, UT 84112-0090, USA    URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------



reply via email to

[Prev in Thread] Current Thread [Next in Thread]