bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#12783: info for sort has an illogical example


From: Pádraig Brady
Subject: bug#12783: info for sort has an illogical example
Date: Tue, 06 Nov 2012 14:44:07 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120615 Thunderbird/13.0.1

On 11/06/2012 04:20 AM, Kevin O'Gorman wrote:
One ammendment...  I'd say setting LANG=C is also unwise, and for the same
reason.  make that
- It can be unwise to set LC_ALL or LANG to affect sort order because they
may affect many other things as well, such as the language used for error
and help messages.

On Mon, Nov 5, 2012 at 7:31 PM, Kevin O'Gorman <address@hidden> wrote:

Looking at a convenient issue of the POSIX standard, I find
http://pubs.opengroup.org/onlinepubs/007908799/xbd/envvar.htm, and in
particular where it mentions a "precedence order".  It seems that the
interaction of environment variables is not unspecified at all.

I'm guessing that something else was actually meant: the writer perhaps
found it hard to describe in a general way the interaction of a collation
locale with the contents of a file to be sorted if it happens that the
contents were created in another locale and are to be interpreted in the
way they were created.  So, applying LC_COLLATE=C to Chinese big-5 could
well produce a peculiar order of things.

This has already been the case for me.  My data is ASCII, comprising
numeric data and type data, both normal and encoded.  All codes are
printable ASCII and are specifically designed to be sorted based on the
contents of bytes.  This did not work well with LANG=en_US.UTF-8, because
for this data 'a' and 'A' are numbers that differ by 26, but the LANG
setting was treating them as nearly equivalent.  It seems that LC_COLLATE=C
is the correct cure.

If I'm right, it seems that it would be better to rewrite that footnote in
the sort info page something like this:

(1) The collation order used by 'sort' is controlled by environment
variables in accordance with the POSIX specification.  In particular, the
first of the LC_ALL,  LC_COLLATE, and LANG variables that is defined to a
non-null value controls the collation order; in their absence your system
has a default.  If the collation order is incompatible with your data, you
are unlikely to get the desired results.  Often, but not always, setting
and exporting LC_COLLATE=C in your environment is the right choice, but if
your data contains natural language text or proper names the right choice
will agree with the encoding used for the data.  Setting LC_ALL can be
unwise because it can affect many other things as well, such as the
language used for error and help messages. See
http://pubs.opengroup.org/onlinepubs/007908799/xbd/envvar.html or any
later version for more information.

You may also want to change the warning in the output of "sort --help" to
*** WARNING ***
The locale specified by the environment affects sort order. For correct
operation it must be compatible with your data.
Set LC_COLLATE=C to get the traditional sort order that uses native byte
values.


On Fri, Nov 2, 2012 at 11:07 AM, Bob Proulx <address@hidden> wrote:

Kevin O'Gorman wrote:
(reformatted and numbered)
A, In that case, set the `LC_ALL' environment variable to `C'.
B. Note that setting only `LC_COLLATE' has two problems.
B1. First, it is ineffective if `LC_ALL' is also set.
B2. Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if
`LC_CTYPE' is unset) is set to an incompatible value.
B2x. For example, you get undefined behavior if `LC_CTYPE' is
`ja_JP.PCK'
but `LC_COLLATE' is `en_US.UTF-8'.

The example in B2x is illogical since A and B together mean we're
setting
LC_COLLATE to C, not some random value like en_US.UTF-8.
I want to know if LC_COLLATE=C can be messed up by an LC_CTYPE setting,
or
anything besides LC_ALL.  I'm writing software that will use sort
extensively in unknown environments, and I'd like to keep all
adjustments
as localized as possible.  So far, setting the collating sequence to
POSIX
is all that I need; no other locale adjustments.

I also agree that the above is needlessly disjoint.  It doesn't flow.

Would you be able to suggest an improvement to the wording that would
make it better than the current prose?  Of course a submission as a
patch would be great.  Using git patch submissions is the preferred
format.  But just saying what you think it should say would also be
appreciated.

Thanks for cleaning that up Kevin.
Your description is clearer.
I'm not sure we can drop the warning about LC_TYPE though.

While LC_TYPE is _not_ significant to sort order on solaris or GNU/Linux...

$ for e in LANG LC_ALL LC_COLLATE LC_CTYPE; do
    printf "%s\n" B a | echo $(env -i $e=en_US sort)
done
a B
a B
a B
B a

... it may be significant when specifying (multibyte) characters
to skip etc. and thus impacts the sort order in that way.
This is either with common downstream i18n patches or future
multibyte handling in upstream sort.

Unfortunately LC_CTYPE is would up with LC_MESSAGES too (since glibc-2.3.3):
http://www.gnu.org/software/libc/manual/html_node/Charset-conversion-in-gettext.html

thanks,
Pádraig.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]