bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#12783: info for sort has an illogical example


From: Kevin O'Gorman
Subject: bug#12783: info for sort has an illogical example
Date: Tue, 6 Nov 2012 18:59:30 -0800

On Tue, Nov 6, 2012 at 6:44 AM, Pádraig Brady <address@hidden> wrote:

> On 11/06/2012 04:20 AM, Kevin O'Gorman wrote:
>
>> One ammendment...  I'd say setting LANG=C is also unwise, and for the same
>> reason.  make that
>> - It can be unwise to set LC_ALL or LANG to affect sort order because they
>> may affect many other things as well, such as the language used for error
>> and help messages.
>>
>> On Mon, Nov 5, 2012 at 7:31 PM, Kevin O'Gorman <address@hidden>
>> wrote:
>>
>>  Looking at a convenient issue of the POSIX standard, I find
>>> http://pubs.opengroup.org/**onlinepubs/007908799/xbd/**envvar.htm<http://pubs.opengroup.org/onlinepubs/007908799/xbd/envvar.htm>,
>>> and in
>>> particular where it mentions a "precedence order".  It seems that the
>>> interaction of environment variables is not unspecified at all.
>>>
>>> I'm guessing that something else was actually meant: the writer perhaps
>>> found it hard to describe in a general way the interaction of a collation
>>> locale with the contents of a file to be sorted if it happens that the
>>> contents were created in another locale and are to be interpreted in the
>>> way they were created.  So, applying LC_COLLATE=C to Chinese big-5 could
>>> well produce a peculiar order of things.
>>>
>>> This has already been the case for me.  My data is ASCII, comprising
>>> numeric data and type data, both normal and encoded.  All codes are
>>> printable ASCII and are specifically designed to be sorted based on the
>>> contents of bytes.  This did not work well with LANG=en_US.UTF-8, because
>>> for this data 'a' and 'A' are numbers that differ by 26, but the LANG
>>> setting was treating them as nearly equivalent.  It seems that
>>> LC_COLLATE=C
>>> is the correct cure.
>>>
>>> If I'm right, it seems that it would be better to rewrite that footnote
>>> in
>>> the sort info page something like this:
>>>
>>> (1) The collation order used by 'sort' is controlled by environment
>>> variables in accordance with the POSIX specification.  In particular, the
>>> first of the LC_ALL,  LC_COLLATE, and LANG variables that is defined to a
>>> non-null value controls the collation order; in their absence your system
>>> has a default.  If the collation order is incompatible with your data,
>>> you
>>> are unlikely to get the desired results.  Often, but not always, setting
>>> and exporting LC_COLLATE=C in your environment is the right choice, but
>>> if
>>> your data contains natural language text or proper names the right choice
>>> will agree with the encoding used for the data.  Setting LC_ALL can be
>>> unwise because it can affect many other things as well, such as the
>>> language used for error and help messages. See
>>> http://pubs.opengroup.org/**onlinepubs/007908799/xbd/**envvar.html<http://pubs.opengroup.org/onlinepubs/007908799/xbd/envvar.html>or
>>>  any
>>> later version for more information.
>>>
>>> You may also want to change the warning in the output of "sort --help" to
>>> *** WARNING ***
>>> The locale specified by the environment affects sort order. For correct
>>> operation it must be compatible with your data.
>>> Set LC_COLLATE=C to get the traditional sort order that uses native byte
>>> values.
>>>
>>>
>>> On Fri, Nov 2, 2012 at 11:07 AM, Bob Proulx <address@hidden> wrote:
>>>
>>>  Kevin O'Gorman wrote:
>>>>
>>>>> (reformatted and numbered)
>>>>> A, In that case, set the `LC_ALL' environment variable to `C'.
>>>>> B. Note that setting only `LC_COLLATE' has two problems.
>>>>> B1. First, it is ineffective if `LC_ALL' is also set.
>>>>> B2. Second, it has undefined behavior if `LC_CTYPE' (or `LANG', if
>>>>> `LC_CTYPE' is unset) is set to an incompatible value.
>>>>> B2x. For example, you get undefined behavior if `LC_CTYPE' is
>>>>>
>>>> `ja_JP.PCK'
>>>>
>>>>> but `LC_COLLATE' is `en_US.UTF-8'.
>>>>>
>>>>> The example in B2x is illogical since A and B together mean we're
>>>>>
>>>> setting
>>>>
>>>>> LC_COLLATE to C, not some random value like en_US.UTF-8.
>>>>> I want to know if LC_COLLATE=C can be messed up by an LC_CTYPE setting,
>>>>>
>>>> or
>>>>
>>>>> anything besides LC_ALL.  I'm writing software that will use sort
>>>>> extensively in unknown environments, and I'd like to keep all
>>>>>
>>>> adjustments
>>>>
>>>>> as localized as possible.  So far, setting the collating sequence to
>>>>>
>>>> POSIX
>>>>
>>>>> is all that I need; no other locale adjustments.
>>>>>
>>>>
>>>> I also agree that the above is needlessly disjoint.  It doesn't flow.
>>>>
>>>> Would you be able to suggest an improvement to the wording that would
>>>> make it better than the current prose?  Of course a submission as a
>>>> patch would be great.  Using git patch submissions is the preferred
>>>> format.  But just saying what you think it should say would also be
>>>> appreciated.
>>>>
>>>
> Thanks for cleaning that up Kevin.
> Your description is clearer.
> I'm not sure we can drop the warning about LC_TYPE though.
>
> While LC_TYPE is _not_ significant to sort order on solaris or GNU/Linux...
>
> $ for e in LANG LC_ALL LC_COLLATE LC_CTYPE; do
>     printf "%s\n" B a | echo $(env -i $e=en_US sort)
> done
> a B
> a B
> a B
> B a
>
> ... it may be significant when specifying (multibyte) characters
> to skip etc. and thus impacts the sort order in that way.
> This is either with common downstream i18n patches or future
> multibyte handling in upstream sort.
>
> Unfortunately LC_CTYPE is would up with LC_MESSAGES too (since
> glibc-2.3.3):
> http://www.gnu.org/software/**libc/manual/html_node/Charset-**
> conversion-in-gettext.html<http://www.gnu.org/software/libc/manual/html_node/Charset-conversion-in-gettext.html>
>
> thanks,
> Pádraig.
>

What I wrote is only a suggestion.  As I'm far from expert in these
matters, I'll leave the final form to you all.  Thanks for your work on
coreutils, and your attention to this matter.  I think my work is done here.

-- 
Kevin O'Gorman

programmer, n. an organism that transmutes caffeine into software.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]