bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sort


From: Nathan Moore
Subject: Re: sort
Date: Mon, 29 Aug 2005 23:42:57 -0400
User-agent: Mozilla Thunderbird 1.0.6 (X11/20050716)

Bob Proulx wrote:

Nathan Moore wrote:
I guess that the best way to put it is, what is the correct behavior when none of the LC_ environmental variables
are set?

What is the output of 'locale'?

 locale
address@hidden:~> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
address@hidden:~>
address@hidden:~> set | grep LC_
address@hidden:~> set | grep LANG
LANG=en_US.UTF-8
address@hidden:~>
This is what I got w/o me actually setting anything.

That will display the settings according to the environment
variables.  If none are set then you will get a C/POSIX locale by
default.  But that command will display them individually.

This really isn't mentioned in the documentation (or I wasn't able to find it). My version of coreutils
is 5.2.1, which is the most recent.

Please suggest improvements to the documentation so that they can be
improved.  The info docs currently say this:

      (1) If you use a non-POSIX locale (e.g., by setting `LC_ALL' to
   `en_US'), then `sort' may produce output that is sorted differently
   than you're accustomed to.  In that case, set the `LC_ALL' environment
   variable to `C'.  Note that setting only `LC_COLLATE' has two problems.
   First, it is ineffective if `LC_ALL' is also set.  Second, it has
   undefined behavior if `LC_CTYPE' (or `LANG', if `LC_CTYPE' is unset) is
   set to an incompatible value.  For example, you get undefined behavior
   if `LC_CTYPE' is `ja_JP.PCK' but `LC_COLLATE' is `en_US.UTF-8'.

How might that be improved?

Looking at this now I think suggesting to set LC_ALL=C is too strong.
I know why it was done, so that it would override LANG.  But now I
think it should probably just suggestion LANG with the warning that
LC_COLLATE overrides LANG and LC_ALL overrides LC_COLLATE.
A pointer in the man and info to locale(1) would be nice. One problem I had was that I really didn't know what were legal settings for the variables, and the sort docs really didn't point me anywhere. Also, the default actions should be listed better. Knowledge of glibc locale should not be assumed by users of utility programs. I'm a programmer and I've never actually messed with that stuff until today (just never had a need for it -- probably would have noticed
if I wasn't US/English).

I've never really messed w/ the LC_ environmental variables before
and some of mine were not set (on SuSE 9.2).

You don't need to set all of them.  Just the ones you want.  Don't try
to set them all.  Personally I use this:

 export LANG=en_US.UTF-8
 export LC_COLLATE=C

I've figured it out (export `locale`), but it seems like that is one
of those things that just isn't written down anywhere.

You should not need to do that.  I recommend against it.

Since sending the initial report, I had figured out that "LC_COLLATE=ascii sort" did what I wanted.

Hmm...  I think "ascii" is actually unrecognized and that is causing a
fallback to C/POSIX.  I think other programs will complain when they
can't find that locale data.  So this will actually create other
errors.  Better to set this to C or POSIX instead.
Well, that is odd. I would have thought that LC_COLLATE being undefined, being set to empty, or being set to something invalid would all have the same effect. But from sort I got ... Just noticed something -- I actually overlooked the fact that the lines were partially sorted with the defaults, but completely sorted with the correct LC_COLLATE. Ok... I'm attaching operators.tx_ (edited down version of a much longer file operators.txt) that has 1 or 2 columns per line. The first line is an operator for a C like programming language and the optional second column is a lex action for that operator. I was trying to use sort as a quick way to make sure I hadn't misses an operator in one of 2 lists-- since the sorted lines would have duplicates on adjacent lines which would be easy to spot, leaving the operators without a match needing further attention in either the lex file or the other file. I had just "cat"ed the 2 file segments together (was not using sort's merge features). I'm also attaching the output of a couple of runs of sort on this file. The filenames have the environmental variables involved encoded into them and should be easy to figure out.
LANG="en_US.UTF-8"  for all runs.
LC_COLLATE="en_US.UTF-8" gave an empty file as output, but if export `locale` is run prior to running the sort (which sets LC_COLLATE and a bunch of other stuff to "en_US.UTF-8), then the output is the same as if LC_COLLATE= any one of "POSIX", "C", "ascii", or "your_mama". This was actually the behavior that I wanted, but was
not what I got w/ LANG="en_US.UTF-8" and LC_COLLATE not set.
I'm going to go investigate the locale settings more on my own. Any pointers to places to
look for C, shell, and system configuration stuff related to this.

Thanks for your replies, and please tell me what the behavior is
without any LC_ settings.  I'm just curious.

You get C/POSIX sort ordering by default if none of LC_ nor LANG
(don't forget LANG) is set.
So, If LC_'s are not set, but LANG is, what method of comparing used?

Note that GNU coreutils does not set any of the locale settings in
your environment.  This was very likely done by your distro.  I
believe that doing this without notifying the user is a distro problem
and not a coreutils problems.  You might need to address this problem
with your distro.
I know that they (coreutils) do not set up the environment. Distro setups should probably have options to delve into these settings a bit more during installs. (funny aside -- I had a Red Hat distro once that didn't come w/ stat. That should
have been illegal)

Thanks again for all of the time and help y'all have given me.

Nathan


=
,
:
::
?
?=
.
...
.@
(
)
[
]
{
}
&
&=
&&
&                  {ASCIIOP_RETURN(AND);}
.                  {ASCIIOP_RETURN(DOT);}
[                  {ASCIIOP_RETURN(LB);}
{                  {ASCIIOP_RETURN(LC);}
-                  {ASCIIOP_RETURN(MINUS);}
~                  {ASCIIOP_RETURN(NEGATE);}
!                  {ASCIIOP_RETURN(NOT);}
+                  {ASCIIOP_RETURN(PLUS);}
]                  {ASCIIOP_RETURN(RB);}
}                  {ASCIIOP_RETURN(RC);}
*                  {ASCIIOP_RETURN(STAR);}
#                  {NAMED_PPOP_RETURN('#') ;}
##                 {NAMED_PPOP_RETURN(POUNDPOUND);}
,                  {PPOP_RETURN(COMMA);}
(                  {PPOP_RETURN(LP);}
)                  {PPOP_RETURN(RP);}
=
::
:
?=
?
&
&=
&&
.
.@
...
,
[
]
{
}
(
)
(                  {PPOP_RETURN(LP);}
)                  {PPOP_RETURN(RP);}
,                  {PPOP_RETURN(COMMA);}
#                  {NAMED_PPOP_RETURN('#') ;}
##                 {NAMED_PPOP_RETURN(POUNDPOUND);}

{                  {ASCIIOP_RETURN(LC);}
}                  {ASCIIOP_RETURN(RC);}
[                  {ASCIIOP_RETURN(LB);}
]                  {ASCIIOP_RETURN(RB);}
.                  {ASCIIOP_RETURN(DOT);}
&                  {ASCIIOP_RETURN(AND);}
*                  {ASCIIOP_RETURN(STAR);}
+                  {ASCIIOP_RETURN(PLUS);}
-                  {ASCIIOP_RETURN(MINUS);}
~                  {ASCIIOP_RETURN(NEGATE);}
!                  {ASCIIOP_RETURN(NOT);}


!                  {ASCIIOP_RETURN(NOT);}
#                  {NAMED_PPOP_RETURN('#') ;}
##                 {NAMED_PPOP_RETURN(POUNDPOUND);}
&
&                  {ASCIIOP_RETURN(AND);}
&&
&=
(
(                  {PPOP_RETURN(LP);}
)
)                  {PPOP_RETURN(RP);}
*                  {ASCIIOP_RETURN(STAR);}
+                  {ASCIIOP_RETURN(PLUS);}
,
,                  {PPOP_RETURN(COMMA);}
-                  {ASCIIOP_RETURN(MINUS);}
.
.                  {ASCIIOP_RETURN(DOT);}
...
.@
:
::
=
?
?=
[
[                  {ASCIIOP_RETURN(LB);}
]
]                  {ASCIIOP_RETURN(RB);}
{
{                  {ASCIIOP_RETURN(LC);}
}
}                  {ASCIIOP_RETURN(RC);}
~                  {ASCIIOP_RETURN(NEGATE);}

reply via email to

[Prev in Thread] Current Thread [Next in Thread]