emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Emacs-bug-tracker] bug#7878: closed ("sort" bug--inconsistent single-co


From: GNU bug Tracking System
Subject: [Emacs-bug-tracker] bug#7878: closed ("sort" bug--inconsistent single-column sorting influenced by other columns?)
Date: Fri, 21 Jan 2011 09:38:01 +0000

Your message dated Fri, 21 Jan 2011 02:45:02 -0700
with message-id <address@hidden>
and subject line Re: bug#7878: "sort" bug--inconsistent single-column sorting 
influenced by other columns?
has caused the GNU bug report #7878,
regarding "sort" bug--inconsistent single-column sorting influenced by other 
columns?
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden)


-- 
7878: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=7878
GNU Bug Tracking System
Contact address@hidden with problems
--- Begin Message --- Subject: "sort" bug--inconsistent single-column sorting influenced by other columns? Date: Thu, 20 Jan 2011 18:40:01 -0800

“sort” does inconsistent sorting.

 

I’m pretty sure it has NOTHING to do with the following warning, although I could be totally wrong.

 

“ *** WARNING ***

The locale specified by the environment affects sort order.

Set LC_ALL=C to get the traditional sort order that uses

native byte values. “

 

 

See the attached shell script and text files.

 

bash-3.2$

 

 

cat test1.txt

323|1

36|2

406|3

40|4

587|5

cat test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

Note that the first column is the same for both files.

 

sort test1.txt

323|1

36|2

40|4

406|3

587|5

sort test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

The rows are in a different order depending on the dataset--and it is NOT a numeric sort. I'm not even sure it is is ANY type of sort.

 

sort -k1 test1.txt

323|1

36|2

40|4

406|3

587|5

sort -k1 test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

Trying to fix the problem by focusing on the first column doesn't work.

 

sort -t "|" test1.txt

323|1

36|2

40|4

406|3

587|5

sort -t "|" test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -t '|' test1.txt

323|1

36|2

40|4

406|3

587|5

sort -t '|' test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -k1 -t "|" test1.txt

323|1

36|2

40|4

406|3

587|5

sort -k1 -t "|" test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -k1 -t '|' test1.txt

323|1

36|2

40|4

406|3

587|5

sort -k1 -t '|' test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

Trying to fix the problem by including delimiter information doesn't work.

sort -k1d test1.txt

323|1

36|2

40|4

406|3

587|5

sort -k1d test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -s test1.txt

323|1

36|2

40|4

406|3

587|5

sort -s test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -s -k1 test1.txt

323|1

36|2

40|4

406|3

587|5

sort -s -k1 test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

Neither does dictionary order or stable matching.

sort -g test1.txt

36|2

40|4

323|1

406|3

587|5

sort -g test7.txt

36|C2

40|B4

323|B1

406|B3

587|C5

sort -n test1.txt

36|2

40|4

323|1

406|3

587|5

sort -n test7.txt

36|C2

40|B4

323|B1

406|B3

587|C5

Using numeric or general sorting appears to fix the problem on this numeric example. But why did it sort inconsistently in the first place based on the other contents of the

 file rather than just focusing on the first column--even when I told it to?

sort test1.txt | join -a1 -a2 -t "\|" - test7.txt

323|1|B1

36|2|C2

40|4

406|3|B3

40|B4

587|5|C5

Inconsistent sorting when combined with 'join' provides incorrect matches and duplication of records. This is a mess.

sort test1.txt | sort -c

sort test7.txt | sort -c

Yet, sort -c says that it is sorted correctly.

sort test1.txt

323|1

36|2

40|4

406|3

587|5

sort test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort test1.txt | join -a1 -a2 -j1 -t "\|" -e "0" -o "1.1,1.2,2.2" - test7.txt

See COMMENTED Cygwin output.

 

# $ sort test1.txt

# 323|1

# 36|2

# 406|3

# 40|4

# 587|5

 

# $ sort test7.txt

# 323|B1

# 36|C2

# 406|B3

# 40|B4

# 587|C5

 

# $ sort test1.txt | join -a1 -a2 -j1 -t "|" -e "0" -o "1.1,1.2,2.2" - test7.txt

# |B1|1

# |C22

# |B3|3

# |B44

# |C5|5

 

 

And finally, Cygwin does this sort consistently across all three examples (but it does mess up the 'join'). ????? Sucks to be me with a defective Cygwin and an unreliable so

rt and work to get done. Any advice?

 

 

randall lewis
research scientist
 
address@hidden
mobile 617-671-8294
 
4401 great america parkway, santa clara, ca, 95054, us



 

 

Attachment: SortBug.sh
Description: SortBug.sh

Attachment: test7.txt
Description: test7.txt

Attachment: test1.txt
Description: test1.txt


--- End Message ---
--- Begin Message --- Subject: Re: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns? Date: Fri, 21 Jan 2011 02:45:02 -0700 User-agent: Mutt/1.5.20 (2009-06-14)
Hi Randall,

Randall Lewis wrote:
> Wow! So, a couple comments about how I seem to have figured out
> every wrong way to use "sort" when also using "join."

You did have an impressive number of cases examined!

> Who would've thought that 
>
> sort -k1 test1.txt
> 
> would default to sort on the entire line? (I normally would've
> thought that [,POS2] means "optional if you want to have it keep
> going beyond the first field.")

You are not the only one to have had that misconception.  But that is
the way that it has always worked.  Here is the GNU sort documentation.

  `-k POS1[,POS2]'
  `--key=POS1[,POS2]'
       Specify a sort field that consists of the part of the line between
       POS1 and POS2 (or the end of the line, if POS2 is omitted),
       _inclusive_.

This behavior goes back at least to Unix v7 days and actually very
likely well before that time.  When you are a programmer in the middle
1970's writing a sorting program and you make a simple decision about
how to control sorting using command line arguments would you have had
any idea that in 2011 we would still be using virtually the same
program and interface forty years later?  And you are working on the
problem for what amounts to the first time on a new operating system.
Having done interface design and having been less successful I can't
complain.  :-)  Some of the decisions were less than great.  Other
decisions were excellent and visionary.  On average they were better
than most of us can do on our best days.

> Also, who would've thought that the default "sort" would be
> incompatible with "join" and that you would need to write the
> command like this every time you wanted to use "join"?

When sort and join were written they were compatible.  Back then the
collation sequence was strictly byte ordering.  That is the standard C
locale ordering.

It wasn't until recently when locales were introduced with en_US and
similar that problems were introduced.  For reasons unfathomable to me
the powers that be made sort ordering dictionary ordering where case
is folded and punctuation is ignored.  They failed to see how this
would negatively impact almost everything.  Creeping features.
Because punctuation is ignored in the en_US locale it causes a lot of
problems.  You didn't have to say LC_ALL=C for the first thirty years.
Don't get me started.  I have been a rather outspoken critic of this
design decision.

Personally I have the following set in my shell environment.

  export LANG=en_US.UTF-8
  export LC_COLLATE=C

I want the traditional collation sequence and so set LC_COLLATE.  But
I also want the fancy new characters with umlauts and that requires
(along with a unicode charset) a UTF-8 capable locale.  The above is a
compromise but for me a good one.

> LC_ALL=C sort test1.txt
> 
> Or that you would need a special type of "pre-sort" on the column
> (which I was executing wrong)?
> 
> sort -k1,1 -t "|" test1.txt

Since you had two fields you probably want to sort on the second field too.

  sort -k1,1 -k2,2 -t "|" test1.txt

That will sort on the first field and then the second field.

> Regardless, here is "locale" (for the record, I'm pretty new to the
> utilities--and love them. I'm not a computer scientist, but rather
> an economist trying to fit in at Yahoo! with the engineers and
> computer scientists). I'm sure there's a good reason why there are
> two, and it's pretty clear that I novice enough that I'll have to
> learn that later.

I didn't follow where the "two" was attached.  Two as in economists
and computer scientists?  Or two as in engineers and computer
scientists?  Full disclosure: I am an electrical engineer. :-)

> Thanks, Bob, for sharing two separate ways that I could get the
> answer the way I need it--two ways I could not have come up with on
> my own.

Just to nudge in a particular direction there are two other mailing
lists that are good to know about.  The address@hidden mailing list
is for general discussion of the coreutils.  Here on bug-coreutils is
where bug reports are collected every message thread opens a bug
ticket in the bug tracking system.  Which is great for bug reports.
But not so good for general discussion since it keeps opening bugs
that need to be triaged.  That is why we have the coreutils mailing
list which is just a normal list for normal discussion.  Additionally
there is a general discussion list for general help
address@hidden that is also a good resource.

> P.S. So, the reason why sorting on the column didn't work for me was
> because it was plucking out the delimiter and then doing a string
> sort? 

Correct.

> Then it was string sorting, putting numbers before letters (as
> you might expect it to)?

It would look like this to sort:

  $ sed 's/[[:punct:]]//' test1.txt 
  3231
  362
  4063
  404
  5875

  $ sed 's/[[:punct:]]//' test1.txt | LC_ALL=C sort
  3231
  362
  404
  4063
  5875

> 323|1
> 36|2
> 406|3
> 40|7 <-- Changed from 4 to 7 changed the sort order.
> 587|5

  $ sed 's/[[:punct:]]//' test1.txt | LC_ALL=C sort
  3231
  362
  4063
  407
  5875

And case is folded too.  But that didn't come into play here.  And
this affects everything that sorts everywhere on the system.
Including the shell.

  echo *
  for f in *; do ...
  ls

Bob


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]