bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SORT bug: multiple field sort gives unexpected sort order


From: Pádraig Brady
Subject: Re: SORT bug: multiple field sort gives unexpected sort order
Date: Wed, 17 Mar 2010 22:00:01 +0000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3

On 17/03/10 11:46, Pádraig Brady wrote:
> On 16/03/10 11:50, Pádraig Brady wrote:
>> On 15/03/10 15:56, Denzen, van Carl wrote:
>>> On ubuntu v9.03 with sort version  6.10, when I do a quite simple sort, I 
>>> get unexpected result.
>>> Fields are variable length, separator is a comma. I want to sort on two 
>>> (Dutch style, dd-mm-yyyy) date fields. Here is the output:
>>> address@hidden carl]$ cat test-input.txt
>>> 23-01-1999,25-04-2008
>>> ,24-04-2008
>>> 23-01-1993,23-04-2008
>>> ,01-02-1999
>>> ,12-03-1998
>>> 23-01-1991,21-04-2008
>>> address@hidden carl]$ sort <test-input.txt --field-separator=, 
>>> --key=1.7,1.10 --key=1.4,1.5 --key=1.1,1.2 --key=2.7,2.10 --key=2.4,2.5 
>>> --key=2.1,2.2
>>
>> Here is the output from running with the soon to be released --debug option.
>> That shows that the first field is considered to be the delimiter+second 
>> field
>> in the case where the first field is empty.  That is at least confusing and
>> may be a bug.  I'll look at it this evening sometime.
> 
> I just tried it on solaris and it behaves the same as coreutils.
> So it might be some corner case of the POSIX spec (which I couldn't see on a 
> very quick can)

It seems that the end position for a field can go into the next field.
So that's going to cause problems when you have different width fields
when you're specifying particular end positions.

$ printf "one,two\n" | sort -s -t, --debug -k1,1.5
one,two
_____

POSIX says nothing about this situation, so sort is
following traditional sort behavior as documented
in this hunk from the source code to coreutils sort:

#ifdef POSIX_UNSPECIFIED
  /* The following block of code makes GNU sort incompatible with
     standard Unix sort, so it's ifdef'd out for now.
     The POSIX spec isn't clear on how to interpret this.
     FIXME: request clarification.

     From: address@hidden (Karl Heuer)
     Date: Thu, 30 May 96 12:20:41 -0400
     [Translated to POSIX 1003.1-2001 terminology by Paul Eggert.]

     [...]I believe I've found another bug in `sort'.

     $ cat /tmp/sort.in
     a b c 2 d
     pq rs 1 t
     $ textutils-1.15/src/sort -k1.7,1.7 </tmp/sort.in
     a b c 2 d
     pq rs 1 t
     $ /bin/sort -k1.7,1.7 </tmp/sort.in
     pq rs 1 t
     a b c 2 d

     Unix sort produced the answer I expected: sort on the single character
     in column 7.  GNU sort produced different results, because it disagrees
     on the interpretation of the key-end spec "M.N".  Unix sort reads this
     as "skip M-1 fields, then N-1 characters"; but GNU sort wants it to mean
     "skip M-1 fields, then either N-1 characters or the rest of the current
     field, whichever comes first".  This extra clause applies only to
     key-ends, not key-starts.
     */

  /* Make LIM point to the end of (one byte past) the current field.  */
  if (tab != TAB_DEFAULT)
    {
      char *newlim;
      newlim = memchr (ptr, tab, lim - ptr);
      if (newlim)
        lim = newlim;
    }
  else
    {
      char *newlim;
      newlim = ptr;
      while (newlim < lim && blanks[to_uchar (*newlim)])
        ++newlim;
      while (newlim < lim && !blanks[to_uchar (*newlim)])
        ++newlim;
      lim = newlim;
    }
#endif

Personally I think we should enable the code.
At least I'll produce a warning in --debug mode to
say that a field end is spanning into the next field.
For now you'll need to transform your input with something like:

sed 's/^,/00-00-0000,/' |
sort ... |
sed 's/^00-00-0000,/,/'

cheers,
Pádraig.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]