bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#42340: Fwd: bug#42340: "join" reports that "sort"ed input is not sor


From: Assaf Gordon
Subject: bug#42340: Fwd: bug#42340: "join" reports that "sort"ed input is not sorted
Date: Wed, 15 Jul 2020 18:38:25 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

Hello,

On 2020-07-15 2:12 p.m., Beth Andres-Beck wrote:
If that is the intended behavior, the bug is that:
printf '12,\n1,\n' | sort -t, -k1 -s
1,
12,

does _not_ take the remainder of the line into account, and only sorts on
the initial field, prioritizing length.

It is at the very least unexpected that adding an `a` to the end of both
lines would change the sort order of those lines:
printf '12,a\n1,a\n' | sort -t, -k1 -s
12,a
1,a


Not a bug, just an incomplete usage :)

sort's -k/--key parameter takes two values (the second being optional):
the first and last column to use as the key. If the second value is omitted (as in your case), then the key is taken from the first field
to the end of the line.

And so:
"sort -k1,1" means take the first *and only the first* field as the key.
"sort -k1" means take the first field until the end of the line as the key.
"sort -k1,3" means take the first,second and third fields as the single key.
"sort -k1,1 -k2,2 -k3,3" means take the first field as the first key,
second field as the second key, and third field as the third key.

---

The "--debug" option can help illustrate what sort is doing,
by adding underscore characters to show which characters are being used as keys in each line. Consider the following:

   $ printf '12,\n1,\n' | sort -t, -k1 -s --debug
   sort: using ‘en_CA.utf8’ sorting rules
   1,
   __
   12,
   ___

   $ printf '12,\n1,\n' | sort -t, -k1,1 -s --debug
   sort: using ‘en_CA.utf8’ sorting rules
   1,
   _
   12,
   __

In the first example, the "-k1" means from first field till end of line,
the underscore includes the "," characters.
In the second example, the "-k1,1" means only the first field, and the comma is not used.

Now consider your second case of adding an "a" at the end of each line:

   $ printf '12,a\n1,a\n' | sort -t, -k1 -s --debug
   sort: using ‘en_CA.utf8’ sorting rules
   12,a
   ____
   1,a
   ___

   $ printf '12,a\n1,a\n' | sort -t, -k1,1 -s --debug
   sort: using ‘en_CA.utf8’ sorting rules
   1,a
   _
   12,a
   __

In the first example, "-k1" means: from first field until the end of the line, and so the entire string "12,a" is compared against "1,a".

**AND**, because the locale is a "utf-8" locale, punctuation characters are ignored (as mentioned in the previous email in this thread).
So effectively the compared strings are "12a" vs "1a".
The ASCII value of "2" is smaller than the ASCII value of "a", and
therefore "12a" appears before "1a".

If we force C locale, then the order is reversed:

   $ printf '12,a\n1,a\n' | LC_ALL=C sort -t, -k1 -s --debug
   sort: using simple byte comparison
   1,a
   ___
   12,a
   ____

Because now punctuation characters are used, and the ASCII value of ","
is smaller than the ASCII value of "2".

**HOWEVER**, this result of using "LC_ALL=C" together with "-k1" is
only correct by a happy accident :)
it is still very likely that "-k1" is not what you wanted - you probably meant to do "-k1,1".

---

Lastly, the "-s/--stable" option in the above contrived examples is superfluous - it doesn't affect the output order because there are no
equal field values (i.e. "1" vs "12").
A slightly better example to illustrate how "-s" affects ordering is this:

   $ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1
   1,a
   2,b
   2,x

   $ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1 -s
   1,a
   2,x
   2,b

Here, "1" comes before "2" - that's obvious. But should "2,b" come before "2,x" ? If we do not use "-s/--stable", then "sort" ALSO does one additional comparison of the entire line as a last step (hence "sort --help" says
"[disable] last-resort comparison" about "-s/--stable").
The substring ",b" comes before ",x" - therefore "2,b" appears first.

If we add "-s/--stable", the last comparison step of the entire line is skipped, and the lines of "2" appear in the order they were in the input (hence - "stable").

By using "--debug" we can see the additional comparison step (indicated by additional underscore lines);

   $ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1 --debug
   sort: using ‘en_CA.utf8’ sorting rules
   1,a
   _
   ___
   2,b
   _
   ___
   2,x
   _
   ___


   $ printf "2,x\n1,a\n2,b\n" | sort -t, -k1,1 -s --debug
   sort: using ‘en_CA.utf8’ sorting rules
   1,a
   _
   2,x
   _
   2,b
   _

---

Hope this helps.
regards,
 - assaf







reply via email to

[Prev in Thread] Current Thread [Next in Thread]