|
From: | Eric Blake |
Subject: | bug#40226: sort: expected sort order when -c in use |
Date: | Wed, 25 Mar 2020 16:35:47 -0500 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 |
On 3/25/20 3:02 PM, Richard Ipsum wrote:
On Wed, Mar 25, 2020 at 01:17:19PM -0500, Eric Blake wrote:On 3/25/20 12:37 PM, Richard Ipsum wrote:[snip]See the difference? In the first case, sort is doing its default case-insensitive comparison of the entire line (because you passed -f but not -k), AND a stability comparison of the byte values of the entire line (as shown by the two ____ lines per input). But in the second case, when you add -s, the stability comparison is omitted. The two lines are indeed different when the stability comparison is performed, explaining why -c choked when -s is absent. Or put another way, -f affects only -k, including the implied -k1 when you don't specify anything, and not -s. So now that we know that, let's return to your example:I'm trying to understand this relative to POSIX, which makes no mention of stability as far as I can see (and there is no -s in POSIX). POSIX says that -f should override the default ordering rules. I don't understand why the last-resort comparison is required when -c is in use, since we're not sorting with -c, just checking if the input is already sorted?
POSIX states [sort description]:"If this collating sequence does not have a total ordering of all characters (see XBD LC_COLLATE), any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale."
As I understand it, this is true even when -f modifies the collating sequence to compare all lowercase characters as their uppercase equivalent.
But POSIX further states [XBD LC_COLLATE]:"All implementation-provided locales (either preinstalled or provided as locale definitions which can be installed later) should define a collation sequence that has a total ordering of all characters unless the locale name has an '@' modifier indicating that it has a special collation sequence (for example, @icase could indicate that each upper and lowercase character pair collates equally).
Notes:A future version of this standard may require these locales to define a collation sequence that has a total ordering of all characters (by changing "should" to "shall").
Users installing their own locales should ensure that they define a collation sequence with a total ordering of all characters unless an '@' modifier in the locale name (such as @icase ) indicates that it has a special collation sequence."
Put another way should -c imply -s ?
Maybe we compromise, and state that -c implies -s only for locales that do not include @ in their name (that is, if a locale already guarantees a total ordering of all characters, then even when -f collapses lowercase into uppercase, we don't need the final-resort comparison; but if a locale does not guarantee total ordering, the -s has to be explicit)?
-- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
[Prev in Thread] | Current Thread | [Next in Thread] |