coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: version-sort ugliness or bugs


From: Erik Auerswald
Subject: Re: version-sort ugliness or bugs
Date: Fri, 16 Apr 2021 03:44:34 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Hi,

On Thu, Apr 15, 2021 at 11:47:34PM +0200, Vincent Lefevre wrote:
> I'm currently using version-sort in order to get integers sorted
> in strings (due to the lack of simple numeric sort like in zsh),
> but I've noticed some ugliness. This may be bugs, not I'm not sure

This seems to be quite common:

https://www.gnu.org/software/coreutils/manual/coreutils.html#Correct_002fIncorrect-ordering-and-Expected_002fUnexpected-results

https://www.gnu.org/software/coreutils/manual/coreutils.html#Other-version_002fnatural-sort-implementations

I think that version sort "functions as designed" for your examples.

> since the description of the sorting method in the Coreutils manual
> takes several pages with all its exceptions,

I think all of your problems ("ugliness") is caused by the concept of "file
extensions" in GNU Coreutils version sort.

https://www.gnu.org/software/coreutils/manual/coreutils.html#Special-handling-of-file-extensions

All my explanations below are based on the version sort documentation at
https://www.gnu.org/software/coreutils/manual/coreutils.html#Version-sort-ordering

> and I expect no-one will try to understand.

I do try to understand this, but that's just me. ;-)

> Here are some examples with coreutils 8.32-4+b1 under Debian/unstable
> with the following locales:
> 
> $ locale
> LANG=POSIX
> LANGUAGE=
> LC_CTYPE=C.UTF-8
> LC_NUMERIC="POSIX"
> LC_TIME=en_DK.utf8
> LC_COLLATE=POSIX

I think this is relevant for sort order in general.

> LC_MONETARY="POSIX"
> LC_MESSAGES="POSIX"
> LC_PAPER="POSIX"
> LC_NAME="POSIX"
> LC_ADDRESS="POSIX"
> LC_TELEPHONE="POSIX"
> LC_MEASUREMENT="POSIX"
> LC_IDENTIFICATION="POSIX"
> LC_ALL=
> 
> in case this matters for version-sort.

Hard to say, since version sort specifies a special high-level sort order.
I do not think it matters for your examples.

> $ printf "%s\n" a.b.1 a.c a.c.0 ab.1 ac ac.0 | sort -V
> a.c
> ab.1
> ac
> ac.0
> a.b.1
> a.c.0
> 
> Here, one has "a.c" before "a.b.1", which is very surprising. It is

".c" is seens as a file extension and omitted from the key for sorting.
It is added back as a tie breaker, if necessary.  ".1" is not seen as a
file extension, and thus ".b" is no longer considered as possible file
extension.

> also surprising to have all these strings between "a.c" and "a.c.0",
> which I would expect to be consecutive here.

Again, ".c" is seen as a file extension and omitted for sorting, while ".0"
is not.  Thus the keys "a" and "a.c.0" are compared with "ab.1", "ac",
"ac.0" and "a.b.1".  Since the file extension ".c" is not required as tie
breaker, it is ignored.

> $ printf "%s\n" a.aux a.fdb_latexmk a.fls a.log | sort -V
> a.aux
> a.fls
> a.log
> a.fdb_latexmk
> 
> Here, I would expect "a.fdb_latexmk" to be between "a.aux" and "a.fls".

Again, ".aux" and ".fls" are seen as file extensions, but ".fdb_latexmk" is
not.

> Less important:
> 
> $ printf "%s\n" foo1 foo1.txt foo1b foo1b.txt "foo1 bar" | sort -V
> foo1
> foo1.txt
> foo1b
> foo1b.txt
> foo1 bar
> 
> I think that having "foo1 bar" after something like "foo1b" is rather
> unusual, because "foo1 bar" makes me think that the word "foo1" is
> before "foo1b". This happens to work with lexicographic sort because
> the space is the first printable character, but this is a nice feature.

And again, ".txt" is seen as file extension and ignored (unless a tie
breaker is needed, as in "foo1b" and "foo1b.txt").

Additionally, ' ' is a non-character, but not '~', and thus sorts after
all characters.

Thanks,
Erik
-- 
I generally hate phones - they are irritating and disturb you as you
work or read or whatever - and a cellphone to me is just an opportunity
to be irritated wherever you are.
                        -- Linus Torvalds



reply via email to

[Prev in Thread] Current Thread [Next in Thread]