bug#36674: Sort Suggestion

From: Assaf Gordon
Subject: bug#36674: Sort Suggestion
Date: Mon, 15 Jul 2019 13:23:52 -0600
On Mon, Jul 15, 2019 at 11:42:01AM -0700, Marshall Lake wrote:
> Even though this isn't a bug, I was asked to send the following to this
> email address.

(General suggestions and discussions are better suited for
address@hidden mailing list, that way the system won't open a new
bug item.)

> Re:  SORT Command from GNU coreutils 8.25
> A suggestion for an additional option to the SORT command is to ignore
> non-alphanumeric characters.
> As an example, in attempting to sort an index ...
> Abbott, William                        259
> sorts before:
> Abbot, William                         099
> If non-alphanumeric characters were ignored then the same two records
> would sort as:
> Abbot, William                         099
> Abbott, William                        259

There's actually something else at play here:
In your case, sort does ignore non-alphanumeric characters,
but it ALSO ignores white space.
That happens because your locale is set to some language
(for example, en_US.UTF8).

Using such locale makes sort ignore all non-alphanumeric chareacters,
whitespace, and upper/lower cases.

In essense, you are compaing "AbbottWilliam" (two 't's) to
'AbbotWilliam' (one 't') - and then the second 't' is compared to a 'w',
and is determined to come first.

If you force a POSIX/C locate, then all characters are considered,
and the result will be as you requested.

Observe the following:

  $ printf "%s\n" AbbottWilliam AbbotWilliam | LC_ALL=en_CA.utf8 sort

  $ printf "%s\n" "Abbott William" "Abbot William" | LC_ALL=en_CA.utf8 sort
  Abbott William
  Abbot William

  $ printf "%s\n" "Abbott William" "Abbot William" | LC_ALL=C sort
  Abbot William
  Abbott William

  $ printf "%s\n" "Abbott, William" "Abbot, William" | LC_ALL=C sort
  Abbot, William
  Abbott, William

Note that 'sort' already has an option for dictionary style sorting:
   -d, --dictionary-order: consider only blanks and alphanumeric characters.

However, locale rules take precedence over it, so effectively it only
works in "C" locale:

  $ printf "%s\n" "Ab,,b,,ott William" "Abbot William" | LC_ALL=C sort
  Ab,,b,,ott William
  Abbot William

  $ printf "%s\n" "Ab,,b,,ott William" "Abbot William" | LC_ALL=C sort -d
  Abbot William
  Ab,,b,,ott William

You can read past discussion about the confusion resulting from locale
sorting rules here:

As such, I'm closing this as "not a bug", but discussion can continue
by replying to this thread.


