[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: uniq/sort documentation flaw
From: |
Pádraig Brady |
Subject: |
Re: uniq/sort documentation flaw |
Date: |
Tue, 5 May 2009 12:13:04 +0100 |
User-agent: |
Thunderbird 2.0.0.6 (X11/20071008) |
Andries E. Brouwer wrote:
> uniq(1) says
>
> Discard all but one of successive identical lines from INPUT
>
> However, this is very misleading. "Identical" does not mean identical
> but "equal if one ignores differences that LC_COLLATE says should be ignored".
>
> This man page line should be changed, adding a reference to the locale.
> As it is now, the words locale and LC_COLLATE do not occur on the man page.
>
> The info file is better and mentions LC_COLLATE.
> But also there the fact that the meanings of "repeated" and "duplicate"
> are modified by LC_COLLATE is not mentioned explicitly.
>
> Andries
How about the attached?
> (Sorting is an operation done on all kinds of data, not only lines of text.
> I would not mind an option that tells sort to ignore the locale rules for
> sorting because what is sorted is not text. That feels cleaner than
> preceding each invocation with LC_COLLATE=C. And locale-free sort also
> is much faster.)
Well it is a very common issue.
http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
I'm not sure there is a better solution than what we have though.
cheers,
Pádraig.
>From 14d5f083fc6ed571ca0c07e51e7d4365c1ddcd91 Mon Sep 17 00:00:00 2001
From: =?utf-8?q?P=C3=A1draig=20Brady?= <address@hidden>
Date: Tue, 5 May 2009 12:00:15 +0100
Subject: [PATCH] doc: note the use of LC_COLLATE in comm, join and uniq.
* doc/coreutils.texi (uniq invocation): Simplify the
text to remove the inconsequential mentioning of order,
while implying that LC_COLLATE can alter equality comparisons.
* src/comm.c (usage): Mention LC_COLLATE is significant.
* src/join.c (usage): Ditto
* src/uniq.c (usage): Ditto. Also improve the summary.
Suggestion from Andries Brouwer
---
doc/coreutils.texi | 4 ++--
src/comm.c | 4 ++++
src/join.c | 1 +
src/uniq.c | 7 +++++--
4 files changed, 12 insertions(+), 4 deletions(-)
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 918f44e..b96fdb2 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -4406,8 +4406,8 @@ duplicate lines, perhaps you want to use @code{sort -u}.
@xref{sort invocation}.
@vindex LC_COLLATE
-Comparisons use the character collating sequence specified by the
address@hidden locale category.
+Comparisons honor the rules specified by the @env{LC_COLLATE}
+locale category.
If no @var{output} file is specified, @command{uniq} writes to standard
output.
diff --git a/src/comm.c b/src/comm.c
index c60936f..3c5b09a 100644
--- a/src/comm.c
+++ b/src/comm.c
@@ -129,6 +129,10 @@ and column three contains lines common to both files.\n\
"), stdout);
fputs (HELP_OPTION_DESCRIPTION, stdout);
fputs (VERSION_OPTION_DESCRIPTION, stdout);
+ fputs (_("\
+\n\
+Note, comparisons honor the rules specified by `LC_COLLATE'.\n\
+"), stdout);
emit_bug_reporting_address ();
}
exit (status);
diff --git a/src/join.c b/src/join.c
index 992a357..c716698 100644
--- a/src/join.c
+++ b/src/join.c
@@ -204,6 +204,7 @@ separated by CHAR.\n\
\n\
Important: FILE1 and FILE2 must be sorted on the join fields.\n\
E.g., use `sort -k 1b,1' if `join' has no options.\n\
+Note, comparisons honor the rules specified by `LC_COLLATE'.\n\
If the input is not sorted and some lines cannot be joined, a\n\
warning message will be given.\n\
"), stdout);
diff --git a/src/uniq.c b/src/uniq.c
index a3e0fb7..f9b4342 100644
--- a/src/uniq.c
+++ b/src/uniq.c
@@ -135,8 +135,10 @@ Usage: %s [OPTION]... [INPUT [OUTPUT]]\n\
"),
program_name);
fputs (_("\
-Discard all but one of successive identical lines from INPUT (or\n\
-standard input), writing to OUTPUT (or standard output).\n\
+Filter adjacent matching lines from INPUT (or standard input),\n\
+writing to OUTPUT (or standard output).\n\
+\n\
+With no options, matching lines are merged to the first occurence.\n\
\n\
"), stdout);
fputs (_("\
@@ -170,6 +172,7 @@ characters. Fields are skipped before chars.\n\
\n\
Note: 'uniq' does not detect repeated lines unless they are adjacent.\n\
You may want to sort the input first, or use `sort -u' without `uniq'.\n\
+Also, comparisons honor the rules specified by `LC_COLLATE'.\n\
"), stdout);
emit_bug_reporting_address ();
}
--
1.5.3.6