[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
coreutils uniq -d -u does not conform to POSIX
From: |
Paul Eggert |
Subject: |
coreutils uniq -d -u does not conform to POSIX |
Date: |
Tue, 13 May 2003 23:40:56 -0700 |
I asked my students to clone Debian GNU/Linux uniq in Python, giving
them a POSIX-compatible uniq to start with. Josh Hyman, one of my
students, pointed out a disagreement between the two programs which
turns out to be a POSIX-compatibility bug in coreutils. POSIX says
that "uniq -d -u" should output nothing, but with coreutils uniq, -d
overrides -u and vice versa.
Here is a proposed patch. While I was at it I noticed a few related
infelicities in the documentation, so this patch fixes them too.
2003-05-13 Paul Eggert <address@hidden>
Fix uniq to conform to POSIX, which requires that "uniq -d -u"
must output nothing. Problem reported by Josh Hyman.
* doc/coreutils.texi (uniq invocation, squeezing, The uniq command):
Use "repeated" rather than "duplicate" to describe adjacent
duplicates; this simplifies the description and makes it more
consistent with POSIX.
(uniq invocation): Make it clear that -d and -u suppress the
output of lines, rather than cause some lines to be output.
Mention what happens if a line lacks enough fields or characters.
* src/uniq.c (enum output_mode, mode): Remove, replacing with:
(output_unique, output_first_repeated, output_later_repeated):
New vars. All uses of "mode" changed to use these variables,
which are not mutually exclusive as "mode" was.
(writeline): New arg "match", used to control whether to
obey output_first_repeated or output_later_repeated.
All callers changed.
(check_file, main): Adjust to above changes.
* tests/uniq/Test.pm: Test that 'uniq -d -u' outputs nothing.
Index: doc/coreutils.texi
===================================================================
RCS file: /home/m/meyering/fetish/cu/doc/coreutils.texi,v
retrieving revision 1.110
diff -p -u -r1.110 coreutils.texi
--- doc/coreutils.texi 13 May 2003 12:42:02 -0000 1.110
+++ doc/coreutils.texi 14 May 2003 06:39:37 -0000
@@ -3271,12 +3271,12 @@ standard input if nothing is given or fo
uniq address@hidden@dots{} address@hidden address@hidden
@end example
-By default, @command{uniq} prints the unique lines in a sorted file, i.e.,
-discards all but one of identical successive lines. Optionally, it can
-instead show only lines that appear exactly once, or lines that appear
-more than once.
+By default, @command{uniq} prints its input lines, except that
+it discards all but the first of adjacent repeated lines, so that
+no output lines are repeated. Optionally, it can instead discard
+lines that are not repeated, or all repeated lines.
-The input need not be sorted, but duplicate input lines are detected
+The input need not be sorted, but repeated input lines are detected
only if they are adjacent. If you want to discard non-adjacent
duplicate lines, perhaps you want to use @code{sort -u}.
@@ -3295,7 +3295,8 @@ The program accepts the following option
@itemx address@hidden
@opindex -f
@opindex --skip-fields
-Skip @var{n} fields on each line before checking for uniqueness. Fields
+Skip @var{n} fields on each line before checking for uniqueness. Use
+a null string for comparison if a line has fewer than @var{n} fields. Fields
are sequences of non-space non-tab characters that are separated from
each other by at least one space or tab.
@@ -3307,7 +3308,8 @@ does not allow this; use @option{-f @var
@itemx address@hidden
@opindex -s
@opindex --skip-chars
-Skip @var{n} characters before checking for uniqueness. If you use both
+Skip @var{n} characters before checking for uniqueness. Use a null string
+for comparison if a line has fewer than @var{n} characters. If you use both
the field and character skipping options, fields are skipped over first.
On older systems, @command{uniq} supports an obsolete option
@@ -3330,31 +3332,34 @@ Ignore differences in case when comparin
@itemx --repeated
@opindex -d
@opindex --repeated
address@hidden duplicate lines, outputting
-Print one copy of each duplicate line.
address@hidden repeated lines, outputting
+Discard lines that are not repeated. When used by itself, this option
+causes @command{uniq} to print the first copy of each repeated line,
+and nothing else.
@item -D
@itemx address@hidden
@opindex -D
@opindex --all-repeated
address@hidden all duplicate lines, outputting
-Print all copies of each duplicate line.
address@hidden all repeated lines, outputting
+Do not discard the second and subsequent repeated input lines,
+but discard lines that are not repeated.
This option is useful mainly in conjunction with other options e.g.,
to ignore case or to compare only selected fields.
The optional @var{delimit-method} tells how to delimit
-groups of duplicate lines, and must be one of the following:
+groups of repeated lines, and must be one of the following:
@table @samp
@item none
-Do not delimit groups of duplicate lines.
+Do not delimit groups of repeated lines.
This is equivalent to @option{--all-repeated} (@option{-D}).
@item prepend
-Output a newline before each group of duplicate lines.
+Output a newline before each group of repeated lines.
@item separate
-Separate groups of duplicate lines with a single newline.
+Separate groups of repeated lines with a single newline.
This is the same as using @samp{prepend}, except that
there is no newline before the first group, and hence
may be better suited for output direct to users.
@@ -3373,13 +3378,14 @@ This is a @sc{gnu} extension.
@opindex -u
@opindex --unique
@cindex unique lines, outputting
-Print non-duplicate lines.
+Discard the first repeated line. When used by itself, this option
+causes @command{uniq} to print unique lines, and nothing else.
@item -w @var{n}
@itemx address@hidden
@opindex -w
@opindex --check-chars
-Compare @var{n} characters on each line (after skipping any specified
+Compare at most @var{n} characters on each line (after skipping any specified
fields and characters). By default the entire rest of the lines are
compared.
@@ -4649,13 +4655,13 @@ tr -s '\n'
@item
Find doubled occurrences of words in a document.
-For example, people often write ``the the'' with the duplicated words
+For example, people often write ``the the'' with the repeated words
separated by a newline. The bourne shell script below works first
by converting each sequence of punctuation and blank characters to a
single newline. That puts each ``word'' on a line by itself.
Next it maps all uppercase characters to lower case, and finally it
runs @command{uniq} with the @option{-d} option to print out only the words
-that were adjacent duplicates.
+that were repeated.
@example
#!/bin/sh
@@ -12055,8 +12061,8 @@ Finally (at least for now), we'll look a
sorting data, you will often end up with duplicate lines, lines that
are identical. Usually, all you need is one instance of each line.
This is where @command{uniq} comes in. The @command{uniq} program reads its
-standard input, which it expects to be sorted. It only prints out one
-copy of each duplicated line. It does have several options. Later on,
+standard input. It prints only one
+copy of each repeated line. It does have several options. Later on,
we'll use the @option{-c} option, which prints each unique line, preceded
by a count of the number of times that line occurred in the input.
Index: src/uniq.c
===================================================================
RCS file: /home/m/meyering/fetish/cu/src/uniq.c,v
retrieving revision 1.99
diff -p -u -r1.99 uniq.c
--- src/uniq.c 10 May 2003 13:39:05 -0000 1.99
+++ src/uniq.c 14 May 2003 06:39:37 -0000
@@ -74,16 +74,12 @@ enum countmode
times they occurred in the input. */
static enum countmode countmode;
-enum output_mode
-{
- output_repeated, /* -d Only lines that are repeated. */
- output_all_repeated, /* -D All lines that are repeated. */
- output_unique, /* -u Only lines that are not repeated. */
- output_all /* Default. Print first copy of each line. */
-};
-
-/* Which lines to output. */
-static enum output_mode mode;
+/* Which lines to output: unique lines, the first of a group of
+ repeated lines, and the second and subsequented of a group of
+ repeated lines. */
+static bool output_unique;
+static bool output_first_repeated;
+static bool output_later_repeated;
/* If nonzero, ignore case when comparing. */
static int ignore_case;
@@ -240,15 +236,17 @@ different (char *old, char *new, size_t
/* Output the line in linebuffer LINE to stream STREAM
provided that the switches say it should be output.
+ MATCH is true if the line matches the previous line.
If requested, print the number of times it occurred, as well;
LINECOUNT + 1 is the number of times that the line occurred. */
static void
-writeline (const struct linebuffer *line, FILE *stream, int linecount)
+writeline (struct linebuffer const *line, FILE *stream,
+ bool match, int linecount)
{
- if ((mode == output_unique && linecount != 0)
- || (mode == output_repeated && linecount == 0)
- || (mode == output_all_repeated && linecount == 0))
+ if (! (linecount == 0 ? output_unique
+ : !match ? output_first_repeated
+ : output_later_repeated))
return;
if (countmode == count_occurrences)
@@ -295,7 +293,7 @@ check_file (const char *infile, const ch
this optimization lets uniq output each different line right away,
without waiting to see if the next one is different. */
- if (mode == output_all && countmode == count_none)
+ if (output_unique && output_first_repeated && countmode == count_none)
{
char *prevfield IF_LINT (= NULL);
size_t prevlen IF_LINT (= 0);
@@ -334,7 +332,7 @@ check_file (const char *infile, const ch
while (!feof (istream))
{
- int match;
+ bool match;
char *thisfield;
size_t thislen;
if (readline (thisline, istream) == 0)
@@ -346,7 +344,7 @@ check_file (const char *infile, const ch
if (match)
++match_count;
- if (mode == output_all_repeated && delimit_groups != DM_NONE)
+ if (delimit_groups != DM_NONE)
{
if (!match)
{
@@ -362,9 +360,9 @@ check_file (const char *infile, const ch
}
}
- if (!match || mode == output_all_repeated)
+ if (!match || output_later_repeated)
{
- writeline (prevline, ostream, match_count);
+ writeline (prevline, ostream, match, match_count);
SWAP_LINES (prevline, thisline);
prevfield = thisfield;
prevlen = thislen;
@@ -373,7 +371,7 @@ check_file (const char *infile, const ch
}
}
- writeline (prevline, ostream, match_count);
+ writeline (prevline, ostream, false, match_count);
}
closefiles:
@@ -410,7 +408,8 @@ main (int argc, char **argv)
skip_chars = 0;
skip_fields = 0;
check_chars = SIZE_MAX;
- mode = output_all;
+ output_unique = output_first_repeated = true;
+ output_later_repeated = false;
countmode = count_none;
delimit_groups = DM_NONE;
@@ -480,11 +479,12 @@ main (int argc, char **argv)
break;
case 'd':
- mode = output_repeated;
+ output_unique = false;
break;
case 'D':
- mode = output_all_repeated;
+ output_unique = false;
+ output_later_repeated = true;
if (optarg == NULL)
delimit_groups = DM_NONE;
else
@@ -508,7 +508,7 @@ main (int argc, char **argv)
break;
case 'u':
- mode = output_unique;
+ output_first_repeated = false;
break;
case 'w':
@@ -532,7 +532,7 @@ main (int argc, char **argv)
usage (EXIT_FAILURE);
}
- if (countmode == count_occurrences && mode == output_all_repeated)
+ if (countmode == count_occurrences && output_later_repeated)
{
error (0, 0,
_("printing all duplicated lines and repeat counts is meaningless"));
Index: tests/uniq/Test.pm
===================================================================
RCS file: /home/m/meyering/fetish/cu/tests/uniq/Test.pm,v
retrieving revision 1.9
diff -p -u -r1.9 Test.pm
--- tests/uniq/Test.pm 18 Feb 2002 12:39:19 -0000 1.9
+++ tests/uniq/Test.pm 14 May 2003 06:39:37 -0000
@@ -83,6 +83,8 @@ my @tv = (
['117', '--all-repeated=prepend', "a\na\nb\nc\nc\n", "\na\na\n\nc\nc\n", 0],
['118', '--all-repeated=prepend', "a\nb\n", "", 0],
['119', '--all-repeated=badoption', "a\n", "", 1],
+# Check that -d and -u suppress all output, as POSIX requires.
+['120', '-d -u', "a\na\n\b", "", 0],
);
sub test_vector
- coreutils uniq -d -u does not conform to POSIX,
Paul Eggert <=