bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

coreutils uniq -d -u does not conform to POSIX


From: Paul Eggert
Subject: coreutils uniq -d -u does not conform to POSIX
Date: Tue, 13 May 2003 23:40:56 -0700

I asked my students to clone Debian GNU/Linux uniq in Python, giving
them a POSIX-compatible uniq to start with.  Josh Hyman, one of my
students, pointed out a disagreement between the two programs which
turns out to be a POSIX-compatibility bug in coreutils.  POSIX says
that "uniq -d -u" should output nothing, but with coreutils uniq, -d
overrides -u and vice versa.

Here is a proposed patch.  While I was at it I noticed a few related
infelicities in the documentation, so this patch fixes them too.

2003-05-13  Paul Eggert  <address@hidden>

        Fix uniq to conform to POSIX, which requires that "uniq -d -u"
        must output nothing.  Problem reported by Josh Hyman.

        * doc/coreutils.texi (uniq invocation, squeezing, The uniq command):
        Use "repeated" rather than "duplicate" to describe adjacent
        duplicates; this simplifies the description and makes it more
        consistent with POSIX.
        (uniq invocation): Make it clear that -d and -u suppress the
        output of lines, rather than cause some lines to be output.
        Mention what happens if a line lacks enough fields or characters.

        * src/uniq.c (enum output_mode, mode): Remove, replacing with:
        (output_unique, output_first_repeated, output_later_repeated):
        New vars.  All uses of "mode" changed to use these variables,
        which are not mutually exclusive as "mode" was.
        (writeline): New arg "match", used to control whether to
        obey output_first_repeated or output_later_repeated.
        All callers changed.
        (check_file, main): Adjust to above changes.

        * tests/uniq/Test.pm: Test that 'uniq -d -u' outputs nothing.

Index: doc/coreutils.texi
===================================================================
RCS file: /home/m/meyering/fetish/cu/doc/coreutils.texi,v
retrieving revision 1.110
diff -p -u -r1.110 coreutils.texi
--- doc/coreutils.texi  13 May 2003 12:42:02 -0000      1.110
+++ doc/coreutils.texi  14 May 2003 06:39:37 -0000
@@ -3271,12 +3271,12 @@ standard input if nothing is given or fo
 uniq address@hidden@dots{} address@hidden address@hidden
 @end example
 
-By default, @command{uniq} prints the unique lines in a sorted file, i.e.,
-discards all but one of identical successive lines.  Optionally, it can
-instead show only lines that appear exactly once, or lines that appear
-more than once.
+By default, @command{uniq} prints its input lines, except that
+it discards all but the first of adjacent repeated lines, so that
+no output lines are repeated.  Optionally, it can instead discard
+lines that are not repeated, or all repeated lines.
 
-The input need not be sorted, but duplicate input lines are detected
+The input need not be sorted, but repeated input lines are detected
 only if they are adjacent.  If you want to discard non-adjacent
 duplicate lines, perhaps you want to use @code{sort -u}.
 
@@ -3295,7 +3295,8 @@ The program accepts the following option
 @itemx address@hidden
 @opindex -f
 @opindex --skip-fields
-Skip @var{n} fields on each line before checking for uniqueness.  Fields
+Skip @var{n} fields on each line before checking for uniqueness.  Use
+a null string for comparison if a line has fewer than @var{n} fields.  Fields
 are sequences of non-space non-tab characters that are separated from
 each other by at least one space or tab.
 
@@ -3307,7 +3308,8 @@ does not allow this; use @option{-f @var
 @itemx address@hidden
 @opindex -s
 @opindex --skip-chars
-Skip @var{n} characters before checking for uniqueness.  If you use both
+Skip @var{n} characters before checking for uniqueness.  Use a null string
+for comparison if a line has fewer than @var{n} characters.  If you use both
 the field and character skipping options, fields are skipped over first.
 
 On older systems, @command{uniq} supports an obsolete option
@@ -3330,31 +3332,34 @@ Ignore differences in case when comparin
 @itemx --repeated
 @opindex -d
 @opindex --repeated
address@hidden duplicate lines, outputting
-Print one copy of each duplicate line.
address@hidden repeated lines, outputting
+Discard lines that are not repeated.  When used by itself, this option
+causes @command{uniq} to print the first copy of each repeated line,
+and nothing else.
 
 @item -D
 @itemx address@hidden
 @opindex -D
 @opindex --all-repeated
address@hidden all duplicate lines, outputting
-Print all copies of each duplicate line.
address@hidden all repeated lines, outputting
+Do not discard the second and subsequent repeated input lines,
+but discard lines that are not repeated.
 This option is useful mainly in conjunction with other options e.g.,
 to ignore case or to compare only selected fields.
 The optional @var{delimit-method} tells how to delimit
-groups of duplicate lines, and must be one of the following:
+groups of repeated lines, and must be one of the following:
 
 @table @samp
 
 @item none
-Do not delimit groups of duplicate lines.
+Do not delimit groups of repeated lines.
 This is equivalent to @option{--all-repeated} (@option{-D}).
 
 @item prepend
-Output a newline before each group of duplicate lines.
+Output a newline before each group of repeated lines.
 
 @item separate
-Separate groups of duplicate lines with a single newline.
+Separate groups of repeated lines with a single newline.
 This is the same as using @samp{prepend}, except that
 there is no newline before the first group, and hence
 may be better suited for output direct to users.
@@ -3373,13 +3378,14 @@ This is a @sc{gnu} extension.
 @opindex -u
 @opindex --unique
 @cindex unique lines, outputting
-Print non-duplicate lines.
+Discard the first repeated line.  When used by itself, this option
+causes @command{uniq} to print unique lines, and nothing else.
 
 @item -w @var{n}
 @itemx address@hidden
 @opindex -w
 @opindex --check-chars
-Compare @var{n} characters on each line (after skipping any specified
+Compare at most @var{n} characters on each line (after skipping any specified
 fields and characters).  By default the entire rest of the lines are
 compared.
 
@@ -4649,13 +4655,13 @@ tr -s '\n'
 
 @item
 Find doubled occurrences of words in a document.
-For example, people often write ``the the'' with the duplicated words
+For example, people often write ``the the'' with the repeated words
 separated by a newline.  The bourne shell script below works first
 by converting each sequence of punctuation and blank characters to a
 single newline.  That puts each ``word'' on a line by itself.
 Next it maps all uppercase characters to lower case, and finally it
 runs @command{uniq} with the @option{-d} option to print out only the words
-that were adjacent duplicates.
+that were repeated.
 
 @example
 #!/bin/sh
@@ -12055,8 +12061,8 @@ Finally (at least for now), we'll look a
 sorting data, you will often end up with duplicate lines, lines that
 are identical.  Usually, all you need is one instance of each line.
 This is where @command{uniq} comes in. The @command{uniq} program reads its
-standard input, which it expects to be sorted.  It only prints out one
-copy of each duplicated line.  It does have several options.  Later on,
+standard input.  It prints only one
+copy of each repeated line.  It does have several options.  Later on,
 we'll use the @option{-c} option, which prints each unique line, preceded
 by a count of the number of times that line occurred in the input.
 
Index: src/uniq.c
===================================================================
RCS file: /home/m/meyering/fetish/cu/src/uniq.c,v
retrieving revision 1.99
diff -p -u -r1.99 uniq.c
--- src/uniq.c  10 May 2003 13:39:05 -0000      1.99
+++ src/uniq.c  14 May 2003 06:39:37 -0000
@@ -74,16 +74,12 @@ enum countmode
    times they occurred in the input. */
 static enum countmode countmode;
 
-enum output_mode
-{
-  output_repeated,             /* -d Only lines that are repeated. */
-  output_all_repeated,         /* -D All lines that are repeated. */
-  output_unique,               /* -u Only lines that are not repeated. */
-  output_all                   /* Default.  Print first copy of each line. */
-};
-
-/* Which lines to output. */
-static enum output_mode mode;
+/* Which lines to output: unique lines, the first of a group of
+   repeated lines, and the second and subsequented of a group of
+   repeated lines.  */
+static bool output_unique;
+static bool output_first_repeated;
+static bool output_later_repeated;
 
 /* If nonzero, ignore case when comparing.  */
 static int ignore_case;
@@ -240,15 +236,17 @@ different (char *old, char *new, size_t 
 
 /* Output the line in linebuffer LINE to stream STREAM
    provided that the switches say it should be output.
+   MATCH is true if the line matches the previous line.
    If requested, print the number of times it occurred, as well;
    LINECOUNT + 1 is the number of times that the line occurred. */
 
 static void
-writeline (const struct linebuffer *line, FILE *stream, int linecount)
+writeline (struct linebuffer const *line, FILE *stream,
+          bool match, int linecount)
 {
-  if ((mode == output_unique && linecount != 0)
-      || (mode == output_repeated && linecount == 0)
-      || (mode == output_all_repeated && linecount == 0))
+  if (! (linecount == 0 ? output_unique
+        : !match ? output_first_repeated
+        : output_later_repeated))
     return;
 
   if (countmode == count_occurrences)
@@ -295,7 +293,7 @@ check_file (const char *infile, const ch
      this optimization lets uniq output each different line right away,
      without waiting to see if the next one is different.  */
 
-  if (mode == output_all && countmode == count_none)
+  if (output_unique && output_first_repeated && countmode == count_none)
     {
       char *prevfield IF_LINT (= NULL);
       size_t prevlen IF_LINT (= 0);
@@ -334,7 +332,7 @@ check_file (const char *infile, const ch
 
       while (!feof (istream))
        {
-         int match;
+         bool match;
          char *thisfield;
          size_t thislen;
          if (readline (thisline, istream) == 0)
@@ -346,7 +344,7 @@ check_file (const char *infile, const ch
          if (match)
            ++match_count;
 
-          if (mode == output_all_repeated && delimit_groups != DM_NONE)
+          if (delimit_groups != DM_NONE)
            {
              if (!match)
                {
@@ -362,9 +360,9 @@ check_file (const char *infile, const ch
                }
            }
 
-         if (!match || mode == output_all_repeated)
+         if (!match || output_later_repeated)
            {
-             writeline (prevline, ostream, match_count);
+             writeline (prevline, ostream, match, match_count);
              SWAP_LINES (prevline, thisline);
              prevfield = thisfield;
              prevlen = thislen;
@@ -373,7 +371,7 @@ check_file (const char *infile, const ch
            }
        }
 
-      writeline (prevline, ostream, match_count);
+      writeline (prevline, ostream, false, match_count);
     }
 
  closefiles:
@@ -410,7 +408,8 @@ main (int argc, char **argv)
   skip_chars = 0;
   skip_fields = 0;
   check_chars = SIZE_MAX;
-  mode = output_all;
+  output_unique = output_first_repeated = true;
+  output_later_repeated = false;
   countmode = count_none;
   delimit_groups = DM_NONE;
 
@@ -480,11 +479,12 @@ main (int argc, char **argv)
          break;
 
        case 'd':
-         mode = output_repeated;
+         output_unique = false;
          break;
 
        case 'D':
-         mode = output_all_repeated;
+         output_unique = false;
+         output_later_repeated = true;
          if (optarg == NULL)
            delimit_groups = DM_NONE;
          else
@@ -508,7 +508,7 @@ main (int argc, char **argv)
          break;
 
        case 'u':
-         mode = output_unique;
+         output_first_repeated = false;
          break;
 
        case 'w':
@@ -532,7 +532,7 @@ main (int argc, char **argv)
       usage (EXIT_FAILURE);
     }
 
-  if (countmode == count_occurrences && mode == output_all_repeated)
+  if (countmode == count_occurrences && output_later_repeated)
     {
       error (0, 0,
           _("printing all duplicated lines and repeat counts is meaningless"));
Index: tests/uniq/Test.pm
===================================================================
RCS file: /home/m/meyering/fetish/cu/tests/uniq/Test.pm,v
retrieving revision 1.9
diff -p -u -r1.9 Test.pm
--- tests/uniq/Test.pm  18 Feb 2002 12:39:19 -0000      1.9
+++ tests/uniq/Test.pm  14 May 2003 06:39:37 -0000
@@ -83,6 +83,8 @@ my @tv = (
 ['117', '--all-repeated=prepend', "a\na\nb\nc\nc\n", "\na\na\n\nc\nc\n", 0],
 ['118', '--all-repeated=prepend', "a\nb\n",          "",                 0],
 ['119', '--all-repeated=badoption', "a\n",           "",                 1],
+# Check that -d and -u suppress all output, as POSIX requires.
+['120', '-d -u', "a\na\n\b",        "",                         0],
 );
 
 sub test_vector




reply via email to

[Prev in Thread] Current Thread [Next in Thread]