coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

uniq for unsorted input


From: Bruno Haible
Subject: uniq for unsorted input
Date: Tue, 26 Mar 2024 17:40:44 +0100

The documentation of 'uniq' [1] says:

  "The input need not be sorted, but repeated input lines are
   detected only if they are adjacent. If you want to discard
   non-adjacent duplicate lines, perhaps you want to use 'sort -u'."

When I wrote gnulib-tool many years ago, as a shell script, I naturally
used
  - 'sort -u' to remove duplicates,
  - 'sort' + 'join' to compute set intersections, and
  - 'sort' + 'join -v' to compute set differences.

In a couple of places I only wanted to remove duplicates. Since the coreutils
don't offer a way to remove duplicates without sorting, I accepted the sorting
as a side effect. This has led to inconsistencies: In some places, gnulib-tool
sorts file names, and in other places it doesn't. [2]

In order to avoid mistakes of this kind, I propose that coreutils at least
gives a hint that removing duplicate lines from an unsorted list of lines
is possible, in O(N log N) run time and O(1) space.

The way to do so is described in [3]:
$ cat data.txt | nl | sort -k 2 | uniq -f 1 | sort -n | sed -e 's/^[^   ]*      
//'
or
$ cat -n data.txt | sort --key=2.1 -b -u | sort -n | sed -e 's/^[^      ]*      
//'

Other methods described in [3], such as counters maintained in an 'awk'
or 'perl' process, or the 'unique' program that is part of the 'john' package
[4], can be ignored, because they need O(N) space and are thus not usable for
40 GB large inputs [5].

* Would you accept a documentation change to the 'uniq' page that explains
  the alternative with the line numbering trick mentioned above?

* What do you think about adding
  - a 'uniq' option -k / --keep-unsorted
    that implies unsorted input and removal of occurrences 2,...,n of a line
    that occurs n times,
  - a 'uniq' option --keep-unsorted-last
    that implies unsorted input and removal of occurrences 1,...,n-1 of a line
    that occurs n times?
  These options would be implemented by spawning a pipe of 'cat', 'sort',
  'sort', 'sed' programs as shown above, optionally resorting to all-in-memory
  processing if the input's size is below a certain threshold (like 'sort'
  does).

Bruno

[1] https://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html
[2] https://lists.gnu.org/archive/html/bug-gnulib/2024-03/msg00334.html
[3] https://unix.stackexchange.com/questions/11939/
[4] https://www.openwall.com/john/
[5] https://stackoverflow.com/questions/30906191/






reply via email to

[Prev in Thread] Current Thread [Next in Thread]