Hello Pádraig,
Pádraig Brady wrote, On 02/20/2013 08:47 PM:
On 02/20/2013 06:44 PM, Assaf Gordon wrote:
Hello,
Attached is a suggestion for "--group" option in uniq, as discussed here:
http://lists.gnu.org/archive/html/coreutils/2011-03/msg00000.html
http://lists.gnu.org/archive/html/coreutils/2012-03/msg00052.html
The patch adds two parameters:
--group=[method] separate each unique line (whether duplicated or not)
with a marker.
method={none,separate(default),prepend,append,both)
--group-separator=SEP with --group, separates group using SEP
(default: empty line)
--group-sep is probably overkill.
I'd just use \n or \0 if -z specified.
OK.
As for separation methods I'd just go with what we have for
--all-repeated (but remove 'none' which wouldn't be useful with --group),
as we've never had requests for anything else. so:
--group={prepend, separate(default)}
I'd like to have at least "append" or "both", for the added convenience of
downstream analysis.
It's obviously a "nice-to-have" and not "must-have" feature, and can be
implemented in other ways, but knowing that there will always be a terminating marker *after* a
group (even the last group) makes downstream processing code simpler.
Typical example:
$ cat INPUT | uniq --group=append | \
awk '$0!="" { ## item in the group, collect it }
$0=="" { ## end of group, do something }'
Without the final group marker, any downstream code will require two points of
"group processing": when a marker is found, and at EOF.
Something like:
$ cat INPUT | uniq --group=append | \
awk '$0!="" { ## item in the group, collect it }
$0=="" { ## end of group, do something }
END { ## end of last group, do something, duplicated code }'
Similar reason for having "both", as it ensures there I can put any special
initialization code in the group-marker case, and doesn't need to duplicate it in a
separate 'BEGIN{}' clause (Of course, this doesn't have to be awk - can be
perl/python/ruby/whatever that will do downstream processing).
I realize it's not a "make-or-break" feature - but if we're trying to make text
processing easier, I believe "append/both" makes it even easier.
Supporting -u or -d with --group wouldn't be useful either really.
It's probably most consistent to just disallow those combinations.
Just to be clear on the reasoning: because with "-u" and "-d", each *line* is
implicitly a separate group, there's no apparent utility for an end-of-group marker.