|
From: | Pádraig Brady |
Subject: | Re: Question about uniq's treatment of spaces-only lines |
Date: | Mon, 1 Aug 2022 21:54:51 +0100 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Thunderbird/98.0 |
On 31/07/2022 17:26, Sudarshan S Chawathe wrote:
On 2022-07-30T13:25:34+0100 (Saturday), Pádraig Brady writes:More succinctly: $ printf '%s\n' first blah ' ' ' ' 'l ast' | uniq -f1 first l ast I.e. skipping one field will compare all but the 'l ast' line as equal. This is operating as per the POSIX standard which states: "Ignore the first fields fields on each input line when doing comparisons, where fields is a positive decimal integer. A field is the maximal string matched by the basic regular expression: [[:blank:]]*[^[:blank:]]* If the fields option-argument specifies more fields than appear on an input l ine, a null string shall be used for comparison."Thank you for the clarification. For me, the key to resolving my earlier confusion was the realization that the blanks are included in the field as opposed to being interpreted as inter-field separators. This is obvious now based on what you quote above from the POSIX docs but escaped me earlier because I hadn't thought of checking those docs. The GNU info docs for uniq do not seem to describe what exactly a field is in this context. Perhaps it would be useful to include the above quote or an equivalent description (or pointer) there.
Yes good point. It's quite confusing actually. Given: $ cat in.txt 1 2 2 2 3 2 4 2 One might think given the current definition that `uniq -f1` would operate only on the '2's above. But the leading spaces are part of the second field and so significant to the comparison. $ uniq -f1 in.txt 1 2 2 2 3 2 4 2 $ tr -s ' ' <in.txt | uniq -f1 1 2 This is quite awkward really in the presence of variable number of blanks. I'm applying the following to describe this operation: @opindex -f @opindex --skip-fields Skip @var{n} fields on each line before checking for uniqueness. Use -a null string for comparison if a line has fewer than @var{n} fields. Fields -are sequences of non-space non-tab characters that are separated from -each other by at least one space or tab. +a null string for comparison if a line has fewer than @var{n} fields. +Fields are a sequence of blank characters followed by non-blank characters. +Field numbers are one based, i.e., @option{-f 1} will skip the first +field (which may optionally have leading blanks). thanks, Pádraig
[Prev in Thread] | Current Thread | [Next in Thread] |