|
From: | Erik Auerswald |
Subject: | Re: [PATCH] Fixed incomplete and incorrect treatment of comments and trailing whitespace |
Date: | Sat, 21 May 2022 15:12:25 +0200 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 |
Hi Dima, On 20.05.22 18:16, Erik Auerswald wrote:
On Thu, May 19, 2022 at 11:47:44PM -0700, Dima Kogan wrote:Erik Auerswald <auerswal@unix-ag.uni-kl.de> writes:From a quick glance at the code diff in the link, this seems to allow comments inside a field, e.g., with datamash -H -C -t',' and the following input: # the next line is the header line one,two,three # the following line is the data line 1,2#this is the 2nd field,3 in the data line the string "#this is the 2nd field" would be skipped, and the data line would have three fields with values 1, 2, and 3. Is that correct (I did not test it)? Is that the intended functionality?I haven't tested this yet, but that wasn't my intent. Comments should do what they do in perl and python and awk and everywhere else, so the last line should be interpreted as "1,2". I may have made a mistake in the implementation.I have just tested it, this is not the way you're code behaves. Everything beginning with the comment sign is ignored, as intended. I misread the diff. Just to be clear: I also would expect this behavior, i.e., the comment sign and everything until the end of the line is ignored as a comment. I'd still prefer this change to comment handling to be off by default and only activated with a new option. E.g., as an option that modifies -C behavor by adding support for in-line comments, or as an option that ignores only in-line comments (I'd prefer the former). Modifying the -C behavior could automatically enable -C if it was not given explicitly, to avoid having to add two options for the intended behavior.
As an additional thought on comment handling: Currently, GNU datamash only knows of comment lines and can ignore those. Comments that start at the comment character and extend to the end of the line can be seen as a more general kind of comment, because a "comment line" is turned into an "empty line". But currently, GNU datamash does not ignore empty lines, neither by default nor via option: $ printf '1\n2\n' | datamash sum 1 3 $ printf '1\n\n2\n' | datamash sum 1 datamash: invalid input: field 1 requested, line 2 has only 0 fields This leads to the question: "what is an empty line?" I usually expect any "blank" line (i.e., matching ^[[:space:]]*$) as "empty." The combination of ignoring in-line comments with ignoring blank lines would ignore comment lines as well. As such a separate option to only activate support for in-line comments seems to be fine. Some more examples to illustrate this: $ printf '1\n ; comment line\n\n2 # in-line comment\n' \ > | ./datamash --no-strict reverse 1 ; comment line 2 # in-line comment $ $ printf '1\n ; comment line\n\n2 # in-line comment\n' \ > | ./datamash -C --no-strict reverse 1 2 # in-line comment $ $ printf '1\n ; comment line\n\n2 # in-line comment\n' \ > | sed 's/[[:space:]]*[#;].*$//;/^[[:space:]]*$/d' \ > | ./datamash --no-strict reverse 1 2 Then there is the additional question of where an in-line comment starts: at the comment character or at the first whitespace at a run of whitespace characters directly preceding the in-line comment? With the default of -t$'\t' or with -t' ', preceding whitespace probably should not be removed. With -W, preceding whitespace should probably be removed. If trailing whitespace is ignored with -W, this happens automatically. With a non-whitespace separator character, both removing and keeping whitespace preceding a comment could be seen as valid. But this all depends on the specifics of the data format and user expectations. When input data is given as CSV (as in "character separated values" where the character is neither a space nor a tab) format, an option to trim whitespace at the start and end of a field could be useful. I do think that adding convenience options to GNU datamash to ease use with human-edited input data (e.g., with additional whitespace and comments) is sensible. I do not like software that gratuitously changes how it interprets input data, especially if there is no option to revert to the former behavior. IMHO new interpretations should be activated using new options (-W vs. trailing whitespace is an edge case IMHO, changing the current behavior is probably a bug fix, at least when comparing GNU datamash with Awk or Bash). Kind regards, Erik
[Prev in Thread] | Current Thread | [Next in Thread] |