bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Fixed incomplete and incorrect treatment of comments and tra


From: Tim Rice
Subject: Re: [PATCH] Fixed incomplete and incorrect treatment of comments and trailing whitespace
Date: Sat, 28 May 2022 04:54:03 +0000

Hi Dima,

AWK doesn't have a concept of comments in the data it parses.

Yep, that's exactly the point.

Whether or not Awk has a concept of comments in the data it processes is 
irrelevant. We're relying on analogies between Datamash and Awk to think about 
correct behavior, and so the analogies need to match up in the right way.

Datamash accepting commented lines in data is not in the same category of 
functionality as Awk accepting comments in Awk scripts. Therefore, you can't 
rely on Awk's implementation of commenting to reason about this specific issue 
for Datamash.


I disagree, but I also don't particularly care what datamash does by
default. In --vnlog mode it should suppress trailing comments.

Then we needn't worry about it until we're sure the vnlog patch is ready for 
merging. I might have lost track of all developments in that thread, but iirc 
there was still more to be done.

Currently, I'm thinking that vnlog support might be a good candidate for v2.0. 
For v1.8, we should focus on bug fixes and general tidying up.

Remember, I am still new to maintaining a GNU project. For this release, I'd 
appreciate having a chance to try out the process without too many 
complications involved.


That is,

bar 5
bbb
The second line has trailing spaces. At the moment, Datamash handles this
in a way that is arguably correct:

```
$ ./datamash -W transpose < ~/tmp/testing.txt
bar     bbb
5
```

The patch adds a test to the test suite to flag the faulty behavior, and
to demonstrate that it has been fixed. See that for detail.

Yep, your test is what my example was based on. My point was that if you 
arbitrarily skip end-of-line whitespace, that will break other things that 
currently work.

And this whole discussion is obviated if you use sed to clean up eol 
whitespace. Datamash is intended to work in a pipeline with the rest of the GNU 
ecosystem. We needn't solve all problems related to cleaning messy data, so 
there is no need to break what currently works.


So, the whole point of datamash and vnlog and all the others is to be
convenient. I can do everything with AWK, or with C code, and nobody
NEEDS any of these tools. Handling obviously-quirky input in the obvious
way improves the convenience of datamash, and we should strive to do
that.

Data like in my previous example isn't "obviously quirky." This means that the 
'if' guard proposed in your GitHub code is really not the obvious way to handle this.

At the very least, you would need to introduce more book-keeping, to only 
discard the whitespace if the number of fields detected so far in the current 
record is not less than previously detected record lengths.

Which would introduce a lot more complexity. For a task that can already be 
handled quickly and simply with Sed.


Once you decide what you want the base code to be, I'll rebase my
patches on top, to do the extra stuff with --vnlog. Kinda busy right
now. Probably will get to it in a week or two.

Sure, that's okay. Like I said, vnlog support is a better match for what I have 
in mind for v2.0, and I'd prefer to roll v1.8 without it.

This gives us more time to think about the form vnlog support should take, and 
what accommodations will need to be made. Remember, for v2.0 we're thinking 
about making Datamash more modular and extensible, so vnlog might be able to 
take advantage of that work, too.

~ Tim



reply via email to

[Prev in Thread] Current Thread [Next in Thread]