bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] vnlog support


From: Erik Auerswald
Subject: Re: [PATCH] vnlog support
Date: Sun, 15 May 2022 17:09:16 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1

Hello Dima,

On 14.05.22 22:18, Dima Kogan wrote:
Since we're talking about working on this again, and making a new
release, I'd like to ping this feature request. I exchanged a few emails
about it with Assaf right before he disappeared, and it sounded like he
was going to add this feature. I've no idea what, if anything, he wanted
to change about the patch.

vnlog support would make both projects much more useful. The original
mailing list post (quoted in full below) contains a demo and a patch.
The patch needs to be updated such that -v implies -W. If I can get an
ACK from whoever is intending to take over datamash, I can re-test the
patch, finalize things, add tests, and so on.

I am not a GNU datamash maintainer, but I'd like to provide some
high-level comments on the vnlog support patches:

1. While GNU datamash, when given the option -C, --skip-comments,
   recognizes lines where the first non-whitespace character is
   either '#' or ';' as comments, the vnlog format does not treat
   ';' as starting a comment.  Thus keeping ';' as comment start
   in vnlog mode creates a new and slightly different vnlog format.
   This could result in incompatibilities with existing data and
   tools.  Is this intended?

2. The patches do not add any special treatment of '-' to GNU
   datamash, but '-' does have a special meaning in vnlog.  I
   would expect a vnlog mode in GNU datamash to support the
   following use case:

    $ cat vnlog.example
    # v1 v2 v3
    1 2 3
    4 - 6
    - 8 9
    $ # GNU datamash does not interpret '-'
    $ ./datamash -C -W sum 1-3 < vnlog.example
    ./datamash: invalid numeric value in line 2 field 2: '-'
    $ # tr can be used for this example, but not in general
    $ tr -- - 0 < vnlog.example | ./datamash -C -W sum 1-3
    5       10      18

   But then missing values do not work with "sum" anyway:

    $ cat missing_value
    1       2       3
    4               6
            8       9
    $ ./datamash sum 1-3 < missing_value
    ./datamash: invalid numeric value in line 2 field 2: ''
    $ ./datamash sum 3 < missing_value
    18

3. The patches seem to create a vnlog mode where both input and
   output are in vnlog format.  Could it be useful to be able to
   specify vnlog format separately for input and output?

4. If one would consider creating vnlog output from character
   separated input data via GNU datamash, empty fields would
   need to be replaced with '-'.  While GNU datamash has some
   support for missing values via the --no-strict and --filler=X
   options, this does not seem to replace empty fields with the
   specified filler, and missing fields seem to be replaced only
   sometimes, e.g., with the "transpose" operation, but not the
   "reverse" operation.  Would it be useful to add optionally
   generating '-' fields?

5. Would it make sense to add the functionality required for
   vnlog format support via separate options?  There could be a
   --vnlog option that sets all those correctly and then adds
   the vnlog specific prologue handling.

   Perhaps the functionality could be added using variables that
   could be controlled via options, without adding all those
   controlling options immediately.

   - There is already a -W, --whitespace option.
   - There is already an --output-delimiter option.
   - There is already a -C, --skip-comments option.

   - There could be a new option to specify the comment
     character.
   - There could be a new option to treat some value, e.g., the
     filler value, as representing an empty field.
   - There could be a new option to replace empty and missing
     fields in the output with the filler value.
   - There could be a new option to add a prefix to the output
     header line.
   - There could be a new option to read the input header line
     from a vnlog prologue.

I have trimmed the patches from my email, since I did not directly
comment on the code details.  Here are mailing list archive URLs
for easy reference:

- Original posting of vnlog support patches:
  https://lists.gnu.org/archive/html/bug-datamash/2020-04/msg00006.html

- Current re-posting of vnlog support patches:
  https://lists.gnu.org/archive/html/bug-datamash/2022-05/msg00015.html

The above comments are only questions and suggestions, of course.

Best regards,
Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]