bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] vnlog support


From: Erik Auerswald
Subject: Re: [PATCH] vnlog support
Date: Sat, 21 May 2022 20:34:25 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1

Hello Dima,

On 14.05.22 22:18, Dima Kogan wrote:
[...]
vnlog support would make both projects much more useful.

I agree that adding support for the vnlog data format, in addition
to the current more traditional format used by 'cut' and 'paste',
where each TAB separates two fields and there are neither headers
nor comments, could make GNU datamash applicable for additional use
cases.

[...]
Dima Kogan <datamash@dima.secretsauce.net> writes:

I maintain vnlog, a toolkit for manipulating tabular ascii data:

   https://github.com/dkogan/vnlog

The cmdline tools are largely thin frontends around awk and GNU
coreutils. The capabilities are complementary with datamash, and it'd be
nice if datamash supported vnlog's data format. It already does 99% of
it, and I'm attaching a prototype patch (to the 1.4 stable release) that
adds the rest. The vnlog format:

- A whitespace-separated table of text

- Lines beginning with # are comments

It seems as if vnlog supports in-line comments as well, i.e.,
a comment can be appended to a data line.  It might make sense
to add that to the format description.

- The first line that begins with a single # (not ## or #!) is a legend, naming 
each column

- Empty fields reported as -

As you can see, this is very close to what datamash does already.

While there are similarities, there are also incompatibilities.
Thus I'd say datamash provides about 80% of what is needed to
support vnlog.

Currently existing:

- GNU datamash supports a header line via options
- GNU datamash supports consecutive whitespace as field delimiters
  via option
- GNU datamash supports skipping comment lines via option
- GNU datamash output lines are often valid vnlog data lines

Currently missing:

- GNU datamash does not support changing the comment start
  character sequence as required for vnlog
- GNU datamash does not support using a comment line as an
  input header line
- GNU datamash does not output a comment line as a header line
- GNU datamash accepts either '#' or ';' as starting a comment
  line, but vnlog does not recognize ';' as a comment character
- GNU datamash does not support in-line comments
- GNU datamash does not replace empty output fields with '-'

Comment lines are not passed through, but skipped (ignored) by
GNU datamash.

To use GNU datamash with vnlog data, I'd probably start with
using Awk to transform input data into TSV format without comments,
and GNU datamash TSV output into vnlog.

Minimally tested vnlog to TSV converter:

awk '# Copyright (C) 2022 Erik Auerswald
     # Copying and distribution of this file, with or without
     # modification, are permitted in any medium without royalty
     # provided the copyright notice and this notice are preserved.
     # This file is offered as-is, without any warranty.
    /^[[:space:]]*#[#!]/ { next }
    /^[[:space:]]*#/ && have_legend { next }
    /^[[:space:]]*#/ && !have_legend {
        have_legend = 1
        sub(/^[[:space:]]*#[[:space:]]*/, "")
    }
    {
        sub(/[[:space:]]*#.*$/, "")
        gsub(/[[:space:]]+/, "\t")
        print
    }'

Minimally tested GNU datamash TSV output to vnlog converter:

awk '# Copyright (C) 2022 Erik Auerswald
     # Copying and distribution of this file, with or without
     # modification, are permitted in any medium without royalty
     # provided the copyright notice and this notice are preserved.
     # This file is offered as-is, without any warranty.
    { sub(/^\t/, "-\t") }
    NR == 1 { sub(/^/, "# ") }
    {
        while ($0 ~ /\t\t/) {
            sub(/\t\t/, "\t-\t")
        }
        gsub(/\t/, " ")
        print
    }'

The above converters are intended to be used with 'datamash -H'.

[...] Trivial demo:

   $ (echo '## comment'; echo '# x y'; seq 5 | awk '{print $1, $1*$1}') | 
./datamash -v sum y mean x
   # sum(y) mean(x)
   55 3

This works with the above converters and 'datamash -H sum y mean x'.

I do think that adding a --vnlog option to GNU datamash to activate
a vnlog mode could be useful.

Initially, I thought this would comprise just a few changes
that might be useful as individual options, but I am not so
sure anymore.  I'd say that vnlog is a different data format
that is incompatible with the current GNU datamash data formats.

IMHO adding vnlog format support in GNU datamash should not
affect the current data formats supported by GNU datamash,
unless it is activated via option --vnlog.

[...]

Thanks,
Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]