bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CSV extension status


From: Andrew J. Schorr
Subject: Re: CSV extension status
Date: Tue, 18 May 2021 08:56:52 -0400
User-agent: Mutt/1.5.21 (2010-09-15)

On Mon, May 17, 2021 at 11:44:56PM +0200, Manuel Collado wrote:
> A record is parsed when read from an input file. And also after
> assigning $0 = "new value". The API allows a custom input parser do
> the first, but not the second.
> 
> For instance, a standard way of prepending a field to the current
> record would be:
> 
>     $0 = "new field" OFS $0
> 
> For CSV fields and records this construction only works if FPAT and
> OFS have the appropriate values. But the API doesn't allow the
> extension to silently assign values to the predefined variables.

Ah, OK, this is because the API sym_update function refuses to allow
extensions to set predefined variables listed in main.c:varinit.

> And things are even worse if the record syntax can not be parsed
> with the supported FS/FPAT/FIELDWIDTHS modes.

OK.

> A naive approach would be to let the API offer a hook that allows a
> custom input parser to fully override the internal gawk record
> parser. But this possibility require a careful consideration.
> 
> Hope this clarify things. I'm ready to further explain my goals, if
> you like.

I think I understand the conceptual problem, but I feel as if maybe we're
letting the perfect be the enemy of the good. In 99.9% of the cases where I use
CSV files, I simply want to have read-only access to the fields. Actually, if
I'm being honest, it's 100%. In other words, I want to be able to say something
like:

gawk -lcsv '
NR == 1 {
        for (i = 1; i <= NF; i++)
                m[$i] = i
        next
}

$m["age"] > 30 {
        sum += $m["weight"]
        n++
}

END {
        printf "found %d people over 30 with an average weight of %.3f\n",
               n, (n? sum/n : 0)
}'

Can this be done without a library? I thought that the possibility of embedded
newlines meant that we needed a library for this rather than a simple FPAT
solution. Maybe I'm confused.

Perhaps I simply haven't dug deep enough into the wonders of CSV format, but if
we could somehow have a csv library or include file that enabled CSV parsing to
work transparently in the read-only case, I think that would be a big win. If
we in addition need to have an insanely complicated gawk library on top of that
to enable reparsing and reconstruction and writing of records, that's fine, but
I suspect that just being able to parse correctly on a read-only basis
(including stripping encapsulating quotes from field values) would be a very
useful tool for lots of people in many situations. Is that doable with an FPAT
solution or a parser library?

Regards,
Andy



reply via email to

[Prev in Thread] Current Thread [Next in Thread]