[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CSV extension status

From: Andrew J. Schorr
Subject: Re: CSV extension status
Date: Tue, 18 May 2021 11:33:22 -0400
User-agent: Mutt/1.5.21 (2010-09-15)

On Tue, May 18, 2021 at 04:41:24PM +0200, Manuel Collado wrote:
> Do you mean without an API-based extension? Yes.

OK. It makes no difference to me whether the invocation is "gawk -lcsv"
or "gawk -icsv"

> >I thought that the possibility of embedded
> >newlines meant that we needed a library for this rather than a simple FPAT
> >solution. Maybe I'm confused.
> A pure gawk library is enough to effectively process CSV data. By
> using my CSVMODE library from http://mcollado.z15.es/xgawk/ your
> example can be coded almost verbatim:
> gawk -i csvmode-1 '
> NR==1 {next}
> csvfield("age") > 30 {
>       sum += csvfield("weight")
>       n++
> }
> END {
>       printf "found %d people over 30 with an average weight of %.3f\n",
>              n, (n? sum/n : 0)
> }'

I downloaded the tgz and zip files, but they seem to be missing the main code,
or I am losing my mind:

bash-4.2$ zipinfo csvmode-0.2b.zip | fgrep csvmode.awk
bash-4.2$ tar tvf csvmode-0.2b.tgz | fgrep csvmode.awk

The contents of the packaged files don't seem to match the files in your


The tarball seems to contain the test cases, but not the actual code:

bash-4.2$ ls csvmode
AUTHORS  INSTALL  NEWS  README  doc  test  usecases

> And this code works with fields quoted, unquoted or with embedded
> newlines. This is why I'm unsure if an API-based gawk-csv extension
> is really needed.

OK, I took a look at the code, and I get the idea. My first thought is
to wonder whether you should use a namespace to scope the proliferation
of variables and functions. There are a ton of hidden variables and
functions that should probably be isolated.

> How about also hosting pure gawk libraries, like CSVMODE, in the
> gawkextlib site? Arnold suggested this sometime ago.

I'd be more than happy to host it there. The more the merrier.

That being said, this solution strikes me as much more complicated
and likely to be much slower than an input parser implemented in C.
For those who want simple, read-only access to CSV documents, my gut instinct
is that an input parser library would be a better and more robust solution.
In particular, the splitting and reconstruction of the record with OFS
seems a bit slow and fragile to me. I really just want to be able to
say gawk -lcsv and not have to worry about configuring all of the
CSV* variables correctly.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]