bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CSV extension status


From: Ed Morton
Subject: Re: CSV extension status
Date: Mon, 17 May 2021 09:49:10 -0500
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1

I took a look at the CSVMODE library and it seems like it'd work fine. I do find the variables used for input/output field separators very confusing though:

 * CSVCOMMA: The special character that delimit the fields. By default
   a comma (,).
 * CSVSEP    Input field delimiter, default comma (,)
 * CSVOSEP   Output field delimiter, default CSVSEP
 * CSVOFS    Field separator for CSVMODE=-1, default SUBSEP

It's not obvious what the difference is between CSVCOMMA and CSVSEP, nor why neither of them (CSVCOMMA?) is named `CSVFS` since presumably one is equivalent to FS like CSVOFS is presumably equivalent to OFS. I really don't like the name CSVCOMMA at all, though, since setting a variable named "comma" to be some other character than a comma is very unintuitive. You could have named it `CSVCHARACTER` or something if CSVFS isn't applicable and it's somehow different from CSVSEP.

It's also not clear what the difference is between `CSVOSEP` and `CSVOFS`.

Actually, why do you need specific variables for those at all, why not just use `FS` and `OFS`? I think the intent is to allow some input data to be CSV and other data not - I don't think that's something that needs to be supported but I don't really care if it is and if doing so means we need unique CSV-specific FS/OFS variables then it's fine.

In my many years of manipulating CSVs with awk (I have to do this almost daily) I've only ever needed 4 things:

1) FS = the input field separator char (usually ,)
2) OFS = the output field separator string (usually FS)
3) QUOTE = the char used to quote a field (usually ")
4) ESCAPE = the char used to escape a quote within a field (usually doubled QUOTE but sometimes \ before QUOTE)

Now I think about that, I don't see where you allow specification of an escaped quote within a field, looks like you only allow doubled QUOTE which is fine - easy to work around if you actually need \ QUOTE.

As for whether to produce quoted fields or not, IMHO you should have a variable named `CSVTRIMQUOTES` just like you have `CSVTRIM` for spaces (but then name that `CSVTRIMSPACES`) or expand the possible values of `CSVTRIM` to include removing leading/trailing quotes in addition to/instead of spaces.

I think it's as valid to include this in gawkextlib as anything else and I'd be far more likely to use it than something that required me to build a local version of gawk to do so.

Regards,

    Ed.

On 5/17/2021 8:44 AM, Manuel Collado wrote:
El 16/05/2021 a las 22:21, Ed Morton escribió:
The gawkextlib CSV extension has been listed on
http://gawkextlib.sourceforge.net/ for a while now (a year or more?) but
noted as "(not yet released)". Is there anything specific holding up its
release and/or a planned date when it will be released?

Thanks for your interest.

The gawk-csv extension pushes the gawk API for input parsers to its limits. And uncover some nasty limitations. An input parser can deliver a record composed of fields, but there is no way to control further reparsing of the record after assigning new values to $0..$NF. To do so it it necessary to have a companion pure (g)awk library that temporarily overrides FS, OFS, etc. In addition the generation of CSV output is easier to code in AWK than in C.

The fact is that the combination of FPAT and BEGINFILE/ENDFILE has enough power to implement an effective CSV processing with just a pure gawk library. Please look at the CSVMODE library available at:

    http://mcollado.z15.es/xgawk/

This library extends the functionality of the gawk-csv extension with the ability to operate with either CSV fragments or clean text field values.

The only advantage of using the API for parsing CSV fields is the ability to provide meaningful error messages and recovering of malformed CSV data.

In order to made a decision about maintenance of the gawk-csv extension a feedback from potential users will be very welcome. For instance:

- Should it be desirable to also host pure gawk modules in the gawkextib site?

- In order to unify the user interface of gawk-csv and the CSVMODE library, which control variable value is more natural to select clean text field values?
   CSVMODE = 1 as in gawk-csv, or
   CSVMODE = -1 as in the CSVMODE library ?

Any opinion or suggestions about this or similar extensions will be very appreciated.

Regards.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]