[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CSV extension status

From: Ed Morton
Subject: Re: CSV extension status
Date: Tue, 25 May 2021 12:53:35 -0500
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.2

Manual - thanks for replying, my responses are inline below

On 5/25/2021 3:09 AM, Manuel Collado wrote:
El 25/05/2021 a las 4:14, Ed Morton escribió:
I see the conversation has continued at bug-gawk and Arnold had
suggested spinning it off into an email chain which, if it happened, I'm
not on. I see a lot of complexity being discussed in the thread that
just doesn't seem to be necessary. Is there any reason why the simple
"buildRec()" function I posted at
https://stackoverflow.com/a/45420607/1745001 (and which could be written
more concisely if I used gawk extensions) isn't all we'd need to parse
CSVs? No modes, no extra/ambiguous terminology - just reading a CSV into
fields by calling 1 function each time a record is read.

The goal is to allow beginner awk users to process CSV data as if they were regular awk records. No need to tamper with predefined variables like FS, OFS, NR, FPAT etc. Just put -i csvmode in the command line or add @include "csvmode" to the script.
Setting the field separators using FS and OFS is IMHO desired behavior, not tampering, as that'd be the intuitive way to specify which character separates the fields of the CSV. There's no need to do anything with FPAT, and I agree it'd be good to have NR work as it does today (my buildRec() function would undesirably have record numbers get out of sync with `NR` for CSV fields that contain newlines). I'm not advocating for using my script as the solution to CSV parsing, by the way, just using it as an example of something simple that gets the main job of separating a CSV into fields done without any need for configuration variables and explanation.

By using the CSVMODE library your example becomes:

$ cat decsv2.awk
    printf "Record %d:\n", NR
    for (i=1;i<=NF;i++) {
        # To replace newlines with blanks add gsub(/\n/," ",$i) here
        printf "    $%d=<%s>\n", i, $i
    print "----"

$ gawk -icsvmode-1 -f decsv2 file.csv
Record 1:
    $1=<rec1, fld1>
Record 2:
    $1=<rec2, fld1.1

    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $4=<rec2 fld4>

Please note that the modified decsv2.awk script is not CSV specific. It can be used unmodified to process regular awk records.
I don't think there's a use-case for a single script that has to process CSVs and non-CSVs but if there were it's easily done by just separating the input/output field identification from the control logic so it's not useful to have a CSV library for awk that does this. To me a script that can handle both CSV and non-CSV data is in the same ballpark as a script that can handle both FS-separated and FPAT-matched data - it's almost never going to be wanted but if it is there's simple ways to write it using existing constructs.

Of course, different users have different needs and taste. This is why the library in question attempts to satisfy as much users as possible, by offering a rich set of configuration options.
IMHO a rich set of configuration options isn't necessary and just makes the usage much more complicated than it should be. All you need is:

FS = a character (usually , or tab or ;)
OFS = a character (usually same as FS)

If you WANTED to allow other quotes than " and methods of escaping them other than as "" then you could also have:

CSVQUOTE = a character (usually ", rarely')
CSVESCAPE = a character that appears before a CSVQUOTE within a field to escape it (usually ", rarely \).

but that's it. Anything else the user needed to do (strip quotes from input fields, add quotes to output fields, replace newlines in fields with spaces, etc., etc.) is all easy for them to do in their code.

Even more. The library allows to modify fields and records the usual way. For instance, to add a new field "val" at position "pos":

  if (pos>NF) {$pos = val} else {$pos = val OFS $pos}; $0 = $0;
Good, that's as it should be.

And this code works transparently for both CSV data and regular text data.
Again, I just don't see why that's useful and if it adds even the tiniest bit of complexity, or additional configuration, or performance overhead then IMHO it shouldn't be done.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]