[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CSV extension status

From: Ed Morton
Subject: Re: CSV extension status
Date: Mon, 24 May 2021 21:14:16 -0500
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.2

I see the conversation has continued at bug-gawk and Arnold had suggested spinning it off into an email chain which, if it happened, I'm not on. I see a lot of complexity being discussed in the thread that just doesn't seem to be necessary. Is there any reason why the simple "buildRec()" function I posted at https://stackoverflow.com/a/45420607/1745001 (and which could be written more concisely if I used gawk extensions) isn't all we'd need to parse CSVs? No modes, no extra/ambiguous terminology - just reading a CSV into fields by calling 1 function each time a record is read.

By the way, for this:
>/Clean text -> "Hello!", she said/
>/CSV fragment -> """Hello!"", she said"/
>/Can you suggest better terms? I've not found technical terms for/
>/these values in the CSV bibliography./
I'd like to suggest you simply call it "unquoted" and "quoted" text.


On 5/17/2021 2:59 PM, Ed Morton wrote:
Thanks for the answers. To be honest the different types of input/output field separators with the different modes seems kinda complicated and unnecessary to me. My go-to for parsing CSVs in general is just a function to build a record from the input as each line is being read, using FS and OFS as normal, and then just using that constructed record as normal, e.g. see "buildRec()" at https://stackoverflow.com/a/45420607/1745001. It could easily be tweaked to specify the quote char, and have options to strip leading/trailing quotes and/or spaces if useful but it's trivial for a user to write that trimming code as a loop on the fields so I didn't bother.


On 5/17/2021 2:37 PM, Manuel Collado wrote:
Thanks for your comments.

El 17/05/2021 a las 16:49, Ed Morton escribió:
I took a look at the CSVMODE library and it seems like it'd work fine. I
do find the variables used for input/output field separators very
confusing though:

  * CSVCOMMA: The special character that delimit the fields. By default
    a comma (,).

A documentation errata. gawk-csv used CSVCOMMA. In the CSVMODE library the predefined variable names have changed. Should be CSVSEP, of course.

  * CSVSEP    Input field delimiter, default comma (,)
  * CSVOSEP   Output field delimiter, default CSVSEP

Roles similar to FS and OFS.

  * CSVOFS    Field separator for CSVMODE=-1, default SUBSEP

This a different beast. CSVMODE=-1 delivers clean text field values, and forgets the original CSV notation. The parsed input record is composed of the clean text field values separated by CSVOFS. The value of CSVOFS must be a character never used in the data to be processed.

It's not obvious what the difference is between CSVCOMMA and CSVSEP, nor why neither of them (CSVCOMMA?) is named `CSVFS` since presumably one is
equivalent to FS like CSVOFS is presumably equivalent to OFS. I really
don't like the name CSVCOMMA at all, though, since setting a variable
named "comma" to be some other character than a comma is very
unintuitive. You could have named it `CSVCHARACTER` or something if
CSVFS isn't applicable and it's somehow different from CSVSEP.

Agreed. See above.

It's also not clear what the difference is between `CSVOSEP` and `CSVOFS`.

CSVOSEP is for CSVMODE=1 (CSV fragments). CSVOFS is for CSVMODE=-1 (clean text values).

Actually, why do you need specific variables for those at all, why not
just use `FS` and `OFS`? I think the intent is to allow some input data
to be CSV and other data not - I don't think that's something that needs
to be supported but I don't really care if it is and if doing so means
we need unique CSV-specific FS/OFS variables then it's fine.

The idea is to temporarily override FS, OFS and RS (as well as FPAT and FIELDWIDTHS) while processing CSV records, and automatically restore the previous input parsing mode at the end of the CSV file.

In my many years of manipulating CSVs with awk (I have to do this almost
daily) I've only ever needed 4 things:

1) FS = the input field separator char (usually ,)
2) OFS = the output field separator string (usually FS)
3) QUOTE = the char used to quote a field (usually ")
4) ESCAPE = the char used to escape a quote within a field (usually
doubled QUOTE but sometimes \ before QUOTE)

Now I think about that, I don't see where you allow specification of an
escaped quote within a field, looks like you only allow doubled QUOTE
which is fine - easy to work around if you actually need \ QUOTE.

Backslash escapes are not supported. Embedded quotes mut be doubled.

As for whether to produce quoted fields or not, IMHO you should have a
variable named `CSVTRIMQUOTES` just like you have `CSVTRIM` for spaces
(but then name that `CSVTRIMSPACES`) or expand the possible values of
`CSVTRIM` to include removing leading/trailing quotes in addition
to/instead of spaces.

Printing CSV data must be done by explicitly invoking the provided CSV printing functions.  By default printed CSV fields are quoted only if needed. A CSVQUOTEALL control variable could eventually be implemented to force quoting all fields.

I think it's as valid to include this in gawkextlib as anything else and
I'd be far more likely to use it than something that required me to
build a local version of gawk to do so.

Yes. But this policy has to be considered by the gawkextlib administrators.


Thanks again.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]