Thanks for your comments.
El 17/05/2021 a las 16:49, Ed Morton escribió:
I took a look at the CSVMODE library and it seems like it'd work fine. I
do find the variables used for input/output field separators very
* CSVCOMMA: The special character that delimit the fields. By default
a comma (,).
A documentation errata. gawk-csv used CSVCOMMA. In the CSVMODE library
the predefined variable names have changed. Should be CSVSEP, of course.
* CSVSEP Input field delimiter, default comma (,)
* CSVOSEP Output field delimiter, default CSVSEP
Roles similar to FS and OFS.
* CSVOFS Field separator for CSVMODE=-1, default SUBSEP
This a different beast. CSVMODE=-1 delivers clean text field values,
and forgets the original CSV notation. The parsed input record is
composed of the clean text field values separated by CSVOFS. The value
of CSVOFS must be a character never used in the data to be processed.
It's not obvious what the difference is between CSVCOMMA and CSVSEP, nor
why neither of them (CSVCOMMA?) is named `CSVFS` since presumably one is
equivalent to FS like CSVOFS is presumably equivalent to OFS. I really
don't like the name CSVCOMMA at all, though, since setting a variable
named "comma" to be some other character than a comma is very
unintuitive. You could have named it `CSVCHARACTER` or something if
CSVFS isn't applicable and it's somehow different from CSVSEP.
Agreed. See above.
It's also not clear what the difference is between `CSVOSEP` and
CSVOSEP is for CSVMODE=1 (CSV fragments). CSVOFS is for CSVMODE=-1
(clean text values).
Actually, why do you need specific variables for those at all, why not
just use `FS` and `OFS`? I think the intent is to allow some input data
to be CSV and other data not - I don't think that's something that needs
to be supported but I don't really care if it is and if doing so means
we need unique CSV-specific FS/OFS variables then it's fine.
The idea is to temporarily override FS, OFS and RS (as well as FPAT
and FIELDWIDTHS) while processing CSV records, and automatically
restore the previous input parsing mode at the end of the CSV file.
In my many years of manipulating CSVs with awk (I have to do this almost
daily) I've only ever needed 4 things:
1) FS = the input field separator char (usually ,)
2) OFS = the output field separator string (usually FS)
3) QUOTE = the char used to quote a field (usually ")
4) ESCAPE = the char used to escape a quote within a field (usually
doubled QUOTE but sometimes \ before QUOTE)
Now I think about that, I don't see where you allow specification of an
escaped quote within a field, looks like you only allow doubled QUOTE
which is fine - easy to work around if you actually need \ QUOTE.
Backslash escapes are not supported. Embedded quotes mut be doubled.
As for whether to produce quoted fields or not, IMHO you should have a
variable named `CSVTRIMQUOTES` just like you have `CSVTRIM` for spaces
(but then name that `CSVTRIMSPACES`) or expand the possible values of
`CSVTRIM` to include removing leading/trailing quotes in addition
to/instead of spaces.
Printing CSV data must be done by explicitly invoking the provided CSV
printing functions. By default printed CSV fields are quoted only if
needed. A CSVQUOTEALL control variable could eventually be implemented
to force quoting all fields.
I think it's as valid to include this in gawkextlib as anything else and
I'd be far more likely to use it than something that required me to
build a local version of gawk to do so.
Yes. But this policy has to be considered by the gawkextlib