[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: CSV extension status
From: |
Ed Morton |
Subject: |
Re: CSV extension status |
Date: |
Mon, 17 May 2021 09:49:10 -0500 |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 |
I took a look at the CSVMODE library and it seems like it'd work fine. I
do find the variables used for input/output field separators very
confusing though:
* CSVCOMMA: The special character that delimit the fields. By default
a comma (,).
* CSVSEP Input field delimiter, default comma (,)
* CSVOSEP Output field delimiter, default CSVSEP
* CSVOFS Field separator for CSVMODE=-1, default SUBSEP
It's not obvious what the difference is between CSVCOMMA and CSVSEP, nor
why neither of them (CSVCOMMA?) is named `CSVFS` since presumably one is
equivalent to FS like CSVOFS is presumably equivalent to OFS. I really
don't like the name CSVCOMMA at all, though, since setting a variable
named "comma" to be some other character than a comma is very
unintuitive. You could have named it `CSVCHARACTER` or something if
CSVFS isn't applicable and it's somehow different from CSVSEP.
It's also not clear what the difference is between `CSVOSEP` and `CSVOFS`.
Actually, why do you need specific variables for those at all, why not
just use `FS` and `OFS`? I think the intent is to allow some input data
to be CSV and other data not - I don't think that's something that needs
to be supported but I don't really care if it is and if doing so means
we need unique CSV-specific FS/OFS variables then it's fine.
In my many years of manipulating CSVs with awk (I have to do this almost
daily) I've only ever needed 4 things:
1) FS = the input field separator char (usually ,)
2) OFS = the output field separator string (usually FS)
3) QUOTE = the char used to quote a field (usually ")
4) ESCAPE = the char used to escape a quote within a field (usually
doubled QUOTE but sometimes \ before QUOTE)
Now I think about that, I don't see where you allow specification of an
escaped quote within a field, looks like you only allow doubled QUOTE
which is fine - easy to work around if you actually need \ QUOTE.
As for whether to produce quoted fields or not, IMHO you should have a
variable named `CSVTRIMQUOTES` just like you have `CSVTRIM` for spaces
(but then name that `CSVTRIMSPACES`) or expand the possible values of
`CSVTRIM` to include removing leading/trailing quotes in addition
to/instead of spaces.
I think it's as valid to include this in gawkextlib as anything else and
I'd be far more likely to use it than something that required me to
build a local version of gawk to do so.
Regards,
Ed.
On 5/17/2021 8:44 AM, Manuel Collado wrote:
El 16/05/2021 a las 22:21, Ed Morton escribió:
The gawkextlib CSV extension has been listed on
http://gawkextlib.sourceforge.net/ for a while now (a year or more?) but
noted as "(not yet released)". Is there anything specific holding up its
release and/or a planned date when it will be released?
Thanks for your interest.
The gawk-csv extension pushes the gawk API for input parsers to its
limits. And uncover some nasty limitations. An input parser can
deliver a record composed of fields, but there is no way to control
further reparsing of the record after assigning new values to $0..$NF.
To do so it it necessary to have a companion pure (g)awk library that
temporarily overrides FS, OFS, etc. In addition the generation of CSV
output is easier to code in AWK than in C.
The fact is that the combination of FPAT and BEGINFILE/ENDFILE has
enough power to implement an effective CSV processing with just a pure
gawk library. Please look at the CSVMODE library available at:
http://mcollado.z15.es/xgawk/
This library extends the functionality of the gawk-csv extension with
the ability to operate with either CSV fragments or clean text field
values.
The only advantage of using the API for parsing CSV fields is the
ability to provide meaningful error messages and recovering of
malformed CSV data.
In order to made a decision about maintenance of the gawk-csv
extension a feedback from potential users will be very welcome. For
instance:
- Should it be desirable to also host pure gawk modules in the
gawkextib site?
- In order to unify the user interface of gawk-csv and the CSVMODE
library, which control variable value is more natural to select clean
text field values?
CSVMODE = 1 as in gawk-csv, or
CSVMODE = -1 as in the CSVMODE library ?
Any opinion or suggestions about this or similar extensions will be
very appreciated.
Regards.
- Re: CSV extension status, (continued)
- Re: CSV extension status, Manuel Collado, 2021/05/19
- Re: CSV extension status, Andrew J. Schorr, 2021/05/19
- Re: CSV extension status, Andrew J. Schorr, 2021/05/19
- Re: CSV extension status, arnold, 2021/05/20
- Re: CSV extension status, Neil R. Ormos, 2021/05/18
- Re: CSV extension status, Manuel Collado, 2021/05/18
- Re: CSV extension status, Neil R. Ormos, 2021/05/19
- Re: CSV extension status, Manuel Collado, 2021/05/19
Re: CSV extension status,
Ed Morton <=
- Re: CSV extension status, Manuel Collado, 2021/05/17
- Re: CSV extension status, Ed Morton, 2021/05/17
- Re: CSV extension status, Ed Morton, 2021/05/24
- Re: CSV extension status, Manuel Collado, 2021/05/25
- Re: CSV extension status, Ed Morton, 2021/05/25
- Re: CSV extension status, arnold, 2021/05/26
- Re: CSV extension status, Ed Morton, 2021/05/28