[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Parsing standard CVS data by gawk
From: |
Jarno Suni |
Subject: |
Re: [bug-gawk] Parsing standard CVS data by gawk |
Date: |
Tue, 1 Sep 2015 10:30:26 +0300 |
On Wed, 8 Jul 2015 00:17:17 +0300
Jarno Suni <address@hidden> wrote:
> Current manual tells:
> "NOTE: Some programs export CSV data that contains embedded newlines
> between the double quotes. gawk provides no way to deal with this.
> Even though a formal specification for CSV data exists, there isn’t
> much more to be done; the FPAT mechanism provides an elegant solution
> for the majority of cases, and the gawk developers are satisfied with
> that."
> https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html
>
> I think this is a bit misleading, since standard CSV data can be
> parsed by gawk. The following script reads all CVS data in a
> two-dimensional array that is used in the END section of the Gawk
> program to display the fields together with their array indexes:
>
> dos2unix | gawk '
> function strip_quoted_field(s)
> {
> s = substr(s, 2, length(s) - 2)
> gsub(/""/, "\"", s)
> return s
> }
> BEGIN{
> RS = "" # read the whole input file as one record
> FS = "" # I guess this setting reduces internal splitting work
> record = 0;
> }
> {
> nof = patsplit($0, a, /([^,"\n]*)|("(("")*[^"]*)*")/, seps)
> field = 0;
> for (i = 1; i <= nof; i++) {
> field++
> if (substr(a[i], 1, 1) == "\"")
> f[record][field] = strip_quoted_field(a[i])
> else f[record][field] = a[i]
> if (seps[i] != ",") { field=0; record++ }
> delete a[i]
> }
> }
> END{
> field=length(f[0])
> for (i = 0; i < record; i++)
> for (j = 1; j <= field; j++)
> printf i" "j" :"f[i][j]"\r\n"
> }'
>
> dos2unix utility is used to convert standard DOS style line breaks
> (CRLF i.e. "\r\n") and possible UTF-16 encoding (with byte order mark)
> to "\n" and UTF-8 (without byte order mark), respectively. The script
> also works with plain Unix-style UTF-8 input in Linux my experience.
>
> For some use cases it is not necessary to have all data in memory in
> one time. This might not be optimal implementation for such a case. I
> also wrote an implementation that reads input line by line.
>
> Regards,
>
Oh, I think it is better to use RS="\0" so that blank lines do not work
as record separators.
Don't you think the manual should be fixed?
--
Jarno Ilari Suni - http://www.iki.fi/8/
- Re: [bug-gawk] Parsing standard CVS data by gawk,
Jarno Suni <=