bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Parsing standard CVS data by gawk


From: Jarno Suni
Subject: Re: [bug-gawk] Parsing standard CVS data by gawk
Date: Tue, 1 Sep 2015 10:30:26 +0300

On Wed, 8 Jul 2015 00:17:17 +0300
Jarno Suni <address@hidden> wrote:

> Current manual tells:
> "NOTE: Some programs export CSV data that contains embedded newlines
> between the double quotes. gawk provides no way to deal with this.
> Even though a formal specification for CSV data exists, there isn’t
> much more to be done; the FPAT mechanism provides an elegant solution
> for the majority of cases, and the gawk developers are satisfied with
> that."
> https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html
> 
> I think this is a bit misleading, since standard CSV data can be
> parsed by gawk. The following script reads all CVS data in a
> two-dimensional array that is used in the END section of the Gawk
> program to display the fields together with their array indexes:
> 
> dos2unix | gawk '
> function strip_quoted_field(s)
> {
>       s = substr(s, 2, length(s) - 2)
>       gsub(/""/, "\"", s)
>       return s
> }
> BEGIN{
>       RS = "" # read the whole input file as one record
>       FS = "" # I guess this setting reduces internal splitting work
>       record = 0;
> }
> {
>       nof = patsplit($0, a, /([^,"\n]*)|("(("")*[^"]*)*")/, seps)
>       field = 0;
>       for (i = 1; i <= nof; i++) {
>               field++         
>               if (substr(a[i], 1, 1) == "\"") 
>                 f[record][field] = strip_quoted_field(a[i])
>                 else f[record][field] = a[i]
>               if (seps[i] != ",") { field=0; record++ }
>               delete a[i]     
>       }
> }
> END{
>       field=length(f[0])
>       for (i = 0; i < record; i++) 
>               for (j = 1; j <= field; j++)
>                       printf i" "j" :"f[i][j]"\r\n"
> }'
> 
> dos2unix utility is used to convert standard DOS style line breaks
> (CRLF i.e. "\r\n") and possible UTF-16 encoding (with byte order mark)
> to "\n" and UTF-8 (without byte order mark), respectively. The script
> also works with plain Unix-style UTF-8 input in Linux my experience.
> 
> For some use cases it is not necessary to have all data in memory in
> one time. This might not be optimal implementation for such a case. I
> also wrote an implementation that reads input line by line.
> 
> Regards,
> 

Oh, I think it is better to use RS="\0" so that blank lines do not work
as record separators.

Don't you think the manual should be fixed?

-- 
Jarno Ilari Suni - http://www.iki.fi/8/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]