bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

FPAT documentation. The CSV example.


From: Manuel Collado
Subject: FPAT documentation. The CSV example.
Date: Sun, 12 Apr 2020 12:28:19 +0200
User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0

The FPAT and FIELDWIDTHS documentation in the gawk-5 manual
has been greatly enhanced w.r.t. gawk-4. But still remains a
little inaccuracy in the example about CSV processing. It
says:

[... each field is either "anything that is not a comma," or
"a double quote, anything that is not a double quote, and a
closing double quote." ...]

And the first proposed FPAT is /([^,]+)|("[^"]+")/, later
amended as /([^,]*)|("[^"]+")/ to accept empty fields.

But in addition to commas, a CSV field can also contains
quotes, that have to be escaped by doubling them. The
proposed regexps fail to accept quoted fields with both
commas and quotes inside. Perhaps the simplest FPAT
expression that recognizes this kind of fields is
/([^,]*)|("([^"]|"")+")/. The following code tests these
variants.

$ cat sample.csv
p,"q,r",s
p,"q""r",s
p,"q,""r",s
p,"",s
p,,s

$ cat fpat.awk
BEGIN {
    fp[0] = "([^,]+)|(\"[^\"]+\")"
    fp[1] = "([^,]*)|(\"[^\"]+\")"
    fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
    FPAT =  fp[fpat+0]
}

{
    print "<" $0 ">"
    printf("NF = %s ", NF)
    for (i = 1; i <= NF; i++) {
        printf("<%s>", $i)
    }
    print ""
}

$ gawk -f fpat.awk sample.csv
<p,"q,r",s>
NF = 3 <p><"q,r"><s>
<p,"q""r",s>
NF = 3 <p><"q""r"><s>
<p,"q,""r",s>
NF = 4 <p><"q,"><"r"><s>
<p,"",s>
NF = 3 <p><""><s>
<p,,s>
NF = 2 <p><s>

$ gawk -v fpat=1 -f fpat.awk sample.csv
<p,"q,r",s>
NF = 3 <p><"q,r"><s>
<p,"q""r",s>
NF = 3 <p><"q""r"><s>
<p,"q,""r",s>
NF = 4 <p><"q,"><"r"><s>
<p,"",s>
NF = 3 <p><""><s>
<p,,s>
NF = 3 <p><><s>

$ gawk -v fpat=2 -f fpat.awk sample.csv
<p,"q,r",s>
NF = 3 <p><"q,r"><s>
<p,"q""r",s>
NF = 3 <p><"q""r"><s>
<p,"q,""r",s>
NF = 3 <p><"q,""r"><s>
<p,"",s>
NF = 3 <p><""><s>
<p,,s>
NF = 3 <p><><s>

Besides that, it is often said that awk is not the right
tool to process CSV data. This is not true for recent gawk
versions. The FPAT and BEGINFILE/ENFILE features provide
enough power to process CSV data in an effective way. I'm
polishing a gawk source library that mimics the gawkextlib
csv extension. Hopefully, it can be made publicly available
in the near future.

Regards.
--
Manuel Collado - http://mcollado.z15.es



reply via email to

[Prev in Thread] Current Thread [Next in Thread]