Date: Fri, 20 Mar 2015 18:49:07 +0000 (UTC)
From: Ed Morton <address@hidden>
To: address@hidden
Subject: [bug-gawk] example tweak in documentations
The FPAT example used in:
http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content
is, I'm sure, used as the starting point for many people working on CSV
files. It doesn't support empty fields, however, and with a small tweak
it could. For example:
$ cat file
Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA
Smith,John,"314 Pi Ave, IL",HisTown,HisState,,USA
Notice that in the 2nd line the ZIP code (6th field) is not populated
and here's what the FPAT value from the documentation does with that:
$ cat tst1.awk
BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"
}
{
print "\nNF = ", NF
for (i = 1; i <= NF; i++) {
printf("$%d = <%s>\n", i, $i)
}
}
$ awk -f tst1.awk file
NF = 7
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 A Pretty Street, NE">
$4 = <MyTown>
$5 = <MyState>
$6 = <12345-6789>
$7 = <USA>
NF = 6
$1 = <Smith>
$2 = <John>
$3 = <"314 Pi Ave, IL">
$4 = <HisTown>
$5 = <HisState>
$6 = <USA>
i.e. it discards it completely. Now if we tweak the FPAT to just use
`*` instead of `+` as the repetition metacharacter:
$ cat tst2.awk
BEGIN {
FPAT = "([^,]*)|(\"[^\"]*\")"
}
{
print "\nNF = ", NF
for (i = 1; i <= NF; i++) {
printf("$%d = <%s>\n", i, $i)
}
}
$
$ awk -f tst2.awk file
NF = 7
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 A Pretty Street, NE">
$4 = <MyTown>
$5 = <MyState>
$6 = <12345-6789>
$7 = <USA>
NF = 7
$1 = <Smith>
$2 = <John>
$3 = <"314 Pi Ave, IL">
$4 = <HisTown>
$5 = <HisState>
$6 = <>
$7 = <USA>
it handles it correctly. I know this is just an FPAT example and as
such doesn't need to be perfect handle all cases but I think given this
is probably being copy/pasted into a lot of scripts and it's a trivial
tweak to fix it, it might be worth doing.
Ed.