bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gawk] 4.7 Defining Fields by Content


From: Marco Coletti
Subject: [bug-gawk] 4.7 Defining Fields by Content
Date: Mon, 14 Mar 2016 09:40:14 +0100

​​
I see that the manual authors propose this regexp pattern to parse a typical CSV line of data:
FPAT = "([^,]*)|(\"[^\"]+\")"
This is just short of what is needed to correctly parse RFC 4180 formatted data, in that it does not account for double quotes appearing as part of a field:
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.
Amending the pattern is quite easy, and I believe it could be worth to give the full regexp since this format is kind of standard. For example it is the format used by Excel for exporting to a CSV file.

​​
FPAT = "([^,\"]*|(\"([^\"]|\"\")*\"))"

This even allows empty fields of the form "" - which are legal according to RFC 4180 - that you can stumble into when the CSV generator embeds a column in quotes just in case (because it does only know that the column has string datatype and does not examine the actual content).


reply via email to

[Prev in Thread] Current Thread [Next in Thread]