bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quotes being stripped by "--csv"


From: Ed Morton
Subject: Re: Quotes being stripped by "--csv"
Date: Sun, 26 Nov 2023 07:24:22 -0600
User-agent: Mozilla Thunderbird

Ben - I do appreciate your feedback and perspective, thanks, see below for my response inline.

On 11/25/2023 9:08 PM, Ben Hoyt wrote:
Hi Ed,

It's likely this discussion is moot, given that Arnold said he's not planning to change Gawk further. However, a few additional thoughts.

> My post is not about input mode vs output mode, it's entirely about input mode - > a way to leave the quotes alone or strip them when populating fields, that is all.
> Output is left entirely up to the user in either case.

Yes, I recognized that's what you were suggesting. I just don't think that's a very helpful way of operating on CSV fields, because with the quotes left in you can't really operate on the data -- for example, you can't fields as numbers or take their sum (the leading quote would get in the way), and you can't even really treat them as strings without stripping the quotes (for example, to concatenate a first name field to last name). In short, the quoted field value would only be usable if you're going to pass it straight through to the output.

That's correct, but not all CSV-processing applications require modifying fields and not all applications that do modify fields are allowed to produce output with different quotes than the input had even if they have to strip those quotes temporarily while modifying the fields.

I get CSVs from multiple sources and need to compare/manipulate them and return them to those sources or send to other destinations that would otherwise receive the original exported CSV. Some of those CSVs are exported from Excel or other Windows tools, some are exported from various applications that run on various web sites, some are created by various Unix tools that have evolved over the years. I see various quoting styles/rules applied across those CSVs - quote only when needed, quote all fields, quote all strings but do not quote numbers, quote only specific columns, quote the data rows but not the header row, etc., etc.

I just counted them and I have 333 gawk scripts using FPAT to manipulate CSVs plus several other CSV-processing scripts that don't use FPAT (most written pre-FPAT). For some I just need to map input fields (from potentially multiple CSVs) to output fields, e.g.:

   ARGIND == 1 { file1[FNR] = $3 OFS $7; next }
   ARGIND == 2 { file2[FNR] = $12; next }
   $3 ~ /whatever/ { print $1, $9, file1[FNR], file2[FNR] }

For others I have to modify some field(s) and today using FPAT I do something like:

   if ( quoted=gsub(/^"|"$/,"",$8) ) { gsub(/""/,"\"",$8) }
   $8 = whatever
   if ( quoted || ($8 ~ /[\n",]/) ) { gsub(/"/,"\"\",$8); $8 = "\"" $8
   "\"" }

So, in each case, the field is simply quoted or not in the output based on whether it was quoted or not in the input (or would be invalid CSV if it wasn't quoted).

In that way I just don't care what the quoting rules are for whatever source I got the CSV from and will send the CSV to, I simply output whatever quoting style was input and so I KNOW it'll work at the destination without assuming they can handle all possible, or any specific, CSV quoting styles.

I obviously know how to do whatever I personally need to do to get the functionality I need, whether that's rolling my own record-reading function with `FPAT` to read multi-line fields as I've done in the past or rolling my own field-splitting function with `--csv` to retain quotes around fields (which I now realise I probably won't actually do as I'd have to then remember to call that function again any time I update $0 so it's more impactful to the rest of the script than using FPAT and calling a record-reading function exactly one time in one location), but people have been writing tools to parse various subsets of CSVs with various subsets of allowed/required quoting for 50+ years and CSVs are used in many varied applications with no 1 common standard they all follow, despite the existence of RFC4180, so I expect I'm not alone in having a need for CSV parsing that simply doesn't strip quotes.

Given that, I suggested a mode like `--csv` but that'd leave quotes alone so we could do whatever we need to do in that regard but the providers would rather not implement it and that's obviously entirely their decision and is fine, I don't think we need to discuss it any further.

    Ed.


Similarly, the "csv" module in Python and the "encoding/csv" package in Go (and I presume it's similar in other languages) give you the un-encoded field value so that you can perform operations on it.

> It is 1 of the 2 possible correct behaviors, and it's the one that I expect will be most
> useful most of the time.

I suppose it's not helpful to argue over what is "correct" or not, and I take your point that what you propose is a possible behaviour. However, I've tried to show above that the field values wouldn't be very useful without un-encoding the data -- except to pass it directly to the output. So I definitely disagree with the second part of your statement. Based on my own usage, I'm very often summing a field or similar, which wouldn't work with your approach (without further dequoting/decoding).

To generalize, I think most data processing tends to work this way: decode input, operate on decoded data, encode output.

In any case, I do think Kernighan's choice to have --csv decode the input so that you can operate on decoded data is the more helpful choice, and consistent with what other languages do.

-Ben



reply via email to

[Prev in Thread] Current Thread [Next in Thread]