Re: Quotes being stripped by "--csv"

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Quotes being stripped by "--csv"

From:	Ed Morton
Subject:	Re: Quotes being stripped by "--csv"
Date:	Sun, 26 Nov 2023 07:24:22 -0600
User-agent:	Mozilla Thunderbird

Ben - I do appreciate your feedback and perspective, thanks, see belowfor my response inline.


On 11/25/2023 9:08 PM, Ben Hoyt wrote:

Hi Ed,
It's likely this discussion is moot, given that Arnold said he's notplanning to change Gawk further. However, a few additional thoughts.
> My post is not about input mode vs output mode, it's entirely aboutinput mode -> a way to leave the quotes alone or strip them when populatingfields, that is all.
> Output is left entirely up to the user in either case.
Yes, I recognized that's what you were suggesting. I just don't thinkthat's a very helpful way of operating on CSV fields, because with thequotes left in you can't really operate on the data -- for example,you can't fields as numbers or take their sum (the leading quote wouldget in the way), and you can't even really treat them as stringswithout stripping the quotes (for example, to concatenate a first namefield to last name). In short, the quoted field value would only beusable if you're going to pass it straight through to the output.

That's correct, but not all CSV-processing applications requiremodifying fields and not all applications that do modify fields areallowed to produce output with different quotes than the input had evenif they have to strip those quotes temporarily while modifying the fields.

I get CSVs from multiple sources and need to compare/manipulate them andreturn them to those sources or send to other destinations that wouldotherwise receive the original exported CSV. Some of those CSVs areexported from Excel or other Windows tools, some are exported fromvarious applications that run on various web sites, some are created byvarious Unix tools that have evolved over the years. I see variousquoting styles/rules applied across those CSVs - quote only when needed,quote all fields, quote all strings but do not quote numbers, quote onlyspecific columns, quote the data rows but not the header row, etc., etc.

I just counted them and I have 333 gawk scripts using FPAT to manipulateCSVs plus several other CSV-processing scripts that don't use FPAT (mostwritten pre-FPAT). For some I just need to map input fields (frompotentially multiple CSVs) to output fields, e.g.:


   ARGIND == 1 { file1[FNR] = $3 OFS $7; next }
   ARGIND == 2 { file2[FNR] = $12; next }
   $3 ~ /whatever/ { print $1, $9, file1[FNR], file2[FNR] }

For others I have to modify some field(s) and today using FPAT I dosomething like:


   if ( quoted=gsub(/^"|"$/,"",$8) ) { gsub(/""/,"\"",$8) }
   $8 = whatever
   if ( quoted || ($8 ~ /[\n",]/) ) { gsub(/"/,"\"\",$8); $8 = "\"" $8
   "\"" }

So, in each case, the field is simply quoted or not in the output basedon whether it was quoted or not in the input (or would be invalid CSV ifit wasn't quoted).

In that way I just don't care what the quoting rules are for whateversource I got the CSV from and will send the CSV to, I simply outputwhatever quoting style was input and so I KNOW it'll work at thedestination without assuming they can handle all possible, or anyspecific, CSV quoting styles.

I obviously know how to do whatever I personally need to do to get thefunctionality I need, whether that's rolling my own record-readingfunction with `FPAT` to read multi-line fields as I've done in the pastor rolling my own field-splitting function with `--csv` to retain quotesaround fields (which I now realise I probably won't actually do as I'dhave to then remember to call that function again any time I update $0so it's more impactful to the rest of the script than using FPAT andcalling a record-reading function exactly one time in one location), butpeople have been writing tools to parse various subsets of CSVs withvarious subsets of allowed/required quoting for 50+ years and CSVs areused in many varied applications with no 1 common standard they allfollow, despite the existence of RFC4180, so I expect I'm not alone inhaving a need for CSV parsing that simply doesn't strip quotes.

Given that, I suggested a mode like `--csv` but that'd leave quotesalone so we could do whatever we need to do in that regard but theproviders would rather not implement it and that's obviously entirelytheir decision and is fine, I don't think we need to discuss it any further.

Ed.

Similarly, the "csv" module in Python and the "encoding/csv" packagein Go (and I presume it's similar in other languages) give you theun-encoded field value so that you can perform operations on it.
> It is 1 of the 2 possible correct behaviors, and it's the one that Iexpect will be most
> useful most of the time.
I suppose it's not helpful to argue over what is "correct" or not, andI take your point that what you propose is a possible behaviour.However, I've tried to show above that the field values wouldn't bevery useful without un-encoding the data -- except to pass it directlyto the output. So I definitely disagree with the second part of yourstatement. Based on my own usage, I'm very often summing a field orsimilar, which wouldn't work with your approach (without furtherdequoting/decoding).
To generalize, I think most data processing tends to work this way:decode input, operate on decoded data, encode output.
In any case, I do think Kernighan's choice to have --csv decode theinput so that you can operate on decoded data is the more helpfulchoice, and consistent with what other languages do.
-Ben

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Quotes being stripped by "--csv", (continued)
- Re: Quotes being stripped by "--csv", arnold, 2023/11/23
  - Re: Quotes being stripped by "--csv", Ed Morton, 2023/11/23
- Re: Quotes being stripped by "--csv", J Naman, 2023/11/27

Prev by Date: Re: Quotes being stripped by "--csv"
Next by Date: Re: Quotes being stripped by "--csv"
Previous by thread: Re: Quotes being stripped by "--csv"
Next by thread: Re: Quotes being stripped by "--csv"
Index(es):
- Date
- Thread