bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: stripping of CR characters in --csv mode


From: cph1968
Subject: Re: stripping of CR characters in --csv mode
Date: Wed, 05 Apr 2023 06:24:23 +0000

as per request
❯ gawk --versionGNU Awk 5.2.60, API 3.2, PMA Avon 8-g1, (GNU MPFR 4.2.0, GNU MP 
6.2.1)
Copyright (C) 1989, 1991-2023 Free Software Foundation.

kind regards,
cph1968

Sent with Proton Mail secure email.

------- Original Message -------
On Tuesday, April 4th, 2023 at 20:48, Ed Morton <mortoneccc@comcast.net> wrote:


> Andy - got it, thanks for clarifying. Yes, that does look like a bug to me 
> too FWIW.
> 

> Talking of bugs - cph1968, in addition to your sample input and output piped 
> to `cat -v` or similar, please also let us know which version of gawk you're 
> running as there were some FPAT bugs in older versions so if you are 
> encountering a problem using FPAT maybe it's due to one of those.
> 

> Ed.
> 

> On 4/4/2023 11:23 AM, Andrew J. Schorr wrote:
> 

> > Hi Ed,
> > 

> > I think the intent is merely to strip and ignore carriage returns that 
> > appear
> > just before a LF record terminator. So it should all work painlessly 
> > regardless
> > of whether the file's records are terminated with only LF or the 
> > combination CR
> > LF.
> > 

> > However, the current code appears to have a bug whereby it strips and
> > removes CR characters regardless of where they appear in the file.
> > 

> > Here's some sample input where the first field contains an embedded CR 
> > inside
> > quotes:
> > 

> > bash-4.2$ echo beforeCR | unix2dos | awk '{printf "\"%s%s\"\n", $1, 
> > "afterCR"}' | hexdump -vC
> > 00000000  22 62 65 66 6f 72 65 43  52 0d 61 66 74 65 72 43  
> > |"beforeCR.afterC|
> > 00000010  52 22 0a                                          |R".|
> > 00000013
> > 

> > And when I run it through gawk --csv, the CR is unceremoniously dropped:
> > 

> > bash-4.2$ echo beforeCR | unix2dos | awk '{printf "\"%s%s\"\n", $1, 
> > "afterCR"}' | ./gawk --csv '{print $1}' | hexdump -vC
> > 00000000  62 65 66 6f 72 65 43 52  61 66 74 65 72 43 52 0a  
> > |beforeCRafterCR.|
> > 00000010
> > 

> > That seems like a bug to me, but perhaps I am confused.
> > 

> > Regards,
> > Andy
> > 

> > On Tue, Apr 04, 2023 at 11:05:48AM -0500, Ed Morton wrote:
> > 

> > > Andy - I know that's what https://www.rfc-editor.org/rfc/rfc4180 says but
> > > that's just one CSV "standard" and in practice most CSVs created/used on 
> > > Unix
> > > end with LF alone and if there's a CR before the LF then it's just another
> > > character unless you write code to remove it.
> > > 

> > > If the CSV file format as used by --csv defines the record terminator as 
> > > CR LF
> > > and --csv strips the CRs then it's output would no longer be valid CSV by 
> > > that
> > > same definition so that's a surprising choice. Does that mean it'll fail 
> > > if the
> > > input is just LF-terminated as most Unix files are (and in which case you
> > > couldn't write `awk --csv 'foo' input | awk --csv 'bar'`)?
> > > 

> > >     Ed.
> > > 

> > > On 4/4/2023 10:48 AM, Andrew J. Schorr wrote:
> > > 

> > >     Hi Ed,
> > > 

> > >     The CSV file format defines the record terminator as CR LF, so the 
> > > new --csv
> > >     option does in fact strip CRs.
> > > 

> > >     Regards,
> > >     Andy
> > > 

> > >     On Tue, Apr 04, 2023 at 10:32:49AM -0500, Ed Morton wrote:
> > > 

> > >         Are you sure in the FPAT output you're not just seeing the 
> > > expected
> > >         effects of there being a CR in your data? The `--csv` output is 
> > > the
> > >         one that looks wrong to me if you have `CR`s at the end of each
> > >         line, unless `--csv` is documented to strip `CR`s from the output.
> > > 

> > >         Please provide the input file you used as it's hard to tell what's
> > >         going on from just the output. Also pipe the output to `cat -v` or
> > >         `od -c` or similar so we can see where the CRs are in the output 
> > > but
> > >         my best guess right now is that `FPAT` is retaining the CRs as
> > >         expected while `--csv` is stripping them (which may or may not be
> > >         expected - I'm not familiar with that option).
> > > 

> > >             Ed.
> > > 

> > >         On 4/4/2023 5:12 AM, cph1968@proton.me wrote:
> > > 

> > >             the regex fp[2] in section 4.7.1 (below) don't quite cut it 
> > > if the CSV file records end in both CR and NL [0H0D 0H0A]. I believe this 
> > > is a common feature of Windows files.
> > >             A simple fix is however to use the gawk --csv option.
> > > 

> > >             ❯ head -n 2 TSCAINV_022023.csv| gawk -f print-fields.awk
> > > 

> > >                 ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
> > >                 F = 1 
> > > <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY
> > >                 1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
> > >                 F = 1 <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE
> > > 

> > >             note here that the last '>' is first character on the next 
> > > line.
> > > 

> > >             output using the --csv option:
> > >             ❯ head -n 2 TSCAINV_022023.csv| gawk --csv -f print-fields.awk
> > >             <ID,CASRN,casregno,UID,EXP,ChemName,DEF,UVCB,FLAG,ACTIVITY>
> > >             NF = 10 
> > > <ID><CASRN><casregno><UID><EXP><ChemName><DEF><UVCB><FLAG><ACTIVITY>
> > >             <1,50-00-0,50000,,,Formaldehyde,,,,ACTIVE>
> > >             NF = 10 <1><50-00-0><50000><><><Formaldehyde><><><><ACTIVE>
> > > 

> > >             much better :-)
> > > 

> > >             ❯ cat print-fields.awk
> > >             {
> > >                 print "<" $0 ">"
> > >                 printf("NF = %s ", NF)
> > >                 for (i = 1; i <= NF; i++) {
> > >                     printf("<%s>", $i)
> > >                 }
> > >                 print ""
> > >             }
> > > 

> > > 

> > > 

> > >         >from section 4.7.1:
> > > 

> > >             BEGIN {
> > >                  fp[0] = "([^,]+)|(\"[^\"]+\")"
> > >                  fp[1] = "([^,]*)|(\"[^\"]+\")"
> > >                  fp[2] = "([^,]*)|(\"([^\"]|\"\")+\")"
> > >                  FPAT = fp[fpat+0]
> > >             }
> > > 

> > > 

> > > 

> > >             kind regards,
> > > 

> > >             cph1968

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]