bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CSV extension status


From: Andrew J. Schorr
Subject: Re: CSV extension status
Date: Wed, 19 May 2021 16:54:19 -0400
User-agent: Mutt/1.5.21 (2010-09-15)

Hi Manuel,

On Wed, May 19, 2021 at 07:48:10PM +0200, Manuel Collado wrote:
> Thank you very much for your thoroughly inspection of my convoluted
> code. I'll revise it according to your suggestions.

You're welcome, but I haven't reviewed it thoroughly. I just scanned
some pieces of it.

> El 19/05/2021 a las 15:03, Andrew J. Schorr escribió:
> >..
> >I'm not sure what happens when _csv_mode > 0. I guess it relies upon
> >upon FPAT parsing in that case.
> 
> Yes.

OK.

> >To be quite honest, I don't quite understand the explanations of how positive
> >and negative CSVMODE approaches differ. The documentation includes much 
> >jargon
> >that I don't understand: "fragments", "clean text", etc. Probably I'm missing
> >something, but where are these terms defined?
> 
> They are not precise technical terms. Just common english words, I hope.
> 
> Clean text -> "Hello!", she said
> CSV fragment -> """Hello!"", she said"
> 
> Can you suggest better terms? I've not found technical terms for
> these values in the CSV bibliography.

Ah, OK, so I'd describe "clean text" as "the field value after
protective/escaping quotes have been removed". And a "CSV fragment" is the ugly
field value with protective quoting. Or something like that. Perhaps the best
way to explain it is as you just did: by showing examples the first time you
introduce these terms.

So if I understand it correctly then, the CSVMODE=1 approach uses
native FPAT and gives field values with nasty CSV quoting, whereas
CSVMODE=-1 strips out the ugly CSV quoting, but should in principle
be slower because it splits and reassembles the record. And your benchmark
seems to confirm that.

By the way, why does the _csv_fpat function calculate a "trim" value when it
never actually uses it?

> >For negative CSVMODE, it says:
> >"CSVOFS must contain a character not used in the CSV input file. The default 
> >SUBSEP character should work in almost all cases."
> >So that's discouraging. This does not seem robust.
> 
> I think it is as robust as its use to emulate multidimensional
> arrays with linear arrays since the creation of AWK.

It's just a question of whether the ASCII octal 034 value appears in the
input. If it does, then everything breaks, unless I'm confused.

> >But what are the drawbacks of the CSVMODE > 0 case?
> 
> It delivers what I call "CSV fragments", instead of the effective
> ("clean text") data.

Agreed -- that is typically not what I want. I imagine most people want
the field quoting removed.

I think it ought to be possible to have a super-fast C parser
that leaves the input record undisturbed, but is able to discern where
the field boundaries are and remove the outer layer of quotes, but you now
have me concerned that CSV quoting is so ugly that this can never work.
In your example:
   CSV fragment -> """Hello!"", she said"
   Clean text -> "Hello!", she said
If that's really true (and I confess that I have no idea how CSV embedded
quotes are escaped), then we'd still be stuck with a pseudo-clean
value of:
   ""Hello!"", she said
That's ugly, but an unusual case in my limited CSV experience.
But if this is really true, then I'm tempted to throw up my hands
in disgust and say that your gawk library is the best we can do, although
I think the code can be tightened up a bit.

And I'm still not convinced that you shouldn't hide the many _csv* local
variables in a namespace.

Thanks,
Andy



reply via email to

[Prev in Thread] Current Thread [Next in Thread]