bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CSV extension status


From: Andrew J. Schorr
Subject: Re: CSV extension status
Date: Wed, 19 May 2021 09:03:36 -0400
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, May 19, 2021 at 01:29:38PM +0200, Manuel Collado wrote:
> Oh! Sorry. I've just realized that you are probably talking about
> how csvmode.awk rebuilds the record with clean values delimited by
> CSVOFS.
> Of course, you are right. On simple cases demangling CSV and
> composing the clean values almost duplicates the processing time.

I'm talking about this code:

http://mcollado.z15.es/xgawk/csvmode/csvmode.awk

# Process CSV input records
_csv_mode+0 {
    # Collect multi-line data, if it is the case
    CSVRECORD = $0
    while (!_csv_1line && gsub(_csv_quote, _csv_quote, CSVRECORD) % 2 == 1 && 
(_csv_multi = getline _csv_) > 0) {
        CSVRECORD = CSVRECORD "\n" _csv_
        NR--
        FNR--
    }
    if (_csv_multi) {
        $0 = CSVRECORD
    }

    # Convert the CSV record at user request
    if (_csv_mode < 0) {
        _csv_nf = csvsplit($0, _csv_ff)
        _csv_record = ""
        _csv_sep = ""
        for (k=1; k in _csv_ff; k++) {
            _csv_record = _csv_record _csv_sep csvunquote(_csv_ff[k])
            _csv_sep = OFS
        }
        $0 = _csv_record
    } else if (_csv_trimlvl > 0) {
        for (k=1; k<=NF; k++) {
            sub(/^[[:space:]]+/, "", $k)
            sub(/[[:space:]]+$/, "", $k)
            if (_csv_trimlvl > 1) {
                gsub( /[[:space:]]+/, " ", $k )
            }
        }
    }

    # Store a possible header record
    if (FNR==1) {
        for (k=1; k<=NF; k++) {
            if (_csv_mode < 0) label = $k
            else label = csvunquote($k, _csv_quote, _csv_trimlvl)
            _csv_column[label] = k
        }
    }
}

When _csv_mode < 0, it splits and reconstructs the record. That has to
be time-consuming.

I'm not sure what happens when _csv_mode > 0. I guess it relies upon
upon FPAT parsing in that case.

To be quite honest, I don't quite understand the explanations of how positive
and negative CSVMODE approaches differ. The documentation includes much jargon
that I don't understand: "fragments", "clean text", etc. Probably I'm missing
something, but where are these terms defined?

For negative CSVMODE, it says:
"CSVOFS must contain a character not used in the CSV input file. The default 
SUBSEP character should work in almost all cases."
So that's discouraging. This does not seem robust.

But what are the drawbacks of the CSVMODE > 0 case?

Also, I think I see an inefficiency in your code: _csv_multi is not reset
properly, so can result in setting "$0 = CSVRECORD" when there's no need to do
so.  And there's no reason to embed the !_csv_1line test in the while loop
test; I'd break that out. And maybe you should be using some functions here so
you can have some local variables instead of polluting the global namespace.
But that might hurt performance, so perhaps it's better to accept that _csv*
has a proliferation of hidden variables.

> But, surprisingly, even in that case the pure gawk library beats the
> API-based extension. A simple test based on your previous age/weight
> example, with a sample of 10000 random values gives:
> 
> -- with csvmode.awk
> CSVMODE = 1 (CSV fragments)
> real    0m0.151s
> user    0m0.109s
> sys     0m0.015s
> 
> CSVMODE = -1 (clean values)
> real    0m0.253s
> user    0m0.203s
> sys     0m0.030s
> 
> -- with gawk-csv (clean values)
> real    0m0.980s
> user    0m0.312s
> sys     0m0.672s
> 
> Don't know the reason of this unexpected result.

There seem to be only 2 possibilities, in theory: 1. the input parser API is
designed in such a way that it's impossible to achieve good performance; or 2.
the CSV parser is poorly implemented. Or maybe there are other explanations
that I'm missing. I have not inspected the C code.

Regards,
Andy



reply via email to

[Prev in Thread] Current Thread [Next in Thread]