bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parse CVS in awk


From: Manuel Collado
Subject: Re: Parse CVS in awk
Date: Thu, 9 Apr 2020 19:53:36 +0200
User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0

El 09/04/2020 a las 17:00, Manuel Collado escribió:
El 09/04/2020 a las 4:51, Peng Yu escribió:
I'm wondering if the solution mentioned here is robust against all CVS
format variations.

https://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content

This manual says:

<quote>
NOTE: Some programs export CSV data that contains embedded newlines between the double quotes. gawk provides no way to deal with this. Even though a formal specification for CSV data exists, there isn’t much more to be done; the FPAT mechanism provides an elegant solution for the majority of cases, and the gawk developers are satisfied with that.
<endquote>

Well, there is a trick that can handle fields with embedded newlines. The idea is to join lines until the number of quotes is an even number. And amend NR and FNR if necessary:

# Process CSV input records with embedded newlines
{
    # Collect multi-line data, if it is the case
    CSVRECORD = $0
while (gsub("\"", "\"", CSVRECORD) % 2 == 1 && (_csv_multi = getline _csv_) > 0) {
        CSVRECORD = CSVRECORD "\n" _csv_
        NR--
        FNR--
    }
    if (_csv_multi) {
        $0 = CSVRECORD
    }
}

HTH.
--
Manuel Collado - http://mcollado.z15.es



reply via email to

[Prev in Thread] Current Thread [Next in Thread]