bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Parsing standard CVS data by gawk


From: Jarno Suni
Subject: Re: [bug-gawk] Parsing standard CVS data by gawk
Date: Tue, 1 Sep 2015 16:21:11 +0300

On Tue, 1 Sep 2015 08:22:49 -0400
"Andrew J. Schorr" <address@hidden> wrote:

> On Wed, Jul 08, 2015 at 12:17:17AM +0300, Jarno Suni wrote:
> > For some use cases it is not necessary to have all data in memory in
> > one time. This might not be optimal implementation for such a case.
> > I also wrote an implementation that reads input line by line.
> 
> I think the normal gawk usage pattern is to handle one record at a
> time. Reading the entire file into memory is not scalable, so cannot
> be held out as a general solution to the problem. My sense is that
> the proper solution is to write a CSV parser extension in C that
> handles all the corner cases properly. We are seeking a volunteer to
> implement this.

Also this kind of tool in C could be used as part of handling CSV files:
https://github.com/dbro/csvquote
(The other-languages branch contains implementations in various
other programming languages.)

By the way, the line-by-line script that I wrote for the task is
faster though a bit more complicated than the all-at-once one.
Possibly I could program even faster one, if there were faster
implementations for 

function match_from(string,regex,start) {
        return match(substr(string,start),regex)
}

and

function index_from(string,substring,start) {
        return index(substr(string,start),substring)
}

Those can be implemented fast in C, if characters are fixed length (e.g.
-b mode).

-- 
Jarno Ilari Suni - http://www.iki.fi/8/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]