[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Parsing standard CVS data by gawk
From: |
Jarno Suni |
Subject: |
Re: [bug-gawk] Parsing standard CVS data by gawk |
Date: |
Tue, 1 Sep 2015 16:21:11 +0300 |
On Tue, 1 Sep 2015 08:22:49 -0400
"Andrew J. Schorr" <address@hidden> wrote:
> On Wed, Jul 08, 2015 at 12:17:17AM +0300, Jarno Suni wrote:
> > For some use cases it is not necessary to have all data in memory in
> > one time. This might not be optimal implementation for such a case.
> > I also wrote an implementation that reads input line by line.
>
> I think the normal gawk usage pattern is to handle one record at a
> time. Reading the entire file into memory is not scalable, so cannot
> be held out as a general solution to the problem. My sense is that
> the proper solution is to write a CSV parser extension in C that
> handles all the corner cases properly. We are seeking a volunteer to
> implement this.
Also this kind of tool in C could be used as part of handling CSV files:
https://github.com/dbro/csvquote
(The other-languages branch contains implementations in various
other programming languages.)
By the way, the line-by-line script that I wrote for the task is
faster though a bit more complicated than the all-at-once one.
Possibly I could program even faster one, if there were faster
implementations for
function match_from(string,regex,start) {
return match(substr(string,start),regex)
}
and
function index_from(string,substring,start) {
return index(substr(string,start),substring)
}
Those can be implemented fast in C, if characters are fixed length (e.g.
-b mode).
--
Jarno Ilari Suni - http://www.iki.fi/8/