[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: accelerating sscanf ?
From: |
Daniel J Sebald |
Subject: |
Re: accelerating sscanf ? |
Date: |
Thu, 22 Mar 2012 11:50:18 -0500 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111108 Fedora/3.1.16-1.fc14 Thunderbird/3.1.16 |
On 03/22/2012 10:28 AM, CdeMills wrote:
Hello,
when reading big files with my dataframe package, most of the time is spent
in converting string to double. The steps are:
1) the whole file is cut into lines
2) each line is then split using the field separator and stored in a cell
array
3) the conversion is performed as
the_line = cellfun (@(x) sscanf (x, "%f", locales), dummy, 'UniformOutput',
false);
This is to say that sscanf is called for each field. During the call, a
stream is created, new locales are set, and so on. Some functions working on
strings accept cells of strings as input. Would it be OK to have sscanf also
accept cell array as first member ? The algorithm will then be:
1) create a istringstream, put the right locale on it. Create a cell array
for the output result.
2) for each entry in args(0)
- verify that it's a string
- put its value into the istringstream
- scan it, and store the result in the output cell array
The issue I have is that files with 7500 lines of 12 fields take more than
120 seconds to be parsed. If the number of calls is reduced by a factor of
10 at the interpreter level, the speedup would be worth a try ? What do you
think about it ?
Pascal,
What you describe sounds very similar to the issue that we've been
discussing concerning bin2dec, i.e., better performance when working
with strings inside cells. 7500 lines with 12 fields is what I would
consider "database" application. Once the data is in the cells, Octave
can do a lot of powerful things to analyze that data. However, it
appears getting large data sets or transforming them is somewhat
inefficient.
In any case, I'd like to point you to an alternative you might try.
There is a string function called strsplit which can be pretty nice.
Here's an example:
charstr = "these, are, fields\nseparated, by, commas";
lines = strsplit(charstr, "\n")
for i=1:length(lines)
C(i,:) = strsplit(lines{i}, ',');
end
C
If one brings in the whole file as a hunk of characters then uses
strsplit, that can be efficient. The only problem--I don't know the
format of your data--is having text strings which have the delimiter
character inside.
Dan