octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: accelerating sscanf ?


From: Daniel J Sebald
Subject: Re: accelerating sscanf ?
Date: Thu, 22 Mar 2012 11:50:18 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111108 Fedora/3.1.16-1.fc14 Thunderbird/3.1.16

On 03/22/2012 10:28 AM, CdeMills wrote:
Hello,
when reading big files with my dataframe package, most of the time is spent
in converting string to double. The steps are:
1) the whole file is cut into lines
2) each line is then split using the field separator and stored in a cell
array
3) the conversion is performed as
  the_line = cellfun (@(x) sscanf (x, "%f", locales), dummy, 'UniformOutput',
false);

This is to say that sscanf is called for each field. During the call, a
stream is created, new locales are set, and so on. Some functions working on
strings accept cells of strings as input. Would it be OK to have sscanf also
accept cell array as first member ? The algorithm will then be:
1) create a istringstream, put the right locale on it. Create a cell array
for the output result.
2) for each entry in args(0)
    - verify that it's a string
    - put its value into the istringstream
    - scan it, and store  the result in the output cell array

The issue I have is that files with 7500 lines of 12 fields take more than
120 seconds to be parsed. If the number of calls is reduced by a factor of
10 at the interpreter level, the speedup would be worth a try ? What do you
think about it ?

Pascal,

What you describe sounds very similar to the issue that we've been discussing concerning bin2dec, i.e., better performance when working with strings inside cells. 7500 lines with 12 fields is what I would consider "database" application. Once the data is in the cells, Octave can do a lot of powerful things to analyze that data. However, it appears getting large data sets or transforming them is somewhat inefficient.

In any case, I'd like to point you to an alternative you might try. There is a string function called strsplit which can be pretty nice. Here's an example:

charstr = "these, are, fields\nseparated, by, commas";
lines = strsplit(charstr, "\n")
for i=1:length(lines)
  C(i,:) = strsplit(lines{i}, ',');
end
C

If one brings in the whole file as a hunk of characters then uses strsplit, that can be efficient. The only problem--I don't know the format of your data--is having text strings which have the delimiter character inside.

Dan


reply via email to

[Prev in Thread] Current Thread [Next in Thread]