[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
importdata different approach
From: |
Daniel J Sebald |
Subject: |
importdata different approach |
Date: |
Tue, 30 Jul 2013 23:40:36 -0500 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111108 Fedora/3.1.16-1.fc14 Thunderbird/3.1.16 |
Erik,
I used the importdata function last night, and although it works fine
(thank you) it seems to be quite slow for CSV files even for relatively
small files. I profiled the routine a bit, and here are some CPU times
for various parts of the routine (the size of the data is 7383 x 5):
ans = 0.0099990
ans = 0.089986
ans = 0
ans = 0
ans = 0
ans = 0.097985
ans = 0.49592
ans = 3.6494
The main thing to note from this is basically that the first stages
involving the regexp routine are rather efficient and the last stages
which involve double looping are quite the opposite. The main issue is
that the following tests and whatnot consume time:
if (any (file_content_rows{i} != " "))
and
data_numeric = str2double (row_data{j});
and
for i=(header_rows+1):length(file_content_rows)
data_columns = max (data_columns,
length (regexp (file_content_rows{i},
delimiter_pattern, "split")));
endfor
and
row_data = regexp (file_content_rows{i}, delimiter_pattern, "split");
Take particular note that the last two operations are duplicating the
same work of splitting the data according to delimiter.
The reason, say, "str2double (row_data{j})" is slow is that the argument
to str2double is a single element. Even though the core of str2double()
is pretty fast, there is the precursor type-checking on what the
arguments are, whether they are strings or whether they are cells, etc.
So, when called in this way, str2double is spending an inordinate
amount of CPU cycles not doing the actual conversion but checking data
types. It's better to call str2double() on large cells or string matrices.
I tried to rework things by handling the white space removal in the
character data stream before breaking data into rows and applying
str2double() on all the cells at once. I managed to cut the CPU
consumption to about 1/4 of the current version. But I then wondered if
there wasn't something else we could use because a big portion of the
time was in creating the cells via regexp(...,"split"). In fact, there
already is the "dlmread()" which I think has enough flexibility with its
arguments to handle the importdata CSV ascii case. It is so efficient
that I think a better approach is to
1) Just fscanf the first header lines of the file (as opposed to reading
in the whole data file)
2) Use dlmread() to do all the work, which places NaN for the cases
where the conversion failed
3) Look at the data matrix for any NaN and then retroactively read in
the data file and then compute where the associated lines are. I think
I've done it efficiently so that every entry of the file need not be
extracted, just the lines where the NaN occurred.
The last step slows things down, but it is still pretty efficient. Here
is the CPU consumption for stages of the revamped importdata:
octave:460> aa = importdata_new ('foo.csv');
ans = 1.0000e-03
ans = 0.029996
ans = 0
Whoo-hoo! Factor of 125 speed up. Here are the results when I place a
couple text strings amongst the data columns:
octave:461> aa = importdata_new ('foo_b.csv');
ans = 0
ans = 0.033995
ans = 0.18297
Well, you can see that having to pull the data back in and apply regexp
adds some, but still compared to the current importdata.m it is rather
minuscule.
Perhaps Rik can take a look at the last part of figuring out the
output.textdata cell contents when the data has text strings in it. I
don't think it can be much more efficient that what I did, but if it is
possible Rik will find it.
The patch is here:
https://savannah.gnu.org/patch/index.php?8140
There are three tests that fail after applying the patch. We can
discuss those. Basically, I don't agree with some of the results:
%!test
%! # Header
%! A.data = [3.1 -7.2 0; 0.012 6.5 128];
%! A.textdata = {"This is a header row."; \
%! "this row does not contain any data, but the next one
does."};
%
I think that treating text with spaces rules out using space characters
as delimiter and automatically recognizing column names. For example,
if the first lines of my data file were
TIME VOLTAGE DISPLACEMENT
0 3.3 0.137
0.25 3.4 0.148
0.5 3.6 0.150
how can we tell that the first line should be data column titles or just
some textdata?
%!test
%! # Missing values
%! A = [3.1 NaN 0; 0.012 6.5 128];
The above test produces the correct data output.data, but while this
expectation is just the data, the new routine is creating
output.textdata for that NaN result which happens to be an empty string.
Isn't that the proper result?
%!test
%! # CR for line breaks
%! A = [3.1 -7.2 0; 0.012 6.5 128];
%! fn = tmpnam ();
%! fid = fopen (fn, "w");
%! fputs (fid, "3.1\t-7.2\t0\r0.012\t6.5\t128");
The new version of importdata fails on the above test, and it would be
easy to correct as a first step by searching and replacing any \r with
\n. However, I wonder if the proper fix for this would be a simple
addition to dlmread(). So let's hold off on this test until we are
certain where it should be fixed.
Dan
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- importdata different approach,
Daniel J Sebald <=