importdata different approach

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

importdata different approach

From:	Daniel J Sebald
Subject:	importdata different approach
Date:	Tue, 30 Jul 2013 23:40:36 -0500
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111108 Fedora/3.1.16-1.fc14 Thunderbird/3.1.16

Erik,

I used the importdata function last night, and although it works fine(thank you) it seems to be quite slow for CSV files even for relativelysmall files. I profiled the routine a bit, and here are some CPU timesfor various parts of the routine (the size of the data is 7383 x 5):


ans =  0.0099990
ans =  0.089986
ans = 0
ans = 0
ans = 0
ans =  0.097985
ans =  0.49592
ans =  3.6494

The main thing to note from this is basically that the first stagesinvolving the regexp routine are rather efficient and the last stageswhich involve double looping are quite the opposite. The main issue isthat the following tests and whatnot consume time:


    if (any (file_content_rows{i} != " "))

and

          data_numeric = str2double (row_data{j});

and

  for i=(header_rows+1):length(file_content_rows)
    data_columns = max (data_columns,
                        length (regexp (file_content_rows{i},
                                        delimiter_pattern, "split")));
  endfor

and

      row_data = regexp (file_content_rows{i}, delimiter_pattern, "split");

Take particular note that the last two operations are duplicating thesame work of splitting the data according to delimiter.

The reason, say, "str2double (row_data{j})" is slow is that the argumentto str2double is a single element. Even though the core of str2double()is pretty fast, there is the precursor type-checking on what thearguments are, whether they are strings or whether they are cells, etc.So, when called in this way, str2double is spending an inordinateamount of CPU cycles not doing the actual conversion but checking datatypes. It's better to call str2double() on large cells or string matrices.

I tried to rework things by handling the white space removal in thecharacter data stream before breaking data into rows and applyingstr2double() on all the cells at once. I managed to cut the CPUconsumption to about 1/4 of the current version. But I then wondered ifthere wasn't something else we could use because a big portion of thetime was in creating the cells via regexp(...,"split"). In fact, therealready is the "dlmread()" which I think has enough flexibility with itsarguments to handle the importdata CSV ascii case. It is so efficientthat I think a better approach is to

1) Just fscanf the first header lines of the file (as opposed to readingin the whole data file)

2) Use dlmread() to do all the work, which places NaN for the caseswhere the conversion failed

3) Look at the data matrix for any NaN and then retroactively read inthe data file and then compute where the associated lines are. I thinkI've done it efficiently so that every entry of the file need not beextracted, just the lines where the NaN occurred.

The last step slows things down, but it is still pretty efficient. Hereis the CPU consumption for stages of the revamped importdata:


octave:460> aa = importdata_new ('foo.csv');
ans =    1.0000e-03
ans =  0.029996
ans = 0

Whoo-hoo! Factor of 125 speed up. Here are the results when I place acouple text strings amongst the data columns:


octave:461> aa = importdata_new ('foo_b.csv');
ans = 0
ans =  0.033995
ans =  0.18297

Well, you can see that having to pull the data back in and apply regexpadds some, but still compared to the current importdata.m it is ratherminuscule.

Perhaps Rik can take a look at the last part of figuring out theoutput.textdata cell contents when the data has text strings in it. Idon't think it can be much more efficient that what I did, but if it ispossible Rik will find it.


The patch is here:

https://savannah.gnu.org/patch/index.php?8140

There are three tests that fail after applying the patch. We candiscuss those. Basically, I don't agree with some of the results:



%!test
%! # Header
%! A.data = [3.1 -7.2 0; 0.012 6.5 128];
%! A.textdata = {"This is a header row."; \

%! "this row does not contain any data, but the next onedoes."};

I think that treating text with spaces rules out using space charactersas delimiter and automatically recognizing column names. For example,if the first lines of my data file were


TIME VOLTAGE DISPLACEMENT
0 3.3 0.137
0.25 3.4 0.148
0.5 3.6 0.150

how can we tell that the first line should be data column titles or justsome textdata?



%!test
%! # Missing values
%! A = [3.1 NaN 0; 0.012 6.5 128];

The above test produces the correct data output.data, but while thisexpectation is just the data, the new routine is creatingoutput.textdata for that NaN result which happens to be an empty string.Isn't that the proper result?



%!test
%! # CR for line breaks
%! A = [3.1 -7.2 0; 0.012 6.5 128];
%! fn  = tmpnam ();
%! fid = fopen (fn, "w");
%! fputs (fid, "3.1\t-7.2\t0\r0.012\t6.5\t128");

The new version of importdata fails on the above test, and it would beeasy to correct as a first step by searching and replacing any \r with\n. However, I wonder if the proper fix for this would be a simpleaddition to dlmread(). So let's hold off on this test until we arecertain where it should be fixed.

Dan

[Prev in Thread]

Current Thread

[Next in Thread]

importdata different approach, Daniel J Sebald <=

Prev by Date: Re: Please test new Doxygen building
Next by Date: SOCIS deadline approaching
Previous by thread: Re: Classdef for next release
Next by thread: SOCIS deadline approaching
Index(es):
- Date
- Thread