octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #51871] loading '-ascii' format files is slow


From: Dan Sebald
Subject: [Octave-bug-tracker] [bug #51871] loading '-ascii' format files is slow
Date: Thu, 31 Aug 2017 13:52:15 -0400 (EDT)
User-agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0

Follow-up Comment #16, bug #51871 (project octave):

Regarding delimiters, following link to "load": 
http://www.mathworks.com/help/matlab/ref/load.html#input_argument_d0e580248 . 
Semicolon is in the list, and it does the same thing Octave is doing, i.e.,
accepting more variations on input characters than it outputs.

I've been updating from Mint 17.3 to Mint 18.2 over the past couple days
(looks the same, but it's a change from gtk2 to gtk3...seems more robust so
far).  So I can't try many changes just yet.  I'll just make some comments for
now.

This hunk


+      // Remove tailing '\r'.
+      while (retval.size () && retval.back () == '\r')
+        retval.pop_back ();


doesn't make much difference then, probably due to the fact retval is a
string.  I was thinking that retval.size() is like strlen() in the sense it
has to scan from the start of the string to find the '\0'.  However, I am
guessing that strings compute their size when allocating memory for the
string.  Thus, as Count points out the retval.size() and retval.back() are
very efficient pointer manipulations.  Of course, all the overhead in terms of
CPU is at the creation and initialization of the string.

Definitely we want to avoid this use of strings:


            std::string buf = get_mat_data_input_line (is);


because that get_mat_data_input_line() is creating a fresh string every time
it is called.  I think Count is right that a malloc() is required for that. 
That's too much system activity.  I would hope that this construct


+              std::getline (is, buf);


where buf is a string created only once, eventually only exercises malloc() if
the string length needs to be increased, i.e., more space.

>From #14:

>> With patch in #10, plus without get_lines_and_columns() (by hand
input the nc and nr): 0.981622 sec.

That's a big difference.  It sure would be nice to eliminate that  2.3 s. 
Rather than get the exact number of rows on the first pass, could we get the
maximum number of rows (e.g., count the number of newline characters in the
data with some fast C function) and then trim back the Matrix once we know the
eventual size on the second pass which tosses out comment lines?

>> And no removal of any comment in get_mat_data_input_line(): 0.735889
sec. 

That's a pretty sizable fraction too, once we first bring the benchmark down
to the order of one second.  This hunk


+      // Remove any comment.
+      size_t pos_comment = retval.find_first_of ("#%");
+      if (pos_comment != std::string::npos)
+        retval.erase (pos_comment);


is the one I was most concerned about because I know regardless of whether
retval is a string or simple buffer this code has to scan through a whole line
if there is no comment, i.e., the most probable scenario.  That is why I say
it might be faster to scan for numeric values, and when that fails then check
for comment characters.

This hunk


+      // Detect non-whitespace.
+      no_data_found = (retval.find_first_not_of (" \t\r") ==
std::string::npos);


, however, is a negative search so will return as soon as it finds a numeric
character.  So that's not CPU consuming (I may have been thinking otherwise
previously).

BUT, isn't above a bug?  There are plenty of non-numeric characters that will
cause no_data_found to be logical 0.  For example, the second line here


1.2 3.4 5.6
@ @ @
6.5 4.3 2.1


Do strings have some kind of member function similar to
"isnumeric()"?  Say, retval.find_first_numeric()?  We wouldn't want
to get such a long string as

retval.find_first_of ("0123456789.-+")

would we?

There are other comments I could make, but let's first get the main
bottlenecks out of the way.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?51871>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]