octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #50619] textscan weird behaviour when reading


From: Dan Sebald
Subject: [Octave-bug-tracker] [bug #50619] textscan weird behaviour when reading a csv
Date: Sat, 25 Mar 2017 18:38:58 -0400 (EDT)
User-agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0

Follow-up Comment #12, bug #50619 (project octave):

Here's the comment for the delimited_stream object:


  // Delimited stream, optimized to read strings of characters separated
  // by single-character delimiters.
  //
  // The reason behind this class is that octstream doesn't provide
  // seek/tell, but the opportunity has been taken to optimise for the
  // textscan workload.
  //
  // The function reads chunks into a 4kiB buffer, and marks where the
  // last delimiter occurs.  Reads up to this delimiter can be fast.
  // After that last delimiter, the remaining text is moved to the front
  // of the buffer and the buffer is refilled.  This also allows cheap
  // seek and tell operations within a "fast read" block.


I take it that using the std::stream routines are slow when reading one
character at a time, so this class brings in a chunk of data to remain
resident and have faster access.  I don't know the value of a non-absolute
seek and tell that only refer to some given buffer.

Also, what if there is no "last delimiter" in the buffer because the length of
the string is greater than the length of the buffer?

Another issue, given chunks of data is the goal, is that the existing code
doesn't go about it in a very efficient way.  For example, here is the code
that reads a string:


        std::string vv ("        ");      // initial buffer.  Grows as needed
        switch (fmt.type)
          {
          case 's':
            scan_string (is, fmt, vv);
            break;



For every field read, there is a new standard string "on the stack".  Now, "on
the stack" probably means that the actual character data isn't on the stack,
but the object's reference is (for otherwise the stack could easily overflow
if the string is very long).  But the point is, for every field there's a
fresh std::string that has to be grown.  Instead, why not keep a std::string
as part of the delimited_stream object?  One that grows as large as the
largest field it ever sees, and when the delimiter stream is deleted, so is
the memory associated with the std::string.

Because, looking at scan_string:


        for (i = 0; i < width; i++)
          {
            if (i+1 > val.length ())
              val = val + val + ' ';      // grow even if empty
            int ch = is.get ();
            if (is_delim (ch) || ch == std::istream::traits_type::eof ())
              {
                is.putback (ch);
                break;
              }
            else
              val[i] = ch;
          }


every time through the loop one has to check if the length of std::string is
big enough.  Why not use a fast character scanning function to determine the
length to the next delimiter first, then expand the std::string, then copy
data?

There are a lot of FIXMEs in this code, probably because of a
not-fully-planned buffer scheme.

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?50619>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]