Re: Improving strread / textread / textscan

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving strread / textread / textscan

From:	Philip Nienhuis
Subject:	Re: Improving strread / textread / textscan
Date:	Mon, 24 Oct 2011 23:47:05 +0200
User-agent:	Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 SeaMonkey/2.0.6

Ben Abbott wrote:

On Oct 24, 2011, at 2:49 PM, Philip Nienhuis wrote:

Answers to three emails in one:

Ben Abbott wrote:

<snip>

Test #11: Passed.


Hmmm... on ML2007a, I get:
Test #11: Failed.
OBSERVED:
   49   10   76   50

EXPECTED:
   76   49   10   76   50

So ML is inconsistent...

( Note I fixed some typos in your script :-)  )


I'm confused. Did you run a modified test #11? If so, how did the unmodified 
script behave, and can you show us what you changed?

I copied/pasted your code into the ML editor, and only adapted the typos(OBSEVED -> OBSERVED, and "no enough"-> "not enough" in oct_assert.m).

Test #12: Failed.
OBSEVED:
            2

EXPECTED:
            2
            4
            0

Test #13: Passed.
Test #14: Passed.

The script with the  tests and the oct_assert function are attached.


Apparently ML doesn't recognize empty fields squeezed between two literals.


For reference, test 12 is ...

str = sprintf ('Text1Text2Text\nTextText4Text\nText57Text');
c = textscan (str, 'Text%*dText%dText');
fprintf ('Test #12:')
oct_assert (c{1}, int32 ([2; 4; 0]));

Looking at the table under "User Configurable Options" (link below), MW indicates that "EmptyValue" 
is the "Value to return for empty numeric fields in delimited files." I read this to mean that empties only 
occur between the characters defined as "delimiters".

        http://www.mathworks.com/help/techdoc/ref/textscan.html

Replace the literals (i.e. "Text") with delimiters ...

c = textscan (sprintf ('1,2\n,4\n57,'), '%*d%d', 'delimiter', ',');
c{:}

ans =

            2
            4

Notice the last value isn't between two delimiters, but is preceded by a 
delimiter and followed by white-space. If a second delimiter is added, then ...

c = textscan (sprintf ('1,2\n,4\n57,,'), '%*d%d', 'delimiter', ',');
c{:}

ans =

            2
            4
            0

I haven't studied the docs very deeply, and have only looked at the docs for 
R2011b, but it looks to me that ML is behaving in a manner that is consistent 
with its documentation (admittedly the documentation is rather esoteric).

I'd say Octave more strictly complies to the rules. But admittedly thisis an extreme example.

Note that processing literals differs from processing of delimiters.

<snip>

=========================
Ben Abbott wrote:


I've made some modifications to your original notes, and added a few more below.

a. "Words" or fields (to be interpreted later) are separated by white-space or 
delimiters.
b. The white-space char set can be adapted by the user with the "whitespace" 
keyword. It can even be set to empty.
c. White-space is understood to possibly be a vector of white-space chars that 
during reading is folded into one char that separates two fields.
d. Delimiters are also characters that separate words / fields.  Multiple 
delimiters are not folded into a single instance.
e. Vectors of white-space and one delimiter are folded into one _delimiter_ 
that separates fields.
f. A pair of delimiters separated by white-space (or nothing) imply an empty 
value.
g. By default "emptyvalue" is NaN for numeric data types. If the numeric type 
doesn't support NaN, the zero is used (int32 for example). For character fields, an empty 
value is just an empty string.
h. If so desired, multiple consecutive delimiters can be folded into one delimiter if 
"MultipleDelimsAsOne" parameter is set to 1.
i. EOL char sequences (\n, \r\n, or \r) are also delimiters, but are not 
affected by the MultipleDelimsAsOne parameter.


As to strread, there's another ML subrule:
<QUOTE>
If your data uses a character other than a space as a delimiter, you must use 
the strread parameter 'delimiter' to specify the delimiter
</QUOTE>
What is it, space or whitespace?


Are you referring to the different EOLs? I'm not entirely sure what you are 
asking, but I'll make a guess.


Sorry for not being clear enough.

At one place in the docs, ML says "fields are separated by whitespace",while a bit further down is the quote I gave above which only mentionsgenuine spaces.

Textread operates on one line at a time. If an attempt is made to read past the 
end of a line with a single format statement, empties will be inserted for 
those fields read past the EOL.

c = textscan (sprintf ('1\n2\n\n4\n57\n\n'), '%*d%d', 'delimiter', ',');

c{:}


ans =

            0
            0
            0
            0

Unfortunately, I missed catching the problems with "i" before. I think it 
should read ...

i. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not 
delimit fields / words and are unaffected by the MultipleDelimsAsOne parameter. 
Any fields read beyond an EOL are treated as being empty.

Does that make sense?


Not all of it.

An EOL can also be a field delimiter. Obvious, because an EOL naturallycuts off fields if there's no other delimiter first.

The rest of i. looks correct to me.

IAnyway, if your&  mine colllection of inferred rules apply, I do not 
understand this (7th test of Octave strread.m):

octave:23>  a = strread ("a b c, d e, , f", "%s", "delimiter", ",")
a =
{
  [1,1] = a b c
  [2,1] = d e
  [3,1] =
  [4,1] = f
}
(Same goes for ML)


I hadn't considered this before.  I'll have to study the docs again to see if there is a 
reference to this. I did try dropping the "delimiter" to see what happens.

a = textscan ('a b c, d e, , f', '%s');

a{:}

ans =

     'a'
     'b'
     'c,'
     'd'
     'e,'
     ','
     'f'

because in this example there are spaces ("whitespace") separating e.g., 'a' 
and 'b'.

But (ML):

a = strread ('1 2 3, 4 5, , 6', '%d', 'delimiter', ',')

a =
     1
     2
     3
     4
     5
     0
     6

In the above cases, I get the same results for textscan.

So it seems that interpretation&  processing of default whitespace depends on 
the field format specifier as well?

It appears that ML doesn't use the white-space property, as delimiters for strings, when the
"delimiter" property has been specified. I've added another line to the list (specifically
"g" and "j").

a. "Words" or fields (to be interpreted later) are separated by white-space or
delimiters.
b. The white-space char set can be adapted by the user with the "whitespace"
keyword. It can even be set to empty.
c. White-space is understood to possibly be a vector of white-space chars that
during reading is folded into one char that separates two fields.
d. Delimiters are also characters that separate words / fields. Multiple
delimiters are not folded into a single instance.
e. Vectors of white-space and one delimiter are folded into one _delimiter_
that separates fields.
f. A pair of delimiters separated by white-space (or nothing) implies an empty
value.
g. If the delimiter property is specified, then white-space is *not* used to
delimit character fields. However, white-space is always used to delimit
numeric fields.
h. By default "emptyvalue" is NaN for numeric data types. If the numeric type
doesn't support NaN, the zero is used (int32 for example). For character fields, an empty
value is just an empty string.
i. If so desired, multiple consecutive delimiters can be folded into one delimiter if
"MultipleDelimsAsOne" parameter is set to 1.
j. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not
delimit fields / words and are unaffected by the MultipleDelimsAsOne parameter.
Any fields read beyond an EOL are treated as being empty.

Does this look correct to you?


Overall, yes, save for i. as mentioned above.

But as to g., ML seems inconsistent. Spaces in character strings wouldonly be preserved if whitespace is set to "" (empty), according to theML docs (they even got an example about this).

Strict compliance with rule g. might render patching of strread.m muchmore complicated, as for each individual format specifier we'd have tocheck the whitespace/delimiters around the field in question, dependingon the format specifier's nature.This is more easily done in a compiled version that linearly ploughsthrough the text string, than in current strread.m that works by parsingcomplete columns one by one.I can try to implement rule g. in a quick-and-dirty fashion, perhapsthis will solve the actual bug that provoked my renewed interest.

How much further should we go in fixing current strread (the work horsefor textscan and textread), given the end-of-life for strread in ML plusjwe's upcoming compiled textscan version? (if he -or someone else- evergets time to finish it, of course)I'm not in favor of blindly imitating as much as we can of the moreobscure, or undocumented, or inconsistent, or corner case behavior of ML.

I'd prefer clarity and consistency over strict ML compatibility.

Your suggestion of documenting the Octave behavior that ML didn'tdocument for its own functions is to be applauded.

BTW our (= mostly your) investigation of ML behavior does serve apurpose, i.e. to enhance jwe's textscan.


Philip

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Improving strread / textread / textscan, (continued)

Prev by Date: Re: oct2mat (was: Re: Mingw Octave-3.4.3 binaries for testing on windows)
Next by Date: Re: Mingw Octave-3.4.3 binaries for testing on windows
Previous by thread: Re: Improving strread / textread / textscan
Next by thread: Re: Improving strread / textread / textscan
Index(es):
- Date
- Thread