octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving strread / textread / textscan


From: Philip Nienhuis
Subject: Re: Improving strread / textread / textscan
Date: Mon, 24 Oct 2011 23:47:05 +0200
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 SeaMonkey/2.0.6

Ben Abbott wrote:
On Oct 24, 2011, at 2:49 PM, Philip Nienhuis wrote:

Answers to three emails in one:

Ben Abbott wrote:
<snip>
Test #11: Passed.

Hmmm... on ML2007a, I get:
Test #11: Failed.
OBSERVED:
   49   10   76   50

EXPECTED:
   76   49   10   76   50

So ML is inconsistent...

( Note I fixed some typos in your script :-)  )

I'm confused. Did you run a modified test #11? If so, how did the unmodified 
script behave, and can you show us what you changed?

I copied/pasted your code into the ML editor, and only adapted the typos (OBSEVED -> OBSERVED, and "no enough"-> "not enough" in oct_assert.m).

Test #12: Failed.
OBSEVED:
            2

EXPECTED:
            2
            4
            0

Test #13: Passed.
Test #14: Passed.

The script with the  tests and the oct_assert function are attached.

Apparently ML doesn't recognize empty fields squeezed between two literals.

For reference, test 12 is ...

str = sprintf ('Text1Text2Text\nTextText4Text\nText57Text');
c = textscan (str, 'Text%*dText%dText');
fprintf ('Test #12:')
oct_assert (c{1}, int32 ([2; 4; 0]));

Looking at the table under "User Configurable Options" (link below), MW indicates that "EmptyValue" 
is the "Value to return for empty numeric fields in delimited files." I read this to mean that empties only 
occur between the characters defined as "delimiters".

        http://www.mathworks.com/help/techdoc/ref/textscan.html

Replace the literals (i.e. "Text") with delimiters ...

c = textscan (sprintf ('1,2\n,4\n57,'), '%*d%d', 'delimiter', ',');
c{:}

ans =

            2
            4

Notice the last value isn't between two delimiters, but is preceded by a 
delimiter and followed by white-space. If a second delimiter is added, then ...

c = textscan (sprintf ('1,2\n,4\n57,,'), '%*d%d', 'delimiter', ',');
c{:}

ans =

            2
            4
            0

I haven't studied the docs very deeply, and have only looked at the docs for 
R2011b, but it looks to me that ML is behaving in a manner that is consistent 
with its documentation (admittedly the documentation is rather esoteric).

I'd say Octave more strictly complies to the rules. But admittedly this is an extreme example.
Note that processing literals differs from processing of delimiters.

<snip>
=========================
Ben Abbott wrote:

I've made some modifications to your original notes, and added a few more below.

a. "Words" or fields (to be interpreted later) are separated by white-space or 
delimiters.
b. The white-space char set can be adapted by the user with the "whitespace" 
keyword. It can even be set to empty.
c. White-space is understood to possibly be a vector of white-space chars that 
during reading is folded into one char that separates two fields.
d. Delimiters are also characters that separate words / fields.  Multiple 
delimiters are not folded into a single instance.
e. Vectors of white-space and one delimiter are folded into one _delimiter_ 
that separates fields.
f. A pair of delimiters separated by white-space (or nothing) imply an empty 
value.
g. By default "emptyvalue" is NaN for numeric data types. If the numeric type 
doesn't support NaN, the zero is used (int32 for example). For character fields, an empty 
value is just an empty string.
h. If so desired, multiple consecutive delimiters can be folded into one delimiter if 
"MultipleDelimsAsOne" parameter is set to 1.
i. EOL char sequences (\n, \r\n, or \r) are also delimiters, but are not 
affected by the MultipleDelimsAsOne parameter.

As to strread, there's another ML subrule:
<QUOTE>
If your data uses a character other than a space as a delimiter, you must use 
the strread parameter 'delimiter' to specify the delimiter
</QUOTE>
What is it, space or whitespace?

Are you referring to the different EOLs? I'm not entirely sure what you are 
asking, but I'll make a guess.

Sorry for not being clear enough.
At one place in the docs, ML says "fields are separated by whitespace", while a bit further down is the quote I gave above which only mentions genuine spaces.

Textread operates on one line at a time. If an attempt is made to read past the 
end of a line with a single format statement, empties will be inserted for 
those fields read past the EOL.

c = textscan (sprintf ('1\n2\n\n4\n57\n\n'), '%*d%d', 'delimiter', ',');
c{:}

ans =

            0
            0
            0
            0

Unfortunately, I missed catching the problems with "i" before. I think it 
should read ...

i. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not 
delimit fields / words and are unaffected by the MultipleDelimsAsOne parameter. 
Any fields read beyond an EOL are treated as being empty.

Does that make sense?

Not all of it.
An EOL can also be a field delimiter. Obvious, because an EOL naturally cuts off fields if there's no other delimiter first.
The rest of i. looks correct to me.


IAnyway, if your&  mine colllection of inferred rules apply, I do not 
understand this (7th test of Octave strread.m):

octave:23>  a = strread ("a b c, d e, , f", "%s", "delimiter", ",")
a =
{
  [1,1] = a b c
  [2,1] = d e
  [3,1] =
  [4,1] = f
}
(Same goes for ML)

I hadn't considered this before.  I'll have to study the docs again to see if there is a 
reference to this. I did try dropping the "delimiter" to see what happens.

a = textscan ('a b c, d e, , f', '%s');

a{:}

ans =

     'a'
     'b'
     'c,'
     'd'
     'e,'
     ','
     'f'

because in this example there are spaces ("whitespace") separating e.g., 'a' 
and 'b'.

But (ML):
a = strread ('1 2 3, 4 5, , 6', '%d', 'delimiter', ',')
a =
     1
     2
     3
     4
     5
     0
     6

In the above cases, I get the same results for textscan.

So it seems that interpretation&  processing of default whitespace depends on 
the field format specifier as well?

It appears that ML doesn't use the white-space property, as delimiters for strings, when the 
"delimiter" property has been specified. I've added another line to the list (specifically 
"g" and "j").

a. "Words" or fields (to be interpreted later) are separated by white-space or 
delimiters.
b. The white-space char set can be adapted by the user with the "whitespace" 
keyword. It can even be set to empty.
c. White-space is understood to possibly be a vector of white-space chars that 
during reading is folded into one char that separates two fields.
d. Delimiters are also characters that separate words / fields.  Multiple 
delimiters are not folded into a single instance.
e. Vectors of white-space and one delimiter are folded into one _delimiter_ 
that separates fields.
f. A pair of delimiters separated by white-space (or nothing) implies an empty 
value.
g. If the delimiter property is specified, then white-space is *not* used to 
delimit character fields. However, white-space is always used to delimit 
numeric fields.
h. By default "emptyvalue" is NaN for numeric data types. If the numeric type 
doesn't support NaN, the zero is used (int32 for example). For character fields, an empty 
value is just an empty string.
i. If so desired, multiple consecutive delimiters can be folded into one delimiter if 
"MultipleDelimsAsOne" parameter is set to 1.
j. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not 
delimit fields / words and are unaffected by the MultipleDelimsAsOne parameter. 
Any fields read beyond an EOL are treated as being empty.

Does this look correct to you?

Overall, yes, save for i. as mentioned above.
But as to g., ML seems inconsistent. Spaces in character strings would only be preserved if whitespace is set to "" (empty), according to the ML docs (they even got an example about this).

Strict compliance with rule g. might render patching of strread.m much more complicated, as for each individual format specifier we'd have to check the whitespace/delimiters around the field in question, depending on the format specifier's nature. This is more easily done in a compiled version that linearly ploughs through the text string, than in current strread.m that works by parsing complete columns one by one. I can try to implement rule g. in a quick-and-dirty fashion, perhaps this will solve the actual bug that provoked my renewed interest.

How much further should we go in fixing current strread (the work horse for textscan and textread), given the end-of-life for strread in ML plus jwe's upcoming compiled textscan version? (if he -or someone else- ever gets time to finish it, of course) I'm not in favor of blindly imitating as much as we can of the more obscure, or undocumented, or inconsistent, or corner case behavior of ML.
I'd prefer clarity and consistency over strict ML compatibility.
Your suggestion of documenting the Octave behavior that ML didn't document for its own functions is to be applauded.

BTW our (= mostly your) investigation of ML behavior does serve a purpose, i.e. to enhance jwe's textscan.

Philip


reply via email to

[Prev in Thread] Current Thread [Next in Thread]