[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Improving strread / textread / textscan
From: |
Philip Nienhuis |
Subject: |
Re: Improving strread / textread / textscan |
Date: |
Mon, 24 Oct 2011 23:47:05 +0200 |
User-agent: |
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 SeaMonkey/2.0.6 |
Ben Abbott wrote:
On Oct 24, 2011, at 2:49 PM, Philip Nienhuis wrote:
Answers to three emails in one:
Ben Abbott wrote:
<snip>
Test #11: Passed.
Hmmm... on ML2007a, I get:
Test #11: Failed.
OBSERVED:
49 10 76 50
EXPECTED:
76 49 10 76 50
So ML is inconsistent...
( Note I fixed some typos in your script :-) )
I'm confused. Did you run a modified test #11? If so, how did the unmodified
script behave, and can you show us what you changed?
I copied/pasted your code into the ML editor, and only adapted the typos
(OBSEVED -> OBSERVED, and "no enough"-> "not enough" in oct_assert.m).
Test #12: Failed.
OBSEVED:
2
EXPECTED:
2
4
0
Test #13: Passed.
Test #14: Passed.
The script with the tests and the oct_assert function are attached.
Apparently ML doesn't recognize empty fields squeezed between two literals.
For reference, test 12 is ...
str = sprintf ('Text1Text2Text\nTextText4Text\nText57Text');
c = textscan (str, 'Text%*dText%dText');
fprintf ('Test #12:')
oct_assert (c{1}, int32 ([2; 4; 0]));
Looking at the table under "User Configurable Options" (link below), MW indicates that "EmptyValue"
is the "Value to return for empty numeric fields in delimited files." I read this to mean that empties only
occur between the characters defined as "delimiters".
http://www.mathworks.com/help/techdoc/ref/textscan.html
Replace the literals (i.e. "Text") with delimiters ...
c = textscan (sprintf ('1,2\n,4\n57,'), '%*d%d', 'delimiter', ',');
c{:}
ans =
2
4
Notice the last value isn't between two delimiters, but is preceded by a
delimiter and followed by white-space. If a second delimiter is added, then ...
c = textscan (sprintf ('1,2\n,4\n57,,'), '%*d%d', 'delimiter', ',');
c{:}
ans =
2
4
0
I haven't studied the docs very deeply, and have only looked at the docs for
R2011b, but it looks to me that ML is behaving in a manner that is consistent
with its documentation (admittedly the documentation is rather esoteric).
I'd say Octave more strictly complies to the rules. But admittedly this
is an extreme example.
Note that processing literals differs from processing of delimiters.
<snip>
=========================
Ben Abbott wrote:
I've made some modifications to your original notes, and added a few more below.
a. "Words" or fields (to be interpreted later) are separated by white-space or
delimiters.
b. The white-space char set can be adapted by the user with the "whitespace"
keyword. It can even be set to empty.
c. White-space is understood to possibly be a vector of white-space chars that
during reading is folded into one char that separates two fields.
d. Delimiters are also characters that separate words / fields. Multiple
delimiters are not folded into a single instance.
e. Vectors of white-space and one delimiter are folded into one _delimiter_
that separates fields.
f. A pair of delimiters separated by white-space (or nothing) imply an empty
value.
g. By default "emptyvalue" is NaN for numeric data types. If the numeric type
doesn't support NaN, the zero is used (int32 for example). For character fields, an empty
value is just an empty string.
h. If so desired, multiple consecutive delimiters can be folded into one delimiter if
"MultipleDelimsAsOne" parameter is set to 1.
i. EOL char sequences (\n, \r\n, or \r) are also delimiters, but are not
affected by the MultipleDelimsAsOne parameter.
As to strread, there's another ML subrule:
<QUOTE>
If your data uses a character other than a space as a delimiter, you must use
the strread parameter 'delimiter' to specify the delimiter
</QUOTE>
What is it, space or whitespace?
Are you referring to the different EOLs? I'm not entirely sure what you are
asking, but I'll make a guess.
Sorry for not being clear enough.
At one place in the docs, ML says "fields are separated by whitespace",
while a bit further down is the quote I gave above which only mentions
genuine spaces.
Textread operates on one line at a time. If an attempt is made to read past the
end of a line with a single format statement, empties will be inserted for
those fields read past the EOL.
c = textscan (sprintf ('1\n2\n\n4\n57\n\n'), '%*d%d', 'delimiter', ',');
c{:}
ans =
0
0
0
0
Unfortunately, I missed catching the problems with "i" before. I think it
should read ...
i. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not
delimit fields / words and are unaffected by the MultipleDelimsAsOne parameter.
Any fields read beyond an EOL are treated as being empty.
Does that make sense?
Not all of it.
An EOL can also be a field delimiter. Obvious, because an EOL naturally
cuts off fields if there's no other delimiter first.
The rest of i. looks correct to me.
IAnyway, if your& mine colllection of inferred rules apply, I do not
understand this (7th test of Octave strread.m):
octave:23> a = strread ("a b c, d e, , f", "%s", "delimiter", ",")
a =
{
[1,1] = a b c
[2,1] = d e
[3,1] =
[4,1] = f
}
(Same goes for ML)
I hadn't considered this before. I'll have to study the docs again to see if there is a
reference to this. I did try dropping the "delimiter" to see what happens.
a = textscan ('a b c, d e, , f', '%s');
a{:}
ans =
'a'
'b'
'c,'
'd'
'e,'
','
'f'
because in this example there are spaces ("whitespace") separating e.g., 'a'
and 'b'.
But (ML):
a = strread ('1 2 3, 4 5, , 6', '%d', 'delimiter', ',')
a =
1
2
3
4
5
0
6
In the above cases, I get the same results for textscan.
So it seems that interpretation& processing of default whitespace depends on
the field format specifier as well?
It appears that ML doesn't use the white-space property, as delimiters for strings, when the
"delimiter" property has been specified. I've added another line to the list (specifically
"g" and "j").
a. "Words" or fields (to be interpreted later) are separated by white-space or
delimiters.
b. The white-space char set can be adapted by the user with the "whitespace"
keyword. It can even be set to empty.
c. White-space is understood to possibly be a vector of white-space chars that
during reading is folded into one char that separates two fields.
d. Delimiters are also characters that separate words / fields. Multiple
delimiters are not folded into a single instance.
e. Vectors of white-space and one delimiter are folded into one _delimiter_
that separates fields.
f. A pair of delimiters separated by white-space (or nothing) implies an empty
value.
g. If the delimiter property is specified, then white-space is *not* used to
delimit character fields. However, white-space is always used to delimit
numeric fields.
h. By default "emptyvalue" is NaN for numeric data types. If the numeric type
doesn't support NaN, the zero is used (int32 for example). For character fields, an empty
value is just an empty string.
i. If so desired, multiple consecutive delimiters can be folded into one delimiter if
"MultipleDelimsAsOne" parameter is set to 1.
j. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not
delimit fields / words and are unaffected by the MultipleDelimsAsOne parameter.
Any fields read beyond an EOL are treated as being empty.
Does this look correct to you?
Overall, yes, save for i. as mentioned above.
But as to g., ML seems inconsistent. Spaces in character strings would
only be preserved if whitespace is set to "" (empty), according to the
ML docs (they even got an example about this).
Strict compliance with rule g. might render patching of strread.m much
more complicated, as for each individual format specifier we'd have to
check the whitespace/delimiters around the field in question, depending
on the format specifier's nature.
This is more easily done in a compiled version that linearly ploughs
through the text string, than in current strread.m that works by parsing
complete columns one by one.
I can try to implement rule g. in a quick-and-dirty fashion, perhaps
this will solve the actual bug that provoked my renewed interest.
How much further should we go in fixing current strread (the work horse
for textscan and textread), given the end-of-life for strread in ML plus
jwe's upcoming compiled textscan version? (if he -or someone else- ever
gets time to finish it, of course)
I'm not in favor of blindly imitating as much as we can of the more
obscure, or undocumented, or inconsistent, or corner case behavior of ML.
I'd prefer clarity and consistency over strict ML compatibility.
Your suggestion of documenting the Octave behavior that ML didn't
document for its own functions is to be applauded.
BTW our (= mostly your) investigation of ML behavior does serve a
purpose, i.e. to enhance jwe's textscan.
Philip
- Re: Improving strread / textread / textscan, (continued)
- Re: Improving strread / textread / textscan, Philip Nienhuis, 2011/10/23
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/23
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/23
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/23
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/24
- Re: Improving strread / textread / textscan, Philip Nienhuis, 2011/10/24
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/24
- Re: Improving strread / textread / textscan, Philip Nienhuis, 2011/10/24
- Re: Improving strread / textread / textscan, Philip Nienhuis, 2011/10/24
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/24
- Re: Improving strread / textread / textscan,
Philip Nienhuis <=
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/24
- Re: Improving strread / textread / textscan, Philip Nienhuis, 2011/10/25
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/25
- Re: Improving strread / textread / textscan, PhilipNienhuis, 2011/10/31
- Re: Improving strread / textread / textscan, Ben Abbott, 2011/10/31