octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving strread / textread / textscan


From: Philip Nienhuis
Subject: Re: Improving strread / textread / textscan
Date: Tue, 25 Oct 2011 22:43:56 +0200
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 SeaMonkey/2.0.6

Hi Ben,

Ben Abbott wrote:
On Oct 24, 2011, at 5:47 PM, Philip Nienhuis wrote:

Ben Abbott wrote:
On Oct 24, 2011, at 2:49 PM, Philip Nienhuis wrote:

Answers to three emails in one:

Ben Abbott wrote:
<snip>
Test #11: Passed.

Hmmm... on ML2007a, I get:
Test #11: Failed.
OBSERVED:
   49   10   76   50

EXPECTED:
   76   49   10   76   50

So ML is inconsistent...

( Note I fixed some typos in your script :-)  )

I'm confused. Did you run a modified test #11? If so, how did the unmodified 
script behave, and can you show us what you changed?

I copied/pasted your code into the ML editor, and only adapted the typos (OBSEVED ->  OBSERVED, and 
"no enough"->  "not enough" in oct_assert.m).

The 11th test is ...

c = textscan (sprintf ('L1\nL2'), '%s', 'endofline', '');
oct_assert (int8(c{:}{:}), int8([ 76,  49,  10,  76,  50 ]));

Looks to me as if R2007a had a bug in it. Is that a reasonable conclusion?

? I don't know. I just got 2007a at my disposal. Perhaps tomorrow I'll have a chance to try r2009a, we'll see then.


<snip>

a. "Words" or fields (to be interpreted later) are separated by white-space or 
delimiters.
b. The white-space char set can be adapted by the user with the "whitespace" 
keyword. It can even be set to empty.
c. White-space is understood to possibly be a vector of white-space chars that 
during reading is folded into one char that separates two fields.
d. Delimiters are also characters that separate words / fields.  Multiple 
delimiters are not folded into a single instance.
e. Vectors of white-space and one delimiter are folded into one _delimiter_ 
that separates fields.
f. A pair of delimiters separated by white-space (or nothing) imply an empty 
value.
g. By default "emptyvalue" is NaN for numeric data types. If the numeric type 
doesn't support NaN, the zero is used (int32 for example). For character fields, an empty 
value is just an empty string.
h. If so desired, multiple consecutive delimiters can be folded into one delimiter if 
"MultipleDelimsAsOne" parameter is set to 1.
i. EOL char sequences (\n, \r\n, or \r) are also delimiters, but are not 
affected by the MultipleDelimsAsOne parameter.
<snip>
Unfortunately, I missed catching the problems with "i" before. I think it 
should read ...

i. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not 
delimit fields / words and are unaffected by the MultipleDelimsAsOne parameter. 
Any fields read beyond an EOL are treated as being empty.

Does that make sense?

Not all of it.
An EOL can also be a field delimiter. Obvious, because an EOL naturally cuts 
off fields if there's no other delimiter first.
The rest of i. looks correct to me.

Maybe we're defining "delimiter" differently? ... or maybe I'm being overlay 
pedantic?

I think you're simply at a more abstract level then me, while I (the guy who patched this part for Octave) tend to think more at a practical level (how do I manage to code it).

I'm using the term to indicate a character that separates lines. Which an EOL 
does. Or a character that separates fields. Which EOL does not do.

....unless the EOL chars are part of whitespace. Now ML's default whitespace for strread = ' \b\r\n\t'. AFAIU ML only allows '\n', '\r\n', or '\r' as EOL (default = determined from file), all of which are in strread's default whitespace, and as whitespace is the default delimiter, EOL's implicitly can delimit fields.
Perhaps this is where my confusion stems from. See a few lines below...:

Thus, EOLs are delimiters for lines but not for fields within a line.

The MW docs do a reasonable job of describing this. See "Field and Row 
Delimiters" at the link below.

        http://www.mathworks.com/help/techdoc/ref/textscan.html

... we should be careful to not mix up strread and textscan.
I suppose you think more "the textscan way", while I (knowing that currently strread does the actual work for textscan) tend to perceive stuff more against strread.m background.

IAnyway, if your&   mine colllection of inferred rules apply, I do not 
understand this (7th test of Octave strread.m):

octave:23>   a = strread ("a b c, d e, , f", "%s", "delimiter", ",")
a =
{
  [1,1] = a b c
  [2,1] = d e
  [3,1] =
  [4,1] = f
}
(Same goes for ML)

I hadn't considered this before.  I'll have to study the docs again to see if there is a 
reference to this. I did try dropping the "delimiter" to see what happens.

a = textscan ('a b c, d e, , f', '%s');

a{:}

ans =

     'a'
     'b'
     'c,'
     'd'
     'e,'
     ','
     'f'

because in this example there are spaces ("whitespace") separating e.g., 'a' 
and 'b'.

But (ML):
a = strread ('1 2 3, 4 5, , 6', '%d', 'delimiter', ',')
a =
     1
     2
     3
     4
     5
     0
     6

In the above cases, I get the same results for textscan.

So it seems that interpretation&   processing of default whitespace depends on 
the field format specifier as well?

It appears that ML doesn't use the white-space property, as delimiters for strings, when the 
"delimiter" property has been specified. I've added another line to the list (specifically 
"g" and "j").

a. "Words" or fields (to be interpreted later) are separated by white-space or 
delimiters.
b. The white-space char set can be adapted by the user with the "whitespace" 
keyword. It can even be set to empty.
c. White-space is understood to possibly be a vector of white-space chars that 
during reading is folded into one char that separates two fields.
d. Delimiters are also characters that separate words / fields.  Multiple 
delimiters are not folded into a single instance.
e. Vectors of white-space and one delimiter are folded into one _delimiter_ 
that separates fields.
f. A pair of delimiters separated by white-space (or nothing) implies an empty 
value.
g. If the delimiter property is specified, then white-space is *not* used to 
delimit character fields. However, white-space is always used to delimit 
numeric fields.
h. By default "emptyvalue" is NaN for numeric data types. If the numeric type 
doesn't support NaN, the zero is used (int32 for example). For character fields, an empty 
value is just an empty string.
i. If so desired, multiple consecutive delimiters can be folded into one delimiter if 
"MultipleDelimsAsOne" parameter is set to 1.
j. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not 
delimit fields / words and are unaffected by the MultipleDelimsAsOne parameter. 
Any fields read beyond an EOL are treated as being empty.

Does this look correct to you?

Overall, yes, save for i. as mentioned above.
But as to g., ML seems inconsistent. Spaces in character strings would only be preserved 
if whitespace is set to "" (empty), according to the ML docs (they even got an 
example about this).

hmmm ... I think I managed to confuse myself a bit earlier. I tried a simple 
test to confirm my understanding, but just proved my understanding was 
incomplete.

a = textscan ('1, 2, 3', '%s %s %s', 'delimiter', ',', 'whitespace', '');

a{:}

ans =

     '1'


ans =

     ' 2'


ans =

     ' 3'

Notice a{2:3} have leading spaces. If "whitespace" is not defined empty, then 
there is  no white space in a{:}.

a = textscan ('1, 2, 3', '%s %s %s', 'delimiter', ',');

a{:}

ans =

     '1'


ans =

     '2'


ans =

     '3'

These two examples and the one below (we've used before) ...

a = textscan ('a b c, d e, , f', '%s', 'delimiter', ',');
a{:}

ans =

     'a b c'
     'd e'
     ''
     'f'

... imply to me that that when reading character data, when "delimiter" is 
specified, white-space is not used to delimit, and the characters read are trimmed of 
leading and trailing white-space.

That's my impression as well.

From textscan docs:
<QUOTE>
textscan adds a space character, char(32), to any specified Whitespace unless Whitespace is empty ('') and the format includes any string conversion specifier.
<QUOTE>
I suppose strread does the same. Perhaps this is where we need to search for analysis of ML behavior.

Strict compliance with rule g. might render patching of strread.m much more 
complicated, as for each individual format specifier we'd have to check the 
whitespace/delimiters around the field in question, depending on the format 
specifier's nature.
This is more easily done in a compiled version that linearly ploughs through 
the text string, than in current strread.m that works by parsing complete 
columns one by one.
I can try to implement rule g. in a quick-and-dirty fashion, perhaps this will 
solve the actual bug that provoked my renewed interest.

How much further should we go in fixing current strread (the work horse for 
textscan and textread), given the end-of-life for strread in ML plus jwe's 
upcoming compiled textscan version? (if he -or someone else- ever gets time to 
finish it, of course)
I'm not in favor of blindly imitating as much as we can of the more obscure, or 
undocumented, or inconsistent, or corner case behavior of ML.
I'd prefer clarity and consistency over strict ML compatibility.
Your suggestion of documenting the Octave behavior that ML didn't document for 
its own functions is to be applauded.

For the moment, I'm mostly concerned about documenting how textscan should 
work. If you've been able to improve Octave's compatibility, then I recommend 
you put together a changeset. John or someone else may make it obsolete at some 
point, but that is part of the nature of code development ... after all you're 
about to do the same to one of my contributions ;-)

Happened to me too, several times. Yes that's our fate...
But you are quick in turning ideas into changesets. I'm more reluctant and rather wait until I'm fairly sure.

I'll try to prepare a changeset for strread.m in the coming days (I have only little time each day due to medical issues).

In any event, my latest attempt is below to document how textscan parses fields 
is below.


01) Lines of input are delimited by EOL chars. The EOL character may be
     specified by the parameter "endofline". The default is determined from
     the file ("\n", "\r", or "\r\n").

... 01) only applies if textscan reads from file. Correct?

02) When reading character fields, if no "delimiter" property is defined, then
     the characters contained by the "whitespace" property are used to delimit
     fields. When the "delimiter" property is defined, the defined "whitespace"
     property is ignored for the purpose of delimiting strings. Also, when the
     "delimiter" property is defined all leading and trailing characters
     contained in the "whitespace" property are trimmed from the strings read.
03) Any attempt to read fields beyond an EOL are treated as being empty. For
     numeric data empty values are replaced by the property "emptyvalue".
04) Values for numeric fields are separated by characters contained by the
     "whitespace", or "delimiter", properties.

... or their union (?) (which is what I think); but see below 09)

05) The white-space char set can be adapted by the user with the "whitespace"
     property. It can even be set to empty.

... I'm not sure, but I think ML only allows certain characters to be part of whitespace. At least I read the strread docs this way. I don't know if this also holds for textscan.

06) A repetitiion of white-space chars is folded into one char.
07) Delimiters are also characters that separate fields.  Multiple
     delimiters are not folded into a single instance.
09) For numeric fields, vectors of white-space, and one delimiter, are folded
     into one _delimiter_ that separates the fields
__VV__count goes wrong...
09) A pair of delimiters separated by white-space (or nothing) implies an
     empty value.
10) If the delimiter property is specified, then white-space is *not* used to
     delimit character fields. However, white-space is always used to delimit
     numeric fields.
11) For numeric data, the default "emptyvalue" is NaN. If the numeric
     type doesn't support NaN, then zero is used (int32 for example). For
     character fields, an empty value is just an empty string.
12) Multiple consecutive delimiters can be folded into one delimiter by
     setting the "MultipleDelimsAsOne" parameter to true.

Once this part is settled, then I hope to write tests for all of this. Later 
I'll add tests for all data types, patterns, field-multiplicity, and skipping 
fields / literals.

For which textscan version?
The one in dev (default) has been developed as far as I could. (That is, I have an experimental version that allows resuming reading at byte position after a specified nr of fields have been read, in contrast to the current version which simply does line counting).

Some of what you mention below can only reasonably be implemented in a compiled function.
Otherwise, I think this is a valuable goal.

Interger, signed: %d, %d8, %d16, %d32, %d64
Interger, usigned: %u, %u8, %u16, %u32, %u64
Floating-point: %f, %f32, %f64, %n
Character strings: %s, %q, %c

Pattern-matching: %[...], %[^...]

Multiple fields: %Nc, %Ns, %Nq, %N[...], %N[^...], %Nn, %Nd, %Nu, %Nf, %N.Dn, 
%N.Df

Skipping fields: %*, %*n, and literals

Philip


reply via email to

[Prev in Thread] Current Thread [Next in Thread]