Re: Improving strread / textread / textscan

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving strread / textread / textscan

From:	Philip Nienhuis
Subject:	Re: Improving strread / textread / textscan
Date:	Tue, 25 Oct 2011 22:43:56 +0200
User-agent:	Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 SeaMonkey/2.0.6

Hi Ben,

Ben Abbott wrote:

On Oct 24, 2011, at 5:47 PM, Philip Nienhuis wrote:

Ben Abbott wrote:

On Oct 24, 2011, at 2:49 PM, Philip Nienhuis wrote:

Answers to three emails in one:

Ben Abbott wrote:

<snip>

Test #11: Passed.


Hmmm... on ML2007a, I get:
Test #11: Failed.
OBSERVED:
   49   10   76   50

EXPECTED:
   76   49   10   76   50

So ML is inconsistent...

( Note I fixed some typos in your script :-)  )


I'm confused. Did you run a modified test #11? If so, how did the unmodified 
script behave, and can you show us what you changed?


I copied/pasted your code into the ML editor, and only adapted the typos (OBSEVED ->  OBSERVED, and 
"no enough"->  "not enough" in oct_assert.m).


The 11th test is ...

c = textscan (sprintf ('L1\nL2'), '%s', 'endofline', '');
oct_assert (int8(c{:}{:}), int8([ 76,  49,  10,  76,  50 ]));

Looks to me as if R2007a had a bug in it. Is that a reasonable conclusion?

? I don't know. I just got 2007a at my disposal. Perhaps tomorrow I'llhave a chance to try r2009a, we'll see then.



<snip>


a. "Words" or fields (to be interpreted later) are separated by white-space or 
delimiters.
b. The white-space char set can be adapted by the user with the "whitespace" 
keyword. It can even be set to empty.
c. White-space is understood to possibly be a vector of white-space chars that 
during reading is folded into one char that separates two fields.
d. Delimiters are also characters that separate words / fields.  Multiple 
delimiters are not folded into a single instance.
e. Vectors of white-space and one delimiter are folded into one _delimiter_ 
that separates fields.
f. A pair of delimiters separated by white-space (or nothing) imply an empty 
value.
g. By default "emptyvalue" is NaN for numeric data types. If the numeric type 
doesn't support NaN, the zero is used (int32 for example). For character fields, an empty 
value is just an empty string.
h. If so desired, multiple consecutive delimiters can be folded into one delimiter if 
"MultipleDelimsAsOne" parameter is set to 1.
i. EOL char sequences (\n, \r\n, or \r) are also delimiters, but are not 
affected by the MultipleDelimsAsOne parameter.

<snip>

Unfortunately, I missed catching the problems with "i" before. I think it 
should read ...

i. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not 
delimit fields / words and are unaffected by the MultipleDelimsAsOne parameter. 
Any fields read beyond an EOL are treated as being empty.

Does that make sense?


Not all of it.
An EOL can also be a field delimiter. Obvious, because an EOL naturally cuts 
off fields if there's no other delimiter first.
The rest of i. looks correct to me.


Maybe we're defining "delimiter" differently? ... or maybe I'm being overlay 
pedantic?

I think you're simply at a more abstract level then me, while I (the guywho patched this part for Octave) tend to think more at a practicallevel (how do I manage to code it).

I'm using the term to indicate a character that separates lines. Which an EOL 
does. Or a character that separates fields. Which EOL does not do.

....unless the EOL chars are part of whitespace. Now ML's defaultwhitespace for strread = ' \b\r\n\t'.AFAIU ML only allows '\n', '\r\n', or '\r' as EOL (default = determinedfrom file), all of which are in strread's default whitespace, and aswhitespace is the default delimiter, EOL's implicitly can delimit fields.

Perhaps this is where my confusion stems from. See a few lines below...:

Thus, EOLs are delimiters for lines but not for fields within a line.

The MW docs do a reasonable job of describing this. See "Field and Row 
Delimiters" at the link below.

        http://www.mathworks.com/help/techdoc/ref/textscan.html


... we should be careful to not mix up strread and textscan.

I suppose you think more "the textscan way", while I (knowing thatcurrently strread does the actual work for textscan) tend to perceivestuff more against strread.m background.

IAnyway, if your&   mine colllection of inferred rules apply, I do not 
understand this (7th test of Octave strread.m):

octave:23>   a = strread ("a b c, d e, , f", "%s", "delimiter", ",")
a =
{
  [1,1] = a b c
  [2,1] = d e
  [3,1] =
  [4,1] = f
}
(Same goes for ML)


I hadn't considered this before.  I'll have to study the docs again to see if there is a 
reference to this. I did try dropping the "delimiter" to see what happens.

a = textscan ('a b c, d e, , f', '%s');

a{:}

ans =

     'a'
     'b'
     'c,'
     'd'
     'e,'
     ','
     'f'

because in this example there are spaces ("whitespace") separating e.g., 'a' 
and 'b'.

But (ML):

a = strread ('1 2 3, 4 5, , 6', '%d', 'delimiter', ',')

a =
     1
     2
     3
     4
     5
     0
     6

In the above cases, I get the same results for textscan.

So it seems that interpretation&   processing of default whitespace depends on 
the field format specifier as well?

It appears that ML doesn't use the white-space property, as delimiters for strings, when the
"delimiter" property has been specified. I've added another line to the list (specifically
"g" and "j").

a. "Words" or fields (to be interpreted later) are separated by white-space or
delimiters.
b. The white-space char set can be adapted by the user with the "whitespace"
keyword. It can even be set to empty.
c. White-space is understood to possibly be a vector of white-space chars that
during reading is folded into one char that separates two fields.
d. Delimiters are also characters that separate words / fields. Multiple
delimiters are not folded into a single instance.
e. Vectors of white-space and one delimiter are folded into one _delimiter_
that separates fields.
f. A pair of delimiters separated by white-space (or nothing) implies an empty
value.
g. If the delimiter property is specified, then white-space is *not* used to
delimit character fields. However, white-space is always used to delimit
numeric fields.
h. By default "emptyvalue" is NaN for numeric data types. If the numeric type
doesn't support NaN, the zero is used (int32 for example). For character fields, an empty
value is just an empty string.
i. If so desired, multiple consecutive delimiters can be folded into one delimiter if
"MultipleDelimsAsOne" parameter is set to 1.
j. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not
delimit fields / words and are unaffected by the MultipleDelimsAsOne parameter.
Any fields read beyond an EOL are treated as being empty.

Does this look correct to you?


Overall, yes, save for i. as mentioned above.
But as to g., ML seems inconsistent. Spaces in character strings would only be preserved 
if whitespace is set to "" (empty), according to the ML docs (they even got an 
example about this).


hmmm ... I think I managed to confuse myself a bit earlier. I tried a simple 
test to confirm my understanding, but just proved my understanding was 
incomplete.

a = textscan ('1, 2, 3', '%s %s %s', 'delimiter', ',', 'whitespace', '');

a{:}

ans =

     '1'


ans =

     ' 2'


ans =

     ' 3'

Notice a{2:3} have leading spaces. If "whitespace" is not defined empty, then 
there is  no white space in a{:}.

a = textscan ('1, 2, 3', '%s %s %s', 'delimiter', ',');

a{:}

ans =

     '1'


ans =

     '2'


ans =

     '3'

These two examples and the one below (we've used before) ...

a = textscan ('a b c, d e, , f', '%s', 'delimiter', ',');

a{:}


ans =

     'a b c'
     'd e'
     ''
     'f'

... imply to me that that when reading character data, when "delimiter" is 
specified, white-space is not used to delimit, and the characters read are trimmed of 
leading and trailing white-space.


That's my impression as well.

From textscan docs:
<QUOTE>

textscan adds a space character, char(32), to any specified Whitespaceunless Whitespace is empty ('') and the format includes any stringconversion specifier.

<QUOTE>

I suppose strread does the same. Perhaps this is where we need to searchfor analysis of ML behavior.

Strict compliance with rule g. might render patching of strread.m much more 
complicated, as for each individual format specifier we'd have to check the 
whitespace/delimiters around the field in question, depending on the format 
specifier's nature.
This is more easily done in a compiled version that linearly ploughs through 
the text string, than in current strread.m that works by parsing complete 
columns one by one.
I can try to implement rule g. in a quick-and-dirty fashion, perhaps this will 
solve the actual bug that provoked my renewed interest.

How much further should we go in fixing current strread (the work horse for 
textscan and textread), given the end-of-life for strread in ML plus jwe's 
upcoming compiled textscan version? (if he -or someone else- ever gets time to 
finish it, of course)
I'm not in favor of blindly imitating as much as we can of the more obscure, or 
undocumented, or inconsistent, or corner case behavior of ML.
I'd prefer clarity and consistency over strict ML compatibility.
Your suggestion of documenting the Octave behavior that ML didn't document for 
its own functions is to be applauded.


For the moment, I'm mostly concerned about documenting how textscan should 
work. If you've been able to improve Octave's compatibility, then I recommend 
you put together a changeset. John or someone else may make it obsolete at some 
point, but that is part of the nature of code development ... after all you're 
about to do the same to one of my contributions ;-)


Happened to me too, several times. Yes that's our fate...

But you are quick in turning ideas into changesets. I'm more reluctantand rather wait until I'm fairly sure.

I'll try to prepare a changeset for strread.m in the coming days (I haveonly little time each day due to medical issues).

In any event, my latest attempt is below to document how textscan parses fields 
is below.

01) Lines of input are delimited by EOL chars. The EOL character may be
     specified by the parameter "endofline". The default is determined from
     the file ("\n", "\r", or "\r\n").


... 01) only applies if textscan reads from file. Correct?

02) When reading character fields, if no "delimiter" property is defined, then
     the characters contained by the "whitespace" property are used to delimit
     fields. When the "delimiter" property is defined, the defined "whitespace"
     property is ignored for the purpose of delimiting strings. Also, when the
     "delimiter" property is defined all leading and trailing characters
     contained in the "whitespace" property are trimmed from the strings read.
03) Any attempt to read fields beyond an EOL are treated as being empty. For
     numeric data empty values are replaced by the property "emptyvalue".
04) Values for numeric fields are separated by characters contained by the
     "whitespace", or "delimiter", properties.


... or their union (?) (which is what I think); but see below 09)

05) The white-space char set can be adapted by the user with the "whitespace"
     property. It can even be set to empty.

... I'm not sure, but I think ML only allows certain characters to bepart of whitespace. At least I read the strread docs this way. I don'tknow if this also holds for textscan.

06) A repetitiion of white-space chars is folded into one char.
07) Delimiters are also characters that separate fields.  Multiple
     delimiters are not folded into a single instance.
09) For numeric fields, vectors of white-space, and one delimiter, are folded
     into one _delimiter_ that separates the fields

__VV__count goes wrong...

09) A pair of delimiters separated by white-space (or nothing) implies an
     empty value.
10) If the delimiter property is specified, then white-space is *not* used to
     delimit character fields. However, white-space is always used to delimit
     numeric fields.
11) For numeric data, the default "emptyvalue" is NaN. If the numeric
     type doesn't support NaN, then zero is used (int32 for example). For
     character fields, an empty value is just an empty string.
12) Multiple consecutive delimiters can be folded into one delimiter by
     setting the "MultipleDelimsAsOne" parameter to true.

Once this part is settled, then I hope to write tests for all of this. Later 
I'll add tests for all data types, patterns, field-multiplicity, and skipping 
fields / literals.


For which textscan version?

The one in dev (default) has been developed as far as I could. (That is,I have an experimental version that allows resuming reading at byteposition after a specified nr of fields have been read, in contrast tothe current version which simply does line counting).

Some of what you mention below can only reasonably be implemented in acompiled function.

Otherwise, I think this is a valuable goal.

Interger, signed: %d, %d8, %d16, %d32, %d64
Interger, usigned: %u, %u8, %u16, %u32, %u64
Floating-point: %f, %f32, %f64, %n
Character strings: %s, %q, %c

Pattern-matching: %[...], %[^...]

Multiple fields: %Nc, %Ns, %Nq, %N[...], %N[^...], %Nn, %Nd, %Nu, %Nf, %N.Dn, 
%N.Df

Skipping fields: %*, %*n, and literals


Philip

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Improving strread / textread / textscan, (continued)

Prev by Date: Re: nthoutarg function?
Next by Date: Re: Mingw Octave-3.4.3 binaries for testing on windows
Previous by thread: Re: Improving strread / textread / textscan
Next by thread: Re: Improving strread / textread / textscan
Index(es):
- Date
- Thread