octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: advice / help needed for reading formatted text (textscan, strread,


From: Ben Abbott
Subject: Re: advice / help needed for reading formatted text (textscan, strread, & textread)
Date: Mon, 25 Oct 2010 14:52:48 +0800

On Oct 25, 2010, at 2:20 PM, John W. Eaton wrote:

> On 25-Oct-2010, Ben Abbott wrote:
> 
> | The function textscan, strread, and textread currently are not fully 
> compliant with ML. They each need expanded support of formats.
> | 
> | ML  supports the following format types; %d, %u, %f, %s, %q, %c, %[…], %[^…]
> | 
> | In addition it supports skipping entries, "%*f", and specifying precision, 
> "%12.5f"
> 
> The scanf functions already handle all of the format specifiers above
> except for %q and bit-widths for %u and %d conversions.  I suppose
> adding bit-width conversion specifiers (%d8, %d16, %d32, %d64, %u8,
> %u16, %u32, %u64) would not be too hard.  Adding %q might not be too
> hard either.  Exactly how is it supposed to work?  Does it mean to
> match something like
> 
>  [:space:]*"[^"]*"
> 
> and return everything between the two ""?  Is it an error to have a
> newline embedded in the quoted string?

Apparently, newlines may be embedded.

>> c = textscan (sprintf ('SkipThis"\n%s\n"','Hello World'), 'SkipThis%q');
>> c{1}

ans = 

    '
Hello World
'

> | ... and have "c" be returned as a [Nx5] cell array, where the "%s", "%c", 
> and "%[TF]" entries are cell-strings and the other cell entries are numeric.
> 
> I think we need a new function other than the scanf functions to
> return values in a cell array.  But rather than writing it from
> scratch, it should probably be based on the
> octave_base_stream::do_scanf function in oct-stream.cc.  I guess we
> would need to handle the %{ud}{8,16,32,64} and %q specifiers in the
> scanf_format_elt class and the octave_base_stream::do_scanf function.
> When called by the scanf functions, these extra format specifiers
> should be disabled.  I think that would be as simple as not
> recognizing them when constructing the scanf_format_elt object from
> the given format string and there would not have to be specific checks
> in the do_scanf function.  There would also need to be an option to
> gather the output values in a different way so they could be returned
> as a cell array.  I could probably help with implementing this.
> 
> jwe

In addition to %{ud}{8,16,32,64} ML also supports %f{32,64}

For a new function, perhaps there may be other parameters / features of the 
functions what could/should(?) be included.  While studying ML's textscan, I 
put together a doc-string that includes all of what ML supports. It is below. 
Note that the position indicator works when the 1st argument is a fid or a 
string.

 -- Function File: C = textscan (FID, FORMAT)
 -- Function File: C = textscan (FID, FORMAT, N)
 -- Function File: C = textscan (FID, FORMAT, PARAM, VALUE, ...)
 -- Function File: C = textscan (FID, FORMAT, N, PARAM, VALUE, ...)
 -- Function File: A = textscan (STR, ...)
 -- Function File: [A, POSITION] = textscan (...)
     Read formatted data from a text file.

     The file associated with FID is read and parsed according to
     FORMAT. The supported formats are;

    `%c'
          Read a single character (may be white space).

    `%d'
          Read a 32 bit integer.

    `%d8'
          Read a 8 bit integer.

    `%d16'
          Read a 16 bit integer.

    `%d32'
          Read a 32 bit integer.

    `%d64
          Read a 64 bit integer.

    `%f'
          Read a double.

    `%f32'
          Read a single.

    `%f64'
          Read a double.

    `%n'
          Read a double.

    `%q'
          Read quoted text. Single or double quotes are supported.

    `%s'
          Read a string.

    `%u'
          Read an unsigned 32 bit integer.

    `%u8'
          Read an unsigned 8 bit integer.

    `%u16'
          Read an unsigned 16 bit integer.

    `%u32'
          Read an unsigned 32 bit integer.

    `%u64'
          Read an unsigned 64 bit integer.

    `%*FMT'
          Skips a field of the specified format type.

    `%[...]'
          Read characters until one is encountered that does not match
          those between the brackets.

    `%[^...]'
          Read characters until one is encountered that does match
          those between the brackets.

     The parameters below may be specified.

    `bufsiz'
          Maximum string length in bytes. The default value is 4095.

    `collectoutput'
          When true, output cells with the same data type will be
          concatentated into a single array. The default value is
          `false'.

    `commentstyle'
             * When a string, all characters between the string and the
               end of the line will be ignored.

             * When a cell array of two strings, the first indicates
               the beginning of a comment and the second the end of a
               comment. This allows C-style comments to be ignored,
               i.e. {"/*", "*/"}.

    `delimiter'
          The characters with used to dimit different fields. The
          default value is a space.

    `emptyvalue'
          Value assigned to fields that are empty. The default value is
          NaN.

    `endofline'
          The end of line character. The defaut is any of `\n', `\r\n',
          or `\r'.

    `expchars'
          The characters which are understood to indicate the exponent.
          The default value is "eEdD".

    `headerlines'
          The number of lines at the beginning of the file to be
          skipped. The default value is zero.

    `multipledelimsasone'
          If true, consecutive delimiters will be interpreted as a
          single delimiter.  This option only has an effect if a
          `delimiter' is specified.  The default value is `false'.

    `returnonerror'
          Indicates the behavior when an error is encoutred. If set to
          `false' `textscan' will attempt to continue. If set to
          `true', the default falue is `false', which whill cause an
          error to be returned.

    `treatasempty'
          Specify one or more strings which will be treated as being
          equivalent to an empty field. Value may either be an single
          string, or a cell array of strings. The is no default value.

    `whitespace'
          Chacacters that are interpreted as whitespace. The default
          value is ` \b\t'.

     The optional input, N, specifes the number of lines to be read from
     the file, associated with FID.

     The output, C, is a cell array whose length is given by the number
     of format specifiers.

     The second output, POSITION, provides the position, in characters,
     from the beginning of the file, or string.

     See also: dlmread, fscanf, load, strread, textread






reply via email to

[Prev in Thread] Current Thread [Next in Thread]