[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #55452] fopen() does not support encoding argu

From: Andrew Janke
Subject: [Octave-bug-tracker] [bug #55452] fopen() does not support encoding argument
Date: Sat, 9 Mar 2019 11:43:11 -0500 (EST)
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36

Follow-up Comment #12, bug #55452 (project octave):

> I only checked with fprintf (fid, "%s", string), and fscanf (fid, "%s")
before. I didn't have a look at "fgetl" yet.

Looks like we'll need a "fgetl()", "fgets()", and/or "textscan()" before this
test can really run. "fscanf()" elides whitespace in its input when doing
`fscanf (fid, '%s')`.

"fgetl()" would be more convenient, because making sure the example files
don't have newlines at the ends of the files is a bit of a bother.

> There doesn't seem to be a convenient function to get the number of
characters in a string straight away (or I forgot about it). "numel" returns
the number of bytes in the char array.

I'm really just looking for the number of bytes here, for quick
identity-checking where null bytes are not apparent in the output. I'm just
calling it "chars" because it's counting Octave chars. Which are really just
bytes at this point.

Your fputs()/fprintf()/fgetl()/fscanf() ideas make sense.

> I am not sure how to handle "fgets": Should we just read one byte and return
that? Or should we make sure that we read one character (whatever the number
of bytes necessary)? 

I would say, since fgets() is line-oriented, and that seems to be the family
of functions we should make encoding-aware, that it should return characters:
`fgets (fid, len)` should read len characters from the input, however many
bytes that is in the input encoding, and return len characters, however many
bytes that is in Octave's internal/native encoding. That's consistent with a
naive reading of the documentation for fgets(). But it will probably cause a
compatibility break with current uses of fgets(), because currently I suspect
Octave conflates bytes and characters here, so the len input is currently
interpreted as a number of bytes (Octave chars), not characters.

For that matter, maybe this family of functions should be renamed
"text-oriented", since it includes fgets(), which is not line-aware.

> Should we just read one byte and return that?

I don't think that's possible in an encoding-aware world, because a given
character may be represented by different bytes, and even different numbers of
bytes, in the input and Octave internal encodings.

> I never worked with multi-byte encodings like SHIFT-JIS. How do they encode
ASCII characters?

Different encodings do it differently. Some encodings are fixed-width (like
Big5 or UTF-16 (basically) or UTF-32) and just use the same number of bytes
for every character, so ASCII characters get encoded as multiple bytes with
0-padding. Some encodings (like Shift-JIS or UTF-8) use different numbers of
bytes for different character ranges. In these, you check the value of the
first byte of a byte sequence to determine how many bytes there will be in it;
its range encodes/implies the length of the byte sequence for that character.

UTF-16 has the additional complication that it's actually a variable-width
encoding: each UTF-16 "code unit" is 2 bytes, and most characters are encoded
as a single code unit, but exotic characters outside the "Basic Multilingual
Plane" are encoded as "surrogate pairs" of two code units. Many UTF-16
implementations (including Matlab and old versions of Java) just ignore this
complexity and pretend that UTF-16 is fixed-width.


Reply to this item at:


  Message sent via Savannah

reply via email to

[Prev in Thread] Current Thread [Next in Thread]