octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode support in io Forge package


From: Andrew Janke
Subject: Re: Unicode support in io Forge package
Date: Sat, 19 Oct 2019 11:35:14 -0700
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.9.0



On 10/19/19 5:51 AM, PhilipNienhuis wrote:
apjanke-floss wrote
Hi, Octave and io maintainers,

I'm confused by the Unicode support in the io package. In particular,
the functions unicode2utf8 and utf82unicode, and the "encode_utf"
options in some of the ods/xls read/write functions.

What is the encoding that utf82unicode/unicode2utf8 are calling
"unicode" here? It looks like it's doing a single-byte encoding,
treating each byte as an unsigned int 0-255, and treating those 0-255
values directly as Unicode code point values. That's not any of the
standard Unicode encodings. (But I think it is exactly the same as
Latin-1/ISO 8859-1.)

As I understand it, since about Octave 4.4, Octave's internal encoding
(that is, how it interprets Octave char arrays) is either UTF-8 or an
opaque array of bytes; it's never in the "system code page" or some
other locale-specific encoding.

Is this UTF-8 support in io still relevant/correct? Maybe it should be
deprecated or renamed/removed? Since Octave now supports UTF-8, I think
you'd want to just leave UTF-8 text as is in all cases.

AFAIR to apply unicode2utf8 and utf82unicode there needs to be an option set
explicitly.
I also lost why it was included (and no time to dive in the mercurial logs
now) but there sure was a good reason for it, like bug reports etc.

In core Octave there's native2unicode and unicode2native, maybe those are a
better alternatives.

The io code uses native2unicode as an alternative if it's available, using a feature test. Here's an example from xls2oct.m:


## Convert from UTF-8 and strip characters that are not supported by Octave
  ## (any chars < 32 or > 255).
  if (! strcmp (xls.xtype, "COM") && (spsh_opts.convert_utf))
    if (exist ("native2unicode", "file"))
      conv_fcn = @(str) unicode2native (native2unicode (str, "UTF-8"));
    else
      conv_fcn = @utf82unicode;
    endif
    rawarr = tidyxml (rawarr, conv_fcn);
  endif

This is leaving me even more confused: I'm not sure what the round trip through both native2unicode and unicode2native accomplishes, especially since native2unicode converts from the specified code page to UTF-8, so doing native2unicode(str, "UTF-8") should basically be a no-op.

Putting aside the first native2unicode call, I _think_ the use of unicode2native here is incorrect, because even on Windows, Octave's internal strings are now UTF-8 and not the system default code page. I'm going to do some more research and set up some test spreadsheets, but I suspect all the encoding conversion logic here should just be removed.

Cheers,
Andrew



reply via email to

[Prev in Thread] Current Thread [Next in Thread]