Re: Unicode support in io Forge package

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode support in io Forge package

From:	Andrew Janke
Subject:	Re: Unicode support in io Forge package
Date:	Sat, 19 Oct 2019 11:35:14 -0700
User-agent:	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.9.0



On 10/19/19 5:51 AM, PhilipNienhuis wrote:

apjanke-floss wrote

Hi, Octave and io maintainers,

I'm confused by the Unicode support in the io package. In particular,
the functions unicode2utf8 and utf82unicode, and the "encode_utf"
options in some of the ods/xls read/write functions.

What is the encoding that utf82unicode/unicode2utf8 are calling
"unicode" here? It looks like it's doing a single-byte encoding,
treating each byte as an unsigned int 0-255, and treating those 0-255
values directly as Unicode code point values. That's not any of the
standard Unicode encodings. (But I think it is exactly the same as
Latin-1/ISO 8859-1.)

As I understand it, since about Octave 4.4, Octave's internal encoding
(that is, how it interprets Octave char arrays) is either UTF-8 or an
opaque array of bytes; it's never in the "system code page" or some
other locale-specific encoding.

Is this UTF-8 support in io still relevant/correct? Maybe it should be
deprecated or renamed/removed? Since Octave now supports UTF-8, I think
you'd want to just leave UTF-8 text as is in all cases.


AFAIR to apply unicode2utf8 and utf82unicode there needs to be an option set
explicitly.
I also lost why it was included (and no time to dive in the mercurial logs
now) but there sure was a good reason for it, like bug reports etc.

In core Octave there's native2unicode and unicode2native, maybe those are a
better alternatives.

The io code uses native2unicode as an alternative if it's available,using a feature test. Here's an example from xls2oct.m:

## Convert from UTF-8 and strip characters that are not supported byOctave

  ## (any chars < 32 or > 255).
  if (! strcmp (xls.xtype, "COM") && (spsh_opts.convert_utf))
    if (exist ("native2unicode", "file"))
      conv_fcn = @(str) unicode2native (native2unicode (str, "UTF-8"));
    else
      conv_fcn = @utf82unicode;
    endif
    rawarr = tidyxml (rawarr, conv_fcn);
  endif

This is leaving me even more confused: I'm not sure what the round tripthrough both native2unicode and unicode2native accomplishes, especiallysince native2unicode converts from the specified code page to UTF-8, sodoing native2unicode(str, "UTF-8") should basically be a no-op.

Putting aside the first native2unicode call, I _think_ the use ofunicode2native here is incorrect, because even on Windows, Octave'sinternal strings are now UTF-8 and not the system default code page. I'mgoing to do some more research and set up some test spreadsheets, but Isuspect all the encoding conversion logic here should just be removed.


Cheers,
Andrew

[Prev in Thread]

Current Thread

[Next in Thread]

Unicode support in io Forge package, Andrew Janke, 2019/10/19
- Re: Unicode support in io Forge package, PhilipNienhuis, 2019/10/19
  - Re: Unicode support in io Forge package, Andrew Janke <=
    - Re: Unicode support in io Forge package, Markus Mützel, 2019/10/19
    - Re: Unicode support in io Forge package, PhilipNienhuis, 2019/10/20
    - Re: Unicode support in io Forge package, Andrew Janke, 2019/10/20
    - Re: Unicode support in io Forge package, Markus Mützel, 2019/10/20
    - Re: Unicode support in io Forge package, Andrew Janke, 2019/10/20
- Re: Unicode support in io Forge package, Markus Mützel, 2019/10/19
- Re: Unicode support in io Forge package, Markus Mützel, 2019/10/19

Prev by Date: Re: Table I/O [WAS: io-2.4.13 released]
Next by Date: Re: Unicode support in io Forge package
Previous by thread: Re: Unicode support in io Forge package
Next by thread: Re: Unicode support in io Forge package
Index(es):
- Date
- Thread