|
From: | Andrew Janke |
Subject: | Re: Unicode support in io Forge package |
Date: | Sat, 19 Oct 2019 11:35:14 -0700 |
User-agent: | Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 |
On 10/19/19 5:51 AM, PhilipNienhuis wrote:
apjanke-floss wroteHi, Octave and io maintainers, I'm confused by the Unicode support in the io package. In particular, the functions unicode2utf8 and utf82unicode, and the "encode_utf" options in some of the ods/xls read/write functions. What is the encoding that utf82unicode/unicode2utf8 are calling "unicode" here? It looks like it's doing a single-byte encoding, treating each byte as an unsigned int 0-255, and treating those 0-255 values directly as Unicode code point values. That's not any of the standard Unicode encodings. (But I think it is exactly the same as Latin-1/ISO 8859-1.) As I understand it, since about Octave 4.4, Octave's internal encoding (that is, how it interprets Octave char arrays) is either UTF-8 or an opaque array of bytes; it's never in the "system code page" or some other locale-specific encoding. Is this UTF-8 support in io still relevant/correct? Maybe it should be deprecated or renamed/removed? Since Octave now supports UTF-8, I think you'd want to just leave UTF-8 text as is in all cases.AFAIR to apply unicode2utf8 and utf82unicode there needs to be an option set explicitly. I also lost why it was included (and no time to dive in the mercurial logs now) but there sure was a good reason for it, like bug reports etc. In core Octave there's native2unicode and unicode2native, maybe those are a better alternatives.
The io code uses native2unicode as an alternative if it's available, using a feature test. Here's an example from xls2oct.m:
## Convert from UTF-8 and strip characters that are not supported by Octave
## (any chars < 32 or > 255). if (! strcmp (xls.xtype, "COM") && (spsh_opts.convert_utf)) if (exist ("native2unicode", "file")) conv_fcn = @(str) unicode2native (native2unicode (str, "UTF-8")); else conv_fcn = @utf82unicode; endif rawarr = tidyxml (rawarr, conv_fcn); endifThis is leaving me even more confused: I'm not sure what the round trip through both native2unicode and unicode2native accomplishes, especially since native2unicode converts from the specified code page to UTF-8, so doing native2unicode(str, "UTF-8") should basically be a no-op.
Putting aside the first native2unicode call, I _think_ the use of unicode2native here is incorrect, because even on Windows, Octave's internal strings are now UTF-8 and not the system default code page. I'm going to do some more research and set up some test spreadsheets, but I suspect all the encoding conversion logic here should just be removed.
Cheers, Andrew
[Prev in Thread] | Current Thread | [Next in Thread] |