Re: locale encoding and core functions

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: locale encoding and core functions

From:	Andrew Janke
Subject:	Re: locale encoding and core functions
Date:	Mon, 4 Mar 2019 23:33:18 -0500
User-agent:	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.5.1



On 2/23/19 4:12 AM, "Markus Mützel" wrote:

TL;DR: Is there a way to get information whether an .m file is from Octave core
or from a user function?

Some background:
With the upcoming Octave 5 it will be possible to set the mfile_encoding that will be
used to read .m files. This is important because Octave has to know which encoding is
used in the .m file to correctly display non-ASCII characters in strings (e.g. in the
"workspace" view or in plots). This is done by converting from whatever
encoding the user set up to UTF-8 and convert to whatever encoding necessary at any
interfaces.
However, there is a problem when we read core .m files which are always encoded
in UTF-8 (and not in the encoding the user set up). On conversion of these
files from the locale encoding to UTF-8, non-ASCII characters result in garbled
text.
E.g. the German character "ä" encoded in UTF-8 is represented by two bytes: c3 a4. Assume that users would set the
mfile_encoding to "ISO 8859-1" (Latin1). Then these two bytes are interpreted as representing the two letters
"Ã¤". This means that a string from a core .m file that contained the letter "ä" would display as
"Ã¤" for those users.

None of the core .m files contain any non-ASCII characters at the moment.
However, there are a few help texts in some Octave Forge packages that do. See
also bug #55195 [1].

The conversion to UTF-8 is done in "file_reader::get_input" in the file
"input.cc".
If we knew in that function that the file we read from was from the core (or an
Octave Forge package), we could skip the conversion from the locale encoding to
mitigate the problem.

So back to the initial question: Is there a way to pass this information down
to that function?

Markus

PS: This problem mostly affects Windows users where the default mfile_encoding
depends on the locale of Windows (see also bug #49685). But in general any user
who would prefer to use an encoding other than UTF-8 in their .m file code
would be affected by this bug.

[1]: https://savannah.gnu.org/bugs/index.php?55195
[2]: https://savannah.gnu.org/bugs/index.php?49685

Fixed-encoding support like this sounds like a good idea. I would liketo be able to use non-ASCII characters in .m source code in a portablemanner. And I can see use cases for this in core M-code: example andtest data may want to use international or special characters, both totest that the code under test supports it, and to provide examples foradvanced usage. It would be convenient to enter these as literalcharacters instead of having to use \x escape sequences.

But just switching on "core/Forge" vs "user" .m files may not be thebest way to do it in the long run. In particular, I think these encodingconcerns apply to non-core Octave code, too.

There's no direct way to detect whether an .m file is from core Octave.But you could build a function to do so on top of __pathorig__() prettyeasily: Take that path and remove all the paths under the pkginstallation locations. What's left is, I think, the Octave default corepath. You could consider any .m file from one of those paths to be"core" Octave; anything else to be user-defined Octave.

You could also use that path to detect files which are pkg-installed vson the user path. But that's not the same as detecting Octave Forgepackages, because users might also install non-Forge packages using pkg.You would have to look into the installation metadata for each packageto determine Forge vs non-Forge.

But, this core vs user detection has a couple drawbacks, at least forOctave developers. It's really convenient to be able to work on Octave's.m files by cloning the octave repo, firing up a reference installedOctave, and sticking selected directories from your local repo'sscripts/ dir on the front of the Octave path. If the encoding of those.m files was detected differently in those cases, this wouldn't workportably when there were non-ASCII characters in the source files.

My real issue is that this doesn't support portability for .m codeoutside core Octave, which I think is a worthy goal. In today'sglobalized world, you might well want to share code between developersor users that are in different locales and have different defaultencodings on their machines. It would be nice if Octave projects wereeasily portable between those users without requiring them to do specialconfiguration on their machines.

Let's say I have colleagues Edward in the UK, Cixin in China, and Juriin Japan. Edward uses an English Windows machine. Cixin runs a machinewith GB2312 default encoding, and Juri runs Shift-JIS default encoding.I'm running a US English Mac. Edward, Cixin, and Juri have each writtenOctave library projects, with .m files in their local default encoding,and we all want to write programs that use all those libraries. How canthis be done? If "non-core" .m files are always read with the defaultsystem encoding, then Cixin and Juri's files will always be garbled forEdward, and vice versa. And there's no system default encoding I can setthat will allow me to use all these libraries at the same time. (Withoutmanually transcoding their source files, which is a big pain, and atotal no-go if you have developers in multiple encodings working fromthe same project git repo.)

Another example: I have an Octave package octave-table(https://github.com/apjanke/octave-table), and I would like its+table_examples namespace to include examples with international textand emoji and the like, to demonstrate that they are supported. Howshould its source files be written so that they work for users runningunder any default encoding? I think they need to be encoded in Unicode,and Octave has to have a mechanism to know to interpret them as Unicode(or as a specific UTF format).

And if Octave does encoding detection differently for Octave Forge andnon-Octave-Forge packages, would I then need to transcode my files if mypackage is eventually accepted to Octave Forge? When doing furtherdevelopment, would I also need to go through a "pkg install" step eachtime I changed some source code and wanted to test it?


I suspect the only way to resolve this is something like either:

a) support an explicit source code encoding indicator at a per-project,per-directory, or per-m-file level, orb) take a big breaking change, and require all .m source files to alwaysbe in Unicode. Then locales are irrelevant when reading source.

For a), you could support a special .encoding file in either each M-codesource dir (the things added to the Octave path) or project root (wouldhave to be inferred by just traversing up the directory path abovesource root files), and add UTF-8 .encoding files to all Octave core andOctave Forge code dirs. Or for the file-level indicator, you couldsupport a magic "%encoding <whatever>" comment, like Ruby and Python do.I would prefer a per-project/dir .encoding, because you only need toremember to do it once, and not per file. Which also makes it easier toadd it in after the fact for existing projects that need to beinternationalized.

Figuring out the Matlab compatibility situation is difficult. There aresome threads discussing this, but they all confuse source code fileencoding with the runtime's I/O and character data processing, and nodocs come right out and explicitly say how Matlab handles characterencoding of its .m source files.


https://www.mathworks.com/matlabcentral/answers/340903-unicode-characters-in-m-file
https://www.mathworks.com/matlabcentral/answers/262114-why-i-can-not-read-comments-in-chinese-in-my-mfile
https://stackoverflow.com/questions/4984532/unicode-characters-in-matlab-source-files
https://www.mathworks.com/help/matlab/matlab_env/how-the-matlab-process-uses-locale-settings.html

Reading between the lines (and using memories from the dim past), Ithink Matlab always treats .m source files as being in the systemdefault encoding. So I don't think there's a way to support full easyMatlab portability and full easy locale portability at the same time.And the Matlab editor does not have good non-ASCII support, so it'sharder to tell what's going on.

Here's another weird edge case: If different .m files are going to beinterpreted as being in different encodings, how do strings with "\x"escape sequences in those files work? Are those byte sequences producedby the "\x" escapes interpreted as being in the same encoding as thatsource file? Or are they always considered to be in the internalencoding used by Octave's string objects? More generally, whattranscoding is applied to string literals in M source, and does the "\x"escape interpretation happen before or after that transcoding? In eitherof these scenarios, is it actually possible for a developer to portablywrite a string literal that uses \x escapes to encode multibyteinternational characters?


Cheers,
Andrew

[Prev in Thread]

Current Thread

[Next in Thread]

Re: locale encoding and core functions, Andrew Janke <=
- Re: locale encoding and core functions, Markus Mützel, 2019/03/09
  - Re: locale encoding and core functions, Andrew Janke, 2019/03/09
    - Re: locale encoding and core functions, John W. Eaton, 2019/03/09
    - Re: locale encoding and core functions, Andrew Janke, 2019/03/09
    - Re: locale encoding and core functions, John W. Eaton, 2019/03/09
    - Re: locale encoding and core functions, Mike Miller, 2019/03/09
    - Re: locale encoding and core functions, John W. Eaton, 2019/03/12
    - Re: locale encoding and core functions, Andrew Janke, 2019/03/12
  - Re: locale encoding and core functions, Carnë Draug, 2019/03/10

Prev by Date: Re:
Next by Date: Re: Use Markdown for NEWS file
Previous by thread: [no subject]
Next by thread: Re: locale encoding and core functions
Index(es):
- Date
- Thread