octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: locale encoding and core functions


From: Andrew Janke
Subject: Re: locale encoding and core functions
Date: Mon, 4 Mar 2019 23:33:18 -0500
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.5.1



On 2/23/19 4:12 AM, "Markus Mützel" wrote:
TL;DR: Is there a way to get information whether an .m file is from Octave core 
or from a user function?

Some background:
With the upcoming Octave 5 it will be possible to set the mfile_encoding that will be 
used to read .m files. This is important because Octave has to know which encoding is 
used in the .m file to correctly display non-ASCII characters in strings (e.g. in the 
"workspace" view or in plots). This is done by converting from whatever 
encoding the user set up to UTF-8 and convert to whatever encoding necessary at any 
interfaces.
However, there is a problem when we read core .m files which are always encoded 
in UTF-8 (and not in the encoding the user set up). On conversion of these 
files from the locale encoding to UTF-8, non-ASCII characters result in garbled 
text.
E.g. the German character "ä" encoded in UTF-8 is represented by two bytes: c3 a4. Assume that users would set the 
mfile_encoding to "ISO 8859-1" (Latin1). Then these two bytes are interpreted as representing the two letters 
"ä". This means that a string from a core .m file that contained the letter "ä" would display as 
"ä" for those users.

None of the core .m files contain any non-ASCII characters at the moment. 
However, there are a few help texts in some Octave Forge packages that do. See 
also bug #55195 [1].

The conversion to UTF-8 is done in "file_reader::get_input" in the file 
"input.cc".
If we knew in that function that the file we read from was from the core (or an 
Octave Forge package), we could skip the conversion from the locale encoding to 
mitigate the problem.

So back to the initial question: Is there a way to pass this information down 
to that function?

Markus

PS: This problem mostly affects Windows users where the default mfile_encoding 
depends on the locale of Windows (see also bug #49685). But in general any user 
who would prefer to use an encoding other than UTF-8 in their .m file code 
would be affected by this bug.

[1]: https://savannah.gnu.org/bugs/index.php?55195
[2]: https://savannah.gnu.org/bugs/index.php?49685


Fixed-encoding support like this sounds like a good idea. I would like to be able to use non-ASCII characters in .m source code in a portable manner. And I can see use cases for this in core M-code: example and test data may want to use international or special characters, both to test that the code under test supports it, and to provide examples for advanced usage. It would be convenient to enter these as literal characters instead of having to use \x escape sequences.

But just switching on "core/Forge" vs "user" .m files may not be the best way to do it in the long run. In particular, I think these encoding concerns apply to non-core Octave code, too.

There's no direct way to detect whether an .m file is from core Octave. But you could build a function to do so on top of __pathorig__() pretty easily: Take that path and remove all the paths under the pkg installation locations. What's left is, I think, the Octave default core path. You could consider any .m file from one of those paths to be "core" Octave; anything else to be user-defined Octave.

You could also use that path to detect files which are pkg-installed vs on the user path. But that's not the same as detecting Octave Forge packages, because users might also install non-Forge packages using pkg. You would have to look into the installation metadata for each package to determine Forge vs non-Forge.

But, this core vs user detection has a couple drawbacks, at least for Octave developers. It's really convenient to be able to work on Octave's .m files by cloning the octave repo, firing up a reference installed Octave, and sticking selected directories from your local repo's scripts/ dir on the front of the Octave path. If the encoding of those .m files was detected differently in those cases, this wouldn't work portably when there were non-ASCII characters in the source files.

My real issue is that this doesn't support portability for .m code outside core Octave, which I think is a worthy goal. In today's globalized world, you might well want to share code between developers or users that are in different locales and have different default encodings on their machines. It would be nice if Octave projects were easily portable between those users without requiring them to do special configuration on their machines.

Let's say I have colleagues Edward in the UK, Cixin in China, and Juri in Japan. Edward uses an English Windows machine. Cixin runs a machine with GB2312 default encoding, and Juri runs Shift-JIS default encoding. I'm running a US English Mac. Edward, Cixin, and Juri have each written Octave library projects, with .m files in their local default encoding, and we all want to write programs that use all those libraries. How can this be done? If "non-core" .m files are always read with the default system encoding, then Cixin and Juri's files will always be garbled for Edward, and vice versa. And there's no system default encoding I can set that will allow me to use all these libraries at the same time. (Without manually transcoding their source files, which is a big pain, and a total no-go if you have developers in multiple encodings working from the same project git repo.)

Another example: I have an Octave package octave-table (https://github.com/apjanke/octave-table), and I would like its +table_examples namespace to include examples with international text and emoji and the like, to demonstrate that they are supported. How should its source files be written so that they work for users running under any default encoding? I think they need to be encoded in Unicode, and Octave has to have a mechanism to know to interpret them as Unicode (or as a specific UTF format).

And if Octave does encoding detection differently for Octave Forge and non-Octave-Forge packages, would I then need to transcode my files if my package is eventually accepted to Octave Forge? When doing further development, would I also need to go through a "pkg install" step each time I changed some source code and wanted to test it?

I suspect the only way to resolve this is something like either:
a) support an explicit source code encoding indicator at a per-project, per-directory, or per-m-file level, or b) take a big breaking change, and require all .m source files to always be in Unicode. Then locales are irrelevant when reading source.

For a), you could support a special .encoding file in either each M-code source dir (the things added to the Octave path) or project root (would have to be inferred by just traversing up the directory path above source root files), and add UTF-8 .encoding files to all Octave core and Octave Forge code dirs. Or for the file-level indicator, you could support a magic "%encoding <whatever>" comment, like Ruby and Python do. I would prefer a per-project/dir .encoding, because you only need to remember to do it once, and not per file. Which also makes it easier to add it in after the fact for existing projects that need to be internationalized.

Figuring out the Matlab compatibility situation is difficult. There are some threads discussing this, but they all confuse source code file encoding with the runtime's I/O and character data processing, and no docs come right out and explicitly say how Matlab handles character encoding of its .m source files.

https://www.mathworks.com/matlabcentral/answers/340903-unicode-characters-in-m-file
https://www.mathworks.com/matlabcentral/answers/262114-why-i-can-not-read-comments-in-chinese-in-my-mfile
https://stackoverflow.com/questions/4984532/unicode-characters-in-matlab-source-files
https://www.mathworks.com/help/matlab/matlab_env/how-the-matlab-process-uses-locale-settings.html

Reading between the lines (and using memories from the dim past), I think Matlab always treats .m source files as being in the system default encoding. So I don't think there's a way to support full easy Matlab portability and full easy locale portability at the same time. And the Matlab editor does not have good non-ASCII support, so it's harder to tell what's going on.

Here's another weird edge case: If different .m files are going to be interpreted as being in different encodings, how do strings with "\x" escape sequences in those files work? Are those byte sequences produced by the "\x" escapes interpreted as being in the same encoding as that source file? Or are they always considered to be in the internal encoding used by Octave's string objects? More generally, what transcoding is applied to string literals in M source, and does the "\x" escape interpretation happen before or after that transcoding? In either of these scenarios, is it actually possible for a developer to portably write a string literal that uses \x escapes to encode multibyte international characters?

Cheers,
Andrew



reply via email to

[Prev in Thread] Current Thread [Next in Thread]