[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: locale encoding and core functions
From: |
Andrew Janke |
Subject: |
Re: locale encoding and core functions |
Date: |
Mon, 4 Mar 2019 23:33:18 -0500 |
User-agent: |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.5.1 |
On 2/23/19 4:12 AM, "Markus Mützel" wrote:
TL;DR: Is there a way to get information whether an .m file is from Octave core
or from a user function?
Some background:
With the upcoming Octave 5 it will be possible to set the mfile_encoding that will be
used to read .m files. This is important because Octave has to know which encoding is
used in the .m file to correctly display non-ASCII characters in strings (e.g. in the
"workspace" view or in plots). This is done by converting from whatever
encoding the user set up to UTF-8 and convert to whatever encoding necessary at any
interfaces.
However, there is a problem when we read core .m files which are always encoded
in UTF-8 (and not in the encoding the user set up). On conversion of these
files from the locale encoding to UTF-8, non-ASCII characters result in garbled
text.
E.g. the German character "ä" encoded in UTF-8 is represented by two bytes: c3 a4. Assume that users would set the
mfile_encoding to "ISO 8859-1" (Latin1). Then these two bytes are interpreted as representing the two letters
"ä". This means that a string from a core .m file that contained the letter "ä" would display as
"ä" for those users.
None of the core .m files contain any non-ASCII characters at the moment.
However, there are a few help texts in some Octave Forge packages that do. See
also bug #55195 [1].
The conversion to UTF-8 is done in "file_reader::get_input" in the file
"input.cc".
If we knew in that function that the file we read from was from the core (or an
Octave Forge package), we could skip the conversion from the locale encoding to
mitigate the problem.
So back to the initial question: Is there a way to pass this information down
to that function?
Markus
PS: This problem mostly affects Windows users where the default mfile_encoding
depends on the locale of Windows (see also bug #49685). But in general any user
who would prefer to use an encoding other than UTF-8 in their .m file code
would be affected by this bug.
[1]: https://savannah.gnu.org/bugs/index.php?55195
[2]: https://savannah.gnu.org/bugs/index.php?49685
Fixed-encoding support like this sounds like a good idea. I would like
to be able to use non-ASCII characters in .m source code in a portable
manner. And I can see use cases for this in core M-code: example and
test data may want to use international or special characters, both to
test that the code under test supports it, and to provide examples for
advanced usage. It would be convenient to enter these as literal
characters instead of having to use \x escape sequences.
But just switching on "core/Forge" vs "user" .m files may not be the
best way to do it in the long run. In particular, I think these encoding
concerns apply to non-core Octave code, too.
There's no direct way to detect whether an .m file is from core Octave.
But you could build a function to do so on top of __pathorig__() pretty
easily: Take that path and remove all the paths under the pkg
installation locations. What's left is, I think, the Octave default core
path. You could consider any .m file from one of those paths to be
"core" Octave; anything else to be user-defined Octave.
You could also use that path to detect files which are pkg-installed vs
on the user path. But that's not the same as detecting Octave Forge
packages, because users might also install non-Forge packages using pkg.
You would have to look into the installation metadata for each package
to determine Forge vs non-Forge.
But, this core vs user detection has a couple drawbacks, at least for
Octave developers. It's really convenient to be able to work on Octave's
.m files by cloning the octave repo, firing up a reference installed
Octave, and sticking selected directories from your local repo's
scripts/ dir on the front of the Octave path. If the encoding of those
.m files was detected differently in those cases, this wouldn't work
portably when there were non-ASCII characters in the source files.
My real issue is that this doesn't support portability for .m code
outside core Octave, which I think is a worthy goal. In today's
globalized world, you might well want to share code between developers
or users that are in different locales and have different default
encodings on their machines. It would be nice if Octave projects were
easily portable between those users without requiring them to do special
configuration on their machines.
Let's say I have colleagues Edward in the UK, Cixin in China, and Juri
in Japan. Edward uses an English Windows machine. Cixin runs a machine
with GB2312 default encoding, and Juri runs Shift-JIS default encoding.
I'm running a US English Mac. Edward, Cixin, and Juri have each written
Octave library projects, with .m files in their local default encoding,
and we all want to write programs that use all those libraries. How can
this be done? If "non-core" .m files are always read with the default
system encoding, then Cixin and Juri's files will always be garbled for
Edward, and vice versa. And there's no system default encoding I can set
that will allow me to use all these libraries at the same time. (Without
manually transcoding their source files, which is a big pain, and a
total no-go if you have developers in multiple encodings working from
the same project git repo.)
Another example: I have an Octave package octave-table
(https://github.com/apjanke/octave-table), and I would like its
+table_examples namespace to include examples with international text
and emoji and the like, to demonstrate that they are supported. How
should its source files be written so that they work for users running
under any default encoding? I think they need to be encoded in Unicode,
and Octave has to have a mechanism to know to interpret them as Unicode
(or as a specific UTF format).
And if Octave does encoding detection differently for Octave Forge and
non-Octave-Forge packages, would I then need to transcode my files if my
package is eventually accepted to Octave Forge? When doing further
development, would I also need to go through a "pkg install" step each
time I changed some source code and wanted to test it?
I suspect the only way to resolve this is something like either:
a) support an explicit source code encoding indicator at a per-project,
per-directory, or per-m-file level, or
b) take a big breaking change, and require all .m source files to always
be in Unicode. Then locales are irrelevant when reading source.
For a), you could support a special .encoding file in either each M-code
source dir (the things added to the Octave path) or project root (would
have to be inferred by just traversing up the directory path above
source root files), and add UTF-8 .encoding files to all Octave core and
Octave Forge code dirs. Or for the file-level indicator, you could
support a magic "%encoding <whatever>" comment, like Ruby and Python do.
I would prefer a per-project/dir .encoding, because you only need to
remember to do it once, and not per file. Which also makes it easier to
add it in after the fact for existing projects that need to be
internationalized.
Figuring out the Matlab compatibility situation is difficult. There are
some threads discussing this, but they all confuse source code file
encoding with the runtime's I/O and character data processing, and no
docs come right out and explicitly say how Matlab handles character
encoding of its .m source files.
https://www.mathworks.com/matlabcentral/answers/340903-unicode-characters-in-m-file
https://www.mathworks.com/matlabcentral/answers/262114-why-i-can-not-read-comments-in-chinese-in-my-mfile
https://stackoverflow.com/questions/4984532/unicode-characters-in-matlab-source-files
https://www.mathworks.com/help/matlab/matlab_env/how-the-matlab-process-uses-locale-settings.html
Reading between the lines (and using memories from the dim past), I
think Matlab always treats .m source files as being in the system
default encoding. So I don't think there's a way to support full easy
Matlab portability and full easy locale portability at the same time.
And the Matlab editor does not have good non-ASCII support, so it's
harder to tell what's going on.
Here's another weird edge case: If different .m files are going to be
interpreted as being in different encodings, how do strings with "\x"
escape sequences in those files work? Are those byte sequences produced
by the "\x" escapes interpreted as being in the same encoding as that
source file? Or are they always considered to be in the internal
encoding used by Octave's string objects? More generally, what
transcoding is applied to string literals in M source, and does the "\x"
escape interpretation happen before or after that transcoding? In either
of these scenarios, is it actually possible for a developer to portably
write a string literal that uses \x escapes to encode multibyte
international characters?
Cheers,
Andrew
- Re: locale encoding and core functions,
Andrew Janke <=
- Re: locale encoding and core functions, Markus Mützel, 2019/03/09
- Re: locale encoding and core functions, Andrew Janke, 2019/03/09
- Re: locale encoding and core functions, John W. Eaton, 2019/03/09
- Re: locale encoding and core functions, Andrew Janke, 2019/03/09
- Re: locale encoding and core functions, John W. Eaton, 2019/03/09
- Re: locale encoding and core functions, Mike Miller, 2019/03/09
- Re: locale encoding and core functions, John W. Eaton, 2019/03/12
- Re: locale encoding and core functions, Andrew Janke, 2019/03/12
Re: locale encoding and core functions, Carnë Draug, 2019/03/10