octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: locale encoding and core functions


From: Markus Mützel
Subject: Re: locale encoding and core functions
Date: Sat, 9 Mar 2019 16:10:12 +0100

Am 05. März 2019 um 05:33 Uhr schrieb "Andrew Janke":
> On 2/23/19 4:12 AM, "Markus Mützel" wrote:
> I suspect the only way to resolve this is something like either:
> a) support an explicit source code encoding indicator at a per-project, 
> per-directory, or per-m-file level, or
> b) take a big breaking change, and require all .m source files to always 
> be in Unicode. Then locales are irrelevant when reading source.
> 
> For a), you could support a special .encoding file in either each M-code 
> source dir (the things added to the Octave path) or project root (would 
> have to be inferred by just traversing up the directory path above 
> source root files), and add UTF-8 .encoding files to all Octave core and 
> Octave Forge code dirs. Or for the file-level indicator, you could 
> support a magic "%encoding <whatever>" comment, like Ruby and Python do. 
> I would prefer a per-project/dir .encoding, because you only need to 
> remember to do it once, and not per file. Which also makes it easier to 
> add it in after the fact for existing projects that need to be 
> internationalized.

Your idea with .encoding files in each directory sounds promising. Maybe we 
should use ".mfile-encoding" or some other name more specific.
I'd rather not traverse up the directory tree to look for that file. When 
should we stop looking for that file? Should we traverse up until root? What 
should be done in case we reach a directory without read access?
I would also prefer to not parse each source file for a magic comment.
Both of these options also sound like they might impact first run performance.


> Figuring out the Matlab compatibility situation is difficult.
I think anything we'd do in that respect would automatically beat Matlab that 
is ignorant to the source file encoding.

> Reading between the lines (and using memories from the dim past), I 
> think Matlab always treats .m source files as being in the system 
> default encoding.
That is what I gathered as well.

> Here's another weird edge case: If different .m files are going to be 
> interpreted as being in different encodings, how do strings with "\x" 
> escape sequences in those files work? Are those byte sequences produced 
> by the "\x" escapes interpreted as being in the same encoding as that 
> source file? Or are they always considered to be in the internal 
> encoding used by Octave's string objects? More generally, what 
> transcoding is applied to string literals in M source, and does the "\x" 
> escape interpretation happen before or after that transcoding? In either 
> of these scenarios, is it actually possible for a developer to portably 
> write a string literal that uses \x escapes to encode multibyte 
> international characters?
Do we automatically escape \x sequences when parsing .m files? Or is this 
something the interpreter does when processing double quoted strings?
In the latter case, I don't think that we have to worry about that.

Markus



reply via email to

[Prev in Thread] Current Thread [Next in Thread]