octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How should we treat invalid UTF-8?


From: Andrew Janke
Subject: Re: How should we treat invalid UTF-8?
Date: Mon, 4 Nov 2019 15:48:50 -0500
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.9.0

On 11/2/19 8:24 AM, "Markus Mützel" wrote:
> Hi,
> 
> Some time ago, we decided to use UTF-8 as the default encoding in Octave.
> In particular, a change to allow (and require!) UTF-8 in regular expressions 
> [1] triggered a few bug reports and questions on the mailing lists that 
> involved invalid UTF-8 (e.g. [2]).
> Background: Some characters in UTF-8 are encoded with multiple bytes (e.g. 
> the German umlaut "ä" is encoded as decimal [195 164]). As a consequence of 
> how Unicode codepoints are encoded in UTF-8, there are some byte sequences 
> that cannot be correctly decoded to a Unicode codepoint (e.g. a byte with the 
> decimal value 228 on its own). Such byte sequences are called "invalid".
> At the moment, we don't have any logic for handling those invalid byte 
> sequences specially. This can lead to a whole lot of different errors and is 
> not limited to the regexp family of functions. E.g. entering "char (228)" at 
> the Octave prompt leads to a replacement character ("�") being displayed at 
> the command window on Linux (at least for me on Ubuntu 19.04), but it 
> completely breaks the command window on Windows (e.g. [3]).
> Similarly, there are issues when using invalid UTF-8 for strings in plots.
> 
> There are different approaches for how to handle invalid byte sequences in 
> UTF-8 (that are suggested by the standard). I can't find a direct reference 
> right now. But here is what Wikipedia says about it: [4].
> They can be mainly be assigned into these 3 groups:
> 1. Throw an error.
> 2. Replace each invalid byte with the same or different replacement 
> characters.
> 3. Fall back to a different encoding for such bytes (e.g. ISO-8859-1 or 
> CP1252).
> 
> Judging from some error reports, (western) users seem to expect that they get 
> a micro sign on entering "char(181)" (and similarly for other printable 
> characters at codepoints 128-255). If we implemented falling back to 
> "ISO-8859-1" or "CP1252", we would follow that principle of least surprise in 
> that respect.
> 
> However, it is not clear to me at which level we would implement that 
> fallback conversion: For some users, it might feel "most natural" to see a 
> "µ" everywhere when they use "char(181)" in their code. Others might be 
> surprised if the conversion from one type (double) to another type (char) and 
> back leads to a different result (different number of elements even!).
> If we don't do the validation on creation of the char vector, there are 
> probably a lot of places where strings should be validated before we use them.
> 
> A similar question arises when reading strings from a file (fopen, fread, 
> fgets, fgetl, textscan, ...): Should we return the bytes as stored in the 
> file? Or should we better assure that the strings are valid?
> 
> Matlab doesn't have the same problem (for western users) because they don't 
> use UTF-8 but UTF-16 (or a subset of it "UCS-2"). All characters encoded in 
> ISO-8859-1 have the same numeric value in UTF-16 (and equally in UCS-2).
> 
> I am slightly leaning towards implementing some sort of fallback mechanism 
> (see e.g. bug #57107 [2] comment #17). But I'm open to any ideas of how to 
> implement that exactly.
> 
> Another "solution" would be to review our initial decision to use UTF-8. 
> Instead, we could follow Matlab and use a "uint16_t" for our "char" class. 
> But that would probably involve some major changes and a lot of conversions 
> on interfaces to libraries we use.
> 
> Markus
> 
> [1]: http://hg.savannah.gnu.org/hgweb/octave/rev/94d490815aa8
> [2]: https://savannah.gnu.org/bugs/index.php?57107
> [3]: https://savannah.gnu.org/bugs/index.php?57133
> [4]: https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
> 

Hi all,

I'm coming around to the idea that Octave should be conservative and
strict about encodings at I/O and library boundaries, and lean toward
erroring out or using replacement characters, and not doing any
mixed-encoding fallback mechanisms. At least for our basic stuff like
fopen/fread/csvread. I think it would support higher-quality code, and
it would be easier for users to understand and diagnose, given a little
explanation.

I don't think we can fully protect users from having to know about
character encodings, and having to know what encoding their input data
is in. And trying to get fancy there could make it harder to do the
"right" thing when program correctness is important.

> There are different approaches for how to handle invalid byte
sequences in UTF-8 [...]

One note: I don't think this is strictly about invalid byte sequences in
UTF-8, but rather invalid byte sequences in text data in any encoding.

My inclination is to handle invalid encoded byte sequences by:
  1. When doing file input or output, raise an error immediately
    a) That probably (maybe?) goes for encoding-aware text-oriented
network I/O, like urlread(), too.
  2. When doing transcoding explicitly requested by the user (like a
unicode2native() call), raise an error unless the user explicitly
requested a character-replacement or fallback scheme. (This would be a
change from current behavior.)
  2. When passing text to a UI presentation element that Octave controls
(like a GUI widget, a plot element, or terminal output), use the
"invalid character" replacement character
Where validation probably happens whenever you're crossing an encoding
boundary or library/system-call boundary.

Doing "smart" fallback is a convenience for users who are using Octave
interactively and looking at their data as its processed, so they can
recognize garbage (if the data set is small enough). But for automated
processes or stuff with long processing pipelines, it could end up
silently passing incorrect data through, which isn't good. And I think
it would be nice if Octave would support those scenarios. Raising an
error at the point of the conversion failure makes sure that the
user/maintainer notices the problem, and makes it easy to locate (and
with a decent error message, hopefully easy to Google to figure out what
went wrong).

> Matlab doesn't have the same problem (for western users) because they
don't use UTF-8 but UTF-16 (or a subset of it "UCS-2"). All characters
encoded in ISO-8859-1 have the same numeric value in UTF-16 (and equally
in UCS-2).
>
> Another "solution" would be to review our initial decision to use
UTF-8. Instead, we could follow Matlab and use a "uint16_t" for our
"char" class. But that would probably involve some major changes and a
lot of conversions on interfaces to libraries we use.

I don't think that's why Matlab has it "easy" here. I think it's because
a) all their text I/O is encoding-aware, and b) on Windows, they use the
system default legacy code page as the default encoding, which gives you
ISO-8859-1 in the West. The fact that Matlab's internal encoding is
UCS-2 and that's an easy transformation from ISO-8859-1 is just an
internal implementation detail.

Matlab does have the opposite problem: if your input data is actually
UTF-8 (which I think is the more common case these days) or if you want
your code to be portable across OSes or regions, you need to explicitly
specify UTF-8 or some other known encoding whenever your code does an
fopen(). If you have UTF-8 data and do a plain fopen(), it'll silently
garble your data.

If we changed Octave char to be 16-bit UTF-16 code points, we'd still
have the same problem of deciding what to use for a default encoding,
and what to do when the input didn't match that encoding.

Cheers,
Andrew



reply via email to

[Prev in Thread] Current Thread [Next in Thread]