octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How should we treat invalid UTF-8?


From: Andrew Janke
Subject: Re: How should we treat invalid UTF-8?
Date: Wed, 6 Nov 2019 18:21:38 -0500
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.9.0


On 11/6/19 4:57 AM, "Markus Mützel" wrote:
> Am 05. November 2019 um 05:46 Uhr schrieb "Andrew Janke":
>> On 11/4/19 6:29 PM, "Markus Mützel" wrote:
>> [...]
>>
>> Define double(char) and char(double) to work along row vectors instead
>> of on individual elements. (You have to define char(double) this way for
>> the conversion I suggested above to make sense in a UTF-8 world anyway.)
>> [...]
> 
> You are right. I was still thinking that we wanted to implement this as a 
> fallback mechanism.
> But if we always interpret double input to char() as "Unicode code points" 
> (resembling UTF-32), round trips would be save.
> 
> Do we want that char on double (and vice versa) does more than a simple 
> cast-like operation? If we can answer this question with "Yes", I think we 
> could be close to a possible solution.
> 
> What about single and the integer classes as input to char()? It would 
> probably be reasonable to do the same for them.

That makes sense to me. With the provision that it should probably throw
an error when "overflow" would happen when the char string you're
converting contains a code point that won't fit in the target type.

> We have the Octave-specific unicode_idx() function that might help in these 
> situations:
> str = "aäbc";
> str(unicode_idx (str)==2) % is the second character
> But I agree that it adds complexity to use that function instead of simply 
> indexing into the string.
> 
> We could also add more functions that could better support more use cases.

Yeah, that's what you'd need.

That "==" step sounds expensive. Would it be able to have unicode_idx
also return a precomputed lookup table of the start and end
bytes/elements for each character index? That way, you could call
unicode_idx once on a string, and then have O(1) access to each of its
characters. Like this:

str = "aäbc"
[idx, idx2] = unicode_idx (str)
% idx = [1 2 2 3 4]
% idx2 = [1 1; 2 3; 4 4; 5 5]
nth_character = str(idx2(n,1):idx2(n,2))  % No == needed, so this is O(1)

>> [...] With 16-bit chars, you can do this conveniently with direct
>> character indexing, and can vectorize the operation using a 2-D char
>> array. [...] 
> 
> Also this use case doesn't work in Octave (but does in Matlab with the wider 
> chars). But it's probably bad coding style anyway:
> a = "a";
> a(end+1) = "ä";

There's also ==, <, and >.

"xx" == "ä"  % runs, but doesn't do what you'd probably expect
"foobär" == "ä"  % dimension mismatch error

>> if any('ä' == 'Ê'); disp('yep!'); end
yep!

And sort(), unique(), and ismember():

>> sort("späm")
ans = ��mps
>> unique("foobär")
ans = ��bfor
>>

(That's not even valid UTF-8 in the results.)

>> [tf, ix] = ismember('ä', 'foobär')
tf =
  1  1
ix =
   5   6
>> [tf, ix] = ismember('ä', 'foobÊr')
tf =
  1  0
ix =
   5   0

Would anyone actually ever want to do these things on strings with
non-ASCII characters? I honestly don't know the answer to that.

This brings up another point: it'll be useful to still have a way of
getting at the raw underlying bytes inside a string, without any
validation or transcoding, for debugging odd results or invalid strings.
typecast() seems appropriate for this, and it seems to already work.
Like in that sort() result: since the results are not valid UTF-8, I
want to just look at the raw bytes to see what's going on.

>> sort("späm")
ans = ��mps
>> typecast(str, 'uint8')
ans =
  164  195  109  112  115

Side note: that's a surprising result. If sort() is working byte-wise on
the char elements, why are the high bytes sorted to the beginning of the
result? That might be a bug.

Cheers,
Andrew



reply via email to

[Prev in Thread] Current Thread [Next in Thread]