[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Nano-devel] [patch] properly show invalid byte sequences in UTF-8
From: |
Benno Schulenberg |
Subject: |
Re: [Nano-devel] [patch] properly show invalid byte sequences in UTF-8 |
Date: |
Sat, 18 Apr 2015 18:38:26 +0200 |
On Mon, Apr 13, 2015, at 21:49, Benno Schulenberg wrote:
> When doing for example:
>
> echo "0000000: 20c2 bb6f 6f6f 20c2 7878 78" | xxd -r >botched
>
> and then opening the file 'botched' in nano (in a UTF-8 locale),
> it will show:
>
> »ooo »xxx
>
> But the second guillemet isn't really there (if you search for it, the
> first one wil be the only occurrence), it is just a ghost.
A better example, without the distracting o's and x's, would be:
echo "0000000: c2bb 2020 c220 " | xxd -r >botched
The second c2 is followed by an invalid byte: 20. Such a successor
byte should be in the range 80 - bf. One might expect a space (20)
to be displayed, but what happens is that nano picks up the bb of
the preceding multibyte sequence and so displays another guillemet
(UTF-8 code 0xc2 0xbb, code point U+00BB, "»") and then goes on
to display also the space (20).
If instead of the second c2 one puts a c3, UTF-8 code 0xc3 0xbb will
get displayed: code point U+00FB, a small u with a circumflex, "û".
Benno
--
http://www.fastmail.com - Email service worth paying for. Try it for free