Re: [Nano-devel] [patch] properly show invalid byte sequences in UTF-8

nano-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nano-devel] [patch] properly show invalid byte sequences in UTF-8

From:	Benno Schulenberg
Subject:	Re: [Nano-devel] [patch] properly show invalid byte sequences in UTF-8
Date:	Sat, 18 Apr 2015 18:38:26 +0200

On Mon, Apr 13, 2015, at 21:49, Benno Schulenberg wrote:
> When doing for example:
> 
>     echo "0000000: 20c2 bb6f 6f6f 20c2 7878 78" | xxd -r >botched
> 
> and then opening the file 'botched' in nano (in a UTF-8 locale),
> it will show:
> 
>  »ooo »xxx
> 
> But the second guillemet isn't really there (if you search for it, the
> first one wil be the only occurrence), it is just a ghost.

A better example, without the distracting o's and x's, would be:

  echo "0000000: c2bb 2020 c220 " | xxd -r >botched

The second c2 is followed by an invalid byte: 20.  Such a successor
byte should be in the range 80 - bf.  One might expect a space (20)
to be displayed, but what happens is that nano picks up the bb of
the preceding multibyte sequence and so displays another guillemet
(UTF-8 code 0xc2 0xbb, code point U+00BB, "»") and then goes on
to display also the space (20).

If instead of the second c2 one puts a c3, UTF-8 code 0xc3 0xbb will
get displayed: code point U+00FB, a small u with a circumflex, "û".

Benno

-- 
http://www.fastmail.com - Email service worth paying for. Try it for free

[Prev in Thread]

Current Thread

[Next in Thread]

[Nano-devel] [patch] properly show invalid byte sequences in UTF-8, Benno Schulenberg, 2015/04/13
- Re: [Nano-devel] [patch] properly show invalid byte sequences in UTF-8, Benno Schulenberg <=

Prev by Date: Re: [Nano-devel] [PATCH 1/2 v2] Guile syntax
Next by Date: Re: [Nano-devel] [PATCH] linter definitions
Previous by thread: [Nano-devel] [patch] properly show invalid byte sequences in UTF-8
Next by thread: [Nano-devel] GNU nano 2.4.1
Index(es):
- Date
- Thread