emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fwd: Re: Inadequate documentation of silly characters on screen.


From: David Kastrup
Subject: Re: Fwd: Re: Inadequate documentation of silly characters on screen.
Date: Thu, 19 Nov 2009 17:55:10 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (gnu/linux)

Alan Mackenzie <address@hidden> writes:

> On Thu, Nov 19, 2009 at 10:30:18AM -0500, Stefan Monnier wrote:
>> > The actual character in the string is ñ (#x3f).
>
>> No: the string does not contain any characters, only bytes, because
>> it's a unibyte string.
>
> I'm thinking from the lisp viewpoint.  The string is a data structure
> which contains characters.  I really don't want to have to think about
> the difference between "chars" and "bytes" when I'm hacking lisp.  If
> I do, then the abstraction "string" is broken.
>
>> So it contains the byte 241, not the character ñ.
>
> That is then a bug.  I wrote "(aset nl 0 ?ñ)", not "(aset nl 0 241)".

Huh?  ?ñ is the Emacs code point of ñ.  Which is pretty much identical
to the Unicode code point in Emacs 23.

>> The byte 241 can be inserted in multibyte strings and buffers because
>> it is also a char of code 4194289 (which gets displayed as \361).
>
> Hang on a mo'!  How can the byte 241 "be" a char of code 4194289?
> This is some strange usage of the word "be" that I wasn't previously
> aware of.  ;-)

Emacs encodes most of its things in utf-8.  A Unicode code point is an
integer.  You can encode it in different encodings, resulting in
different byte streams.  Inside of a byte stream encoded in utf-8, the
isolated byte 241 does not correspond to a Unicode character.  It is not
valid utf-8.  When Emacs reads a file supposedly in utf-8, it wants to
represent _all_ possible byte streams in order to be able to save
unchanged data unmolested.

So it encodes the entity "illegal isolated byte 241 in an utf-8
document" with the character code 4194289 which has a representation in
Emacs' internal variant of utf-8, but is outside of the range of
Unicode.

> At this point, would you please just agree with me that when I do
>
>    (setq nl "\n")
>    (aset nl 0 ?ñ)
>    (insert nl)
>
> , what should appear on the screen should be "ñ", NOT "\361"?  Thanks!

You assume that ?ñ is a character.  But in Emacs, it is an integer, a
Unicode code point in Emacs 23.  As long as there is something like a
unibyte string, there is no way to distinguish the character 241 and the
byte 241 except when Emacs is told explicitly.

Because Emacs has no separate "character" data type.

-- 
David Kastrup





reply via email to

[Prev in Thread] Current Thread [Next in Thread]