[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255
From: |
Eli Zaretskii |
Subject: |
bug#5700: [bug-gnu-emacs] emacs-23 and 8-bit characters in 128..255 |
Date: |
Thu, 07 Jul 2016 19:21:47 +0300 |
> From: npostavs@users.sourceforge.net
> Date: Wed, 06 Jul 2016 19:52:16 -0400
> Cc: "Nelson H. F. Beebe" <beebe@math.utah.edu>, 5700@debbugs.gnu.org
>
> With Emacs 24/25, using "\u00FF" works:
>
> (string-equal (buffer-substring (point) (1+ (point))) "\u00FF")
> (looking-at "\u00FF")
>
> Seems to be another instance of the unibyte vs multibyte string escape syntax
> thing:
>
> You can also use hexadecimal escape sequences (‘\xN’) and octal
> escape sequences (‘\N’) in string constants. *But beware:* If a
> string constant contains hexadecimal or octal escape sequences, and
> these escape sequences all specify unibyte characters (i.e., less
> than 256), and there are no other literal non-ASCII characters or
> Unicode-style escape sequences in the string, then Emacs
> automatically assumes that it is a unibyte string. That is to say,
> it assumes that all non-ASCII characters occurring in the string are
> 8-bit raw bytes.
>
> Stefan Monnier <monnier@IRO.UMontreal.CA> writes:
> > which seems acceptable, whereas under Emacs-23 we have:
> >
> [...]
> > (multibyte-string-p "\377") prints as "\377"
>
> In 23.4 it returns returns nil
Yes.
The other significant piece of the puzzle is described in this text
from the ELisp manual:
For technical reasons, a unibyte and a multibyte string are ‘equal’
if and only if they contain the same sequence of character codes
and all these codes are either in the range 0 through 127 (ASCII)
or 160 through 255 (‘eight-bit-graphic’). However, when a unibyte
string is converted to a multibyte string, all characters with
codes in the range 160 through 255 are converted to characters with
higher codes, whereas ASCII characters remain unchanged. Thus, a
unibyte string and its conversion to multibyte are only ‘equal’ if
the string is all ASCII. Character codes 160 through 255 are not
entirely proper in multibyte text, even though they can occur. As
a consequence, the situation where a unibyte and a multibyte string
are ‘equal’ without both being all ASCII is a technical oddity that
very few Emacs Lisp programmers ever get confronted with. *Note
Text Representations::.
This was one of the significant changes in Emacs 23, and I think it is
the main factor for the changed behavior reported by Nelson.