help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Solved] RE: Differences between identical strings in Emacs lisp


From: Eli Zaretskii
Subject: Re: [Solved] RE: Differences between identical strings in Emacs lisp
Date: Tue, 07 Apr 2015 17:22:22 +0300

> From: Jürgen Hartmann <address@hidden>
> Date: Tue, 7 Apr 2015 15:55:48 +0200
> 
> Thank you Pascal Bourguignon for your explanation:
> 
> > ...
> > 
> >     (mapcar 'multibyte-string-p (list "\xBA" (concat '(#xBA))))
> >     --> (nil t)
> > 
> > string-equal (and therefore string=) don't ignore the multibyte property
> > of a string.
> 
> So it's all about the multibyte property?

It's about the tricky relationships between unibyte and multibyte
strings.

May I ask why you need to mess with unibyte strings?  (Your original
message doesn't seem to present a real problem, just something that
puzzled you.)

> > Now, it's hard to say how to "solve" this problem, basically, you asked
> > for it: "\xBA" is not a valid way to write a string containing masculine
> > ordinal.
> 
> In seams that one can use "\u00BA" to achieve this in a string constant; it
> evaluates to a multibyte string containing the integer 186:
> 
>    "\u00BA"
>    --> "º"

Why can't you simply use the º character? why do you need to use its
codepoint?

>    (multibyte-string-p "\u00BA")
>    --> t
> 
>    (append "\u00BA" ())
>    --> (186)
> 
> I found it very surprising, that it is not only the escape sequences
> (characters) in the string constant that determine its multibyte property,
> but it is also the other way round: The sequence \x yields
> different results depending on the multibyte property of the string constant
> it is used in. For example the constant "\x3FFFBA" is an unibyte string
> containing the integer 186:
> 
>    "\x3FFFBA"
>    --> "\272"

"Contains" is incorrect here.  That constant _represents_ a raw byte
whose value is 186.  Emacs goes out of its way under the hood to show
you 186 when the buffer or string contains 0x3FFFBA.

> 
>    (multibyte-string-p "\x3FFFBA")
>    --> nil
> 
>    (append "\x3FFFBA" ())
>    --> (186)
> 
> The constant "\x3FFFBA Ä" on the other hand is a mulibyte string in which the
> sequence \x3FFFBA yields the integer 4194234:
> 
>    "\x3FFFBA Ä"
>    --> "\272 Ä"
> 
>    (multibyte-string-p "\x3FFFBA Ä")
>    --> t
> 
>    (append "\x3FFFBA Ä" ())
>    --> (4194234 32 196)
> 
> This seems to be an undocumented feature.

It's barely documented in the node "Text Representations" in the ELisp
manual.

This is a tricky issue, so you are well advised to stay away of
unibyte strings as much as you can, for your sanity's sake.

> In this respect it is interesting to compare another pair of strings: "A" and
> (substring "AÄ" 0 1). Both of them contain the same integer, namely 65, and 
> are
> printed as "A"--they only differ in their multibyte property: The former is
> an unibyte string, the latter multibyte:

Don't try to learn about unibyte/multibyte strings using ASCII
characters as examples, because ASCII is treated specially for obvious
reasons.

> > (On the other hand, one might argue that having both unibyte and
> > multibyte strings in a lisp implementation is not a good idea, and
> > there's the opportunity for a big refactoring and simplification).

Hear, hear!

> To illustrate this, consider the strings "A" and (substring "AÄ" 0 1) from
> above. They have the same integer content, only differ in their multibyte
> property and compare equal.

Yes, and therefore you don't need to consider the multibyte property.

> If we just change their integer values--in both strings alike--from 65 to
> 186, we get the pair "\xBA" and (concat '(#xBA)), that we also discussed
> before. Also here the only difference lies in the multibyte property, while
> the integer values are the same. But this time the strings compare different.

As they should: you are comparing a character with a raw byte.

> One might say that this is not surprising, because this time the integers are
> interpreted as different characters. But this would be in contradiction to
> the definition of the term character according to which a character actually
> _is_ that integer (cf. lisp manual, section "2.3.3 Character Type").

It is an integer, but note that no one told you anywhere that a raw
byte is a character.  It's a raw byte.

> Does we come to the limit of the definition of what a character is?
> 
> But this gets pretty philosophical. For the practical purpose you helped me
> a lot and I think that I got some better feeling for this topic.

I'd still suggest that you try as much as you can not to use unibyte
strings in your Lisp applications.  That way lies madness.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]