help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Solved] RE: Differences between identical strings in Emacs lisp


From: Eli Zaretskii
Subject: Re: [Solved] RE: Differences between identical strings in Emacs lisp
Date: Tue, 07 Apr 2015 20:28:59 +0300

> From: Jürgen Hartmann <address@hidden>
> Date: Tue, 7 Apr 2015 19:02:38 +0200
> 
> Thank you for your comments and your caring advises, Eli Zaretskii:
> 
> > May I ask why you need to mess with unibyte strings?  (Your original
> > message doesn't seem to present a real problem, just something that
> > puzzled you.)
> 
> That's right: I was trying to learn something about the basic Lisp data types
> and their constants and, as a side effect, trying to understand some of these
> "cryptic" read and write sequences that one sees in Emacs from time to time.

A worthy goal.

> First I thought that some hidden decoding based on some charsets or coding
> systems occurs.

Actually, some sort of "decoding" does occur, albeit perhaps not in
the use cases you tried -- Emacs will sometimes silently convert
unibyte characters to their locale-dependent multibyte equivalents.

This whole area of unibyte strings is replete with dwim-ish hacks and
kludges, all in an attempt to do what the user expects.  Thus the
confusion and the advice to stay away of that gray area.

> >> ... For example the constant "\x3FFFBA" is an unibyte string
> >> containing the integer 186:
> >>
> >>    "\x3FFFBA"
> >>    --> "\272"
> >
> > "Contains" is incorrect here.  That constant _represents_ a raw byte
> > whose value is 186.  Emacs goes out of its way under the hood to show
> > you 186 when the buffer or string contains 0x3FFFBA.
> 
> What is the correct parlance here: Is it correct to say that the constant
> "\x3FFFBA\x3FFFBB\x3FFFBC" is not a string because it does not contain (?)
> any characters; rather it is just a sequence of raw bytes?

It's a "unibyte string", which, by definition, contains raw bytes.

But it is actually better to say that the raw bytes there are \272 and
not \x3FFFBC.  The latter is just the representation Emacs uses for
the former, Emacs goes out of its way not to show that internal
representation to the user.

> >> ... definition of the term character according to which a character
> >> actually
> >> _is_ that integer (cf. lisp manual, section "2.3.3 Character Type").
> >
> > It is an integer, but note that no one told you anywhere that a raw
> > byte is a character.  It's a raw byte.
> 
> Ah, that seems to be the key: raw bytes are not characters.

Exactly.

> (Up to now I thought that raw bytes are a special set of characters
> that have different representations in unibyte and multibyte
> contexts.)

They _are_ a special "character set", but only in the very technical
sense of "character set" in Emacs.  By their nature and their
properties in Emacs, they are not characters.

> In spite of my previous promise not to try to learn something about the
> unibyte/multibyte topic from ASCII, I shily dare to ask another question in
> this context (don't beat me): Does the A in the unibyte string "A" represent
> a character or a raw byte? Or both? In the latter case, is this that special
> treatment of ASCII you talked about before?

Raw bytes are only those whose value is above 127, so A is a
character.

For subtle technical reasons (or maybe by some historical accident), a
pure-ASCII string is a unibyte string, although it contains
characters, not raw bytes.  So having a unibyte string does not yet
mean you have raw bytes in it.

> > I'd still suggest that you try as much as you can not to use unibyte
> > strings in your Lisp applications.  That way lies madness.
> 
> I will try to follow that advice--and I hope that it is not too late...

By far the only valid use case where you need to manipulate unibyte
strings of raw bytes is if you need to encode or decode strings by
calling encode-coding-region and its ilk.  E.g., an application that
needs to send base64-encoded text needs first to encode it using
whatever coding-system is appropriate, which produces unibyte text
containing raw bytes, and then call base64-encode-region to produce
the final result.  And similarly for decoding such stuff.  You will
see examples of this in Gnus and Rmail, for example.

> So, thank you very much for your enlightening answers.

You are welcome.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]