Re: Unibyte characters, strings, and buffers

emacs-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Unibyte characters, strings, and buffers

From:	Stephen J. Turnbull
Subject:	Re: Unibyte characters, strings, and buffers
Date:	Fri, 28 Mar 2014 19:28:56 +0900
Eli Zaretskii writes:

 > Let's not talk about Emacs 20 vintage problems, 

If they were *only* Emacs 20 vintage, this thread wouldn't exist.

 > Likewise examples from XEmacs, since the differences in this area
 > between Emacs and XEmacs are substantial, and that precludes useful
 > comparison.

"It works fine" isn't useful information?  XEmacs has *two* reasons to
want to change its internal representation.  (1) A Unicode
representation, especially UTF-8, would allow all autosave files to be
readable by other programs.  (2) A PEP 393-like representation would
be way faster for big buffers and strings.  Bytes-character confusion
is just plain not an issue, not for anybody, not at all.

 > First, we must have a way to have buffer "text" that represents a
 > stream of bytes, not some human-readable text.  (Just as a random
 > example, a buffer visiting an mbox file, from which you decode
 > portions into another buffer for display.)  Agreed?

No, I disagree.  XEmacs/MULE has never had such a feature, yet we can
run all Emacs programs without changing the buffer representation
(modulo inability to represent all Unicode characters properly, but
the JIT charsets are plenty good enough in practice).

 > In such unibyte buffers, we need a way to represent raw bytes, which
 > are parts of as yet un-decoded byte sequences that represent encoded
 > characters.

Again, I disagree.  Unibyte is a design mistake, and unnecessary.
XEmacs proves it -- we use (essentially) the same code in many
applications (VM, Gnus for two mbox-using examples) as GNU Emacs does.
The variations for XEmacs and Emacs are due to extents vs. overlays
and such-like, not due to buffer representation.

For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as
no-ops forever, and as far as I can tell nobody's ever needed to worry
about it (of course, maybe the folks who use those are just more clued
than the poor user in my next paragraph).

I agree that having a way to represent "undecodable bytes" in a string
or buffer is extremely convenient.  XEmacs's lack of this capability
is surely a deficiency (Hi, David K!)  But this is a completely
different issue from unibyte buffers.  Emacs doesn't need unibyte
buffers to perform its work, and if they are desirable on the grounds
of space or time efficiency, they should be opaque to Lisp.

 > We cannot represent each such byte as a Latin-1 character, because
 > Latin-1 characters are stored inside Emacs as 2-byte sequences of
 > their UTF-8 encoding.  If you interpret bytes as Latin-1
 > characters, functions like string-bytes will return wrong results
 > for those raw bytes.  Agreed?

No, I still disagree.

`(defun string-bytes (&rest junk) (error))', and live happily ever
after.  You don't need `string-bytes' unless you've exposed internal
representation to Lisp, then you desperately need it to write correct
code (which some users won't be able to do anyway without help, cf. 
https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk).  So
*don't expose internal representation* (and the hammer marks on users'
foreheads will disappear in due time, and the headaches even faster!)

 > So here you have already at least 2 valid reasons

No, *you* have them.  XEmacs works perfectly well without them, using
code written for Emacs.

 > If we want to get rid of unibyte, Someone(TM) should present a
 > complete practical solution to those two problems (and a few
 > others), otherwise, this whole discussion leads nowhere.

Complete practical solution: "They are non-problems, forget about
them, and rewrite any code that implies you need to remember them."

Fortunately for me, I am *intimately* familiar with XEmacs internals,
and therefore RMS won't let me write this code for Emacs. :-)

 > > If you stick to the interpretation that bytes contain non-negative
 > > integers less than 256, you won't have a problem in practice if you
 > > think them as the first 256 Unicode characters, but choose not to use
 > > functions that make sense only with characters.
 > 
 > What do you mean by "choose"?  Lisp code is used by many programmers
 > out there; sometimes, they aren't even aware if the buffer they work
 > on is unibyte, or what that means.

Which is precisely why we're having this thread.  If there were *no*
Lisp-visibile unibyte buffers or strings, it couldn't possibly matter.

 > Even when they are aware, they just want Emacs to DTRT, for their
 > own value of "RT".

Too bad for them, as long as Emacs has unibyte buffers.  They have to
be aware, and write code correctly for the mode of the buffer.
Viz. the poor serial port programmer in comp.emacs.

In XEmacs, they don't have to; they just use an appropriate
network-coding-system, and it just works.  That may not be *obvious*
to a programmer coming from a different background (say, Python) who
expects there to be both byte streams and text streams, but since
there's no other way to do it, it's not hard to get it right.

 > And what does "choose not to use" mean, anyway?  How do you choose not
 > to use 'insert', for example? what do you use instead?

Of course you use `insert'.  What I'm saying is that if you don't want
to trash a binary buffer where each byte is represented by an
ISO-8859-1 character in internal representation, you need to avoid
(1) coding-system-for-write other than 'binary (in XEmacs, aliased to
'iso-8859-1-unix), and (2) functions that mutate characters using
properties of characters that bytes don't have (eg, upcase).  That's
really all there is to it.

 > The issue at hand is how do you pull the trick, in practice, of
 > doing TRT with the legitimate use cases where Emacs needs to
 > manipulate raw bytes.

Follow the Nike advice: Just Do It.  Works fine, I assure you.  I can
understand that you're worried by this:

 > As long as Emacs exposes the character values to Lisp programs as
 > simple integers, I don't think we can take this path.

... but I'm not really sure why not.  I'll grant that after drinking
the Ben Wing Kool-Aid the idea of Emacsen without a character type
gives me hives, but that's because arbitrary integers, if decomposed
into byte- sized fields and inserted into a buffer, can become
non-characters and crash XEmacs.  But surely you have a function like
`char-int-p'[1] that is used (implicitly by `insert') to prevent
non-characters (in Emacs, 0xFFFF and surrogates would be examples, I
suppose) from being inserted in buffers.  Otherwise you'd have crashes
all over the place, I would imagine.  Since you don't, you must be
doing something to prevent arbitrary integers from getting inserted.

It seems to me that the only real issue, given that you have a way in
Emacs to represent undecodable bytes (XEmacs doesn't, but Emacs does)
is what to do if somebody reads in data as 'binary, then proceeds to
insert non-Latin-1 characters in the buffer.  I can think of three
possibilities: (1) don't allow it without changing the buffer's output
codec, (2) treat the existing characters as Latin-1, or (3) convert
all the existing "bytes" to undecodable bytes representation.

XEmacs implicitly does (2) ((3) can't be implemented at all, at
present).  I tend to prefer (1), but ISTR that would not have worked
very well with some programs, specifically readmail and VM (whose
author had a lot of influence on how XEmacs internals were designed),
because they narrowed the buffer and converted wire format (including
raw multibyte encodings) to displayed text in-place.

Footnotes: 
[1]  `char-int-p' is a built-in function (char-int-p OBJECT)
Documentation:
Return t if OBJECT is an integer that can be converted into a character.
See `char-int'.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Buffer-local variables affect general-purpose functions, (continued)
Prev by Date: Re: Unibyte characters, strings, and buffers
Next by Date: Re: Unibyte characters, strings, and buffers
Previous by thread: Re: Unibyte characters, strings, and buffers
Next by thread: Re: Unibyte characters, strings, and buffers
Index(es):
- Date
- Thread