Re: Unibyte characters, strings, and buffers

emacs-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Unibyte characters, strings, and buffers

From:	Eli Zaretskii
Subject:	Re: Unibyte characters, strings, and buffers
Date:	Sat, 29 Mar 2014 13:44:57 +0300
> From: "Stephen J. Turnbull" <address@hidden>
> Cc: address@hidden,
>     address@hidden
> Date: Sat, 29 Mar 2014 18:23:17 +0900
> 
> Eli Zaretskii writes:
> 
>  > This thread is about different issues.
> 
> *sigh*  No, it's about unibyte being a premature pessimization.

*Sigh*, indeed.

>  > >  > Likewise examples from XEmacs, since the differences in this area
>  > >  > between Emacs and XEmacs are substantial, and that precludes useful
>  > >  > comparison.
>  > > 
>  > > "It works fine" isn't useful information?
>  > 
>  > No, because it describes a very different implementation.
> 
> Not at all.  The implementation of multibyte buffers is very similar.

Says you.  But I cannot talk intelligently about that, because I don't
know the details.  And it sounds like you cannot talk about the issue
at hand, because you don't know the details of Emacs handling of raw
bytes.  This discussion is about Emacs's unibyte buffers and strings,
so it isn't going to yield any useful insights by you talking about
XEmacs implementation without knowing what is Emacs's one, and me the
other way around.  That is why I asked not to bring the XEmacs
implementation into this discussion.

> What's different is that Emacs complifusticates matters by also having
> a separate implementation of unibyte buffers, and then basically
> making a union out of the two structures called "buffer".  XEmacs
> simply implements binary as a particular coding system in and out of
> multibyte buffers.

In Emacs, a coding system is only consulted when a buffer is read or
written.  If you also consult it when inserting text into it, or when
deciding whether 'downcase' should or shouldn't change the character
from the buffer, then you still have unibyte buffers in disguise, you
just call them "buffers whose coding system is 'binary'".

>  > Then I guess you will have to suggest how to implement this without
>  > unibyte buffers.
> 
> No, I don't.  I already told you how to do it: nuke unibyte buffers
> and use iso-8859-1-unix as the binary codec.

"Codec" is XEmacs terminology, I don't understand what that means in
practice, when applied to Emacs.  If it means the same as coding
system, then how can iso-8859-1-unix byte-stream be decoded into, say,
Cyrillic characters (assuming the byte-stream was actually UTF-8
encoded Cyrillic text)?

> Then you're done, except for those applications that actually make
> the mistake of using unibyte text explicitly.

What does "explicitly" mean in this context?  Can you show an example
of "explicit" vs "implicit" use of unibyte text?

>  > >  > In such unibyte buffers, we need a way to represent raw bytes, which
>  > >  > are parts of as yet un-decoded byte sequences that represent encoded
>  > >  > characters.
>  > > 
>  > > Again, I disagree.  Unibyte is a design mistake, and unnecessary.
>  > 
>  > Then what do you call a buffer whose "text" is encoded?
> 
> "Binary."

That's just a different name.  If "binary" buffers are treated
differently from any other kind, when processing characters from them,
then they are just unibyte buffers in disguise.

>  > > XEmacs proves it -- we use (essentially) the same code in many
>  > > applications (VM, Gnus for two mbox-using examples) as GNU Emacs does.
>  > 
>  > I asked you not to bring XEmacs into the discussion, because I cannot
>  > talk intelligently about its implementation.  If you insist on doing
>  > that, this discussion is futile from my POV.
> 
> The whole point here is that exactly what the XEmacs implementation is
> *irrelevant*.  The point that we implement the same API as GNU Emacs
> without unibyte buffers or the annoyances and incoherence that comes
> with them.

Without knowing the details of the implementation, it is impossible to
talk about merits and demerits of each design and implementation.
Therefore, bringing into this discussion XEmacs implementation without
describing it in all detail does not help.  Excuse me, but I don't
believe you when you say you have no problems at all in this area,
just because you say that.  If you want that to count, you will have
to delve into the gory details, and then show why and how the problems
are avoided.

>  > > For heaven's sake, we've had `buffer-as-{multi,uni}-byte defined as
>  > > no-ops forever
>  > 
>  > I wasn't talking about those functions.  I was talking about the need
>  > to have unibyte buffers and strings.
> 
> There is no "need for unibyte."  You're simply afraid to throw it away.

I'm not afraid of anything of the kind.  This discussion was started
in order to try figuring out how to get rid of unibyte.  If you want
to help, offer specific technical solutions to specific issues we have
in Emacs.  Copying the XEmacs implementation, even if we were sure it
resolves the problem (and I'm not at all sure), is impractical.

>  > How is it different?  What would be the encoding of a buffer that
>  > contains raw bytes?
> 
> Depends.  If it's uninterpreted bytes, "binary."  If those are
> undecodable bytes, they'll be the representation of raw bytes that
> occurred in an otherwise sane encoded stream, and the buffer's
> encoding will be the nominal encoding of that stream.  If you want to
> ensure sanity of output, then you will use an output encoding that
> errors on rawbytes, and a program that cleans up those rawbytes in a
> way appropriate for the application.  If you expect the next program
> in the pipeline to handle them, then you use a variant encoding that
> just encodes them back to the original undecodable rawbytes.

That's exactly what Emacs does, so I think you rather agree to what I
originally described as requirements and you said you disagreed.

>  > But that's ridiculous: a raw byte is just a single byte, so
>  > string-bytes should return a meaningful value for a string of such
>  > bytes.
> 
> `string-bytes' should not exist.  As I wrote earlier:
> 
>  > > You don't need `string-bytes' unless you've exposed internal
>  > > representation to Lisp, then you desperately need it to write correct
>  > > code (which some users won't be able to do anyway without help, cf. 
>  > > https://groups.google.com/forum/#!topic/comp.emacs/IRKeteTzfbk).  So
>  > > *don't expose internal representation* (and the hammer marks on users'
>  > > foreheads will disappear in due time, and the headaches even faster!)
>  > 
>  > How else would you know how many bytes will a string take on disk?
> 
> How does `string-bytes' help?

It returns that information.

> You don't know what encoding will be used to write them

Yes, I do know: the buffer's coding system tells me.  And if text is
already encoded, then I know no additional encoding will be applied,
and whatever string-bytes tells me is it.

> If you use iso-8859-1-unix as the coding system, then "bytes on the
> wire" == "characters in the string".  No problema, señor.

Not if you want to recode the string in, say, UTF-8.  When you shuffle
text from one buffer to another, Emacs does not track which encoding
that text came from, so the iso-8859-1-unix information is lost.

>  > >  > So here you have already at least 2 valid reasons
>  > > 
>  > > No, *you* have them.  XEmacs works perfectly well without them, using
>  > > code written for Emacs.
>  > 
>  > XEmacs also works "perfectly well" without bidi and other stuff.  That
>  > doesn't help at all in this discussion.
> 
> You're right: because XEmacs doesn't handle bidi, it's irrelevant to
> this discussion.  Why did *you* bring it up?

To show how your way of arguing doesn't help.

> What is relevant is how to represent byte streams in Emacs.  The
> obvious non-unibyte way is a one-to-one mapping of bytes to Unicode
> characters.  It is *extremely* convenient if the first 128 of those
> bytes correspond to the ASCII coded character set, because so many
> wire protocols use ASCII "words" syntactically.  The other 128 don't
> matter much, so why not just use the extremely convenient Latin-1 set
> for them?

Because there are situations when the effect of this is not what Lisp
programs and users expect.  Case folding and case-insensitive search
is one of them, although not the only one.

>  > >  > If we want to get rid of unibyte, Someone(TM) should present a
>  > >  > complete practical solution to those two problems (and a few
>  > >  > others), otherwise, this whole discussion leads nowhere.
>  > > 
>  > > Complete practical solution: "They are non-problems, forget about
>  > > them, and rewrite any code that implies you need to remember them."
>  > 
>  > That a slogan, not a solution.
> 
> No, it is a precise high-level design for a solution.

We need a low-level design, not high-level.

>  > > Fortunately for me, I am *intimately* familiar with XEmacs internals,
>  > > and therefore RMS won't let me write this code for Emacs. :-)
>  > 
>  > Then perhaps you shouldn't be part of this discussion.
> 
> Since I've been invited to leave, I will.  My point is sufficiently
> well-made for open minds to deal with the details.

No, it isn't made at all.  I tried to explain above why I think so.

>  > > Which is precisely why we're having this thread.  If there were *no*
>  > > Lisp-visibile unibyte buffers or strings, it couldn't possibly matter.
>  > 
>  > And if I had $5M on by bank account, I'd probably be elsewhere
>  > enjoying myself.  IOW, how are "if there were no..." arguments useful?
> 
> Because they point out that this thread wouldn't have happened with a
> different design.

But we _are_ with this design, and have been using it for the last 15
years.  Good luck believing that someone will come and replace the
existing design with something radically different.  There wasn't a
comparable revolution in Emacs since 2001, so I largely doubt that
expecting another one any time soon is wise.  We don't even have
people aboard capable of making such changes.

The only practical way of advancing in this area is by low-level
changes that don't throw away the high-level design.  That is why
precisely describing the details of every proposal is so important:
without them, any proposal becomes impractical and thus not
interesting.

>  > This is not a discussion about whose model is better, Emacs or XEmacs.
>  > This is a discussion of whether and how can we remove unibyte buffers,
>  > strings, and characters from Emacs.  You must start by understanding
>  > how are they used in Emacs 24, and then suggest practical ways to
>  > change that.
> 
> Well, I would have said "tell me about it"

And I would have replied "sorry, I have no time for that".  The
sources are there to be studied, and you are welcome to ask questions
about stuff you don't understand just by looking at the sources.

There cannot be any useful discussion of these matters without
thorough understanding of how Emacs stores characters and raw bytes in
its buffers, and where and how the unibyte nuisance comes into play.

> I will say nothing you've said so far even hints at issues with
> simply removing the whole concept of unibyte.

I started by describing some basic requirements that lead to unibyte.
You refuse to even acknowledge those requirements.  How can we
continue a useful discussion when we don't even agree about the
basics?  To convince me, you need first to take my view of the issue,
something that you refuse to do.  I cannot begin to explain "the
issues" to you if you don't even agree with my starting point.

>  > In Emacs, 'insert' does some pretty subtle stuff with unibyte buffers
>  > and characters.  If you use it, you get what it does.
> 
> And I'm telling you those subtleties are a *problem* that solves
> nothing that an Emacs without a unibyte concept can't handle fine.

You keep saying that, but without the details (which you cannot or
won't provide), these are just slogans with little technical value.

>  > If the buffer is not marked specially, how will I know to avoid
>  > [inserting non-Latin-1 characters in a "binary" buffer]?
> 
> All experience with XEmacs says *you* (the human programmer) *won't*
> have any problem avoiding that.  As a programmer, if you're working
> with a binary protocol, you will be using binary buffers and strings,
> and byte-sized integers.  If you accidentally mix things up, you'll
> quickly get an encoding error on output (since the binary codec can't
> output non-Latin-1 Unicode characters.

On this level, it sounds like XEmacs does things exactly like Emacs
does, it just calls them differently.  If so, you have the same
problems; e.g., what will 'downcase-word' do in a "binary" buffer,
when it sees a "character" whose value is 192?

> It's just not a problem in practice, and that's not why unibyte was
> introduced in Emacs anyway.  Unibyte was introduced because some folks
> thought working with variable-width-encoded buffers was too
> inefficient so they wanted access to a flat buffer of bytes.  That's
> why buffer-as-{uni,multi}byte type punning was included.

Maybe so, but we are now 15 years after that, so history is only
marginally important.  What _is_ important is how to get rid of the
issues we have, without a complete redesign.

>  > And I still don't see how this is relevant.  You are describing a
>  > marginally valid use case, while I'm talking about use cases we meet
>  > every day, and which must be supported, e.g. when some Lisp wants to
>  > decode or encode text by hand.
> 
> You use `encode-coding-region' and `decode-coding-region', same as you
> do now.  Do you seriously think that XEmacs doesn't support those use
> cases?

"Support" doesn't mean "there're no issues".  Emacs supports them as
well, you know.  That fact in itself doesn't help at all in this
discussion, because we all know (I hope) that at this "slogan level"
things work very well for quite some time.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Unibyte characters, strings, and buffers, (continued)
Prev by Date: Re: Unibyte characters, strings, and buffers
Next by Date: Re: Unibyte characters, strings, and buffers
Previous by thread: Re: Unibyte characters, strings, and buffers
Next by thread: Re: Unibyte characters, strings, and buffers
Index(es):
- Date
- Thread