Re: eight-bit char handling in emacs-unicode

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: eight-bit char handling in emacs-unicode

From:	Stefan Monnier
Subject:	Re: eight-bit char handling in emacs-unicode
Date:	18 Nov 2003 22:05:39 -0500
User-agent:	Gnus/5.09 (Gnus v5.9.0) Emacs/21.3.50

> I see.  Apart from the design itself, I agree that it's difficult to
> introduce a new type.  But, when I discussed with Richard about the
> Character type object a few year ago, he was not that negative provided
> that it gives sure improvement.

Sounds about right to me: we have one free tag that we could use for chars
(and that I currently use to boost the max buffer size from 256MB to 512MB
in my local code).
But it needs to pay for itself.

> Then, we can't use make-string-unibyte for the current case
> because, in emacs-unicode, (concat '(?a 192)) returns a
> multibyte string whose second element is A-grave, not an
> eight-bit-char.  Am I missing something?

Well, obviously we need to make it accept this case (i.e. accept both the
latin-1 192 and the eight-bit-char 192).  I'm sure there'll be other issues.
I haven't had much time to think about it and you're obviously better
placed to foresee potential problems.

>> To do what your string-make-unibyte does you should use
>> `encode-coding-string' where the coding system is passed explicitly.

> Those are conceptually different things (I remember the
> similar discussion we had a while ago).

> encode-coding-string does:
> char-sequence --CCS-set--> (CCS/codepoint-pair)-sequence
>     --CES--> encoded-byte-sequence

> string-make-unibyte does:
> char-sequence --CCS--> code-point-sequence
>     --concat--> code-point-sequence

> These two yield the same result only when CCS support all
> chars in "char-sequence" and CES is stateless
> (e.g. iso-latin-1) and .

You lost me here (I'm a poor soul whose doesn't know much outside of the
latin-1 world).
I thought that string-make-unibyte only behaves meaningfully for
"normal 8bit coding-systems" such as latin-1.

>> I've changed my Emacs so that string-make-unibyte does the above
>> (i.e. signals an error if it encounters a non-byte char) and it works fairly
>> well, except for the few places where the elisp code is sloppy and needs to
>> be fixed.

> How did you change it?  string-make-unibyte internally uses
> the function copy_text.  Did you change it?  But, then, each
> time you copy a multibyte string into a unibyte buffer, you
> should get an error.

Of course: it's an error.  A unibyte buffer cannot represent multibyte
chars, so you need to encode them first (into a unibyte string).

Now to tell you the truth, my change had to accept a few (not so) special
cases and it took a bit of fiddling to make the code lenient enough to
accept elisp code I didn't feel like "fixing".  I can't remember the details
off-hand, but I remember having problems with regexp matching functions
where multibyte regexps are used in unibyte buffers.


-- Stefan

[Prev in Thread]

Current Thread

[Next in Thread]

Re: eight-bit char handling in emacs-unicode, (continued)

Prev by Date: problem of marker as position
Next by Date: Re: doc elisp intro cross reference fixes
Previous by thread: Re: eight-bit char handling in emacs-unicode
Next by thread: Re: eight-bit char handling in emacs-unicode
Index(es):
- Date
- Thread