Re: Emacs Lisp's future

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Emacs Lisp's future

From:	Mark H Weaver
Subject:	Re: Emacs Lisp's future
Date:	Mon, 06 Oct 2014 12:27:35 -0400
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)

Eli Zaretskii <address@hidden> writes:

>> From: Mark H Weaver <address@hidden>
>> Cc: address@hidden, address@hidden, address@hidden,
>> address@hidden, address@hidden, address@hidden, address@hidden
>> Date: Mon, 06 Oct 2014 02:21:41 -0400
>> 
>> A related problem has to do with the fact that naively implemented UTF-8
>> allows code points to be represented with more bytes than are actually
>> needed, essentially by padding the code point with leading zeroes and
>> then encoding with UTF-8 as if the high bits were non-zero.  For
>> example, the ASCII quote (") can be represented as the single byte 0x22,
>> the two byte sequence 0xC0 0xA2, etc.
>> 
>> UTF-8 decoders are supposed to detect and reject these "overlong"
>> encodings, but it is likely that many programs fail to do this.  Such
>> programs are usually vulnerable to these overlong encodings when trying
>> to detect special characters (e.g. for quoting/escaping) or when
>> validating inputs.
>> 
>> To cope with this, the Unicode standards require that UTF-8 codecs
>> reject overlong encodings and other invalid byte sequences.  This is in
>> direct conflict with the idea of "raw byte" code points, whose purpose
>> is to be tolerant of arbitrary byte sequences and to propagate them
>> unchanged.
>
> The obvious solution is to encode the raw bytes internally in a UTF-8
> compatible way.  Which is what Emacs does in its buffers and strings,
> as I'm sure you know.  Can't Guile do something similar?

I'm afraid you've misunderstood, or perhaps I've failed to explain it
clearly.

It doesn't matter how these raw bytes are encoded internally.  No matter
what mechanism we use to accomplish it, propagating invalid byte
sequences by default is bad security policy.  It has the effect of
exposing all internal subsystems to malformed UTF-8 such as overlong
encodings unless users take explicit steps to check for them and remove
them.  This is a recipe for security holes.

The Unicode standard requires that all UTF-8 codecs refuse to accept,
produce, or propagate invalid byte sequences, including the troublesome
overlong encodings.  I'm not one for blindly following standards, but in
my opinion this is the default policy we should adopt.

Editing files is an unusual case.  Of course, we want users to be able
to edit a file with coding errors, and to leave any part of the file
untouched by the user exactly as it was.  Anything else would be a
mistake.

However, I would argue that even in Emacs, string<->bytevector
conversions should be strict by default, so that other uses of them
(e.g. communication over sockets, pipes, and encoding of command-line
arguments to subprocess) should be strict by default.  Even if you
disagree, I'd like the strict mode to remain the default in Guile.

      Mark

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Emacs Lisp's future, (continued)

Prev by Date: Re: Emacs Lisp's future
Next by Date: Re: Bug in font-lock-syntactic-keywords handling?
Previous by thread: Re: Emacs Lisp's future
Next by thread: Re: Emacs Lisp's future
Index(es):
- Date
- Thread