bug-gnu-emacs
[Top][All Lists]

## emacs 21.1: > 3-byte UTF-8 characters get mangled on save

 From: Jay Berkenbilt Subject: emacs 21.1: > 3-byte UTF-8 characters get mangled on save Date: Thu, 29 Nov 2001 14:22:40 -0500

>From description of mule-utf-8 coding system:

> Unicode characters out of the ranges U+0000-U+33FF and U+E200-U+FFFF
> are decoded into sequences of eight-bit-control and eight-bit-graphic
> characters to preserve their byte sequences.

This doesn't to work for at least some Unicode characters whose values
are require more than 3 bytes to represent (i.e., 10000 and greater),
though I don't know whether that in itself has anything to do with the
actual bug.

I am assuming that the intended behavior is that any UTF-8-encoded
file should be able to be loaded into emacs and edited and saved
safely.  Characters in the supported ranges should be shown as per the
ISO10646 font, while characters outside supported ranges are to be
encoded as described above with the intention that their byte values
be preserved.

Here's how to reproduce the bug:

rm /tmp/a
LANG=C emacs -q --no-site-file

M-x set-default-font -misc-fixed-medium-r-normal--15-*-iso10646-*

M-x find-file-literally /tmp/a

Enter the UTF-8 value for lower-case pi (U+03C0: in UTF-8: 11001111
10000000, which is \317\200): C-q 317 RET C-q 200 RET

Save and reload (with find-alternate-file rather than revert-buffer)
with the utf-8 coding system:

C-x RET c utf-8 RET C-x C-v RET

You should see a pi.  Set input method to TeX:

C-x RET C-\ TeX

At the end of the line, type \alpha.  A lower-case alpha should be
added to the buffer.  Save and find the file literally. (M-x
find-file-literally M-n RET y).

You should see

\317\200\316\261

So far, so good.  Now for the problem characters.  The UTF-8 encoding
for the Unicode (hex) value 10000 is 11110000 10010000 10000000
10000000, or \360\220\200\200.  This is the smallest Unicode value
requiring four bytes to represent in UTF-8.  Enter that on the next
line (with C-q 360 RET ...)

Save and C-x C-v in utf-8 mode as above.  Modify the buffer (space,
backspace) and save.  Emacs happily saves the buffer without giving
any messages about the coding system, which is fine.  Now, M-x
find-file-literally again.  Although the top line is preserved, the

\360\220\200\302\200

Switch back to utf-8.  You still see this byte pattern.  Note that
\302\200 is the proper encoding for U+80 but \360 (11110000) starts a
four-byte UTF-8 character.

Here's another example.  In literal mode, replace the second line with
the UTF-8 encoding for U+7FFFFFFF, a nice boundary condition: 11111101
10111111 10111111 10111111 10111111 10111111, or
\375\277\277\277\277\277.  Now go back to utf-8, modify the buffer,
save it, and go back to literal.  You see that this line has changed
to \375\277\277\277\337\277.

\337\277 is the correct UTF-8 encoding for U+3777, but \375 indicates
that it is the first byte of a 6-byte UTF-8 character, and the six
bytes above do not constitute a valid UTF-8 character.

I haven't debugged this, but I have looked a little at coding.c.  One
thing that is pretty looks suspicious is that the code in
detect_coding_utf_8 seems incorrect if multibytep is 1.  The
interpretation of multibytep being 1 causes special treatment for some
byte values that seems consistent with emacs's own internal multibyte
character treatment.  The again, I don't know what sequence of bytes
it's looking at; I'm assuming it's looking at literal bytes in the
file but maybe not.  Anyway, that's not enough in itself to explain
the bug since the examples I've given don't actually contain any
examples of LEADING_CODE_8_BIT_CONTROL.  Otherwise, the code looks
correct as long as src_end - src is large enough to capture the entire
UTF-8 character.......  The problem is probably being introduced when
the buffer is being saved rather than being loaded anyway.  I'll stop
rambling now, but if you need me to look into this further, let me
know.

--