[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: utf-8.el
From: |
Kenichi Handa |
Subject: |
Re: utf-8.el |
Date: |
Wed, 19 Jan 2005 11:51:14 +0900 (JST) |
User-agent: |
SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.3.50 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI) |
In article <address@hidden>, Stefan Monnier <address@hidden> writes:
> Does anyone see a problem with the simple patch below?
See the comment below.
> Also, could anyone confirm that the docstring of mule-utf-8 is correct in
> saying that invalid utf-8 sequences are not always correctly preserved?
> Why is that? Can't we fix it?
I remember I fixed ccl-mule-utf-8-encode-untrans to preserve
invalid utf-8 sequence as far as possible. So perhaps the
current version preserves even invalid sequence correctly.
I've just run this code for a fairly long time and saw no error.
(defun temp ()
(let ((count 0))
(while t
(setq count (1+ count))
(message "%d" count)
(let* ((len (+ 6 (random 6)))
(str (make-string len 0)))
(dotimes (i len)
(aset str i (+ 128 (random 128))))
(or (equal str
(encode-coding-string
(decode-coding-string str 'utf-8) 'utf-8))
(error "%s caused error" (setq error-string str)))))))
> Also could anyone explain to me why `utf-8-compose' needs to lookup the
> hashtable (get 'utf-subst-table-for-decode 'translation-hash-table), since
> it looks to me like ccl-decode-mule-utf-8 already takes care of decoding
> chars that are in this table.
subst-tables are not preloaded. They are automatically
loaded in utf-8-post-read-conversion but it runs after
ccl-decode-mule-utf-8 is executed. And the arg hash-table
becomes non-nil only when subst-tables are loaded.
> I also don't understand the following part of
> the code:
> (if (= l 2)
> (put-text-property (point) (min (point-max) (+ l (point)))
> 'display (format "\\%03o" ch))
> (compose-region (point) (+ l (point)) ?�))
> what does it mean for l (the number of bytes) to be equal to 2?
The docstring of ccl-untranslated-to-ucs is not clear. In
"Set r1 to the byte length", the byte length means how many
of r0, r1, r2, r3 (each of them contains a byte) contribute
to a unicode character (or an invalid byte).
If l is 2, that means an invalid byte was converted to
two-char sequence of eight-bit-graphic (#xC2 or #xC3) and
eight-bit-control/graphic. In that case, it is better to
display that sequence by octal instead of showing ?�.
> --- orig/lisp/international/utf-8.el
> +++ mod/lisp/international/utf-8.el
> @@ -2,7 +2,7 @@
> ;; Copyright (C) 2001, 2004 Electrotechnical Laboratory, JAPAN.
> ;; Licensed to the Free Software Foundation.
> -;; Copyright (C) 2001, 2002 Free Software Foundation, Inc.
> +;; Copyright (C) 2001, 2002, 2005 Free Software Foundation, Inc.
> ;; Author: TAKAHASHI Naoto <address@hidden>
> ;; Maintainer: FSF
> @@ -259,7 +259,7 @@
> (funcall decode-char-no-trans (car x))
> (funcall decode-char-no-trans (cdr x))))
> ranges "")))
> - ;; These forces loading and settting tables for
> + ;; This forces loading and setting tables for
> ;; utf-translate-cjk-mode.
> (setq utf-translate-cjk-lang-env nil
> ucs-mule-cjk-to-unicode (make-hash-table :test 'eq)
> @@ -951,10 +951,7 @@
> (save-excursion
> (save-restriction
> (narrow-to-region (point) (+ (point) length))
> - ;; Can't do eval-when-compile to insert a multibyte constant
> - ;; version of the string in the loop, since it's always loaded as
> - ;; unibyte from a byte-compiled file.
> - (let ((range (string-as-multibyte "^\xc0-\xc3\xe1-\xf7"))
> + (let ((range "^\xc0-\xc3\xe1-\xf7")
This change is not good because range is set to a unibyte
string and regexp search converts it to a multibyte
string by `make-multibyte-string'. Here what we need is a
multibyte string that contains eight-bit-graphci/control
chars. Anyway it is better to change string-as-multibyte to
string-to-multibyte.
> (buffer-multibyte enable-multibyte-characters)
> hash-table ch)
> (set-buffer-multibyte t)
> @@ -1036,8 +1033,7 @@
> mule-unicode-0100-24ff
> mule-unicode-2500-33ff
> mule-unicode-e000-ffff
> - ,@(if utf-translate-cjk-mode
> - utf-translate-cjk-charsets))
> + ,@utf-translate-cjk-charsets)
This change is ok.
> (mime-charset . utf-8)
> (coding-category . coding-category-utf-8)
> (valid-codes (0 . 255))
> @@ -1054,23 +1050,23 @@
> ;; I think this needs special private charsets defined for the
> ;; untranslated sequences, if it's going to work well.
> -;;; (defun utf-8-compose-function (pos to pattern &optional string)
> -;;; (let* ((prop (get-char-property pos 'composition string))
> -;;; (l (and prop (- (cadr prop) (car prop)))))
> -;;; (cond ((and l (> l (- to pos)))
> -;;; (delete-region pos to))
> -;;; ((and (> (char-after pos) 224)
> -;;; (< (char-after pos) 256)
> -;;; (save-restriction
> -;;; (narrow-to-region pos to)
> -;;; (utf-8-compose)))
> -;;; t))))
> -
> -;;; (dotimes (i 96)
> -;;; (aset composition-function-table
> -;;; (+ 128 i)
> -;;; `((,(string-as-multibyte "[\200-\237\240-\377]")
> -;;; . utf-8-compose-function))))
> +;; (defun utf-8-compose-function (pos to pattern &optional string)
> +;; (let* ((prop (get-char-property pos 'composition string))
> +;; (l (and prop (- (cadr prop) (car prop)))))
> +;; (cond ((and l (> l (- to pos)))
> +;; (delete-region pos to))
> +;; ((and (> (char-after pos) 224)
> +;; (< (char-after pos) 256)
> +;; (save-restriction
> +;; (narrow-to-region pos to)
> +;; (utf-8-compose)))
> +;; t))))
> +
> +;; (dotimes (i 96)
> +;; (aset composition-function-table
> +;; (+ 128 i)
> +;; `((,(string-as-multibyte "[\200-\237\240-\377]")
> +;; . utf-8-compose-function))))
> ;; arch-tag: b08735b7-753b-4ae6-b754-0f3efe4515c5
> ;;; utf-8.el ends here
This change is ok if that is the correct coding style for
comments.
---
Ken'ichi HANDA
address@hidden
- utf-8.el, Stefan Monnier, 2005/01/18
- Re: utf-8.el,
Kenichi Handa <=
- Re: utf-8.el, Stefan Monnier, 2005/01/18
- Re: utf-8.el, Kenichi Handa, 2005/01/19
- Re: utf-8.el, Stefan Monnier, 2005/01/19
- Re: utf-8.el, Kenichi Handa, 2005/01/19
- Re: utf-8.el, Stefan Monnier, 2005/01/19
- Re: utf-8.el, Kenichi Handa, 2005/01/19
Re: utf-8.el, Andreas Schwab, 2005/01/19