address@hidden: euc-tw encoding (was: autoload for define-ccl-program)]

emacs-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
address@hidden: euc-tw encoding (was: autoload for define-ccl-program)]

From:	Richard Stallman
Subject:	address@hidden: euc-tw encoding (was: autoload for define-ccl-program)]
Date:	Mon, 10 Dec 2001 13:46:20 -0700 (MST)
Could you please look at this and comment on it?

------- Start of forwarded message -------
To: address@hidden
Subject: euc-tw encoding (was: autoload for define-ccl-program)
From: Werner LEMBERG <address@hidden>
In-Reply-To: <address@hidden>
Sender: address@hidden
Date: Sat, 08 Dec 2001 16:31:09 +0100 (CET)
X-MIME-Autoconverted: from base64 to 8bit by pele.santafe.edu id LAA10376


>  WL> No.  I want to autoload a CCL function :-)
> 
> I don't understand how you expect it to work, and it's still not clear
> it would be terribly useful.

After I've seen your solution how to handle Unicode, I'm simply
following you, using a dummy pre-write-conversion.

So here is finally my patch to implement EUC-TW encoding.  Please
test and install.


    Werner


======================================================================


2001-12-08  Werner LEMBERG  <address@hidden>

        Add support for EUC-TW decoding/encoding.

        * language/china-util.el (big5-to-flat-code, flat-code-to-big5,
        euc-to-flat-code, flat-code-to-euc, expand-euc-big5-alist):
        New auxiliary functions to build `big5-to-cns'.
        (big5-to-cns): New translation alist.

        * language/chinese.el: Added new coding system `euc-tw' and its
        alias `euc-taiwan'.
        Updated language `Chinese-CNS' to include euc-tw encoding also.
        (ccl-decode-euc-tw, ccl-encode-euc-tw): New functions for handling
        euc-tw.


- --- china-util.el.old Thu Jul 26 07:11:22 2001
+++ china-util.el       Sat Dec  8 15:45:44 2001
@@ -1,7 +1,8 @@
- -;;; china-util.el --- utilities for Chinese
+;;; china-util.el --- utilities for Chinese  -*- coding: iso-2022-7bit -*-
 
 ;; Copyright (C) 1995 Electrotechnical Laboratory, JAPAN.
 ;; Licensed to the Free Software Foundation.
+;; Copyright (C) 1995, 2001 Free Software Foundation, Inc.
 
 ;; Keywords: mule, multilingual, Chinese
 
@@ -26,7 +27,7 @@
 
 ;;; Code:
 
- -;; Hz/ZW encoding stuffs
+;; Hz/ZW/EUC-TW encoding stuff
 
 ;; HZ is an encoding method for Chinese character set GB2312 used
 ;; widely in Internet.  It is very similar to 7-bit environment of
@@ -38,6 +39,13 @@
 ;; encodes Chinese characters line by line by starting each line with
 ;; the sequence "zW".  It also uses only 7-bit as HZ.
 
+;; EUC-TW is similar to EUC-KS or EUC-JP.  Its main character set is
+;; plane 1 of CNS 11643; characters of planes 2 to 7 are accessed with
+;; a single shift escape followed by three bytes: the first gives the
+;; plane, the second and third the character code.  Note that characters
+;; of plane 1 are (redundantly) accessible with a single shift escape
+;; also.
+
 ;; ISO-2022 escape sequence to designate GB2312.
 (defvar iso2022-gb-designation "\e$A")
 ;; HZ escape sequence to designate GB2312.
@@ -155,6 +163,265 @@
   "Encode the text in the current buffer to HZ."
   (interactive)
   (encode-hz-region (point-min) (point-max)))
+
+;; The following sets up a translation table (big5-to-cns) from Big 5
+;; to CNS encoding, using some auxiliary functions to make the code
+;; more readable.
+
+;; Many kudos to Himi!  The used code has been adapted from his
+;; mule-ucs package.
+
+(defun big5-to-flat-code (num)
+  "Convert NUM in Big 5 encoding to a `flat code'.
+0xA140 will be mapped to position 0, 0xA141 to position 1, etc.
+There are no gaps in the flat code."
+
+  (let ((hi (/ num 256))
+        (lo (% num 256)))
+    (+ (* 157 (- hi #xa1))
+       (- lo (if (>= lo #xa1) 98 64)))))
+
+(defun flat-code-to-big5 (num)
+  "Convert NUM from a `flat code' to Big 5 encoding.
+This is the inverse function of `big5-to-flat-code'."
+
+  (let ((hi (/ num 157))
+        (lo (% num 157)))
+    (+ (* 256 (+ hi #xa1))
+       (+ lo (if (< lo 63) 64 98)))))
+
+(defun euc-to-flat-code (num)
+  "Convert NUM in EUC encoding (in GL representation) to a `flat code'.
+0x2121 will be mapped to position 0, 0x2122 to position 1, etc.
+There are no gaps in the flat code."
+
+  (let ((hi (/ num 256))
+        (lo (% num 256)))
+    (+ (* 94 (- hi #x21))
+       (- lo #x21))))
+
+(defun flat-code-to-euc (num)
+  "Convert NUM from a `flat code' to EUC encoding (in GL representation).
+The inverse function of `euc-to-flat-code'.  The high and low bytes are
+returned in a list."
+
+  (let ((hi (/ num 94))
+        (lo (% num 94)))
+    (list (+ hi #x21) (+ lo #x21))))
+
+(defun expand-euc-big5-alist (alist)
+  "Create a translation table and fills it with data given in ALIST.
+Elements of ALIST can be either given as
+
+  ((euc-charset . startchar) . (big5-range-begin . big5-range-end))
+
+or as
+
+  (euc-character . big5-charcode)
+
+The former maps a range of glyphs in an EUC charset (where STARTCHAR
+is in GL representation) to a certain range of Big 5 encoded
+characters, the latter maps a single glyph.  Glyphs which can't be
+mapped will be represented with the byte 0xFF.
+
+The return value is the filled translation table."
+
+  (let (chartable
+        elem
+        result
+        char
+        big5
+        i
+        end
+        codepoint
+        charset)
+    (setq chartable (make-char-table 'translation-table #xFF))
+    (while alist
+      (setq elem (car alist)
+            char (car elem)
+            big5 (cdr elem)
+            alist (cdr alist))
+      (cond ((and (consp char)
+                  (consp big5))
+               (setq i (big5-to-flat-code (car big5))
+                     end (big5-to-flat-code (cdr big5))
+                     codepoint (euc-to-flat-code (cdr char))
+                     charset (car char))
+               (while (>= end i)
+                 (aset chartable
+                       (decode-big5-char (flat-code-to-big5 i))
+                       (apply (function make-char)
+                              charset
+                              (flat-code-to-euc codepoint)))
+                 (setq i (1+ i)
+                       codepoint (1+ codepoint)))
+            )
+            ((and (char-valid-p char)
+                  (numberp big5))
+               (setq i (decode-big5-char big5))
+               (aset chartable i char)
+            )
+            (t
+             (error "Unknown slot type: %S" elem)
+            )
+      )
+    )
+    ;; the return value
+    chartable
+  )
+)
+
+;; All non-CNS encodings are commented out.
+
+(define-translation-table 'big5-to-cns
+  (expand-euc-big5-alist
+   '(
+     ;; Symbols
+     ((chinese-cns11643-1 . #x2121) . (#xA140 . #xA1F5))
+     (?$,1vp(B . #xA1F6)
+     (?$,1vr(B . #xA1F7)
+     ((chinese-cns11643-1 . #x2259) . (#xA1F8 . #xA2AE))
+     ((chinese-cns11643-1 . #x2421) . (#xA2AF . #xA3BF))
+     ;; Control codes (vendor dependent)
+     ((chinese-cns11643-1 . #x4221) . (#xA3C0 . #xA3E0))
+     ;; Level 1 Ideographs
+     ((chinese-cns11643-1 . #x4421) . (#xA440 . #xACFD))
+     (?è . #xACFE)
+     ((chinese-cns11643-1 . #x5323) . (#xAD40 . #xAFCF))
+     ((chinese-cns11643-1 . #x5754) . (#xAFD0 . #xBBC7))
+     ((chinese-cns11643-1 . #x6B51) . (#xBBC8 . #xBE51))
+     (?é¬ . #xBE52)
+     ((chinese-cns11643-1 . #x6F5C) . (#xBE53 . #xC1AA))
+     ((chinese-cns11643-1 . #x7536) . (#xC1AB . #xC2CA))
+     (?è¦ . #xC2CB)
+     ((chinese-cns11643-1 . #x7737) . (#xC2CC . #xC360))
+     ((chinese-cns11643-1 . #x782E) . (#xC361 . #xC3B8))
+     (?ç¾ . #xC3B9)
+     (?ç¹³ . #xC3BA)
+     ((chinese-cns11643-1 . #x7866) . (#xC3BB . #xC455))
+     (?å¨ . #xC456)
+     ((chinese-cns11643-1 . #x7962) . (#xC457 . #xC67E))
+     ;; Symbols
+     ((chinese-cns11643-1 . #x2621) . (#xC6A1 . #xC6BE))
+     ;; Radicals
+     (?$,2:b(B . #xC6BF)
+     (?$,2:c(B . #xC6C0)
+     (?$,2:e(B . #xC6C1)
+     (?$,2:g(B . #xC6C2)
+     (?$,2:l(B . #xC6C3)
+     (?$,2:m(B . #xC6C4)
+     (?$,2:n(B . #xC6C5)
+     (?$,2:s(B . #xC6C6)
+     (?$,2:v(B . #xC6C7)
+     (?$,2:y(B . #xC6C8)
+     (?$,2:{(B . #xC6C9)
+     (?$,2;"(B . #xC6CA)
+     (?$,2;'(B . #xC6CB)
+     (?$,2;.(B . #xC6CC)
+     (?$,2;3(B . #xC6CD)
+     (?$,2;4(B . #xC6CE)
+     (?$,2;5(B . #xC6CF)
+     (?$,2;9(B . #xC6D0)
+     (?$,2;:(B . #xC6D1)
+     (?$,2;A(B . #xC6D2)
+     (?$,2;F(B . #xC6D3)
+     (?$,2;g(B . #xC6D4)
+     (?$,2;h(B . #xC6D5)
+     (?$,2<A(B . #xC6D6)
+     (?$,2<J(B . #xC6D7)
+     ;; Diacritical Marks
+     ; ((japanese-jisx0208 . #x212F) . (#xC6D8 . #xC6D9))
+     ;; Japanese Kana Supplement
+     ; ((japanese-jisx0208 . #x2133) . (#xC6DA . #xC6E3))
+     ;; Japanese Hiragana
+     ; ((japanese-jisx0208 . #x2421) . (#xC6E7 . #xC77A))
+     ;; Japanese Katakana
+     ; ((japanese-jisx0208 . #x2521) . (#xC77B . #xC7F2))
+     ;; Cyrillic Characters
+     ; ((japanese-jisx0208 . #x2721) . (#xC7F3 . #xC854))
+     ; ((japanese-jisx0208 . #x2751) . (#xC855 . #xC875))
+     ;; Special Chinese Characters
+     (?ä¹ . #xC879)
+     (?ä¹ . #xC87B)
+     (?å . #xC87D)
+     (?å . #xC8A2)
+
+     ;; JIS X 0208 NOT SIGN (cf. U+00AC)
+     ; (?¬ . #xC8CD)
+     ;; JIS X 0212 BROKEN BAR (cf. U+00A6)
+     ; (?¦ . #xC8CE)
+
+     ;; GB 2312 characters
+     ; (?$,1s2(B . #xC8CF)
+     ; (?$,1s3(B . #xC8D0)
+        ;;;;; C8D1 - Japanese `(æ ª)'
+     ; (?$,1uV(B . #xC8D2)
+        ;;;;; C8D2 - Tel.
+
+     ;; Level 2 Ideographs
+     ((chinese-cns11643-2 . #x2121) . (#xC940 . #xC949))
+     (?å . #xC94A);; a duplicate of #xA461
+     ((chinese-cns11643-2 . #x212B) . (#xC94B . #xC96B))
+     ((chinese-cns11643-2 . #x214D) . (#xC96C . #xC9BD))
+     (?å . #xC9BE)
+     ((chinese-cns11643-2 . #x217D) . (#xC9BF . #xC9EC))
+     ((chinese-cns11643-2 . #x224E) . (#xC9ED . #xCAF6))
+     (?æ· . #xCAF7)
+     ((chinese-cns11643-2 . #x2439) . (#xCAF8 . #xD6CB))
+     (?ç . #xD6CC)
+     ((chinese-cns11643-2 . #x3770) . (#xD6CD . #xD779))
+     (?è . #xD77A)
+     ((chinese-cns11643-2 . #x387E) . (#xD77B . #xDADE))
+     (?ç¬» . #xDADF)
+     ((chinese-cns11643-2 . #x3E64) . (#xDAE0 . #xDBA6))
+     ((chinese-cns11643-2 . #x3F6B) . (#xDBA7 . #xDDFB))
+     (?å . #xDDFC);; a duplicate of #xDCD1
+     ((chinese-cns11643-2 . #x4424) . (#xDDFD . #xE8A2))
+     ((chinese-cns11643-2 . #x554C) . (#xE8A3 . #xE975))
+     ((chinese-cns11643-2 . #x5723) . (#xE976 . #xEB5A))
+     ((chinese-cns11643-2 . #x5A29) . (#xEB5B . #xEBF0))
+     (?é¥ . #xEBF1)
+     ((chinese-cns11643-2 . #x5B3F) . (#xEBF2 . #xECDD))
+     (?å¾» . #xECDE)
+     ((chinese-cns11643-2 . #x5C6A) . (#xECDF . #xEDA9))
+     ((chinese-cns11643-2 . #x5D75) . (#xEDAA . #xEEEA))
+     (?é . #xEEEB)
+     ((chinese-cns11643-2 . #x6039) . (#xEEEC . #xF055))
+     (?ç£¿ . #xF056)
+     ((chinese-cns11643-2 . #x6243) . (#xF057 . #xF0CA))
+     (?è¤ . #xF0CB)
+     ((chinese-cns11643-2 . #x6337) . (#xF0CC . #xF162))
+     ((chinese-cns11643-2 . #x6430) . (#xF163 . #xF16A))
+     (?é¥ . #xF16B)
+     ((chinese-cns11643-2 . #x6438) . (#xF16C . #xF267))
+     (?çª . #xF268)
+     ((chinese-cns11643-2 . #x6573) . (#xF269 . #xF2C2))
+     ((chinese-cns11643-2 . #x664E) . (#xF2C3 . #xF374))
+     ((chinese-cns11643-2 . #x6762) . (#xF375 . #xF465))
+     ((chinese-cns11643-2 . #x6935) . (#xF466 . #xF4B4))
+     (?è . #xF4B5)
+     ((chinese-cns11643-2 . #x6962) . (#xF4B6 . #xF4FC))
+     ((chinese-cns11643-2 . #x6A4C) . (#xF4FD . #xF662))
+     (?é¼ . #xF663)
+     ((chinese-cns11643-2 . #x6C52) . (#xF664 . #xF976))
+     ((chinese-cns11643-2 . #x7167) . (#xF977 . #xF9C3))
+     (?é¬® . #xF9C4)
+     (?é¸ . #xF9C5)
+     (?ç© . #xF9C6)
+     ((chinese-cns11643-2 . #x7235) . (#xF9C7 . #xF9D1))
+     ((chinese-cns11643-2 . #x7241) . (#xF9D2 . #xF9D5))
+
+     ;; Additional Ideographs
+     (?ç¢ . #xF9D6)
+     (?é¹ . #xF9D7)
+     (?è£ . #xF9D8)
+     (?å¢» . #xF9D9)
+     (?æ . #xF9DA)
+     (?ç²§ . #xF9DB)
+     (?å«º . #xF9DC)
+    )
+  )
+)
 
 ;;
 (provide 'china-util)
- --- chinese.el.old    Thu Jul 26 07:11:23 2001
+++ chinese.el  Sat Dec  8 15:41:47 2001
@@ -168,13 +168,154 @@
 ;; Chinese CNS11643 (traditional)
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
+(defvar big5-to-cns (make-translation-table)
+  "Translation table for encoding to `euc-tw'.")
+;; Could have been done by china-util loaded before.
+(unless (get 'big5-to-cns 'translation-table)
+  (define-translation-table 'big5-to-cns big5-to-cns))
+
+(define-ccl-program ccl-decode-euc-tw
+  ;; CNS plane 1 needs either two or four bytes in EUC-TW encoding;
+  ;; CNS planes 2 to 7 always need four bytes.  In internal encoding of
+  ;; Emacs, CNS planes 1 and 2 need three bytes, and planes 3 to 7 need
+  ;; four bytes.  Thus a buffer magnification value of 2 (for both
+  ;; encoding and decoding) is sufficient.
+  `(2
+    ;; we don't have enough registers to hold all charset-ids
+    ((r4 = ,(charset-id 'chinese-cns11643-1))
+     (r5 = ,(charset-id 'chinese-cns11643-2))
+     (r6 = ,(charset-id 'chinese-cns11643-3))
+     (loop
+      (read-if (r0 < #x80)
+         ;; ASCII
+         (write-repeat r0)
+       ;; not ASCII
+       (if (r0 == #x8E)
+           ;; single shift
+           (read-if (r1 < #xA1)
+               ;; invalid byte
+               ((write r0)
+                (write-repeat r1))
+             (if (r1 > #xA7)
+                 ;; invalid plane
+                 ((write r0)
+                  (write-repeat r1))
+               ;; OK, we have a plane
+               (read-if (r2 < #xA1)
+                   ;; invalid first byte
+                   ((write r0 r1)
+                    (write-repeat r2))
+                 (read-if (r3 < #xA1)
+                     ;; invalid second byte
+                     ((write r0 r1 r2)
+                      (write-repeat r3))
+                   ;; CNS 1-7, finally
+                   ((branch (r1 - #xA1)
+                     (r1 = r4)
+                     (r1 = r5)
+                     (r1 = r6)
+                     (r1 = ,(charset-id 'chinese-cns11643-4))
+                     (r1 = ,(charset-id 'chinese-cns11643-5))
+                     (r1 = ,(charset-id 'chinese-cns11643-6))
+                     (r1 = ,(charset-id 'chinese-cns11643-7)))
+                    (r2 = ((((r2 - #x80) << 7) + r3) - #x80))
+                    (write-multibyte-character r1 r2)
+                    (repeat))))))
+         ;; standard EUC
+         (if (r0 < #xA1)
+             ;; invalid first byte
+             (write-repeat r0)
+           (read-if (r1 < #xA1)
+               ;; invalid second byte
+               ((write r0)
+                (write-repeat r1))
+             ;; CNS 1, finally
+             ((r1 = ((((r0 - #x80) << 7) + r1) - #x80))
+              (write-multibyte-character r4 r1)
+              (repeat)))))))))
+  "CCL program to decode EUC-TW encoding."
+)
+
+(define-ccl-program ccl-encode-euc-tw
+  `(2
+    ;; we don't have enough registers to hold all charset-ids
+    ((r2 = ,(charset-id 'ascii))
+     (r3 = ,(charset-id 'chinese-big5-1))
+     (r4 = ,(charset-id 'chinese-big5-2))
+     (r5 = ,(charset-id 'chinese-cns11643-1))
+     (r6 = ,(charset-id 'chinese-cns11643-2))
+     (loop
+      (read-multibyte-character r0 r1)
+      (if (r0 == r2)
+         (write-repeat r1)
+       (;; Big 5 encoded characters are first translated to CNS
+        (if (r0 == r3)
+            (translate-character big5-to-cns r0 r1)
+          (if (r0 == r4)
+              (translate-character big5-to-cns r0 r1)))
+        (if (r0 == r5)
+            (r0 = #xA1)
+          (if (r0 == r6)
+              (r0 = #xA2)
+            (if (r0 == ,(charset-id 'chinese-cns11643-3))
+                (r0 = #xA3)
+              (if (r0 == ,(charset-id 'chinese-cns11643-4))
+                  (r0 = #xA4)
+                (if (r0 == ,(charset-id 'chinese-cns11643-5))
+                    (r0 = #xA5)
+                  (if (r0 == ,(charset-id 'chinese-cns11643-6))
+                      (r0 = #xA6)
+                    (if (r0 == ,(charset-id 'chinese-cns11643-7))
+                        (r0 = #xA7)
+                      ;; not CNS.  We use a dummy character which
+                      ;; can't occur in EUC-TW encoding to indicate
+                      ;; this.
+                      (write-repeat #xFF))))))))))
+      (if (r0 != #xA1)
+         ;; single shift and CNS plane
+         ((write #x8E)
+          (write r0)))
+      (write ((r1 >> 7) + #x80))
+      (write ((r1 % #x80) + #x80))
+      (repeat))))
+  "CCL program to encode EUC-TW encoding."
+)
+
+(defun euc-tw-pre-write-conversion (beg end)
+  "Semi-dummy pre-write function effectively to autoload china-util."
+  ;; Ensure translation table is loaded.
+  (require 'china-util)
+  ;; Don't do this again.
+  (coding-system-put 'euc-tw 'pre-write-conversion nil)
+  nil)
+
+(make-coding-system
+  'euc-tw 4 ?Z
+  "ISO 2022 based EUC encoding for Chinese CNS11643"
+  '(ccl-decode-euc-tw . ccl-encode-euc-tw)
+  '((safe-charsets ascii
+                  chinese-big5-1
+                  chinese-big5-2
+                  chinese-cns11643-1
+                  chinese-cns11643-2
+                  chinese-cns11643-3
+                  chinese-cns11643-4
+                  chinese-cns11643-5
+                  chinese-cns11643-6
+                  chinese-cns11643-7)
+    (valid-codes (0 . 255))
+    (pre-write-conversion . euc-tw-pre-write-conversion)))
+
+(define-coding-system-alias 'euc-taiwan 'euc-tw)
+
 (set-language-info-alist
  "Chinese-CNS" '((charset chinese-cns11643-1 chinese-cns11643-2
                          chinese-cns11643-3 chinese-cns11643-4
                          chinese-cns11643-5 chinese-cns11643-6
                          chinese-cns11643-7)
- -              (coding-system iso-2022-cn)
- -              (coding-priority iso-2022-cn chinese-big5 chinese-iso-8bit)
+                (coding-system iso-2022-cn euc-tw)
+                (coding-priority iso-2022-cn euc-tw chinese-big5
+                                 chinese-iso-8bit)
                 (features china-util)
                 (input-method . "chinese-cns-quick")
                 (documentation . "Support for Chinese CNS character sets."))
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÄ$,1.G(B,ý×¯zY$,1.j(BX$,1""(BX$,1%t(BIrÏÝz÷¥þ
       îþàÛiÿÿæj)ÿ{ÿ¸?$,1.H y©ÿ(B+-wèýérÏÝz÷¥
------- End of forwarded message -------
[Prev in Thread]
Current Thread
[Next in Thread]
address@hidden: euc-tw encoding (was: autoload for define-ccl-program)], Richard Stallman <=
- Re: address@hidden: euc-tw encoding (was: autoload for define-ccl-program)], Kenichi Handa, 2001/12/11
  - Re: address@hidden: euc-tw encoding, Werner LEMBERG, 2001/12/11
  - Re: address@hidden: euc-tw encoding (was: autoload for define-ccl-program)], Richard Stallman, 2001/12/12
Prev by Date: address@hidden: hl-line should be updated when overlay-arrow-position is changed]
Next by Date: Re: Lexical binding -- do we really need it?
Previous by thread: address@hidden: hl-line should be updated when overlay-arrow-position is changed]
Next by thread: Re: address@hidden: euc-tw encoding (was: autoload for define-ccl-program)]
Index(es):
- Date
- Thread