bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#35766: emacs saves utf-16 le xml files as utf-16 be


From: Eli Zaretskii
Subject: bug#35766: emacs saves utf-16 le xml files as utf-16 be
Date: Fri, 17 May 2019 18:34:48 +0300

> From: Noam Postavsky <npostavs@gmail.com>
> Cc: Eli Zaretskii <eliz@gnu.org>,  "35766\@debbugs.gnu.org" 
> <35766@debbugs.gnu.org>
> Date: Fri, 17 May 2019 07:48:30 -0400
> 
>     UTF-16LE    1014    [RFC2781]   [RFC2781]   csUTF16LE

Ouch, I was looking at the wrong column in that document.

The problem is that our detection of encoding of XML files is based on
the assumption that the header is in ASCII-compatible encoding, which
UTF-16 isn't.  So regexp search for the XML header fails, and the
detection fails with it.

The patch below make us at least recognize UTF-16 with BOM, and also
stop the encoding from frightening the user when she specifies UTF-16
with BOM at buffer-save time.  But by default, saving a buffer with
UTF-16BE or UTF-16LE still produces a file without BOM, and that
cannot be detected by our encoding-detection machinery, leaving it to
the user to use "C-x RET c" or "C-x RET r".

Perhaps we should by default produce encoding with BOM when XML header
specifies UTF-16?

diff --git a/lisp/international/mule-cmds.el b/lisp/international/mule-cmds.el
index dfa9e4e..a248ef8 100644
--- a/lisp/international/mule-cmds.el
+++ b/lisp/international/mule-cmds.el
@@ -1029,7 +1029,11 @@ select-safe-coding-system
                 ;; This check perhaps isn't ideal, but is probably
                 ;; the best thing to do.
                 (not (auto-coding-alist-lookup (or file buffer-file-name "")))
-                (not (coding-system-equal coding-system auto-cs)))
+                (not (coding-system-equal coding-system auto-cs))
+                 (or (equal (coding-system-type auto-cs) 'charset)
+                     (not (coding-system-equal (coding-system-type auto-cs)
+                                               (coding-system-type
+                                                coding-system)))))
            (unless (yes-or-no-p
                     (format "Selected encoding %s disagrees with \
 %s specified by file contents.  Really save (else edit coding cookies \
diff --git a/lisp/international/mule.el b/lisp/international/mule.el
index b5414de..fcdcd3c 100644
--- a/lisp/international/mule.el
+++ b/lisp/international/mule.el
@@ -2587,9 +2587,14 @@ xml-find-file-coding-system
       (let ((detected
              (with-coding-priority '(utf-8)
                (coding-system-base
-                (detect-coding-region (point-min) (point-max) t)))))
-        ;; Pure ASCII always comes back as undecided.
+                (detect-coding-region (point-min) (point-max) t))))
+            (bom (list (char-after 1) (char-after 2))))
         (cond
+         ((equal bom '(#xFE #xFF))
+          'utf-16be-with-signature)
+         ((equal bom '(#xFF #xFE))
+          'utf-16le-with-signature)
+         ;; Pure ASCII always comes back as undecided.
          ((memq detected '(utf-8 undecided))
           'utf-8)
          ((eq detected 'utf-16le-with-signature) 'utf-16le-with-signature)





reply via email to

[Prev in Thread] Current Thread [Next in Thread]