[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#20623: XML and HTML files with encoding/charset="utf-8" declaration
From: |
Eli Zaretskii |
Subject: |
bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save |
Date: |
Sun, 10 Dec 2017 21:17:00 +0200 |
> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: rgm@gnu.org, a.s@realize.ch, sledergerber@gmx.net,
> 20623@debbugs.gnu.org
> Date: Mon, 04 Dec 2017 16:08:14 -0500
>
> > Isn't it better to fix this in sgml-xml-auto-coding-function? That's
> > where the root cause is, AFAIU.
>
> I'd expect the same problem would affect all other uses.
Not sure what you meant by "all other uses". Could you please
elaborate?
> > And I don't understand the comment about latin-1-mac: I don't think we
> > have such problems in Emacs. The -with-signature variety is
> > different, because it is not about EOL format.
>
> You might be right, but I don't know where/how this is handled.
I would like to propose the following alternative patch, which accepts
utf-8-with-signature and utf-8-hfs as variants of utf-8 for the
purposes of encoding of XML files. Comments? Do we want a similar
treatment for UTF-16? (That doesn't seem to be required by the bug
report, and UTF-16 in XML files is non-standard anyway. But what
about HTML?)
diff --git a/lisp/international/mule.el b/lisp/international/mule.el
index 857fa80..5ff1acf 100644
--- a/lisp/international/mule.el
+++ b/lisp/international/mule.el
@@ -2493,7 +2493,17 @@ sgml-xml-auto-coding-function
(let* ((match (match-string 1))
(sym (intern (downcase match))))
(if (coding-system-p sym)
- sym
+ ;; If the encoding tag is UTF-8 and the buffer's
+ ;; encoding is one of the variants of UTF-8, use the
+ ;; buffer's encoding. This allows, e.g., saving an
+ ;; XML file as UTF-8 with BOM when the tag says UTF-8.
+ (if (and (coding-system-equal 'utf-8
+ (coding-system-type sym))
+ (coding-system-equal sym
+ (coding-system-type
+ buffer-file-coding-system)))
+ buffer-file-coding-system
+ sym)
(message "Warning: unknown coding system \"%s\"" match)
nil))
;; Files without an encoding tag should be UTF-8. But users
@@ -2506,7 +2516,8 @@ sgml-xml-auto-coding-function
(coding-system-base
(detect-coding-region (point-min) size t)))))
;; Pure ASCII always comes back as undecided.
- (if (memq detected '(utf-8 undecided))
+ (if (memq detected
+ '(utf-8 'utf-8-with-signature 'utf-8-hfs undecided))
'utf-8
(warn "File contents detected as %s.
Consider adding an encoding attribute to the xml declaration,