[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: undecided vs utf-8
From: |
Kenichi Handa |
Subject: |
Re: undecided vs utf-8 |
Date: |
Fri, 05 Nov 2010 11:01:58 +0900 |
In article <address@hidden>, Lars Magne Ingebrigtsen <address@hidden> writes:
> When using erc, it decodes iso-8859-1 fine with the default `undecided'
> into encoding. However, any utf-8 strings are, sort of, just translated
> into the same coding system:
> (decode-coding-string "u-te-\303\246ff \303\245tte" 'undecided)
>>> "u-te-æff åtte"
It's perhaps because you are in some of iso-8859-1 locale.
As I'm in ja_JP.UTF-8 locale, the above is decoded by utf-8.
> (decode-coding-string "u-te-\303\246ff \303\245tte" 'utf-8)
>>> "u-te-æff åtte"
> So, uhm... Is this meant to be this way? I know that guessing the
> first thing is, well, correct, sort of -- it's valid iso-8859-1,
> although very strange. But it's also valid utf-8. Shouldn't
> `decode-coding-string' prefer utf-8 if it's actually valid? If it's
> valid utf-8, then it's quite likely that it's meant to be utf-8, even
> though other coding systems are also possible.
I don't want to add such a heuristic in
decode-coding-string/region (the lowest functions available
from Lisp). Please note that above sequence is also valid
as Big5. If people are in Big5 locale, it's hard to answer
which of utf-8 or big5 is preferred unless we implement NLP
system.
Perhaps making an upper layer function that will accept a
list of preferred coding systems will be good; something
like this.
(defun detect-and-decode-coding-string (str preferred)
(let ((detected (detect-coding-string str))
decided)
(while (and preferred (not decided))
(if (memq (car preferred) detected)
(setq decided (car preferred))
(setq preferred (cdr preferred))))
(decode-coding-string str (or decided (car detected)))))
---
Kenichi Handa
address@hidden
Re: undecided vs utf-8, Eli Zaretskii, 2010/11/05