Re: undecided vs utf-8

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: undecided vs utf-8

From:	Kenichi Handa
Subject:	Re: undecided vs utf-8
Date:	Fri, 05 Nov 2010 11:01:58 +0900

In article <address@hidden>, Lars Magne Ingebrigtsen <address@hidden> writes:

> When using erc, it decodes iso-8859-1 fine with the default `undecided'
> into encoding.  However, any utf-8 strings are, sort of, just translated
> into the same coding system:

> (decode-coding-string "u-te-\303\246ff \303\245tte" 'undecided)
>>> "u-te-Ã¦ff Ã¥tte"

It's perhaps because you are in some of iso-8859-1 locale.
As I'm in ja_JP.UTF-8 locale, the above is decoded by utf-8.

> (decode-coding-string "u-te-\303\246ff \303\245tte" 'utf-8)
>>> "u-te-æff åtte"

> So, uhm...  Is this meant to be this way?  I know that guessing the
> first thing is, well, correct, sort of -- it's valid iso-8859-1,
> although very strange.  But it's also valid utf-8.  Shouldn't
> `decode-coding-string' prefer utf-8 if it's actually valid?  If it's
> valid utf-8, then it's quite likely that it's meant to be utf-8, even
> though other coding systems are also possible.

I don't want to add such a heuristic in
decode-coding-string/region (the lowest functions available
from Lisp).  Please note that above sequence is also valid
as Big5.  If people are in Big5 locale, it's hard to answer
which of utf-8 or big5 is preferred unless we implement NLP
system.

Perhaps making an upper layer function that will accept a
list of preferred coding systems will be good; something
like this.

(defun detect-and-decode-coding-string (str preferred)
  (let ((detected (detect-coding-string str))
        decided)
    (while (and preferred (not decided)) 
      (if (memq (car preferred) detected)
          (setq decided (car preferred))
        (setq preferred (cdr preferred))))
    (decode-coding-string str (or decided (car detected)))))

---
Kenichi Handa
address@hidden

[Prev in Thread]

Current Thread

[Next in Thread]

undecided vs utf-8, Lars Magne Ingebrigtsen, 2010/11/04
- Re: undecided vs utf-8, Lars Magne Ingebrigtsen, 2010/11/04
  - Re: undecided vs utf-8, Stefan Monnier, 2010/11/04
    - Re: undecided vs utf-8, Eli Zaretskii, 2010/11/05
- Re: undecided vs utf-8, Kenichi Handa <=
  - Re: undecided vs utf-8, Lars Magne Ingebrigtsen, 2010/11/04
    - Re: undecided vs utf-8, Kenichi Handa, 2010/11/05
    - Re: undecided vs utf-8, Lars Magne Ingebrigtsen, 2010/11/05
    - Re: undecided vs utf-8, Eli Zaretskii, 2010/11/05
    - Re: undecided vs utf-8, Lars Magne Ingebrigtsen, 2010/11/05
    - Re: undecided vs utf-8, Eli Zaretskii, 2010/11/05
    - Re: undecided vs utf-8, Deniz Dogan, 2010/11/05
    - Re: undecided vs utf-8, Lars Magne Ingebrigtsen, 2010/11/05
- Re: undecided vs utf-8, Eli Zaretskii, 2010/11/05

Prev by Date: Minor update for Savannah Emacs page
Next by Date: Re: undecided vs utf-8
Previous by thread: Re: undecided vs utf-8
Next by thread: Re: undecided vs utf-8
Index(es):
- Date
- Thread