[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-8 in path / filename
From: |
Peter Dyballa |
Subject: |
Re: UTF-8 in path / filename |
Date: |
Sat, 26 Aug 2006 11:36:34 +0200 |
Am 26.08.2006 um 01:09 schrieb Miles Bader:
Peter Dyballa <Peter_Dyballa@Web.DE> writes:
There won't be a perfect solution with GNU Emacs in the near
future ...
You constantly seem to be having problems with UTF-8, but it works
absolutely perfectly for me, filenames, dired, everything (using
emacs 22).
[It works perfectly even if I do `emacs -Q' to avoid loading my init
file, though I normally use (set-language-environment 'japanese).]
AFAIK the main thing is that your LANG environment variable be set to
something mentioning utf-8 -- I use "ja_JP.UTF-8".
pete 39 /\ .
/Users/pete
pete 40 /\ env | egrep -i 'LC|LANG'
LANG=de_DE.UTF-8
LC_CTYPE=de_DE.UTF-8
pete 41 /\ /usr/local/bin/emacs-22.0.50 -Q &
Files with UTF-8 characters in them are shown in dired (has -u: in
mode-line, i.e. uses UTF-8) à la <vowel><empty box>. Some UTF-8
characters like ß or Û show up as themselves. In the same manner they
appear in the buffer's mode-line, once visited, and also in the list
of buffers buffer (C-x b), completely unreadable in the Buffers menu
from menu bar and in another completely unreadable fashion in the
"Buffer Menu" pop-up. The font used for the vowels, the empty boxes,
or the other characters is taken from the Java SDK and quite rich
(1425 mapped characters for mostly European and some near eastern
scripts):
-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1 (#x61)
-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1 (#x308)
-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1 (#xDF)
-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1 (#x20AC)
Somehow this looks like a mixture of ISO 8859 characters (#x61, #xDF)
and Unicode (#x20AC) and something else (#x308) or are some
representations just abbreviations that leave away the 'leading zeros?'
The other information from C-u C-x = on the examples is:
character: a (97, #o141, #x61, U+0061)
charset: ascii (ASCII (ISO646 IRV))
code point: #x61
syntax: w which means: word
category: a:ASCII l:Latin
buffer code: #x61
file code: #x61 (encoded by coding system mule-utf-8)
character: (332488, #o1211310, #x512c8, U+0308)
charset: mule-unicode-0100-24ff (Unicode characters of the range
U+0100..U+24FF.)
code point: #x25 #x48
syntax: w which means: word
category: ^:Combining diacritic or mark
buffer code: #x9C #xF4 #xA5 #xC8
file code: #xCC #x88 (encoded by coding system mule-utf-8)
character: ß (2271, #o4337, #x8df, U+00DF)
charset: latin-iso8859-1 (Right-Hand Part of Latin Alphabet 1
(ISO/IEC 8859-1): ISO-IR-100.)
code point: #x5F
syntax: w which means: word
category: l:Latin
buffer code: #x81 #xDF
file code: #xC3 #x9F (encoded by coding system mule-utf-8)
character: Û (342604, #o1235114, #x53a4c, U+20AC)
charset: mule-unicode-0100-24ff (Unicode characters of the range
U+0100..U+24FF.)
code point: #x74 #x4C
syntax: w which means: word
buffer code: #x9C #xF4 #xF4 #xCC
file code: #xE2 #x82 #xAC (encoded by coding system mule-utf-8)
An excerpt from the fontset's description (I am missing ISO 8859-16!):
Fontset: -*-*-medium-r-*-*-10-*-*-*-m-*-fontset-startup
CHARSET or CHAR RANGE FONT NAME
--------------------- ---------
ascii -b&h-lucidatypewriter-medium-r-normal-sans-10-100-75-75-m-60-
iso10646-1
[-Adobe-Courier-Medium-R-Normal--10-100-75-75-M-60-ISO10646-1]
[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
[-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
latin-iso8859-1 -b&h-lucidatypewriter-*-iso10646-1
[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
[-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
latin-iso8859-2 -*-iso8859-2
latin-iso8859-3 -*-iso8859-3
latin-iso8859-4 -*-iso8859-4
thai-tis620 -*-*-*-tis620-*
greek-iso8859-7 -*-iso8859-7
arabic-iso8859-6 -*-iso8859-6
hebrew-iso8859-8 -*-iso8859-8
katakana-jisx0201 -*-jisx0201-*
latin-jisx0201 -*-jisx0201-*
cyrillic-iso8859-5 -*-iso8859-5
latin-iso8859-9 -*-iso8859-9
latin-iso8859-15 -*-iso8859-15
latin-iso8859-14 -*-iso8859-14
...
mule-unicode-2500-33ff -b&h-lucidatypewriter-*-iso10646-1
mule-unicode-e000-ffff -b&h-lucidatypewriter-*-iso10646-1
mule-unicode-0100-24ff -b&h-lucidatypewriter-*-iso10646-1
[-B&H-LucidaTypewriter-Bold-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
[-B&H-LucidaTypewriter-Medium-R-Normal-Sans-10-100-75-75-M-60-
ISO10646-1]
...
IMO the display of UTF-8 characters is not sufficient.
If that doesn't work, I dunno, maybe it's something screwy about
the mac.
There is something special, possibly screwy, in Mac OS X's (or
better: HFS+', the file system's) way to store UTF-8 characters in
file names: they get de-composed, i.e. an ä becomes a¨, an à becomes
a`, etc. (and only these, a file's contents does not get de-composed
how would such a JPEG picture look like?). So two or three octets
in the string on disk are expanded to a pair of one octet and
(mostly ?) two octets. GNU Emacs should be able to detect that: if a
character is from the category (see above) "Combining diacritic or
mark" it can't stand alone by nature, but must be combined with the
character on the left in a left to right writing system or with the
character on the right in a right to left writing system (I have no
idea of the rules in a top to bottom writing system like Mongolian
and whether these have combining characters). And it should be able
to handle the character categories correctly.
--
Greetings
Pete
What¹s the difference between OS X and Vista?
Microsoft employees are excited about OS X
- UTF-8 in path / filename, Grégory SCHMITT, 2006/08/24
- Re: UTF-8 in path / filename, Noah Slater, 2006/08/24
- Re: UTF-8 in path / filename, Peter Dyballa, 2006/08/25
- Message not available
- Re: UTF-8 in path / filename, Grégory SCHMITT, 2006/08/25
- Re: UTF-8 in path / filename, Peter Dyballa, 2006/08/25
- Re: UTF-8 in path / filename, Grégory SCHMITT, 2006/08/25
- Re: UTF-8 in path / filename, Peter Dyballa, 2006/08/25
- Message not available
- Re: UTF-8 in path / filename, Miles Bader, 2006/08/25
- Re: UTF-8 in path / filename,
Peter Dyballa <=
- Re: UTF-8 in path / filename, James Cloos, 2006/08/26
- Re: UTF-8 in path / filename, Peter Dyballa, 2006/08/27
- Re: UTF-8 in path / filename, James Cloos, 2006/08/28
- Re: UTF-8 in path / filename, Peter Dyballa, 2006/08/28
- Message not available
- Re: UTF-8 in path / filename, Harald Hanche-Olsen, 2006/08/27
- Message not available
- Re: UTF-8 in path / filename, Grégory SCHMITT, 2006/08/25
- Message not available
- Message not available
- Re: UTF-8 in path / filename, Grégory SCHMITT, 2006/08/25
- Re: UTF-8 in path / filename, Miles Bader, 2006/08/25