[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Emacs and UTF-8 locale
From: |
Markus Kuhn |
Subject: |
Re: Emacs and UTF-8 locale |
Date: |
Tue, 18 Dec 2001 13:20:04 +0000 |
> Date: Mon, 17 Dec 2001 09:52:20 +0200 (IST)
> From: Eli Zaretskii <address@hidden>
> To: Richard Stallman <address@hidden>
> cc: address@hidden, address@hidden
> Subject: Re: UTF-8 locale
>
> On Sun, 16 Dec 2001, Richard Stallman wrote:
> > Recent changes in mule-cmds.el automatically turn on the UTF-8
> > locale when $LANG says so.
> >
> > For which values of LANG does Emacs use UTF-8?
>
> Those which match the regexp ".*utf\\(-?8\\)\\>".
The proper way of determining the encoding used by the current locale is
not to look at a single locale variable, but to query the Single Unix
Specification (and now also POSIX) function nl_langinfo(CODESET), as for
example in
utf8_mode = (strcmp(nl_langinfo(CODESET), "UTF-8") == 0);
There are UTF-8 locales in use (e.g., vi_VI), which do NOT have UTF-8 in
their name, therefore the direct test of the locale environment
variables is just a less reliable fallback option.
It is my understanding that elisp currently has no direct access to the
output of the API function nl_langinfo(CODESET), and I hope this can be
fixed. Alternatively, you can execute the shell command "locale charmap",
which outputs the return value of nl_langinfo(CODESET) followed by a new
line. This could be used under elisp even right now, though it is less
elegant of course.
Fortunately, there exists only one single standard string that
nl_langinfo(CODESET) returns in a UTF-8 locale, and that is "UTF-8".
(For ISO 8859-1, both "ISO-8859-1" and "ISO8859-1" are used by
different manufacturers.)
There is at the moment only one widely used system that does not yet
implement nl_langinfo(3) or locale(1) (namely *BSD), and on such a
system, you can do as a fallback something like
char *s;
int utf8_mode = 0;
if ((s = getenv("LC_ALL")) ||
(s = getenv("LC_CTYPE")) ||
(s = getenv("LANG"))) {
if (strstr(s, "UTF-8"))
utf8_mode = 1;
}
It is important that you do not only test LANG, but the first variable
in the sequence LC_ALL, LC_CTYPE and LANG that has a value. Many UTF-8
users strongly prefer LC_CTYPE=en_GB.UTF-8 LANG=C, as this changes only
the encoding but not the sorting order etc., and it also speeds up
program start time, as the C libraray will only load the LC_CTYPE part
of the locale data, and not all the unwanted rest.
If you need an autoconf test for the presence of nl_langinfo(CODESET),
then here is one:
======================== m4/codeset.m4 ================================
#serial AM1
dnl From Bruno Haible.
AC_DEFUN([AM_LANGINFO_CODESET],
[
AC_CACHE_CHECK([for nl_langinfo and CODESET], am_cv_langinfo_codeset,
[AC_TRY_LINK([#include <langinfo.h>],
[char* cs = nl_langinfo(CODESET);],
am_cv_langinfo_codeset=yes,
am_cv_langinfo_codeset=no)
])
if test $am_cv_langinfo_codeset = yes; then
AC_DEFINE(HAVE_LANGINFO_CODESET, 1,
[Define if you have <langinfo.h> and nl_langinfo(CODESET).])
fi
])
=======================================================================
For more information on how applications should activate UTF-8 modes,
please have a look at:
http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
- Re: Emacs and UTF-8 locale,
Markus Kuhn <=
- Re: Emacs and UTF-8 locale, Eli Zaretskii, 2001/12/18
- Re: Emacs and UTF-8 locale, Tomohiro KUBOTA, 2001/12/18
- Re: Emacs and UTF-8 locale, Eli Zaretskii, 2001/12/18
- Re: Emacs and UTF-8 locale, Dave Love, 2001/12/19
- Re: Emacs and UTF-8 locale, Paul Eggert, 2001/12/19
- Re: Emacs and UTF-8 locale, Eli Zaretskii, 2001/12/19
- Re: Emacs and UTF-8 locale, Dave Love, 2001/12/21
- Re: Emacs and UTF-8 locale, Eli Zaretskii, 2001/12/22
- Re: Emacs and UTF-8 locale, Dave Love, 2001/12/21
Re: Emacs and UTF-8 locale, Dave Love, 2001/12/19