bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Windows and non-BMP characters


From: Bruno Haible
Subject: Windows and non-BMP characters
Date: Sun, 13 Feb 2011 22:26:36 +0100
User-agent: KMail/1.9.9

Two weeks ago, when discussing the support of non-BMP characters on Windows,
I was under the impression that it would be useful to use the wwchar_t layer
on both Cygwin >= 1.7 _and_ native Windows.

Now I've come to the conclusion that it's pointless on native Windows. The
reason is that
  1) native Windows provides no locales (in the ISO C90 sense) that support
     non-BMP characters,
  2) the use of the 'char *' data type for strings is based on such locales,
  3) the programs for which gnulib is meant to be used are based on 'char *'
     strings and ISO C99 APIs.

In detail:

The documentation of the setlocale function in msvcrt [1] mentions

    "The set of available languages, country/region codes, and code
     pages includes all those supported by the Win32 NLS API except
     code pages that require more than two bytes per character, such
     as UTF-7 and UTF-8. If you provide a code page like UTF-7 or
     UTF-8, setlocale will fail, returning NULL."

This coincides with my experiments on Windows XP:
  - For code pages that requires MB_CUR_MAX <= 2, Windows msvcrt
    supports such locales, e.g.
      Japanese_Japan.932
      Chinese_Taiwan.950
      Chinese_China.936
    See [2] for a more complete list.
  - The only widely used encodings with MB_CUR_MAX > 2 are UTF-8 and GB18030.
    Attempts to use setlocale with a codepage of 54936 (= GB18030)
    or 65001 (= UTF-8) fail. Although the functions
    MultiByteToWideChar and WideCharToMultiByte support codepage
    65001. [3][4]

In contrast, Windows supports locales also at the Win32 level:
[5][6]. But this page [7] says:

  "New Windows applications should use Unicode to avoid the
   inconsistencies of varied code pages ..."

and

  "Your application can convert between Windows code pages and OEM code
   pages using the standard C runtime library functions. However, use
   of these functions presents a risk of data loss because the characters
   that can be represented by each code page do not match exactly."

Similarly, [8] says:

  "the system ACP might not cover all code points in the user's
   selected logon language identifier. For compatibility with this
   edition, your application should avoid calls that depend on GetACP
   either implicitly or explicitly, as this function can cause some
   locales to display text as question marks. Instead, the application
   should use the Unicode API functions directly ..."

See also [9].

In summary, Microsoft added support for UTF-8 and GB18030 to the
Win32 API, but there are no (and will likely be no) locales at the
setlocale() level that support UTF-8 or GB18030. They are basically
saying "stop using ANSI or OEM code pages because they are too
limited", with the implicit consequence "use API where strings are
'wchar_t *', and stop using API where strings are 'char *'".

It's obvious that Unix programs will continue to use 'char *'.

So there's no point for gnulib to try to support non-BMP characters
on native Windows.

Bruno

[1] http://msdn.microsoft.com/en-us/library/x99tb11d.aspx
[2] http://docs.moodle.org/en/Table_of_locales
[3] http://msdn.microsoft.com/en-us/library/dd319072%28v=VS.85%29.aspx
[4] http://msdn.microsoft.com/en-us/library/dd374130%28v=VS.85%29.aspx
[5] http://msdn.microsoft.com/en-us/library/dd318661%28v=VS.85%29.aspx
[6] http://msdn.microsoft.com/en-us/library/dd318716%28v=VS.85%29.aspx
[7] http://msdn.microsoft.com/en-us/library/dd317752%28v=VS.85%29.aspx
[8] http://msdn.microsoft.com/en-us/library/dd318070%28v=VS.85%29.aspx
[9] http://blogs.msdn.com/b/michkap/archive/2007/07/11/3823291.aspx

-- 
In memoriam Alexander Samoylovich 
<http://en.wikipedia.org/wiki/Alexander_Samoylovich>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]