[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] Bug: Codeset to wchar_t fails unexpectedly on Woe
From: |
Keith Marshall |
Subject: |
Re: [bug-gnu-libiconv] Bug: Codeset to wchar_t fails unexpectedly on Woe32 |
Date: |
Tue, 24 Apr 2007 19:53:01 +0100 |
User-agent: |
KMail/1.8.2 |
On Monday 23 April 2007 23:00, Bruno Haible wrote:
> There are two issues:
>
> 1) The approach used by libiconv for converting from/to wchar_t.
> Since the ISO C 99 standard does not "define" the representation of a
> wchar_t, the default approach is to convert through the locale
> encoding: wchar_t <--> char = locale encoding <--> other encoding.
> When one knows that the wchar_t encoding is Unicode, libiconv can
> convert directly. I didn't think about this case for Woe32 (since the
> main porting targets are Unix systems). I'm now applying the appended
> patch. It's low risk (except that it would be useful to know whether
> the Woe32 wchar_t[] encoding is really UCS-2 or UTF-16).
I can't find any definitive statement from Microsnot on this; I did find
some presentations and blogs on microsoft.com, which *suggest* that
some versions of Woe32 are UCS-2, and some (newer) ones are UTF-16, but
they lack consistency, and none says explicitly which versions of
MSVCRT implement which standard for wchar_t. It's probably safest to
assume UCS-2, for the base version preferred by MinGW.
> 2) The fact that on your system, the locale encoding for a Slovenian
> locale is CP1252.
No, I think you've misunderstood; on my system, the locale encoding is
CP1252, for "English_United Kingdom". I offered Slovenian as one
example, of many I could have chosen, to explore and illustrate the
problem I had observed.
> > The language is Slovenian, (although that choice is arbitrary),
> > the codeset is ISO-8859-2, and my woe32 box is configured with a
> > system code page, (which I don't have authority to change), of
> > CP1252. ...
> > `locale_charset' does
> >
> > #elif defined WIN32_NATIVE
> >
> > static char buf[2 + 10 + 1];
> >
> > /* Woe32 has a function returning the locale's
> > codepage as a number. */
> > sprintf (buf, "CP%u", GetACP ());
> > codeset = buf;
> >
> > which results in `tocode' being reassigned as `CP1252'; this
> > seems somehow perverse
>
> Indeed. I don't think you will get very far in such a locale. The
> use of "char *" to denote strings in locale-dependent encodings is
> pervasive in Unix and GNU software.
All I'm trying to do is parse an arbitrary byte stream, to step over
multibyte groupings in any arbitrary input encoding, exactly as Ulrich
Drepper does, in his `gencat' implementation to accompany the glibc
implementation of `catgets'. I'm using libiconv to achieve this, even
for input codesets for which the system lacks a prepared code page.
> I believe the installation of (proprietary) "language packs", such
> as for Hungarian, will allow you to get a locale with GetACP() =
> CP1250.
And this is precisely what I'm trying to avoid.
> > begs a couple of questions:--
> >
> > 1a) If neither the `fromcode' nor the `tocode' is related to
> > the current locale, why do we care what codeset is used
> > in this locale? What is the rationale for this change
> > of `tocode' to the codeset mapped for `GetACP'?
>
> In general, the wchar_t representation is locale dependent. Examples
> are Solaris and FreeBSD.
Ok, understood.
> > 1b) Since `mbrtowc' functions in the context of the process'
> > active LC_CTYPE, which doesn't even necessarily match the
> > codeset from `GetACP', (it is more likely to simply be the
> > "C" locale's portable character set), what is the rationale
> > for even considering its use in this conversion context?
> > Surely, it is unlikely to be appropriate.
>
> There is a fundamental assumption between mbrtowc and
> locale_charset(): the "char *" strings that are the input to mbrtowc
> are supposed to be encoded in locale_charset(). On Woe32, the MSVCRT
> library's implementation of mbtowc uses
> MultiByteToWideChar(__lc_codepage,...), where __lc_codepage is set by
> setlocale().
Yes, and it seems that what is set by setlocale() may not be reflected
in what is returned by GetACP().
> > 2b) ... and is followed by
> >
> > outcount = cd->ofuncs.xxx_wctomb(cd,outptr,wc,outleft);
> > if (outcount != RET_ILUNI)
> > goto outcount_ok;
> >
> > which invokes `cp1252_wctomb', on the code returned from
> > `iso8859_2_mbtowc'; in this case, the return value is not
> > RET_ILUNI
>
> Yes, you cannot get very far when you try to use Slovenian strings in
> a locale whose encoding is CP1252.
Yes, but for my purposes, I can get far enough after applying the patch
under discussion.
> > Now, observing that my GNU/Linux implementation of GCC *does*
> > define `__STDC_ISO_10646__', whereas the MinGW implementation
> > *does* *not*, suggests a possible work around for the failing
> > conversion on woe32; by arranging to have this symbol defined, with
> > any non-zero value
>
> Yes, this provides a workaround, limited to libiconv. I prefer to not
> define __STDC_ISO_10646__, because 'wchar_t' is only 16 bits, and
> ISO-10646 consists of many more than 65536 characters.
Ok. I suggested setting __STDC_ISO_10646__, simply because it seemed
less invasive than the alternative. I'd also considered something very
similar to what you've implemented. Either achieves exactly the same
effect, so I'm happy to adopt your preferred format. I'll roll a new
mingwPORT, with that included.
> > I'm less certain in the DJGPP case
>
> DJGPP has an entirely different libc. It doesn't have wchar_t
> functions at all, IIRC. Don't waste your brain cycles on it:
Oh, I wasn't planning to; I just wanted to point out that I don't have
any experience with DJGPP, so wasn't prepared to attest to expected
behaviour on that platform, even though I'd simply copied and pasted a
conditional test incorporating it, from elsewhere in the file.
> DJGPP is not a porting target any more nowadays.
Understood.
Thanks,
Keith.