[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-gnu-libiconv] Bug: Codeset to wchar_t fails unexpectedly on Woe32
From: |
Keith Marshall |
Subject: |
[bug-gnu-libiconv] Bug: Codeset to wchar_t fails unexpectedly on Woe32 |
Date: |
Sun, 22 Apr 2007 14:52:19 +0100 |
User-agent: |
KMail/1.8.2 |
[Report to bug-gnu-libiconv; copy to MinGW-users for info]
I've built libiconv-1.11 on woe32, using the MinGW build of
gcc-3.4.5, and the MSYS build tools from the MinGW project.
The good news is that it builds OOTB, and `make check' appears
to complete successfully, (although it would be nice if the
result of each test was confirmed by printing `ok' for each
successful outcome).
The bad news is that the implementation appears to be broken
WRT codeset to wchar_t conversions, which incorrectly report
EILSEQ errors when codeset != active system code page.
What follows is a fairly extensive, (and quite long), analysis
of the problem. I believe I have identified a possible work
around, although not a definitive solution, and would welcome
comments.
Here's an example, just one of many, taken from a parse of
the message catalogue sources provided with man-1.6; (I'm not
familiar with the language here; it just happens to be the
snippet around the first point of failure, in the residual
intermediate file left over from a successful build of the
entire set of available catalogues, on my GNU/Linux box).
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <iconv.h>
#include <errno.h>
#ifndef ICONV_CONST
#define ICONV_CONST
#endif
#define ICONV_CAST ICONV_CONST char **
int main()
{
char *inptr;
iconv_t mc = iconv_open( "wchar_t", "iso-8859-2" );
char *input_string = "ni moè odpreti\\n";
int inlen = strlen( input_string );
wchar_t conv;
size_t skip;
for( inptr = input_string; inlen > 0; inptr += skip )
{
char *ptr = inptr;
size_t convlen = sizeof( conv );
wchar_t *convptr = &conv;
size_t probe = 0;
do { size_t try = ++probe;
skip = iconv( mc, (ICONV_CAST)(&ptr), &try,
(char **)(&convptr), &convlen );
}
while( (skip == (size_t)(-1)) && (errno == EINVAL)
&& (probe < inlen) );
if( skip == (size_t)(-1) )
perror( "iconv" );
skip = (ptr == inptr) ? (size_t)(1) : ptr - inptr;
inlen -= (int)(skip);
}
return 0;
}
The language is Slovenian, (although that choice is arbitrary),
the codeset is ISO-8859-2, and my woe32 box is configured with a
system code page, (which I don't have authority to change), of
CP1252. The sample text, defined as `input_string' appears
near the end of the second message defined in the `mess.sl'
file, in the man-1.6 distribution, and the fault occurs at the
sixth byte in that string.
I've built libiconv with CFLAGS='-g -O0', and compiled the above
sample code with
gcc -g -O0 -otestcase -DICONV_CONST=const testcase.c -liconv
so I can trace it effectively in GDB, where I observe:--
1) In `iconv_open', the `tocode' is initially (correctly)
identified as `wchar_t'; this causes the invocation of
#if HAVE_MBRTOWC
to_wchar = 1;
tocode = locale_charset();
continue;
#endif
and `locale_charset' does
#elif defined WIN32_NATIVE
static char buf[2 + 10 + 1];
/* Woe32 has a function returning the locale's
codepage as a number. */
sprintf (buf, "CP%u", GetACP ());
codeset = buf;
which results in `tocode' being reassigned as `CP1252'; this
seems somehow perverse, and begs a couple of questions:--
1a) If neither the `fromcode' nor the `tocode' is related to
the current locale, why do we care what codeset is used
in this locale? What is the rationale for this change
of `tocode' to the codeset mapped for `GetACP'?
1b) Since `mbrtowc' functions in the context of the process'
active LC_CTYPE, which doesn't even necessarily match the
codeset from `GetACP', (it is more likely to simply be the
"C" locale's portable character set), what is the rationale
for even considering its use in this conversion context?
Surely, it is unlikely to be appropriate.
2) In `iconv', (actually `libiconv'), for each of the first five
bytes of the sample text, a successful conversion is obtained,
with the result being the zero extended wchar_t representation,
with identical numeric value to the original byte; conversion
is correctly achieved in `iso8859_2_mbtowc' invoked indirectly
by `unicode_loop_convert' via `wchar_to_loop_convert'.
On return from `iso8859_2_mbtowc', in `unicode_loop_convert',
I then see:
2a) A check, to confirm that the conversion has not overrun
an internal buffer; this succeeds...
2b) ... and is followed by
outcount = cd->ofuncs.xxx_wctomb(cd,outptr,wc,outleft);
if (outcount != RET_ILUNI)
goto outcount_ok;
which invokes `cp1252_wctomb', on the code returned from
`iso8859_2_mbtowc'; in this case, the return value is not
RET_ILUNI, and control transfers to `outcount_ok', but to
call `cp1252_wctomb' in this context does seem somewhat
dubious, for the reason given in (3b) and (3c) below.
2c) Following the jump to `outcount_ok', control returns to
`wchar_to_loop_convert', where, after a validity check,
I see:
/* Successful conversion. */
size_t bufcount = bufptr-buf; /* = BUF_SIZE-bufleft */
mbstate_t state = wcd->state;
wchar_t wc;
res = mbrtowc(&wc,buf,bufcount,&state);
if (res == (size_t)(-2)) ...
which also seems questionable. It clearly is trying to
check if the current input byte is a possible lead byte
in a multibyte sequence, but by using `mbrtowc', it is
doing so WRT a codeset which may not match the input,
(and does not, in the case in question); thus, is the
result valid, or in any way useful?
3) Action (2) repeats, successfully converting each of the first
five bytes of `input_string', before arriving at the sixth byte,
(the accented `è', with input code `0xe8'), which is subjected to
the same sequence of conversions as above, with the results:
3a) `iso8859_2_mbtowc' returns a wchar_t conversion, with a
value of `0x10d'; AFAICT, this is the correct value, and
it is completely valid, in the input codeset.
3b) As in (2b), `unicode_loop_convert' passes this `0x10d'
code to `cp1252_wctomb', which recoils in horror, can't
find a suitable representation in CP1252, and returns
RET_ILUNI.
3c) `unicode_loop_convert' now inspects the return code from
`cp1252_wctomb', sees it was RET_ILUNI, and (incorrectly)
decrees that the input byte was invalid; (it wasn't, but it
definitely seems that the test performed on it was). At
this point, `unicode_loop_convert' gives up, sets `errno'
to EILSEQ, immediately returns (size_t)(-1), and it's
"Goodnight Vienna".
Now, if I repeat all of the above, but on my GNU/Linux box, (Ubuntu
6.06 with GCC-4.0.3), I see completely different, and entirely more
reasonable behaviour. This system defines `__STDC_ISO_10646__', and
this code fragment, appearing in `iconv_open' immediately before the
fragment shown in (1)...
#if __STDC_ISO_10646__
if (sizeof(wchar_t) == 4) {
to_index = ei_ucs4internal;
break;
}
if (sizeof(wchar_t) == 2) {
to_index = ei_ucs2internal;
break;
}
if (sizeof(wchar_t) == 1) {
to_index = ei_iso8859_1;
break;
}
#endif
... prevents control from ever reaching those former questionable
statements; consequently, the converter control struct is configured
differently, and instead of the sequence described in (2) and (3),
I now see:--
4) Instead of invoking `unicode_loop_convert' indirectly, by way
of a call to `wchar_to_loop_convert', `iconv' now passes control
directly to `unicode_loop_convert', where:--
4a) `iso8859_2_mbtowc' is again called, to get the wchar_t code
for the input byte.
4b) A similar check to that of (2a) is performed, then...
4c) ... we again progress to the
outcount = cd->ofuncs.xxx_wctomb(cd,outptr,wc,outleft);
if (outcount != RET_ILUNI)
goto outcount_ok;
step; however, on this occasion `cd->ofuncs.xxx_wctomb' is
mapped, not to anything associated with the current locale,
but to `ucs4internal_wctomb'. This has no problem with the
wide character code generated by (4a), even for the case
which is the analogue of (3b), and all is well.
Now, observing that my GNU/Linux implementation of GCC *does* define
`__STDC_ISO_10646__', whereas the MinGW implementation *does* *not*,
suggests a possible work around for the failing conversion on woe32;
by arranging to have this symbol defined, with any non-zero value,
either by patching MinGW's own `_mingw.h', (which works around the
problem only for MinGW builds), or, (for a slightly more general woe32
or MS-DOS solution), by an `#ifdef _WIN32' guarded conditional
definition within the libiconv source, e.g.
--- old/libiconv-1.11/lib/iconv.c 2006-01-23 13:16:12.000000000 +0000
+++ new/libiconv-1.11/lib/iconv.c 2007-04-22 14:05:09.000000000 +0100
@@ -18,6 +18,13 @@
* Fifth Floor, Boston, MA 02110-1301, USA.
*/
+#if !defined(__STDC_ISO_10646__) \
+ && ((defined(_WIN32) && (defined(_MSC_VER) || defined(__MINGW32__))) \
+ || defined(__DJGPP__) \
+ )
+# define __STDC_ISO_10646__ 200009L
+#endif
+
#include <iconv.h>
#include <stdlib.h>
This causes the behaviour on woe32 to much more closely follow the
GNU/Linux behaviour, with `ucs2internal_wctomb' substituted for the
`ucs4internal_wctomb' call of the GNU/Linux case, and the valid code
sequence of the above example, and those of the many other examples
found in the same set of man-1.6 message catalogue sources, to pass
correctly through the converter.
I'm less certain in the DJGPP case, but I think this is a reasonable
work around for the woe32 cases, since AIUI the `wchar_t' of woe32,
for versions up to w2K is UCS-2, and for wXP and later it is UTF-16,
both of which are conformant with ISO-10646. Of course, it doesn't
help, in any more general case, where the potential reference to the
locale charset established in (1) still seems dubious.
Regards,
Keith.
- [bug-gnu-libiconv] Bug: Codeset to wchar_t fails unexpectedly on Woe32,
Keith Marshall <=