[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] Skipping over EILSEQ & EINVAL errors?
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] Skipping over EILSEQ & EINVAL errors? |
Date: |
Tue, 16 Sep 2008 01:49:10 +0200 |
User-agent: |
KMail/1.5.4 |
Hi,
address@hidden wrote:
> Can anyone explain why the iconv binary can successfully skip over bad
> characters in the input (EILSEQ & EINVAL errors), but the libiconv
> conversion function cannot?
Good question ;-) The answer is: POSIX specified the iconv program in this
way [1], and it specified the iconv function in this way [2].
It is correct that you cannot easily implement the skipping over bad
input, as required for the iconv program, with the iconv() function.
GNU libc has a different internal API that allows this (gconv), and GNU
libiconv another internal API (iconvctl).
But the gnulib module 'striconveh' [3] contains portable code for error
handling with iconv. It supports three error handlers:
/* Handling of unconvertible characters. */
enum iconv_ilseq_handler
{
iconveh_error, /* return and set errno = EILSEQ */
iconveh_question_mark, /* use one '?' per unconvertible character */
iconveh_escape_sequence /* use escape sequence \uxxxx or \Uxxxxxxxx */
};
> I've been trying to use libiconv to convert CJK files into UTF-8.
>
> I've noticed that when I run something like this (using the iconv
> binary, from a command line):
>
> /usr/bin/iconv -f gb2312 [chinese language file encoded as gb2312]
This error often happens when the input is not in GB2312, but in related
encodings such as GBK or GB18030 [4].
> In the code example
> (http://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html),
> hitting an EILSEQ or EINVAL error is cause for stopping processing.
That's only because it's meant to be a _simple_ example :-)
> In my own code, I've tried to lseek forward if I get either of those
> errors, but the iconv function gives no indication of how large the bad
> input is, or where the next "clean" byte is.
Yes, this error handling can be tricky. In particular, skipping just 1 byte
in UTF-16 or UTF-32 encoded input is probably a bad idea.
Bruno
[1] http://www.opengroup.org/susv3/utilities/iconv.html
[2] http://www.opengroup.org/susv3/functions/iconv.html
[3] http://www.gnu.org/software/gnulib/MODULES.html#module=striconveh
[4] http://www.haible.de/bruno/charsets/conversion-tables/Chinese.html