[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gnulib] Re: iconv made easy

From: Simon Josefsson
Subject: [bug-gnulib] Re: iconv made easy
Date: Mon, 13 Dec 2004 21:26:08 +0100
User-agent: Gnus/5.110003 (No Gnus v0.3) Emacs/21.3.50 (gnu/linux)

Bruno Haible <address@hidden> writes:

> Your function returns a memory block that is often way too large and thus
> wastes memory.

Yes, fixing that would be good.

> In many applications, I guess, the (to_codeset, from_codeset) pair will be
> the same for many strings. Therefore I think it's worth providing a function
> that takes an iconv_t and doesn't need to iconv_open/iconv_close.


>> I know the function doesn't handle embedded ASCII #0
> iconv() handles NUL bytes correctly; you don't need to handle them specially.

How would you find out the length of the returned string?

What I meant was that if convert a string from ASCII to some encoding
that uses 0 for some other character (I recall some mobile phone
encodings where '\0' meant the '@' character), then you can't find out
the output string length.

Consequently, for completeness, perhaps there should be an API like:

char *iconv_length (const char *from_codeset, const char *to_codeset,
                    const char *input, size_t inlen, size_t *outlen);

Or something like that.  I have no use for this, so I'd rather avoid
implementing something that wouldn't be tested immediately.

Or is that interface not ever useful?

>> I'm thinking of another API, 'iconv_lz', that would be take
>> zero-terminated strings in the locale's code set, and convert them to
>> a specified code set.  But that would need nl_langinfo(CODESET) so it
>> wouldn't be thread safe
> You have a strange notion of "thread safe". nl_langinfo(CODESET) can be
> called in multiple threads simultaneously without locking. It's only
> setlocale() in other threads that can disturb nl_langinfo(CODESET).
> Therefore IMO it's setlocale() which is not MT-safe.

Is this true?  POSIX says:

   The nl_langinfo() function need not be reentrant. A function that
   is not required to be reentrant is not required to be thread-safe.

I have taken that to mean that you shouldn't call nl_langinfo from
threaded code, if you can't guarantee that other threads won't call it
at the same time.  And in a library, without mutexes, you can't.

I have been planning to make changes in my projects for this, to make
sure a library doesn't call nl_langinfo (CODESET), but rather let the
application call it, and pass down the output from that function into
the library, together with the strings it provide the library with.
Alas, I have had more important things to do to have time to fix

> Appended you find two alternative codes, taken from gettext (and tested for
> 4 years), and another one, taken from my unfinished libunistring - more
> powerful but untested.

Is the first one LGPL?

Is libunistring available somewhere?

> You will notice that there are two approaches to converting a string:
> a) allocate an initial buffer and extend it as needed, stopping and
>    restarting iconv() each time a realloc is needed,
> b) call iconv() once to determine the length and then once again for
>    filling the result string.
> I never measured which one is more efficient; probably it will also
> depend on the iconv implementation (glibc is noticeably faster for
> large strings than for small ones) and on the initial buffer size in
> case a).

I generally don't worry about efficiency, so I don't have an opinion
on this.

> Which API do you find worth pursuing?

I haven't read them in detail yet.  I'll try to read them...  I guess
my only concerns would be thread safety, or if the code depend on very
many other gnulib modules.

One thing about the libunistring file that I didn't like was that it
uses locale_charset.  There may have been a time where the complexity
of that function was needed to get things to work in the real world,
but today I'd rather use nl_langinfo (CODESET) [unless a specification
argue there is a problem with that].

However, simply moving the locale_charset part of the API to a
separate module seem feasible, and would fix my concern.

Perhaps the locale_charset() module could be turned into a replacement
for nl_langinfo (CODESET), if a system doesn't have a working
nl_langinfo (CODESET), instead of providing a completely new API.

Btw, are nl_langinfo (CODESET) guaranteed to return character set
names that are understood by iconv, if the system is providing both
functions?  I guess if the system doesn't have iconv natively, in
general you loose.  I'm not sure I understand what POSIX means by
"code set".  Character set and the encoding?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]