[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-gnulib] Re: iconv made easy
From: |
Simon Josefsson |
Subject: |
[bug-gnulib] Re: iconv made easy |
Date: |
Mon, 13 Dec 2004 21:26:08 +0100 |
User-agent: |
Gnus/5.110003 (No Gnus v0.3) Emacs/21.3.50 (gnu/linux) |
Bruno Haible <address@hidden> writes:
> Your function returns a memory block that is often way too large and thus
> wastes memory.
Yes, fixing that would be good.
> In many applications, I guess, the (to_codeset, from_codeset) pair will be
> the same for many strings. Therefore I think it's worth providing a function
> that takes an iconv_t and doesn't need to iconv_open/iconv_close.
Sure.
>> I know the function doesn't handle embedded ASCII #0
>
> iconv() handles NUL bytes correctly; you don't need to handle them specially.
How would you find out the length of the returned string?
What I meant was that if convert a string from ASCII to some encoding
that uses 0 for some other character (I recall some mobile phone
encodings where '\0' meant the '@' character), then you can't find out
the output string length.
Consequently, for completeness, perhaps there should be an API like:
char *iconv_length (const char *from_codeset, const char *to_codeset,
const char *input, size_t inlen, size_t *outlen);
Or something like that. I have no use for this, so I'd rather avoid
implementing something that wouldn't be tested immediately.
Or is that interface not ever useful?
>> I'm thinking of another API, 'iconv_lz', that would be take
>> zero-terminated strings in the locale's code set, and convert them to
>> a specified code set. But that would need nl_langinfo(CODESET) so it
>> wouldn't be thread safe
>
> You have a strange notion of "thread safe". nl_langinfo(CODESET) can be
> called in multiple threads simultaneously without locking. It's only
> setlocale() in other threads that can disturb nl_langinfo(CODESET).
> Therefore IMO it's setlocale() which is not MT-safe.
Is this true? POSIX says:
The nl_langinfo() function need not be reentrant. A function that
is not required to be reentrant is not required to be thread-safe.
I have taken that to mean that you shouldn't call nl_langinfo from
threaded code, if you can't guarantee that other threads won't call it
at the same time. And in a library, without mutexes, you can't.
I have been planning to make changes in my projects for this, to make
sure a library doesn't call nl_langinfo (CODESET), but rather let the
application call it, and pass down the output from that function into
the library, together with the strings it provide the library with.
Alas, I have had more important things to do to have time to fix
this...
> Appended you find two alternative codes, taken from gettext (and tested for
> 4 years), and another one, taken from my unfinished libunistring - more
> powerful but untested.
Is the first one LGPL?
Is libunistring available somewhere?
> You will notice that there are two approaches to converting a string:
> a) allocate an initial buffer and extend it as needed, stopping and
> restarting iconv() each time a realloc is needed,
> b) call iconv() once to determine the length and then once again for
> filling the result string.
> I never measured which one is more efficient; probably it will also
> depend on the iconv implementation (glibc is noticeably faster for
> large strings than for small ones) and on the initial buffer size in
> case a).
I generally don't worry about efficiency, so I don't have an opinion
on this.
> Which API do you find worth pursuing?
I haven't read them in detail yet. I'll try to read them... I guess
my only concerns would be thread safety, or if the code depend on very
many other gnulib modules.
One thing about the libunistring file that I didn't like was that it
uses locale_charset. There may have been a time where the complexity
of that function was needed to get things to work in the real world,
but today I'd rather use nl_langinfo (CODESET) [unless a specification
argue there is a problem with that].
However, simply moving the locale_charset part of the API to a
separate module seem feasible, and would fix my concern.
Perhaps the locale_charset() module could be turned into a replacement
for nl_langinfo (CODESET), if a system doesn't have a working
nl_langinfo (CODESET), instead of providing a completely new API.
Btw, are nl_langinfo (CODESET) guaranteed to return character set
names that are understood by iconv, if the system is providing both
functions? I guess if the system doesn't have iconv natively, in
general you loose. I'm not sure I understand what POSIX means by
"code set". Character set and the encoding?
Thanks,
Simon
[bug-gnulib] Re: iconv made easy, Simon Josefsson, 2004/12/13
- [bug-gnulib] Re: iconv made easy, Simon Josefsson, 2004/12/15
- Re: [bug-gnulib] Re: iconv made easy, Paul Eggert, 2004/12/15
- [bug-gnulib] Re: iconv made easy, Simon Josefsson, 2004/12/15
- Re: [bug-gnulib] Re: iconv made easy, Paul Eggert, 2004/12/15
- [bug-gnulib] Re: iconv made easy, Simon Josefsson, 2004/12/15
- [bug-gnulib] Re: iconv made easy, Simon Josefsson, 2004/12/25