[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: uc_width and wcwidth optimization
From: |
Alexander V. Lukyanov |
Subject: |
Re: uc_width and wcwidth optimization |
Date: |
Wed, 14 Dec 2011 14:02:33 +0400 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Tue, Dec 13, 2011 at 11:32:53AM +0100, Bruno Haible wrote:
> 2) The wcwidth change is a good idea, but unfortunately is not multithread-
> safe. Different threads can have different locales, therefore a global
> variable as a cache won't lead to correct results always.
Fortunately charset.alias is not re-read every time wcwidth is called. ;-)
Are there any real programs which use different locales in threads?
> I'm attaching the benchmark program I'm experimenting with. So far, it seems
> that locale_charset() is really slow, whereas the is_cjk stuff is not a big
> speed problem.
is_cjk_encoding() is on the second place after locale_charset.
locale_charset is slow because of linear search of locale alias.
Unfortunately, I don't know how to optimize it to be thread-safe without
heavy artillery like thread-local storage.
> > Besides, uc_width is used in wcwidth for cjk encodings as designed.
>
> - if (STREQ (encoding, "UTF-8", 'U', 'T', 'F', '-', '8', 0, 0, 0 ,0))
> + if (cached_is_utf8_encoding || cached_is_cjk_encoding)
> {
> /* We assume that in a UTF-8 locale, a wide character is the same as a
> Unicode character. */
> - return uc_width (wc, encoding);
> + return uc_width (wc, cached_is_cjk_encoding);
> }
>
> This won't work portably: The comment says that only in UTF-8 locales we know
> that a wchar_t represents a Unicode character. In locales with encodings
> such as EUC-JP or GB18030 you cannot assume anything about how to libc has
> defined the wchar_t values.
It means that it is possible to avoid is_cjk_encoding() calling at all,
because uc_width only uses encoding for cjk checking and uc_width is only
called by wcwidth for UTF-8 case (which is not a cjk encoding).
--
Alexander.