bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: horrible utf-8 performace in wc


From: Eric Blake
Subject: Re: horrible utf-8 performace in wc
Date: Fri, 6 Jun 2008 23:27:03 +0000 (UTC)
User-agent: Loom/3.14 (http://gmane.org/)

Bruno Haible <bruno <at> clisp.org> writes:

> > http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
> 
> But before these techniques can be used in practice in packages such as
> coreutils, two problems would have to be solved satisfactorily:
> 
>   1) "George Pollard makes the assumption that the input string is valid UTF-
8".
>      This assumption cannot be upheld, as long as you use the same type
>      ('char *') for UTF-8 encoded strings and normal C strings, or when
>      you occasionally convert between one and the other.

Agreed.

> 
>      For example: Assume NAME is really a valid UTF-8 string.
>      A program then does
> 
>        static char buf[20];
>        snprintf (buf, "%s", NAME);
>        utf8_strlen (buf);
> 
>      Boing! You already have a buffer overrun:

Disagreed.  Reread Colin Percival's vectorized algorithm - he intentionally 
checks for NUL before counting non-leading UTF-8 bytes.  Yes, if any of the 
char* is not a valid UTF-8 character, the final count will be garbage.  But 
snprintf guarantees a NUL, and the vectorized counter guarantees stopping at 
NUL; so the garbage is bounded: no greater than the number of bytes, and no 
less than the number of number of valid characters.

>   2) We already have the problem that we want to keep good performance when
>      handling strings in the "C" locale or, more generally, in a unibyte 
locale.
>      So we get code duplication:
>        - code for unibyte locales,
>        - code for multibyte locales that uses mbrtowc().
>      If you want to optimize UTF-8 locales particularly, i.e. optimize away
>      the function calls inherent in mbrtowc(), then we get code triplication:
>        - code for unibyte locales,
>        - code for UTF-8 locales,
>        - code for multibyte locales other than UTF-8, that uses mbrtowc().
>      So, code size increases, and the testing requirements increase as well.

Unfortunately true.  But UTF-8 is such a common and special case that the 
benefits may outweigh the cost of duplication, especially if we can factor it 
well (you've already shown a factorization for writing one loop that can be 
used for unibyte and multibyte by merely swapping which header you include when 
compiling the loop).

-- 
Eric Blake






reply via email to

[Prev in Thread] Current Thread [Next in Thread]