[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: horrible utf-8 performace in wc
From: |
Eric Blake |
Subject: |
Re: horrible utf-8 performace in wc |
Date: |
Fri, 6 Jun 2008 23:27:03 +0000 (UTC) |
User-agent: |
Loom/3.14 (http://gmane.org/) |
Bruno Haible <bruno <at> clisp.org> writes:
> > http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
>
> But before these techniques can be used in practice in packages such as
> coreutils, two problems would have to be solved satisfactorily:
>
> 1) "George Pollard makes the assumption that the input string is valid UTF-
8".
> This assumption cannot be upheld, as long as you use the same type
> ('char *') for UTF-8 encoded strings and normal C strings, or when
> you occasionally convert between one and the other.
Agreed.
>
> For example: Assume NAME is really a valid UTF-8 string.
> A program then does
>
> static char buf[20];
> snprintf (buf, "%s", NAME);
> utf8_strlen (buf);
>
> Boing! You already have a buffer overrun:
Disagreed. Reread Colin Percival's vectorized algorithm - he intentionally
checks for NUL before counting non-leading UTF-8 bytes. Yes, if any of the
char* is not a valid UTF-8 character, the final count will be garbage. But
snprintf guarantees a NUL, and the vectorized counter guarantees stopping at
NUL; so the garbage is bounded: no greater than the number of bytes, and no
less than the number of number of valid characters.
> 2) We already have the problem that we want to keep good performance when
> handling strings in the "C" locale or, more generally, in a unibyte
locale.
> So we get code duplication:
> - code for unibyte locales,
> - code for multibyte locales that uses mbrtowc().
> If you want to optimize UTF-8 locales particularly, i.e. optimize away
> the function calls inherent in mbrtowc(), then we get code triplication:
> - code for unibyte locales,
> - code for UTF-8 locales,
> - code for multibyte locales other than UTF-8, that uses mbrtowc().
> So, code size increases, and the testing requirements increase as well.
Unfortunately true. But UTF-8 is such a common and special case that the
benefits may outweigh the cost of duplication, especially if we can factor it
well (you've already shown a factorization for writing one loop that can be
used for unibyte and multibyte by merely swapping which header you include when
compiling the loop).
--
Eric Blake