[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: coreutils and i18n
From: |
Pádraig Brady |
Subject: |
Re: coreutils and i18n |
Date: |
Mon, 21 Apr 2008 12:53:54 +0100 |
User-agent: |
Thunderbird 2.0.0.6 (X11/20071008) |
Bruno Haible wrote:
> Jim Meyering wrote:
>>> - Processing in unibyte locales should not become significantly slower
>>> than before.
>>> - Code duplication should be avoided, for maintainability.
>>> - Macros which expand to one thing in the multibyte case and to another
>>> thing for the unibyte case are not acceptable.
>>>
>>> How will this students' project solve this dilemma?
>> There's no guarantee, but Paul and I will be supervising.
>
> I mean, what is technically the solution to the dilemma? The typical idiom
> for keeping the speed of the unibyte case is - see e.g.
> gnulib/lib/mbscasecmp.c
> as an example -
>
> #if HAVE_MBRTOWC
> if (MB_CUR_MAX > 1)
> ... multibyte case ...
> else
> #endif
> ... unibyte case ...
>
> but it does have code duplication.
That's the obvious solution that is not really required/desired.
If I was being paid to do it (I have very little free time unfortunately),
then I would do something like...
1. identify filters that require multibyte handling.
2. refactor line input processing etc. to shared code.
3. Intelligently apply multibyte processing.
For illustration look at the performance various `uniq` implementations
currently:
$ rpm -q coreutils
coreutils-6.9-9.fc8
$ echo $LANG
en_IE.UTF-8
# The default one uses the existing i18n patch
$ time uniq < lines.test > /dev/null
real 0m27.724s
$ time LC_CTYPE=C uniq < lines.test > /dev/null
real 0m1.314s
$time ~/git/coreutils/src/uniq < lines.test > /dev/null
real 0m1.187s
$ time ~/myuniq < lines.test > /dev/null
real 0m0.827s
$ time ~/uniq.py < lines.test > /dev/null
real 0m2.657s
Yes the python version (which I nearly wrote in the same
time and the default uniq took to complete the test) is much better!
`myuniq` is a version I implemented from scratch,
to understand some of what the issues involved would be:
http://lists.gnu.org/archive/html/bug-coreutils/2006-07/msg00153.html
It's not just performance. The functionality of the i18n patch for uniq
is buggy in the presence of NUL characters for example:
for i in 1 2 3; do echo -e "1234\x0056789"; done | uniq
123456789
123456789
123456789
for i in 1 2 3; do echo -e "1234\x0056789"; done | LANG=C uniq
123456789
It's great that Paul & Jim are looking at this interesting project
as it really is important as I've mentioned before.
cheers,
Pádraig.
- Re: coreutils and xattr, (continued)
Re: coreutils and i18n, Bruno Haible, 2008/04/20
Re: coreutils and i18n,
Pádraig Brady <=