bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: uniq i18n implementation


From: Pádraig Brady
Subject: Re: uniq i18n implementation
Date: Wed, 09 Aug 2006 22:13:47 +0100
User-agent: Mozilla Thunderbird 1.0.8 (X11/20060502)

Paul Eggert wrote:
> Pádraig Brady <address@hidden> writes:
> 
> 
>>memcoll does 2 errno accesses per call, which shows up significantly
>>in profiles. Does strcoll even set errno?
> 
> 
> <http://www.opengroup.org/susv3/functions/strcoll.html> says it's
> allowed to.  I assume some platforms do.  I wouldn't be surprised if
> errno were set to EILSEQ on some platforms, for example, if the
> strings contain byte sequences that are not valid multibyte
> characters.  Perhaps if you investigate glibc's source code you can
> see what it does in this case; it might be worth making a special case
> for glibc at any rate.  Or maybe we could even use an Autoconf-style
> test.

Thanks for replying Paul.

Yes it looks like I'll need to do a lot of autoconf stuff
for both functionality and performance :(

> 
> 
>>Using strcoll is inefficient anyway
> 
> 
> Don't we know it!  If we can avoid it, we'd like to.

Well, the mbstowcs+wcscoll solution I presented
should be equivalent to strcoll on any platform,
and it's much faster in my tests.

>>I noticed coreutils doesn't shortcut the string comparisons
>>by checking lengths before doing memcoll if !C locale,
>>which is fair enough, but maybe a bit restrictive?
>>Can't one just check lengths when MB_CUR_MAX == 1 ?
> 
> 
> I don't know whether that would be portable.  I can easily imagine
> locales where it wouldn't be.

Fair enough.

> 
> 
>>In general can someone give a non theoretical example
>>of 2 different byte sequences (even of the same length),
>>that compare equal with strcoll() and/or transform to the same
>>wide character with mbstowcs() in any locale.
> 
> 
> I'd expect that these two sequences:
> 
> U+006D LATIN SMALL LETTER M
> U+00ED LATIN SMALL LETTER I WITH ACUTE
> 
> U+006D LATIN SMALL LETTER M
> U+0069 LATIN SMALL LETTER I
> U+0301 COMBINING ACUTE ACCENT 
> 
> would compare equal, at least on some platforms.  However, I haven't
> tested this.  For lots more on this subject, please see
> <http://www.unicode.org/unicode/reports/tr10/>.

Wow, that's a fantastic reference that I wasn't aware off.

> 
> 
>>I.E. how to get strcoll &/or wcscoll to only compare the primary weights.
>>I don't think this functionality is in glibc
> 
> 
> I think you're right.
> 
> 
>>but it probably is possible in ICU?
> 
> 
> Sorry, don't know.

I wonder could we add this as a dependency?

>>My test version of uniq treats the whole line as "C"
>>if it isn't all a valid multibyte sequence,
> 
> 
> I don't think we need to worry overmuch about performance for invalid
> multibyte sequences.  I'd rather have correctness.
> 
> An obvious way to define "correctness" would be to break the sequence
> of bytes into valid multibyte sequences separated by stray bytes, and
> to sort lexicographically, where we use memcoll for the multibyte
> sequences and memcmp for the stray bytes.  If we do this consistently
> in 'sort', 'uniq', 'comm', 'join', etc., I think that would be a win
> over the current situation, where the programs report an error when
> strcoll fails.

Well the performance win is is for the common non error case,
as doing this allows one not to do char by char processing in the util.
This is the major performance advantage in my version.

Also I don't agree with splitting entities into
valid multibyte ranges and "C" for the rest.
That is probably not what the user wants the data interpreted as,
and I think (at least for uniq which I've thought about),
that it's just best to treat the whole entity as "C"
if there are invalid multibyte sequences in the entity.

cheers,
Pádraig.

p.s. I should clarify that I'm doing the standalone
prog just to quickly test things, and intend to
merge any changes I find appropriate back into the
existing uniq.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]