[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Using libunistring for string comparisons et al
From: |
Mark H Weaver |
Subject: |
Re: Using libunistring for string comparisons et al |
Date: |
Tue, 15 Mar 2011 21:12:28 -0400 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) |
Mike Gran <address@hidden> writes:
>> The reason I am still arguing this point is because I have looked
>> seriously at what I would need to do to (A) fix our i18n problems and
>> (B) make the code efficient. I very much want to fix these things,
>> but the pain of trying to do this with our current scheme is too much
>> for me to bear. I shouldn't have to rewrite libunistring, and I
>> shouldn't have to write 3 or 4 different variants of each procedure
>> that takes two string parameters.
>
> What procedures are giving incorrect results?
I know of two categories of bugs. One has to do with case conversions
and case-insensitive comparisons, which must be done on entire strings
but are currently done for each character. Here are some examples:
(string-upcase "Straße") => "STRAßE" (should be "STRASSE")
(string-downcase "ΧΑΟΣΣ") => "χαοσσ" (should be "χαoσς")
(string-downcase "ΧΑΟΣ Σ") => "χαοσ σ" (should be "χαoς σ")
(string-ci=? "Straße" "Strasse") => #f (should be #t)
(string-ci=? "ΧΑΟΣ" "χαoσ") => #f (should be #t)
Another big category of problems has to do with the fact that
scm_from_locale_{string,symbol,keyword} is currently used in many places
where the C string being converted is a compile-time constant. This is
a bug unless the strings are ASCII-only, because the locale is normally
that of the user, which is not necessarily that of the source code.
Ludovic, Andy and I discussed this on IRC, and came to the conclusion
that UTF-8 should be the encoding assumed by functions such as
scm_c_define, scm_c_define_gsubr, scm_c_define_gsubr_with_generic,
scm_c_export, scm_c_define_module, scm_c_resolve_module,
scm_c_use_module, etc. However, this creates pressure to make
scm_from_utf8_string and scm_from_utf8_symbol as efficient as possible.
With the current string representation scheme, the plan for
scm_from_utf8_string is to scan up to the first 100 characters of the
input string, and if the string is found to be ASCII-only, then we can
use scm_from_latin1_string. Otherwise, we need to use scm_from_stringn
which is noticeably slower.
An unfortunate complication is that the snarfing macros such as
SCM_DEFINE et al arrange to store the symbol names as compile-time
constants and thus to put them in a read-only segment of the shared
library. This is done with some preprocessor magic in snarf.h (see
SCM_IMMUTABLE_STRINGBUF). I would like to make SCM_DEFINE et al work
for any UTF-8 strings, but I can do that with cpp only if UTF-8 is the
internal representation. As things currently stand, those macros must
be limited to ASCII-only names, which is unfair to non-English speakers.
Mark
- Re: Using libunistring for string comparisons et al, Mike Gran, 2011/03/12
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/15
- Re: Using libunistring for string comparisons et al, Mike Gran, 2011/03/15
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/15
- Re: Using libunistring for string comparisons et al, Mike Gran, 2011/03/15
- Re: Using libunistring for string comparisons et al,
Mark H Weaver <=
- Re: Using libunistring for string comparisons et al, Ludovic Courtès, 2011/03/16
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/17
- Re: Using libunistring for string comparisons et al, Ludovic Courtès, 2011/03/17
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/17
- Re: Using libunistring for string comparisons et al, Thien-Thi Nguyen, 2011/03/17
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/17
- Re: Using libunistring for string comparisons et al, Thien-Thi Nguyen, 2011/03/18
- Re: Using libunistring for string comparisons et al, Mark H Weaver, 2011/03/18
- Re: Using libunistring for string comparisons et al, Ludovic Courtès, 2011/03/20
- Re: Using libunistring for string comparisons et al, Andy Wingo, 2011/03/30