guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Using libunistring for string comparisons et al


From: Mark H Weaver
Subject: Using libunistring for string comparisons et al
Date: Fri, 11 Mar 2011 17:33:47 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux)

Mike Gran <address@hidden> writes:
> [...] But doing the upper->lower operation picks
> up a few more of the corner cases, like U+03C2 GREEK
> SMALL LETTER FINAL SIGMA and U+03C3 GREEK SMALL LETTER SIGMA
> which are the same letter with different representations,
> or U+00B5 MICRO SIGN and U+039C GREEK SMALL LETTER MU
> which are supposed to have the same sort ordering.

Ah, okay.  Makes sense.

> Now that we've pulled in all of libunistring, it might
> be a good idea to see if it has a complete implementation
> of unicode case folding, because upper->lower is also not
> completely correct.

I looked into this.  Indeed, the libunistring documentation mentions
that in some languages (e.g. German), the to_upper and to_lower
conversions cannot be done properly on a per-character basis, because
the number of character can change.  These operations much be done on an
entire string.  For example:

<http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-2.html>

  (string-upcase "Straße") => "STRASSE"
  (string-foldcase "Straße") => "strasse"

libunistring contains all the necessary functions, including
case-insensitive string comparisons.  However, the only string
representations supported by these operations are: UTF-8, UTF-16,
UTF-32, or locale-encoded strings, and for comparisons both strings must
be the same encoding.

I'm aware that this proposal will be very controversial, but starting in
Guile 2.2, I think we ought to consider storing strings internally in
UTF-8, as is done in Gauche.  This would of course make string-ref and
string-set! into O(n) operations.  However, I claim that any code that
depends on string-ref and string-set! could be better written 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]