[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: library for unicode collation in C for texi2any?
From: |
Gavin Smith |
Subject: |
Re: library for unicode collation in C for texi2any? |
Date: |
Sat, 14 Oct 2023 10:07:27 +0100 |
On Fri, Oct 13, 2023 at 07:31:29AM +0000, Werner LEMBERG wrote:
>
> >> OK, no tailoring. I wasn't aware of those differences, thanks for
> >> pointing me to it.
> >>
> >> Hopefully, we agree that `@documentlanguage` should set a
> >> language-specific collation for the index.
> >
> > Without tailoring, this basically means collation according to
> > Unicode codepoints.
>
> Uh oh, this is not good. As an example, consider the letter 'ä'.
> There are two possible collations that are considered as correct for
> German:
>
> * Sort 'ä' right before 'b'.
>
> * Handle 'ä' similar to 'ae' but sort it after 'ae'.
>
> Neither collation corresponds to Unicode codepoints.
I think there is some confusion here. The Unicode Collation Algorithm
does not simply order by codepoint. So the unicode codepoint for 'ä' (U+00E4)
is not compared numerically to that for 'a' (U+0061) at any point.
See https://www.unicode.org/reports/tr10/#Collation_And_Code_Chart_Order.
> The basic principle to remember is: The position of characters in the
> Unicode code charts does not specify their sort order.
As far as I understand, there is a default ordering where ä will be sorted
after a. A "multilevel" ordering is used, giving the following ordering
a
ä
ab
äb
z
rather than
a
ab
ä
äb
z
which is what would happen if ä was simply treated as a letter between
a and b.
"Tailoring" is a further language-dependent alteration to the collation
algorithm. The TR10 document gives the example of Swedish where 'ä' would be
its own letter and sort after 'z':
a
ab
z
ä
äb
Nobody is arguing for "codepoint-order" sorting, but what is in question
here is whether there should be this latter language-dependent alteration
of the sorting order. This alteration may be good in theory but it remains
to be seen how practical it is to achieve.
- Re: library for unicode collation in C for texi2any?, (continued)
- Re: library for unicode collation in C for texi2any?, Eli Zaretskii, 2023/10/12
- Re: library for unicode collation in C for texi2any?, Werner LEMBERG, 2023/10/12
- Re: library for unicode collation in C for texi2any?, Eli Zaretskii, 2023/10/13
- Re: library for unicode collation in C for texi2any?, Werner LEMBERG, 2023/10/13
- Re: library for unicode collation in C for texi2any?, Eli Zaretskii, 2023/10/13
- Re: library for unicode collation in C for texi2any?, Werner LEMBERG, 2023/10/13
- Re: library for unicode collation in C for texi2any?, Eli Zaretskii, 2023/10/13
- Re: library for unicode collation in C for texi2any?, Werner LEMBERG, 2023/10/13
- Re: library for unicode collation in C for texi2any?, Eli Zaretskii, 2023/10/13
- Re: library for unicode collation in C for texi2any?, Werner LEMBERG, 2023/10/13
- Re: library for unicode collation in C for texi2any?,
Gavin Smith <=
- Re: library for unicode collation in C for texi2any?, Werner LEMBERG, 2023/10/14
- Re: library for unicode collation in C for texi2any?, Patrice Dumas, 2023/10/14
- Re: library for unicode collation in C for texi2any?, Eli Zaretskii, 2023/10/14
- Re: library for unicode collation in C for texi2any?, Patrice Dumas, 2023/10/14
implementation language [was: library for unicode collation in C for texi2any?], Per Bothner, 2023/10/12