emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: On language-dependent defaults for character-folding


From: Eli Zaretskii
Subject: Re: On language-dependent defaults for character-folding
Date: Tue, 23 Feb 2016 18:56:36 +0200

> From: Achim Gratz <address@hidden>
> Date: Sun, 21 Feb 2016 09:14:18 +0100
> 
> Elias Mårtenson writes:
> > Because under the Unicode decomposition rules, ø is not decomposable. I
> > can't explain why that is the case (probably because there is no reason to
> > have a combining /. After all, the only languages that use ø are languages
> > that use it as a character of its own).
> 
> AFAIK, for combining characters to be composable/decomposable the glyphs
> must not overlap.  This is the same issue as with the polish »ł« to the
> best of my knowledge.

The definitive answer is here, for those interested:

  http://www.unicode.org/mail-arch/unicode-ml/y2016-m02/0106.html

> In other words, unicode composition/decomposition rules tell you more
> about the glyph construction than they do about useful strategies to
> search for multiple characters.

That conclusion is too radical, IMO.  You will see in the above
message that the criterion you describe was just a means for the UTC
to draw a line somewhere, i.e. it was an ad-hoc rule more than
anything else.

> The idea of using the base character of the canonical decomposition
> in the search might still yield a useful shortcut in most cases, but
> I'm not sure it is correct in all languages even when that
> decomposition exists and, as the examples show, there are cases
> where the non-decomposed character has to be treated specially.

Language-specific tailoring is indeed needed for best results, but the
language-independent decompositions have their place.  E.g., you will
see in the Unicode collation database (UCA) a file named decomps.txt
that is basically a list of decompositions from UnicodeData.txt with
additions specifically for collation, searching, and matching
(including ł, btw).  Which tells me that the decomposition data in
UnicodeData.txt is a good basis for these features, it is not just
about glyph constructions.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]