Re: On language-dependent defaults for character-folding

On 20 February 2016 at 03:18, Eli Zaretskii <address@hidden> wrote:

> Date: Fri, 19 Feb 2016 21:37:26 +0800
> From: Elias Mårtenson <address@hidden>
> Cc: Lars Ingebrigtsen <address@hidden>, emacs-devel <address@hidden>
>
> For example, if the buffer includes ñ (2 characters), should "C-s n"
> find the n in it?
>
> That depends on the locale of the user.

There are use cases that are independent of the locale. For example,
imagine that you need to find all the literal n characters in a buffer
because you are investigating a bug in the program that produced that
buffer. As an Emacs user, I need to do such jobs almost every day. I
don't want the results affected by the locale.

Of course I'm not saying that you should now be able to do this. All I'm advocating here is sensible defaults.

> However, from the point of a user, there should not be a visible
> difference between the precomposed and the composed variants are the
> exact same character.

What if the user wants to find all those places where what looks like
ñ is actually ñ? Wouldn't that be a valid use case?

It would, but certainly a very rare one. For all intents and purposes the two forms are (should be) equivalent.

The reference you are looking for is the Unicode Standard itself. It
says to use the normalization forms, see for example section 5.16
there.

I have read that section before, and I have now read it again. The section certainly talks about searching ignores diacritics, but does not discuss a method to do so. There is also a reference to TR29, but it refers to grapheme clusters which would be a very strange way to do character folding (Koreans would be very confused).

Every character-folding search implementation decomposes characters
before matching them. So does Emacs. We didn't invent this, and we
certainly didn't use the decompositions where they weren't supposed to
be used. It's not a trick, it's what everyone else does to do the
job. See the ICU library, for example.

Every example you have given so far discusses the decomposition equivalence. I.e. the fact that the who variants of ñ are the same. Section 5.16 discuss the _concept_ of allowing n and ñ match similarly but the mechanism to do so is locale-dependent. This is what Unicode says, and that is what I say. My position is simply that the default (if absolutely nothing else overrides it) should be chosen to take the locale of the user into account.

> The decompositions are used in the normalisation forms to ensure that the two variants are treated equally
> (such as the two alternative representations of ñ that we have been discussing).

Yes, and any character-folding search uses normalization forms as
well.

Yes, but that's not what normalisation forms were designed to do.

Again (I really apologise for repeating myself, I'm starting to sound like a troll and that is truly not my intention), the purpose of normalisation forms are to ensure that the two variants of ñ compare the same. It is not designed to provide a mechanism to allow n to compare equal to ñ.

> Yes. I am fully aware of this. But so be it. Having applications work differently depending on the locale of the
> environment the application was started in is nothing new.

It's not new. It's old. We should move on to more general
environments that support multiple languages. Emacs is such an
environment. The old l10n paradigms are fundamentally incompatible
with that.

Sure, but doesn't it make sense to fall back to the user's default if the buffer does not have an overriding locale?

> Being a multi-lingual environment, Emacs has no real notion of the
> locale.
>
> Perhaps it should?

That'd be a step backward, IMO.

As opposed to having no concept of locale at all? I just have to disagree with you on that.

Strange, I always thought the data was there. Perhaps you should ask
a question on the Unicode mailing list, then.

That's a good idea actually. Thank you for the suggestion. I'm reading that mailing list, and I will post a question there.

Regards,

Elias

From:	Elias Mårtenson
Subject:	Re: On language-dependent defaults for character-folding
Date:	Sat, 20 Feb 2016 13:22:57 +0800