Re: Unicode confusables and reordering characters considered harmful

From: Stefan Kangas
Subject: Re: Unicode confusables and reordering characters considered harmful
Date: Wed, 3 Nov 2021 12:19:58 +0100

Gregory Heytings <gregory@heytings.org> writes:

> There's some data that shows that this is extremely rare in general: the
> Rust Security Response WG analyzed the 70322 crates and found only 5 in
> which these codepoints were present (see [1]).  That's ~0.01 %.
> Moreover such highlighting does not make the source code or text
> unreadable, even in those few legitimate cases.

Depending on how you define it, there is at least one major world
language (Arabic) that has a RTL script, and other major languages
such as Urdu, Farsi and Hebrew also use it (and a couple of others
too).  So I think we should consider to what extent your proposal
might hurt users of such languages.

Are these characters important to write comments and strings in any of
those languages?  Will your proposal make it harder to type in such
languages?  If yes, are there less invasive solutions?

The Rust data point is relevant, but in my opinion not sufficient to
outweigh the above considerations.  But even if that wasn't the case,
we would still need to consider languages like C, Fortran, PHP,
JavaScript, etc.  We are, after all, talking about hundreds of
millions of native speakers of the mentioned languages, a certain
proportion of which will be Emacs users interested in writing strings
and comments in their own language.

