[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode confusables and reordering characters considered harmful, a

From: Eli Zaretskii
Subject: Re: Unicode confusables and reordering characters considered harmful, a simple solution
Date: Fri, 05 Nov 2021 16:19:56 +0200

> From: Stefan Kangas <stefan@marxist.se>
> Date: Fri, 5 Nov 2021 06:08:42 -0700
> Cc: db48x@db48x.net, cpitclaudel@gmail.com, emacs-devel@gnu.org, 
>       monnier@iro.umontreal.ca, yuri.v.khan@gmail.com
> I didn't study `bidi-find-overridden-directionality' yet, but the
> "Trojan Source" paper writes:
>     "By banning all directionality-control characters, users with
>     legitimate Bidi-override use cases in comments are penalized.
>     Therefore, a better defense might be to ban the use of
>     _unterminated_ Bidi override characters within string literals and
>     comments.  By ensuring that each override is terminated – that is,
>     for example, that every LRI has a matching PDI– it becomes
>     impossible to distort legitimate source code outside of string
>     literals and comments."  (p. 8, their emphasis)
> So, IIUC, the problematic cases are "unterminated Bidi override
> characters", and those are the ones worth warning about.  Does that
> sound correct to you?

No.  What they say is simply wrong: such unterminated overrides and
embeddings are perfectly valid.  The Unicode Bidirectional Algorithm
(UBA) mandates (https://unicode.org/reports/tr9/#X8):

  X8. All explicit directional embeddings, overrides and isolates are
  completely terminated at the end of each paragraph.

      Explicit paragraph separators (bidirectional character type B)
      indicate the end of a paragraph. As such, they are not included in
      any embedding, override or isolate. They are simply assigned the
      paragraph embedding level.

And in https://unicode.org/reports/tr9/#Bidirectional_Character_Types
you can see that newline is one of the characters whose bidi type is
B; compare:

  (get-char-code-property ?\n 'bidi-class) => B

So when the UBA says "at the end of each paragraph", it means in
practice at EOL, since all the other paragraph separators are rarely
if ever used in human-readable text.  (And Emacs, of course,
implements that rule.)

The authors of the paper simply don't understand the bidi stuff well
enough to make useful proposals about this.  They should have bring
this up on the Unicode mailing list, where at least the experts (and I
don't mean myself, I mean the people who wrote the UBA) could set them

I encourage you to read the comments in the implementation I wrote, to
see which cases I consider "suspicious".  The comments need to be read
with the UBA spec in mind, at least its Xn rules.  I will be happy to
explain or clarify if something is unclear there.  This is a complex
issue, and discussing it rationally could really enhance our
understanding and handling of these cases.

> > Adding one line is a nuisance.  If it can be avoided, we should avoid
> > it.  Since we are capable of detecting the really suspicious uses of
> > those controls, it is much better to use that, because in that case
> > users will not have to add anything.
> I agree that it does sound better to prefer such an approach if
> possible.

Then let's try to implement that.  If there's a need for more
bidi-specific infrastructure, let me know and I will see what I can


reply via email to

[Prev in Thread] Current Thread [Next in Thread]