emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Unicode confusables and reordering characters considered harmful, a simp


From: Daniel Brooks
Subject: Unicode confusables and reordering characters considered harmful, a simple solution
Date: Tue, 02 Nov 2021 14:28:16 -0700
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stefan Monnier <monnier@iro.umontreal.ca>
>> Cc: stefan@marxist.se,  cpitclaudel@gmail.com,  emacs-devel@gnu.org
>> Date: Tue, 02 Nov 2021 15:47:27 -0400
>> 
>> > In most cases, there's no need to make these controls stand out,
>> > because situations where this presents security risks are extremely
>> > rare, to put it mildly, and OTOH having them stand out more by default
>> > will make it harder to read text with completely legitimate uses of
>> > these controls (example: TUTORIAL.he).
>> 
>> Fully agreed.  That's the problem: how to define the problematic cases
>> in a precise enough way that it doesn't rule out all lots of
>> legitimate cases.
>
> That's what bidi-find-overridden-directionality already does, albeit
> not yet for the specific examples in that paper.  But Someone™ should
> write a minor mode or an optional display feature which uses that
> function to highlight the problematic stretches of text on display,
> using the function's output for finding such stretches of text.

We already have it; it is called whitespace-mode. It’s not perfect, but
this morning I customized mine to make these characters more obvious:

(custom-set-variables
 '(whitespace-display-mappings
   '((space-mark 32 [183] [46])
     (space-mark 160 [164] [95])
     (newline-mark 10 [36 10])
     (tab-mark 9 [187 9] [92 9])
     (space-mark #x202A [#x21D2]) ; ⇒ LEFT-TO-RIGHT EMBEDDING
     (space-mark #x202B [#x21D0]) ; ⇐ RIGHT-TO-LEFT EMBEDDING
     (space-mark #x202D [#x2192]) ; → LEFT-TO-RIGHT OVERRIDE
     (space-mark #x202E [#x2190]) ; ← RIGHT-TO-LEFT OVERRIDE
     (space-mark #x2066 [#x21E5]) ; ⇥ LEFT-TO-RIGHT ISOLATE
     (space-mark #x2067 [#x21E4]) ; ⇤ RIGHT-TO-LEFT ISOLATE
     (space-mark #x2068 [#x21A7]) ; ↧ FIRST STRONG ISOLATE
     (space-mark #x202C [#x21D1]) ; ⇑ POP DIRECTIONAL FORMATTING
     (space-mark #x2069 [#x2912]) ; ⤒ POP DIRECTIONAL ISOLATE
     )))

I didn’t spend much time thinking about which arrows to pick; these
seemed right to me. They are all using 'space-mark as the kind, but I
would like to extend whitespace-mode with a new kind specifically for
these characters, so that I can give them a custom face as well.

Here is some sample lisp code that I tried it on:

(defun main ()
  (let ((is_admin nil))
    ‮⁦ ; begin admins only⁩⁦(when is_admin
      (print "You are an admin."))‮⁦ ; end admins only⁩(
)

Syntax highlighting is certainly a big clue that something is odd about
this code, as the conditional is displayed in the comment face. It was
however a nice little puzzle to figure out how to get the permutation of
characters that I wanted.

I will however note that Elisp, as currently implemented, is probably
immune to this attack. The directional characters are incorrectly¹
treated as identifiers when they are outside of a comment; if you
actually run this you will get a void-variable warning which is very
confusing at first because the variable name is invisible. Great fun.

I suggest that we include something along these lines in Emacs, and turn
on whitespace-mode by default in all programming modes. If I recall
correctly, the default configuration of whitespace-mode is fairly
inoffensive. I would recommend keeping it so except that we make the
face for BIDI control characters pretty obvious; perhaps a red
background or something.

By only enabling it by default in programming modes, we avoid bothering
users of prose–oriented modes where using these characters is
benign. Maybe we could have an override for programming languages such
as Elisp that we think are immune to this attack, but I don’t really
think we need to go that far.

db48x

¹ I say that this is incorrect because they are classified by Unicode as
control characters rather than as letters or numbers. The Elisp
specification, such as it exists, probably doesn’t say anything about
them.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]