[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode confusables and reordering characters considered harmful, a

From: Daniel Brooks
Subject: Re: Unicode confusables and reordering characters considered harmful, a simple solution
Date: Fri, 05 Nov 2021 17:54:37 -0700
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)

Gregory Heytings <gregory@heytings.org> writes:

> This "I consider" is the problem of your approach.  Malevolent actors
> are always more inventive, and will find a way to escape the safety
> net you created.


> The cases you consider suspicious are cases where the directionality
> of one or more characters is overridden by reordering control
> characters, but this is not what the "Trojan Source" paper is about.
> The problem it points to is much broader, it's about using these
> invisible control characters to make the source code appear different
> to a human reader and to a compiler.

Specifically reordering the source so that something which is inside of
a comment or string appears to be outside of it, or visa versa.

However, as you say arbitrary rearrangement is on the table. The paper
specifically mentions that the line can be treated as an anagram, and
the characters rearranged into an arbitrary order. It would be fun to
find a nice example where one enum variant was substituted for another,
with no string or comment on the line to supply the necessary
characters. It would require enum variants whose names are anagrams…

> In fact, it did not take me much time to create a case that your
> algorithm doesn't detect (and AFAIU cannot detect without also
> displaying warnings about many legitimate uses).  I attach the example
> code, how that code is displayed by Emacs, and how that code would be
> displayed with the patch I proposed.
> #define is_restricted_user(user)                            \
>   !strcmp (user, "root") ? 0 :                                      \
>   !strcmp (user, "admin") ? 0 :                                     \
>   !strcmp (user, "superuser‮⁦? 0 : 1⁩ ⁦")

I love this example.

I think that it can be detected though. As the paper says, we should be
on the lookout for unterminated overrides. This example has a
LEFT-TO-RIGHT ISOLATE that is left unterminated by a POP DIRECTIONAL
ISOLATE; it thus applies long enough to hit the string delimiter.

Personally I don’t mind detecting these sorts of errors, as long as we
recognize that we cannot reliably do so unless we also know the syntax
of the language; not every language terminates a string the same
way. Imagine this were Perl, and we were manipulating not a
double–quoted string but a q{}, a qx{}, or worse: a regex match
(m//). Recall that regex matches can use arbitrary punctuation
characters as delimiters; m[] is just as valid as m//. But perhaps it
would suffice to find isolates which are only terminated by a newline


reply via email to

[Prev in Thread] Current Thread [Next in Thread]