[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm du
bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator
Thu, 29 Jun 2017 17:49:24 +0300
> From: Itai Berli <address@hidden>
> Date: Thu, 29 Jun 2017 12:16:00 +0300
> I'll repeat: according to Unicode a paragraph ends with a paragraph
> separator. What constitutes a paragraph separator is specified precisely
> in section 5.8 'Newline Guidelines' of The Unicode Standard version
> 8.0.0. For instance, on a MacOS X system, it is `LF` (line feed,
> Unicode 000A). The formatting effects of the bidi algorithm must not
> cross the paragraph separator boundary.
> And yet in Emacs the formatting extend beyond the paragraph separator,
> and this is the case on all operating systems. Consider, for instance,
> the following example.
The UBA allows applications to employ "higher-level protocols" when
deciding on base paragraph direction. See section 4.3 in UAX#9 and
specifically clause HL1 there.
This is what Emacs does: it applies its own heuristics for this
decision. The reason for that is that Emacs's implementation of the
UBA must work reasonably well in plain-text buffers, where typically
long paragraphs are broken into lines by newline characters (which are
paragraph separators according to the UBA), and many times the
partition into lines is done by auto-fill or similar features, thus
making the first character of the next line fairly arbitrary. Using
the UBA paragraph-direction determination would then produce
unacceptable results, whereby the direction of a part of a paragraph
could change in unpredictable ways when text is refilled.
> Consider, for
> instance, a LaTeX document for typesetting Hebrew
> text. Normally in order to eliminate the usual leading indentation of
> the first line of a paragraph, a `\noinent` command is placed at the
> beginning of the paragraph. However, because the Unicode bidi algorithm
> determins the directionality of a paragraph based on its first word, the
> Hebrew text is formatted like English text. This is not a problem; it is
> to be expected.
The Emacs bidirectional display doesn't have special facilities for
marked-up text, such as TeX and HTML/XML. Because those markups use
punctuation characters for their markup, doing so in RTL context can
produce unpleasant results in the default display, as you point out.
You can alleviate this to some extent by (in your case) starting the
paragraph with an RLM control character before \noindent, optionally
followed by an LRM or enclosing \noindent in LRE..PDF (so that the
backslash displays to the left of "noindent"). This is admittedly a
bit awkward, but I think the results are still acceptable.
I will gladly work with anyone who'd volunteer to introduce features
required to better support markup languages. This will require
low-level display changes and some support from the relevant major
modes to use those features. For now, the demand was sufficiently low
(I think you are about the second person to raise the issue since
bidirectional display debuted in Emacs 24.1) to keep this issue low on
> One way to resolve this is to explicitly change the directionality of the
> paragraph, however, disregarding the fact that this is not currently
> possible due to a separate Emacs bug, even if it were possible, it would
> affect the placement of the backslash at the beginning of the
> `\noindent` command, which will no longer look like a LaTeX command.
I think my suggestion above fixes this latter issue as well.