emacs-bidi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[emacs-bidi] Re: Supporting non-plain-text buffers


From: Martin J. Dürst
Subject: [emacs-bidi] Re: Supporting non-plain-text buffers
Date: Thu, 15 Jul 2010 19:49:23 +0900
User-agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.4pre) Gecko/20091214 Eudora/3.0b4

Hello Eli,

Sorry to be late with my reply.

On 2010/07/07 19:59, Eli Zaretskii wrote:
Date: Tue, 06 Jul 2010 16:18:17 +0900
From: "Martin J. Dürst"<address@hidden>
CC: address@hidden

One thing that we should think about is what people want to happen if
there is actual displayable text in some of these strings. I don't have
much of an idea where this is used, but I can imagine that at least in
some usage scenarios, one might want the text added via an overlay to be
rendered in exactly the same way as the text in the buffer.

Can you explain what do you mean by the last sentence?  Perhaps an
example will clarify that.

Well, let's assume that there is some arcane file format with settings,
and there is some Emacs lisp that adds additional text with overlays to
make it easier to understand the format. I'm sure there are other use
cases for such strings, otherwise, why would there be before-string and
after-string properties for overlays. Anyway, if there are both RTL and
LTR characters in one of these properties, these texts also need bidi
treatment. Even if there's only RTL, it has to be reordered for display.

There's no argument that text in display strings should be reordered.
I just didn't yet write code to handle that, but it's on my todo.

Okay.

Also, in some cases, the texts in the overlay properties may form units
that are best treated as embeddings (or similar), but in other cases,
they may better be treated as part of the overall text, and that overall
text should be processed with the bidi algorithm.

I don't see any situation that RLE/LRE or RLO/LRO, as part of the
display string itself, won't be able to handle.  Do you?

It depends on where we allow the corresponding PDFs to go. (a) do the PDFs need to be in the same piece of text (or, if a PDF is missing, do we just close the embedding anyway at the end of that piece of text), or (b) are embeddings (and overrides) allowed to span several of these text pieces?

If you mean (b), then we should most probably be covered. If you mean (a), I'm not so sure about it.


I think having a special text property that covers the text
that needs to be reordered is a cleaner solution.

It's definitely also a viable solution, although there also might be
some tricky issues. Say you have a property defining an embedding from
characters 10 to 30, and another such property from characters 20 to 40.
What exactly is that supposed to mean?

This cannot happen in Emacs, because each property can have only one
value for each character.  In effect, ranges of buffer positions of the
same text property cannot overlap.

I see. But then that would make it rather difficult to define
embeddings, wouldn't it, because you have to include the number of
current embeddings and their orientation in the property.

We may need to specify the base paragraph direction for each such
portion of buffer text, yes.  But that is all; I don't see why we
would need to specify embedding level -- this can be handled with the
existing characters, RLE, RLO, etc.

This again depends on the answer to (a)/(b) above.

IOW, what I thought about was that most of the text would be not
reordered (which is okay, since outside strings and comments, the rest
is strict L2R, mostly even 7-bit ASCII, text).  Only the portions that
have the special property on them will be reordered, and that
reordering will be according to the normal UAX#9 rules.  I still don't
see which use-cases will need something more than this.  And I mean
specific practical use-cases, not hypothetical ones.

I think for something like the C programming language, this view mostly makes sense. But not all programming languages are that easy. As one example, in many programming languages (Perl, Ruby, JavaScript,...), regular expressions are part of the language syntax. There, you can have complex hierarchies of (e.g. RTL) text and syntactic structure.

Also, in several programming languages, there is string interpolation. This means that in the middle of a (let's assume RTL) string, one can go back to code. And then of course in the middle of that code, one can go back to strings. And string interpolation can also be used in regexps.

And then there is also the whole area of PHP, JSP, ASP,... where you have by definition program code in the middle of (Web page) text, and of course that program code can contain text again.

Not myself using any RTL language, I can only guess how users may want to have such constructs displayed. But assuming that the examples in Unicode TR 9 have some actual use, I'd assume that at least some of the people involved, at least in some cases, would prefer structured reordering rather than just piecewise reordering at the lowest level.

This also applies to XML/HTML. Let's take the following example from TR 9:
logical, with some LRE/RLE/PDF: DID YOU SAY ’he said “car MEANS CAR”‘?
With HTML markup:
<p lang='he' dir='rtl'>DID YOU SAY ’<span lang='en' dir='ltr'>he said “<span lang='he' dir='rtl'><span lang='en'>car</span> MEANS CAR</span>”</span>‘?</p>

To take just the innermost part here, would an user want to see
   <span lang='en'>car</span> RAC SNAEM
or would she like to see
   RAC SNAEM <span lang='en'>car</span>
or would she like to see
   RAC SNAEM <span/>car<lang='en' span>
which looks confusing, but maybe not so much if the element name is in RTL, too, which would then give something like
   RAC SNAEM <NAPS/>car<lang='en' NAPS>

So for XML, especially with native markup, we can have the whole gamut from little pieces (if anything) of RTL in a sea of LTR to little pieces (if anything) of LTR in a sea of RTL, and lots of combinations and nestings in the middle.

E.g.  something like (the characters a-g are just so that there's
something between the formatting codes):

a RLE b LRE c RLE d POP e POP f POP g

would translate into (writing each character on a separate line)

a
b RLE
c RLE LRE
d RLE LRE RLE
e RLE LRE
f RLE
g

Unless you add quite a bit of intermediate library code, this will be
rather inconvenient to handle for an end user.

I don't understand why this would be needed.  Could you please present
a detailed example where this is needed?

For the actual text being edited, see above. Of course, this specific feature would only be needed if there's no other way to affect display structure.


You mean overlapping properties? In that case, I agree. But if
properties cannot overlap, maybe we should use overlays. As far as I
understand, they can overlap.

Overlays don't scale up well; having lots of them in a buffer slows
down redisplay to an annoyingly low speed.  So I'd rather we didn't,
if we can find another solution.  Again, I still don't see why we
would need this one, and what problems it is supposed to solve.

To go back to the basics, we need a way (on first approximation, any way may be okay) to tell the display reordering engine where and how it should take into account the syntactic structure of the program/markup being edited.

Whether that can best be done by
1) adding some bidi formatting codes to the text being edited,
2) adding some bidi formatting codes to display text from a property or overlay (before-string and after-string) 3) adding some bidi-specific properties or overlay properties to directly influence bidi reordering
4) some other means
is what I think we have been discussing. I continue to agree with you that 1) is a bad idea. 2) is what Kenichi originally suggested. 3) is what the example above is about.

I don't really mind too much which way we go, but given that I must assume that the bidi algorithm has hierarchically nested embeddings for a reason, and that programming languages and markup languages are in many ways quickly much more nested than natural language (see examples above), I don't think we can easily get away with a simplistic model of "everything is LTR, with an occasional RTL string in it". That might work for a programming language like C, but not for things like Perl, Ruby, PHP, JSP, ASP, HTML, and XML.


I'm not sure I understand, but if it means that the bidi algorithm is
just applied piecewise, that won't be enough. It may be enough for some
simple cases, such as C programs, where the main concern is to keep text
within string constants together, and the rest is ASCII only and
therefore goes LTR. However, on the other hand, with some XML markup
with e.g. element and attribute names in Hebrew, in our experience
actual nestings (i.e. embeddings in terms of the bidi algorithm) are
highly desirable.

Again, an example would go a long way towards explaining what you
mean.  In general, what I wrote does not eliminate the possibility
that embeddings might be used within the reordered parts, nor that the
text outside of the markup is LTR only.

Okay. In the prototype and in the Web-based editor that we have worked
on to display HTML, we typically used embeddings for:
- Elements (incl. start tag and end tag) that have a dir attribute
(which indicates an embedding in the Web page view). These can of course
be nested.
- Start tags (and end tags)
- Attribute/attribute value combinations

Not all of these may be necessary in all cases, but it would be too
complicated to try and figure exactly which ones might be left out in
any particular case, and even this wouldn't eliminate the need for
nested embeddings. And it is at least currently unclear to me how you
could achieve nested embeddings with a possibility to tell the rendering
engine "restrict yourself to this region".

Please show an actual fragment of HTML/XML which needs nesting or
embeddings.

See above.

Or a property that changes the bidi category of a character?

This can be done if we need it, but I still don't see use-cases that
would benefit from such a feature.

Making the characters that define XML syntax, such as<,>, ", ', =,...
strong LTR would solve a lot (but not all) of the display anomalies for
XML (incl. HTML).

If it doesn't solve all the problems, I'd rather try first to find a
solution that does.

Agreed.

We probably won't want to change the bidi
properties of a character for the entire buffer (because it could be
used elsewhere in the buffer, like in a comment, where we would want
it to be reordered normally).  So this means we would need to use
different tables of bidi properties for different portions of the
text.  Switching bidi properties during display, as it walks the
buffer, is doable, but is somewhat tricky and can raise some hard
problems.

The table lookup might be done beforehand, with Font lock or some similar mechanism, and the result may be carried in properties.


The fact that it is not a comprehensive solution makes me
even more reluctant to use it.

I agree with that. It may be possible to make it a comprehensive solution if we can use LRE/RLE/.../PDF as a bidi property (i.e. say that a plain old character also works as an LRE/RLE/.../PDF). But that may not be enough, we might have to go as far as being able to attach a sequence of bidi properties to a single character. Not exactly pretty :-(.


It might solve all display anomalies for programming languages like C to
define " (for strings) and comment start/end as LTR (at least as long as
there are no RTL identifiers).

But quotes can appear in the comments as well, so I think here, too,
we won't be able to use the same properties for the entire buffer.

True.

Covering each string, excluding its quotes, with a special text
property, and the same with a comment (excluding the comment
start/end) sounds a simpler solution.

This works very nicely if there is no nesting. If you can tell me for sure that nobody working with Perl, Ruby, PHP, JSP, ASP, HTML, XML,... will prefer nested bidi reordering for some cases, that might solve the problem. But I wouldn't want to make such an assertion.


Regards,    Martin.


--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:address@hidden



reply via email to

[Prev in Thread] Current Thread [Next in Thread]