[emacs-bidi] Re: Supporting non-plain-text buffers

emacs-bidi

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[emacs-bidi] Re: Supporting non-plain-text buffers

From:	Martin J. Dürst
Subject:	[emacs-bidi] Re: Supporting non-plain-text buffers
Date:	Thu, 15 Jul 2010 19:49:23 +0900
User-agent:	Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.4pre) Gecko/20091214 Eudora/3.0b4

Hello Eli,

Sorry to be late with my reply.

On 2010/07/07 19:59, Eli Zaretskii wrote:

Date: Tue, 06 Jul 2010 16:18:17 +0900
From: "Martin J. Dürst"<address@hidden>
CC: address@hidden

One thing that we should think about is what people want to happen if
there is actual displayable text in some of these strings. I don't have
much of an idea where this is used, but I can imagine that at least in
some usage scenarios, one might want the text added via an overlay to be
rendered in exactly the same way as the text in the buffer.


Can you explain what do you mean by the last sentence?  Perhaps an
example will clarify that.


Well, let's assume that there is some arcane file format with settings,
and there is some Emacs lisp that adds additional text with overlays to
make it easier to understand the format. I'm sure there are other use
cases for such strings, otherwise, why would there be before-string and
after-string properties for overlays. Anyway, if there are both RTL and
LTR characters in one of these properties, these texts also need bidi
treatment. Even if there's only RTL, it has to be reordered for display.


There's no argument that text in display strings should be reordered.
I just didn't yet write code to handle that, but it's on my todo.


Okay.

Also, in some cases, the texts in the overlay properties may form units
that are best treated as embeddings (or similar), but in other cases,
they may better be treated as part of the overall text, and that overall
text should be processed with the bidi algorithm.


I don't see any situation that RLE/LRE or RLO/LRO, as part of the
display string itself, won't be able to handle.  Do you?

It depends on where we allow the corresponding PDFs to go. (a) do thePDFs need to be in the same piece of text (or, if a PDF is missing, dowe just close the embedding anyway at the end of that piece of text), or(b) are embeddings (and overrides) allowed to span several of these textpieces?

If you mean (b), then we should most probably be covered. If you mean(a), I'm not so sure about it.

I think having a special text property that covers the text
that needs to be reordered is a cleaner solution.


It's definitely also a viable solution, although there also might be
some tricky issues. Say you have a property defining an embedding from
characters 10 to 30, and another such property from characters 20 to 40.
What exactly is that supposed to mean?


This cannot happen in Emacs, because each property can have only one
value for each character.  In effect, ranges of buffer positions of the
same text property cannot overlap.


I see. But then that would make it rather difficult to define
embeddings, wouldn't it, because you have to include the number of
current embeddings and their orientation in the property.


We may need to specify the base paragraph direction for each such
portion of buffer text, yes.  But that is all; I don't see why we
would need to specify embedding level -- this can be handled with the
existing characters, RLE, RLO, etc.


This again depends on the answer to (a)/(b) above.

IOW, what I thought about was that most of the text would be not
reordered (which is okay, since outside strings and comments, the rest
is strict L2R, mostly even 7-bit ASCII, text).  Only the portions that
have the special property on them will be reordered, and that
reordering will be according to the normal UAX#9 rules.  I still don't
see which use-cases will need something more than this.  And I mean
specific practical use-cases, not hypothetical ones.

I think for something like the C programming language, this view mostlymakes sense. But not all programming languages are that easy. As oneexample, in many programming languages (Perl, Ruby, JavaScript,...),regular expressions are part of the language syntax. There, you can havecomplex hierarchies of (e.g. RTL) text and syntactic structure.

Also, in several programming languages, there is string interpolation.This means that in the middle of a (let's assume RTL) string, one can goback to code. And then of course in the middle of that code, one can goback to strings. And string interpolation can also be used in regexps.

And then there is also the whole area of PHP, JSP, ASP,... where youhave by definition program code in the middle of (Web page) text, and ofcourse that program code can contain text again.

Not myself using any RTL language, I can only guess how users may wantto have such constructs displayed. But assuming that the examples inUnicode TR 9 have some actual use, I'd assume that at least some of thepeople involved, at least in some cases, would prefer structuredreordering rather than just piecewise reordering at the lowest level.


This also applies to XML/HTML. Let's take the following example from TR 9:
logical, with some LRE/RLE/PDF: DID YOU SAY ’he said “car MEANS CAR”‘?
With HTML markup:

DID YOU SAY ’he said“car MEANSCAR”‘?


To take just the innermost part here, would an user want to see
   <span lang='en'>car</span> RAC SNAEM
or would she like to see
   RAC SNAEM <span lang='en'>car</span>
or would she like to see
   RAC SNAEM <span/>car<lang='en' span>

which looks confusing, but maybe not so much if the element name is inRTL, too, which would then give something like

   RAC SNAEM <NAPS/>car<lang='en' NAPS>

So for XML, especially with native markup, we can have the whole gamutfrom little pieces (if anything) of RTL in a sea of LTR to little pieces(if anything) of LTR in a sea of RTL, and lots of combinations andnestings in the middle.

E.g.  something like (the characters a-g are just so that there's
something between the formatting codes):

a RLE b LRE c RLE d POP e POP f POP g

would translate into (writing each character on a separate line)

a
b RLE
c RLE LRE
d RLE LRE RLE
e RLE LRE
f RLE
g

Unless you add quite a bit of intermediate library code, this will be
rather inconvenient to handle for an end user.


I don't understand why this would be needed.  Could you please present
a detailed example where this is needed?

For the actual text being edited, see above. Of course, this specificfeature would only be needed if there's no other way to affect displaystructure.

You mean overlapping properties? In that case, I agree. But if
properties cannot overlap, maybe we should use overlays. As far as I
understand, they can overlap.


Overlays don't scale up well; having lots of them in a buffer slows
down redisplay to an annoyingly low speed.  So I'd rather we didn't,
if we can find another solution.  Again, I still don't see why we
would need this one, and what problems it is supposed to solve.

To go back to the basics, we need a way (on first approximation, any waymay be okay) to tell the display reordering engine where and how itshould take into account the syntactic structure of the program/markupbeing edited.


Whether that can best be done by
1) adding some bidi formatting codes to the text being edited,

2) adding some bidi formatting codes to display text from a property oroverlay (before-string and after-string)3) adding some bidi-specific properties or overlay properties todirectly influence bidi reordering

4) some other means

is what I think we have been discussing. I continue to agree with youthat 1) is a bad idea. 2) is what Kenichi originally suggested. 3) iswhat the example above is about.

I don't really mind too much which way we go, but given that I mustassume that the bidi algorithm has hierarchically nested embeddings fora reason, and that programming languages and markup languages are inmany ways quickly much more nested than natural language (see examplesabove), I don't think we can easily get away with a simplistic model of"everything is LTR, with an occasional RTL string in it". That mightwork for a programming language like C, but not for things like Perl,Ruby, PHP, JSP, ASP, HTML, and XML.

I'm not sure I understand, but if it means that the bidi algorithm is
just applied piecewise, that won't be enough. It may be enough for some
simple cases, such as C programs, where the main concern is to keep text
within string constants together, and the rest is ASCII only and
therefore goes LTR. However, on the other hand, with some XML markup
with e.g. element and attribute names in Hebrew, in our experience
actual nestings (i.e. embeddings in terms of the bidi algorithm) are
highly desirable.


Again, an example would go a long way towards explaining what you
mean.  In general, what I wrote does not eliminate the possibility
that embeddings might be used within the reordered parts, nor that the
text outside of the markup is LTR only.


Okay. In the prototype and in the Web-based editor that we have worked
on to display HTML, we typically used embeddings for:
- Elements (incl. start tag and end tag) that have a dir attribute
(which indicates an embedding in the Web page view). These can of course
be nested.
- Start tags (and end tags)
- Attribute/attribute value combinations

Not all of these may be necessary in all cases, but it would be too
complicated to try and figure exactly which ones might be left out in
any particular case, and even this wouldn't eliminate the need for
nested embeddings. And it is at least currently unclear to me how you
could achieve nested embeddings with a possibility to tell the rendering
engine "restrict yourself to this region".


Please show an actual fragment of HTML/XML which needs nesting or
embeddings.


See above.

Or a property that changes the bidi category of a character?


This can be done if we need it, but I still don't see use-cases that
would benefit from such a feature.


Making the characters that define XML syntax, such as<,>, ", ', =,...
strong LTR would solve a lot (but not all) of the display anomalies for
XML (incl. HTML).


If it doesn't solve all the problems, I'd rather try first to find a
solution that does.


Agreed.

We probably won't want to change the bidi
properties of a character for the entire buffer (because it could be
used elsewhere in the buffer, like in a comment, where we would want
it to be reordered normally).  So this means we would need to use
different tables of bidi properties for different portions of the
text.  Switching bidi properties during display, as it walks the
buffer, is doable, but is somewhat tricky and can raise some hard
problems.

The table lookup might be done beforehand, with Font lock or somesimilar mechanism, and the result may be carried in properties.

The fact that it is not a comprehensive solution makes me
even more reluctant to use it.

I agree with that. It may be possible to make it a comprehensivesolution if we can use LRE/RLE/.../PDF as a bidi property (i.e. say thata plain old character also works as an LRE/RLE/.../PDF). But that maynot be enough, we might have to go as far as being able to attach asequence of bidi properties to a single character. Not exactly pretty :-(.

It might solve all display anomalies for programming languages like C to
define " (for strings) and comment start/end as LTR (at least as long as
there are no RTL identifiers).


But quotes can appear in the comments as well, so I think here, too,
we won't be able to use the same properties for the entire buffer.


True.

Covering each string, excluding its quotes, with a special text
property, and the same with a comment (excluding the comment
start/end) sounds a simpler solution.

This works very nicely if there is no nesting. If you can tell me forsure that nobody working with Perl, Ruby, PHP, JSP, ASP, HTML, XML,...will prefer nested bidi reordering for some cases, that might solve theproblem. But I wouldn't want to make such an assertion.



Regards,    Martin.


--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:address@hidden

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [emacs-bidi] Suboptimal display-reordering in minibuffer, Martin J. Dürst, 2010/07/01
- Re: [emacs-bidi] Suboptimal display-reordering in minibuffer, Eli Zaretskii, 2010/07/01
 - Re: [emacs-bidi] Suboptimal display-reordering in minibuffer, Martin J. Dürst, 2010/07/01
 - [emacs-bidi] Supporting non-plain-text buffers (was: Suboptimal display-reordering in minibuffer), Eli Zaretskii, 2010/07/02
 - [emacs-bidi] Re: Supporting non-plain-text buffers, Martin J. Dürst, 2010/07/06
 - [emacs-bidi] Re: Supporting non-plain-text buffers, Eli Zaretskii, 2010/07/07
 - [emacs-bidi] Re: Supporting non-plain-text buffers, Martin J. Dürst <=
 - [emacs-bidi] Re: Supporting non-plain-text buffers, Eli Zaretskii, 2010/07/15
 - Re: [emacs-bidi] Suboptimal display-reordering in minibuffer, Beni Cherniavsky-Paskin, 2010/07/02
 - Re: [emacs-bidi] Supporting non-plain-text buffers, Eli Zaretskii, 2010/07/02
 - Re: [emacs-bidi] Suboptimal display-reordering in minibuffer, Martin J. Dürst, 2010/07/06

Prev by Date: [emacs-bidi] Hebrew tutorial
Next by Date: Re: [emacs-bidi] Thank you Eli for your work on bidi!
Previous by thread: [emacs-bidi] Re: Supporting non-plain-text buffers
Next by thread: [emacs-bidi] Re: Supporting non-plain-text buffers
Index(es):
- Date
- Thread