bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#20140: 24.4; M17n shaper output rejected


From: Richard Wordingham
Subject: bug#20140: 24.4; M17n shaper output rejected
Date: Sat, 21 Mar 2015 17:58:18 +0000

On Sat, 21 Mar 2015 17:33:17 +0900
handa@gnu.org (K. Handa) wrote:

> In article <20150318222040.4066e6e9@JRWUBU2>, Richard Wordingham
> <richard.wordingham@ntlworld.com> writes: [...]
> > I extract and analyse what was rendered as shaped ('accepted') and
> > what was not ('rejected'), quoting the monitoring output.  I
> > suspect the problem is the strict testing of the from and to fields
> > in Lisp function font-shape-gstring, which is defined in file
> > font.c.
> [...]
> > The shaping of the following, with vowels or MEDIAL RA that should
> > be rendered before the consonant, was rejected:
> 
> > mflt_run( 1A3E 1A6E 1A6C 1A65) produced ( 1A6E>872:1:1 1A3E>810:0:3
> 1A6C>869:0:3 1A65>862:0:3) 
> 
> If U+1A6E is displayed before U+1A3E, and they are in
> different grapheme cluster, when you move point forward one
> step by one, the cursor must move back and forth as below
> (cursor is indicated by dashes):
> 
>  display: SPC 1A6E 1A3E+1A6C+1A65 SPC
>  step 1:  ---    
>  step 2:           --------------
>  step 3:      ----
>  step 4:                          ---
> 
> Is that what you want?

It gives me more control for editing in Emacs.  Another implementation
could choose to move in visual order. The policing function could
choose to merge the 'out of order' clusters
- that is what new HarfBuzz does, though I think that should only be
done if the client requests it.

What I ought to want is SIL's split cursor scheme, which indicated the
next ('point') and previous characters, even in bidirectional text.
Unfortunately, that's not compatible with m17n, which seems to assume
that cursor position will be a single number.  The Emacs functions
forward-char-intrusive and backward-char-intrusive provided a pleasant,
more intuitive, alternative, and I am sad to hear they are gone.
Perhaps I'll have to start using toggle-auto-composition.

The one consolation in Emacs is that delete-forward-char
deletes a single character, rather than a whole cluster.  That
greatly reduces the disadvantage of having clusters.  Also,
search still works by characters rather than by clusters.  If I want to
search for a character in LibreOffice, I have to go into the
special regular expression find and replace menu.  That is unpleasant.

> At least, the support for all Indic scripts (they have
> characters in logical order as your example of Tai Tham
> text) treats re-ordered glyphs as one grapheme cluster.
> That is not only Emacs but also gtk (pango) applications.

That's a nasty fault with HarfBuzz.

> Please try to move cursor over this Devanagri text "हिंदी" on
> Emacs, gedit, and, for instance, firefox.  They all treat
> that text as 2 grapheme clusters "हिं" and "दी".  The first
> one corresponds to character the sequence U+935 U+93F, and
> U+93F (vowel I) is displayed before U+935 (base cosonant).

Note that those clusters are only 3 and 2 characters long.  Retyping
them is tolerable.  Now consider the Sanskrit Devanagari text स्त्री,
which contains two consonant-combining viramas.  Emacs moves across it
in 1 step, but Claws e-mail (GTK-based, I believe) and LibreOffice
(HarfBuzz-based, at least for linux) both take 3 steps to move across
it.  Claws and LibreOffice use different algorithms to position the
cursor.  That of LibreOffice seems more reasonable, but that of
Claws works better!  The reason is that Unicode did not declare virama
as forming grapheme clusters.

> [...]
> 
> > There does appear to be a work around, which is to have m17n declare
> > the orthographic syllables it receives to be 'grapheme clusters'.
> 
> I think that's the right solution; i.e. make all combined
> and out-of-ordered glyphs as one cluster.
> 
> > It solves at least some of the problems above.
> 
> Which one is not solved by it?

It seems to have solved all of them.  When I reported the bug, I was
having problems with my font because libotf was silently ignoring half
the lookups in my font.

I though I might have problems with U+1A58 TAI THAM SIGN MAI KANG LAI,
which in Lao visually groups (usually) with the following base
consonant and in Tai Khuen groups with the preceding base consonant. My
clustering in Emacs follows the Tai Khuen scheme.  (I compose two
orthographic clusters together in Emacs, but declare two grapheme
clusters in the FLT processing.)  However, my font follows a major
Northern Thai dictionary and places it on the following base consonant
if there is nothing above it, but otherwise places it on the preceding
base consonant.  However, my implementation is too dirty to cause
problems - the second cluster is not reported as deriving from the
mai kang lai character.

I wonder, though, what will happen if I manage to implement the
Universal Shaping Engine's (USE) rphf feature. The author of a Lao-style
Tai Tham font wanted this feature in HarfBuzz.  The desired effect seems
easy to achieve in m17n-flt, but placing it under font control is more
difficult.  I'm studying MLM2-OTF.flt to see how to do it.

> > However, it then makes editing of the 'clusters' more
> > difficult.  Note that there are examples above with 5
> > characters in a cluster, and this is by no means the
> > limit.
> 
> But, it seems that the current behavior is accepted, at
> least, by Indic people.

Who do you mean by 'Indic people'?  I can see at least three groups:

1) Indian speakers of Indic languages who use Indic scripts, thus
including users of Hindi, Gujarati and Bengali.

See my comments above.

2) Indian users of Indic scripts, thus also including speakers of
Malayalam and Tamil.

In Tamil, a phonetically CVCCV word will normally naturally split into
clusters as CV.C+virama.CV.  I must admit I am surprised that they have
accepted CV.CCV - or do Tamils not use Emacs for Tamil?

Tamils are notorious for regarding their writing system as a syllabary
rather than as an abugida.

I haven't studied the Malayalam script - that does seem a fairly
complicated Indian script, as one might expect when Dravidians use a
script tailored to Middle Indic and stretched to cover Old Indic.

3) Users of Indic scripts, thus also including the Burmese, Thai,
Cambodians and Lao as well as the users of the Tai Tham script.

Rebellion is rampant.  The original Unicode encoding of Thai
followed the phonetic order (allegedly - it was probably the
collation order instead).  This was rapidly thrown out as
incompatible with the current, working encoding.  Unicode responded
with the derogatory property of 'logical order exception'.

Around Unicode 5.1, the preposed vowels of Thai and Lao were suddenly
included in grapheme clusters with the base consonant. As the
consequences started to appear in applications, there were howls of
rage from Thais, and the characters were restored to their original
status as fully independent characters.

It doesn't seem so long ago that the Cambodian government imposed
Unicode on Cambodia.  You'd have thought that access to applications
would have made Unicode the obvious choice.

New Tai Lue is an interesting case.  Microsoft delayed support for this
simple Indic script for so long that most apparently Unicode-encoded
New Tai Lue text was actually encoded in visual order.  With Unicode
8.0, New Tai Lue is changing from phonetic order to visual order, and
it will no longer need any clusters at all!  Emacs 23.3 (which is what
is in long-term support Ubuntu 12.04) offers no support for New Tai
Lue, so I am not sure that there is yet a New Tai Lue view on
composition in Emacs.

Richard.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]