[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#11073: 24.0.94; BIDI-related crash in redisplay with certain byte se

From: Kenichi Handa
Subject: bug#11073: 24.0.94; BIDI-related crash in redisplay with certain byte sequences
Date: Mon, 26 Mar 2012 16:45:56 +0900

In article <address@hidden>, Eli Zaretskii <address@hidden> writes:

> > Why do we need this unification?  Or rather, why do we need multiple
> > codepoints, which then forces us to unify them?

> That's something Handa-san (CC'ed) will be able to explain much better
> than I ever could.

It's a long story.  When I designed emacs-unicode (the
version before merged to the trunk, more than 10 years ago),
the unification maps of CJK charsets to Unicode were not
stable.  In addtion, there were various conflicting policies
on which character to unify to which character.  One reason
of this confusion was that Unicode itself didn't define
mapping to/from such CJK charsets (JIS, GB, KSC).

The unification problem is not only for Ideographic
characters.  Many CJK charsets contain, for instance,
full-width version of Greek characters, but Unicode doesn't
distinguish them from single-width versions (though Unicode
has full-width version of 'A'..'Z', etc).  There were people
who wanted to distinguish full-width Greek chars from
single-width chars.

There also were people who have a text of iso-2022-7bit file
which distinguishes characters of GB charset and JIS
charset.  To edit such a file and write it back as the
original one, one has to disable unification of one of GB
and JIS (or both of them).

So, I decided at that time to give each CJK charset unique
code space (above #x110000) in Emacs, and allow users to
freely unify/disunify them to Unicode code space (below
#x110000) by giving the function unify-charset.

FYI, http://www.unicode.org/reports/tr38/ tells some
difficulty of mappings.

> AFAIU, there are good reasons to have some CJK
> characters on separate codepoints, because they need to be treated
> differently from their Unicode codepoints (perhaps a different choice
> of font to display them?)

That was one reaons, but the current code pay attention to
`charset' text property of each character to select a proper

Kenichi Handa

reply via email to

[Prev in Thread] Current Thread [Next in Thread]