Re: [Texmacs-dev] utf-8 support update

texmacs-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Texmacs-dev] utf-8 support update

From:	Joris van der Hoeven
Subject:	Re: [Texmacs-dev] utf-8 support update
Date:	Mon, 25 Nov 2002 11:24:38 +0100 (MET)

Hi Felix,

> I have begun work on the TeXmacs universal character set -> Unicode
> mapping. It can be found at my site www.fbreuer.de/texmacs. 

Seems promising.

> > You have to be careful with the number of bytes for each character.
> > In the Cork encoding, each character only takes one byte,
> > so you should write #41 for "A" rather than "#0041".
> 
> I changed the mapping accordingly.

Cool.

> > We will rather write these conversion routines in C++ (they must
> > be really fast) in src/Resources/Translators
> 
> I do not get how this translator works. It seems to never return a
> translated string. And instead of building a table associating indices
> into the string-to-be-translated, it associates strings with indices.
> Are texmacs hashmaps multimaps? I am lost. Could somebody explain it to
> me?

TeXmacs hashmaps just associate values of a given type to entries of
a given type. Nothing special there.

The idea would be to make the translator class more powerful.
At the moment it is just used for storing the corresponding positions in
the font encodings for mathematical characters, as well as the definitions
of virtual characters (I am not sure that we should keep this in
the translator class).

What we should do is write a function which converts strings to strings
from a character to character table. In order to make this function as
efficient as possible for several types of translations, we might make
the translator class abstract and perform several types of optimizations
in the concrete classes. For instance, a translation from ISO-8859-x to
Cork should be faster than a translation from TeXmacs-Universal-Encoding
(TMUE) to Unicode.

> Since we are talking about a conversion of the encoding of a string and
> not of a translation of its contents, wouldn't it be better to function
> as_utf8 to string.cc? This would lend itself more to the inclusion of
> other encodings using iconv.h. However, I don't want to argue, I just
> need someone to enlighten me :)

No, I think it is better to encapsulate these routines in dedicated
classes like translator, so that we may optimize depending on the context,
store the corresponding tables and allow for operations on the tables.

Notice that it would also be nice to have some command-line tools
for reading in our dictionaries and performing the translations
(maybe slightly slower, but with the same functionality).

> Regarding the universal characters: <big|cap> is a different character
> then \<cap\>, so <big|...> nodes would have to be converted as well. Why
> isn't <big|cap> encoded as <big|\<cap\>>? The latter seems more
> consistent to me. How about <left|...>, <right|...>, <mid|...>? 

Be careful: <big|cap> is not a character but a unary application of
"big" on "cap". The corresponding characters are of the form <big-cap-nr>,
where "nr" is a number. In other words, the fonts accomodate an infinite
number of symbols (contrary to unicode). This is good in the particular
case of big operators, because they are really still part of the font.

In fact, it is good to think of encodings as being related to fonts.
I would like to improve the font-selection system by the introduction
of a new way to deal with composite fonts. The idea would be that
you first define a compound encoding like

        compound-enc := roman-enc, cyrillic-enc, greek-enc

which may then be used to construct compound fonts using the "compound"
keyword. For instance:

        ((cm $size $series $shape $dpi)
         (compound compound-enc cm $size $series $shape $dpi))

You may (should) then define the different parts of the compound font
as follows:

        ((cm roman-enc $size $series $shape $dpi) (cmr ...))
        ((cm cyrillic-enc $size $series $shape $dpi) (larm ...))
        ((cm greek-enc $size $series $shape $dpi) (grrm ...))

The general TeXmacs rewriting system for fonts may then be used
to specify appropriate fallbacks and so.

We probably also have to define two more operations on fonts
for reencoding a given font and for constructing fonts of
accented characters.

<Joris>

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Texmacs-dev] string encoding, (continued)

Prev by Date: [Texmacs-dev] Patches applied for next version
Next by Date: Re: [Texmacs-dev] TeXmacs on Cygwin finally runs - How can I contribute in the further development?
Previous by thread: [Texmacs-dev] utf-8 support update
Next by thread: [Texmacs-dev] Re: Documentation translation ended (?)
Index(es):
- Date
- Thread