texmacs-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Texmacs-dev] string encoding


From: Joris van der Hoeven
Subject: Re: [Texmacs-dev] string encoding
Date: Tue, 19 Nov 2002 10:42:46 +0100 (MET)

> A first draft of the dictionary mapping Cork (TeXmacs) encoding to
> Unicode encoding is now finished. You can take a look at it here:
> 
> http://www.fbreuer.de/texmacs/corktounicode.scm

Great, this seems cool.
It may not be really necessary to put the comments (a lot of extra work).
It is better to provide a dictionary to the symbolic names of characters.
This probably already exists for unicode. With Andrey's tool you may
then compose two dictionaries in order to have an explanation :^)

> Any suggestions and/or corrections are welcome. Does anybody have an
> idea how to test this mapping? (I.e. generate a document/table where one
> can visually verify that the mapping is correct?)

You have to be careful with the number of bytes for each character.
In the Cork encoding, each character only takes one byte,
so you should write #41 for "A" rather than "#0041".
In Unicode, some characters take one byte, some two, and some even more.

We still have to develop something for testing all this in C++.

> I didn't make a patch from the dictionary because I don't know where to
> put it in the TeXmacs source tree. How to use this dictionary to convert
> between encodings? I guess just a bit of Scheme code would do the trick,
> but I don't know Scheme well enough (yet). 

We will rather write these conversion routines in C++ (they must
be really fast) in src/Resources/Translators

> Next, I am going to write a TeXmacs universal encoding <-> Unicode
> dictionary. I noticed that sometimes the universal characters are
> encoded this way: \<char\> and sometimes this way: <char>. Which of
> these two should I use in the dictionary? Or should I use just char?

You should use the <char> form, which is the one being used internally.

Andrey: I noticed that you do not put the <> around the characters
when converting from .enc to .scm. This should be done for strigns
of length >1.

> Regarding ISO-8859-*: I noticed that ISO-8859-1 is a subset of Unicode
> (see  ftp://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT). How about
> the other ISO-8859-* encodings? Instead of writing a dictionary it would
> probably be more sensible to just use iconv to convert
> ISO-8859<->Unicode.

Absolutely. Maybe you can actually find this somewhere on the web.

Notice that it would be good to include the "la" encoding,
for Cyrillic too.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]