[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Codepage?
From: |
Antoine Leca |
Subject: |
Re: Codepage? |
Date: |
Tue, 02 May 2000 16:49:17 +0200 |
Rob Kramer wrote:
>
> > So in fact, the text is 8-bit encoded.
> >
> > So you need a way to convert from the encoding used in Windows Word
> > (I assume you, or your user, knows what it is) to transform to
> > character indices as stored in the font (and then to glyph indices
> > using some TT_CharToIndex function).
> > Do I still get it right so far?
>
> Correct, but I or the user don't know the encoding I guess..
Ah! now I get the picture. What is really lacking is the information
of the encoding used (what you call the codepage is a IBM/MS way to
name just that, in a convenient way for computers that *love* to
used numbers everywhere ;-)).
> I mean, the way
> we did Thai text was by using a Thai keyboard on a normal English Win95
> installation. Somehow the keyboard produced the proper codes to match the
> font (and as far as I could see, that font only had a 'MS symbol' map.
Yes, becasue that is the way Win 95 keyboards works: they generate
the (8-bit) proper codes ready to be ingested by the application, using
the default codepage associated with every language (in this case, 874,
a.k.a. TIS 820).
> Do you say that if I want to display Russian, my Windows (or word?) should
> be in 'Russian mode',
Yes.
> and my software should too?
Yes.
> That was what I was trying to do by having the user specify a codepage..
O.K. So what you need is to learn about the codepage used to encode
a text. If you are under Word, that is a simple query of the locale
of the current keyboard layout (we are going off-topic, I keep it short);
if the text is persistant (already typed), then the only way to know
is to get the codepage associated with the font.
In Word format, this information is encoded with a byte (named charset)
that is associated with every font. 0 is Latin-1 (1252), 0xEE means East
European Windows, 0xCC means Cyrillic, 222 = 0xDE (IIRC) means Thai.
However, the same information is not really encoded in the font (because
a font can be remapped to cover various encodings)...
> Can't I get an application like Word to output Unicode?
Not easily :-(
> > Depending of your platform, the job is more than probably already done,
> > but the particular solution you have to use (iconv, mbs[r]towcs,
> > MultiByteToWideChar, recode, ...) is dedicated.
>
> Are these applications or calls in some library?
Library calls (except recode). That makes them easier to use in your case :-)
iconv is nearly standard on Unix platforms
MultiByteToWideChar is the Microsoft's counterpart
mbs[r]towcs are standard C, but you should verify yourself that
the output (the "wc" end) is really Unicode: a lot of library
perform an awful job here: when it performs well, that is the
mightiest...
Hope it helps,
Antoine
- Re: Codepage?, Rob Kramer, 2000/05/02
- Re: Codepage?,
Antoine Leca <=