help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Emacs in xterm and Cyrillic?


From: vedm
Subject: Re: Emacs in xterm and Cyrillic?
Date: 13 Apr 2005 21:47:17 -0400
User-agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.4

Kevin Rodgers <address@hidden> writes:

> vedm wrote:
>  > My cyrillic files are encoded in iso8859-5, just because that encoding
>  > is within the ASCII set and is enough for the cyrilic script. Yes, I
>  > agree that UTF is better for handling all sorts of languages, but I
>  > still haven't tried to use it in emacs and Xterm. (One disadvantage of
>  > UTF is that the UTF files (at least cyrillic files) are almost two times
>  > bigger compared to ASCII encoded files).
> 
> That can't be the case, because Cyrillic characters can't even be
> represented in ASCII.  

The first half of this statement is true, the second one is not. When
you say "That can't be the case" I assume you mean that iso-8859-5 is
not ASCII, and that is true. Until now I thought of the ISO-8859 family
of encodings as "8-bit ASCII". But now I did my research an found this
good review of character codes:
http://www.cs.tut.fi/~jkorpela/chars.html. 
As it says:

<quote>

The misnomer "8-bit ASCII"

...ASCII is strictly and unambiguously a 7-bit code in the sense that
all code positions are in the range 0 - 127.

It (the term "8-bit ASCII") is a misnomer used to refer to various
character codes which are extensions of ASCII in the following sense:
the character repertoire contains ASCII as a subset, the code numbers
are in the range 0 - 255, and the code numbers of ASCII characters equal
their ASCII codes."

</quote>

Now, the second part of your statement is that "Cyrillic characters
can't even be represented in ASCII". But the Cyrillic alphabet consists
of about 30 letters (Bulgarian - 30, Russian - 33), and the 7-bit ASCII
code has 128 positions, which is clearly more than enough to encode 30
letters (or 60, for upper and lower case)

In fact, the first Cyrillic encodings used a 7-bit char-set. A good
discussion of the Cyrillic character sets can be found here:
http://czyborra.com/charsets/cyrillic.html

> It is true that your Cyrillic files will be
> encoded in ISO-8859-5 with just 1 byte per character, whereas the
> Cyrillic characters require 2 bytes in UTF-8 (I don't know about
> UTF-16).  But the actual size of the UTF-8 files will depend on how many
> Cyrillic vs. ASCII characters are present, since the ASCII characters
> are still represented as a single byte.

My Cyrillic files consist entirely of Cyrillic characters (excluding
special characters like spaces, new lines, dots etc), so they are
invariably almost two times bigger when encoded in UTF-8. And these
files are meant to be on my web server: so if they are UTF my server
would have to pass double the data for each page...unless there is some
compression trick.


-- 
vedm


reply via email to

[Prev in Thread] Current Thread [Next in Thread]