[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-gnubg] updated po files
From: |
Jim Segrave |
Subject: |
Re: [Bug-gnubg] updated po files |
Date: |
Tue, 15 Jun 2004 21:50:41 +0200 |
User-agent: |
Mutt/1.4.1i |
On Tue 15 Jun 2004 (09:03 +0000), Joern Thyssen wrote:
> On Tue, Jun 15, 2004 at 10:33:45AM +0200, Petr Kadlec wrote
...
> > Well, my file has a different checksum, but if I convert the line ends
> > using fromdos, I get exactly the same hash, which proves that CVS
> > converts line ends during transfer (which is good and well).
>
> Well, yes and no. For single byte character sets this is good. However,
> for multiple byte characters sets this is problematic. For example, the
> unicode character sequence for a c with a dot above is 0x01 0x0A. I
> think cvs would convert this to 0x01 0x0A 0x0D inserting a line feed in
> the text. Consider this imaginary UTF-8 sequence: 0x0A 0x56 being
> converted by cvs to 0x0A 0x0D 0x56, which is probably an illegal UTF-8
> sequence.
>
> Anyway, I can see that Kaoru has committed a fix.
I think that Unicode character set is done in UTF8, where ther
representations are somewhat different - see
<URL:http://www1.tip.nl/~t876506/utf8tbl.html>
Multibyte characters are flagged by the first byte of the sequence
with the first byte being either < 0x80, in which case it's a 1 byte
character or 0xc0..ef to indicate the first byte of a 2 character
sequence, etc. SO the 0x0a 0x56 becoming 0x0a 0x0d 0x56 would not
create a new UTF8 character. Further, each byte of a UTF8 encoding of
a multibyte sequence will have bit 7 set, so no multibyte character
will ever contain 0x0a and conversion to 0x0d 0x0a won't change
anything.
What can be a problem is using 8 bit single byte characters and then
reading the result expecting UTF8, since any signed characters in the
input will be mistakenly turned into (possibly illegal) multi-byte
encodings.
--
Jim Segrave address@hidden