[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings
From: |
Victor Stinner |
Subject: |
[bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings |
Date: |
Mon, 09 May 2011 17:47:13 +0200 |
Hi,
Someone opened an issue in Python bug tracker asking to change how
invalid multibyte sequences are handled.
http://bugs.python.org/issue12016
b'\xffabc'.decode('gb2312', 'replace') gives "�bc". The 'a' character is
seen as part of a multibyte character of 2 bytes. Because {0xFF, 0x61}
is invalid in GB2312, the two bytes are replaced by U+FFFD.
Is it the "right" way to to do? Or should we ignore/replace 0xFF and
restart the decoder at 'a' to "�abc"?
UTF-8 decoder changed recently to ignore a single byte and restart the
decoder, so '\xF1\x80\x41\x42\x43' is now decoded "�ABC" instead "�C".
Should we do the same for all encodings? Or at least for asian encodings
(gb2312, gbk, gb18030, big5 family, ISO 2202 family, JIS family, EUC_KR,
CP949, Big5, CP950, ...)?
I hope that the question is not too much unrelated for your mailing
list.
Victor Stinner
PS: Can you please CC-me to your answers?
- [bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings,
Victor Stinner <=