[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gettext] Additional checks for msgfmt

From: Vladimir 'φ-coder/phcoder' Serbinenko
Subject: [bug-gettext] Additional checks for msgfmt
Date: Mon, 02 Apr 2012 17:55:29 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.3) Gecko/20120329 Icedove/10.0.3

Hello, by analysing the po files on TP, I have found a series of common
errors which are trivial to detect automatically:
- File claiming to use one-byte encoding but in reality being UTF-8. Can
be checked if all messages are valid UTF-8 and contain non-ASCII
characters. Such strings are very unlikely to be one-byte encoded due to
UTF-8 structure (never happened on TP analysis).
- Comments being in legacy encoding in UTF-8 file. Check that whole file
makes sense in a given encoding
- Presence of C1 control codes instead of some letters (thanks to cp1250
vs latin1 confusion). These codes should never be used as their
behaviour is largely unpredictable. There was no legitimate use on TP.
- C0 control characters other than \n, \r, \v, \t, \a, \e. Same as C1
but for different reasons. xchat apparently uses \2 more or less validly
and they are present in msgids as well
- Presence of U+0xfffd due to failed conversion. U+0xfffd means "invalid
- (old files only) Usage of ISO 2022. This usage is a largely obsolete
encoding which isn't well converted into UTF-8 (i.a. iconv doesn't). In
one case it was an obscure unspecified variant. It's easy to detect by
very special escapes: \e$( and \e(. Perhaps only vt100 escapes should be

None of these errors are easy to check for. Should I write a patch for

Vladimir 'φ-coder/phcoder' Serbinenko

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]