[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: _DefaultStringEncoding
From: |
Richard Frith-Macdonald |
Subject: |
Re: _DefaultStringEncoding |
Date: |
Sat, 18 Oct 2003 07:20:09 +0100 |
On Friday, October 17, 2003, at 03:14 PM, Bruno Haible wrote:
Hi,
NSString._DefaultStringEncoding is determined as the value of
GetDefEncoding()
in Unicode.m.
I have three questions about it.
1) Why are the possible values of GNUSTEP_STRING_ENCODING in the
range { "NSISOLatin1StringEncoding", "NSJapaneseEUCStringEncoding",
... }
and not the widely known and standardized names
{ "ISO-8859-1", "EUC-JP", ... }
? This makes it needlessly hard for users.
Because the OpenStep standard names are used ... but I agree there is
no reason
while the names that iconv supports should not be acceptable as well.
I've fixed
that.
2) Why does gnustep-base-1.8.0/Documentation/Base.gsdoc say that the
value
of GNUSTEP_STRING_ENCODING
"may be any of the 8-bit encodings supported by your system
(excluding multi-byte encodings)" ?
I've set it to NSUTF8StringEncoding and the Hello world program
displays
its greeting message (in German, non-ASCII of course) just fine.
It's an error ... that restriction used to be there a few years ago,
but is no longer
the case. I've updated the documentation.
3) If GNUSTEP_STRING_ENCODING is not set, why is the default value
(set in Unicode.m:580) ISO-8859-1? On POSIX systems, all programs
are expected to interpret file names and file contents according to
the encoding given by the current locale (nl_langinfo (CODESET)).
IMO this codeset should be taken and transformed into the GNUstep
specific equivalent name. I'm using a de_DE.UTF-8 locale and all
my local files are UTF-8 encoded.
As far as I'm aware ... there is no particular reason why GNUstep should
not be posix compliant as long as it doen't seriously conflict with
OpenStep
and Apple compatibility. I'd be happy to accept a patch to make this
change
as long as nobody knows good reason not to.
The situation for URLs is different; for files read from arbitrary
URLs the following heuristic makes sense:
- If the contents is valid UTF-8, then assume it is UTF-8.
- Otherwise assume it is ISO-8859-1.
The reason why this heuristic works well in practice is that normal
human-written ISO-8859-1 texts have a ~ 99.8% probability of being
invalid UTF-8.