[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: plists in UTF8
From: |
Richard Frith-Macdonald |
Subject: |
Re: plists in UTF8 |
Date: |
Wed, 14 Jun 2006 13:31:21 +0100 |
On 14 Jun 2006, at 13:12, David Ayers wrote:
The issue is whether a UTF-8 plist without a BOM is a valid plist
(i.e.
should be considered non-portable).
Well, if it has no BOM then how do you know it's UTF-8? For an XML
plist you can theoretically use the initial header to determine
character encoding (we don't have support for that and it's not in
the OpenStep/MacOS-X spec/documentation that we should), but other
than that the only standard we have is to use the encoding for the
locale we are working in ... which is non-portable by definition.
I've often read that BOM's in UTF-8 files cause issues (e.g.:
http://en.wikipedia.org/wiki/Byte_Order_Mark). It becomes a problem
when multiple text files are concatenated and someone (I think it was
you) told me that BOM's within files have been deprecated. (I
wonder if
cat(1) or it underlying facilities would be patched to handle this).
I think a BOM within (ie not at the start of) a file is actually
illegal. It's the zero-width space in UTF-16 (acts as a BOM at start
of file) which is deprecated.
I guess you just can't really use 'cat' to join UTF-8 (or UTF-16)
files ... depends whether you consider 'cat' to be a binary data
utility or a text utility ... probably some people would argue it
works correctly if it just concatenates the data streams.
Historically, we are used to having the same tools work with binary
data and with text, but in a world with different locales and
different text coding schemes that's no longer the case.
I don't believe that BOMs cause special problems ... they only cause
problems if you join text files improperly ... which is really no
worse (perhaps better because it's more easily detected) than if you
concatenated files containing text in different encodings.
I think that one could argue that a plain UTF-8 file should be
considered valid/portable by plparse... But for that to be of any
value
would also mean, that UTF-8 files would be parsed correctly in non-
UTF-8
locales, which I suppose is the reason that UTF-8 without BOM is
currently considered non-portable.
Well yes ... if there is no means of telling that a file is UTF-8 ...
then for practical purpose it isn't UTF-8 ... it's just a bunch of
bytes with no known meaning. You can guess what encoding it is, but
that guess is going to vary depending on the locale you are in.
Guessing may be reasonable for an editor (debatable), but is
inappropriate for a checker.
- plists in UTF8, David Wetzel, 2006/06/14
- Re: plists in UTF8, Richard Frith-Macdonald, 2006/06/14
- Re: plists in UTF8, David Ayers, 2006/06/14
- Re: plists in UTF8,
Richard Frith-Macdonald <=
- Re: plists in UTF8, Pete French, 2006/06/14
- Re: plists in UTF8, Pete French, 2006/06/14
- Re: plists in UTF8, Richard Frith-Macdonald, 2006/06/14
- Re: plists in UTF8, Pete French, 2006/06/14
- Re: plists in UTF8, David Ayers, 2006/06/14