Re: plists in UTF8

discuss-gnustep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: plists in UTF8

From:	Richard Frith-Macdonald
Subject:	Re: plists in UTF8
Date:	Wed, 14 Jun 2006 13:31:21 +0100


On 14 Jun 2006, at 13:12, David Ayers wrote:

The issue is whether a UTF-8 plist without a BOM is a valid plist(i.e.
should be considered non-portable).

Well, if it has no BOM then how do you know it's UTF-8? For an XMLplist you can theoretically use the initial header to determinecharacter encoding (we don't have support for that and it's not inthe OpenStep/MacOS-X spec/documentation that we should), but otherthan that the only standard we have is to use the encoding for thelocale we are working in ... which is non-portable by definition.

I've often read that BOM's in UTF-8 files cause issues (e.g.:
http://en.wikipedia.org/wiki/Byte_Order_Mark).  It becomes a problem
when multiple text files are concatenated and someone (I think it was

you) told me that BOM's within files have been deprecated. (Iwonder if

cat(1) or it underlying facilities would be patched to handle this).

I think a BOM within (ie not at the start of) a file is actuallyillegal. It's the zero-width space in UTF-16 (acts as a BOM at startof file) which is deprecated.

I guess you just can't really use 'cat' to join UTF-8 (or UTF-16)files ... depends whether you consider 'cat' to be a binary datautility or a text utility ... probably some people would argue itworks correctly if it just concatenates the data streams.Historically, we are used to having the same tools work with binarydata and with text, but in a world with different locales anddifferent text coding schemes that's no longer the case.I don't believe that BOMs cause special problems ... they only causeproblems if you join text files improperly ... which is really noworse (perhaps better because it's more easily detected) than if youconcatenated files containing text in different encodings.

I think that one could argue that a plain UTF-8 file should be
considered valid/portable by plparse... But for that to be of anyvaluewould also mean, that UTF-8 files would be parsed correctly in non-UTF-8
locales, which I suppose is the reason that UTF-8 without BOM is
currently considered non-portable.

Well yes ... if there is no means of telling that a file is UTF-8 ...then for practical purpose it isn't UTF-8 ... it's just a bunch ofbytes with no known meaning. You can guess what encoding it is, butthat guess is going to vary depending on the locale you are in.Guessing may be reasonable for an editor (debatable), but isinappropriate for a checker.

[Prev in Thread]

Current Thread

[Next in Thread]

plists in UTF8, David Wetzel, 2006/06/14
- Re: plists in UTF8, Richard Frith-Macdonald, 2006/06/14
  - Re: plists in UTF8, David Ayers, 2006/06/14
    - Re: plists in UTF8, Richard Frith-Macdonald <=
    - Re: plists in UTF8, Pete French, 2006/06/14
    - Re: plists in UTF8, Pete French, 2006/06/14
    - Re: plists in UTF8, Richard Frith-Macdonald, 2006/06/14
    - Re: plists in UTF8, Pete French, 2006/06/14
    - Re: plists in UTF8, David Ayers, 2006/06/14

Prev by Date: Re: plists in UTF8
Next by Date: Re: plists in UTF8
Previous by thread: Re: plists in UTF8
Next by thread: Re: plists in UTF8
Index(es):
- Date
- Thread