Re: [Lynx-dev] Unicode-marking, &c

lynx-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] Unicode-marking, &c

From:	Thorsten Glaser
Subject:	Re: [Lynx-dev] Unicode-marking, &c
Date:	Fri, 27 Feb 2009 09:24:00 +0000 (UTC)

David Woolley dixit:

>> Here under Windows there are constant references to the character that
>> begins a 16-bit-wide-character file (FF FE) or UTF-8 file (EF BB BF).
>
> These are all valid printable characters in ISO 8859/x.  Although somewhat
> unlikely combinations, they are not reserved sequences.

We are talking about a file that does _begin_ with these byte sequences
here, not a file that solely consists of them.

For UCS-* the things are quite clear, you get <\0h\0t\0m\0l\0> so it
obviously is not any 8-bit encoding.

For UTF-8, it’s not that easy, but:

• If the file is UTF-8 and uses any nōn-ASCII characters, it almost
  always will contain an octet from the [0x80‥0x9F] range, which
  practically rules it out from being encoded as latin1

• In case of doubt: If the file contains only valid UTF-8 with no
  encoding errors (invalid multibyte sequences), lean towards it,
  as it’s the current standard replacing the 8-bit character sets

• If the file only contains ASCII characters, while point #1 above
  is no longer valid, the difference is moot anyway

bye,
//mirabilos
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
        -- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Lynx-dev] Unicode-marking, &c, (continued)
- Re: [Lynx-dev] Unicode-marking, &c, David Woolley, 2009/02/27
  - Re: [Lynx-dev] Unicode-marking, &c, Thorsten Glaser <=

Prev by Date: Re: [Lynx-dev] Unicode-marking, &c
Next by Date: Re: [Lynx-dev] Unicode-marking, &c
Previous by thread: Re: [Lynx-dev] Unicode-marking, &c
Index(es):
- Date
- Thread