lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] Unicode-marking, &c


From: David Woolley
Subject: Re: [Lynx-dev] Unicode-marking, &c
Date: Fri, 27 Feb 2009 08:15:46 +0000
User-agent: Thunderbird 2.0.0.19 (X11/20081209)

 wrote:
I saw queer little characters begin some webpages, and upon seeing such
on local webpages rendered right here, I suspect that they are magic numbers
that now mark in-Unicode-or-UTF8-encoded files, and Lynx misses this.

They are byte order marks and DO NOT indicate that the file is not ISO 8859/1. You need to know that the file is UTF-* before you start trying to interpret these codes.

Maybe when such really are downloaded it is the server s duty to strip the
page of the magic numbers and turn them into other forms of naming the file
s alphabet, but when the file is local Lynx is stuck with it.

It is the server's duty to send character set indications (which are mandatory in HTML4) that correctly represent the character encoding used. How they identify the character set used is a local issue, not one for standardisation.

What you are probably hitting is the tendency of big name browsers, particularly IE, to interpret pages as what they think the author meant, rather than what they have actually said. The most famous case of this is probably that IE will interpret text/plain as HTML, if it looks like HTML, even if the author's intent was that it be seen as the source code. That is a direct violation of the standards that MS refuse to change.

Here under Windows there are constant references to the character that
begins a 16-bit-wide-character file (FF FE) or UTF-8 file (EF BB BF).

These are all valid printable characters in ISO 8859/x. Although somewhat unlikely combinations, they are not reserved sequences.

Has anything been done about this?

It's a problem of sloppy authoring, much like the sending of GB2312 with out a charset, or even with a windows-1252 one. In particular, if neither the HTTP nor the meta content-type specify a charset, but the HTML version is 4 or higher, the page is invalid, and if they specify the wrong charset, that charset is the one that Lynx should use,

--
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]