[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] non-ASCII characters (ISO-8859 or UTF-8?)

From: David Woolley
Subject: Re: [Lynx-dev] non-ASCII characters (ISO-8859 or UTF-8?)
Date: Sat, 15 Jul 2006 13:23:35 +0100 (BST)

> header indicates the character set.  The characters have hex codes
> 0x92 (apostrophe), 0x93 (left quote), 0x94 (right quote), and 0x97
> (em-dash).

They are not ISO 8859/1; they are invalid codes in that character set.

They are probably windows-1252 characters.  Unfortunately Microsoft
software delights in using their proprietary codes for smart quotes
like this.  Historically, they would even generate ”, even though
that is invalid in all versions of HTML (entities always encode
the standard character set for HTML which is ISO 10646, with some 
exclusions, for HTML 4 upwards and ISO 8859/1, before that.

> "7-bit approximation;" setting the option "assumed document character
> set" to ISO-8859 and to UTF-8; setting the option "raw 8-bit

You need to set assumed document character set to windows-1252, if the
actual character codes are being used.  This probably won't work if
the site actively lies about its character set.  This should work
if the actual characters are used.  If entities are used, I don't know
what heuristics Lynx users for undefined numeric entity values.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]