lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

LYNX-DEV Fwd: webwatch-l Strange Numbers in Lynx


From: Laura Eaves
Subject: LYNX-DEV Fwd: webwatch-l Strange Numbers in Lynx
Date: Sat, 10 May 1997 00:00:52 -0400 (EDT)

I just received this from Lloyd Rasmussen, who said I could forward it on to 
lynx-dev, in case anyone else is interested.
Thanks Lloyd!
--le

Date: Fri, 9 May 97 10:44:10 EDT
From: "Lloyd G. Rasmussen" <address@hidden>
Subject: Fwd: Re: webwatch-l Strange Numbers in Lynx

Dear Laura:  Here's some stuff I dug up last week about &#146;.  Since 
it's not part of the HTML-sanctioned  character set, but appears to be 
mostly a Microsoft invention, it falls into the category of "do we 
make this a browser that can read everything, or do we make it an HTML 
validator."  Discussion here on Lynx-dev last week also indicated that 
these codes, when flattened from 8 bits to 7, land in the range of 
control characters, which scares some programmers, I guess.  I hope 
you have time to check this out a little.  I use Vocal-Eyes under DOS 
and Window-Eyes under Win 3.1.  I don't have Linux.  I work in the 
braille and talking book program of the Library of Congress.


----- Forwarded message begins here -----
From: Lloyd G. Rasmussen  <address@hidden>
To: address@hidden
Date: Fri, 2 May 97 10:31:03 EDT
Subject: Re: webwatch-l Strange Numbers in Lynx
On Fri, 2 May 1997 05:29:28 -0700 (PDT), 
Kelly Ford   <address@hidden> wrote:

>If I understand you correctly, the characters I'm asking about such as
>0146 won't change no matter what character set I choose.  The MSNBC site
>at http://www.msnbc.com is full of these characters.  Is this something we
>should ask Microsoft to correct or is an improvement in Lynx necessary?
>

I suspect we won't be able to get Microsoft to change these.  I asked 
your question over on Lynxdev and didn't get much of a response. 
Perusal of the comp.text.sgml newsgroup turned up a recent large 
thread on this subject.  Indulge me with the following two newsgroup 
messages, inserted below, otherwise you can hit the Delete key now.  
Basically, if we are running on code page 1252 or have the proper 
graphics browser, we will see these characters properly.  You will 
also see these characters in some kinds of ASCII saves from MS Word.  
There are a couple of web pages referenced for testing these character 
entities.  If the developers of Lynx can be convinced to support these 
"extensions" of ISO char-sets, the problem could be fixed.


------ Forwarded message ends here ------


sholarp wrote:
> A coincidence.  Just tonight I have questioned the webmaster
> at the MSNBC web site by e-mail about the use of the encoded
> character &#0146 (absent the semicolon) throughout that site's
> web pages.
> 
> This character appears where an apostrophe, or right single
> quote, should appear.  For some reason, the author of MSNBC's
> text consistently uses a nonstandard encoding (with respect to
> both ISO 8859-1 and HTML v3.2 (?)) for this character.

You are fully right and they definitely should fix this quickly!!!

The characters 128-159 are not used in ISO 8859-1 and Unicode,
the character sets of HTML. MS-Windows uses a superset of
ANSI/ISO 8859-1, known to experts as "Code Page 1252 (CP1252)",
a Microsoft specific character set with additional characters
in the 128-159 range (also know as C1 range).

All the CP1252 characters are also available in Unicode.
For example the CP1252 character 146 that you mentioned
(RIGHT SINGLE QUOTATION MARK) has the Unicode number 8217,
therefore you should use this number in order to conform to
the HTML standard.  Modern HTML browser like Netscape 4.0
understand Unicode and will automatically convert the Unicode
character &#8217 back into the character 146 on MS-Windows
machines, and into the suitable character on other systems.

The official CP1252<->Unicode conversion table is printed in
the Unicode 2.0 standard and for instance available on
<ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/>.

MS-Windows HTML authoring software definitely should implement the
conversion table below! Please forward this mail to the developers
of your HTML authoring tool if this is done wrong currently.

The CP1252 characters that are not part of ANSI/ISO 8859-1 and
that should therefore always be encoded as Unicode characters >255
are the following:

0x82    0x201a  #SINGLE LOW-9 QUOTATION MARK
0x83    0x0192  #LATIN SMALL LETTER F WITH HOOK
0x84    0x201e  #DOUBLE LOW-9 QUOTATION MARK
0x85    0x2026  #HORIZONTAL ELLIPSIS
0x86    0x2020  #DAGGER
0x87    0x2021  #DOUBLE DAGGER
0x88    0x02c6  #MODIFIER LETTER CIRCUMFLEX ACCENT
0x89    0x2030  #PER MILLE SIGN
0x8a    0x0160  #LATIN CAPITAL LETTER S WITH CARON
0x8b    0x2039  #SINGLE LEFT-POINTING ANGLE QUOTATION MARK
0x8c    0x0152  #LATIN CAPITAL LIGATURE OE
0x91    0x2018  #LEFT SINGLE QUOTATION MARK
0x92    0x2019  #RIGHT SINGLE QUOTATION MARK
0x93    0x201c  #LEFT DOUBLE QUOTATION MARK
0x94    0x201d  #RIGHT DOUBLE QUOTATION MARK
0x95    0x2022  #BULLET
0x96    0x2013  #EN DASH
0x97    0x2014  #EM DASH
0x98    0x02dc  #SMALL TILDE
0x99    0x2122  #TRADE MARK SIGN
0x9a    0x0161  #LATIN SMALL LETTER S WITH CARON
0x9b    0x203a  #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
0x9c    0x0153  #LATIN SMALL LIGATURE OE
0x9f    0x0178  #LATIN CAPITAL LETTER Y WITH DIAERESIS

Hope this helped ...

Markus

-- 
Markus Kuhn, Computer Science grad student, Purdue
University, Indiana, US, email: address@hidden



In article <address@hidden>,
Markus Kuhn  <address@hidden> wrote:

> The characters 128-159 are not used in ISO 8859-1 and Unicode, the
> character sets of HTML.  MS-Windows uses a superset of ANSI/ISO
> 8859-1, known to experts as "Code Page 1252 (CP1252)", a Microsoft
> specific character set with additional characters in the 128-159
> range.  All the CP1252 characters are also available in Unicode.
> For example the CP1252 character 146 that you mentioned (RIGHT
> SINGLE QUOTATION MARK) has the Unicode number 8217, therefore you
> should use this number in order to conform to the HTML standard.
> Modern HTML browser like Netscape 4.0 understand Unicode and will
> automatically convert the Unicode character &#8217; back into the
> character 146 on MS-Windows machines, and into the suitable
> character on other systems.

> 0x82    0x201a  #SINGLE LOW-9 QUOTATION MARK
> 0x83    0x0192  #LATIN SMALL LETTER F WITH HOOK
> 0x84    0x201e  #DOUBLE LOW-9 QUOTATION MARK

etc.

Here's a translation of this table into more HTML-author friendly
terms (I've also added this table to the Web page at
http://uts.cc.utexas.edu/~churchh/latin1.html , where you can test
whether your browser understands these entities):


 Windows   Unicode
  char.   HTML code        Character Description
  -----     -----          ---------------------
ALT-0130   &#8218;     Single Low-9 Quotation Mark
ALT-0131   &#402;      Latin Small Letter F With Hook
ALT-0132   &#8222;     Double Low-9 Quotation Mark
ALT-0133   &#8230;     Horizontal Ellipsis
ALT-0134   &#8224;     Dagger
ALT-0135   &#8225;     Double Dagger
ALT-0136   &#710;      Modifier Letter Circumflex Accent
ALT-0137   &#8240;     Per Mille Sign
ALT-0138   &#352;      Latin Capital Letter S With Caron
ALT-0139   &#8249;     Single Left-Pointing Angle Quotation Mark
ALT-0140   &#338;      Latin Capital Ligature OE
ALT-0145   &#8216;     Left Single Quotation Mark
ALT-0146   &#8217;     Right Single Quotation Mark
ALT-0147   &#8220;     Left Double Quotation Mark
ALT-0148   &#8221;     Right Double Quotation Mark
ALT-0149   &#8226;     Bullet
ALT-0150   &#8211;     En Dash
ALT-0151   &#8212;     Em Dash
ALT-0152   &#732;      Small Tilde
ALT-0153   &#8482;     Trade Mark Sign
ALT-0154   &#353;      Latin Small Letter S With Caron
ALT-0155   &#8250;     Single Right-Pointing Angle Quotation Mark
ALT-0156   &#339;      Latin Small Ligature OE
ALT-0159   &#376;      Latin Capital Letter Y With Diaeresis

--
"You know they've reintroduced the death penalty for insurance company
directors?"  "Really?" said Arthur, "No, I didn't.  For what offense?"
Trillian frowned.  "What do you mean, offense?"  "I see."  --  _Mostly 
Harmless_  ||  Henry Churchyard  ||  http://uts.cc.utexas.edu/~churchh


-- Lloyd Rasmussen
Senior Staff Engineer, Engineering Section
National Library Service for the  Blind and Physically Handicapped
Library of Congress          202-707-0535
(work)       address@hidden    www.loc.gov/nls/
(home) address@hidden

;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]