[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: LYNX-DEV minor display problem (?character 0xA2?)
From: |
Klaus Weide |
Subject: |
Re: LYNX-DEV minor display problem (?character 0xA2?) |
Date: |
Fri, 2 May 1997 15:55:43 -0500 (CDT) |
On Fri, 2 May 1997, Foteos Macrides wrote:
> Hynek Med <address@hidden> wrote:
> >On Fri, 2 May 1997, Klaus Weide wrote:
> >> On Fri, 2 May 1997, Bela Lubkin wrote:
> >> > An hour later: it's because character 0xA2 is eventually being
> >> > translated to 0x9b on output. The SCO ANSI console takes 0x9b as CSI,
> >> ^^^^^^^^^^^^^^^^
> >> The IBM PC character set (cp437) contains a visible character at that
> >> position. [...]
> >
> >Or perhaps we need another option [...]
> >:-)
>
> The bank of 8-bit control characters always is illegal for
> text/html. ALWAYS, ALWAYS, ALWAYS.
They (code points in the range 128-159) are illegal for
"text/html;charset=iso-8859-N" where N=1..10.
They are also illegal in the "document character set" in the SGML sense,
for all known HTML versions, and that is why numeric character references
like "™" are illegal.
But they may or may not be legal in HTML documents AS TRANSMITTED OR
STORED if they use a different "charset", because that is supposed to
be an ENCODING of the real thing.
> [...] Lynx still does a single
> pass through the stream, and thus uses a state machine to juggle the
> "de-encoding" and "de-encoded parsing" at the same time.
Still true for the chartrans code in the devel Lynx. That's why attribute
values, for example ALT= text, don't get treated the same way as normal
PCTEXT.
According to the HTML Internationalization RFC 2070, the reference
processing model is
[resource]->[decoder]->[entity ]->[ SGML ]->[application]->[display]
[manager] [parser]
but "An actual implementation may choose, or not, to translate the
document into some encoding of the document character set as
described above; the behaviour described by this reference processing
model can be achieved otherwise."
Lynx doesn't have a separate "decoder" or "entity manager", so those
functions are either also handled in the "parser" in SGML.c or deferred
to later processing.
> But for CJK
> charsets, those aren't 8-bit control characters. They're half of a
> multibyte pair.
The same for UTF-8 encoding of Unicode, where bytes in that rage are
more or less guaranteed to appear.
Klaus
;
; To UNSUBSCRIBE: Send a mail message to address@hidden
; with "unsubscribe lynx-dev" (without the
; quotation marks) on a line by itself.
;