[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

lynx-dev Re: msg00798.html (was: 0x2276 handling)

From: Foteos Macrides
Subject: lynx-dev Re: msg00798.html (was: 0x2276 handling)
Date: Tue, 5 May 1998 16:18:31 -0400

"Leonid Pauzner" <address@hidden> wrote:
>OK, besides native utf-8 + hex HREF=  may use any HTML coding scheme
>"for internal use only": client software will convert...
>What the user should see on the bottom line in advanced mode?
>- The converted result as it will be submitted?

        In v2.7.2 I had it showing the utf-8 + hex encoded URL, i.e.,
what would actually be sent to the server, in the statusline and
ShowInfo ('=') display.  That was controversial in the IETF discussions,
and left as a lot of waffling in the form of suggestions to implementors,
without any requirements, so that consensus might be reached between the
idealists and the pragmatists.  The idealists would like browsers to
display the URLs in the user's language (the "DCS", for Lynx) so that they
will see it homologously to an English-speaking user viewing a non-i18n
URL.  They felt what actually went out over the wire to the server is
irrelevant to the user, and showing the utf-8 + hex encoded URL is
defeating the intent of i18n.  The pragmatists felt that this was
unrealistic unless the browser forces use of utf-8 as the DCS because
otherwise you may not have the intended characters in the DCS, such that
7-bit approximations or other substitutions would need to be made, and
the user might then record and enter (cut and paste, etc.) that distorted
URL in other contexts, causing failures of retrieval (but what v2.8 is
doing is wrong from both the idealists' and pragmatists' perspectives :).

>> is actually sent in the request to a server.  The intent is to be able to
>> use named or numeric character references in the HTML markup, and have the
>> browser do the conversions, because few people could do the utf-8 and then
>> hex conversions in their heads when writing HTML (certainly not me :).
>HTML 4.0 recommend not only href= encoding as %hexhex(utf-8)
>but even the associated text in <a> attribute  8-)

        No.  You perhaps are misinterpreting its recommenation that utf-8 +
hex encoding be used for NAME attribute values (those typically become
fragments in URL References, and are not part of the actual URL, i.e.,
are not sent to the server, but acted upon by the browser after receiving
the server's reply).  That recommendation contradicts the HTML 4.0 DTD,
which does not allow a '%' in such attribute values.  Anyway, v2.7.2 did
not pay attention to any specs, and handled NAME and ID attribute values
so that they will actually work with the garbage HTML created by FrontPage
and the Netscape "HTML editors".  Klaus had included that code in the devel
code set, and I assume it's still there in v2.8, because we haven't again
started getting "bug reports" about fragments not working "properly" in
v2.8. :)

>This may be useful for bookmarking, though.

        Yes, but if you do that, you'd be wise to make it optional somehow,
because the discussion about it on lynx-dev was like the idealists versus
pragmatists debates in the IETF working groups (probably also in the W3C
working groups, but you have to be a paid member for those, so I don't
know :).

        Note that Poon's latest suggestion about an ASSUMED_DCS is the
RAW mode toggle, which has been in Lynx for several releases.

>> and when I tried it with the W32 binary, it did what he thought it should
>> do, and is wrong.  The server is returning "Content-Type: text/plain"
>> without a charset parameter, so the assumed charset should apply, and
>> both the 'o'ptions page and the ShowInfo Page ('=') confirm that I have
>> it set as iso-8859-1.  Yet it looks as though CJK multibyte characters
>> in the iso-8859-1 control character range are being handled as DosLatinUS
>> characters for what is sent to the screen (does that binary use the
>> "work with MicoSoft sins" assumption that some of those are Windows
>Currently it displays &#nnn from x80-x9F range as WINDOWS-1252 codepoints
>(inflicted by FrontPage), but not display it in ALT= :-(

        Klaus had copied an old version of that code into the devel code set,
but had it commented out with NOTUSED_FOTEMODS.  Apparently TD used it in
SGML.c but not LYCharUtils.c for the v2.8 release.

>>If the document is or assumed as iso-8859-1,
>>control characters (x80-x9F) ignored sighlently if happend.
>>you mean they should be assumed as windows-1252 ?

        Those values are UNUSED in the HTML 4.0 DTD.  In real SGML and
XML, declarations can be added to use them in explicit ways (because
they're not already defined in the Document Character Set), but HTML
doesn't support such SGML declarations, and so the browser is free to
apply any error recovery it likes.  I like the error recovery of making
then do what users of FrontPage and the Netscape "HTML editors" intended
when they were creating the HTML with that software, and still have no
other way to do with those poor implementations.  But it's not an issue
of "right" or "wrong" as far as SGML is concerned (it's "undefined" :). 

>> characters, as the v2.7.2 code did?).  Also, when I set the assumed
>> charset to euc-jp or shift_jis (it's not clear which the FAQ is using),
>> I get different, but still 8-bit characters.
>>                                 Fote
>The first paragraph of the above FAQ says the text in shift_jis.
>I have no idea how kanji should look like, but I got 8-bit characters
>on my cyrillic display (cp866) which obviously wrong.
>Than I choose "7 bit approx" display and got a translation of
>I don't know what. Let someone from Japanese describe how it should be.

        The CJK support is not using the Unicode-based chartrans procedures,
(i.e., when the charset is CJK, and thus 7-bit approximations are not used)
and you're right to worry about breaking the CJK stuff when trying to fix
up v2.8's chartrans stuff (it's easy to break it under such circumstances).
I'm reasonably confident that both the CJK and Unicode-based chartrans
stuff were working properly in v2.7.2, but I'm reluctant to try to give
you guidance about that without myself walking through it with a debugger
to be sure I remember it correctly, and I still don't have a programming
environment set up here.  I recently downloaded the CJK support for IE,
to see what that's like, and the download was 9.6 MBytes.  You'd be
taking about tables much bigger than the browser itself if you tried to
use Unicode-based CJK support in Lynx as well.  Sigh.

        Note that I'm not subscribed to lynx-dev, but sending this reply
because I got involved in this particular thread when I was subscribed.
I assume Bob is back and will pass it on to lynx-dev.

        Since I *am* posting a message, I also have a question.  What do
people use for PostScript (".ps") files on Windows 95 boxes, and where
can I get it?

Foteos Macrides

reply via email to

[Prev in Thread] Current Thread [Next in Thread]