[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

lynx-dev Re: msg00798.html (was: 0x2276 handling)

From: Foteos Macrides
Subject: lynx-dev Re: msg00798.html (was: 0x2276 handling)
Date: Sat, 2 May 1998 14:01:55 -0400

"Leonid Pauzner" <address@hidden> wrote
>> you do for character reference handling, though, because i18n URLs are
>> not far from becoming commonplace.  The specs for them have for the
>> most part reached consensus in the IETF forums, and commercial
>> implementations have already been released.  As I've stressed, the
>> v2.7.2 code will handle those, but the v2.8 code will botch them.
>BTW, is there i18n URLs test online?

        I don't recall the URLs, nor have access now to the test page
I was using.  The IETF ID was updated in March, and there have been no
criticisms, so it's likely to move on as is to an RFC.  That's accessible
via the URI Working Group home page (which you can work your way to via
the Lynx online 'h'elp).

>According to the code (I check it just now)
>v2.8 seems to encode URL in utf8 (if change &lg= with &2276;)
>and than HTEscape(utf8buffer, URL_ALPHAS),
>but as we see from the bottom line
>it obviously fails the same way as for &lg= :-(
>Should we restrict the decoding of any entities with code>127 in URL?

        I'm not sure I understand your question.  The usual rules should
apply for converting raw > 7-bit, named, and numeric character references
to their Unicode values (but for now, not treating '=' as an implied
terminator for named character references), then those converted to utf-8,
and those utf-8 multibytes (not the Unicode values) hex escaped, for what
is actually sent in the request to a server.  The intent is to be able to
use named or numeric character references in the HTML markup, and have the
browser do the conversions, because few people could do the utf-8 and then
hex conversions in their heads when writing HTML (certainly not me :).
One could, of course, use the utf-8 + hex converted URLs as the attribute
values in the first place, and WYSIWYG HTML editors may do that (i.e., when
the user of the editor indicates non-ASCII characters in URLs which will be
handled as attribute values).  In the latter cases, everything already would
be in the ASCII range, and so the browser would use the attribute value as
is when processing the HTML document.

        That bug report from Poon reflects a lack of understanding about
the difference between the document charset and the Display Character Set.
What he describes Lynx as doing appears to be what it should be doing.
However, I tracked down a URL for the FAQ (would have been nice if he had
included it the the message):

and when I tried it with the W32 binary, it did what he thought it should
do, and is wrong.  The server is returning "Content-Type: text/plain"
without a charset parameter, so the assumed charset should apply, and
both the 'o'ptions page and the ShowInfo Page ('=') confirm that I have
it set as iso-8859-1.  Yet it looks as though CJK multibyte characters
in the iso-8859-1 control character range are being handled as DosLatinUS
characters for what is sent to the screen (does that binary use the
"work with MicoSoft sins" assumption that some of those are Windows
characters, as the v2.7.2 code did?).  Also, when I set the assumed
charset to euc-jp or shift_jis (it's not clear which the FAQ is using),
I get different, but still 8-bit characters.

Foteos Macrides

reply via email to

[Prev in Thread] Current Thread [Next in Thread]