lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug)


From: Leonid Pauzner
Subject: Re: lynx-dev URLs with raw 8-bit chars (was: lynx: have bug)
Date: Mon, 22 Mar 1999 17:24:36 +0300 (MSK)

21-Mar-99 12:38 Klaus Weide wrote:
> On Sun, 21 Mar 1999, Leonid Pauzner wrote:

>> One certain "problem" I personally run into is a utf-8 URL encoding:
>> when HREF= have *open 8-bit text* the remote server (script)
>> may (1) expect such bytes %xx-encoded,
>> but lynx now (2) translate URLs from document charset to utf-8
>> and then sent each byte %xx-encoded (an obvious check -
>> a number of %xx encoded bytes increased).

> But URLs should never *have* unencoded 8-bit chars - and lynx
Right.
> never generates such URLs as a result of form submission (I hope).
Right (we generate %xx encoded bytes (1), including local file names)

HTML4.0 on syntax of anchor names:
http://www.w3.org/TR/PR-html40/struct/links.html#h-12.2.1
says:

   Anchor names should be restricted to ASCII characters. Please consult
   the section on representing non-ASCII characters is URLs for more
   information.

and that section is under
http://www.w3.org/TR/PR-html40/appendix/notes.html#urls
(below)

So both (1) and (2) should be considered as a recovery from a broken document.

We usually bypass the problem when Lynx process both broken #fragment link
and a broken NAME= target (they get resolved in a consistent way),
but the problem occurs when we deals with one end only
(say, link to a CGI script).


22-Mar-99 12:42 I wrote:
> 21-Mar-99 20:37 Klaus Weide wrote:

>> This means that the user can usually toggle between the two interpretations
>> with -raw / '@'.   It's not completely logical that the interpretation
>> of URLs should depend on this.  OTOH there's the ease of switching, and
>> it's more likely that encoding the raw value is the right thing (or even
>> possible) when the user's environment is consistent with the server's.

> Completely wrong to overload -raw mode here (to ask user
> to get the document unreadable in order to follow a link),
> it may be switchable like "dsoft-quotes" instead.

Now I think we may overload "dsoft-quotes" to toggle between
two interpretations, the original meaning of this key is a work around
the bug in HTML anchor which is very close to discussed problem.
(One should decide which "interpretation" is "standard"
and which is a workaround).

I haven't come with a patch yet but pick references FYI:
HTML4.0, Lynx/2.7.2 CHANGES and Lynx/2.8 CHANGES.


***** HTML 4.0

   The following notes are informative, not normative.

B.1 Representing non-ASCII characters in URLs

   We recommend the following convention for representing non-ASCII
   characters in URLs: each character is represented in UTF-8 (see
   [RFC2044]) as one or more bytes and these bytes are then escaped with
   the URL escaping mechanism (converting each byte to %HH, where HH is
   the hexadecimal notation of the byte value).

   This procedure results in the same syntactically legal URL according
   to [RFC1738] or [RFC2141] and independent of the character encoding to
   which the HTML document carrying the URL may have been transcoded.

   Note. The procedure above doesn't guarantee that UTF-8 can be used in
   all schemes or on all resources of a scheme. The the producer of a URL
   (usually the HTML author) is responsible for ensuring that this works
   for the URL in question, or using another notation (with %HH escapes
   not corresponding to UTF-8 if necessary) to address the resource in
   question.

   Note. Some older user agents trivially process URLs in HTML using the
   bytes of the character encoding in which the document was received.
   Some older HTML documents rely on this (illegal) practice and break
   when transcoded. User agents that want to handle these older documents
   should, on receiving a URL containing characters outside the legal
   set, first use the conversion based on UTF-8. Only if the resulting
   URL does not resolve should they try constructing a URL based on the
   bytes of the character encoding in which the document was received.

   Note. The same conversion based on UTF-8 should be applied to anchor
   names as appearing in the name attribute of the A element.

   Note. The URL that is constructed when a form is submitted may be used
   as an anchor-style link (e.g., the href attribute for the A element).
   Unfortunately, the use of the "&" character to separate form fields
   interacts with its use in SGML attribute values to delimit character
   entity references. For example, to use the URL "http://host/?x=1&y=2";
   as a linking URL, it must be written <A
   href="http://host/?x=1&#38;y=2";> or <A
   href="http://host/?x=1&amp;y=2";>. HTTP server implementors, and in
   particular, CGI implementors are encouraged to support the use of ";"
   in place of "&" to save authors the trouble of escaping "&" characters
   in this manner.

****** Lynx/2.8
1997-09-27
...
* Non-ASCII characters in URLs and similar strings encountered in the HTML.c
  processing (previously handled by LYUnEscapeToLatinOne) are now generally
  URL-encoded, instead of doing this just for 8-bit characters which are
  the result of entity expansion.  There is no clear standard definition what
  non-ASCII characters in URLs in HTML attributes (HREF etc.) actually mean,
  especially if the transmission character encoding is something else than
  iso-8859-1.  Leaving them as the raw byte values as received runs against
  the HTML i18n view that the transmission encoding is distinct from the
  document character set and has to be (conceptually at least) decoded before
  SGML parsing.  It also won't work in general for entities that expand to
  to Unicode characters which cannot be expressed at all in the currently
  effective (or assumed) charset, and would lead to problems with displaying
  URLs on the statusline or representing them in auxiliary screens or bookmark
  files.  So now we try to first transform to the document charset "as usual"
  (undo the transmission encoding), then translate the Unicode value into a
  sequence of (one or more) byte values which are then URL-encoded.  Since
  character values > 255 cannot be expressed in a byte, always use UTF-8
  for them.  It may not be what the author intended, but should be at least
  consistent between internal (fragment) HREFs and NAME (or ID) attributes
  in the same document or set of documents.  Since this is dealing with
  bytes currently disallowed in URLs, it falls under error recovery.  But
  the handling should be roughly in line with current Internet Drafts
  (draft-masinter-url-i18n-00.txt, draft-duerst-query-i18n-00.txt,
  draft-ietf-ftpext-intl-ftp-02.txt).
  For character values < 256 (but > 127) this isn't currently consistently
  done, we may still be URL-escaping the byte value without UTF-8 encoding.
  - KW

***** Lynx/2.7.2
1997-10-06
...
* Made LYExpandString(), LYUnEscapeEntities() and LYUnEscapeToLatinOne()
  simpler, added better comments, and modified LYUnEscapeToLatinOne() so
  that it uses hex escaped UTF-8 multibytes for characters outside the
  ASCII range (may need mods when standards for internationization of
  URLs and MIME headers are finalized).  These functions still expect
  strings in the charset of the input stream, with only invalid control
  characters removed, and still parallel the conversions done in SGML.c
  and HTPlain.c, within the context of the HTML parser's (Utterly Tag and
  Attribute Soup :) settings and the display character set options.  They
  do not URL encode any ASCII characters, except for ESC in CJK escape
  sequences when the flag to do that is set, to avoid possible double
  encoding. - FM



reply via email to

[Prev in Thread] Current Thread [Next in Thread]