lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: lynx-dev Conversion of special character codes within anchor tags


From: Greg Marr
Subject: RE: lynx-dev Conversion of special character codes within anchor tags
Date: Fri, 25 Sep 1998 08:56:21 -0400

At 10:25 PM 9/24/98 +0100, you wrote:
>> That is not the only reason they exist.
>
>It doesn't need to be the only reason, it just needs to be a valid reason.
>The point remains the same. In what situation will you include these
>characters in a URL?

Any time you want those characters in the url.

The big problem here seems to be that you are discussing using HTML character
entities in a URL, while everyone else is discussing using HTML character
entities in an attribute value.  HTML character entities are processed in
attribute values, or else you couldn't do this: <IMG SRC="rightarrow.gif"
ALT="--&gt;">

<http://www.w3.org/TR/REC-html40/appendix/notes.html#non-ascii-chars>http:/
/www.w3.org/TR/REC-html40/appendix/notes.html#non-ascii-chars

B.2.1 Non-ASCII characters in URI attribute values

Although URIs do not contain non-ASCII values (see [URI], section 2.1) authors
sometimes specify them in attribute values expecting URIs (i.e., defined with
%URI; in the DTD). For instance, the following href value is illegal: 

<A href="http://foo.org/Håkon";>...</A>

We recommend that user agents adopt the following convention for handling
non-ASCII characters in such cases: 

1) Represent each character in UTF-8 (see [RFC2044]) as one or more bytes. 

2) Escape these bytes with the URI escaping mechanism (i.e., by converting
each
byte to %HH, where HH is the hexadecimal notation of the byte value). 

This procedure results in a syntactically legal URI (as defined in [RFC1738],
section 2.2 or [RFC2141], section 2) that is independent of the character
encoding to which the HTML document carrying the URI may have been
transcoded. 

Note. Some older user agents trivially process URIs in HTML using the bytes of
the character encoding in which the document was received. Some older HTML
documents rely on this practice and break when transcoded. User agents that
want to handle these older documents should, on receiving a URI containing
characters outside the legal set, first use the conversion based on UTF-8.
Only
if the resulting URI does not resolve should they try constructing a URI based
on the bytes of the character encoding in which the document was received. 

Note. The same conversion based on UTF-8 should be applied to values of the
name attribute for the A element. 

B.2.2 Ampersands in URI attribute values

The URI that is constructed when a form is submitted may be used as an
anchor-style link (e.g., the href attribute for the A element). Unfortunately,
the use of the "&" character to separate form fields interacts with its use in
SGML attribute values to delimit character entity references. For example, to
use the URI "http://host/?x=1&y=2"; as a linking URI, it must be written <A
href="http://host/?x=1&#38;y=2";> or <A href="http://host/?x=1&amp;y=2";>. 

We recommend that HTTP server implementors, and in particular, CGI
implementors
support the use of ";" in place of "&" to save authors the trouble of escaping
"&" characters in this manner. 

>> The semicolon is also used as a delimiter, just for this reason. I don't
>> remember the RFC, but this has been discussed here before, and someone
>> found the RFC that mentions this.  You could search the archives to find
it.
>
>Would that be the semicolon as in "other possible delimiters (such as + or
>;)"? This would be a useful point if it were not for the fact that every
>browser I have ever come across uses ampersand as its delimiter. Actually,
>we may both be wrong on this point. RFC 1738 (which is the one I think you
>want, states in section 3.3:
>
>   "Within the <path> and <searchpart> components, "/", ";", "?" are
>   reserved."
>
>So ";" is not available as a delimiter.

See above quote from HTML 4.0

>> The proper solution is to escape the & as &amp;.  A conforming
>> HTML processor will change this to & before it uses the URL.
>
>> Which browsers do not handle this properly?  They should be fixed.
>
>Fair enough, all the ones I can lay my hands on handle this correctly. But
>then again, only lynx 2.8 and IE3.02 cannot handle the use of an unescaped
>ampersand. Navigator 3 and 4, IE4, Opera 3.21, Quarterdeck Mosaic 2.02, lynx
>2.6 are all fine with this.
>
>> The situation is clear enough already, and doesn't need clarification. Any
>> literal ampersands in an HTML file need to be escaped as &amp; to eliminate
>> the possibility that they could be starting a character reference.
>> It is not a retrograde step that introduces an incompatibility, it is
adhering
>> to the proper behaviour as described by the standards.
>
>Perhaps you could point me to the standard where this has been clarified?

See above.

>I can give you plenty of examples where this behaviour causes problems. At
>the very least, it introduces unnecessary processing requirements. I guess
>I'll have to do as you say and escape all my ampersands, but can you give me
>one example of when it would be useful and valid to include non-ASCII
>characters in a URL?

We're not including non-ascii characters in a URL here, we're including SGML
reserved characters in an attribute value by replacing them with character
entity references.

--
Greg Marr
address@hidden
"We thought you were dead." 
"I was, but I'm better now." - Sheridan, "The Summoning"

reply via email to

[Prev in Thread] Current Thread [Next in Thread]