lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: lynx-dev Conversion of special character codes within anchor tags


From: Bruno Prior
Subject: RE: lynx-dev Conversion of special character codes within anchor tags
Date: Thu, 24 Sep 1998 22:25:06 +0100

> That is not the only reason they exist.

It doesn't need to be the only reason, it just needs to be a valid reason.
The point remains the same. In what situation will you include these
characters in a URL?

> Take the cases of & < > ".
> These entities exist because the bare characters have significance
> in HTML: the start of an entity reference, the start of a tag, the
> end of a tag, and an attribute value delimiter.

Fair enough. Let's take the case of these characters. Suppose I substitute
the field name "gt" instead of "curren" in my example. Now we have a tag
which reads <A HREF="http://www.some.site/sample.cgi?para=1&gt=GBP";>. By
your logic, this would actually be interpreted as <A
HREF="http://www.some.site/sample.cgi?para=1>=GBP">. Do you want to have a
guess at how many browsers would get confused by the premature close to the
tag? Why is this useful? Why is this sensible?

> URL escaping and HTML
> character entities are totally different animals.

I am well aware of that. Do you want to point me to the place in my original
message where I mentioned URL encoding? You make my point for me. If you
want to include any of these characters which you are so keen to allow in
your URLs, you have to encode them. That is exactly why it is not useful
(and in some situations positively damaging) to be able to include them in a
URL without encoding. If the URL is generated by your browser when a form is
submitted, any characters like this are automatically encoded. If the URL is
simply given in an A HREF tag, the browser sends it as is. It is your job to
encode the items, not the browser's. Therefore, you should _not_ be
including special characters directly within the URL. Of course, there is a
different set of characters which require URL encoding as opposed to
character entitizing (if that is a real word), but as it happens, the set of
characters which require URL encoding is a superset of the set of characters
which require entitizing. In other words, there is not a single character
for which it is useful to translate the character entity into the actual
character in a URL, because every single one of them (as well as several
others) needs URL encoding. The only characters which don't need URL
encoding are the alphanumeric ones (A-Z, a-z, 0-9) plus the characters
$-_.+!*'(),. Which one of these do you intend to send to the browser in the
shape of its character entity? Come to think of it, which one of these has a
character entity (unless you use the &# form)?

> The semicolon is also used as a delimiter, just for this reason. I don't
> remember the RFC, but this has been discussed here before, and someone
> found the RFC that mentions this.  You could search the archives to find
it.

Would that be the semicolon as in "other possible delimiters (such as + or
;)"? This would be a useful point if it were not for the fact that every
browser I have ever come across uses ampersand as its delimiter. Actually,
we may both be wrong on this point. RFC 1738 (which is the one I think you
want, states in section 3.3:

   "Within the <path> and <searchpart> components, "/", ";", "?" are
   reserved."

So ";" is not available as a delimiter.

> The proper solution is to escape the & as &amp;.  A conforming
> HTML processor will change this to & before it uses the URL.

> Which browsers do not handle this properly?  They should be fixed.

Fair enough, all the ones I can lay my hands on handle this correctly. But
then again, only lynx 2.8 and IE3.02 cannot handle the use of an unescaped
ampersand. Navigator 3 and 4, IE4, Opera 3.21, Quarterdeck Mosaic 2.02, lynx
2.6 are all fine with this.

> The situation is clear enough already, and doesn't need clarification.
Any
> literal ampersands in an HTML file need to be escaped as &amp; to
eliminate
> the possibility that they could be starting a character reference.

> It is not a retrograde step that introduces an incompatibility, it is
adhering
> to the proper behaviour as described by the standards.

Perhaps you could point me to the standard where this has been clarified?

I can give you plenty of examples where this behaviour causes problems. At
the very least, it introduces unnecessary processing requirements. I guess
I'll have to do as you say and escape all my ampersands, but can you give me
one example of when it would be useful and valid to include non-ASCII
characters in a URL?

Cheers,


Bruno Prior         address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]