Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

On Mon, Jan 14, 2013 at 1:36 PM, Sungjin Chun <address@hidden> wrote:

As far as I know, revised RFC permits UTF-8 characters in the URL without encoding. Am I wrong here?

The latest URI RFC is 3986. The relevant description in prose is:

Local names, such as file system names, are stored with a local

character encoding. URI producing applications (e.g., origin

servers) will typically use the local encoding as the basis for

producing meaningful names. The URI producer will transform the

local encoding to one that is suitable for a public interface and

then transform the public interface encoding into the restricted set

of URI characters (reserved, unreserved, and percent-encodings).

Those characters are, in turn, encoded as octets to be used as a

reference within a data format (e.g., a document charset), and such

data formats are often subsequently encoded for transmission over

Internet protocols.

The relevant parts of the BNF are:

pct-encoded = "%" HEXDIG HEXDIG

reserved = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")"

/ "*" / "+" / "," / ";" / "="

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

path = path-abempty ; begins with "/" or is empty

/ path-absolute ; begins with "/" but not "//"

/ path-noscheme ; begins with a non-colon segment

/ path-rootless ; begins with a segment

/ path-empty ; zero characters

path-abempty = *( "/" segment )

path-absolute = "/" [ segment-nz *( "/" segment ) ]

path-noscheme = segment-nz-nc *( "/" segment )

path-rootless = segment-nz *( "/" segment )

path-empty = 0<pchar>

segment = *pchar

segment-nz = 1*pchar

segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )

; non-zero-length segment without any colon ":"

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

Thus you can't use raw non-ASCII bytes in a URI - they must

be encoded, and interpretation is up to the origin (and is overwhelmingly

utf8 these days).

Even Solr (the search engine) permits them.

It would of course be possible for any tool or webserver to

accept URIs with non-ASCII bytes, but I don't know of any

browsers which would _send_ such a request, because in

general it would be rejected.

I tried searching non-ASCII on whitehouse.gov (which uses

Solr) and indeed it generated a percent-encoded query. My

browser (Chrome) rendered the percent escapes as utf-8 for

me though.

There's also punycode which can be used to represent Unicode

domain names (which otherwise don't even allow percent escapes).

In some cases certain browsers will render this for you (generally

if the encoded script matches the top-level country name, e.g.

for a .kr domain Hangul would be shown), but it's in general

a dangerous extension because it makes phishing attempts easier.

Alex

From:	Alex Shinn
Subject:	Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.
Date:	Mon, 14 Jan 2013 14:42:40 +0900