chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.


From: Ivan Raikov
Subject: Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.
Date: Thu, 17 Jan 2013 09:35:36 +0900

Hi Peter,

    I think that allowing raw UTF-8 sequences in uri-generic breaks compatibility with RFC 3986. In other words, if you construct a URI with a UTF-8 sequence that happens to include reserved ASCII characters, those ASCII characters will not get escaped, and you could potentially be sending an invalid URI to a legacy system that does not understand UTF-8. For example, the UTF-8 string "пиле" consists of the octets  D0 BF D0 B8 D0 BB D0 B5. The ASCII codes corresponding to these octets are all outside of the allowed character set defined in RFC 3986 and will correctly get rejected by the uri-reference constructor. However, if we allow UTF-8 string operations in uri-generic, and extend the unreserved character set to include Unicode, these octets will form a valid character sequence and will get accepted by uri-reference without being escaped. If you then send the result of uri->string  to a system that does not understand UTF-8, the URI will get rejected.

  My proposed solution is to include a UTF-8 aware constructor to uri-generic and prevent percent decoding of UTF-8 sequences. I believe that this solution is compatible with the IRI to URI mapping scheme described in Section 3.1 of RFC 3987, but indeed I need to extend the uri-generic test suite with more UTF-8 examples to ensure that nothing is broken. I think that any solution will have to give the user choice whether to use ASCII or UTF-8, and not just default to UTF-8.

   Ivan

On Thu, Jan 17, 2013 at 4:51 AM, Peter Bex <address@hidden> wrote:

OK, I took some time to investigate and I pinpointed this problem.
This appears to happen due to the use of core srfi-14 and srfi-13 in
uri-generic; its char-set operations simply don't deal with anything
beyond ASCII.  Only by switching to the UTF versions utf8-srfi-14,
utf8-srfi-13 and unicode-char-sets this works:

Without patch:
$ csi -R uri-generic -P '(uri-encode-string "삼계탕")'
"�%82%BC�%B3%84�%83%95"

With patch:
$ csi -R uri-generic -P '(uri-encode-string "삼계탕")'
"%EC%82%BC%EA%B3%84%ED%83%95"

Ivan, what do you think about adding the UTF8 dependency, as per the
attached patch (against trunk)?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]