[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’

From: Timothy Sample
Subject: bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
Date: Mon, 27 May 2019 09:39:03 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux)


Ricardo Wurmus <address@hidden> writes:

> Ludovic Courtès <address@hidden> writes:
>> Using the “lower” regexp class instead of “[a-z]” works:
>> --8<---------------cut here---------------start------------->8---
>> scheme@(guile-user)> (string-match "[[:lower:]]" "w")
>> $12 = #("w" (0 . 1))
>> --8<---------------cut here---------------end--------------->8---
>> However, it’s not clear to me whether the “lower” class is supposed to
>> be the same for all locales or if we’re just lucky:
>>   http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
>> Thoughts?
> The lower class is much larger than [a-z].  If we only wanted to work
> around this particular problem we could explicitly spell out the range,
> which would be the same in all locales.  (Obviously, that wouldn’t be
> pretty.)

I think that explicitly spelling out the range is the right thing to do
here.  The POSIX spec says that character ranges work in the POSIX
locale, but “in other locales, a range expression has unspecified

> But can’t URI parts contain more than those characters?

A quick reading of RFC 3986 suggests that the host part of a URI can be
an IP address (version 4 or 6) or a registered name.  It gives the
following rules for registered names:

reg-name      = *( unreserved / pct-encoded / sub-delims )
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

Here, “ALPHA”, “DIGIT”, and “HEXDIG” are specified in RFC 2234, and are
just the ASCII ranges you might expect (except for that “HEXDIG” only
allows uppercase letters).

It looks like Guile is currently a little stricter than this, but pretty
close (if you take the character ranges to mean ASCII ranges).

> To circumvent
> the question whether the lower class is locale dependent we could
> generate an explicit range from a charset.

I think this is the right approach.  Using “[:lower:]” would allow
things outside of the RFC, like ‘é’.  Adding support for
internationalized domain names using Punycode would be cool, but well
outside the scope of this bug.  :)

-- Tim

reply via email to

[Prev in Thread] Current Thread [Next in Thread]