Re: AW: treatment of U+002E that is produced by NFKC

help-libidn

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: AW: treatment of U+002E that is produced by NFKC

From:	Simon Josefsson
Subject:	Re: AW: treatment of U+002E that is produced by NFKC
Date:	Tue, 15 Jan 2008 16:51:16 +0100
User-agent:	Gnus/5.110007 (No Gnus v0.7) Emacs/22.1 (gnu/linux)

"Erik van der Poel" <address@hidden> writes:

> Looks good to me.
>
> Other than your interpretation of RFC 3490 leading to the insertion of
> 0x2E into a DNS label, but I guess you and I will simply have to agree
> that we disagree on this point. RFC 3490 should have been clearer.

I regard escaping 0x2E as the logical consequence of the IDNA design to
operate on single labels and how U+2024 etc behaves under NFKC.  I think
RFC 3490 didn't intend for ToASCII to be able to take one label and
output two labels.  I suspect the reason for the problems here is that
there was a perception that ToASCII would never produce new 0x2E's.  But
I can't say for sure.

> By the way, I did a Web search for "2024 nfkc" and found that this
> issue was raised, but I guess it was not resolved adequately:
>
> http://www.ops.ietf.org/lists/idn/idn.2001/msg02450.html

Interesting.

/Simon

> Erik
>
> On Jan 15, 2008 7:15 AM, Simon Josefsson <address@hidden> wrote:
>> "Erik van der Poel" <address@hidden> writes:
>>
>> > Yes, that's right.
>> >
>> > By the way, there may be a different way to address this issue. If
>> > libidn has a separate API for NFKC or Nameprep, the caller could pass
>> > the entire domain name (including all of the dots and dot-like
>> > characters) through NFKC (or Nameprep) first, and then call the normal
>> > IDNA routine. This is quite likely to behave the same way as MSIE 7
>> > and Firefox 2. If you chose this approach, you could simply document
>> > this somewhere, and callers could then decide whether or not to go
>> > this way.
>>
>> Libidn has a simple NFKC interface, and I'm documenting that approach
>> now.  Below is the current text in the manual.  I'll forward this to the
>> Firefox IDN guys to see if they are interested in documenting their
>> practice further, possibly in an I-D.  If ToASCII(NFKC(i)) turns out to
>> actually work and behave better than RFC 3490, documenting that now
>> seems useful.
>>
>> Thanks,
>> /Simon
>>
>> Appendix B On Label Separators
>> ******************************
>>
>> Some strings contains characters whose NFKC normalized form contain the
>> ASCII dot (0x2E, ".").  Examples of these characters are U+2024 (ONE
>> DOT LEADER) and U+248C (DIGIT FIVE FULL STOP).  The strings have the
>> interesting property that their IDNA ToASCII output will contain
>> embedded dots.  For example:
>>
>>      ToASCII (hi U+248C com) = hi5.com
>>      ToASCII (räksmörgås U+2024 com) = xn--rksmrgs.com-l8as9u
>>
>>    This demonstrate the two general cases: The first where the ASCII dot
>> is part of an output that do not begin with the IDN prefix "xn-".  The
>> second example illustrate when the dot is part of IDN prefixed with
>> "xn-".
>>
>>    The input strings are, from the DNS point of view, a single label.
>> The IDNA algorithm translate one label at a time.  Thus, the output is
>> expected to be only one label.  What is important here is to make sure
>> the DNS resolver receives the correct query.  The DNS protocol does not
>> use the dot to delimit labels on the wire, rather it uses length-value
>> pairs.  Thus the correct query would be for `{7}hi5.com' and
>> `{22}xn--rksmrgs.com-l8as9u' respectively.
>>
>>    Some implementations (1) have decided that these inputs strings are
>> potentially confusing for the user.  The string "hi U+248C com" looks
>> like "hi5.com" on systems that support Unicode properly.  These
>> implementations do not follow RFC 3490.  They yield:
>>
>>      ToASCII (hi U+248C com) = hi5.com
>>      ToASCII (räksmörgås U+2024 com) = xn--rksmrgs-5wao1o.com
>>
>>    The DNS query they perform are `{3}hi5{3}com' and
>> `{18}xn--rksmrgs-5wao1o{3}com' respectively.  Arguably, this leads to a
>> better user experience, and suggests that the IDNA specification is
>> sub-optimal in this area.
>>
>> B.1 Recommended Workaround
>> ==========================
>>
>> It has been suggested to normalize the entire input string using NFKC
>> before passing it to IDNA ToASCII.  You may use
>> `stringprep_utf8_nfkc_normalize' or `stringprep_ucs4_nfkc_normalize'.
>> This will avoid the problem, and appears to lead to similar behaviour
>> as IE/Firefox.
>>
>>    Alternative workarounds are being considered.  Eventually Libidn may
>> implement a new flag to the `idna_*' functions that implements a
>> recommended way to work around this problem.
>>
>>    ---------- Footnotes ----------
>>
>>    (1) Notably Microsoft's Internet Explorer and Mozilla's Firefox, but
>> not Apple's Safari.
>>

[Prev in Thread]

Current Thread

[Next in Thread]

AW: treatment of U+002E that is produced by NFKC, (continued)
- Re: treatment of U+002E that is produced by NFKC, Simon Josefsson, 2008/01/13

Prev by Date: Re: AW: treatment of U+002E that is produced by NFKC
Next by Date: Re: AW: treatment of U+002E that is produced by NFKC
Previous by thread: Re: AW: treatment of U+002E that is produced by NFKC
Next by thread: Re: AW: treatment of U+002E that is produced by NFKC
Index(es):
- Date
- Thread