Re: AW: treatment of U+002E that is produced by NFKC

help-libidn

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: AW: treatment of U+002E that is produced by NFKC

From:	Erik van der Poel
Subject:	Re: AW: treatment of U+002E that is produced by NFKC
Date:	Tue, 15 Jan 2008 08:03:35 -0800

On Jan 15, 2008 7:51 AM, Simon Josefsson <address@hidden> wrote:
> "Erik van der Poel" <address@hidden> writes:
>
> > Looks good to me.
> >
> > Other than your interpretation of RFC 3490 leading to the insertion of
> > 0x2E into a DNS label, but I guess you and I will simply have to agree
> > that we disagree on this point. RFC 3490 should have been clearer.
>
> I regard escaping 0x2E as the logical consequence of the IDNA design to
> operate on single labels and how U+2024 etc behaves under NFKC.

When you say "escaping 0x2E", I guess you're referring to the use of
the backslash (\). If so, I think I see your point of view more
clearly now. However, RFC 3490 does not explicitly say that you must
insert a backslash to escape any 0x2Es that come out of Nameprep if
you want to get 0x2E into a DNS packet using some routine that treats
the backslash as the escape character.

Anyway, we're really splitting hairs now. The one thing we seem to
agree on is that RFC 3490 could have been clearer in this area.

Erik

> I think
> RFC 3490 didn't intend for ToASCII to be able to take one label and
> output two labels.  I suspect the reason for the problems here is that
> there was a perception that ToASCII would never produce new 0x2E's.  But
> I can't say for sure.
>
> > By the way, I did a Web search for "2024 nfkc" and found that this
> > issue was raised, but I guess it was not resolved adequately:
> >
> > http://www.ops.ietf.org/lists/idn/idn.2001/msg02450.html
>
> Interesting.
>
> /Simon
>
>
> > Erik
> >
> > On Jan 15, 2008 7:15 AM, Simon Josefsson <address@hidden> wrote:
> >> "Erik van der Poel" <address@hidden> writes:
> >>
> >> > Yes, that's right.
> >> >
> >> > By the way, there may be a different way to address this issue. If
> >> > libidn has a separate API for NFKC or Nameprep, the caller could pass
> >> > the entire domain name (including all of the dots and dot-like
> >> > characters) through NFKC (or Nameprep) first, and then call the normal
> >> > IDNA routine. This is quite likely to behave the same way as MSIE 7
> >> > and Firefox 2. If you chose this approach, you could simply document
> >> > this somewhere, and callers could then decide whether or not to go
> >> > this way.
> >>
> >> Libidn has a simple NFKC interface, and I'm documenting that approach
> >> now.  Below is the current text in the manual.  I'll forward this to the
> >> Firefox IDN guys to see if they are interested in documenting their
> >> practice further, possibly in an I-D.  If ToASCII(NFKC(i)) turns out to
> >> actually work and behave better than RFC 3490, documenting that now
> >> seems useful.
> >>
> >> Thanks,
> >> /Simon
> >>
> >> Appendix B On Label Separators
> >> ******************************
> >>
> >> Some strings contains characters whose NFKC normalized form contain the
> >> ASCII dot (0x2E, ".").  Examples of these characters are U+2024 (ONE
> >> DOT LEADER) and U+248C (DIGIT FIVE FULL STOP).  The strings have the
> >> interesting property that their IDNA ToASCII output will contain
> >> embedded dots.  For example:
> >>
> >>      ToASCII (hi U+248C com) = hi5.com
> >>      ToASCII (räksmörgås U+2024 com) = xn--rksmrgs.com-l8as9u
> >>
> >>    This demonstrate the two general cases: The first where the ASCII dot
> >> is part of an output that do not begin with the IDN prefix "xn-".  The
> >> second example illustrate when the dot is part of IDN prefixed with
> >> "xn-".
> >>
> >>    The input strings are, from the DNS point of view, a single label.
> >> The IDNA algorithm translate one label at a time.  Thus, the output is
> >> expected to be only one label.  What is important here is to make sure
> >> the DNS resolver receives the correct query.  The DNS protocol does not
> >> use the dot to delimit labels on the wire, rather it uses length-value
> >> pairs.  Thus the correct query would be for `{7}hi5.com' and
> >> `{22}xn--rksmrgs.com-l8as9u' respectively.
> >>
> >>    Some implementations (1) have decided that these inputs strings are
> >> potentially confusing for the user.  The string "hi U+248C com" looks
> >> like "hi5.com" on systems that support Unicode properly.  These
> >> implementations do not follow RFC 3490.  They yield:
> >>
> >>      ToASCII (hi U+248C com) = hi5.com
> >>      ToASCII (räksmörgås U+2024 com) = xn--rksmrgs-5wao1o.com
> >>
> >>    The DNS query they perform are `{3}hi5{3}com' and
> >> `{18}xn--rksmrgs-5wao1o{3}com' respectively.  Arguably, this leads to a
> >> better user experience, and suggests that the IDNA specification is
> >> sub-optimal in this area.
> >>
> >> B.1 Recommended Workaround
> >> ==========================
> >>
> >> It has been suggested to normalize the entire input string using NFKC
> >> before passing it to IDNA ToASCII.  You may use
> >> `stringprep_utf8_nfkc_normalize' or `stringprep_ucs4_nfkc_normalize'.
> >> This will avoid the problem, and appears to lead to similar behaviour
> >> as IE/Firefox.
> >>
> >>    Alternative workarounds are being considered.  Eventually Libidn may
> >> implement a new flag to the `idna_*' functions that implements a
> >> recommended way to work around this problem.
> >>
> >>    ---------- Footnotes ----------
> >>
> >>    (1) Notably Microsoft's Internet Explorer and Mozilla's Firefox, but
> >> not Apple's Safari.
> >>
>

[Prev in Thread]

Current Thread

[Next in Thread]

Re: AW: treatment of U+002E that is produced by NFKC, (continued)
- Re: treatment of U+002E that is produced by NFKC, Simon Josefsson, 2008/01/13

Prev by Date: Re: AW: treatment of U+002E that is produced by NFKC
Previous by thread: Re: AW: treatment of U+002E that is produced by NFKC
Next by thread: Re: AW: treatment of U+002E that is produced by NFKC
Index(es):
- Date
- Thread