bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: /usr/bin/printf: invalid universal character name


From: Bruno Haible
Subject: Re: /usr/bin/printf: invalid universal character name
Date: Thu, 15 May 2008 00:21:56 +0200
User-agent: KMail/1.5.4

Jim Meyering wrote:
> Paul Eggert added this feature 8 years ago

Well, all honours to Paul, but this feature I did submit to you on 2000-02-02.

> I don't know the motivation for those exceptions.

The motivation is that the ISO C 99 standard has these exceptions:

  ISO C 99, 6.4.3(2):
  "Constraints
   A universal character name shall not specify a character whose short
   identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (`),
   nor one in the range D800 through DFFF inclusive."

and I find it undesirable to have different variants of the same concept in
different tools. For example, the hexadecimal escape syntax is different:
  - In C, Awk, Emacs Lisp, it accepts any number of hexadecimal digits.
  - In sh, PHP, Python, Perl, it accepts up to 2 hexadecimal digits.
  - In C#, it accepts up to 4 hexadecimal digits.
Similarly, the octal escape syntax is different:
  - In C, Awk, Emacs Lisp, it accepts up to 3 octal digits,
  - In Perl, likewise, but values between \400 and \777 are valid.
It causes headaches to the programmers, for no real benefit.

The motivation for those exceptions in C are probably to avoid discussing
weird cases like
   char foo[] = "abc\u000Adef";       // newline in string - allowed or not?
   char bar[] = "abc\\u00789A";       // hexadecimal escape or not?
   char mph[] = "abc\u0022";          // valid or not?
   char mph[] = "abc\\u0022";         // abc\u0022 or abc" ?

and - to a letter extent - to allow faster parsing. A parser that needs to
interpret
   \u0023include \u0022stdio.h"
is certainly slower than a parser that can reject this input.

Hermann Peifer wrote:
> Only DOLLAR SIGN, COMMERCIAL AT and GRAVE ACCENT are legal in the
> range 0x00..0x9f ?
>
> I still think that these 92 cases are bugs, rather than anything else:

You are entitled to your opinion. So that you cannot call it "bugs" any more,
I propose to make the restriction explicit in the coreutils manual:


2008-05-14  Bruno Haible  <address@hidden>

        * doc/coreutils.texi (printf invocation): Clarify invalid ranges for
        Unicode character escape syntax.

--- coreutils.texi.bak  2008-03-14 01:48:04.000000000 +0100
+++ coreutils.texi      2008-05-15 00:18:50.000000000 +0200
@@ -10305,7 +10305,9 @@
 four hexadecimal digits @var{hhhh}, and @samp{\U} for 32-bit Unicode
 characters, specified as eight hexadecimal digits @var{hhhhhhhh}.
 @command{printf} outputs the Unicode characters
-according to the @env{LC_CTYPE} locale.
+according to the @env{LC_CTYPE} locale.  Unicode characters in the ranges
+U+0000...U+009F, U+D800...U+DFFF cannot be specified by this syntax, except
+for U+0024 ($), U+0040 (@@), and U+0060 (@`).
 
 The processing of @samp{\u} and @samp{\U} requires a full-featured
 @code{iconv} facility.  It is activated on systems with glibc 2.2 (or newer),





reply via email to

[Prev in Thread] Current Thread [Next in Thread]