[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
some POSIX-conformance cleanups for GNU tr
From: |
Paul Eggert |
Subject: |
some POSIX-conformance cleanups for GNU tr |
Date: |
Tue, 01 Jun 2004 15:46:43 -0700 |
User-agent: |
Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux) |
I went through the POSIX spec for 'tr' with a fine-toothed come and
compared it to GNU tr's source. In some cases GNU 'tr' is too picky;
it diagnoses constructs for which POSIX does not require a diagnostic.
I think it's better for POSIXLY_CORRECT to affect the behavior of GNU
'tr' as little as possible, so I removed the overly-picky diagnostics.
In a few other cases GNU tr gives the wrong answer, e.g. "tr 'a\055b'
def" is treated like "tr a-b def" which isn't right. Also, POSIX
requires an option -C which GNU tr currently doesn't support. Here is
a patch.
2004-06-01 Paul Eggert <address@hidden>
Some POSIX-conformance cleanups for tr.
* doc/coreutils.texi (tr invocation): Mention -C.
* src/tr.c (posix_pedantic): Remove; no longer needed since
we need to test this in just one place now.
(usage): Mention -C.
(unquote): Note that \055, \n, etc are escaped.
Do not worry about POSIXLY_CORRECT when warning about ambiguous
escape sequences.
\ at end of string stands for itself.
Do not diagnose invalid backslash escapes: POSIX says the behavior
is unspecified in this case, so we don't need to diagnose it.
(main): Add support for -C (currently an alias for -c).
Do not diagnose 'tr [:upper:] [:upper:], as POSIX does not require
a diagnostic here.
* tests/tr/Test.pm: New tests bs-055, bs-at-end, repeat-Compl.
Fix comment for range-a-a.
Index: doc/coreutils.texi
===================================================================
RCS file: /home/meyering/coreutils/cu/doc/coreutils.texi,v
retrieving revision 1.183
diff -p -u -r1.183 coreutils.texi
--- doc/coreutils.texi 1 Jun 2004 12:46:22 -0000 1.183
+++ doc/coreutils.texi 1 Jun 2004 21:41:51 -0000
@@ -4670,8 +4670,17 @@ delete characters, then squeeze repeated
The @var{set1} and (if given) @var{set2} arguments define ordered
sets of characters, referred to below as @var{set1} and @var{set2}. These
sets are the characters of the input that @command{tr} operates on.
-The @option{--complement} (@option{-c}) option replaces @var{set1} with its
+The @option{--complement} (@option{-c}, @option{-C}) option replaces
address@hidden with its
complement (all of the characters that are not in @var{set1}).
+
+Currently @command{tr} fully supports only single-byte characters.
+Eventually it will support multibyte characters; when it does, the
address@hidden option will cause it to complement the set of characters,
+whereas @option{-c} will cause it to complement the set of values.
+This distinction will matter only when some values are not characters,
+and this is possible only in locales using multibyte encodings when
+the input contains encoding errors.
@exitstatus
Index: src/tr.c
===================================================================
RCS file: /home/meyering/coreutils/cu/src/tr.c,v
retrieving revision 1.131
diff -p -u -r1.131 tr.c
--- src/tr.c 31 May 2004 11:30:27 -0000 1.131
+++ src/tr.c 1 Jun 2004 22:31:20 -0000
@@ -212,23 +212,8 @@ static bool delete = false;
/* Use the complement of set1 in place of set1. */
static bool complement = false;
-/* When nonzero, this flag causes GNU tr to provide strict
- compliance with POSIX draft 1003.2.11.2. The POSIX spec
- says that when -d is used without -s, string2 (if present)
- must be ignored. Silently ignoring arguments is a bad idea.
- The default GNU behavior is to give a usage message and exit.
- Additionally, when this flag is nonzero, tr prints warnings
- on stderr if it is being used in a manner that is not portable.
- Applicable warnings are given by default, but are suppressed
- if the environment variable `POSIXLY_CORRECT' is set, since
- being POSIX conformant means we can't issue such messages.
- Warnings on the following topics are suppressed when this
- variable is nonzero:
- 1. Ambiguous octal escapes. */
-static bool posix_pedantic;
-
/* When tr is performing translation and string1 is longer than string2,
- POSIX says that the result is undefined. That gives the implementor
+ POSIX says that the result is unspecified. That gives the implementor
of a POSIX conforming version of tr two reasonable choices for the
semantics of this case.
@@ -314,7 +299,7 @@ Usage: %s [OPTION]... SET1 [SET2]\n\
Translate, squeeze, and/or delete characters from standard input,\n\
writing to standard output.\n\
\n\
- -c, --complement first complement SET1\n\
+ -c, -C, --complement first complement SET1\n\
-d, --delete delete characters in SET1, do not translate\n\
-s, --squeeze-repeats replace each input sequence of a repeated
character\n\
that is listed in SET1 with a single occurrence\n\
@@ -475,6 +460,7 @@ unquote (char const *s, struct E_string
switch (s[i])
{
case '\\':
+ es->escaped[j] = true;
switch (s[i + 1])
{
case '\\':
@@ -523,15 +509,16 @@ unquote (char const *s, struct E_string
c = 8 * c + oct_digit;
++i;
}
- else if (!posix_pedantic)
+ else
{
/* A 3-digit octal number larger than \377 won't
fit in 8 bits. So we stop when adding the
next digit would put us over the limit and
give a warning about the ambiguity. POSIX
- isn't clear on this, but one person has said
- that in his interpretation, POSIX says tr
- can't even give a warning. */
+ isn't clear on this, and we interpret this
+ lack of clarity as meaning the resulting behavior
+ is undefined, which means we're allowed to issue
+ a warning. */
error (0, 0, _("warning: the ambiguous octal escape \
\\%c%c%c is being\n\tinterpreted as the 2-byte sequence \\0%c%c, `%c'"),
s[i], s[i + 1], s[i + 2],
@@ -541,20 +528,15 @@ unquote (char const *s, struct E_string
}
break;
case '\0':
- error (0, 0, _("invalid backslash escape at end of string"));
- return false;
-
+ /* POSIX seems to require that a trailing backslash must
+ stand for itself. Weird. */
+ es->escaped[j] = false;
+ i--;
+ c = '\\';
+ break;
default:
- if (posix_pedantic)
- {
- error (0, 0, _("invalid backslash escape `\\%c'"), s[i + 1]);
- return false;
- }
- else
- {
- c = s[i + 1];
- es->escaped[j] = true;
- }
+ c = s[i + 1];
+ break;
}
++i;
es->s[j++] = c;
@@ -1701,7 +1683,7 @@ main (int argc, char **argv)
atexit (close_stdout);
- while ((c = getopt_long (argc, argv, "cdst", long_options, NULL)) != -1)
+ while ((c = getopt_long (argc, argv, "cCdst", long_options, NULL)) != -1)
{
switch (c)
{
@@ -1709,6 +1691,7 @@ main (int argc, char **argv)
break;
case 'c':
+ case 'C':
complement = true;
break;
@@ -1734,8 +1717,6 @@ main (int argc, char **argv)
}
}
- posix_pedantic = (getenv ("POSIXLY_CORRECT") != NULL);
-
non_option_args = argc - optind;
translating = (non_option_args == 2 && !delete);
@@ -1764,7 +1745,7 @@ deleting and squeezing repeats"));
this deserves a fatal error, so that's the default. */
if ((delete && !squeeze_repeats) && non_option_args != 1)
{
- if (posix_pedantic && non_option_args == 2)
+ if (non_option_args == 2 && getenv ("POSIXLY_CORRECT"))
--non_option_args;
else
error (EXIT_FAILURE, 0,
@@ -1888,17 +1869,8 @@ without squeezing repeats"));
else if ((class_s1 == UL_LOWER && class_s2 == UL_LOWER)
|| (class_s1 == UL_UPPER && class_s2 == UL_UPPER))
{
- /* By default, GNU tr permits the identity mappings: from
- [:upper:] to [:upper:] and [:lower:] to [:lower:]. But
- when POSIXLY_CORRECT is set, those evoke diagnostics. */
- if (posix_pedantic)
- {
- error (EXIT_FAILURE, 0,
- _("\
-invalid identity mapping; when translating, any [:lower:] or [:upper:]\n\
-construct in string1 must be aligned with a corresponding construct\n\
-([:upper:] or [:lower:], respectively) in string2"));
- }
+ /* POSIX says the behavior of `tr "[:upper:]" "[:upper:]"'
+ is undefined. Treat it as a no-op. */
}
else
{
Index: tests/tr/Test.pm
===================================================================
RCS file: /home/meyering/coreutils/cu/tests/tr/Test.pm,v
retrieving revision 1.10
diff -p -u -r1.10 Test.pm
--- tests/tr/Test.pm 31 May 2004 12:17:49 -0000 1.10
+++ tests/tr/Test.pm 1 Jun 2004 22:37:05 -0000
@@ -68,7 +68,7 @@ my @tv = (
['y', '-d ' . q|'a-z'|, 'abc $code', ' $', 0],
['z', '-ds ' . q|'a-z' '$.'|, 'a.b.c $$$$code\\', '. $\\', 0],
-# Make sure that a-a is accepted, even though POSIX 1001.2 says it is illegal.
+# Make sure that a-a is accepted.
['range-a-a', q|'a-a' 'z'|, 'abc', 'zbc', 0],
#
['null', q|'a' ''''|, '', '', 1],
@@ -84,6 +84,8 @@ my @tv = (
['o-rep-2', q|'[b*010]cd' '[a*7]BC[x*]'|, 'bcd', 'BCx', 0],
['esc', q|'a\-z' 'A-Z'|, 'abc-z', 'AbcBC', 0],
+['bs-055', q|'a\055b' def|, "a\055b", 'def', 0],
+['bs-at-end', q|'\' x|, "\\", 'x', 0],
#
# From Ross
@@ -108,6 +110,7 @@ my @tv = (
['repeat-0', q|abc '[b*0]'|, 'abcd', 'bbbd', 0],
['repeat-000', q|abc '[b*00000000000000000000]'|, 'abcd', 'bbbd', 0],
['repeat-compl', '-c ' . q|'[a*65536]\n' '[b*]'|, 'abcd', 'abbb', 0],
+['repeat-Compl', '-C ' . q|'[a*65536]\n' '[b*]'|, 'abcd', 'abbb', 0],
);
- some POSIX-conformance cleanups for GNU tr,
Paul Eggert <=