[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sort -o x -o y
From: |
Paul Eggert |
Subject: |
Re: sort -o x -o y |
Date: |
02 Sep 2003 16:03:29 -0700 |
User-agent: |
Gnus/5.09 (Gnus v5.9.0) Emacs/21.3 |
Dan Jacobson <address@hidden> writes:
> $ echo a|sort -o x -o y
> $ ls
> y
POSIX allows this behavior, but it's admittedly weird.
I think that option order should not matter, unless POSIX or the
documentation explicitly says otherwise. So I propose the following
patch. While looking into this problem I noticed that sort's -t
option doesn't let you specify a NUL as a field separator (this is a
related issue since 'sort' uses 0 to represent "no option specified
yet"). Also, the documentation and usage strings incorrectly say
"white space" several places where they should say "blanks". Here's
a patch for these problems.
2003-09-02 Paul Eggert <address@hidden>
* NEWS: sort -t '\0' now uses a NUL tab.
sort option order no longer matters, unless POSIX requires it.
* doc/coreutils.texi (sort invocation): -d now overrides -i.
"whitespace" -> "blanks"; "whitespace" isn't correct.
-t '\0' now specifies a NUL tab.
* src/sort.c (usage): Say "blanks" instead of "whitespace",
Similar fixes for many comments.
(TAB_DEFAULT): New constant, so that we can support NUL as
the field separator.
(tab): Now int, not char. Initialize to TAB_DEFAULT.
(specify_sort_size): If multiple sizes are specified, use the largest.
(begfield, limfield): Support NUL tab char.
(set_ordering): Do not let -i override -d.
(main): Report an error if incompatible -o or -t options are given.
Report an error for "-t ''". Allow "-t '\0'" to specify a NUL tab.
Index: NEWS
===================================================================
RCS file: /cvsroot/coreutils/coreutils/NEWS,v
retrieving revision 1.124
diff -p -u -r1.124 NEWS
--- NEWS 27 Aug 2003 09:18:28 -0000 1.124
+++ NEWS 2 Sep 2003 22:50:50 -0000
@@ -13,6 +13,12 @@ GNU coreutils NEWS
timestamps to their full nanosecond resolution; microsecond
resolution is the best we can do right now.
+ sort now supports the zero byte (NUL) as a field separator; use -t '\0'.
+ The -t '' option, which formerly had no effect, is now an error.
+
+ sort option order no longer matters for the options -S, -d, -i, -o, and -t.
+ Stronger options override weaker, and incompatible options are diagnosed.
+
** Bug fixes
stat no longer overruns a buffer for format strings ending in `%'
Index: doc/coreutils.texi
===================================================================
RCS file: /cvsroot/coreutils/coreutils/doc/coreutils.texi,v
retrieving revision 1.130
diff -p -u -r1.130 coreutils.texi
--- doc/coreutils.texi 17 Aug 2003 17:10:25 -0000 1.130
+++ doc/coreutils.texi 2 Sep 2003 22:51:09 -0000
@@ -2969,6 +2969,8 @@ converting to floating point.
@vindex LC_CTYPE
Ignore nonprinting characters.
The @env{LC_CTYPE} locale determines character types.
+This option has no effect if the stronger @option{--dictionary-order}
+(@option{-d}) option is also given.
@item -M
@itemx --month-sort
@@ -2976,7 +2978,7 @@ The @env{LC_CTYPE} locale determines cha
@opindex --month-sort
@cindex months, sorting by
@vindex LC_TIME
-An initial string, consisting of any amount of whitespace, followed
+An initial string, consisting of any amount of blanks, followed
by a month name abbreviation, is folded to UPPER case and
compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}.
Invalid names compare low to valid names. The @env{LC_TIME} locale
@@ -2989,7 +2991,7 @@ category determines the month spellings.
@cindex numeric sort
@vindex LC_NUMERIC
Sort numerically: the number begins each line; specifically, it consists
-of optional whitespace, an optional @samp{-} sign, and zero or more
+of optional blanks, an optional @samp{-} sign, and zero or more
digits possibly separated by thousands separators, optionally followed
by a decimal-point character and zero or more digits. The @env{LC_NUMERIC}
locale specifies the decimal-point character and thousands separator.
@@ -3085,7 +3087,7 @@ than @var{size}.
@cindex field separator character
Use character @var{separator} as the field separator when finding the
sort keys in each line. By default, fields are separated by the empty
-string between a non-whitespace character and a whitespace character.
+string between a non-blank character and a blank character.
That is, given the input line @address@hidden foo bar}}, @command{sort} breaks
it
into fields @address@hidden foo}} and @address@hidden bar}}. The field
separator is
not considered to be part of either the field preceding or the field
@@ -3093,6 +3095,10 @@ following. But note that sort fields th
as @option{-k 2}, or sort fields consisting of a range, as @option{-k 2,3},
retain the field separators present between the endpoints of the range.
+To specify a zero byte (@acronym{ASCII} @sc{nul} (Null) character) as
+the field separator, use the two-character string @samp{\0}, e.g.,
address@hidden -t '\0'}.
+
@item -T @var{tempdir}
@itemx address@hidden
@opindex -T
@@ -3218,7 +3224,7 @@ field-end part of the key specifier.
@item
Sort the password file on the fifth field and ignore any
-leading white space. Sort lines with equal values in field five
+leading blanks. Sort lines with equal values in field five
on the numeric user ID in field three.
@example
@@ -3242,7 +3248,7 @@ The use of @option{-print0}, @option{-z}
that pathnames that contain Line Feed characters will not get broken up
by the sort operation.
-Finally, to ignore both leading and trailing white space, you
+Finally, to ignore both leading and trailing blanks, you
could have applied the @samp{b} modifier to the field-end specifier
for the first key,
Index: src/sort.c
===================================================================
RCS file: /cvsroot/coreutils/coreutils/src/sort.c,v
retrieving revision 1.267
diff -p -u -r1.267 sort.c
--- src/sort.c 4 Aug 2003 08:55:44 -0000 1.267
+++ src/sort.c 2 Sep 2003 22:56:17 -0000
@@ -146,8 +146,8 @@ struct keyfield
size_t echar; /* Additional characters in field. */
bool const *ignore; /* Boolean array of characters to ignore. */
char const *translate; /* Translation applied to characters. */
- bool skipsblanks; /* Skip leading white space at start. */
- bool skipeblanks; /* Skip trailing white space at finish. */
+ bool skipsblanks; /* Skip leading blanks at start. */
+ bool skipeblanks; /* Skip trailing blanks at finish. */
bool numeric; /* Flag for numeric comparison. Handle
strings of digits with optional decimal
point, but no exponential notation. */
@@ -173,7 +173,7 @@ char *program_name;
internally, but doing this with good performance is a bit
tricky. */
-/* Table of white space. */
+/* Table of blanks. */
static bool blanks[UCHAR_LIM];
/* Table of non-printing characters. */
@@ -243,10 +243,13 @@ static bool reverse;
they were read if all keys compare equal. */
static bool stable;
-/* Tab character separating fields. If NUL, then fields are separated
- by the empty string between a non-whitespace character and a whitespace
+/* If TAB has this value, blanks separate fields. */
+enum { TAB_DEFAULT = CHAR_MAX + 1 };
+
+/* Tab character separating fields. If TAB_DEFAULT, then fields are
+ separated by the empty string between a non-blank character and a blank
character. */
-static char tab;
+static int tab = TAB_DEFAULT;
/* Flag to remove consecutive duplicate lines from the output.
Only the last of a sequence of equal lines will be output. */
@@ -305,7 +308,7 @@ Other options:\n\
-S, --buffer-size=SIZE use SIZE for main memory buffer\n\
"), stdout);
printf (_("\
- -t, --field-separator=SEP use SEP instead of non- to whitespace transition\n\
+ -t, --field-separator=SEP use SEP instead of non-blank to blank transition\n\
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or %s\n\
multiple options specify multiple directories\n\
-u, --unique with -c: check for strict ordering\n\
@@ -618,6 +621,11 @@ specify_sort_size (char const *s)
if (e == LONGINT_OK)
{
+ /* If multiple sort sizes are specified, take the maximum, so
+ that option order does not matter. */
+ if (n < sort_size)
+ return;
+
sort_size = n;
if (sort_size == n)
{
@@ -769,7 +777,7 @@ begfield (const struct line *line, const
/* The leading field separator itself is included in a field when -t
is absent. */
- if (tab)
+ if (tab != TAB_DEFAULT)
while (ptr < lim && sword--)
{
while (ptr < lim && *ptr != tab)
@@ -817,7 +825,7 @@ limfield (const struct line *line, const
`beginning' is the first character following the delimiting TAB.
Otherwise, leave PTR pointing at the first `blank' character after
the preceding field. */
- if (tab)
+ if (tab != TAB_DEFAULT)
while (ptr < lim && eword--)
{
while (ptr < lim && *ptr != tab)
@@ -866,7 +874,7 @@ limfield (const struct line *line, const
*/
/* Make LIM point to the end of (one byte past) the current field. */
- if (tab)
+ if (tab != TAB_DEFAULT)
{
char *newlim;
newlim = memchr (ptr, tab, lim - ptr);
@@ -2159,7 +2167,10 @@ set_ordering (register const char *s, st
key->general_numeric = true;
break;
case 'i':
- key->ignore = nonprinting;
+ /* Option order should not matter, so don't let -i override
+ -d. -d implies -i, but -i does not imply -d. */
+ if (! key->ignore)
+ key->ignore = nonprinting;
break;
case 'M':
key->month = true;
@@ -2428,6 +2439,8 @@ main (int argc, char **argv)
break;
case 'o':
+ if (outfile != minus && strcmp (outfile, optarg) != 0)
+ error (SORT_FAILURE, 0, _("multiple output files specified"));
outfile = optarg;
break;
@@ -2440,15 +2453,28 @@ main (int argc, char **argv)
break;
case 't':
- tab = optarg[0];
- if (tab && optarg[1])
- {
- /* Provoke with `sort -txx'. Complain about
- "multi-character tab" instead of "multibyte tab", so
- that the diagnostic's wording does not need to be
- changed once multibyte characters are supported. */
- error (SORT_FAILURE, 0, _("multi-character tab `%s'"), optarg);
- }
+ {
+ int newtab = optarg[0];
+ if (! newtab)
+ error (SORT_FAILURE, 0, _("empty tab"));
+ if (optarg[1])
+ {
+ if (strcmp (optarg, "\\0") == 0)
+ newtab = '\0';
+ else
+ {
+ /* Provoke with `sort -txx'. Complain about
+ "multi-character tab" instead of "multibyte tab", so
+ that the diagnostic's wording does not need to be
+ changed once multibyte characters are supported. */
+ error (SORT_FAILURE, 0, _("multi-character tab `%s'"),
+ optarg);
+ }
+ }
+ if (tab != TAB_DEFAULT && tab != newtab)
+ error (SORT_FAILURE, 0, _("incompatible tabs"));
+ tab = newtab;
+ }
break;
case 'T':
- Re: sort -o x -o y,
Paul Eggert <=