bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#19240: cut 8.22 adds newline


From: Bob Proulx
Subject: bug#19240: cut 8.22 adds newline
Date: Thu, 4 Dec 2014 10:48:38 -0700
User-agent: Mutt/1.5.23 (2014-03-12)

Eric Blake wrote:
> I'll leave it to other contributors to weigh in on whether omitting
> the final newline on output when it was missing on input is worth
> the complexity of a change.

> Pádraig Brady wrote:
> > If we were just implementing now, I'd not output the extra '\n',
> > but changing at this stage needs to be carefully considered,
> > and with all the textutils, not just cut(1).
> 
> I tend to go the opposite - producing text output, even on non-text
> input, is more likely to be useful when piping files to other utilities
> that don't handle non-text files as gracefully as the coreutils.  But I
> definitely agree that it is not something we change lightly.

I have these thoughts and comments to make.

1. I don't "like" input file lines that don't have trailing newlines.
It raises the question of whether the input is actually valid input.
It feels to me like any line missing a newline is incomplete.  There
is likely to have been an error in the creation of it.  Handling it
silently feels like ignoring the error.  But raising an actual error
by exit code or by emitting a warning or error message feels too heavy
handed.  I would lean toward assuming that any incomplete input line
is actually terminated by a newline as the lessor of the evils.

2. The suggesion for for handling *fields* that do not end with a
trailing newline differently from those that do doesn't make any sense
to me at all.  What is a field?  Is the newline part of the field?  I
think not.  Consider this.

  $ printf "one two" | awk '{print$1}'
  one

  $ printf "one two" | awk '{print$2}'
  two

  $ printf "one two\n" | awk '{print$1}'
  one

  $ printf "one two\n" | awk '{print$2}'
  two

The newline is not part of field two.  Otherwise printing it would
result in the second having two newlines output.

  $ printf "one two" | cut -d' ' -f1
  one

  $ printf "one two" | cut -d' ' -f2
  two

  $ printf "one two\n" | cut -d' ' -f1
  one

  $ printf "one two\n" | cut -d' ' -f2
  two

Same thing for cut.  The newline is not part of any of the fields.
The newline terminates the input line.  The newline is not associated
with any of the delimited fields contained in an input line.

For byte or character operations in the utils such as head -c those
are binary operations and should be interpreted strictly according to
the bytes.  But not for cut -c which is column based.

John Kendall wrote:
> # Solaris cut
> $ printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4
> 1
> 12
> 123
> 1234
> 1234
> 1234$

That is tickling non-portable behavior.  I had a friend run some tests
on HP-UX and IBM AIX and the results there were different from
Solaris.  Seems Solaris is already the unusual case.

When looking count the "1234" lines carefully.  Because HP-UX and
older AIX don't process the line without a trailing newline at all.
It is omitted there.  Newer AIX appears to handle it like GNU.

  # uname -srm
  HP-UX B.10.20 9000/785
  # printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4
  1
  12
  123
  1234
  1234
  #

  # uname -srm
  HP-UX B.11.31 ia64
  # printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4
  1
  12
  123
  1234
  1234
  #

  # uname -s ; oslevel
  AIX
  4.3.3.0
  # printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4
  1
  12
  123
  1234
  1234
  #

  # uname -s ; oslevel
  AIX
  7.1.0.0
  # printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4
  1
  12
  123
  1234
  1234
  1234
  #

  # head -1 /etc/motd ; uname -m
  Compaq Tru64 UNIX V5.0A
  alpha
  # printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4
  1
  12
  123
  1234
  1234
  #

  # uname -s
  Darwin
  # printf "1\n12\n123\n1234\n12345\n123456" | cut -c1-4
  1
  12
  123
  1234
  1234
  1234
  #

Using input lines without a trailing newline is already a minefield of
portability problems.  It depends upon details of the implementation.

I think what Solaris cut must be doing is processing the emission of
characters across the line character by character.  When it hits the
input newline it knows it is done and emits a newline itself and
starts again on a new line.  When it hits EOF on the input it probably
just stops doing anything and exits itself without printing anything
more and therefore not emitting a newline.  Likely just an accident of
implementation.

This is what makes "lines" without a newline such an unportable thing
to count upon.  It causes it to depend upon an implementation detail.
Different implementation might do different things.  And in fact
different ones do actually do different things.  This probably isn't
too widespread of an issue or it would have come up more often.  And
more specific to the Solaris code port there would be similar problems
differently if trying to use other legacy Unix platforms.  Best to
avoid the construct entirely for robust operation.

> I came upon this while porting scripts from Solaris 10 to Centos 7.

Can you share with us the specific construct that caused this to
arise?  I have done a lot of script porting to and from HP-UX systems
and am curious as to the issue.

Bob





reply via email to

[Prev in Thread] Current Thread [Next in Thread]