bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] When RS is null, POSIX states \n should be in FS, gawk on


From: Weeks, Nathan
Subject: Re: [bug-gawk] When RS is null, POSIX states \n should be in FS, gawk only does that if FS is single char
Date: Mon, 22 Apr 2019 23:59:24 +0000

While gawk & bwk awk exhibit the same behavior in both cases, mawk (default 
/usr/bin/awk version from Debian & Ubuntu, as well as the 1.3.4 20171017 
version I tested separately) doesn't add a newline as a field separator in 
either case (despite what POSIX states):


case #1


$ printf '1:2\n3\n' | docker run -i --rm debian:9.8-slim awk -F':' -v RS= '{for 
(i=1; i<=NF; i++) print i"/"NF, "<"$i">"}'
1/2 <1>
2/2 <2
3>

case #2

$ printf '1::2\n3\n' | docker run -i --rm debian:9.8-slim awk -F'::' -v RS= 
'{for (i=1; i<=NF; i++) print i"/"NF, "<"$i">"}'
1/2 <1>
2/2 <2
3>

For the case #1, busybox 1.30.1 awk seems to implement the current wording of 
the POSIX standard in that "records are separated by sequences consisting of a 
<newline> plus one or more blank lines":

$ printf '1:2\n3\n' | docker run -i --rm busybox:1.30.1 awk -F':' -v RS= '{for 
(i=1; i<=NF; i++) print i"/"NF, "<"$i">"}'
1/4 <1>
2/4 <2>
3/4 <3>
4/4 <>

Since there is no sequence of two or more newline characters that would make a 
record separator (or does a "blank line" also include a line that comprises 
only <blank> characters?), and a single newline is a field separator, then I 
think a literal interpretation of the current POSIX standard would mean the 
last field would be empty (any other interpretations?)

If we add a second newline to make a newline + "blank line", then busybox awk 
does appear to interpret that as a record separator (call this case #3):

$ printf '1:2\n3\n\n' | docker run -i --rm busybox:1.30.1 awk -F':' -v RS= 
'{for (i=1; i<=NF; i++) print i"/"NF, "<"$i">"}'
1/3 <1>
2/3 <2>
3/3 <3>

I can't explain busybox awk's behavior for case #2, however:

$ printf '1::2\n3\n' | docker run -i --rm busybox:1.30.1 awk -F'::' -v RS= 
'{for (i=1; i<=NF; i++) print i"/"NF, "<"$i">"}'
1/2 <1>
2/2 <2>

So my initial interpretation is that:


  1.  POSIX should be changed to be consistent with the gawk / BWK awk for both 
examples (how should that be worded, considering the busybox awk behavior for 
case #1?)
  2.  busybox awk and mawk should be changed (where possible) to be consistent 
with the behavior standardized as a result of #1 (which may need to be worded 
loosely enough to allow some flexibility; while busybox awk seems malleable, 
the mawk version in Debian/Ubuntu seems less so).

Other thoughts?

--
Nathan

From: <address@hidden<mailto:address@hidden>>
Date: Sun, Apr 21, 2019 at 8:25 AM
Subject: Re: [bug-gawk] When RS is null, POSIX states \n should be in FS, gawk 
only does that if FS is single char
To: <address@hidden<mailto:address@hidden>>, 
<address@hidden<mailto:address@hidden>>


Hi Ed.

[ BCC to some other awk maintainers, for their interest, and action
  if necessary. ]

Ed Morton <address@hidden<mailto:address@hidden>> wrote:

> I just came across this where setting RS to null causes FS to include
> `\n` if FS is a singe char but not otherwise:
>
>     $ printf '1:2\n3\n' | awk -F':' -v RS= '{for (i=1; i<=NF; i++) print
>     i"/"NF, "<"$i">"}'
>     1/3 <1>
>     2/3 <2>
>     3/3 <3>
>
>     $ printf '1::2\n3\n' | awk -F'::' -v RS= '{for (i=1; i<=NF; i++)
>     print i"/"NF, "<"$i">"}'
>     1/2 <1>
>     2/2 <2
>     3>
>
> with this gawk version:
>
>     $ awk --version
>     GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2)
>     Copyright (C) 1989, 1991-2018 Free Software Foundation.
>
> and that makes sense given the gawk documentation
> (https://www.gnu.org/software/gawk/manual/gawk.html#Multiple-Line) which
> says (red/underline mine):
>
>     When RS is set to the empty string _/and /__FS is set to a single
>     character_, the newline character always acts as a field separator.
>     This is in addition to whatever field separations result from FS^

This is how Unix awk has behaved since the dawn of time, and how
gawk behaves.  I'm not going to change gawk; see below.

> but the POSIX spec (http://pubs.opengroup.org/onlinepubs/9699919799/) says:
>
>     *RS*
>         The first character of the string value of *RS* shall be the
>         input record separator; a <newline> by default. If *RS* contains
>         more than one character, the results are unspecified. If *RS* is
>         null, then records are separated by sequences consisting of a
>         <newline> plus one or more blank lines, leading or trailing
>         blank lines shall not result in empty records at the beginning
>         or end of the input, and a <newline> shall always be a field
>         separator, no matter what the value of *FS* is.
>
> gawk behaves the way I described with or without the `--posix` flag.
> Shouldn't it add `\n` as a separator when RS is null regardless of the
> value of FS like POSIX says? FWIW OSX/BSD awk on MacOS behaves the same
> way that gawk does, idk about other awks.

The language in POSIX, "no matter what the value of FS is" has been there
since at least the 2004 standard. (I couldn't find anything older online).

In turn, that language is actually based on the Aho, Kernighan and Weinberger
book, pages 61 and 84, which say the same thing. (!)

As you note, it does imply that RS = "" should cause \n to be a separator
even if FS is regexp.

HOWEVER, the code in Unix awk (see https://github.com/onetrueawk/awk)
is more like this:

        if (FS is a regexp)
                do regexp field splitting
        else if (FS is " ")
                split on ' ', '\t', and '\n'
        else {
                split on other single character value of FS
                if (RS is null)
                        also split on '\n'
        }

Gawk is essentially the same, although how the code works is different.

Given that the existing practice dates back to at least 1987, over three
decades, I think that changing the code would be the wrong thing to do.

Instead, I will document this discrepancy, and work to get the standard
revised.

Thanks!

Arnold




This electronic message contains information generated by the USDA solely for 
the intended recipients. Any unauthorized interception of this message or the 
use or disclosure of the information it contains may violate the law and 
subject the violator to civil or criminal penalties. If you believe you have 
received this message in error, please notify the sender and delete the email 
immediately.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]