[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] When RS is null, POSIX states \n should be in FS, gawk on
From: |
ED MORTON |
Subject: |
Re: [bug-gawk] When RS is null, POSIX states \n should be in FS, gawk only does that if FS is single char |
Date: |
Sun, 21 Apr 2019 07:56:02 -0500 (CDT) |
Arnold - sounds good, thanks for looking into it and I think that's the best
approach since with the way gawk works today I can do
$ printf '1:2\n3\n' | awk -F'[:]' -v RS= '{for (i=1; i<=NF; i++) print i"/"NF,
"<"$i">"}'
1/2 <1>
2/2 <2
3>
whenever I want a single char FS without "\n" getting added. If I couldn't do
the above then having to REMOVE an automatically added "\n" when I really only
wanted ":" as the FS would be a significant pain (off the top of my head I
actually can't think of a way to do it that doesn't involve converting "\n"s in
the record to some control-char that I just hope isn't in the input and then
re-splitting the record and then restoring the "\n"s or manually writing a
"while(index($0,FS))substr()s" loop or similar to identify the fields!).
To be honest I'd rather awk simply NEVER added "\n" to FS when RS="" since it's
non-intuitive that that'd happen and it's so trivial to simply add "\n" to FS
if I want it but I don't expect that behavior to change now and the way gawk
works today provides a simple workaround, so I agree that just documenting the
way gawk works and trying to get the standard changed is the way to go.
Ed.
> On April 21, 2019 at 6:25 AM address@hidden wrote:
>
>
> Hi Ed.
>
> [ BCC to some other awk maintainers, for their interest, and action
> if necessary. ]
>
> Ed Morton <address@hidden> wrote:
>
> > I just came across this where setting RS to null causes FS to include
> > `\n` if FS is a singe char but not otherwise:
> >
> > $ printf '1:2\n3\n' | awk -F':' -v RS= '{for (i=1; i<=NF; i++) print
> > i"/"NF, "<"$i">"}'
> > 1/3 <1>
> > 2/3 <2>
> > 3/3 <3>
> >
> > $ printf '1::2\n3\n' | awk -F'::' -v RS= '{for (i=1; i<=NF; i++)
> > print i"/"NF, "<"$i">"}'
> > 1/2 <1>
> > 2/2 <2
> > 3>
> >
> > with this gawk version:
> >
> > $ awk --version
> > GNU Awk 4.2.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.1.2)
> > Copyright (C) 1989, 1991-2018 Free Software Foundation.
> >
> > and that makes sense given the gawk documentation
> > (https://www.gnu.org/software/gawk/manual/gawk.html#Multiple-Line) which
> > says (red/underline mine):
> >
> > When RS is set to the empty string _/and /__FS is set to a single
> > character_, the newline character always acts as a field separator.
> > This is in addition to whatever field separations result from FS^
>
> This is how Unix awk has behaved since the dawn of time, and how
> gawk behaves. I'm not going to change gawk; see below.
>
> > but the POSIX spec (http://pubs.opengroup.org/onlinepubs/9699919799/) says:
> >
> > *RS*
> > The first character of the string value of *RS* shall be the
> > input record separator; a <newline> by default. If *RS* contains
> > more than one character, the results are unspecified. If *RS* is
> > null, then records are separated by sequences consisting of a
> > <newline> plus one or more blank lines, leading or trailing
> > blank lines shall not result in empty records at the beginning
> > or end of the input, and a <newline> shall always be a field
> > separator, no matter what the value of *FS* is.
> >
> > gawk behaves the way I described with or without the `--posix` flag.
> > Shouldn't it add `\n` as a separator when RS is null regardless of the
> > value of FS like POSIX says? FWIW OSX/BSD awk on MacOS behaves the same
> > way that gawk does, idk about other awks.
>
> The language in POSIX, "no matter what the value of FS is" has been there
> since at least the 2004 standard. (I couldn't find anything older online).
>
> In turn, that language is actually based on the Aho, Kernighan and Weinberger
> book, pages 61 and 84, which say the same thing. (!)
>
> As you note, it does imply that RS = "" should cause \n to be a separator
> even if FS is regexp.
>
> HOWEVER, the code in Unix awk (see https://github.com/onetrueawk/awk)
> is more like this:
>
> if (FS is a regexp)
> do regexp field splitting
> else if (FS is " ")
> split on ' ', '\t', and '\n'
> else {
> split on other single character value of FS
> if (RS is null)
> also split on '\n'
> }
>
> Gawk is essentially the same, although how the code works is different.
>
> Given that the existing practice dates back to at least 1987, over three
> decades, I think that changing the code would be the wrong thing to do.
>
> Instead, I will document this discrepancy, and work to get the standard
> revised.
>
> Thanks!
>
> Arnold
- [bug-gawk] When RS is null, POSIX states \n should be in FS, gawk only does that if FS is single char, Ed Morton, 2019/04/15
- Re: [bug-gawk] When RS is null, POSIX states \n should be in FS, gawk only does that if FS is single char, arnold, 2019/04/15
- Re: [bug-gawk] When RS is null, POSIX states \n should be in FS, gawk only does that if FS is single char, arnold, 2019/04/21
- Re: [bug-gawk] When RS is null, POSIX states \n should be in FS, gawk only does that if FS is single char,
ED MORTON <=
- Message not available