[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: gawk -i inplace is an order of magnitude faster when also redirectin
From: |
arnold |
Subject: |
Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout |
Date: |
Thu, 29 Feb 2024 08:48:42 -0700 |
User-agent: |
Heirloom mailx 12.5 7/5/10 |
I looked at this briefly. It makes sense as to why it's working
the way it currently does. The extension temporarily replaces standard
output's file descriptor with one on a new temporary file.
But the rest of the stdio.h mechanics are left alone. So if stdout
was initially a tty, it remains line buffered; if it was a file,
it's block buffered.
According to the setvbuf(3) man page, setvbuf() can only be called
before any I/O operations are done on a FILE *. I'm not sure if
it's safe to do so in the extension, but maybe it is.
I will poke at it a little bit; it's not a cut-and-dried easy fix.
Arnold
Ed Morton <mortoneccc@comcast.net> wrote:
> No problem. Trying again to post the strace output as it got mangled by
> something in transit last time:
>
> The SE answer I linked, https://unix.stackexchange.com/a/771263/133219,
> shows strace being used on gawk with a 10-line input file and there
> being 10 writes (same as number of input lines) when used without
> redirection (look at the "calls" column below)"
>
> $ strace -e trace=write -c gawk -i inplace 1 somefile
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 0.000098 9 10 write
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 0.000098 9 10 total
>
> vs 1 write when used with redirection :
>
> $ strace -e trace=write -c gawk -i inplace 1 somefile > /dev/null
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 0.000020 20 1 write
> ------ ----------- ----------- --------- --------- ----------------
> 100.00 0.000020 20 1 total
>
>
>
> On 2/29/2024 8:47 AM, david kerns wrote:
> > sorry for doubting your due diligence
> >
> > On Thu, Feb 29, 2024 at 7:44 AM Ed Morton <mortoneccc@comcast.net> wrote:
> >
> > Yes, I tried the same with `sed` and there was no performance
> > difference between:
> >
> > No redirection:
> >
> > $ time { sed -i -n 'p' file; }
> >
> > real 0m0.027s
> > user 0m0.000s
> > sys 0m0.000s
> >
> > Redirection:
> >
> > $ time { sed -i -n 'p' file >/dev/null; }
> >
> > real 0m0.023s
> > user 0m0.000s
> > sys 0m0.000s
> >
> > The SE answer I linked,
> > https://unix.stackexchange.com/a/771263/133219, shows strace being
> > used on gawk with a 10-line input file and there being 10 writes
> > (same as number of input lines) when used without redirection
> > (look at the "calls" column below)"
> >> |$ strace -e trace=write -c gawk -i inplace 1 somefile % time
> >> seconds usecs/call calls errors syscall ------ -----------
> >> ----------- --------- --------- ---------------- 100.00 0.000098
> >> 9 10 write ------ ----------- ----------- --------- ---------
> >> ---------------- 100.00 0.000098 9 10 total |
> >
> > vs 1 write when used with redirection :
> >
> >> |$ strace -e trace=write -c gawk -i inplace 1 somefile >
> >> /dev/null % time seconds usecs/call calls errors syscall ------
> >> ----------- ----------- --------- --------- ----------------
> >> 100.00 0.000020 20 1 write ------ ----------- -----------
> >> --------- --------- ---------------- 100.00 0.000020 20 1 total |
> >
> > so buffering does seem likely to be the source of the time difference.
> >
> > Regards,
> >
> > Ed.
> >
> > On 2/29/2024 8:32 AM, david kerns wrote:
> >> glad you checked that...
> >> have you tried other commands? ... perhaps the closing of stdout by the
> >> shell before the fork/exec is causing it.
> >>
> >> On Thu, Feb 29, 2024 at 6:57 AM Ed Morton<mortoneccc@comcast.net>
> >> <mailto:mortoneccc@comcast.net> wrote:
> >>
> >>> David - that was 3rd-run timing to ensure caching wasn't the issue.
> >>>
> >>> Ed.
> >>>
> >>> On 2/29/2024 7:35 AM, david kerns wrote:
> >>>
> >>> swap the order (do the redirect one first) I suspect the input file
> >>> was
> >>> still cached for the 2nd run
> >>>
> >>>
> >>> On Thu, Feb 29, 2024 at 5:52 AM Ed Morton<mortoneccc@comcast.net>
> >>> <mailto:mortoneccc@comcast.net> <mortoneccc@comcast.net>
> >>> <mailto:mortoneccc@comcast.net> wrote:
> >>>
> >>>
> >>> Someone on StackExchange was asking about their gawk script being slow
> >>> and someone else (https://unix.stackexchange.com/a/771263/133219)
> >>> pointed out that using `-i inplace` is an order of magnitude slower if
> >>> you don't also redirect stdout which seems unintuitive at best.
> >>>
> >>> For example given a 1 million line input file created by:
> >>>
> >>> $ seq 1000000 > file1m
> >>>
> >>> and using:
> >>>
> >>> $ awk --version
> >>> GNU Awk 5.3.0, API 4.0, PMA Avon 8-g1, (GNU MPFR 4.2.1, GNU MP
> >>> 6.3.0)
> >>>
> >>> If we just reproduce it as-is using `-i inplace` the timing is:
> >>>
> >>> $ time { awk -i inplace '1' file1m; }
> >>>
> >>> real 0m2.544s
> >>> user 0m0.265s
> >>> sys 0m1.843s
> >>>
> >>> whereas if we redirect stdout even though there is no stdout produced:
> >>>
> >>> $ time { awk -i inplace '1' file1m >/dev/null; }
> >>>
> >>> real 0m0.236s
> >>> user 0m0.187s
> >>> sys 0m0.000s
> >>>
> >>> As you can see that second execution with stdout redirected ran an
> >>> order
> >>> of magnitude faster. The person who investigated thinks it's due to
> >>> the
> >>> first execution being considered "interactive" since stdout isn't
> >>> technically being redirected and so doing line buffering vs the second
> >>> execution being "non-interactive" due to stdout being redirected and
> >>> so
> >>> using a larger buffer.
> >>>
> >>> If that is the case, could gawk be updated to consider "inplace"
> >>> editing
> >>> as non-interactive? If not, I think it'd be worth a statement in the
> >>> manual about this difference in performance between the 2.
> >>>
> >>> Ed.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >
- gawk -i inplace is an order of magnitude faster when also redirecting stdout, Ed Morton, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, david kerns, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, Ed Morton, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, david kerns, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, Ed Morton, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, david kerns, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, Ed Morton, 2024/02/29
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout,
arnold <=
- Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout, arnold, 2024/02/29