bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk -i inplace is an order of magnitude faster when also redirectin


From: arnold
Subject: Re: gawk -i inplace is an order of magnitude faster when also redirecting stdout
Date: Thu, 29 Feb 2024 08:48:42 -0700
User-agent: Heirloom mailx 12.5 7/5/10

I looked at this briefly. It makes sense as to why it's working
the way it currently does.  The extension temporarily replaces standard
output's file descriptor with one on a new temporary file.
But the rest of the stdio.h mechanics are left alone. So if stdout
was initially a tty, it remains line buffered; if it was a file,
it's block buffered.

According to the setvbuf(3) man page, setvbuf() can only be called
before any I/O operations are done on a FILE *.  I'm not sure if
it's safe to do so in the extension, but maybe it is.

I will poke at it a little bit; it's not a cut-and-dried easy fix.

Arnold

Ed Morton <mortoneccc@comcast.net> wrote:

> No problem. Trying again to post the strace output as it got mangled by 
> something in transit last time:
>
> The SE answer I linked, https://unix.stackexchange.com/a/771263/133219, 
> shows strace being used on gawk with a 10-line input file and there 
> being 10 writes (same as number of input lines) when used without 
> redirection (look at the "calls" column below)"
>
>     $ strace -e trace=write -c gawk -i inplace 1 somefile
>     % time     seconds  usecs/call     calls    errors syscall
>     ------ ----------- ----------- --------- --------- ----------------
>     100.00    0.000098           9        10           write
>     ------ ----------- ----------- --------- --------- ----------------
>     100.00    0.000098           9        10           total
>
> vs 1 write when used with redirection :
>
>     $ strace -e trace=write -c gawk -i inplace 1 somefile > /dev/null
>     % time     seconds  usecs/call     calls    errors syscall
>     ------ ----------- ----------- --------- --------- ----------------
>     100.00    0.000020          20         1           write
>     ------ ----------- ----------- --------- --------- ----------------
>     100.00    0.000020          20         1           total
>
>
>
> On 2/29/2024 8:47 AM, david kerns wrote:
> > sorry for doubting your due diligence
> >
> > On Thu, Feb 29, 2024 at 7:44 AM Ed Morton <mortoneccc@comcast.net> wrote:
> >
> >     Yes, I tried the same with `sed` and there was no performance
> >     difference between:
> >
> >     No redirection:
> >
> >         $ time { sed -i -n 'p' file; }
> >
> >         real    0m0.027s
> >         user    0m0.000s
> >         sys     0m0.000s
> >
> >     Redirection:
> >
> >         $ time { sed -i -n 'p' file >/dev/null; }
> >
> >         real    0m0.023s
> >         user    0m0.000s
> >         sys     0m0.000s
> >
> >     The SE answer I linked,
> >     https://unix.stackexchange.com/a/771263/133219, shows strace being
> >     used on gawk with a 10-line input file and there being 10 writes
> >     (same as number of input lines) when used without redirection
> >     (look at the "calls" column below)"
> >>     |$ strace -e trace=write -c gawk -i inplace 1 somefile % time
> >>     seconds usecs/call calls errors syscall ------ -----------
> >>     ----------- --------- --------- ---------------- 100.00 0.000098
> >>     9 10 write ------ ----------- ----------- --------- ---------
> >>     ---------------- 100.00 0.000098 9 10 total |
> >
> >     vs 1 write when used with redirection :
> >
> >>     |$ strace -e trace=write -c gawk -i inplace 1 somefile >
> >>     /dev/null % time seconds usecs/call calls errors syscall ------
> >>     ----------- ----------- --------- --------- ----------------
> >>     100.00 0.000020 20 1 write ------ ----------- -----------
> >>     --------- --------- ---------------- 100.00 0.000020 20 1 total |
> >
> >     so buffering does seem likely to be the source of the time difference.
> >
> >     Regards,
> >
> >         Ed.
> >
> >     On 2/29/2024 8:32 AM, david kerns wrote:
> >>     glad you checked that...
> >>     have you tried other commands? ... perhaps the closing of stdout by the
> >>     shell before the fork/exec is causing it.
> >>
> >>     On Thu, Feb 29, 2024 at 6:57 AM Ed Morton<mortoneccc@comcast.net>  
> >> <mailto:mortoneccc@comcast.net>  wrote:
> >>
> >>>     David - that was 3rd-run timing to ensure caching wasn't the issue.
> >>>
> >>>          Ed.
> >>>
> >>>     On 2/29/2024 7:35 AM, david kerns wrote:
> >>>
> >>>     swap the order (do the redirect one first) I suspect the input file 
> >>> was
> >>>     still cached for the 2nd run
> >>>
> >>>
> >>>     On Thu, Feb 29, 2024 at 5:52 AM Ed Morton<mortoneccc@comcast.net>  
> >>> <mailto:mortoneccc@comcast.net>  <mortoneccc@comcast.net>  
> >>> <mailto:mortoneccc@comcast.net>  wrote:
> >>>
> >>>
> >>>     Someone on StackExchange was asking about their gawk script being slow
> >>>     and someone else (https://unix.stackexchange.com/a/771263/133219)
> >>>     pointed out that using `-i inplace` is an order of magnitude slower if
> >>>     you don't also redirect stdout which seems unintuitive at best.
> >>>
> >>>     For example given a 1 million line input file created by:
> >>>
> >>>          $ seq 1000000 > file1m
> >>>
> >>>     and using:
> >>>
> >>>          $ awk --version
> >>>          GNU Awk 5.3.0, API 4.0, PMA Avon 8-g1, (GNU MPFR 4.2.1, GNU MP 
> >>> 6.3.0)
> >>>
> >>>     If we just reproduce it as-is using `-i inplace` the timing is:
> >>>
> >>>          $ time { awk -i inplace '1' file1m; }
> >>>
> >>>          real    0m2.544s
> >>>          user    0m0.265s
> >>>          sys     0m1.843s
> >>>
> >>>     whereas if we redirect stdout even though there is no stdout produced:
> >>>
> >>>          $ time { awk -i inplace '1' file1m >/dev/null; }
> >>>
> >>>          real    0m0.236s
> >>>          user    0m0.187s
> >>>          sys     0m0.000s
> >>>
> >>>     As you can see that second execution with stdout redirected ran an 
> >>> order
> >>>     of magnitude faster. The person who investigated thinks it's due to 
> >>> the
> >>>     first execution being considered "interactive" since stdout isn't
> >>>     technically being redirected and so doing line buffering vs the second
> >>>     execution being "non-interactive" due to stdout being redirected and 
> >>> so
> >>>     using a larger buffer.
> >>>
> >>>     If that is the case, could gawk be updated to consider "inplace" 
> >>> editing
> >>>     as non-interactive? If not, I think it'd be worth a statement in the
> >>>     manual about this difference in performance between the 2.
> >>>
> >>>           Ed.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >



reply via email to

[Prev in Thread] Current Thread [Next in Thread]