Re: wait -n misses signaled subprocess

bug-bash

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wait -n misses signaled subprocess

From:	Chet Ramey
Subject:	Re: wait -n misses signaled subprocess
Date:	Wed, 31 Jan 2024 11:35:57 -0500
User-agent:	Mozilla Thunderbird

On 1/30/24 12:40 PM, Robert Elz wrote:

   | since this was the way -n worked orginally, before it started
   | paying attention to pid arguments.

I'm not sure what the "this" is there, if you meant as I described it
in my answer to your rhetorical question, viz:

        Find, or if there are none already, wait*(2) for, [...]
        If there's already a terminated job [...] then no wait type
        sys call gets performed

then that seems to be in conflict with some of your other statements
like:


I won't ask you to look at the code, but yes, that's pretty much what it
did: polled dead jobs to see if any could be returned because the user
had not been notified, then made sure there were actual running background
jobs and waited for one of them and returned the first one that exited.


chet.ramey@case.edu said (replying to Dale R. Worley):
   | > It looks like the underlying meaning of "-n" is to only pay attention to
   | > *new* job completions, and anything "in the past" (already notified and
   | > moved to the table of terminated background jobs) is ignored.
   | That was the original implementation, yes.

which is a different thing entirely.


Not quite. `new' in this sense is the opposite of `anything in the past'
as Dale described it -- already notified and removed from the jobs list.
Jobs in the jobs list that hadn't been marked as notified were eligible
to be returned, because to the user, they're new.

Half the problem here is that bash aggressively marks dead jobs as being
notified in non-interactive shells without job control enabled, and moves
them out of the jobs table.


   | Right -- it works on the list of running background jobs.

I know it is hard, but for determining what should happen, we need to
keep thoughts of the current implementation details out of this, as
while I'm sure you know exactly what that means, most others will not.


It's pretty much the original implementation as I described it above. The
running background jobs part kicks in after the `dead but not notified'
part.


What matters (to a script writer) is whether or not the processes listed
(if any) have had their status collected before or not - if not, then
any process (job) eligible (in the arg list of pids if there is one, or
just any) which has returned some status should be returned (if there
are multiple, any one of them) and if there are none, then we wait(2)
until one does change status.   What exactly "Running background jobs"
means there is not clear (to me anyway).

OK.

What's the mechanism by
which they find out which processes are in the state where the current version
of wait -n will work on them?    Assume there are multiple running (or
perhaps recently ended) processes, and we want to process each as it
ends (or soon after, given multiple might end around the same time).


If you use wait -n without arguments, you probably don't care, but if you
do, or if you use wait -n with pid/job arguments (which you've presumably
saved yourself) you're going to need slightly different semantics than we
have now to answer that reliably. And that will probably need a new option.


   | The real question is whether or not
   | we should extend `wait -n' to behave more like `wait' without options.

That's not an answerable question, as there are several differences
between wait -n and wait without -n (which is what I assume you mean

by "wait without options").


The bash/posix semantics for `wait' without -n, for which you can ignore -p
and -f.

And that's why I used `more': there are several differences, so which
of those differences should we attempt to change?

The one change that should be made is
to allow wait -n to collect processes/jobs that have already terminated.


Yes, that's one of the things we're talking about. I don't have any problem
with it, but should it take a new option to change those semantics?

Changing it to wait for all the listed pids (which would make it behave
more like wait without -n) is not desirable.


It's never done that.

Nor is changing a simple
"wait -n" (no pid args, the presence, or not, of -p or -f is irrelevant)
to always exit with status 0 - which is what "wait" does.   So, please
be clear.


We're not going to change the return value from wait.


   | Why impose that requirement when it's never existed before?

Never existed before in what?   In bash, perhaps.   In standard Bourne
shells (and POSIX), this isn't at all new, it has always been required
to wait for background processes (or allow the list of saved status

to overflow, and old ones to be discarded).


Yeah, but we're talking about bash here. It doesn't really matter what
the Bourne shell did; there are likely plenty of scripts that assume
the historical bash behavior.

There was never any
implicit "clean up when X happens" which is what bash seems to do
(in non-interactive shells, interactive ones clean up before PS1 is
written).


And?


   | Bash `wait' already has -f to return only when the specified job(s) has
   | terminated, reserving -t for some future use.

No, that's what I meant, -f is making the distinction between terminated
and some other status change.   I meant the distinction between processes
that the shell has already collected status for, and those for which it
is yet to do so - ie: to add an option more or less equiv to WNOHANG in
the wait*(2) sys calls (the ones that have flags).   The shell could
simply never do a wait(2) family sys call when the option is set, or if
it does one, to see if there might be a zombie waiting to be reaped,
then it should set WNOHANG when it does, to avoid the script from pausing.


You're not the first to propose something like that, but I'm not going to
be writing that code any time soon.

And again from the earlier reply to Dale, you said:

   |  If the pid has already terminated, the wait is immediate, and there's no
   | reason to call the system call, but wait still returns the status.

if only that were true in the current implementation when -n is given.

In the man page, the wording for how the wait command behaves is essentially
identical for both the -n case, and the no -n case.


It is, in fact, true in the current implementation, as long as the pid
is in the jobs list. It's always been true. If there is a job marked
(internally, if you must) as dead for which the user has not yet received
notification, wait -n returns it and marks it as notified (and deletes
it from the jobs list).


        Wait for each specified child process and return its termination
         status. [...]
        If the -n option is supplied, wait waits for [...]

In both cases, the wait builtin utility "waits for" various processes
(depending on pid args or not, ...).   It should (aside for no -n waiting
for all the candidate processes, and wait with -n only waiting for one of
them) behave just the same.


Yes, that's one of the things we're talking about: whether wait -n should
consider pids/jobs *not* in the jobs list, the way wait without -n does.
That's about the only thing we're talking about changing here so far.


And from a reply to Steven Palley:

   | This has raised several other questions: whether `wait -n' should work more
   | like `wait' (see below)

You know my answer to that one.

   | and whether non-interactive shells without job
   | control enabled should be so aggressive at marking jobs as notified, since
   | it's that state that allows them to move to the list of terminated 
processes

This all seems to be internal bash optimisation.   What list the processes/
jobs are on inside bash is, or should be, irrelevant to the script, things
should behave just the same (at least for "wait") whatever internal list the
process is on (and just like the sys call, once status of the process has
been returned once, it is gone, and never can be waited upon again).


That hasn't actually been true with bash running in default mode for a
very long time now. Bash has allowed multiple waits for the same pid for
many years, whether or not you or I think it's a good idea or the correct
semantics. Even if it was an accident of the implementation, and maybe you
could say it was, we are stuck with it.


   | Can you think of a use case that would break if wait -n looked at
   | terminated processes?

I certainly cannot, as that's the way I always assumed it worked, and I
think most other people did as well.


It's ok, we got one.

If none of the 3 have changed state when we do the wait -n, everything
works nicely.  If one of them (say pid1) already terminated just before
the wait -n was executed, then in bash, that one is apparently exempt from
being returned.   That's impossible to program around.


So, again, the question is whether or not `wait -n' looks in the bgpids
table, like wait without -n does. Given that things that work today would
not work tomorrow after that change, it will probably take another option.
Maybe it's time for `wait -a' (any).

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    chet@case.edu    http://tiswww.cwru.edu/~chet/

[Prev in Thread]

Current Thread

[Next in Thread]

Re: wait -n misses signaled subprocess, (continued)
- Re: wait -n misses signaled subprocess, Dale R. Worley, 2024/01/24
  - Re: wait -n misses signaled subprocess, Steven Pelley, 2024/01/24
    - Re: wait -n misses signaled subprocess, Steven Pelley, 2024/01/24

Prev by Date: [PATCH 18/18] doc/bash.1: work around limitation of AT&T troff
Next by Date: [PATCH v2 01/18] doc/bash.1: fix rendering error on old *roffs
Previous by thread: Re: wait -n misses signaled subprocess
Next by thread: Re: wait -n misses signaled subprocess
Index(es):
- Date
- Thread