bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wait -n misses signaled subprocess


From: Robert Elz
Subject: Re: wait -n misses signaled subprocess
Date: Wed, 31 Jan 2024 00:40:35 +0700

    Date:        Tue, 30 Jan 2024 09:16:47 -0500
    From:        Chet Ramey <chet.ramey@case.edu>
    Message-ID:  <95841ed3-ec4f-4b17-802c-86e560b58dfa@case.edu>

  | since this was the way -n worked orginally, before it started
  | paying attention to pid arguments.

I'm not sure what the "this" is there, if you meant as I described it
in my answer to your rhetorical question, viz:

        Find, or if there are none already, wait*(2) for, [...]
        If there's already a terminated job [...] then no wait type
        sys call gets performed

then that seems to be in conflict with some of your other statements
like:

chet.ramey@case.edu said (replying to Dale R. Worley):
  | > It looks like the underlying meaning of "-n" is to only pay attention to
  | > *new* job completions, and anything "in the past" (already notified and
  | > moved to the table of terminated background jobs) is ignored.
  | That was the original implementation, yes.

which is a different thing entirely.

  | Right -- it works on the list of running background jobs.

I know it is hard, but for determining what should happen, we need to
keep thoughts of the current implementation details out of this, as
while I'm sure you know exactly what that means, most others will not.

What matters (to a script writer) is whether or not the processes listed
(if any) have had their status collected before or not - if not, then
any process (job) eligible (in the arg list of pids if there is one, or
just any) which has returned some status should be returned (if there
are multiple, any one of them) and if there are none, then we wait(2)
until one does change status.   What exactly "Running background jobs"
means there is not clear (to me anyway).

But if it were to mean only processes that haven't previously terminated,
how is the script writer meant to handle that?   What's the mechanism by
which they find out which processes are in the state where the current version
of wait -n will work on them?    Assume there are multiple running (or
perhaps recently ended) processes, and we want to process each as it
ends (or soon after, given multiple might end around the same time).

  | The real question is whether or not
  | we should extend `wait -n' to behave more like `wait' without options.

That's not an answerable question, as there are several differences
between wait -n and wait without -n (which is what I assume you mean
by "wait without options").   The one change that should be made is
to allow wait -n to collect processes/jobs that have already terminated.
Changing it to wait for all the listed pids (which would make it behave
more like wait without -n) is not desirable.   Nor is changing a simple
"wait -n" (no pid args, the presence, or not, of -p or -f is irrelevant)
to always exit with status 0 - which is what "wait" does.   So, please
be clear.

  | Why impose that requirement when it's never existed before?

Never existed before in what?   In bash, perhaps.   In standard Bourne
shells (and POSIX), this isn't at all new, it has always been required
to wait for background processes (or allow the list of saved status
to overflow, and old ones to be discarded).   There was never any
implicit "clean up when X happens" which is what bash seems to do
(in non-interactive shells, interactive ones clean up before PS1 is
written).

  | Bash `wait' already has -f to return only when the specified job(s) has
  | terminated, reserving -t for some future use.

No, that's what I meant, -f is making the distinction between terminated
and some other status change.   I meant the distinction between processes
that the shell has already collected status for, and those for which it
is yet to do so - ie: to add an option more or less equiv to WNOHANG in
the wait*(2) sys calls (the ones that have flags).   The shell could
simply never do a wait(2) family sys call when the option is set, or if
it does one, to see if there might be a zombie waiting to be reaped,
then it should set WNOHANG when it does, to avoid the script from pausing.

  | There's no reason to keep thousands of terminated jobs in the jobs list,
  | slowing everything down, as long as you give users a way to retrieve their
  | status.

This is just implementation detail, as long as it behaves correctly,
what optimisations the implementation chooses to make are irrelevant.

  | You can run thousands of background jobs in a loop without exceeding the
  | max process limit.

It depends just what those jobs are.   For something like

        while true; do :& done

then yes, sure as the jobs all terminate quite quickly, and as the shell
collects the zombies as soon as they become available (more or less) the
limit never gets reached.

But those kinds of things are rarely useful to anyone except those doing
torture tests.

More likely would be something like

        while true; do sleep 10000 & done

where the "sleep" is just a placeholder for anything meaningful which is
going to take appreciable time to complete.  In this case, and obviously
depending what the limit is set to, you will quickly reach the limit.

And again from the earlier reply to Dale, you said:

  |  If the pid has already terminated, the wait is immediate, and there's no
  | reason to call the system call, but wait still returns the status.

if only that were true in the current implementation when -n is given.

In the man page, the wording for how the wait command behaves is essentially
identical for both the -n case, and the no -n case.

        Wait for each specified child process and return its termination
        status. [...]
        If the -n option is supplied, wait waits for [...]

In both cases, the wait builtin utility "waits for" various processes
(depending on pid args or not, ...).   It should (aside for no -n waiting
for all the candidate processes, and wait with -n only waiting for one of
them) behave just the same.

And from a reply to Steven Palley:

  | This has raised several other questions: whether `wait -n' should work more
  | like `wait' (see below)

You know my answer to that one.

  | and whether non-interactive shells without job
  | control enabled should be so aggressive at marking jobs as notified, since
  | it's that state that allows them to move to the list of terminated processes

This all seems to be internal bash optimisation.   What list the processes/
jobs are on inside bash is, or should be, irrelevant to the script, things
should behave just the same (at least for "wait") whatever internal list the
process is on (and just like the sys call, once status of the process has
been returned once, it is gone, and never can be waited upon again).

  | Can you think of a use case that would break if wait -n looked at
  | terminated processes?

I certainly cannot, as that's the way I always assumed it worked, and I
think most other people did as well.

The current design seems to be imposing impossible races on the script.
The script has no way to find out whether a child has changed state or
not, other than doing a "wait" - if the script is slow doing that, and
the effect is that the job never gets returned from wait -n, then what
are we to do?   If we just do "wait pid" and the process has not terminated,
then we hang, and don't get to process other children when they end while
the one we picked keeps on running.   We need wait -n pid1 pid2 pid3 ...
(for as many as we have running) and we want whatever of those pids has
terminated or whichever terminates next to be the one returned.   That
would work if we omitted -n, but then we'd lose the status for pid1 and pid2,
and get only that for pid3, and would have to pause until all are done.

If none of the 3 have changed state when we do the wait -n, everything
works nicely.  If one of them (say pid1) already terminated just before
the wait -n was executed, then in bash, that one is apparently exempt from
being returned.   That's impossible to program around.

It might be different if there was a sh builtin to block (and unblock)
signals (as in sigprocmask(2)) but I don't believe that even bash has
that - if it existed maybe we could block SIGCHLD before starting the async
jobs, and then do the wait -n would work, as the shell wouldn't have
reaped any of them (however long ago they terminated).   But that would be
bizarre, when all that's needed is to make wait -n behave rationally.









reply via email to

[Prev in Thread] Current Thread [Next in Thread]