bug#12980: 24.3.50; Zombie process left after call-process

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#12980: 24.3.50; Zombie process left after call-process

From:	Eli Zaretskii
Subject:	bug#12980: 24.3.50; Zombie process left after call-process
Date:	Tue, 27 Nov 2012 18:00:56 +0200

> Date: Mon, 26 Nov 2012 14:38:41 -0800
> From: Paul Eggert <eggert@cs.ucla.edu>
> CC: 12980@debbugs.gnu.org, Eli Zaretskii <eliz@gnu.org>, 
>  9488@debbugs.gnu.org
> 
> Harald, thanks for the bug report.  This part of Emacs is a bit
> messy, but I hacked away at it and came up with
> the attached proposed patch.
> 
> This patch touches some code for Microsoft platforms
> so I'm CC'ing this to Eli.

Thanks.

> +/* Return true if CHILD is reapable.  CHILD must be the process ID of
> +   a child process.  As a side effect this function may reap CHILD,
> +   discarding its status, and therefore report that CHILD is not
> +   reapable.
> +
> +   This function is intended for use after the caller was interrupted
> +   while trying to reap CHILD, and where the caller later tests
> +   whether CHILD was in fact reaped.  The caller should not create
> +   another child process (via vfork, say) between the earlier attempt
> +   to reap CHILD and now, as that would cause a race if the newly
> +   created child process happened to have CHILD as its process-ID.  */

I'm confused by this commentary, perhaps because it doesn't explain
what is the definition of "reapable".  The original naive
interpretation (i.e. the child exited and its status is ready to be
accessed) seems to contradict the latter part of the commentary, and
also the fact that the function will return false both if waitpid
returns a positive and a negative value.  I would suggest to clarify
this.

>  #ifdef MSDOS /* MW, July 1993 */
>      /* Note that on MSDOS `child_setup' actually returns the child process
> -       exit status, not its PID, so we assign it to `synch_process_retcode'
> -       below.  */
> +       exit status, not its PID, so assign it to status below.  */
>      pid = child_setup (filefd, outfilefd, fd_error, (char **) new_argv,
>                      0, current_dir);
> -
> -    /* Record that the synchronous process exited and note its
> -       termination status.  */
> -    synch_process_alive = 0;
> -    synch_process_retcode = pid;
> -    if (synch_process_retcode < 0)  /* means it couldn't be exec'ed */
> -      {
> -     synchronize_system_messages_locale ();
> -     synch_process_death = strerror (errno);
> -      }
> +    status = pid;

The feature, whereby this function (call-process) could return a
description of errno, is hereby lost.  That's quite nasty in the MSDOS
case, where errno is the only way to indicate to the user what
happened, because the value of status in case of failure is always -1,
which is not quite descriptive.  Can we please not drop this feature
on the floor?

> +    if (0 < pid)
> +      {
> +     if (INTEGERP (buffer))
> +       record_deleted_pid (pid);
> +     else
> +       {
> +#ifdef MSDOS
> +         /* MSDOS needs different cleanup information.  */
> +         cleanup_info_tail = build_string (tempfile ? tempfile : "");
> +#else
> +         cleanup_info_tail = INTEGER_TO_CONS (pid);
> +#endif
> +         record_unwind_protect (call_process_cleanup,
> +                                Fcons (Fcurrent_buffer (),
> +                                       Fcons (INTEGER_TO_CONS (fd[0]),
> +                                              cleanup_info_tail)));
> +       }
> +      }
> +
> +    unblock_child_signal ();
>      unblock_input ();
>  
>  #endif /* not WINDOWSNT */
> @@ -658,17 +705,6 @@
>    /* Enable sending signal if user quits below.  */
>    call_process_exited = 0;
>  
> -#if defined (MSDOS)
> -  /* MSDOS needs different cleanup information.  */
> -  cleanup_info_tail = build_string (tempfile ? tempfile : "");
> -#else
> -  cleanup_info_tail = INTEGER_TO_CONS (pid);
> -#endif /* not MSDOS */
> -  record_unwind_protect (call_process_cleanup,
> -                      Fcons (Fcurrent_buffer (),
> -                             Fcons (INTEGER_TO_CONS (fd[0]),
> -                                    cleanup_info_tail)));
> -

This is wrong for MSDOS, because the temporary file is created before
attempting to launch the process, and so it needs to be cleaned up
even if we failed to launch the child process.

> -  if (synch_process_termsig)
> +  if (WIFSIGNALED (status))
>      {
>        const char *signame;
>  
>        synchronize_system_messages_locale ();
> -      signame = strsignal (synch_process_termsig);
> +      signame = strsignal (WTERMSIG (status));
>  
>        if (signame == 0)
>       signame = "unknown";
>  
> -      synch_process_death = signame;
> +      return code_convert_string_norecord (build_string (signame),
> +                                        Vlocale_coding_system, 0);
>      }
>  
> -  if (synch_process_death)
> -    return code_convert_string_norecord (build_string (synch_process_death),
> -                                      Vlocale_coding_system, 0);
> -  return make_number (synch_process_retcode);
> +  eassert (WIFEXITED (status));
> +  return make_number (WEXITSTATUS (status));

This seem to assume that the only trouble in call-process could be
from some signal.  Why is that true?

>  static bool
> -process_status_retrieved (pid_t desired, pid_t have, int *status)
> +process_status_retrieved (pid_t desired, int *status, int options)
>  {
> -  if (have < 0)
> +  /* Invoke waitpid only with a known process ID; do not invoke
> +     waitpid with a nonpositive argument.  Otherwise, Emacs might
> +     reap an unwanted process by mistake.  For example, invoking
> +     waitpid (-1, ...) can mess up glib by reaping glib's subprocesses,
> +     so that another thread running glib won't find them.  */
> +  eassert (0 < desired);
> +
> +  while (1)
>      {
> -      /* Invoke waitpid only with a known process ID; do not invoke
> -      waitpid with a nonpositive argument.  Otherwise, Emacs might
> -      reap an unwanted process by mistake.  For example, invoking
> -      waitpid (-1, ...) can mess up glib by reaping glib's subprocesses,
> -      so that another thread running glib won't find them.  */
> -      do
> -     have = waitpid (desired, status, WNOHANG | WUNTRACED);
> -      while (have < 0 && errno == EINTR);
> +      pid_t pid = waitpid (desired, status, WNOHANG | options);
> +      if (0 <= pid)
> +     return 0 < pid;
> +      if (errno == ECHILD)
> +     return 0;
> +      eassert (errno == EINTR);

Why is this eassert a good idea?  We don't need to enforce the Posix
spec at a price of crashing Emacs on users, do we?  What bad things
can happen if you see some other errno here?

> --- src/w32proc.c     2012-11-24 01:57:09 +0000
> +++ src/w32proc.c     2012-11-26 22:06:24 +0000
> @@ -1273,34 +1273,7 @@
>    DebPrint (("Wait signaled with process pid %d\n", cp->pid));
>  #endif
>  
> -  if (status)
> -    {
> -      *status = retval;
> -    }
> -  else if (synch_process_alive)
> -    {
> -      synch_process_alive = 0;
> -
> -      /* Report the status of the synchronous process.  */
> -      if (WIFEXITED (retval))
> -     synch_process_retcode = WEXITSTATUS (retval);
> -      else if (WIFSIGNALED (retval))
> -     {
> -       int code = WTERMSIG (retval);
> -       const char *signame;
> -
> -       synchronize_system_messages_locale ();
> -       signame = strsignal (code);
> -
> -       if (signame == 0)
> -         signame = "unknown";
> -
> -       synch_process_death = signame;
> -     }
> -
> -      reap_subprocess (cp);
> -    }
> -
> +  *status = retval;
>    reap_subprocess (cp);

The condition testing that status is non-NULL must stay, because that
argument can be NULL, even if Emacs currently never uses that option.

I didn't see any other obvious problems.  However, given the
complexity of handling this, and the extent of changes in this
changeset, it would be nice to have somewhere a commentary with an
overview of how termination of child processes is handled.  Reverse
engineering that from the code is not really trivial.  Would you
please consider writing such an overview?

TIA

[Prev in Thread]

Current Thread

[Next in Thread]

bug#9488: 24.3.50; Zombie process left after call-process, Paul Eggert, 2012/11/26
- bug#12980: 24.3.50; Zombie process left after call-process, Eli Zaretskii <=
  - bug#12980: 24.3.50; Zombie process left after call-process, Paul Eggert, 2012/11/28
    - bug#12980: 24.3.50; Zombie process left after call-process, Eli Zaretskii, 2012/11/29
    - bug#12980: 24.3.50; Zombie process left after call-process, Paul Eggert, 2012/11/29

Prev by Date: bug#12994: 24.2.90; default value of diary-header-line-format
Next by Date: bug#13011: 24.2; Text flickering moving cursor with box around text enabled
Previous by thread: bug#9488: 24.3.50; Zombie process left after call-process
Next by thread: bug#12980: 24.3.50; Zombie process left after call-process
Index(es):
- Date
- Thread