bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#63736: [Shepherd] Loss of SIGCHLD notifications


From: Ludovic Courtès
Subject: bug#63736: [Shepherd] Loss of SIGCHLD notifications
Date: Fri, 26 May 2023 14:14:09 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)

I experienced, with the Shepherd 0.10.0, a situation where ‘herd
restart’ would get stuck after the end-of-grace-period expiration
(restarting nscd in this case):

--8<---------------cut here---------------start------------->8---
May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Status of nscd: 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   It is running since 
08:43:57 (36 seconds ago). 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Running value is 
23753. 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   It is enabled. 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Provides (nscd). 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Requires 
(user-processes syslogd). 
May 26 08:44:33 localhost shepherd[1]: [NetworkManager]   Will be respawned. 
May 26 08:44:33 localhost shepherd[1]: Stopping service nscd... 
[...]
May 26 08:44:38 localhost shepherd[1]: Grace period of 5 seconds is over; 
sending -23753 SIGKILL. 
--8<---------------cut here---------------end--------------->8---

The ‘herd restart’ process is indeed waiting, and the nscd process is
still around as zombie, meaning that shepherd didn’t call waitpid(2):

--8<---------------cut here---------------start------------->8---
$ sudo herd status nscd
Status of nscd:
  It is being stopped.
  It is enabled.
  Provides (nscd).
  Requires (user-processes syslogd).
  Will be respawned.
$ ps aux|grep nscd
root     23753  0.0  0.0      0     0 ?        Zs   08:43   0:00 [nscd] 
<defunct>
root     23870  0.0  0.1  49968 18088 ?        Sl   08:44   0:00 
/gnu/store/4gvgcfdiz67wv04ihqfa8pqwzsb0qpv5-guile-3.0.9/bin/guile 
--no-auto-compile /run/current-system/profile/bin/herd restart nscd
ludo     25280 33.3  0.0   6112  2304 pts/0    S+   09:32   0:00 grep 
--color=auto nscd
--8<---------------cut here---------------end--------------->8---

At that point, if I pick another service process and kill it, shepherd
again fails to react to SIGCHLD (or doesn’t receive it):

--8<---------------cut here---------------start------------->8---
$ sudo herd status ntpd
Status of ntpd:
  It is running since 08:44:31 AM (52 minutes ago).
  Running value is 23814.
  It is enabled.
  Provides (ntpd).
  Requires (user-processes networking).
  Will be respawned.
$ sudo kill 23814
$ sudo herd status ntpd
Status of ntpd:
  It is running since 08:44:31 AM (52 minutes ago).
  Running value is 23814.
  It is enabled.
  Provides (ntpd).
  Requires (user-processes networking).
  Will be respawned.
ludo@ribbon ~/src/guix$ ps 23814
  PID TTY      STAT   TIME COMMAND
23814 ?        Zs     0:00 [ntpd] <defunct>
--8<---------------cut here---------------end--------------->8---

If I repeat the same experiment while stracing shepherd, here’s what I
see:

--8<---------------cut here---------------start------------->8---
$ sudo herd status guix-publish
Status of guix-publish:
  It is running since 08:44:32 AM (55 minutes ago).
  Running value is 23822.
  It is enabled.
  Provides (guix-publish).
  Requires (user-processes guix-daemon avahi-daemon).
  Will be respawned.
$ sudo kill 23822
$ ps 23822
  PID TTY      STAT   TIME COMMAND
23822 ?        Zs     0:02 [guix publish] <defunct>
--8<---------------cut here---------------end--------------->8---

… and the strace:

--8<---------------cut here---------------start------------->8---
1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1     epoll_wait(13, [{events=EPOLLIN, data={u32=14, u64=14}}], 8, -1) = 1
1     accept4(14, {sa_family=AF_UNIX}, [112 => 2], SOCK_CLOEXEC|SOCK_NONBLOCK) 
= 25
1     accept4(14, 0x7ffe6f8f0be0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 
EAGAIN (Resource temporarily unavailable)
1     epoll_ctl(13, EPOLL_CTL_MOD, 14, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, 
data={u32=14, u64=14}}) = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     read(25, "(shepherd-command (version 0) (action status) (service 
guix-publish) (arguments ()) (directory \"/home/ludo/src/guix\"))", 1024) = 118
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     epoll_wait(13, [], 8, 0)          = 0
1     write(25, "(reply (version 0) (result ((service (version 0) (provides 
(guix-publish)) (requires (user-processes guix-daemon avahi-daemon)) (respawn? 
#t) (docstring \"[No documentation.]\") (enabled? #t) (running 23822) 
(conflicts ()) (last-respawns ()) (status-changes ((running . 1685083472) 
(starting . 1685083472) (stopped . 1685083469) (stopping . 1685083469) (running 
. 1684924319) (starting . 1684924319) (stopped . 1684924319) (stopping . 
1684924319) (running . 1684853459) (starting . 1684853459) (stopped"..., 686) = 
686
1     close(25)                         = 0
1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1     epoll_wait(13, [{events=EPOLLHUP, data={u32=30, u64=30}}], 8, -1) = 1
1     read(30, "", 4096)                = 0
1     close(30)                         = 0
1     close(36)                         = 0
1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
1     epoll_wait(13, 
--8<---------------cut here---------------end--------------->8---

The signal FD still exists but we didn’t see any activity on it:

--8<---------------cut here---------------start------------->8---
$ sudo ls -lrt /proc/1/fd |grep signalfd
lrwx------ 1 root root 64 May 26 09:41 10 -> anon_inode:[signalfd]
--8<---------------cut here---------------end--------------->8---

Here’s the signal handling status (notice two signals queued):

--8<---------------cut here---------------start------------->8---
$ sudo cat /proc/1/status|grep -i Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: 0000000000014003
SigIgn: 0000000000001000
SigCgt: 0000000120814423
--8<---------------cut here---------------end--------------->8---

With:

  (define (signal-mask->list mask)
    (let loop ((sig 32) (lst '()))
      (if (zero? sig)
          lst
          (loop (- sig 1)
                (if (bit-set? (- sig 1) mask)
                    (cons sig lst)
                    lst)))))

… we see that the block mask is equal to ‘%precious-signals’ (good):

--8<---------------cut here---------------start------------->8---
scheme@(shepherd system)> (signal-mask->list #x0000000000014003)
$33 = (1 2 15 17)
--8<---------------cut here---------------end--------------->8---

… while the ignored mask lists SIGPIPE:

--8<---------------cut here---------------start------------->8---
scheme@(shepherd system)> (signal-mask->list #x0000000000001000)
$32 = (13)
--8<---------------cut here---------------end--------------->8---

However, the process also has 3 GC marker threads and 1 finalization
thread:

--8<---------------cut here---------------start------------->8---
(gdb) info threads
  Id   Target Id                                     Frame 
* 1    Thread 0x7fdbd530f380 (LWP 1) "shepherd"      0x00007fdbd5415626 in 
epoll_wait (epfd=13, 
    events=0x7fdbd203a1a0, maxevents=8, timeout=-1) at 
../sysdeps/unix/sysv/linux/epoll_wait.c:30
  2    Thread 0x7fdbd4cd9640 (LWP 118) "GC-marker-0" 
__futex_abstimed_wait_common64 (private=0, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 
<mark_cv+40>)
    at futex-internal.c:57
  3    Thread 0x7fdbd44d8640 (LWP 119) "GC-marker-1" 
__futex_abstimed_wait_common64 (private=0, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 
<mark_cv+40>)
    at futex-internal.c:57
  4    Thread 0x7fdbd3cd7640 (LWP 120) "GC-marker-2" 
__futex_abstimed_wait_common64 (private=0, 
    cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7fdbd5857be8 
<mark_cv+40>)
    at futex-internal.c:57
  5    Thread 0x7fdbd3389640 (LWP 123) "shepherd"    __GI___libc_read 
(nbytes=1, 
    buf=0x7fdbd3388620, fd=6) at ../sysdeps/unix/sysv/linux/read.c:26
--8<---------------cut here---------------end--------------->8---

These threads seem to be blocking every signal, which is good:

--8<---------------cut here---------------start------------->8---
$ sudo cat /proc/118/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/119/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/120/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5f7bfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
$ sudo cat /proc/123/status |grep ^Sig
SigQ:   2/63467
SigPnd: 0000000000000000
SigBlk: fffffffe5ffbfeff
SigIgn: 0000000000001000
SigCgt: 0000000120814423
--8<---------------cut here---------------end--------------->8---

Then all of a sudden, as I’m conducting this experiment, I see:

--8<---------------cut here---------------start------------->8---
May 26 09:36:37 localhost ntpd[23814]: ntpd exiting on signal 15 (Terminated)
[…]
May 26 10:52:10 localhost shepherd[1]: Respawning nscd. 
May 26 10:52:10 localhost shepherd[1]: Service guix-publish (PID 23822) 
terminated with signal 15. 
May 26 10:52:10 localhost shepherd[1]: Respawning ntpd. 
May 26 10:52:10 localhost shepherd[1]: Respawning guix-publish. 
May 26 10:52:10 localhost shepherd[1]: Starting service nscd... 
May 26 10:52:10 localhost shepherd[1]: Starting service ntpd... 
May 26 10:52:10 localhost shepherd[1]: Service ntpd has been started. 
May 26 10:52:10 localhost shepherd[1]: Service ntpd started. 
May 26 10:52:10 localhost shepherd[1]: Service ntpd running with value 29148. 
--8<---------------cut here---------------end--------------->8---

10:52 is about the time I attached gdb to shepherd…

Interestingly, we get a “terminated with signal 15” message from
shepherd for guix-publish, but not for ntpd.

Long story short: there seems to be a problem with signal delivery.
Most likely, the initial grace period expiration above when stopping
nscd is a symptom of shepherd no longer receiving/processing SIGCHLD
rather than the cause.

To be continued…

Ludo’.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]