bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#41948: Shepherd deadlocks


From: Ludovic Courtès
Subject: bug#41948: Shepherd deadlocks
Date: Sat, 20 Jun 2020 12:31:40 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)

Hi,

Mathieu Othacehe <othacehe@gnu.org> skribis:

> When running "gui-installed-desktop-os-encrypted" test, Shepherd seems
> to deadlock when restarting "guix-daemon". This can happen at different
> stages:
>
> * In "umount-cow-store" procedure, just before finishing the install.
>
> * During "set-http-proxy" tests inside the marionette.
>   
> This is not always reproducible. In order to gather some information, I
> created a Shepherd "strace" service that logs what's happening in
> Shepherd itself (patch attached).

We should be able to reproduce it with much simpler tests then, right?
Like maybe “while : ; do herd restart guix-daemon ; done” or similar?

> It seems that, just after blocking signals, in "fork+exec-command", I
> guess, Shepherd is taking a lock:
>
> 183553:1     chdir("/")                        = 0
> 183554:1     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, 
> tv_nsec=4387222}) = 0
> 183555:1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
> 183560:1     madvise(0x7f179782d000, 12288, MADV_DONTNEED) = 0
> 183561:1     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, 
> tv_nsec=13592357}) = 0
> 183562:1     clone(child_stack=0x7f17981e6fb0, 
> flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
>  parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271
> 183563:1     stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, 
> ...}) = 0
> 183564:1     write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 <unfinished 
> ...>
> 183566:1     <... write resumed>)              = 86
> 183575:1     getpgid(169)                      = 169
> 183576:1     getpgid(0)                        = 0
> 183577:1     kill(-169, SIGTERM <unfinished ...>
> 183579:1     <... kill resumed>)               = 0
> 183582:1     stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, 
> ...}) = 0
> 183583:1     write(15, "shepherd[1]: Service guix-daemon"..., 51 <unfinished 
> ...>
> 183585:1     <... write resumed>)              = 51
> 183594:1     getuid()                          = 0
> 183595:1     rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM 
> CHLD], 8) = 0
> 183596:1     write(6, "\1", 1 <unfinished ...>
> 183598:1     <... write resumed>)              = 1
> 183605:1     futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>
>
> and is then blocking forever.

When that happens, we should check how many threads exist in PID 1.
There should be the finalization thread and the main thread, plus the
signal thread (because there are still ‘sigaction’ calls in the ‘main’
procedure), plus the GC marker threads.

In <https://issues.guix.gnu.org/31925#6>, Andy suggests that the signal
thread is not properly handled; indeed it takes locks and we don’t try
to shut it down upon fork.  However, when using signalfd, the signal
thread must be stuck in its ‘read’ call in ‘read_signal_pipe_data’, so I
don’t see how it could cause problems.

The GC threads are presumably taken care of by the atfork handler in
libgc.

Thoughts?

Ludo’.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]