bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#43565: cuirass: Fibers scheduling blocked.


From: Ludovic Courtès
Subject: bug#43565: cuirass: Fibers scheduling blocked.
Date: Fri, 23 Oct 2020 14:21:05 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)

Good afternoon fearless hacker!

Mathieu Othacehe <othacehe@gnu.org> skribis:

>> ‘process-build-log’ in Cuirass uses ‘read-line/non-blocking’ to read a
>> line from the log port of ‘build-derivations&’.  If that really is
>> non-blocking (and I think it is), then we should be fine?
>>
>> We should attach GDB to Cuirass next time to see what’s blocking.
>
> Cuirass is currently hanging probably due to the same issue. I saved a
> GDB core dump in /home/mathieu/core.76483.

For those following along at home, we have 60 threads in there.

A couple of threads are blocked in ‘clock_nanosleep’, which I considered
fishy at first:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007fe26752f7a1 in __GI___clock_nanosleep (clock_id=-612010, flags=0, 
req=0x7fdf6b40d140, rem=0x7fdf6b40d140)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1  0x00007fe267a0166d in ffi_call_unix64 ()
   from /gnu/store/bw15z9kh9c65ycc2vbhl2izwfwfva7p1-libffi-3.3/lib/libffi.so.7
#2  0x00007fe2679ffac0 in ffi_call_int () from 
/gnu/store/bw15z9kh9c65ycc2vbhl2izwfwfva7p1-libffi-3.3/lib/libffi.so.7
#3  0x00007fe267af5f2e in scm_i_foreign_call (cif_scm=<optimized out>, 
pointer_scm=<optimized out>, 
    errno_ret=errno_ret@entry=0x7fe25a8e86cc, argv=0x7fe25b955df0) at 
foreign.c:1073
#4  0x00007fe267b64a84 in foreign_call (thread=0x7fe26741e480, cif=<optimized 
out>, pointer=<optimized out>)
    at vm.c:1282
#5  0x00007fe2505253e0 in ?? ()
#6  0x00007fe26741e480 in ?? ()
#7  0x00007fe267bd7620 in ?? () from 
/gnu/store/0w76khfspfy8qmcpjya41chj3bgfcy0k-guile-3.0.4/lib/libguile-3.0.so.1
#8  0x00007fe26741e480 in ?? ()
#9  0x00007fe267b1043b in scm_jit_enter_mcode (thread=0x7fe26741e480, 
thread@entry=0x7fe2505253b0, 
    mcode=0x7fe25052627c "L\215\243\210") at jit.c:5852
#10 0x00007fe267b6bc24 in vm_regular_engine (thread=0x7fe2505253b0) at 
vm-engine.c:415
#11 0x00007fe267b6c5b5 in scm_call_n (proc=proc@entry=#<unmatched-tag 20045>, 
argv=argv@entry=0x0, 
    nargs=nargs@entry=0) at vm.c:1608
#12 0x00007fe267ae8ae9 in scm_call_0 (proc=proc@entry=#<unmatched-tag 20045>) 
at eval.c:490
#13 0x00007fe267adb138 in scm_call_with_unblocked_asyncs (proc=#<unmatched-tag 
20045>) at async.c:406
--8<---------------cut here---------------end--------------->8---

This can only come from (fibers posix-clocks) via
‘with-interrupts’—probably OK.

Then there’s a couple of threads block in ‘pthread_cond_wait’, but
that’s presumably also Fibers internals.

Then there’s a whole bunch of threads stuck in ‘read’:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007fe267a180a4 in __libc_read (fd=80, buf=buf@entry=0x7fe22b0bb8f0, 
nbytes=nbytes@entry=8)
    at ../sysdeps/unix/sysv/linux/read.c:26
#1  0x00007fe267af69c7 in fport_read (port=<optimized out>, dst=<optimized 
out>, start=<optimized out>, count=8)
    at fports.c:597
#2  0x00007fe267b30542 in trampoline_to_c_read (port=#<port #<port-type file 
7fe25fb4db40> 7fe22b7b9880>, 
    dst="#<vu8vector>" = {...}, start=0, count=8) at ports.c:266
#3  0x00007fe2580cb5fe in ?? ()
#4  0x00007fe267431d80 in ?? ()
#5  0x00007fe267bd7620 in ?? () from 
/gnu/store/0w76khfspfy8qmcpjya41chj3bgfcy0k-guile-3.0.4/lib/libguile-3.0.so.1
#6  0x00007fe267431d80 in ?? ()
#7  0x00007fe267b1043b in scm_jit_enter_mcode (thread=0x7fe267431d80, 
thread@entry=0x7fe2580cb5d0, 
    mcode=0x7fe229340690 "H\203\350(I\211\314I)\304I\203\374\060\017\205T\003") 
at jit.c:5852
#8  0x00007fe267b6b8e9 in vm_regular_engine (thread=0x7fe2580cb5d0) at 
vm-engine.c:360
#9  0x00007fe267b6c5b5 in scm_call_n (proc=proc@entry=#<unmatched-tag 20045>, 
argv=argv@entry=0x0, 
    nargs=nargs@entry=0) at vm.c:1608
#10 0x00007fe267ae8ae9 in scm_call_0 (proc=proc@entry=#<unmatched-tag 20045>) 
at eval.c:490
#11 0x00007fe267adb138 in scm_call_with_unblocked_asyncs (proc=#<unmatched-tag 
20045>) at async.c:406
--8<---------------cut here---------------end--------------->8---

‘trampoline_to_c_read’ is known as ‘port-read’ in Scheme, so I think the
call above comes from ‘read-bytes’ in (ice-9 suspendable-ports).

Normally, this file descriptor is O_NONBLOCK, and thus ‘fport_read’
immediately returns EAGAIN, so ‘trampoline_to_c_read’ returns #false.

But does Cuirass create file descriptors as O_NONBLOCK?  This has to be
done explicitly, Fibers won’t do it for us.  As it turns out, the answer
is no, in at least one important case: the connection to the daemon
(untested patch below).

While GC is running, Cuirass typically sends ‘build-derivations’ RPCs
and they block until the GC lock is released.  That can lead to the
situation above: a bunch of threads blocked in ‘read’ from their daemon
socket, waiting for the RPC reply.  OTOH, ‘build-derivations’ RPCs are
made from a fresh thread created by ‘build-derivations&’.

There are probably other situations where the daemon replies slowly.
For instance, ‘fetch-input’ can remain stuck until GC is over.

WDYT?

Thanks for investigating!

Ludo’.

diff --git a/src/cuirass/base.scm b/src/cuirass/base.scm
index 5a0c826..6db43c4 100644
--- a/src/cuirass/base.scm
+++ b/src/cuirass/base.scm
@@ -36,6 +36,9 @@
   #:use-module ((guix config) #:select (%state-directory))
   #:use-module (git)
   #:use-module (ice-9 binary-ports)
+  #:use-module ((ice-9 suspendable-ports)
+                #:select (current-read-waiter
+                          current-write-waiter))
   #:use-module (ice-9 format)
   #:use-module (ice-9 match)
   #:use-module (ice-9 popen)
@@ -79,7 +82,12 @@
   ;; currently closes in a 'dynamic-wind' handler, which means it would close
   ;; the store at each context switch.  Remove this when the real 'with-store'
   ;; has been fixed.
-  (let ((store (open-connection)))
+  (let* ((store  (open-connection))
+         (socket (store-connection-socket store)))
+    ;; Mark SOCKET as non-blocking so Fibers can schedule the way it wants.
+    (let ((flags (fcntl socket F_GETFL)))
+      (fcntl socket F_SETFL (logior O_NONBLOCK flags)))
+
     (unwind-protect
      ;; Always set #:keep-going? so we don't stop on the first build failure.
      ;; Set #:print-build-trace explicitly to make sure 'process-build-log'
@@ -422,7 +430,12 @@ Essentially this procedure inverts the 
inversion-of-control that
           (lambda ()
             (guard (c ((store-error? c)
                        (atomic-box-set! result c)))
-              (parameterize ((current-build-output-port output))
+              (parameterize ((current-build-output-port output)
+
+                             ;; STORE's socket is O_NONBLOCK but since we're
+                             ;; not in a fiber, disable Fiber's handlers.
+                             (current-read-waiter #f)
+                             (current-write-waiter #f))
                 (let ((x (build-derivations store lst)))
                   (atomic-box-set! result x))))
             (close-port output))

reply via email to

[Prev in Thread] Current Thread [Next in Thread]