[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#43565: cuirass: Fibers scheduling blocked.
From: |
Ludovic Courtès |
Subject: |
bug#43565: cuirass: Fibers scheduling blocked. |
Date: |
Fri, 23 Oct 2020 14:21:05 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) |
Good afternoon fearless hacker!
Mathieu Othacehe <othacehe@gnu.org> skribis:
>> ‘process-build-log’ in Cuirass uses ‘read-line/non-blocking’ to read a
>> line from the log port of ‘build-derivations&’. If that really is
>> non-blocking (and I think it is), then we should be fine?
>>
>> We should attach GDB to Cuirass next time to see what’s blocking.
>
> Cuirass is currently hanging probably due to the same issue. I saved a
> GDB core dump in /home/mathieu/core.76483.
For those following along at home, we have 60 threads in there.
A couple of threads are blocked in ‘clock_nanosleep’, which I considered
fishy at first:
--8<---------------cut here---------------start------------->8---
(gdb) bt
#0 0x00007fe26752f7a1 in __GI___clock_nanosleep (clock_id=-612010, flags=0,
req=0x7fdf6b40d140, rem=0x7fdf6b40d140)
at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:48
#1 0x00007fe267a0166d in ffi_call_unix64 ()
from /gnu/store/bw15z9kh9c65ycc2vbhl2izwfwfva7p1-libffi-3.3/lib/libffi.so.7
#2 0x00007fe2679ffac0 in ffi_call_int () from
/gnu/store/bw15z9kh9c65ycc2vbhl2izwfwfva7p1-libffi-3.3/lib/libffi.so.7
#3 0x00007fe267af5f2e in scm_i_foreign_call (cif_scm=<optimized out>,
pointer_scm=<optimized out>,
errno_ret=errno_ret@entry=0x7fe25a8e86cc, argv=0x7fe25b955df0) at
foreign.c:1073
#4 0x00007fe267b64a84 in foreign_call (thread=0x7fe26741e480, cif=<optimized
out>, pointer=<optimized out>)
at vm.c:1282
#5 0x00007fe2505253e0 in ?? ()
#6 0x00007fe26741e480 in ?? ()
#7 0x00007fe267bd7620 in ?? () from
/gnu/store/0w76khfspfy8qmcpjya41chj3bgfcy0k-guile-3.0.4/lib/libguile-3.0.so.1
#8 0x00007fe26741e480 in ?? ()
#9 0x00007fe267b1043b in scm_jit_enter_mcode (thread=0x7fe26741e480,
thread@entry=0x7fe2505253b0,
mcode=0x7fe25052627c "L\215\243\210") at jit.c:5852
#10 0x00007fe267b6bc24 in vm_regular_engine (thread=0x7fe2505253b0) at
vm-engine.c:415
#11 0x00007fe267b6c5b5 in scm_call_n (proc=proc@entry=#<unmatched-tag 20045>,
argv=argv@entry=0x0,
nargs=nargs@entry=0) at vm.c:1608
#12 0x00007fe267ae8ae9 in scm_call_0 (proc=proc@entry=#<unmatched-tag 20045>)
at eval.c:490
#13 0x00007fe267adb138 in scm_call_with_unblocked_asyncs (proc=#<unmatched-tag
20045>) at async.c:406
--8<---------------cut here---------------end--------------->8---
This can only come from (fibers posix-clocks) via
‘with-interrupts’—probably OK.
Then there’s a couple of threads block in ‘pthread_cond_wait’, but
that’s presumably also Fibers internals.
Then there’s a whole bunch of threads stuck in ‘read’:
--8<---------------cut here---------------start------------->8---
(gdb) bt
#0 0x00007fe267a180a4 in __libc_read (fd=80, buf=buf@entry=0x7fe22b0bb8f0,
nbytes=nbytes@entry=8)
at ../sysdeps/unix/sysv/linux/read.c:26
#1 0x00007fe267af69c7 in fport_read (port=<optimized out>, dst=<optimized
out>, start=<optimized out>, count=8)
at fports.c:597
#2 0x00007fe267b30542 in trampoline_to_c_read (port=#<port #<port-type file
7fe25fb4db40> 7fe22b7b9880>,
dst="#<vu8vector>" = {...}, start=0, count=8) at ports.c:266
#3 0x00007fe2580cb5fe in ?? ()
#4 0x00007fe267431d80 in ?? ()
#5 0x00007fe267bd7620 in ?? () from
/gnu/store/0w76khfspfy8qmcpjya41chj3bgfcy0k-guile-3.0.4/lib/libguile-3.0.so.1
#6 0x00007fe267431d80 in ?? ()
#7 0x00007fe267b1043b in scm_jit_enter_mcode (thread=0x7fe267431d80,
thread@entry=0x7fe2580cb5d0,
mcode=0x7fe229340690 "H\203\350(I\211\314I)\304I\203\374\060\017\205T\003")
at jit.c:5852
#8 0x00007fe267b6b8e9 in vm_regular_engine (thread=0x7fe2580cb5d0) at
vm-engine.c:360
#9 0x00007fe267b6c5b5 in scm_call_n (proc=proc@entry=#<unmatched-tag 20045>,
argv=argv@entry=0x0,
nargs=nargs@entry=0) at vm.c:1608
#10 0x00007fe267ae8ae9 in scm_call_0 (proc=proc@entry=#<unmatched-tag 20045>)
at eval.c:490
#11 0x00007fe267adb138 in scm_call_with_unblocked_asyncs (proc=#<unmatched-tag
20045>) at async.c:406
--8<---------------cut here---------------end--------------->8---
‘trampoline_to_c_read’ is known as ‘port-read’ in Scheme, so I think the
call above comes from ‘read-bytes’ in (ice-9 suspendable-ports).
Normally, this file descriptor is O_NONBLOCK, and thus ‘fport_read’
immediately returns EAGAIN, so ‘trampoline_to_c_read’ returns #false.
But does Cuirass create file descriptors as O_NONBLOCK? This has to be
done explicitly, Fibers won’t do it for us. As it turns out, the answer
is no, in at least one important case: the connection to the daemon
(untested patch below).
While GC is running, Cuirass typically sends ‘build-derivations’ RPCs
and they block until the GC lock is released. That can lead to the
situation above: a bunch of threads blocked in ‘read’ from their daemon
socket, waiting for the RPC reply. OTOH, ‘build-derivations’ RPCs are
made from a fresh thread created by ‘build-derivations&’.
There are probably other situations where the daemon replies slowly.
For instance, ‘fetch-input’ can remain stuck until GC is over.
WDYT?
Thanks for investigating!
Ludo’.
diff --git a/src/cuirass/base.scm b/src/cuirass/base.scm
index 5a0c826..6db43c4 100644
--- a/src/cuirass/base.scm
+++ b/src/cuirass/base.scm
@@ -36,6 +36,9 @@
#:use-module ((guix config) #:select (%state-directory))
#:use-module (git)
#:use-module (ice-9 binary-ports)
+ #:use-module ((ice-9 suspendable-ports)
+ #:select (current-read-waiter
+ current-write-waiter))
#:use-module (ice-9 format)
#:use-module (ice-9 match)
#:use-module (ice-9 popen)
@@ -79,7 +82,12 @@
;; currently closes in a 'dynamic-wind' handler, which means it would close
;; the store at each context switch. Remove this when the real 'with-store'
;; has been fixed.
- (let ((store (open-connection)))
+ (let* ((store (open-connection))
+ (socket (store-connection-socket store)))
+ ;; Mark SOCKET as non-blocking so Fibers can schedule the way it wants.
+ (let ((flags (fcntl socket F_GETFL)))
+ (fcntl socket F_SETFL (logior O_NONBLOCK flags)))
+
(unwind-protect
;; Always set #:keep-going? so we don't stop on the first build failure.
;; Set #:print-build-trace explicitly to make sure 'process-build-log'
@@ -422,7 +430,12 @@ Essentially this procedure inverts the
inversion-of-control that
(lambda ()
(guard (c ((store-error? c)
(atomic-box-set! result c)))
- (parameterize ((current-build-output-port output))
+ (parameterize ((current-build-output-port output)
+
+ ;; STORE's socket is O_NONBLOCK but since we're
+ ;; not in a fiber, disable Fiber's handlers.
+ (current-read-waiter #f)
+ (current-write-waiter #f))
(let ((x (build-derivations store lst)))
(atomic-box-set! result x))))
(close-port output))