bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-sile


From: Ludovic Courtès
Subject: bug#61839: 'guix offload' occasionally hangs on berlin, hitting max-silent-time
Date: Mon, 17 Apr 2023 18:37:36 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)

Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

> We often get evaluation errors for ‘core-updates’, where the build log
> reveals that this is due to a build timeout while building the Guix
> instance that will be used for the evaluation.  Example:

[...]

> exporting path `/gnu/store/m2hyx7qk2j1bsvpj1qv89701klv8kqkk-guix-command'
> exporting path 
> `/gnu/store/vlm7401d5jvna6j7qdxhcr581cd8g8vh-guix-daemon-1.4.0-4.01fd830'
> exporting path `/gnu/store/zv3c0anfj085d2lyxh4g4aahhczmcaj8-guix-daemon'
> exporting path `/gnu/store/f26rw4haizf2h126k9ja9wk31cm49mj0-guix-17bd02433'
> exporting path 
> `/gnu/store/wc736va72zl2zz36817ahf3kc5fkfr69-inferior-script.scm'
> building of `/gnu/store/yq9r44lic8cn4nphksbi5s3ppwfcsh1n-guile.drv' timed out 
> after 3600 seconds of silence

With help from Andreas, I was able to reproduce it on berlin by running:

  guix build /gnu/store/b22580wj63wk4a3i09n22wa60y36kaa5-openjdk-16.0.2.drv

Timeout happens while ‘guix offload’ is sending build requisites to the
target machine.  At that point, the client ‘guix offload’ process is
stuck like so:

--8<---------------cut here---------------start------------->8---
(gdb) bt
#0  0x00007f1fd7ad7d6f in __GI___poll (fds=0x766b00, nfds=1, timeout=3594784) 
at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f1fcc220de7 in ssh_poll_ctx_dopoll () from 
target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#2  0x00007f1fcc221f0c in ssh_handle_packets () from 
target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#3  0x00007f1fcc221fdb in ssh_handle_packets_termination ()
   from 
target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#4  0x00007f1fcc2034bb in channel_write_common () from 
target:/gnu/store/f3hn75p2jjc94jfxfwsgf96s2wz8kaql-libssh-0.9.6/lib/libssh.so.4
#5  0x00007f1fcc28b15b in ?? () from 
target:/gnu/store/rjhq6cz7xbck4l7zar6ga3ygph2chwzr-guile-ssh-0.16.0/lib/libguile-ssh.so
#6  0x00007f1fd7cf4f34 in scm_i_write_bytes (port=0x7f1fc556ba00, 
src=0x7f1fc3c27000, start=0, count=65536) at ports.c:2858
#7  0x00007f1fd7cff546 in scm_put_bytevector (port=0x7f1fc556ba00, 
bv=0x7f1fc3c27000, start=<optimized out>, count=<optimized out>)
    at r6rs-ports.c:676
#8  0x00007f1fc5ac5197 in ?? ()
#9  0x000000000069731c in ?? ()
#10 0x00007f1fc5acdf71 in ?? ()
#11 0x0000000000697318 in ?? ()
#12 0x00007f1fd7cd2ccc in scm_jit_enter_mcode (thread=0x7f1fd73b1d80, 
mcode=0x69bf14 "\034;\004") at jit.c:6038
#13 0x00007f1fd7d2874f in vm_regular_engine (thread=0x7f1fd73b1d80) at 
vm-engine.c:546
#14 0x00007f1fd7d355d9 in scm_call_n (proc=<optimized out>, argv=<optimized 
out>, nargs=1) at vm.c:1610
#15 0x00007f1fd7c9d457 in scm_primitive_eval (exp=<optimized out>, 
exp@entry=0x7f1fcfab8cf0) at eval.c:671
#16 0x00007f1fd7ca34b6 in scm_eval (exp=0x7f1fcfab8cf0, 
module_or_state=0x7f1fcfaacc80) at eval.c:705
#17 0x00007f1fd7d073b6 in scm_shell (argc=11, argv=0x6a9fc0) at script.c:357
#18 0x00007f1fd7cb249c in invoke_main_func (body_data=0x7fffd2c0d370) at 
init.c:312
#19 0x00007f1fd7c9b85a in c_body (d=0x7fffd2c0d290) at continuations.c:430
#20 0x00007f1fd7d28326 in vm_regular_engine (thread=0x7f1fd73b1d80) at 
vm-engine.c:972
#21 0x00007f1fd7d355d9 in scm_call_n (proc=<optimized out>, argv=<optimized 
out>, nargs=2) at vm.c:1610
#22 0x00007f1fd7c9d09a in scm_call_2 (proc=<optimized out>, arg1=<optimized 
out>, arg2=<optimized out>) at eval.c:503
#23 0x00007f1fd7d53742 in scm_c_with_exception_handler.constprop.0 (type=0x404, 
handler_data=handler_data@entry=0x7fffd2c0d220,
    thunk_data=thunk_data@entry=0x7fffd2c0d220, thunk=<optimized out>, 
handler=<optimized out>) at exceptions.c:170
#24 0x00007f1fd7d2588f in scm_c_catch (tag=<optimized out>, body=<optimized 
out>, body_data=<optimized out>, handler=<optimized out>,
    handler_data=<optimized out>, pre_unwind_handler=<optimized out>, 
pre_unwind_handler_data=0x7f1fcfa10040) at throw.c:168
#25 0x00007f1fd7c9de66 in scm_i_with_continuation_barrier 
(pre_unwind_handler=0x7f1fd7c9db80 <pre_unwind_handler>,
    pre_unwind_handler_data=0x7f1fcfa10040, handler_data=0x7fffd2c0d290, 
handler=0x7f1fd7ca48b0 <c_handler>, body_data=0x7fffd2c0d290,
    body=0x7f1fd7c9b850 <c_body>) at continuations.c:368
#26 scm_c_with_continuation_barrier (func=<optimized out>, data=<optimized 
out>) at continuations.c:464
#27 0x00007f1fd7d24b39 in with_guile (base=0x7fffd2c0d318, data=0x7fffd2c0d340) 
at threads.c:645
#28 0x00007f1fd7bfb0ba in GC_call_with_stack_base () from 
target:/gnu/store/2lczkxbdbzh4gk7wh91bzrqrk7h5g1dl-libgc-8.0.4/lib/libgc.so.1
#29 0x00007f1fd7d1d8b8 in scm_i_with_guile (dynamic_state=<optimized out>, 
data=<optimized out>, func=<optimized out>) at threads.c:688
#30 scm_with_guile (func=<optimized out>, data=<optimized out>) at threads.c:694
#31 0x00007f1fd7cbb025 in scm_boot_guile (argc=<optimized out>, argv=<optimized 
out>, main_func=<optimized out>, closure=<optimized out>)
    at init.c:295
#32 0x0000000000401191 in ?? ()
#33 0x00007f1fd7a117dd in __libc_start_main (main=0x401110, argc=8, 
argv=0x7fffd2c0d4a8, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffd2c0d498) at 
../csu/libc-start.c:332
#34 0x00000000004011da in ?? ()
(gdb) p *fds
$1 = {fd = 15, events = 1, revents = 0}
(gdb) shell ls -l /proc/62345/fd/15
lrwx------ 1 root root 64 Apr 17 17:42 /proc/62345/fd/15 -> 
'socket:[3238975699]'
--8<---------------cut here---------------end--------------->8---

The SSH connection is still live, and on the server side I see the ‘guix
repl -t machine’ process stuck in:

--8<---------------cut here---------------start------------->8---
write(2, "linking 
\342\200\230/gnu/store/0gbj38wswwxq50i68ci973fgi2l531b4-openjdk-16.0.2-checkout/test/hotspot/jtreg/compiler/c2/cr6340864/TestLongVect.java\342\200\231
 to 
\342\200\230/gnu/store/.links/03zscrls9431zipkpx1sn450llvf5hkhvbkpjk57j22f7d2nb3vs\342\200\231\n",
 221
--8<---------------cut here---------------end--------------->8---

(This message comes from optimise-store.cc:192.)

This suggests that ‘guix repl -t machine’ has filled its output buffer;
its output isn’t being read so it gets stuck in that write(2) call.  Its
parent ‘sshd’ process is stuck in:

--8<---------------cut here---------------start------------->8---
ppoll([{fd=3, events=POLLIN}, {fd=4, events=0}, {fd=10, events=POLLIN}, {fd=12, 
events=POLLIN}, {fd=11, events=POLLOUT}], 5, NULL, [], 8^C
--8<---------------cut here---------------end--------------->8---

One hypothesis is that when transferring store items with many files,
such as openjdk-checkout here, guix-daemon on the remote machine outputs
many such “linking … to …” messages; if the client doesn’t read them,
then we end up with a deadlock like this, with the server-side process
stuck writing more messages.

However, that “linking …” message is supposed to be written only at
non-default debugging levels:

    printMsg(lvlTalkative, format("linking ‘%1%’ to ‘%2%’") % path % linkPath);

Turns out that guix-daemon on build machines is running with ‘--debug’
starting with maintenance.git commit
72d8a62466b71eda657b52ca17e99f7189ea18ad (“sysadmin: build-machines:
Enable guix-daemon debug output.”)

The immediate fix is to undo that—I’ll do that later.

The other one is to swallow ‘current-error-port’ in
‘store-import-channel’.

To be continued…

Ludo’.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]