--- Begin Message ---
Subject: |
Offloading sometimes hangs |
Date: |
Tue, 06 Feb 2018 11:04:10 +0100 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) |
Hi,
On berlin.guixsd.org, offloading would sometimes hang in the middle of
an offloaded build: no more build log output showing up, nothing
happening (this is with guix-0.14.0-6.0dcf675).
On the build machine side, the guile process that forwards data between
the sshd and guix-daemon¹ is stuck on:
read(0, …)
with this stack trace:
--8<---------------cut here---------------start------------->8---
(gdb) bt
#0 0x00007f09d6068aed in read () from
/gnu/store/3h31zsqxjjg52da5gp3qmhkh4x8klhah-glibc-2.25/lib/libpthread.so.0
#1 0x00007f09d653fc47 in fport_read ()
from
/gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#2 0x00007f09d656cd77 in scm_i_read_bytes ()
from
/gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#3 0x00007f09d65705fe in scm_fill_input ()
from
/gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#4 0x00007f09d6577897 in scm_get_bytevector_some ()
from
/gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#5 0x00007f09d65abc4d in vm_regular_engine ()
from
/gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#6 0x00007f09d65af2aa in scm_call_n ()
from
/gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
#7 0x00007f09d65338d7 in scm_primitive_eval ()
from
/gnu/store/0v539yjmdqhjm1xcpvndmagkgjz5fvh2-guile-2.2.2/lib/libguile-2.2.so.1
--8<---------------cut here---------------end--------------->8---
In theory this “cannot happen” because it reads from stdin iff ‘select’
said stdin is ready.
On the server side (on berlin itself), the corresponding ‘guix offload’
process is stuck here:
--8<---------------cut here---------------start------------->8---
(gdb) bt
#0 0x00007ff49b3590bd in poll () from
target:/gnu/store/3h31zsqxjjg52da5gp3qmhkh4x8klhah-glibc-2.25/lib/libc.so.6
#1 0x00007ff48f4db377 in ssh_poll_ctx_dopoll ()
from
target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#2 0x00007ff48f4dc319 in ssh_handle_packets ()
from
target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#3 0x00007ff48f4dc3ed in ssh_handle_packets_termination ()
from
target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#4 0x00007ff48f4c8eff in ssh_channel_read_timeout ()
from
target:/gnu/store/3phbrya78gpk7rg6flqyqzf53y3x9zv9-libssh-0.7.5/lib/libssh.so.4
#5 0x00007ff48f930803 in read_from_channel_port ()
from
target:/gnu/store/xfaqdvk060yz7ddc9isk3wkybqmcfj3w-guile-ssh-0.11.2/lib/libguile-ssh.so.11
#6 0x00007ff49cea7d77 in scm_i_read_bytes ()
from
target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#7 0x00007ff49ceac3fc in scm_c_read_bytes ()
from
target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#8 0x00007ff49ceb2838 in scm_get_bytevector_n ()
from
target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#9 0x00007ff49cee6c4d in vm_regular_engine ()
from
target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#10 0x00007ff49ceea2aa in scm_call_n ()
from
target:/gnu/store/swyipr8smrd5bc72n92sdfxzx0p4cjpi-guile-2.2.2/lib/libguile-2.2.so.1
#11 0x00007ff49ce6e8d7 in scm_primitive_eval ()
--8<---------------cut here---------------end--------------->8---
Presumably the ‘scm_get_bytevector_n’ call comes from (guix
serialization) or ‘process-stderr’.
IOW we have a deadlock where both sides are waiting for input data.
Ludo’.
¹
https://git.savannah.gnu.org/cgit/guix.git/tree/guix/ssh.scm?id=0362e5820ab6a1eb8eaf33bc47e592857c25f765#n102
--- End Message ---
--- Begin Message ---
Subject: |
Re: bug#30365: Offloading sometimes hangs |
Date: |
Sat, 10 Feb 2018 11:17:13 +0100 |
User-agent: |
mu4e 0.9.18; emacs 25.3.1 |
Hi Ludo,
> address@hidden (Ludovic Courtès) skribis:
>
>> So what we have here is that the Scheme procedure ‘select’ returned
>> stdin as “ready for reading”. How did that happen? I believe this is
>> due to <https://bugs.gnu.org/30368>: ‘scm_i_prepare_to_wait_on_fd’
>> returns 1, so ‘select’ returns EINTR but it does so without clearing the
>> FD sets.
>
> I’ve pushed a workaround here:
>
>
> https://git.savannah.gnu.org/cgit/guix.git/commit/?id=8446dc5a360e3a13fecea870f86efdbd893e3905
>
> and guix-0.14.0-8.bc880f9 includes that fix.
>
> It’s been running for several hours on berlin, building a bunch of
> things notably on aarch64, and it seems to work well!
Congratulations on figuring this out!
--
Ricardo
GPG: BCA6 89B6 3655 3801 C3C6 2150 197A 5888 235F ACAC
https://elephly.net
--- End Message ---