bug-guix
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#52051: [core-updates-frozen] cannot login ('org.freedesktop.login1'


From: Maxim Cournoyer
Subject: bug#52051: [core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)
Date: Fri, 10 Dec 2021 00:09:28 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)

Hello Ludovic!

Ludovic Courtès <ludo@gnu.org> writes:

> Hello!
>
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> 374   connect(11, {sa_family=AF_UNIX, 
>> sun_path="/var/run/dbus/system_bus_socket"}, 34) = 0
>
> [...]
>
>> 374   epoll_wait(5, [{events=EPOLLIN|EPOLLOUT|EPOLLHUP, data={u32=24802800, 
>> u64=24802800}}], 20, -1) = 1
>> 374 sendmsg(11, {msg_name=NULL, msg_namelen=0,
>> msg_iov=[{iov_base="l\1\0\1\0\0\0\0\1\0\0\0m\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0",
>> iov_len=128}], msg_iovlen=1, msg_controllen=0, msg_flags=0},
>> MSG_DONTWAIT|MSG_NOSIGNAL) = -1 EPIPE (Broken pipe)
>> 374   gettid()                          = 374
>> 374   epoll_ctl(5, EPOLL_CTL_MOD, 11, {events=0, data={u32=24802800, 
>> u64=24802800}}) = 0
>> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, 
>> tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, 
>> {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
>> 374   gettid()                          = 374
>> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, 
>> tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, 
>> {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
>> 374   gettid()                          = 374
>> 374   timerfd_settime(12, TFD_TIMER_ABSTIME, {it_interval={tv_sec=0, 
>> tv_nsec=0}, it_value={tv_sec=0, tv_nsec=1}}, NULL) = 0
>> 374   epoll_wait(5, [{events=EPOLLHUP, data={u32=24802800, u64=24802800}}, 
>> {events=EPOLLIN, data={u32=24764384, u64=24764384}}], 20, -1) = 2
>> 374   read(12, "\1\0\0\0\0\0\0\0", 8)   = 8
>> 374   epoll_ctl(5, EPOLL_CTL_DEL, 11, NULL) = 0
>> 374   close(11)                         = 0
>> 374   gettid()                          = 374
>> 374   epoll_wait(5,  <unfinished ...>
>> 391   <... close resumed>)              = 0
>> 391   madvise(0x7fd6c83dc000, 8368128, MADV_DONTNEED) = 0
>> 391   exit(0)                           = ?
>> 391   +++ exited with 0 +++
>> 374   <... epoll_wait resumed>[{events=EPOLLERR, data={u32=24768000, 
>> u64=24768000}}], 17, -1) = 1
>> 374   lseek(7, 0, SEEK_SET)             = 0
>> 374   read(7, "tty7\n", 63)             = 5
>
> As you pointed out on IRC, the initially ‘Hello’ method call above leads
> to EPIPE, and we can see that elogind eventually closes its socket to
> dbus-daemon *but* keeps doing its thing.
>
> Some interesting things to note…

Indeed, very interesting :-).  Thanks for diving in!

> First, to my surprise, elogind does not use the client library of the
> ‘dbus’ package:
>
> $ guix gc --references $(./pre-inst-env guix build elogind)|grep dbus
> $ echo $?
> 1
>
>
> (This is already the case in ‘master’ with v243.7.)  Instead, it has its
> own implementation of the DBus protocol, in C, from systemd—we can’t
> have enough sources of bugs and vulnerabilities.

Oh!  Marius found this interesting issue [0] that they shared on IRC
today; I wonder if it may be related.  sd-bus apparently pipeline things
aggressively, which is not always handled well by other D-bus
participants.

[0]  https://github.com/systemd/systemd/issues/16610

> Anyway, the “Hello” message is sent to the system bus asynchronously in
> ‘sd-bus.c’:
>
> static int bus_send_hello(sd_bus *bus) {
>         _cleanup_(sd_bus_message_unrefp) sd_bus_message *m = NULL;
>         int r;
>
>         assert(bus);
>
>         if (!bus->bus_client)
>                 return 0;
>
>         r = sd_bus_message_new_method_call(
>                         bus,
>                         &m,
>                         "org.freedesktop.DBus",
>                         "/org/freedesktop/DBus",
>                         "org.freedesktop.DBus",
>                         "Hello");
>         if (r < 0)
>                 return r;
>
>         return sd_bus_call_async(bus, NULL, m, hello_callback, NULL, 0);
> }
>
>
> A callback is called when a reply is received or an error arises:
>
> static int hello_callback(sd_bus_message *reply, void *userdata, sd_bus_error 
> *error) {
>
> [...]
>
> fail:
>         /* When Hello() failed, let's propagate this in two ways: first we 
> return the error immediately here,
>          * which is the propagated up towards the event loop. Let's also 
> invalidate the connection, so that
>          * if the user then calls back into us again we won't wait any 
> longer. */
>
>         bus_set_state(bus, BUS_CLOSING);
>         return r;
> }
>
>
> It’s not clear from that whether the authors intended for the thing to
> keep going in case of failure.  In our case it’s not helpful.

If we picture this in the systemd use case, I believe sd-bus must be
used as *the* d-bus daemon, perhaps?  So it should never die, and expect
users to call back into it to retry things?  In our case, it acts as a
D-Bus client, not a server (IIUC), so perhaps its behavior is not tuned
right for our use case.

> But why does dbus-daemon drop the connection in the first place?
>
> To know that, we could change ‘dbus-root-service-type’ to run
> dbus-daemon from a ‘--enable-verbose-mode’ build, and with the
> ‘DBUS_VERBOSE’ environment set to 1.
>
> Looking at ‘dbus-server-socket.c’ it would seem that the only sensible
> way this can happen is if dbus-daemon doesn’t yet have a
> ‘new_connection_function’ set at the time it accepts the incoming
> connection:

Interesting that you mention this; that's what I worked on yesterday
(see attached patches).  I managed to get elogind panicking the kernel
during a guix system reconfigure, which corrupted GRUB, so had to chroot
and reconfigure from there [1].  Not sure what happened, but it seems
that killing and restarting elogind is susceptible to cause hard locks.

[1]  https://lists.gnu.org/archive/html/guix-devel/2018-06/msg00202.html

Attachment: 0001-gnu-Add-dbus-verbose.patch
Description: Text Data

Attachment: 0002-services-dbus-Add-a-VERBOSE-configuration-option.patch
Description: Text Data

Attachment: 0003-build-Move-D-Bus-parser-to-its-own-build-module.patch
Description: Text Data

Attachment: 0004-services-telephony-Factorize-bus-polling-logic.patch
Description: Text Data

Attachment: 0005-services-dbus-Add-synchronization-primitives-to-dbus.patch
Description: Text Data


> static dbus_bool_t
> handle_new_client_fd_and_unlock (DBusServer *server,
>                                  DBusSocket  client_fd)
> {
>
> [...]
>
>   /* See if someone wants to handle this new connection, self-referencing
>    * for paranoia.
>    */
>   new_connection_function = server->new_connection_function;
>   new_connection_data = server->new_connection_data;
>
>   _dbus_server_ref_unlocked (server);
>   SERVER_UNLOCK (server);
>
>   if (new_connection_function)
>     {
>       (* new_connection_function) (server, connection,
>                                    new_connection_data);
>     }
>   dbus_server_unref (server);
>
>   /* If no one grabbed a reference, the connection will die. */
>   _dbus_connection_close_if_only_one_ref (connection);
>   dbus_connection_unref (connection);
>
> We know that elogind is started after dbus-daemon has written its PID
> file (there’s a Shepherd service dependency).  Is there a possibility
> that dbus-daemon writes its PID file before it has set
> ‘new_connection_function’?

Are PID files conventionally agreed to be synchronization primitives?
In the case of systemd, I feel this may be more of an afterthought,
given the system is designed to rely on socket-based activation.  Still
just guessing though, and your analysis above seems to suggest
otherwise.

> It would seem that ‘bus_context_new’ writes the PID file after it has
> called ‘dbus_server_set_new_connection_function’ via ‘setup_server’ via
> ‘process_config_first_time_only’.
>
> So not sure what happens.
>
> That’s it for today.  Thoughts?

Thank you!

I'm sorry for my lack of deep analysis here, I've spent most of my
evening attempting to fix my system just to boot ^^'.  I've at least
managed to collect the following verbose D-Bus log (54 MiB uncompressed)
which hopefully can shed some light onto how this failure came to be.

Attachment: dbus-daemon.log.xz
Description: Binary data

Happy Friday!

Maxim

reply via email to

[Prev in Thread] Current Thread [Next in Thread]