qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] monitor/qmp: fix race with clients disconnecting early


From: Markus Armbruster
Subject: Re: [PATCH] monitor/qmp: fix race with clients disconnecting early
Date: Wed, 25 Aug 2021 17:54:40 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)

Stefan Reiter <s.reiter@proxmox.com> writes:

> From: Stefan Reiter <stefan@pimaker.at>
>
> The following sequence can produce a race condition that results in
> responses meant for different clients being sent to the wrong one:
>
> (QMP, no OOB)
> 1) client A connects
> 2) client A sends 'qmp_capabilities'
> 3) 'qmp_dispatch' runs in coroutine, schedules out to
>    'do_qmp_dispatch_bh' and yields
> 4) client A disconnects (i.e. aborts, crashes, etc...)
> 5) client B connects
> 6) 'do_qmp_dispatch_bh' runs 'qmp_capabilities' and wakes calling coroutine
> 7) capabilities are now set and 'mon->commands' is set to '&qmp_commands'
> 8) 'qmp_dispatch' returns to 'monitor_qmp_dispatch'
> 9) success message is sent to client B *without it ever having sent
>    'qmp_capabilities' itself*
> 9a) even if client B ignores it, it will now presumably send it's own
>    greeting, which will error because caps are already set
>
> The fix proposed here uses an atomic, sequential connection number
> stored in the MonitorQMP struct, which is incremented every time a new
> client connects. Since it is not changed on CHR_EVENT_CLOSED, the
> behaviour of allowing a client to disconnect only one side of the
> connection is retained.
>
> The connection_nr needs to be exposed outside of the monitor subsystem,
> since qmp_dispatch lives in qapi code. It needs to be checked twice,
> once for actually running the command in the BH (fixes 7/9a), and once for
> sending back a response (fixes 9).
>
> This satisfies my local reproducer - using multiple clients constantly
> looping to open a connection, send the greeting, then exiting no longer
> crashes other, normally behaving clients with unrelated responses.
>
> Signed-off-by: Stefan Reiter <s.reiter@proxmox.com>
> ---
>
> This is another patch in the apparently never-ending series of fixes to
> QMP-related race conditions. Our PVE users really seem to have a knack for
> triggering these ;)

Fortunately, you seem to have a knack for fixing them!

> Here's one of the (several) forum threads where we tried to diagnose the 
> issue:
> https://forum.proxmox.com/threads/error-with-backup-when-backing-up-qmp-command-query-backup-failed-got-wrong-command-id.88017/
>
> It manifested itself under load, where sometimes our monitor code (which uses
> the 'id' system of QMP to track commands) would receive bogus responses, 
> showing
> up as "got wrong command id" errors in our logs.

Let me re-explain the bug in my own words, to make sure I understand.

A QMP monitor normally runs in the monitor I/O thread.

A QMP monitor can serve only one client at a time.

It executes out-of-band commands right as it reads them.  In-band
commands are queued, and executed one after the other in the main loop.

Command output is buffered.  We write it out as fast as the character
device can take it.  If a write fails, we throw away the entire buffer
contents.

A client can disconnect at any time.  This throws away the queue.  An
in-band command may be executing in the main loop.  An out-of-band
command may be executing in the monitor's thread.

Such commands (if any) are not affected by the disconnect.  Their output
gets buffered, but write out fails, so it's thrown away.

*Except* when another client connects quickly enough.  Then we send it
output meant for the previous client.  This is wrong.  I suspect this
could even send invalid JSON.

Special case: if in-band command qmp_capabilities is executing when the
client disconnects, and another client connects before the command flips
the monitor from capabilities negotiation mode to command mode, that
client starts in the wrong mode.

> I'm not sure this approach is the best fix, it seems a lot like a band-aid to
> me, but it's the best I could come up with for now - open for different ideas 
> of
> course.

I need to think through the ramifications.  Please ping me if you don't
hear from me this week.

> Note that with this patch, we're no longer running a command that was 
> scheduled
> after a client has disconnected early. This seems like a slight behaviour 
> change
> to me...

The commit message needs to spell this out.

>          On the other hand, I didn't want to drag the connection number into
> qmp_capabilities() and special case that even further.
>
> I didn't look to deeply into 'QMP in iothread' yet, if that does what I think 
> it
> does there might be two more potential races here:
> * between 'do_qmp_dispatch_bh' running 'aio_co_wake' and 'qmp_dispatch' 
> actually
>   yielding (very unlikely though)
> * plus a TOCTOU in 'do_qmp_dispatch_bh' with the atomic read of the
>   connection_nr and actually running cmd->fn()

We should discuss these in more detail.

> Thanks!

Thank *you*!



reply via email to

[Prev in Thread] Current Thread [Next in Thread]