qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 6/7] block/nbd: decouple reconnect from drain


From: Roman Kagan
Subject: Re: [PATCH 6/7] block/nbd: decouple reconnect from drain
Date: Fri, 26 Mar 2021 09:16:28 +0300

On Tue, Mar 16, 2021 at 09:09:12PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 16.03.2021 19:03, Roman Kagan wrote:
> > On Mon, Mar 15, 2021 at 11:10:14PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > 15.03.2021 09:06, Roman Kagan wrote:
> > > > The reconnection logic doesn't need to stop while in a drained section.
> > > > Moreover it has to be active during the drained section, as the requests
> > > > that were caught in-flight with the connection to the server broken can
> > > > only usefully get drained if the connection is restored.  Otherwise such
> > > > requests can only either stall resulting in a deadlock (before
> > > > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > > > machinery (after 8c517de24a).
> > > > 
> > > > Since the pieces of the reconnection logic are now properly migrated
> > > > from one aio_context to another, it appears safe to just stop messing
> > > > with the drained section in the reconnection code.
> > > > 
> > > > Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")
> > > 
> > > I'd not think that it "fixes" it. Behavior changes.. But 5ad81b4946
> > > didn't introduce any bugs.
> > 
> > I guess you're right.
> > 
> > In fact I did reproduce the situation when the drain made no progress
> > exactly becase the only in-flight reference was taken by the
> > connection_co, but it may be due to some intermediate stage of the patch
> > development so I need to do a more thorough analysis to tell if it was
> > triggerable with the original code.
> > 
> > > > Fixes: 8c517de24a ("block/nbd: fix drain dead-lock because of nbd 
> > > > reconnect-delay")
> > > 
> > > And here..
> > > 
> > > 1. There is an existing problem (unrelated to nbd) in Qemu that long
> > > io request which we have to wait for at drained_begin may trigger a
> > > dead lock
> > > (https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html)
> > > 
> > > 2. So, when we have nbd reconnect (and therefore long io requests) we
> > > simply trigger this deadlock.. That's why I decided to cancel the
> > > requests (assuming they will most probably fail anyway).
> > > 
> > > I agree that nbd driver is wrong place for fixing the problem
> > > described in
> > > (https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html),
> > > but if you just revert 8c517de24a, you'll see the deadlock again..
> > 
> > I may have misunderstood that thread, but isn't that deadlock exactly
> > due to the requests being unable to ever drain because the
> > reconnection process is suspended while the drain is in progress?
> > 
> 
> 
> Hmm, I didn't thought about it this way.  What you mean is that
> reconnection is cancelled on drain_begin, so drain_begin will never
> finish, because it waits for requests, which will never be
> reconnected. So, you are right it's a deadlock too.

Right.

> But as I remember what is described in 8c517de24a is another deadlock,
> triggered by "just a long request during drain_begin". And it may be
> triggered again, if the we'll try to reconnect for several seconds
> during drained_begin() instead of cancelling requests.

IMO it wasn't a different deadlock, it wass exactly this one: the drain
got stuck the way described above with the qemu global mutex taken, so
the rest of qemu got stuck too.

> Didn't you try the scenario described in 8c517de24a on top of your
> series?

I've tried it with the current master with just 8c517de24a reverted --
the deadlock reproduced in all several attempts.  Then with my series --
the deadlock didn't reproduce in any of my several attempts.  More
precisely, qemu appeared frozen once the guest timed out the requests
and initiated ATA link soft reset, which presumably caused that drain,
but, as soon as the reconnect timer expired (resulting in requests
completed into the guest with errors) or the connection was restored
(resulting in requests completed successfully), it resumed normal
operation.

So yes, I think this series does address that deadlock.

Thanks,
Roman.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]