qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 6/7] block/nbd: decouple reconnect from drain


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH 6/7] block/nbd: decouple reconnect from drain
Date: Tue, 16 Mar 2021 21:09:12 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1

16.03.2021 19:03, Roman Kagan wrote:
On Mon, Mar 15, 2021 at 11:10:14PM +0300, Vladimir Sementsov-Ogievskiy wrote:
15.03.2021 09:06, Roman Kagan wrote:
The reconnection logic doesn't need to stop while in a drained section.
Moreover it has to be active during the drained section, as the requests
that were caught in-flight with the connection to the server broken can
only usefully get drained if the connection is restored.  Otherwise such
requests can only either stall resulting in a deadlock (before
8c517de24a), or be aborted defeating the purpose of the reconnection
machinery (after 8c517de24a).

Since the pieces of the reconnection logic are now properly migrated
from one aio_context to another, it appears safe to just stop messing
with the drained section in the reconnection code.

Fixes: 5ad81b4946 ("nbd: Restrict connection_co reentrance")

I'd not think that it "fixes" it. Behavior changes.. But 5ad81b4946
didn't introduce any bugs.

I guess you're right.

In fact I did reproduce the situation when the drain made no progress
exactly becase the only in-flight reference was taken by the
connection_co, but it may be due to some intermediate stage of the patch
development so I need to do a more thorough analysis to tell if it was
triggerable with the original code.

Fixes: 8c517de24a ("block/nbd: fix drain dead-lock because of nbd 
reconnect-delay")

And here..

1. There is an existing problem (unrelated to nbd) in Qemu that long
io request which we have to wait for at drained_begin may trigger a
dead lock
(https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html)

2. So, when we have nbd reconnect (and therefore long io requests) we
simply trigger this deadlock.. That's why I decided to cancel the
requests (assuming they will most probably fail anyway).

I agree that nbd driver is wrong place for fixing the problem
described in
(https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg01339.html),
but if you just revert 8c517de24a, you'll see the deadlock again..

I may have misunderstood that thread, but isn't that deadlock exactly
due to the requests being unable to ever drain because the
reconnection process is suspended while the drain is in progress?



Hmm, I didn't thought about it this way.  What you mean is that reconnection is 
cancelled on drain_begin, so drain_begin will never finish, because it waits 
for requests, which will never be reconnected. So, you are right it's a 
deadlock too.

But as I remember what is described in 8c517de24a is another deadlock, triggered by 
"just a long request during drain_begin". And it may be triggered again, if the 
we'll try to reconnect for several seconds during drained_begin() instead of cancelling 
requests. Didn't you try the scenario described in 8c517de24a on top of your series?



--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]