qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] block/stream: Drain subtree around graph change


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH v2] block/stream: Drain subtree around graph change
Date: Mon, 28 Mar 2022 13:24:56 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0

28.03.2022 11:09, Hanna Reitz wrote:
On 28.03.22 09:44, Hanna Reitz wrote:
On 25.03.22 17:37, Vladimir Sementsov-Ogievskiy wrote:
24.03.2022 17:09, Hanna Reitz wrote:
When the stream block job cuts out the nodes between top and base in
stream_prepare(), it does not drain the subtree manually; it fetches the
base node, and tries to insert it as the top node's backing node with
bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
the actual base node might change (because the base node is actually not
part of the stream job) before the old base node passed to
bdrv_set_backing_hd() is installed.

This has two implications:

First, the stream job does not keep a strong reference to the base node.
Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
because some other block job is drained to finish), we will get a
use-after-free.  We should keep a strong reference to that node.

Second, even with such a strong reference, the problem remains that the
base node might change before bdrv_set_backing_hd() actually runs and as
a result the wrong base node is installed.

Hmm.

So, we don't really need a strong reference, as if it helps to avoid some 
use-after-free, it means that we'll finish up with wrong block graph..

Sure.  But I found it better style to strongly reference a node while it’s 
used.  I’d rather have an outdated block graph (as in: A node that was supposed 
to disappear would still be in use) than a use-after-free.

Graph modifying operations must be somehow isolated from each other.


Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
case, which has five nodes, and simultaneously streams from the middle
node to the top node, and commits the middle node down to the base node.
As it is, this will sometimes crash, namely when we encounter the
above-described use-after-free.

Taking a strong reference to the base node, we no longer get a crash,
but the resuling block graph is less than ideal: The expected result is
obviously that all middle nodes are cut out and the base node is the
immediate backing child of the top node.  However, if stream_prepare()
takes a strong reference to its base node (the middle node), and then
the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
that middle node, the stream job will just reinstall it again.

Therefore, we need to keep the whole subtree drained in
stream_prepare(), so that the graph modification it performs is
effectively atomic, i.e. that the base node it fetches is still the base
node when bdrv_set_backing_hd() sets it as the top node's backing node.

Emanuele has similar idea of isolating graph changes from each other by 
subtree-drain.

If I understand correctly the idea is that we'll drain all other block jobs, so 
the wouldn't do their block-graph modification during drained section. So, we 
can safely modify the graph.

I don't like this idea:

1. drained section = stop IO. But we don't need to stop IO in the whole subtree 
to do a needed block-graph modification.

If you mean to say that draining just the single node should be sufficient, 
I’ll be happy to change it.

Not sure which node, though, because I’d think it would be `base`, but to 
safely fetch it I’d need to drain it, which seems to bite itself in the tail.  
That’s why I went for a subtree drain from `above_base`.

2. Drained section is not a lock, several clients may drain same set of nodes.. 
So we exploit the fact that concurrent clients will be paused by drained 
section and don't proceed to graph-modification code.. But are we sure that 
block-jobs are (and will be?) the only concurrent block-graph modifying 
clients? Can qmp commands interleave somehow?

They can under very specific circumstances and that’s a bug.  See 
https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg00582.html .

Can some jobs from other subtree start a block-graph modification that touches 
our subtree?

That would be wrong.  A block job shouldn’t change nodes it doesn’t own; stream 
doesn’t own the base, but it also doesn’t change it, it only needs to have the 
top node point to it.

If go this way, that would be more safe to drain the whole block-graph on any 
block-graph modification..

I think we'd better have a separate global mechanism for isolating graph 
modifications. Something like a global co-mutex or queue, where clients waits 
for their turn in block graph modifications.

Here is my old proposal on that topic: 
https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/

That would only solve the very specific issue in 030, right?

It should solve the same issue as your patch. You don't add subtree_drain 
around every graph modification.. Or we already have it?

  The stream job isn’t protected from any graph modifications but those coming 
from mirror.  Might be a solution going forward (I didn’t look closer at it at 
the time, given I saw you had a discussion with Kevin), if we lock every graph 
change operation (though a global lock honestly doesn’t sound strictly better 
than draining subsections of the graph, both have their drawbacks), but that 
doesn’t look like it’d be something for 7.1.

Same way, with draining solution you should make a subtree-drain for every 
graph change operation.


I wonder whether we could have a short-term version of `BdrvChild.frozen` 
that’s a coroutine mutex.  If `.frozen` is set, you just can’t change the 
graph, and you also can’t wait, so that’s just an error.  But if `.frozen_lock` 
is set, you can wait on it. Here, we’d keep `.frozen` set for all links between 
top and above_base, and then in prepare() take `.frozen_lock` on the link 
between above_base and base.


Yes that's seems an alternative to global lock, that doesn't block the whole 
graph. Still, I don't think that is bad to lock the whole graph for graph 
modificaiton, as modification should be rare and fast.


Another thought: does subtree-drain really drain the whole connectivity 
component of the graph?

imagine something like this:

[A]  [   C  ]
 |    |    |
 v    v    v
[ B    ]  [ D ]


If we do subtree drain at node A, this will drain B and C, but not D..

Imagine, some another job is attached to node D, and it will start drained 
section too. So, for example both jobs will share drained section on node C. 
That doesn't seem save, and draining is not a lock.

So, if we are going to rely on drained section as on lock, that isolates graph 
modifications from each other, we should drain the whole connectivity component 
of the graph.


Next, I'm not relly sure that two jobs can simultaneusly enter drained section 
and do graph modifications. What prevents this? Assume two block-stream jobs 
reaches their finish simultaneously and go to subtree-drain. That just means 
that job_pause will be called for both jobs.. But what that means for the 
block-stream jobs that is in bdrv_subtree_drained_beeing() call in 
stream_prepare()? Seems nothing? Then both jobs will start graph modification 
process simultaneously and can interleave on any yield point (for exmaple 
rewriting backing_file in qcow2 metadata).


Another reason, why I think that subtree drain is a wrong tool, as I said, is 
extra IO-stop.

Imaging the following graph:

[A]
 |
 v
[B] [C]
 |   |
 v   v
[base]

If we want to rebase A onto base, we actually need only stop IO requests in A 
and B. Why C should suffer from this graph modification? IO requests produced 
by C, and that are living in C and in base don't intersect with rebasing A on 
base process in any way.

====

Actually, I'm not strictly against your patch, and believe that it fixes the 
problem in most cases. And it's probably OK in short term. The only real doubt 
on including it now is that drained sections sometimes lead to dead locks, and 
is it possible that we now fix the bug that happens only in iotest 30 (or is it 
reported somewhere?) and risking to introduce some dead-lock? Seems that if in 
some code it's safe to call drained_begin(), it should be safe to call 
subtree_drained_begin(). And if it trigger some deadlock, it just shows some 
another bug.. Is it worth fixing now, near to 7.0 release? We live with this 
bug for years.. Or something changed that I missed?

--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]