qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] block/stream: Drain subtree around graph change


From: Hanna Reitz
Subject: Re: [PATCH v2] block/stream: Drain subtree around graph change
Date: Fri, 25 Mar 2022 09:50:53 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0

On 24.03.22 19:49, John Snow wrote:
On Thu, Mar 24, 2022 at 10:09 AM Hanna Reitz <hreitz@redhat.com> wrote:
When the stream block job cuts out the nodes between top and base in
stream_prepare(), it does not drain the subtree manually; it fetches the
base node, and tries to insert it as the top node's backing node with
bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
the actual base node might change (because the base node is actually not
part of the stream job) before the old base node passed to
bdrv_set_backing_hd() is installed.

This has two implications:

First, the stream job does not keep a strong reference to the base node.
Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
because some other block job is drained to finish), we will get a
use-after-free.  We should keep a strong reference to that node.
Does that introduce any possibility of deadlock if we're now keeping a
strong reference? I guess not, the job can just delete its own
reference and everything's ... fine, right?

Handling of strong references doesn’t block, bdrv_ref() just increases the refcount, and bdrv_unref() decreases it (deleting the node should the refcount reach 0).

Second, even with such a strong reference, the problem remains that the
base node might change before bdrv_set_backing_hd() actually runs and as
a result the wrong base node is installed.
ow

Well, not horrible.  Just means that a node that was supposed to be removed is still there.  (So you’ll need to commit again.)

If users pass auto_dismiss=false and dismiss the jobs manually in order (which I think is a good idea anyway), this won’t happen.

Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
case, which has five nodes, and simultaneously streams from the middle
node to the top node, and commits the middle node down to the base node.
As it is, this will sometimes crash, namely when we encounter the
above-described use-after-free.

Is there a BZ# or is this an occasional failure in 030 you saw?

Yep, just occasional failure in 030.

What
does failure look like? Does it require anything special to trigger?

Running 030 quite a bunch of times, preferably in parallel as many times as you have cores.  (Perhaps disabling all test cases except test_overlapping_5() to make it a bit quicker.)

Failure for the use-after-free is that inside of bdrv_set_backing_hd(), the base node will just contain garbage and some pointer dereference fails:

#0  bdrv_inherits_from_recursive (parent=parent@entry=0x5614406438e0, child=0x8c8c8c8c8c8c8c8c, child@entry=0x5614412d0a70) at ../block.c:3328
#1  bdrv_set_file_or_backing_noperm
    (parent_bs=parent_bs@entry=0x5614406438e0, child_bs=child_bs@entry=0x5614412d0a70, is_backing=is_backing@entry=true, tran=tran@entry=0x5614414f5f40, errp=errp@entry=0x7ffdf44eb810) at ../block.c:3361 #2  0x000056143da61550 in bdrv_set_backing_noperm (errp=0x7ffdf44eb810, tran=0x5614414f5f40, backing_hd=0x5614412d0a70, bs=0x5614406438e0) at ../block.c:3447 #3  bdrv_set_backing_hd (bs=bs@entry=0x5614406438e0, backing_hd=backing_hd@entry=0x5614412d0a70, errp=errp@entry=0x7ffdf44eb810) at ../block.c:3459 #4  0x000056143dae7778 in stream_prepare (job=0x56144160b6c0) at ../block/stream.c:78
#5  0x000056143da6a91e in job_prepare (job=0x56144160b6c0) at ../job.c:837
#6  job_txn_apply (fn=<optimized out>, job=0x56144160b6c0) at ../job.c:158
#7  job_do_finalize (job=0x56144160b6c0) at ../job.c:854
#8  0x000056143da6ae02 in job_exit (opaque=0x56144160b6c0) at ../job.c:941

Hanna

Taking a strong reference to the base node, we no longer get a crash,
but the resuling block graph is less than ideal: The expected result is
obviously that all middle nodes are cut out and the base node is the
immediate backing child of the top node.  However, if stream_prepare()
takes a strong reference to its base node (the middle node), and then
the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
that middle node, the stream job will just reinstall it again.

Therefore, we need to keep the whole subtree drained in
stream_prepare(), so that the graph modification it performs is
effectively atomic, i.e. that the base node it fetches is still the base
node when bdrv_set_backing_hd() sets it as the top node's backing node.

Verify this by asserting in said 030's test case that the base node is
always the top node's immediate backing child when both jobs are done.

(Off-topic: Sometimes I dream about having a block graph rendering
tool that can update step-by-step, so I can visualize what these block
operations look like. My "working memory" is kind of limited and I get
sidetracked too easily tracing code. That we have the ability to
render at a single point is pretty nice, but it's still hard to get
images from intermediate steps when things happen in tight sequence
without the chance for intervention.)

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
---
v2:
- Oops, the base can be NULL.  Would have noticed if I had ran all test
   cases from 030, and not just test_overlapping_5()...
   That means that keeping a strong reference to the base node must be
   conditional, based on whether there even is a base node or not.
   (I mean, technically we do not even need to keep a strong reference to
   that node, given that we are in a drained section, but I believe it is
   better style to do it anyway.)
---
  block/stream.c         | 15 ++++++++++++++-
  tests/qemu-iotests/030 |  5 +++++
  2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/block/stream.c b/block/stream.c
index 3acb59fe6a..694709bd25 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -64,7 +64,13 @@ static int stream_prepare(Job *job)
      bdrv_cor_filter_drop(s->cor_filter_bs);
      s->cor_filter_bs = NULL;

+    bdrv_subtree_drained_begin(s->above_base);
+
      base = bdrv_filter_or_cow_bs(s->above_base);
+    if (base) {
+        bdrv_ref(base);
+    }
+
      unfiltered_base = bdrv_skip_filters(base);

      if (bdrv_cow_child(unfiltered_bs)) {
@@ -75,14 +81,21 @@ static int stream_prepare(Job *job)
                  base_fmt = unfiltered_base->drv->format_name;
              }
          }
+
          bdrv_set_backing_hd(unfiltered_bs, base, &local_err);
          ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt, 
false);
          if (local_err) {
              error_report_err(local_err);
-            return -EPERM;
+            ret = -EPERM;
+            goto out;
          }
      }

+out:
+    if (base) {
+        bdrv_unref(base);
+    }
+    bdrv_subtree_drained_end(s->above_base);
      return ret;
  }

diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030
index 567bf1da67..14112835ed 100755
--- a/tests/qemu-iotests/030
+++ b/tests/qemu-iotests/030
@@ -436,6 +436,11 @@ class TestParallelOps(iotests.QMPTestCase):
          self.vm.run_job(job='node4', auto_dismiss=True)
          self.assert_no_active_block_jobs()

+        # Assert that node0 is now the backing node of node4
+        result = self.vm.qmp('query-named-block-nodes')
+        node4 = next(node for node in result['return'] if node['node-name'] == 
'node4')
+        self.assertEqual(node4['image']['backing-image']['filename'], 
self.imgs[0])
+
      # Test a block-stream and a block-commit job in parallel
      # Here the stream job is supposed to finish quickly in order to reproduce
      # the scenario that triggers the bug fixed in 3d5d319e1221 and 1a63a907507
--
2.35.1

Seems reasonable, but the best I can give right now is an ACK because
I'm a little rusty with block graph ops ...

--js





reply via email to

[Prev in Thread] Current Thread [Next in Thread]