qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC] block-backend: prevent dangling BDS pointer in blk_drain()


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [RFC] block-backend: prevent dangling BDS pointer in blk_drain()
Date: Thu, 9 Dec 2021 19:51:02 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0

09.12.2021 19:32, Stefan Hajnoczi wrote:
On Thu, Dec 09, 2021 at 04:45:13PM +0100, Hanna Reitz wrote:
On 09.12.21 15:23, Stefan Hajnoczi wrote:
The BlockBackend root child can change during bdrv_drained_begin() when
aio_poll() is invoked. In fact the BlockDriverState can reach refcnt 0
and blk_drain() is left with a dangling BDS pointer.

One example is scsi_device_purge_requests(), which calls blk_drain() to
wait for in-flight requests to cancel. If the backup blockjob is active,
then the BlockBackend root child is a temporary filter BDS owned by the
blockjob. The blockjob can complete during bdrv_drained_begin() and the
last reference to the BDS is released when the temporary filter node is
removed. This results in a use-after-free when blk_drain() calls
bdrv_drained_end(bs) on the dangling pointer.

The general problem is that a function and its callers must not assume
that bs is still valid across aio_poll(). Explicitly hold a reference to
bs in blk_drain() to avoid the dangling pointer.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
I found that BDS nodes are sometimes deleted with bs->quiesce_counter >
0 (at least when running "make check"), so it is currently not possible
to put the bdrv_ref/unref() calls in bdrv_do_drained_begin() and
bdrv_do_drained_end() because they will be unbalanced. That would have
been a more general solution than only fixing blk_drain().

Deleting nodes that have a `quiesce_counter > 0` doesn’t seem wrong to me –
deleting only depends on strong references, and so I’d expect that anything
that increases the quiesce_counter also has a strong reference to the node
if the former wants the latter to stay around.

I suppose we could make it so that both the quiesce_counter and the refcnt
need to be 0 before a BDS is deleted (and then deletion can happen both from
bdrv_unref() and drained_end), but I don’t know whether that’s really
necessary.  I’d rather leave it to the caller to ensure they keep a strong
reference throughout the drain.

The question is, how often do we have a situation like this, where we take a
weak reference for draining, because we assume there’s a strong reference
backing us up (namely the one through blk->root), but that strong reference
then can go away due to draining...

Any suggestions for a better fix?

The fix makes sense to me.

Okay. My concern was that this is a whole class of bugs and my patch
only fixes blk_drain(). I have audited the code some more in the
meantime.

bdrv_insert_node() may be unsafe in the case where bs is a temporary
filter node that is unref'd during bdrv_drained_begin():

   BlockDriverState *bdrv_insert_node(BlockDriverState *bs, QDict *options,
                                      int flags, Error **errp)
   {
       ERRP_GUARD();
       int ret;
       BlockDriverState *new_node_bs = NULL;
       const char *drvname, *node_name;
       BlockDriver *drv;
drvname = qdict_get_try_str(options, "driver");
       if (!drvname) {
           error_setg(errp, "driver is not specified");
           goto fail;
       }
drv = bdrv_find_format(drvname);
       if (!drv) {
           error_setg(errp, "Unknown driver: '%s'", drvname);
           goto fail;
       }
node_name = qdict_get_try_str(options, "node-name"); new_node_bs = bdrv_new_open_driver_opts(drv, node_name, options, flags,
                                               errp);
       options = NULL; /* bdrv_new_open_driver() eats options */
       if (!new_node_bs) {
           error_prepend(errp, "Could not create node: ");
           goto fail;
       }
bdrv_drained_begin(bs);
       ^^^^^^^^^^^^^^^^^^^^^^^ <--- bs can be dangling pointer
       ret = bdrv_replace_node(bs, new_node_bs, errp);
       bdrv_drained_end(bs);

The fix isn't as simple as blk_drain() because we don't want to insert
the new node before the now-deleted node. I think the correct way to
insert a node is against BdrvChild, not BlockDriverState. That way we
can be sure the new node will be inserted into a graph that is reachable
via BdrvChild (e.g. BlockBackend) instead of a detached BDS.

bdrv_set_aio_context_ignore() and blk_io_limits_disable() need to ref bs
like blk_drain() in this patch.

There are some other bdrv_drained_begin() calls that I'm assuming are
safe because they are during creation/deletion so I think we have strong
references there or nothing else knows about our BDS yet.

Do you agree with extending this patch series to cover the functions I
mentioned above?

I'm not sure.

First, we can't support "any graph change" during some graph changing operation.

Actually, when we do some specific graph change operation, we should forbid any 
other graph change operations, they should wait. Possibly, by adding strong 
references everywhere, we can avoid crashes. But what about the logic? If we do 
several graph changing operations simultaneously, the result is absolutely 
unpredictable, it's not what user wants.

The problem is known. For example it may lead to 030 iotest failure. Probably 
now it can't, after changes by Hanna.. But I think we'll not be safe, until we 
have a kind of global mutex for graph changing operations. For example, here is 
my old attempt: 
https://lists.nongnu.org/archive/html/qemu-devel/2020-11/msg05290.html .

So, probably, until we have a good solution for this, better do only necessary 
small fixes like your original patch..


Second, actually bs may disappear on first yield point, which will happen 
earlier than bdrv_drained_being() - in bdrv_new_open_driver_opts(). So, if fix 
something, we'd better declare that caller of bdrv_insert_node() is responsible 
for bs not disappear during function call. And check callers.

--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]