[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [PATCH 1/3] block: add BDS field to count in-flight req

From: Kevin Wolf
Subject: Re: [Qemu-block] [PATCH 1/3] block: add BDS field to count in-flight requests
Date: Tue, 11 Oct 2016 13:00:43 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Am 10.10.2016 um 18:37 hat Paolo Bonzini geschrieben:
> On 10/10/2016 12:36, Kevin Wolf wrote:
> > 
> > At the BlockBackend level (or really anywhere in the graph), we have two
> > different kinds of drain requests that we just fail to implement
> > differently today:
> > 
> > 1. blk_drain(), i.e. a request to actually drain all requests pending in
> >    this BlockBackend. This is what we implement today, even though only
> >    indirectly: We call bdrv_drain() on the root BDS and it calls back
> >    into what is case 2.
> > 
> > 2. BdrvChildRole.drained_begin/end(), i.e. stop sending requests to a
> >    child node. The efficient way to do this is not what we're doing
> >    today (sending all throttled requests and waiting for their
> >    completion), but just stopping to process throttled requests.
> > 
> > Your approach would permanently prevent us from doing this right and
> > force us to do the full drain even for the second case.
> You're entirely correct that these patches are a layering violation.
> However, I think you're wrong that they are a step backwards, because
> they are merely exposing pre-existing gaps in the BDS/BB separation.  In
> fact, I think that these patches are a very good first step towards
> getting the design correct, and I suspect that we agree on the design
> since we have talked about it in the past.
> Now, the good news is that we have an API that makes sense, and we're
> using it mostly correctly.  (The exception is that there are places
> where we don't have a BlockBackend and thus call
> bdrv_drain/bdrv_co_drain instead of blk_drain/blk_co_drain).  Yes,
> bdrv_drained_begin and bdrv_drained_end may need tweaks to work with
> multiple parents, but wherever we use one of the APIs in the family
> we're using the right one.
> The bad news is that while the APIs are good, they are implemented in
> the wrong way---all of them, though some less than others.  In
> particular, blk_drain and bdrv_drain and
> bdrv_drained_begin/bdrv_drained_end represent three completely different
> concepts but we conflate the implementations.   But again, knowing and
> exposing the problem is the first step of solving it, and IMO these
> patches if anything make the problem easier to solve.
> So, let's first of all look at what the three operations should mean.
> blk_drain should mean "wait for completion of all requests sent from
> *this* BlockBackend".  Typically it would be used by the owner of the
> BlockBackend to get its own *data structures* in a known-good state.  It
> should _not_ imply any wait for metadata writes to complete, and it
> should not do anything to requests sent from other BlockBackends.  Three
> random notes: 1) this should obviously include throttled operations,
> just like it does now; 2) there should be a blk_co_drain too; 3) if the
> implementation can't be contained entirely within BlockBackend, we're
> doing it wrong.
> bdrv_drain should only be needed in bdrv_set_aio_context and bdrv_close,
> everyone else should drain its own BlockBackend or use a section.  Its
> role is to ensure the BDS's children are completely quiescent.
> Quiescence of the parents should be a precondition:
> bdrv_set_aio_context, should use bdrv_drained_begin/bdrv_drained_end for
> this purpose; while in the case of bdrv_close there should be no parents.

You making some good points here. I never spent much thought on the fact
that bdrv_drain and bdrv_drained_begin/end are really for different

Actually, does bdrv_set_aio_context really need bdrv_drain, or should it
be using a begin/end pair, too?

bdrv_close() is an interesting case because it has both a section and
then inside that another bdrv_drain() call. Maybe that's the only
constellation in which bdrv_drain() without a section even really makes
sense because then the parents are already quiesced. If that's the only
user of it, though, maybe we should inline it and remove the public
defintion of bdrv_drain(), which is almost always wrong.

> bdrv_{co_,}drained_begin and bdrv_{co_,}drained_end should split into two:
> - blk_{co_,}isolate_begin(blk) and blk_{co_,}isolate_end(blk).  These
> operations visit the BdrvChild graph in the child-to-parent direction,
> starting at blk->root->bs, and tell each root BdrvChild (other than blk)
> to quiesce itself.  Quiescence means ensuring that no new I/O operations
> are triggered.  This in turn has both an external aspect (a Notifier or
> other callback invoked by blk_root_drained_begin/end) and an internal
> aspect (throttling).

It wouldn't only tell the BlockBackends to quiesce themselves, but also
all nodes on the path so that they don't issue internal requests.

Anyway, let's see which of the existing bdrv_drained_begin/end users
would use this (please correct):

* Block jobs use during completion

* The QMP transaction commands to start block jobs drain as well, but
  they don't have a BlockBackend yet. Those would call a BDS-level
  bdrv_isolate_begin/end then, right?

* Quorum wants to isolate itself from new parent requests while adding a
  new child, that's another user for bdrv_isolate_begin/end

* bdrv_snapshot_delete(), probably the save BDS-level thing

> - bdrv_drained_begin(bs) and bdrv_drained_end(bs), which quiesce all
> root BdrvChildren reachable from bs; bdrv_drained_begin then does a
> recursive bdrv_drain(bs) to flush all pending writes.  Just like
> bdrv_drain, bdrv_drained_begin and bdrv_drained_end should be used very
> rarely, possibly only in bdrv_set_aio_context.

bdrv_close probably, too. And that one doesn't really need the recursive
semantics because it closes only one node. If that makes the refcount of
the other nodes go down to zero, they will be closed individually and
drain themselves.

The reason why bdrv_set_aio_context needs the recursive version is
because it's recursive itself.

Possibly add a bool recursive flag to bdrv_drained_begin/end?

> In this new setting, BlockBackend need not disable throttling in
> blk_root_drained_begin/end, and this is important because it explains
> why conflating the concepts is bad.  Once BlockBackends can have
> "isolated sections" instead of "drained sections", calling blk_drain
> from blk_root_drained_begin/end is just one way to ensure that throttled
> writes are not submitted.  It's not a very good one, but that doesn't
> change the fact that blk_drain should wait for throttled requests to
> complete without disabling throttling!  In fact, I surmise that any
> other case where you want blk_drain to ignore throttling is a case where
> you have a bug or missing feature elsewhere, for example some minimal
> bdrv_cancel support that lets you cancel those throttled requests.  (And
> I suspect that the handling of plugged operations in bdrv_drain to be
> the same, but I haven't put much thought on it).

Okay, I think we agree on the theory.

Now I'm curious how this relates to the layering violation in this
specific patch.

> Now let's look at the implementation.
> blk_drain and blk_co_drain can simply count in-flight requests, exactly
> as done in patch 1.  Sure, I'm adding it at the wrong layer because not
> every user of the block layer has already gotten its own personal BB.

Really? Where is it still missing? I was pretty sure I had converted all

> However, moving the count from BDS to BB is easy, and another thing that
> patch 1 gets right is to track the lifetime of AIO requests correctly
> and remove the "double aio_poll" hack once and for all from the
> implementation of bdrv_drain.
> Because of the new precondition, bdrv_drain only needs to do the
> recursive descent that it does in patch 2, without the previous
> quiescing.  So that's another thing that the series nudges in the right
> direction. :)  bdrv_drain should probably be renamed, but I can't think
> of a good one.  Food for thought: think of merging the .bdrv_drain
> callback with .bdrv_inactivate.

Hm... I always thought that .bdrv_drain should really be
.bdrv_drained_begin/end in theory, but we didn't have a user for that
because the qed implementation for .bdrv_drained_end would happen to be

As for .bdrv_inactivate/invalidate, I agree that it should ideally
imply .bdrv_drained_begin/end, though in practice we don't assert
cleared BDRV_O_INACTIVE on reads, so I think it might not hold true.

I'm not so sure if .bdrv_drained_begin/end should imply inactivation,
though. Inactivation really means "this qemu process gives up control of
the image", so that a migration destination can take over. In terms of
the qcow2-based locking series that this was originally part of,
inactivation would mean dropping the lock.

Giving up control is most definitely not what you want in most cases
where you do .bdrv_drained_begin/end. Rather, you want _exclusive_
access, which is pretty much the opposite.

And this is true both for the isolate users mentioned above and for
things like changing the AIO context, which legitimately continue to use

I was wondering if we need separate .bdrv_isolate_begin/end callbacks,
but I think on this level there is actually no difference: Both tell the
block driver to stop sending requests.

> blk_isolate_begin(blk) should be just "bdrv_quiesce_begin(blk->root)"
> (I'm now really sold on that BdrvChild thing!) and likewise for
> blk_isolate_end->bdrv_quiesce_end, with bdrv_quiesce_begin/end doing the
> child-to-parent walk.  Their first argument means "invoke quiescing
> callbacks on all parents of blk->root, except for blk->root itself".
> The BdrvChild callbacks would be renamed to .bdrv_quiesce_begin and
> .bdrv_quiesce_end.
> bdrv_drained_begin(bs) and bdrv_drained_end(bs) should be implemented as
> above.
> bdrv_drain_all should be changed to bdrv_quiesce_all_{begin,end}.  qdev
> drive properties would install a vmstate notifier to quiesce their own
> BlockBackend rather than relying on a blind bdrv_drain_all from vm_stop.

Sounds reasonable.

> And thirdly, how to get there.
> It should not surprise you that I think that these patches are the first
> step.  The second is to separate sections (isolated and quiescent)
> sections from the concept of draining, including phasing out
> bdrv_drain_all.  Now the multi-parent case is handled correctly and you
> can introduce "BB for everyone" including block jobs, block migration
> and the NBD server.  Finally you cleanup bdrv_drain and move the
> in-flight count from BDS to BB.
> Thoughts?

Why don't we start with adding a proper blk_drain so that we don't have
to introduce the ugly intermediate state? This would mostly mean
splitting the throttling drain into one version called by blk_drain that
restarts all requests while ignoring the throttling, and changing the
BdrvChild callback to only stop sending them. Once you're there, I think
that gets you rid of the BDS request count manipulations from the BB
layer because the BDS layer already counts all requests.

You hinted above that the reason is that not every place that needs it
has a BB yet. If you can tell me which users are missing, I can take a
look at converting them. I'm not aware of any and already considered
this done (we should have noticed them (as in failing builds) while
doing the BdrvChild conversion).


reply via email to

[Prev in Thread] Current Thread [Next in Thread]