qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->chi


From: Kevin Wolf
Subject: Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->children: new rwlock
Date: Mon, 23 May 2022 18:04:55 +0200

Am 23.05.2022 um 17:13 hat Stefan Hajnoczi geschrieben:
> On Mon, May 23, 2022 at 03:02:05PM +0200, Kevin Wolf wrote:
> > Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben:
> > > Hi,
> > > Sorry for the long email. I've included three topics that may help us 
> > > discuss
> > > draining and AioContext removal further.
> > > 
> > > The shortcomings of drain
> > > -------------------------
> > > I wanted to explore the logical argument that making graph modifications 
> > > within
> > > a drained section is correct:
> > > - Graph modifications and BB/BDS lookup are Global State (GS).
> > > - Graph traversal from a BB/BDS pointer is IO.
> > > - Only one thread executes GS code at a time.
> > > - IO is quiesced within a drained section.
> > 
> > I think you're mixing two things here and calling both of them IO.
> > 
> > If a function is marked as IO, that means that it is safe to call from
> > I/O requests, which may be running in any iothread (they currently only
> > run in the home thread of the node's AioContext, but the function
> > annotations already promise that any thread is safe).
> > 
> > However, if a function is marked as IO, this doesn't necessarily mean
> > that it is always only called in the context of an I/O request. It can
> > be called by any other code, too.
> > 
> > So while drain does quiesce all I/O requests, this doesn't mean that
> > functions marked as IO won't run any more.
> 
> My model is that between bdrv_drained_begin() and bdrv_drained_end()
> only the caller is allowed to invoke BB/BDS functions (including
> functions marked IO).

Okay, this sounds better. It's not actually related to IO/GS, but to
BB/BDS functions, including both IO and GS functions.

So graph traversal from a BB/BDS pointer makes it a BB/BDS function, and
BB/BDS functions need to be quiesced within a drained section.

> The caller isn't strictly one thread and one or no coroutines. The
> caller could use threads and coroutines, but the important thing is
> that everything else that accesses the BB/BDS is suspended due to
> .drained_begin() callbacks (and is_external).
> 
> So when you say a function marked IO can be called from outside an I/O
> request, that "outside an I/O request" code must be quiesced too.
> Otherwise drain is definitely not safe for graph modifications.

If you phrase it as a condition and as a goal to achieve, then I agree
that this is required when you want to use drain for locking.

My impression was that you were using this to argue that the code is
already doing this and is already safe in this scenario, and this isn't
true. I think I misunderstood you there and you were really describing
the future state that you're envisioning.

> > > - Therefore a drained section in GS code suspends graph traversal, other 
> > > graph
> > >   modifications, and BB/BDS lookup.
> > > - Therefore it is safe to modify the graph from a GS drained section.
> > 
> > So unfortunately, I don't think this conclusion is correct.
> > 
> > In order to make your assumption true, we would have to require that all
> > callers of IO functions must have called bdrv_inc_in_flight(). We would
> > also have to find a way to enforce this preferable at compile time or
> > with static analysis, or at least with runtime assertions, because it
> > would be very easy to get wrong.
> 
> Or that they are quiesced by .drained_begin() callbacks or is_external.
> 
> Do you have a concrete example?

Yes, you're right, bdrv_inc_in_flight() is only one way to achieve this.
They just need to make bdrv_drain_poll() return true as long as they are
active so that bdrv_drained_begin() waits for them. .drained_poll() is
another valid way to achieve this.

However, if we want to rely on static analysis to verify that everything
is covered, always requiring bdrv_inc_in_flight() might make this
easier. So possibly we want to require it even in cases where
.drained_poll() or aio_disable_external() would be enough in theory.

> > 
> > > However, I hit on a problem that I think Emanuele and Paolo have already
> > > pointed out: draining is GS & IO. This might have worked under the 1 
> > > IOThread
> > > model but it does not make sense for multi-queue. It is possible to 
> > > submit I/O
> > > requests in drained sections. How can multiple threads be in drained 
> > > sections
> > > simultaneously and possibly submit further I/O requests in their drained
> > > sections? Those sections wouldn't be "drained" in any useful sense of the 
> > > word.
> > 
> > Drains asks other users not to touch the block node. Currently this only
> 
> BTW interesting language choice here: you're using the more general
> definition of "other users" and "touch[ing] the block node" rather than
> the narrow definition of just "I/O requests". That's exactly how I think
> of drain but based on what you wrote above I'm not sure that's how you
> think of it too?

I hope that my reply above made it clearer: The broader definition is
what is needed if we want to use drain to replace the AioContext lock
for protecting the graph, but the narrow definition is what is
implemented today.

> > includes, but the desire to use drain for locking means that we need to
> > extend it to pretty much any operation on the node, which would include
> > calling drain on that block node. So you should never have two callers
> > in a drain section at the same time, it would be a bug in this model.
> 
> Yes! Drain in its current form doesn't make sense for multi-queue since
> multiple threads could be in drained sections at the same time and they
> would all be allowed to submit new I/O requests.

Nobody would be allowed to submit new requests (because someone else has
drained the node), but they would do so anyway. ;-)

Actually, only half joking, because this shows how weak the protection
by drain is if we don't have a way to verify that the whole codebase
supports drain correctly.

> > Of course, we know that currently drain is not respected by all relevant
> > callers (most importantly, the monitor). If you want to use drain as a
> > form of locking, you need to solve this first.
> > 
> > > It is necessary to adjust how recursive drained sections work:
> > > bdrv_drained_begin() must not return while there are deeper nested drained
> > > sections.
> > > 
> > > This is allowed:
> > > 
> > >      Monitor command            Block job
> > >      ---------------            ---------
> > >   > bdrv_drained_begin()
> > >            .                 > bdrv_drained_begin()
> > >            .                 < bdrv_drained_begin()
> > >            .                          .
> > >            .                          .
> > >            .                 > bdrv_drained_end()
> > >            .                 < bdrv_drained_end()
> > >   < bdrv_drained_begin()
> > >            .
> > >            .
> > >   > bdrv_drained_end()
> > >   < bdrv_drained_end()
> > 
> > Just to make sure I understand the scenario that you have in mind here:
> > The monitor command is not what causes the block job to do the draining
> > because this is inside bdrv_drained_begin(), so the block job just
> > randomly got a callback that caused it to do so (maybe completion)?
> 
> Yes, exactly that completion scenario. When the mirror block job
> completes exactly when a monitor command calls bdrv_drained_begin(),
> mirror_exit_common() deletes a temporary filter BDS node. It involves
> drain and modifies the graph.
> 
> > 
> > Before bdrv_drained_begin() returns, anything is still allowed to run,
> > so I agree that this is valid.
> 
> Thanks for confirming!
> 
> > > This is not allowed:
> > > 
> > >      Monitor command            Block job
> > >      ---------------            ---------
> > >   > bdrv_drained_begin()
> > >            .                 > bdrv_drained_begin()
> > >            .                 < bdrv_drained_begin()
> > >            .                          .
> > >            .                          .
> > >   < bdrv_drained_begin()              .
> > >            .                          .
> > >            .                 > bdrv_drained_end()
> > >            .                 < bdrv_drained_end()
> > >   > bdrv_drained_end()
> > >   < bdrv_drained_end()
> > > 
> > > This restriction creates an ordering between bdrv_drained_begin() 
> > > callers. In
> > > this example the block job must not depend on the monitor command 
> > > completing
> > > first. Otherwise there would be a deadlock just like with two threads 
> > > wait for
> > > each other while holding a mutex.
> > 
> > So essentially drain needs to increase bs->in_flight, so that the outer
> > drain has to wait for the inner drain section to end before its
> > bdrv_drained_begin() can return.
> > 
> > > For sanity I think draining should be GS-only. IO code should not invoke
> > > bdrv_drained_begin() to avoid ordering problems when multiple threads 
> > > drain
> > > simultaneously and have dependencies on each other.
> > > 
> > > block/mirror.c needs to be modified because it currently drains from IO 
> > > when
> > > mirroring is about to end.
> > > 
> > > With this change to draining I think the logical argument for correctness 
> > > with
> > > graph modifications holds.
> > 
> > What is your suggestion how to modify mirror? It drains so that no new
> > requests can sneak in and source and target stay in sync while it
> > switches to the completion code running in the main AioContext. You
> > can't just drop this.
> 
> I haven't read the code in enough depth to come up with a solution.

So this sounds like it needs more thought before we can assume that
we'll have a final state where drain is GS-only.

> > 
> > > Enforcing GS/IO separation at compile time
> > > ------------------------------------------
> > > Right now GS/IO asserts check assumptions at runtime. The next step may be
> > > using the type system to separate GS and IO APIs so it's impossible for 
> > > IO code
> > > to accidentally call GS code, for example.
> > > 
> > >   typedef struct {
> > >       BlockDriverState *bs;
> > >   } BlockDriverStateIOPtr;
> > > 
> > >   typedef struct {
> > >       BlockDriverState *bs;
> > >   } BlockDriverStateGSPtr;
> > > 
> > > Then APIs can be protected as follows:
> > > 
> > >   void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...);
> > > 
> > > A function that only has a BlockDriverStateIOPtr cannot call
> > > bdrv_set_aio_context_ignore() by mistake since the compiler will complain 
> > > that
> > > the first argument must be a BlockDriverStateGSPtr.
> > 
> > And then you have a set of functions that cast from GS to IO pointers,
> > but not for the other way around? Or maybe getting as IO pointer would
> > even be coupled with automatically increasing bs->in_flight?
> > 
> > The other option that we were looking into for this was static analysis.
> > I had hacked up a small script to check consistency of these annotations
> > (without covering function pointers, though) to help me with the review
> > of Emanuele's patches that introduced them. If I understand correctly,
> > Paolo has scripts to do the same a little bit better.
> > 
> > As long as we can integrate such a script in 'make check', we wouldn't
> > necessarily have to have the churn in the C code to switch everything to
> > new wrapper types.
> 
> Yes, that's the problem with the typedef Ptr approach. It would involve
> a lot of code changes. Maybe static analysis tools are better.
> 
> > 
> > > Possible steps for AioContext removal
> > > -------------------------------------
> > > I also wanted to share my assumptions about multi-queue and AioContext 
> > > removal.
> > > Please let me know if anything seems wrong or questionable:
> > > 
> > > - IO code can execute in any thread that has an AioContext.
> > > - Multiple threads may execute a IO code at the same time.
> > > - GS code only execute under the BQL.
> > > 
> > > For AioContext removal this means:
> > > 
> > > - bdrv_get_aio_context() becomes mostly meaningless since there is no 
> > > need for
> > >   a special "home" AioContext.
> > > - bdrv_coroutine_enter() becomes mostly meaningless because there is no 
> > > need to
> > >   run a coroutine in the BDS's AioContext.
> > > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because 
> > > many
> > >   threads/AioContexts may submit new I/O requests. 
> > > BlockDevOps.drained_begin()
> > >   may be used instead (e.g. to temporarily disable ioeventfds on a 
> > > multi-queue
> > >   virtio-blk device).
> > > - AIO_WAIT_WHILE() simplifies to
> > > 
> > >     while ((cond)) {
> > >         aio_poll(qemu_get_current_aio_context(), true);
> > >         ...
> > >     }
> > 
> > Probably not exactly, because you still need aio_wait_kick() to wake up
> > the thread. We use AioWait.num_waiters != 0 to decide whether we need to
> > schedule a BH in the main thread (because doing so unconditionally for
> > every completed request would be very wasteful).
> > 
> > If you want to be able to run this loop from any thread instead of just
> > the main thread, you would have to store somewhere which thread to wake.
> 
> qemu_cond_broadcast() can be used since the wakeup is a slow path.

I don't think we have a QemuCond for waking up a blocking aio_poll()?
Doesn't this usually involve writing to the event notifier file
descriptor of the specific AioContext?

> > >   and the distinction between home AioContext and non-home context is
> > >   eliminated. AioContext unlocking is dropped.
> > > 
> > > Does this make sense? I haven't seen these things in recent patch series.
> > 
> > Intuitively I would agree with most. There may be details that I'm not
> > aware of at the moment. I'm not surprised that we haven't seen any such
> > things in recent patch series because there is an awful lot of
> > preparational work to be done before we can reach this final state.
> 
> Yes. I'm mostly hoping to find out whether my view in fundamentally
> flawed or very different from what you and others think.

No, not fundamentally.

The big challenge in my mind is how to verify that the conditions are
actually met. I'm fairly sure that using drain this way is by far too
complex to rely on review only.

Kevin

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]