qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH for-2.11 1/4] Revert "coroutine: abort if we try


From: Paolo Bonzini
Subject: Re: [Qemu-devel] [PATCH for-2.11 1/4] Revert "coroutine: abort if we try to schedule or enter a pending coroutine"
Date: Tue, 28 Nov 2017 18:35:38 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0

On 28/11/2017 18:19, Kevin Wolf wrote:
> Am 28.11.2017 um 18:01 hat Paolo Bonzini geschrieben:
>> Basically, once you do aio_co_schedule or aio_co_wake the coroutine is
>> not any more yours.  It's owned by the context that will run it and you
>> should not do anything with it.
> 
> Well, but that poses the question how you can implement any coroutine
> that can be reentered from more than one place. Not being able to do
> that would be a severe restriction.
>
> For example, take quorum, which handles requests by spawning a coroutine
> for every child and then yielding until acb->count == s->num_children.
> This means that the main request coroutine can be reentered from the
> child request coroutines in any order and timing.
> 
> What saves us currently is that they are all in the same AioContext, so
> we won't actually end up in aio_co_schedule(), but I'm sure that sooner
> or later we'll find cases where we're waiting for several workers that
> are spread across different I/O threads.

That can work as long as you add synchronization among those "more than
one place" (quorum doesn't need it because as you say there's only one
AioContext involved).  Take the previous example of

        f();
        g();
        qemu_coroutine_yield();

If f() and g() do

        lock a mutex
        x--;
        wake = (x == 0)
        unlock a mutex
        if (wake)
                aio_co_wake(co)

then that's fine, because aio_co_wake is only entered in one place.

> I don't think "don't do that" is a good answer to this.
> 
> And yes, reentering co_aio_sleep_ns() early is a real requirement, too.
> The test case that used speed=1 and would have waited a few hours before
> actually cancelling the job is an extreme example, but even just
> delaying cancellation for a second is bad if this is a second of
> migration downtime.

Ok, that can be fixed along the lines of what Jeff has just written.
I'll take a look.

Thanks,

Paolo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]