qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.


From: Yoshiaki Tamura
Subject: Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
Date: Thu, 20 Jan 2011 19:39:34 +0900

2011/1/20 Kevin Wolf <address@hidden>:
> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov,
>>>>>> +                    blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb,
>>>>>> +                    blk_req->reqs[0].opaque);
>>>>>
>>>>> Same here.
>>>>>
>>>>>> +    bdrv_flush(bs);
>>>>>
>>>>> This looks really strange. What is this supposed to do?
>>>>>
>>>>> One point is that you write it immediately after bdrv_aio_write, so you
>>>>> get an fsync for which you don't know if it includes the current write
>>>>> request or if it doesn't. Which data do you want to get flushed to the 
>>>>> disk?
>>>>
>>>> I was expecting to flush the aio request that was just initiated.
>>>> Am I misunderstanding the function?
>>>
>>> Seems so. The function names don't use really clear terminology either,
>>> so you're not the first one to fall in this trap. Basically we have:
>>>
>>> * qemu_aio_flush() waits for all AIO requests to complete. I think you
>>> wanted to have exactly this, but only for a single block device. Such a
>>> function doesn't exist yet.
>>>
>>> * bdrv_flush() makes sure that all successfully completed requests are
>>> written to disk (by calling fsync)
>>>
>>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
>>> the fsync in the thread pool
>>
>> Then what I wanted to do is, call qemu_aio_flush first, then
>> bdrv_flush.  It should be like live migration.
>
> Okay, that makes sense. :-)
>
>>>>> The other thing is that you introduce a bdrv_flush for each request,
>>>>> basically forcing everyone to something very similar to writethrough
>>>>> mode. I'm sure this will have a big impact on performance.
>>>>
>>>> The reason is to avoid inversion of queued requests.  Although
>>>> processing one-by-one is heavy, wouldn't having requests flushed
>>>> to disk out of order break the disk image?
>>>
>>> No, that's fine. If a guest issues two requests at the same time, they
>>> may complete in any order. You just need to make sure that you don't
>>> call the completion callback before the request really has completed.
>>
>> We need to flush requests, meaning aio and fsync, before sending
>> the final state of the guests, to make sure we can switch to the
>> secondary safely.
>
> In theory I think you could just re-submit the requests on the secondary
> if they had not completed yet.
>
> But you're right, let's keep things simple for the start.
>
>>> I'm just starting to wonder if the guest won't timeout the requests if
>>> they are queued for too long. Even more, with IDE, it can only handle
>>> one request at a time, so not completing requests doesn't sound like a
>>> good idea at all. In what intervals is the event-tap queue flushed?
>>
>> The requests are flushed once each transaction completes.  So
>> it's not with specific intervals.
>
> Right. So when is a transaction completed? This is the time that a
> single request will take.

The transaction is completed when the vm state is sent to the
secondary, and the primary receives the ack to it.  Please let me
know if the answer is too vague.  What I can tell is that it
can't be super fast.

>>> On the other hand, if you complete before actually writing out, you
>>> don't get timeouts, but you signal success to the guest when the request
>>> could still fail. What would you do in this case? With a writeback cache
>>> mode we're fine, we can just fail the next flush (until then nothing is
>>> guaranteed to be on disk and order doesn't matter either), but with
>>> cache=writethrough we're in serious trouble.
>>>
>>> Have you thought about this problem? Maybe we end up having to flush the
>>> event-tap queue for each single write in writethrough mode.
>>
>> Yes, and that's what I'm trying to do at this point.
>
> Oh, I must have missed that code. Which patch/function should I look at?

Maybe I miss-answered to your question.  The device may receive
timeouts.  If timeouts didn't happen, the requests are flushed
one-by-one in writethrough because we're calling qemu_aio_flush
and bdrv_flush together.

Yoshi

>> I know that
>> performance matters a lot, but sacrificing reliability over
>> performance now isn't a good idea.  I first want to lay the
>> ground, and then focus on optimization.  Note that without dirty
>> bitmap optimization, Kemari suffers a lot in sending rams.
>> Anthony and I discussed to take this approach at KVM Forum.
>
> I agree, starting simple makes sense.
>
> Kevin
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to address@hidden
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]