qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virt


From: Paolo Bonzini
Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support
Date: Wed, 13 Aug 2014 15:16:59 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0

Il 13/08/2014 11:54, Kevin Wolf ha scritto:
> Am 12.08.2014 um 21:08 hat Paolo Bonzini geschrieben:
>> Il 12/08/2014 10:12, Ming Lei ha scritto:
>>>>> The below patch is basically the minimal change to bypass coroutines.  Of 
>>>>> course
>>>>> the block.c part is not acceptable as is (the change to 
>>>>> refresh_total_sectors
>>>>> is broken, the others are just ugly), but it is a start.  Please run it 
>>>>> with
>>>>> your fio workloads, or write an aio-based version of a qemu-img/qemu-io 
>>>>> *I/O*
>>>>> benchmark.
>>> Could you explain why the new change is introduced?
>>
>> It provides a fast path for bdrv_aio_readv/writev whenever there is
>> nothing to do after the driver routine returns.  In this case there is
>> no need to wrap the AIOCB returned by the driver routine.
>>
>> It doesn't go all the way, and in particular it doesn't reverse
>> completely the roles of bdrv_co_readv/writev vs. bdrv_aio_readv/writev.
> 
> That's actually why I think it's an option. Remember that, like you say
> below, we're optimising for an extreme case here, and I certainly don't
> want to hurt the common case for it. I can't imagine a way of reversing
> the roles without multiplying the cost for the coroutine path.

I'm not that worried about it.  Perhaps it's enough to add an
!qemu_in_coroutine() to the AIO fast path, and let the driver provide
optimized coroutine paths like in your patches that allocate AIOCBs on
the stack.

> Or do you have a clever solution how you'd go about it without having an
> impact on the common case?

I don't really have any ace up my sleeve, but there are some things that
bother me in the block layer's AIO API and in block.c in general.

One is that block.c can do all the pre-processing it wants before
issuing AIO, but nothing before calling the callback.  This means that
my patches break bdrv_drain_all (they cannot call tracked_request_end).

Another is all the similar structs that we have (RwCo,
BdrvTrackedRequest, BlockRequest, etc.).

Perhaps it would help if we had a single "real" block request object,
which is an extension of the BlockDriverAIOCB and includes enough data
to subsume all these request structs.  That should help commonizing
stuff between the coroutine and AIO paths, for the common case where a
single yield is enough.  I think the single-yield case is the one that
is really worth optimizing for.  If done properly, I think this can
simplify a lot of block.c code, but it is really difficult to get it
right, and unless the design is sound the code is going to come up
really ugly. :(

Another thing to evaluate is the performance gap (if there is any)
between aio=threads and aio=native.  The only advantage of aio=native,
AFAIU, is the batch submission of requests (plug/unplug).  But
aio=threads often ends up having better performance because the kernel
folks have optimized VFS a lot.  So, in the aio=threads case, we might
as well move the format code out of the iothread and into the worker
thread, and get rid of the coroutine cost simply by making everything
synchronous.  Looking within QEMU, this worked out very well for migration.

(We could do batch submission of requests to the thread pool if there
were a variant of sem_post that can add multiple signals to the same
semaphore, similar to ReleaseSemaphore on Windows).

>> The problem is that your patches to do touch too much code and subtly
>> break too much stuff.  The one I wrote does have a little breakage
>> because I don't understand bs->growable 100% and I didn't really put
>> much effort into it (my deadline being basically "be done as soon as the
>> shower is free"), and it is ugly as hell, _but_ it should be compatible
>> with the way the block layer works.
> 
> Yes, your patch is definitely much more palatable than Ming's. The part
> that I still don't like about it is that it would be stating "in the
> common case, we're only doing the second best thing". I'm not yet
> convinced that coroutines perform necessarily worse than state-passing
> callbacks.

Coroutines lump all the allocation costs together at the time you
allocate the stack, but have (much) more expensive context switching.
Your patches decrease the allocation costs by placing the AIOCB on the
stack.

Since you have to allocate an AIOCB anyway if the caller uses
bdrv_aio_readv/writev, coroutines do perform necessarily worse if the
drivers have little or no state to pass around.  This is the case for
both thread-pool and linux-aio AIOCBs; and which is also the case for
the generic block layer whenever the long "if" statements evaluate to true.

Paolo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]