qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode


From: Karl Rister
Subject: Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
Date: Mon, 14 Nov 2016 09:36:44 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0

On 11/14/2016 09:26 AM, Stefan Hajnoczi wrote:
> On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
>> On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote:
>>> Recent performance investigation work done by Karl Rister shows that the
>>> guest->host notification takes around 20 us.  This is more than the 
>>> "overhead"
>>> of QEMU itself (e.g. block layer).
>>>
>>> One way to avoid the costly exit is to use polling instead of notification.
>>> The main drawback of polling is that it consumes CPU resources.  In order to
>>> benefit performance the host must have extra CPU cycles available on 
>>> physical
>>> CPUs that aren't used by the guest.
>>>
>>> This is an experimental AioContext polling implementation.  It adds a 
>>> polling
>>> callback into the event loop.  Polling functions are implemented for 
>>> virtio-blk
>>> virtqueue guest->host kick and Linux AIO completion.
>>>
>>> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of 
>>> nanoseconds to
>>> poll before entering the usual blocking poll(2) syscall.  Try setting this
>>> variable to the time from old request completion to new virtqueue kick.
>>>
>>> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
>>> any
>>> polling!
>>>
>>> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
>>> values.  If you don't find a good value we should double-check the tracing 
>>> data
>>> to see if this experimental code can be improved.
>>
>> Stefan
>>
>> I ran some quick tests with your patches and got some pretty good gains,
>> but also some seemingly odd behavior.
>>
>> These results are for a 5 minute test doing sequential 4KB requests from
>> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
>> performed directly against the virtio-blk device (no filesystem) which
>> is backed by a 400GB NVme card.
>>
>> QEMU_AIO_POLL_MAX_NS      IOPs
>>                unset    31,383
>>                    1    46,860
>>                    2    46,440
>>                    4    35,246
>>                    8    34,973
>>                   16    46,794
>>                   32    46,729
>>                   64    35,520
>>                  128    45,902
> 
> The environment variable is in nanoseconds.  The range of values you
> tried are very small (all <1 usec).  It would be interesting to try
> larger values in the ballpark of the latencies you have traced.  For
> example 2000, 4000, 8000, 16000, and 32000 ns.

Agreed.  As I alluded to in another post, I decided to start at 1 and
double the values until I saw a difference with the expectation that it
would have to get quite large before that happened.  The results went in
a different direction, and then I got distracted by the variation at
certain points.  I figured that by itself the fact that noticeable
improvements were possible with such low values was interesting.

I will definitely continue the progression and capture some larger values.

> 
> Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
> much CPU overhead.
> 
>> I found the results for 4, 8, and 64 odd so I re-ran some tests to check
>> for consistency.  I used values of 2 and 4 and ran each 5 times.  Here
>> is what I got:
>>
>> Iteration    QEMU_AIO_POLL_MAX_NS=2   QEMU_AIO_POLL_MAX_NS=4
>>         1                    46,972                   35,434
>>         2                    46,939                   35,719
>>         3                    47,005                   35,584
>>         4                    47,016                   35,615
>>         5                    47,267                   35,474
>>
>> So the results seem consistent.
> 
> That is interesting.  I don't have an explanation for the consistent
> difference between 2 and 4 ns polling time.  The time difference is so
> small yet the IOPS difference is clear.
> 
> Comparing traces could shed light on the cause for this difference.
> 
>> I saw some discussion on the patches made which make me think you'll be
>> making some changes, is that right?  If so, I may wait for the updates
>> and then we can run the much more exhaustive set of workloads
>> (sequential read and write, random read and write) at various block
>> sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32)
>> that we were doing when we started looking at this.
> 
> I'll send an updated version of the patches.
> 
> Stefan
> 


-- 
Karl Rister <address@hidden>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]