qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Combining synchronous and asynchronous IO


From: Sergio Lopez
Subject: Re: [Qemu-devel] Combining synchronous and asynchronous IO
Date: Fri, 15 Mar 2019 16:33:43 +0100
User-agent: mu4e 1.0; emacs 26.1

Stefan Hajnoczi writes:

> On Thu, Mar 14, 2019 at 06:31:34PM +0100, Sergio Lopez wrote:
>> Our current AIO path does a great job at unloading the work from the VM,
>> and combined with IOThreads provides a good performance in most
>> scenarios. But it also comes with its costs, in both a longer execution
>> path and the need of the intervention of the scheduler at various
>> points.
>> 
>> There's one particular workload that suffers from this cost, and that's
>> when you have just 1 or 2 cores on the Guest issuing synchronous
>> requests. This happens to be a pretty common workload for some DBs and,
>> in a general sense, on small VMs.
>> 
>> I did a quick'n'dirty implementation on top of virtio-blk to get some
>> numbers. This comes from a VM with 4 CPUs running on an idle server,
>> with a secondary virtio-blk disk backed by a null_blk device with a
>> simulated latency of 30us.
>
> Can you describe the implementation in more detail?  Does "synchronous"
> mean that hw/block/virtio_blk.c makes a blocking preadv()/pwritev() call
> instead of calling blk_aio_preadv/pwritev()?  If so, then you are also
> bypassing the QEMU block layer (coroutines, request tracking, etc) and
> that might explain some of the latency.

The first implementation, the one I've used for getting these numbers,
it's just preadv/pwrite from virtio_blk.c, as you correctly guessed. I
know it's unfair, but I wanted to take a look at the best possible
scenario, and then measure the cost of the other layers.

I'm working now on writing non-coroutine counterparts for
blk_co_[preadv|pwrite], so we have SIO without bypassing the block layer.

> It's important for this discussion that we understand what your tried
> out.  "Synchronous" can mean different things.  Since iothread is in
> play the code path is still asynchronous from the vcpu thread's
> perspective (thanks ioeventfd!).  The guest CPU is not stuck during I/O
> (good for quality of service) - however SIO+iothread may need to be
> woken up and scheduled on a host CPU (bad for latency).

I've tried SIO with ioeventfd=off, to make it fully synchronous, but the
performance it's significantly worse. Not sure if this is due to cache
pollution, or simply the guest CPU is able to move on early and be ready
to process the IRQ when it's signalled. Or maybe both.

>>  - Average latency (us)
>> 
>> ----------------------------------------
>> |        | AIO+iothread | SIO+iothread |
>> | 1 job  |      70      |      55      |
>> | 2 jobs |      83      |      82      |
>> | 4 jobs |      90      |     159      |
>> ----------------------------------------
>
> BTW recently I've found that the latency distribution can contain
> important clues that a simple average doesn't show (e.g. multiple peaks,
> outliers, etc).  If you continue to investigate this it might be
> interesting to plot the distribution.

Interesting, noted.

>> In this case the intuition matches the reality, and synchronous IO wins
>> when there's just 1 job issuing the requests, while it loses hard when
>> the are 4.
>
> Have you looked at the overhead of AIO+event loop?  ppoll()/epoll(),
> read()ing the eventfd to clear it, and Linux AIO io_submit().

Not since a while, and that reminds me I wanted to check if we could
improve the poll-max-ns heuristics.

> I had some small patches that try to reorder/optimize these operations
> but never got around to benchmarking and publishing them.  They do not
> reduce latency as low as SIO but they shave off a handful of
> microseconds.
>
> Resuming this work might be useful.  Let me know if you'd like me to dig
> out the old patches.

I would definitely like to take a look at those patches.

>> 
>> While my first thought was implementing this as a tunable, turns out we
>> have a hint about the nature of the workload in the number of the
>> requests in the VQ. So I updated the code to use SIO if there's just 1
>> request and AIO otherwise, with these results:
>
> Nice, avoiding tunables is good.  That way it can automatically adjust
> depending on the current workload and we don't need to educate users on
> tweaking a tunable.
>
>> 
>> -----------------------------------------------------------
>> |        | AIO+iothread | SIO+iothread | AIO+SIO+iothread |
>> | 1 job  |      70      |      55      |        55        |
>> | 2 jobs |      83      |      82      |        78        |
>> | 4 jobs |      90      |     159      |        90        |
>> -----------------------------------------------------------
>> 
>> This data makes me think this is something worth pursuing, but I'd like
>> to hear your opinion on it.
>
> I think it's definitely worth experimenting with more.  One thing to
> consider: the iothread is a shared resource when multiple devices are
> assigned to a single iothread.  In that case we probably do not want SIO
> since it would block the other emulated devices from processing
> requests.

Good point.

> On a related note, there is a summer internship project to implement
> support for the new io_uring API (successor to Linux AIO):
> https://wiki.qemu.org/Google_Summer_of_Code_2019#io_uring_AIO_engine
>
> So please *don't* implement io_uring support right now ;-).

Heh, you got me. That was my initial idea, but luckily I took a look at
the GSoC page first ;-)

Thanks,
Sergio.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]