qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] Combining synchronous and asynchronous IO


From: Stefan Hajnoczi
Subject: Re: [Qemu-block] Combining synchronous and asynchronous IO
Date: Fri, 15 Mar 2019 15:00:36 +0000
User-agent: Mutt/1.11.3 (2019-02-01)

On Thu, Mar 14, 2019 at 06:31:34PM +0100, Sergio Lopez wrote:
> Our current AIO path does a great job at unloading the work from the VM,
> and combined with IOThreads provides a good performance in most
> scenarios. But it also comes with its costs, in both a longer execution
> path and the need of the intervention of the scheduler at various
> points.
> 
> There's one particular workload that suffers from this cost, and that's
> when you have just 1 or 2 cores on the Guest issuing synchronous
> requests. This happens to be a pretty common workload for some DBs and,
> in a general sense, on small VMs.
> 
> I did a quick'n'dirty implementation on top of virtio-blk to get some
> numbers. This comes from a VM with 4 CPUs running on an idle server,
> with a secondary virtio-blk disk backed by a null_blk device with a
> simulated latency of 30us.

Can you describe the implementation in more detail?  Does "synchronous"
mean that hw/block/virtio_blk.c makes a blocking preadv()/pwritev() call
instead of calling blk_aio_preadv/pwritev()?  If so, then you are also
bypassing the QEMU block layer (coroutines, request tracking, etc) and
that might explain some of the latency.

It's important for this discussion that we understand what your tried
out.  "Synchronous" can mean different things.  Since iothread is in
play the code path is still asynchronous from the vcpu thread's
perspective (thanks ioeventfd!).  The guest CPU is not stuck during I/O
(good for quality of service) - however SIO+iothread may need to be
woken up and scheduled on a host CPU (bad for latency).

>  - Average latency (us)
> 
> ----------------------------------------
> |        | AIO+iothread | SIO+iothread |
> | 1 job  |      70      |      55      |
> | 2 jobs |      83      |      82      |
> | 4 jobs |      90      |     159      |
> ----------------------------------------

BTW recently I've found that the latency distribution can contain
important clues that a simple average doesn't show (e.g. multiple peaks,
outliers, etc).  If you continue to investigate this it might be
interesting to plot the distribution.

> In this case the intuition matches the reality, and synchronous IO wins
> when there's just 1 job issuing the requests, while it loses hard when
> the are 4.

Have you looked at the overhead of AIO+event loop?  ppoll()/epoll(),
read()ing the eventfd to clear it, and Linux AIO io_submit().

I had some small patches that try to reorder/optimize these operations
but never got around to benchmarking and publishing them.  They do not
reduce latency as low as SIO but they shave off a handful of
microseconds.

Resuming this work might be useful.  Let me know if you'd like me to dig
out the old patches.

> 
> While my first thought was implementing this as a tunable, turns out we
> have a hint about the nature of the workload in the number of the
> requests in the VQ. So I updated the code to use SIO if there's just 1
> request and AIO otherwise, with these results:

Nice, avoiding tunables is good.  That way it can automatically adjust
depending on the current workload and we don't need to educate users on
tweaking a tunable.

> 
> -----------------------------------------------------------
> |        | AIO+iothread | SIO+iothread | AIO+SIO+iothread |
> | 1 job  |      70      |      55      |        55        |
> | 2 jobs |      83      |      82      |        78        |
> | 4 jobs |      90      |     159      |        90        |
> -----------------------------------------------------------
> 
> This data makes me think this is something worth pursuing, but I'd like
> to hear your opinion on it.

I think it's definitely worth experimenting with more.  One thing to
consider: the iothread is a shared resource when multiple devices are
assigned to a single iothread.  In that case we probably do not want SIO
since it would block the other emulated devices from processing
requests.

On a related note, there is a summer internship project to implement
support for the new io_uring API (successor to Linux AIO):
https://wiki.qemu.org/Google_Summer_of_Code_2019#io_uring_AIO_engine

So please *don't* implement io_uring support right now ;-).

Stefan

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]