qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] is there a limit on the number of in-flight I/O operati


From: Josh Durgin
Subject: Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?
Date: Wed, 26 Aug 2015 17:56:03 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.8.0

On 08/26/2015 04:47 PM, Andrey Korolyov wrote:
On Thu, Aug 27, 2015 at 2:31 AM, Josh Durgin <address@hidden> wrote:
On 08/26/2015 10:10 AM, Andrey Korolyov wrote:

On Thu, May 14, 2015 at 4:42 PM, Andrey Korolyov <address@hidden> wrote:

On Wed, Aug 27, 2014 at 9:43 AM, Chris Friesen
<address@hidden> wrote:

On 08/25/2014 03:50 PM, Chris Friesen wrote:

I think I might have a glimmering of what's going on.  Someone please
correct me if I get something wrong.

I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with
respect to max inflight operations, and neither does virtio-blk calling
virtio_add_queue() with a queue size of 128.

I think what's happening is that virtio_blk_handle_output() spins,
pulling data off the 128-entry queue and calling
virtio_blk_handle_request().  At this point that queue entry can be
reused, so the queue size isn't really relevant.

In virtio_blk_handle_write() we add the request to a MultiReqBuffer and
every 32 writes we'll call virtio_submit_multiwrite() which calls down
into bdrv_aio_multiwrite().  That tries to merge requests and then for
each resulting request calls bdrv_aio_writev() which ends up calling
qemu_rbd_aio_writev(), which calls rbd_start_aio().

rbd_start_aio() allocates a buffer and converts from iovec to a single
buffer.  This buffer stays allocated until the request is acked, which
is where the bulk of the memory overhead with rbd is coming from (has
anyone considered adding iovec support to rbd to avoid this extra
copy?).

The only limit I see in the whole call chain from
virtio_blk_handle_request() on down is the call to
bdrv_io_limits_intercept() in bdrv_co_do_writev().  However, that
doesn't provide any limit on the absolute number of inflight
operations,
only on operations/sec.  If the ceph server cluster can't keep up with
the aggregate load, then the number of inflight operations can still
grow indefinitely.

Chris



I was a bit concerned that I'd need to extend the IO throttling code to
support a limit on total inflight bytes, but it doesn't look like that
will
be necessary.

It seems that using mallopt() to set the trim/mmap thresholds to 128K is
enough to minimize the increase in RSS and also drop it back down after
an
I/O burst.  For now this looks like it should be sufficient for our
purposes.

I'm actually a bit surprised I didn't have to go lower, but it seems to
work
for both "dd" and dbench testcases so we'll give it a try.

Chris


Bumping this...

For now, we are rarely suffering with an unlimited cache growth issue
which can be observed on all post-1.4 versions of qemu with rbd
backend in a writeback mode and certain pattern of a guest operations.
The issue is confirmed for virtio and can be re-triggered by issuing
excessive amount of write requests without completing returned acks
from a emulator` cache timely. Since most applications behave in a
right way, the oom issue is very rare (and we developed an ugly
workaround for such situations long ago). If anybody is interested in
fixing this, I can send a prepared image for a reproduction or
instructions to make one, whichever is preferable.

Thanks!


A gentle bump: for at least rbd backend with writethrough/writeback
cache it is possible to achieve unlimited growth with lot of large
unfinished ops, what can be considered as a DoS. Usually it is
triggered by poorly written applications in the wild, like proprietary
KV databases or MSSQL under Windows, but regular applications,
primarily OSS databases, can trigger the RSS growth for hundreds of
megabytes just easily. There is probably no straight way to limit
in-flight request size by re-chunking it, as supposedly malicious
guest can inflate it up to very high numbers, but it`s fine to crash
such a guest, saving real-world stuff with simple in-flight op count
limiter looks like more achievable option.


Hey, sorry I missed this thread before.

What version of ceph are you running? There was an issue with ceph
0.80.8 and earlier that could cause lots of extra memory usage by rbd's
cache (even in writethrough mode) due to copy-on-write triggering
whole-object (default 4MB) reads, and sticking those in the cache without
proper throttling [1]. I'm wondering if this could be causing the large
RSS growth you're seeing.

In-flight requests do have buffers and structures allocated for them in
librbd, but these should have lower overhead than cow. If these are the
problem, it seems to me a generic limit on in flight ops in qemu would
be a reasonable fix. Other backends have resources tied up by in-flight
ops as well.

Josh

[1] https://github.com/ceph/ceph/pull/3410




I honestly believe that this is the second case. I have your pull in
mine dumpling branch since mid-February, but amount of 'near-oom to
handle' events was still the same over last few months compared to
earlier times, with range from hundred megabytes to gigabyte compared
to the theoretical top of the VM` consumption. Since the nature of the
issue is a very reactive, e.g. RSS image can grow fast and shrink fast
and eventually hit the cgroup limit, I have only a bare reproducer and
a couple of indirect symptoms which are driving my thoughts in a
direction as above - there is still no direct confirmation that
unfinished disk requests are always causing infinite additional memory
allocation.

Could you run massif on one of these guests with a problematic workload
to see where most of the memory is being used?

Like in this bug report, where it pointed to reads for cow as the
culprit:

http://tracker.ceph.com/issues/6494#note-1



reply via email to

[Prev in Thread] Current Thread [Next in Thread]