[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 2/3] exec: posix_madvise usage on SunOS.
Re: [PATCH 2/3] exec: posix_madvise usage on SunOS.
Tue, 21 Jul 2020 10:22:26 +0200
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.9.0
On 20.07.20 21:13, Dr. David Alan Gilbert wrote:
> (Copies in Dave Hildenbrand)
> * Peter Maydell (email@example.com) wrote:
>> On Sat, 18 Jul 2020 at 14:21, David CARLIER <firstname.lastname@example.org> wrote:
>>> From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001
>>> From: David Carlier <email@example.com>
>>> Date: Sat, 18 Jul 2020 13:29:44 +0100
>>> Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS.
>>> with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling
>>> is not accessible thus using posix_madvise here.
>>> Signed-off-by: David Carlier <firstname.lastname@example.org>
>>> exec.c | 8 ++++++++
>>> 1 file changed, 8 insertions(+)
>>> diff --git a/exec.c b/exec.c
>>> index 6f381f98e2..0466a75b89 100644
>>> --- a/exec.c
>>> +++ b/exec.c
>>> @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb,
>>> uint64_t start, size_t length)
>>> * fallocate'd away).
>>> #if defined(CONFIG_MADVISE)
>>> +#if !defined(CONFIG_SOLARIS)
>>> ret = madvise(host_startaddr, length, MADV_DONTNEED);
>>> + /*
>>> + * mmap and its caddr_t based api is not accessible
>>> + * with _XOPEN_SOURCE set on illumos
>>> + */
>>> + ret = posix_madvise(host_startaddr, length,
>> Hi. I'm not sure this patch will do the right thing, because
>> I don't think that Solaris's POSIX_MADV_DONTNEED provides
>> the semantics that this QEMU function says it needs. The
>> comment at the top of the function says:
>> * Unmap pages of memory from start to start+length such that
>> * they a) read as 0, b) Trigger whatever fault mechanism
>> * the OS provides for postcopy.
>> * The pages must be unmapped by the end of the function.
> This code has moved around a bit over it's life; joining the case
> needed by balloon and the case needed by postcopy.
>> (Aside: the use of 'unmap' in this comment is a bit confusing,
>> because it clearly doesn't mean 'unmap' if it wants read-as-0.
>> And the reference to faults on postcopy is incomprehensible
>> to me: if memory is read-as-0 it isn't going to fault.)
> I think because internally to Linux the behaviour is the same;
> this causes the mapping to disappear from the TLB so it faults;
> normally when reading the kernel resolves the fault and puts
> a read-as-zero page there, except if userfault was enabled
> for postcopy, in which case it gives us a kick and we service it.
>> Linux's madvise(MADV_DONTNEED) does guarantee us this
>> read-as-zero behaviour. (It's a silly API choice that Linux
>> put this behaviour behind madvise, which is supposed to be
>> merely advisory, but that's how it is.)
> Yes, I don't think there's any equivalent to madvise
> that guarantees anything.
>> The Solaris
>> posix_madvise() manpage says it is merely advisory and
>> doesn't affect the behaviour of accesses to the memory.
>> If posix_madvise() behaviour was OK in this function, the
>> right way to fix this would be to use qemu_madvise()
>> instead, which already provides this "if host has
>> madvise(), use it, otherwise use posix_madvise()" logic.
>> But I suspect that the direct madvise() here is deliberate.
> Yes, but I can't remember the semantics fully - I think it was because
> we needed the guarantee at this point (and even Linux's
> posix madvise did something different??)
> I've got a note saying we didn't want to use
> qemu_madvise because we wanted to be sure we didn't get
>> Side note: not sure the current code is correct for the
>> BSDs either -- they have madvise() but don't provide
>> Linux's really-read-as-zero guarantee for MADV_DONTNEED.
>> So we should probably be doing something else there, and
>> whatever that something-else is is probably also what
>> Solaris wants.
>> We use ram_block_discard_range() only in migration and
>> in virtio-balloon and virtio-mem; I've cc'd some people
>> who hopefully understand what the requirements on this
>> function are and might have a view on what the not-Linux
>> implementation should look like. (David Gilbert: git
>> blame says you wrote this code :-))
virtio-mem depends on Linux (hw/virtio/Kconfig). I guess
userfaultfd/postcopy is also not relevant in the context of SunOS. So
what remains is virtio-balloon.
virito-balloon ideally wants to discard the actual mapped pages to free
up memory. When memory is re-accessed, a fresh page is faulted in (->
zero-page under Linux). Now, we already have other cases where it looks
like "the balloon works" but it really doesn't. One example is using
vfio+virtio-balloon under Linux - inflating the balloon is simply a NOP,
no memory is actually discarded.
I agree that POSIX_MADV_DONTNEED is not a proper match - different
guarantees. If SunOS cannot implement ram_block_discard_range() as
documented, we should disable it.
I would suggest using ram_block_discard_disable(true) when under SunOS
early during QEMU startup. In addition, we might want to return directly
in ram_block_dicard_range(). We might also want to make virito-balloon
depend on !SubOS.
David / dhildenb