[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM disc
From: |
Chenyi Qiang |
Subject: |
Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers |
Date: |
Wed, 26 Feb 2025 20:43:36 +0800 |
User-agent: |
Mozilla Thunderbird |
On 2/25/2025 5:41 PM, David Hildenbrand wrote:
> On 25.02.25 03:00, Chenyi Qiang wrote:
>>
>>
>> On 2/21/2025 6:04 PM, Chenyi Qiang wrote:
>>>
>>>
>>> On 2/21/2025 4:09 PM, David Hildenbrand wrote:
>>>> On 21.02.25 03:25, Chenyi Qiang wrote:
>>>>>
>>>>>
>>>>> On 2/21/2025 3:39 AM, David Hildenbrand wrote:
>>>>>> On 20.02.25 17:13, Jean-Philippe Brucker wrote:
>>>>>>> For Arm CCA we'd like the guest_memfd discard notifier to call the
>>>>>>> IOMMU
>>>>>>> notifiers and create e.g. VFIO mappings. The default VFIO discard
>>>>>>> notifier isn't sufficient for CCA because the DMA addresses need a
>>>>>>> translation (even without vIOMMU).
>>>>>>>
>>>>>>> At the moment:
>>>>>>> * guest_memfd_state_change() calls the populate() notifier
>>>>>>> * the populate notifier() calls IOMMU notifiers
>>>>>>> * the IOMMU notifier handler calls memory_get_xlat_addr() to get
>>>>>>> a VA
>>>>>>> * it calls ram_discard_manager_is_populated() which fails.
>>>>>>>
>>>>>>> guest_memfd_state_change() only changes the section's state after
>>>>>>> calling the populate() notifier. We can't easily invert the order of
>>>>>>> operation because it uses the old state bitmap to know which
>>>>>>> pages need
>>>>>>> the populate() notifier.
>>>>>>
>>>>>> I assume we talk about this code: [1]
>>>>>>
>>>>>> [1] https://lkml.kernel.org/r/20250217081833.21568-1-
>>>>>> chenyi.qiang@intel.com
>>>>>>
>>>>>>
>>>>>> +static int memory_attribute_state_change(MemoryAttributeManager
>>>>>> *mgr,
>>>>>> uint64_t offset,
>>>>>> + uint64_t size, bool
>>>>>> shared_to_private)
>>>>>> +{
>>>>>> + int block_size = memory_attribute_manager_get_block_size(mgr);
>>>>>> + int ret = 0;
>>>>>> +
>>>>>> + if (!memory_attribute_is_valid_range(mgr, offset, size)) {
>>>>>> + error_report("%s, invalid range: offset 0x%lx, size 0x%lx",
>>>>>> + __func__, offset, size);
>>>>>> + return -1;
>>>>>> + }
>>>>>> +
>>>>>> + if ((shared_to_private &&
>>>>>> memory_attribute_is_range_discarded(mgr,
>>>>>> offset, size)) ||
>>>>>> + (!shared_to_private &&
>>>>>> memory_attribute_is_range_populated(mgr,
>>>>>> offset, size))) {
>>>>>> + return 0;
>>>>>> + }
>>>>>> +
>>>>>> + if (shared_to_private) {
>>>>>> + memory_attribute_notify_discard(mgr, offset, size);
>>>>>> + } else {
>>>>>> + ret = memory_attribute_notify_populate(mgr, offset, size);
>>>>>> + }
>>>>>> +
>>>>>> + if (!ret) {
>>>>>> + unsigned long first_bit = offset / block_size;
>>>>>> + unsigned long nbits = size / block_size;
>>>>>> +
>>>>>> + g_assert((first_bit + nbits) <= mgr->bitmap_size);
>>>>>> +
>>>>>> + if (shared_to_private) {
>>>>>> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
>>>>>> + } else {
>>>>>> + bitmap_set(mgr->shared_bitmap, first_bit, nbits);
>>>>>> + }
>>>>>> +
>>>>>> + return 0;
>>>>>> + }
>>>>>> +
>>>>>> + return ret;
>>>>>> +}
>>>>>>
>>>>>> Then, in memory_attribute_notify_populate(), we walk the bitmap
>>>>>> again.
>>>>>>
>>>>>> Why?
>>>>>>
>>>>>> We just checked that it's all in the expected state, no?
>>>>>>
>>>>>>
>>>>>> virtio-mem doesn't handle it that way, so I'm curious why we would
>>>>>> have
>>>>>> to do it here?
>>>>>
>>>>> I was concerned about the case where the guest issues a request that
>>>>> only partial of the range is in the desired state.
>>>>> I think the main problem is the policy for the guest conversion
>>>>> request.
>>>>> My current handling is:
>>>>>
>>>>> 1. When a conversion request is made for a range already in the
>>>>> desired
>>>>> state, the helper simply returns success.
>>>>
>>>> Yes.
>>>>
>>>>> 2. For requests involving a range partially in the desired state, only
>>>>> the necessary segments are converted, ensuring the entire range
>>>>> complies with the request efficiently.
>>>>
>>>>
>>>> Ah, now I get:
>>>>
>>>> + if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
>>>> offset, size)) ||
>>>> + (!shared_to_private &&
>>>> memory_attribute_is_range_populated(mgr,
>>>> offset, size))) {
>>>> + return 0;
>>>> + }
>>>> +
>>>>
>>>> We're not failing if it might already partially be in the other state.
>>>>
>>>>> 3. In scenarios where a conversion request is declined by other
>>>>> systems,
>>>>> such as a failure from VFIO during notify_populate(), the
>>>>> helper will
>>>>> roll back the request, maintaining consistency.
>>>>>
>>>>> And the policy of virtio-mem is to refuse the state change if not all
>>>>> blocks are in the opposite state.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>> Actually, this part is still a uncertain to me.
>>>>>
>>>>
>>>> IIUC, the problem does not exist if we only convert a single page at a
>>>> time.
>>>>
>>>> Is there a known use case where such partial conversions could happen?
>>>
>>> I don't see such case yet. Actually, I'm trying to follow the behavior
>>> of KVM_SET_MEMORY_ATTRIBUTES ioctl during page conversion. In KVM, it
>>> doesn't reject the request if the whole range isn't in the opposite
>>> state. It just uses xa_store() to update it. Also, I don't see the spec
>>> says how to handle such case. To be robust, I just allow this special
>>> case.
>>>
>>>>
>>>>> BTW, per the status/bitmap track, the virtio-mem also changes the
>>>>> bitmap
>>>>> after the plug/unplug notifier. This is the same, correct?
>>>> Right. But because we reject these partial requests, we don't have to
>>>> traverse the bitmap and could just adjust the bitmap operations.
>>>
>>> Yes, If we treat it as a guest error/bug, we can adjust it.
>>
>> Hi David, do you think which option is better? If prefer to reject the
>> partial requests, I'll change it in my next version.
>
> Hi,
>
> still scratching my head. Having to work around it as in this patch here is
> suboptimal.
>
> Could we simplify the whole thing while still allowing for (unexpected)
> partial
> conversions?
>
> Essentially: If states are mixed, fallback to a "1 block at a time"
> handling.
>
> The only problem is: what to do if we fail halfway through? Well, we can
> only have
> such partial completions for "populate", not for discard.
>
> Option a) Just leave it as "partially completed populate" and return the
> error. The
> bitmap and the notifiers are consistent.
>
> Option b) Just discard everything: someone tried to convert something
> "partial
> shared" to "shared". So maybe, if anything goes wrong, we can just have
> "all private".
>
> The question is also, what the expectation from the caller is: can the
> caller
> even make progress on failure or do we have to retry until it works?
Yes, That's the key problem.
For core mm side conversion, The caller (guest) handles three case:
success, failure and retry. guest can continue on failure but will keep
the memory in its original attribute and trigger some calltrace. While
in QEMU side, it would cause VM stop if kvm_set_memory_attributes() failed.
As for the VFIO conversion, at present, we allow it to fail and don't
return error code to guest as long as we undo the conversion. It only
causes the device not work in guest.
I think if we view the attribute mismatch between core mm and IOMMU as a
fatal error, we can call VM stop or let guest retry until it converts
successfully.
>
> Both options would be compatible with "the caller must retry either way
> until it works".
>
> a) is probably easiest. Something like the following on top of your code:
LGTM, with option a), We need to return the retry status to guest. the
caller (guest) must handle the retry. Not doing so with "partially
completed populate" would cause some problem.
>
>
>
> From 40c001a57c3c1350f3a44288ccb013d903d300cf Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Tue, 25 Feb 2025 09:55:38 +0100
> Subject: [PATCH] tmp
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> system/memory-attribute-manager.c | 66 +++++++++++++++++--------------
> 1 file changed, 36 insertions(+), 30 deletions(-)
>
> diff --git a/system/memory-attribute-manager.c b/system/memory-
> attribute-manager.c
> index 17c70cf677..31960e4a54 100644
> --- a/system/memory-attribute-manager.c
> +++ b/system/memory-attribute-manager.c
> @@ -274,9 +274,7 @@ static void
> memory_attribute_notify_discard(MemoryAttributeManager *mgr,
> if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> continue;
> }
> -
> - memory_attribute_for_each_populated_section(mgr, &tmp, rdl,
> -
> memory_attribute_notify_discard_cb);
> + rdl->notify_discard(rdl, &tmp);
> }
> }
>
> @@ -292,9 +290,7 @@ static int
> memory_attribute_notify_populate(MemoryAttributeManager *mgr,
> if (!memory_region_section_intersect_range(&tmp, offset, size)) {
> continue;
> }
> -
> - ret = memory_attribute_for_each_discarded_section(mgr, &tmp, rdl,
> -
> memory_attribute_notify_populate_cb);
> + ret = rdl->notify_populate(rdl, &tmp);
> if (ret) {
> break;
> }
> @@ -311,9 +307,7 @@ static int
> memory_attribute_notify_populate(MemoryAttributeManager *mgr,
> if (!memory_region_section_intersect_range(&tmp, offset,
> size)) {
> continue;
> }
> -
> - memory_attribute_for_each_discarded_section(mgr, &tmp, rdl2,
> -
> memory_attribute_notify_discard_cb);
> + rdl2->notify_discard(rdl2, &tmp);
> }
> }
> return ret;
> @@ -348,7 +342,10 @@ static bool
> memory_attribute_is_range_discarded(MemoryAttributeManager *mgr,
> static int memory_attribute_state_change(MemoryAttributeManager *mgr,
> uint64_t offset,
> uint64_t size, bool
> shared_to_private)
> {
> - int block_size = memory_attribute_manager_get_block_size(mgr);
> + const int block_size = memory_attribute_manager_get_block_size(mgr);
> + const unsigned long first_bit = offset / block_size;
> + const unsigned long nbits = size / block_size;
> + uint64_t cur_offset;
> int ret = 0;
>
> if (!memory_attribute_is_valid_range(mgr, offset, size)) {
> @@ -357,32 +354,41 @@ static int
> memory_attribute_state_change(MemoryAttributeManager *mgr, uint64_t o
> return -1;
> }
>
> - if ((shared_to_private && memory_attribute_is_range_discarded(mgr,
> offset, size)) ||
> - (!shared_to_private && memory_attribute_is_range_populated(mgr,
> offset, size))) {
> - return 0;
> - }
> -
> if (shared_to_private) {
> - memory_attribute_notify_discard(mgr, offset, size);
> - } else {
> - ret = memory_attribute_notify_populate(mgr, offset, size);
> - }
> -
> - if (!ret) {
> - unsigned long first_bit = offset / block_size;
> - unsigned long nbits = size / block_size;
> -
> - g_assert((first_bit + nbits) <= mgr->bitmap_size);
> -
> - if (shared_to_private) {
> + if (memory_attribute_is_range_discarded(mgr, offset, size)) {
> + /* Already private. */
> + } else if (!memory_attribute_is_range_populated(mgr, offset,
> size)) {
> + /* Unexpected mixture: not completely shared. */
> + for (cur_offset = 0; cur_offset < offset; cur_offset +=
> block_size) {
> + memory_attribute_state_change(mgr, cur_offset, block_size,
> + true);
> + }
> + } else {
> + /* Completely shared. */
> bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
> + memory_attribute_notify_discard(mgr, offset, size);
> + }
> + } else {
> + if (memory_attribute_is_range_populated(mgr, offset, size)) {
> + /* Already shared. */
> + } else if (!memory_attribute_is_range_discarded(mgr, offset,
> size)) {
> + /* Unexpected mixture: not completely private. */
> + for (cur_offset = 0; cur_offset < offset; cur_offset +=
> block_size) {
> + ret = memory_attribute_state_change(mgr, cur_offset,
> block_size,
> + false);
> + if (ret) {
> + break;
> + }
> + }
> } else {
> + /* Completely private. */
> bitmap_set(mgr->shared_bitmap, first_bit, nbits);
> + ret = memory_attribute_notify_populate(mgr, offset, size);
> + if (ret) {
> + bitmap_clear(mgr->shared_bitmap, first_bit, nbits);
> + }
> }
> -
> - return 0;
> }
> -
> return ret;
> }
>
- [RFC 0/2] arm: Add DMA remapping for CCA, Jean-Philippe Brucker, 2025/02/20
- [RFC 2/2] target/arm/kvm-rme: Add DMA remapping for the shared memory region, Jean-Philippe Brucker, 2025/02/20
- [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers, Jean-Philippe Brucker, 2025/02/20
- Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers, David Hildenbrand, 2025/02/20
- Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers, Chenyi Qiang, 2025/02/20
- Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers, David Hildenbrand, 2025/02/21
- Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers, Chenyi Qiang, 2025/02/21
- Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers, Chenyi Qiang, 2025/02/24
- Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers, David Hildenbrand, 2025/02/25
- Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers,
Chenyi Qiang <=
- Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers, Chenyi Qiang, 2025/02/26
- Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers, David Hildenbrand, 2025/02/27
- Re: [RFC 1/2] system/memory: Allow creating IOMMU mappings from RAM discard populate notifiers, Chenyi Qiang, 2025/02/28