qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges


From: Joao Martins
Subject: Re: [PATCH v2 10/20] vfio/common: Record DMA mapped IOVA ranges
Date: Thu, 2 Mar 2023 00:07:35 +0000

On 28/02/2023 20:36, Alex Williamson wrote:
> On Tue, 28 Feb 2023 12:11:06 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 23/02/2023 21:50, Alex Williamson wrote:
>>> On Thu, 23 Feb 2023 21:19:12 +0000
>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>> On 23/02/2023 21:05, Alex Williamson wrote:  
>>>>> On Thu, 23 Feb 2023 10:37:10 +0000
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
>>>>>> On 22/02/2023 22:10, Alex Williamson wrote:    
>>>>>>> On Wed, 22 Feb 2023 19:49:05 +0200
>>>>>>> Avihai Horon <avihaih@nvidia.com> wrote:      
>>>>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>>>>> @@ -612,6 +665,16 @@ static int vfio_dma_map(VFIOContainer *container, 
>>>>>>>> hwaddr iova,
>>>>>>>>          .iova = iova,
>>>>>>>>          .size = size,
>>>>>>>>      };
>>>>>>>> +    int ret;
>>>>>>>> +
>>>>>>>> +    ret = vfio_record_mapping(container, iova, size, readonly);
>>>>>>>> +    if (ret) {
>>>>>>>> +        error_report("vfio: Failed to record mapping, iova: 0x%" 
>>>>>>>> HWADDR_PRIx
>>>>>>>> +                     ", size: 0x" RAM_ADDR_FMT ", ret: %d (%s)",
>>>>>>>> +                     iova, size, ret, strerror(-ret));
>>>>>>>> +
>>>>>>>> +        return ret;
>>>>>>>> +    }      
>>>>>>>
>>>>>>> Is there no way to replay the mappings when a migration is started?
>>>>>>> This seems like a horrible latency and bloat trade-off for the
>>>>>>> possibility that the VM might migrate and the device might support
>>>>>>> these features.  Our performance with vIOMMU is already terrible, I
>>>>>>> can't help but believe this makes it worse.  Thanks,
>>>>>>>       
>>>>>>
>>>>>> It is a nop if the vIOMMU is being used (entries in 
>>>>>> container->giommu_list) as
>>>>>> that uses a max-iova based IOVA range. So this is really for iommu 
>>>>>> identity
>>>>>> mapping and no-VIOMMU.    
>>>>>
>>>>> Ok, yes, there are no mappings recorded for any containers that have a
>>>>> non-empty giommu_list.
>>>>>     
>>>>>> We could replay them if they were tracked/stored anywhere.    
>>>>>
>>>>> Rather than piggybacking on vfio_memory_listener, why not simply
>>>>> register a new MemoryListener when migration is started?  That will
>>>>> replay all the existing ranges and allow tracking to happen separate
>>>>> from mapping, and only when needed.
>>>>
>>>> The problem with that is that *starting* dirty tracking needs to have all 
>>>> the
>>>> range, we aren't supposed to start each range separately. So on a memory
>>>> listener callback you don't have introspection when you are dealing with 
>>>> the
>>>> last range, do we?  
>>>
>>> As soon as memory_listener_register() returns, all your callbacks to
>>> build the IOVATree have been called and you can act on the result the
>>> same as if you were relying on the vfio mapping MemoryListener.  I'm
>>> not seeing the problem.  Thanks,
>>>   
>>
>> While doing these changes, the nice thing of the current patch is that 
>> whatever
>> changes apply to vfio_listener_region_add() will be reflected in the mappings
>> tree that stores what we will dirty track. If we move the mappings 
>> calculation
>> necessary for dirty tracking only when we start, we will have to duplicate 
>> the
>> same checks, and open for bugs where we ask things to be dirty track-ed that
>> haven't been DMA mapped. These two aren't necessarily tied, but felt like I
>> should raise the potentially duplication of the checks (and the same thing
>> applies for handling virtio-mem and what not).
>>
>> I understand that if we were going to store *a lot* of mappings that this 
>> would
>> add up in space requirements. But for no-vIOMMU (or iommu=pt) case this is 
>> only
>> about 12ranges or so, it is much simpler to piggyback the existing listener.
>> Would you still want to move this to its own dedicated memory listener?
> 
> Code duplication and bugs are good points, but while typically we're
> only seeing a few handfuls of ranges, doesn't virtio-mem in particular
> allow that we could be seeing quite a lot more?
> 
Ugh yes, it could be.

> We used to be limited to a fairly small number of KVM memory slots,
> which effectively bounded non-vIOMMU DMA mappings, but that value is
> now 2^15, so we need to anticipate that we could see many more than a
> dozen mappings.
> 

Even with 32k memory slots today we are still reduced on a handful. hv-balloon
and virtio-mem approaches though are the ones that may stress such limit IIUC
prior to starting migration.

> Can we make the same argument that the overhead is negligible if a VM
> makes use of 10s of GB of virtio-mem with 2MB block size?
> 
> But then on a 4KB host we're limited to 256 tracking entries, so
> wasting all that time and space on a runtime IOVATree is even more
> dubious.
>
> In fact, it doesn't really matter that vfio_listener_region_add and
> this potentially new listener come to the same result, as long as the
> new listener is a superset of the existing listener. 

I am trying to put this in a way that's not too ugly to reuse the most between
vfio_listener_region_add() and the vfio_migration_mapping_add().

For you to have an idea, here's so far how it looks thus far:

https://github.com/jpemartins/qemu/commits/vfio-dirty-tracking

Particularly this one:

https://github.com/jpemartins/qemu/commit/3b11fa0e4faa0f9c0f42689a7367284a25d1b585

vfio_get_section_iova_range() is where most of these checks are that are sort of
a subset of the ones in vfio_listener_region_add().

> So I think we can
> simplify out a lot of the places we'd see duplication and bugs.  I'm
> not even really sure why we wouldn't simplify things further and only
> record a single range covering the low and high memory marks for a
> non-vIOMMU VMs, or potentially an approximation removing gaps of 1GB or
> more, for example.  Thanks,

Yes, for Qemu, to have one single artificial range with a computed min IOVA and
max IOVA is the simplest to get it implemented. It would avoid us maintaining an
IOVATree as you would only track min/max pair (maybe max_below).

My concern with a reduced single range is 1) big holes in address space leading
to asking more than you need[*] and then 2) device dirty tracking limits e.g.
hardware may have upper limits, so you may prematurely exercise those. So giving
more choice to the vfio drivers to decide how to cope with the mapped address
space description looks to have a bit more longevity.

Anyway the temptation with having a single range is that this can all go away if
the vfio_listener_region_add() tracks just min/max IOVA pair.

Below scissors mark it's how this patch is looking like in the commit above
while being a full list of mappings. It's also stored here:

https://github.com/jpemartins/qemu/commits/vfio-dirty-tracking

I'll respond here with a patch on what it looks like with the range watermark
approach.

        Joao

[0] AMD 1T boundary is what comes to mind, which on Qemu relocates memory above
4G into after 1T.

------------------>8--------------------

From: Joao Martins <joao.m.martins@oracle.com>
Date: Wed, 22 Feb 2023 19:49:05 +0200
Subject: [PATCH wip 7/12] vfio/common: Record DMA mapped IOVA ranges

According to the device DMA logging uAPI, IOVA ranges to be logged by
the device must be provided all at once upon DMA logging start.

As preparation for the following patches which will add device dirty
page tracking, keep a record of all DMA mapped IOVA ranges so later they
can be used for DMA logging start.

Note that when vIOMMU is enabled DMA mapped IOVA ranges are not tracked.
This is due to the dynamic nature of vIOMMU DMA mapping/unmapping.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/common.c              | 147 +++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events          |   2 +
 include/hw/vfio/vfio-common.h |   4 +
 3 files changed, 150 insertions(+), 3 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 655e8dbb74d4..17971e6dbaeb 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -44,6 +44,7 @@
 #include "migration/blocker.h"
 #include "migration/qemu-file.h"
 #include "sysemu/tpm.h"
+#include "qemu/iova-tree.h"

 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -426,6 +427,11 @@ void vfio_unblock_multiple_devices_migration(void)
     multiple_devices_migration_blocker = NULL;
 }

+static bool vfio_have_giommu(VFIOContainer *container)
+{
+    return !QLIST_EMPTY(&container->giommu_list);
+}
+
 static void vfio_set_migration_error(int err)
 {
     MigrationState *ms = migrate_get_current();
@@ -610,6 +616,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr 
iova,
         .iova = iova,
         .size = size,
     };
+    int ret;

     if (!readonly) {
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
@@ -626,8 +633,10 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr 
iova,
         return 0;
     }

+    ret = -errno;
     error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
-    return -errno;
+
+    return ret;
 }

 static void vfio_host_win_add(VFIOContainer *container,
@@ -1326,11 +1335,127 @@ static int vfio_set_dirty_page_tracking(VFIOContainer
*container, bool start)
     return ret;
 }

+static bool vfio_get_section_iova_range(VFIOContainer *container,
+                                        MemoryRegionSection *section,
+                                        hwaddr *out_iova, hwaddr *out_end)
+{
+    Int128 llend, llsize;
+    hwaddr iova, end;
+
+    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask()));
+
+    if (int128_ge(int128_make64(iova), llend)) {
+        return false;
+    }
+    end = int128_get64(int128_sub(llend, int128_one()));
+
+    if (memory_region_is_iommu(section->mr) ||
+        memory_region_has_ram_discard_manager(section->mr)) {
+       return false;
+    }
+
+    llsize = int128_sub(llend, int128_make64(iova));
+
+    if (memory_region_is_ram_device(section->mr)) {
+        VFIOHostDMAWindow *hostwin;
+        hwaddr pgmask;
+
+        hostwin = vfio_find_hostwin(container, iova, end);
+        if (!hostwin) {
+            return false;
+        }
+
+       pgmask = (1ULL << ctz64(hostwin->iova_pgsizes)) - 1;
+        if ((iova & pgmask) || (int128_get64(llsize) & pgmask)) {
+            return false;
+        }
+    }
+
+    *out_iova = iova;
+    *out_end = int128_get64(llend);
+    return true;
+}
+
+static void vfio_migration_add_mapping(MemoryListener *listener,
+                                       MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
mappings_listener);
+    hwaddr end = 0;
+    DMAMap map;
+    int ret;
+
+    if (vfio_have_giommu(container)) {
+        vfio_set_migration_error(-EOPNOTSUPP);
+        return;
+    }
+
+    if (!vfio_listener_valid_section(section) ||
+        !vfio_get_section_iova_range(container, section, &map.iova, &end)) {
+        return;
+    }
+
+    map.size = end - map.iova - 1; // IOVATree is inclusive, so subtract 1 from
size
+    map.perm = section->readonly ? IOMMU_RO : IOMMU_RW;
+
+    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
+        ret = iova_tree_insert(container->mappings, &map);
+        if (ret) {
+            if (ret == IOVA_ERR_INVALID) {
+                ret = -EINVAL;
+            } else if (ret == IOVA_ERR_OVERLAP) {
+                ret = -EEXIST;
+            }
+        }
+    }
+
+    trace_vfio_migration_mapping_add(map.iova, map.iova + map.size, ret);
+
+    if (ret)
+        vfio_set_migration_error(ret);
+    return;
+}
+
+static void vfio_migration_remove_mapping(MemoryListener *listener,
+                                          MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
mappings_listener);
+    hwaddr end = 0;
+    DMAMap map;
+
+    if (vfio_have_giommu(container)) {
+        vfio_set_migration_error(-EOPNOTSUPP);
+        return;
+    }
+
+    if (!vfio_listener_valid_section(section) ||
+        !vfio_get_section_iova_range(container, section, &map.iova, &end)) {
+        return;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&container->mappings_mutex) {
+        iova_tree_remove(container->mappings, map);
+    }
+
+    trace_vfio_migration_mapping_del(map.iova, map.iova + map.size);
+}
+
+
+static const MemoryListener vfio_dirty_tracking_listener = {
+    .name = "vfio-migration",
+    .region_add = vfio_migration_add_mapping,
+    .region_del = vfio_migration_remove_mapping,
+};
+
 static void vfio_listener_log_global_start(MemoryListener *listener)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
     int ret;

+    memory_listener_register(&container->mappings_listener, 
container->space->as);
+
     ret = vfio_set_dirty_page_tracking(container, true);
     if (ret) {
         vfio_set_migration_error(ret);
@@ -1346,6 +1471,8 @@ static void vfio_listener_log_global_stop(MemoryListener
*listener)
     if (ret) {
         vfio_set_migration_error(ret);
     }
+
+    memory_listener_unregister(&container->mappings_listener);
 }

 static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
@@ -2172,16 +2299,24 @@ static int vfio_connect_container(VFIOGroup *group,
AddressSpace *as,
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
     QLIST_INIT(&container->vrdl_list);
+    container->mappings = iova_tree_new();
+    if (!container->mappings) {
+        error_setg(errp, "Cannot allocate DMA mappings tree");
+        ret = -ENOMEM;
+        goto free_container_exit;
+    }
+    qemu_mutex_init(&container->mappings_mutex);
+    container->mappings_listener = vfio_dirty_tracking_listener;

     ret = vfio_init_container(container, group->fd, errp);
     if (ret) {
-        goto free_container_exit;
+        goto destroy_mappings_exit;
     }

     ret = vfio_ram_block_discard_disable(container, true);
     if (ret) {
         error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
-        goto free_container_exit;
+        goto destroy_mappings_exit;
     }

     switch (container->iommu_type) {
@@ -2317,6 +2452,10 @@ listener_release_exit:
 enable_discards_exit:
     vfio_ram_block_discard_disable(container, false);

+destroy_mappings_exit:
+    qemu_mutex_destroy(&container->mappings_mutex);
+    iova_tree_destroy(container->mappings);
+
 free_container_exit:
     g_free(container);

@@ -2371,6 +2510,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
         }

         trace_vfio_disconnect_container(container->fd);
+        qemu_mutex_destroy(&container->mappings_mutex);
+        iova_tree_destroy(container->mappings);
         close(container->fd);
         g_free(container);

diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 669d9fe07cd9..c92eaadcc7c4 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -104,6 +104,8 @@ vfio_known_safe_misalignment(const char *name, uint64_t
iova, uint64_t offset_wi
 vfio_listener_region_add_no_dma_map(const char *name, uint64_t iova, uint64_t
size, uint64_t page_size) "Region \"%s\" 0x%"PRIx64" size=0x%"PRIx64" is not
aligned to 0x%"PRIx64" and cannot be mapped for DMA"
 vfio_listener_region_del_skip(uint64_t start, uint64_t end) "SKIPPING
region_del 0x%"PRIx64" - 0x%"PRIx64
 vfio_listener_region_del(uint64_t start, uint64_t end) "region_del 0x%"PRIx64"
- 0x%"PRIx64
+vfio_migration_mapping_add(uint64_t start, uint64_t end, int err) "mapping_add
0x%"PRIx64" - 0x%"PRIx64" err=%d"
+vfio_migration_mapping_del(uint64_t start, uint64_t end) "mapping_del
0x%"PRIx64" - 0x%"PRIx64
 vfio_disconnect_container(int fd) "close container->fd=%d"
 vfio_put_group(int fd) "close group->fd=%d"
 vfio_get_device(const char * name, unsigned int flags, unsigned int
num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 87524c64a443..48951da11ab4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -23,6 +23,7 @@

 #include "exec/memory.h"
 #include "qemu/queue.h"
+#include "qemu/iova-tree.h"
 #include "qemu/notify.h"
 #include "ui/console.h"
 #include "hw/display/ramfb.h"
@@ -81,6 +82,7 @@ typedef struct VFIOContainer {
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     MemoryListener listener;
     MemoryListener prereg_listener;
+    MemoryListener mappings_listener;
     unsigned iommu_type;
     Error *error;
     bool initialized;
@@ -89,6 +91,8 @@ typedef struct VFIOContainer {
     uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
     unsigned int dma_max_mappings;
+    IOVATree *mappings;
+    QemuMutex mappings_mutex;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
     QLIST_HEAD(, VFIOGroup) group_list;
--
2.17.2



reply via email to

[Prev in Thread] Current Thread [Next in Thread]