qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v3] vfio/common: Check iova with limit not with


From: Alex Williamson
Subject: Re: [Qemu-devel] [PATCH v3] vfio/common: Check iova with limit not with size
Date: Tue, 26 Jan 2016 10:00:30 -0700

On Tue, 2016-01-26 at 15:51 +0100, Pierre Morel wrote:
> 
> On 01/22/2016 11:19 PM, Alex Williamson wrote:
> > On Fri, 2016-01-22 at 15:14 -0700, Alex Williamson wrote:
> > > On Thu, 2016-01-21 at 14:15 +0100, Pierre Morel wrote:
> > > > On 01/20/2016 04:46 PM, Alex Williamson wrote:
> > > > > On Wed, 2016-01-20 at 16:14 +0100, Pierre Morel wrote:
> > > > > > On 01/12/2016 07:16 PM, Alex Williamson wrote:
> > > > > > > On Tue, 2016-01-12 at 16:11 +0100, Pierre Morel wrote:
> > > > > > > > In vfio_listener_region_add(), we try to validate that the 
> > > > > > > > region
> > > > > > > > is
> > > > > > > > not
> > > > > > > > zero sized and hasn't overflowed the addresses space.
> > > > > > > > 
> > > > > > > > But the calculation uses the size of the region instead of
> > > > > > > > using the region's limit (size - 1).
> > > > > > > > 
> > > > > > > > This leads to Int128 overflow when the region has
> > > > > > > > been initialized to UINT64_MAX because in this case
> > > > > > > > memory_region_init() transform the size from UINT64_MAX
> > > > > > > > to int128_2_64().
> > > > > > > > 
> > > > > > > > Let's really use the limit by sustracting one to the size
> > > > > > > > and take care to use the limit for functions using limit
> > > > > > > > and size to call functions which need size.
> > > > > > > > 
> > > > > > > > Signed-off-by: Pierre Morel <address@hidden>
> > > > > > > > ---
> > > > > > > > 
> > > > > > > > Changes from v2:
> > > > > > > >        - all, just ignore v2, sorry about this,
> > > > > > > >          this is build after v1
> > > > > > > > 
> > > > > > > > Changes from v1:
> > > > > > > >        - adjust the tests by knowing we already substracted one 
> > > > > > > > to
> > > > > > > > end.
> > > > > > > > 
> > > > > > > >     hw/vfio/common.c |   14 +++++++-------
> > > > > > > >     1 files changed, 7 insertions(+), 7 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > > > > > > > index 6797208..a5f6643 100644
> > > > > > > > --- a/hw/vfio/common.c
> > > > > > > > +++ b/hw/vfio/common.c
> > > > > > > > @@ -348,12 +348,12 @@ static void
> > > > > > > > vfio_listener_region_add(MemoryListener *listener,
> > > > > > > >         if (int128_ge(int128_make64(iova), llend)) {
> > > > > > > >             return;
> > > > > > > >         }
> > > > > > > > -    end = int128_get64(llend);
> > > > > > > > +    end = int128_get64(int128_sub(llend, int128_one()));
> > > > > > > >     
> > > > > > > > -    if ((iova < container->min_iova) || ((end - 1) > container-
> > > > > > > > > max_iova)) {
> > > > > > > > +    if ((iova < container->min_iova) || (end  > container-
> > > > > > > > > max_iova)) {
> > > > > > > >             error_report("vfio: IOMMU container %p can't map 
> > > > > > > > guest
> > > > > > > > IOVA
> > > > > > > > region"
> > > > > > > >                          " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
> > > > > > > > -                     container, iova, end - 1);
> > > > > > > > +                     container, iova, end);
> > > > > > > >             ret = -EFAULT;
> > > > > > > >             goto fail;
> > > > > > > >         }
> > > > > > > > @@ -363,7 +363,7 @@ static void
> > > > > > > > vfio_listener_region_add(MemoryListener *listener,
> > > > > > > >         if (memory_region_is_iommu(section->mr)) {
> > > > > > > >             VFIOGuestIOMMU *giommu;
> > > > > > > >     
> > > > > > > > -        trace_vfio_listener_region_add_iommu(iova, end - 1);
> > > > > > > > +        trace_vfio_listener_region_add_iommu(iova, end);
> > > > > > > >             /*
> > > > > > > >              * FIXME: We should do some checking to see if the
> > > > > > > >              * capabilities of the host VFIO IOMMU are adequate 
> > > > > > > > to
> > > > > > > > model
> > > > > > > > @@ -394,13 +394,13 @@ static void
> > > > > > > > vfio_listener_region_add(MemoryListener *listener,
> > > > > > > >                 section->offset_within_region +
> > > > > > > >                 (iova - section->offset_within_address_space);
> > > > > > > >     
> > > > > > > > -    trace_vfio_listener_region_add_ram(iova, end - 1, vaddr);
> > > > > > > > +    trace_vfio_listener_region_add_ram(iova, end, vaddr);
> > > > > > > >     
> > > > > > > > -    ret = vfio_dma_map(container, iova, end - iova, vaddr,
> > > > > > > > section-
> > > > > > > > > readonly);
> > > > > > > > +    ret = vfio_dma_map(container, iova, end - iova + 1, vaddr,
> > > > > > > > section->readonly);
> > > > > > > >         if (ret) {
> > > > > > > >             error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> > > > > > > >                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
> > > > > > > > -                     container, iova, end - iova, vaddr, ret);
> > > > > > > > +                     container, iova, end - iova + 1, vaddr,
> > > > > > > > ret);
> > > > > > > >             goto fail;
> > > > > > > >         }
> > > > > > > >     
> > > > > > > Hmm, did we just push the overflow from one place to another?  If
> > > > > > > we're
> > > > > > > mapping a full region of size int128_2_64() starting at iova zero,
> > > > > > > then
> > > > > > > this becomes (0xffff_ffff_ffff_ffff - 0 + 1) = 0.  So I think we
> > > > > > > need
> > > > > > > to calculate size with 128bit arithmetic too and let it assert if
> > > > > > > we
> > > > > > > overflow, ie:
> > > > > > > 
> > > > > > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > > > > > > index a5f6643..13ad90b 100644
> > > > > > > --- a/hw/vfio/common.c
> > > > > > > +++ b/hw/vfio/common.c
> > > > > > > @@ -321,7 +321,7 @@ static void
> > > > > > > vfio_listener_region_add(MemoryListener *listener,
> > > > > > >                                          MemoryRegionSection
> > > > > > > *section)
> > > > > > >     {
> > > > > > >         VFIOContainer *container = container_of(listener,
> > > > > > > VFIOContainer, listener);
> > > > > > > -    hwaddr iova, end;
> > > > > > > +    hwaddr iova, end, size;
> > > > > > >         Int128 llend;
> > > > > > >         void *vaddr;
> > > > > > >         int ret;
> > > > > > > @@ -348,7 +348,9 @@ static void
> > > > > > > vfio_listener_region_add(MemoryListener *listener,
> > > > > > >         if (int128_ge(int128_make64(iova), llend)) {
> > > > > > >             return;
> > > > > > >         }
> > > > > > > +
> > > > > > >         end = int128_get64(int128_sub(llend, int128_one()));
> > > > > > > +    size = int128_get64(int128_sub(llend, int128_make64(iova)));
> > > > > > here again, if iova is null, since llend is section->size (2^64) ...
> > > > > > 
> > > > > > >     
> > > > > > >         if ((iova < container->min_iova) || (end  > container-
> > > > > > > > max_iova)) {
> > > > > > >             error_report("vfio: IOMMU container %p can't map guest
> > > > > > > IOVA region"
> > > > > > > @@ -396,11 +398,11 @@ static void
> > > > > > > vfio_listener_region_add(MemoryListener *listener,
> > > > > > >     
> > > > > > >         trace_vfio_listener_region_add_ram(iova, end, vaddr);
> > > > > > >     
> > > > > > > -    ret = vfio_dma_map(container, iova, end - iova + 1, vaddr,
> > > > > > > section->readonly);
> > > > > > > +    ret = vfio_dma_map(container, iova, size, vaddr, section-
> > > > > > > > readonly);
> > > > > > >         if (ret) {
> > > > > > >             error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> > > > > > >                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
> > > > > > > -                     container, iova, end - iova + 1, vaddr, 
> > > > > > > ret);
> > > > > > > +                     container, iova, size, vaddr, ret);
> > > > > > >             goto fail;
> > > > > > >         }
> > > > > > >     
> > > > > > > 
> > > > > > > Does that still solve your scenario?  Perhaps vfio-iommu-type1
> > > > > > > should
> > > > > > > have used first/last rather than start/size for mapping since we
> > > > > > > seem
> > > > > > > to have an off-by-one for mapping a full 64bit space.  Seems like
> > > > > > > we
> > > > > > > could do it with two calls to vfio_dma_map if we really wanted to.
> > > > > > > Thanks,
> > > > > > > 
> > > > > > > Alex
> > > > > > > 
> > > > > > You are right, every try to solve this will push the overflow
> > > > > > somewhere
> > > > > > else.
> > > > > > 
> > > > > > There is just no way to express 2^64 with 64 bits, we have the
> > > > > > int128()
> > > > > > solution,
> > > > > > but if we solve it here, we fall in the linux ioctl call anyway.
> > > > > > 
> > > > > > Intuitively, making two calls do not seem right to me.
> > > > > > 
> > > > > > But, what do you think of something like:
> > > > > > 
> > > > > > - creating a new VFIO extention
> > > > > > 
> > > > > > - and in ioctl(), since we have a flag entry in the
> > > > > > vfio_iommu_type1_dma_map,
> > > > > > may be adding a new flag meaning "map all virtual memory" ?
> > > > > > or meaning "use first/last" ?
> > > > > > I think this would break existing code unless we add a new VFIO
> > > > > > extension.
> > > > > Backup, is there ever a case where we actually need to map the entire
> > > > > 64bit address space?  This is fairly well impossible on x86.  I'm
> > > > > pointing out an issue, but I don't know that we need to solve it with
> > > > > more than an assert since it's never likely to happen.  Thanks,
> > > > > 
> > > > > Alex
> > > > > 
> > > > If I understood right, IOVA is the IO virtual address,
> > > > it is then possible to map the virtual address page 
> > > > 0xffff_ffff_ffff_f000
> > > > to something reasonable inside the real memory.
> > > It is.
> > > 
> > > > Eventual we do not need to map the last virtual page but
> > > > I think that in a general case the all virtual memory, as viewed by the
> > > > device through the IOMMU should be mapped to avoid any uninitialized
> > > > virtual memory access.
> > > When using vfio, a device only has access to the IOVA space which has
> > > been explicitly mapped.  This would be a security issue otherwise since
> > > kernel vfio can't rely on userspace to wipe the device IOVA space.
> yes.
> > > > It is the same reason that make us map the all virtual memory for the
> > > > CPU MMU.
> > > We don't really do that either, CPU mapping works based on page tables
> > > and non-existent entries simply don't exist.  We don't fully populate
> > > the page tables in advance, this would be a horrible waste of memory.
> 
> 
> Alex,
> 
> I am not sure of that, when preparing DMA from the device, the
> guest will provide the destination address and these destination addresses
> will be translated by the IOMMU when the device start the DMA.
> 
> The guest can make any decision by preparing the DMA and if I have well 
> understood,
> this is transparent to QEMU.
> What is not transparent is the IOMMU translation.
> 
> Then, when the device starts the DMA the destination address can be 
> anything inside
> the virtual memory and the IOMMU will translate this.
> To be able to translate, a table entry for this virtual address must 
> exist in the IOMMU
> page table.
> 
> If you have several level of page table you may only fill the first level
> for all entries, and may be, have only one first level entry initialized 
> and the belonging
> second level entries filled.
> Which greatly reduces the size of the tables.
> 
> But if you do not fill one of the first level entry the behavior of the 
> IOMMU
> and then of the DMA is done according to what ever has been left in this
> entry.

It seems like you're arguing that the guest is going to have a 2^64 bit
address space for DMA targets.  Sure, the guest driver can program the
device to DMA anywhere in that address space, but what should the IOMMU
actually consider an valid DMA?  It has to be things that have been
mapped, like guest physical RAM or peer-to-peer DMA targets.  How would
the IOMMU handle a stray DMA for anything else?  By default the DMA
target would not be mapped and the DMA would generate a fault.  How
that fault is handled is specific to the architecture and platform.  If
you want to provide the device with a full 2^64 bit address space where
nothing will fault, you're going to need to do it via a lot of mappings
pointing to the same host physical page.  I'm really not sure where
we're going with this though.

> > > 
> > > > May be I missed something, or may be I worry too much,
> > > > but I see this as a restriction on the supported hardware
> > > > if we compare host and guest hardware support compatibility.
> > > I don't see the issue, there's arguably a bug in the API that doesn't
> > > allow us to map the full 64bit IOVA space of a device in a single
> > > mapping, but we can do it in two.  Besides, there's really no case
> > > where a device needs a fully populated IOTLB unless you're actually
> > > giving the device access to 16 EMB of memory.
> > s/EMB/EB/  Or I suppose technically EiB
> 
> yes, I agree with this, we do not need to access so much memory.
> 
> > 
> > > > We can live with it, because in fact you are right and today
> > > > I am not aware of a hardware wanting to access this page but a
> > > > hardware designers knowing having a IOMMU may want to access exactly
> > > > this kind of strange virtual page for special features and this would 
> > > > work
> > > > on the host but not inside of the guest.
> > > The API issue is not that we can't map 0xffff_ffff_ffff_f000, it's that
> > > we can't map 0x0 through 0xffff_ffff_ffff_ffff in a single mapping
> > > because we pass the size instead of the end address (where size here
> > > would be 2^64).  We can map 0x0 through 0xffff_ffff_ffff_efff, followed
> > > by 0xffff_ffff_ffff_f000 through 0xffff_ffff_ffff_ffff, but again, why
> > > would you ever need to do this?  Thanks,
> > > 
> > > Alex
> 
> The thing is that It could be useful to say we map all the virtual memory.

Why?  And again, you can do it, just not in a single mapping, which
really seems like a theoretical problem since you're mapping the IOVA
space of a device, which lives within and consumes a small portion of
this address space, so you're pretty much always looking at doing
multiple mappings.

> Having a size of 2^64 was a possibility.
> On the other hands, with the actual implementation the
> "memory_region_iommu_replay" would take on long long long time.
> 
> In fact, depending on the IOMMU capabilities and usage we do not need to 
> call
> the "memory_region_iommu_replay" at that time or even not at all.

The mapping itself would take a long time, even with GiB pages we're
talking about populating 2^34 page table entries.  Of course many of
those entries would need to point to the same physical page to cover
empty space since otherwise you'd need to run this on a processor that
supports more than 2^64 bits of address space, but you probably don't
want to waste lots of memory covering empty space, so that means a
smaller page size, which means orders of magnitude more mappings and
more space wasted in the page tables...  All of this should have some
useful value, which is escaping me.  Thanks,

Alex



reply via email to

[Prev in Thread] Current Thread [Next in Thread]