qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated d


From: Alex Williamson
Subject: Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
Date: Sun, 23 Oct 2016 20:32:16 -0600

On Fri, 21 Oct 2016 01:47:25 +0530
Kirti Wankhede <address@hidden> wrote:

> Alex,
> 
> Addressing your comments other than invalidation part.
> 
> On 10/20/2016 2:32 AM, Alex Williamson wrote:
> > On Tue, 18 Oct 2016 02:52:04 +0530
> > Kirti Wankhede <address@hidden> wrote:
> >   
> ...
> >> Tested by assigning below combinations of devices to a single VM:
> >> - GPU pass through only
> >> - vGPU device only
> >> - One GPU pass through and one vGPU device
> >> - Linux VM hot plug and unplug vGPU device while GPU pass through device
> >>   exist
> >> - Linux VM hot plug and unplug GPU pass through device while vGPU device
> >>   exist  
> > 
> > Were you able to do these with the locked memory limit of the user set
> > to the minimum required for existing GPU assignment?
> >   
> 
> No, is there a way to set memory limit through livbirt so that it would
> set memory limit to system memory assigned to VM?

Not that I know of, but I also don't know how you're making use of an
mdev device through libvirt yet since they don't have support for the
vfio-pci sysfsdev option.  I would recommend testing with QEMU manually.

> ...
> >> +  container = group->container;
> >> +  if (IS_ERR(container)) {  
> > 
> > I don't see that we ever use an ERR_PTR to set group->container, it
> > should either be NULL or valid and the fact that we added ourselves to
> > container_users should mean that it's valid.  The paranoia test here
> > would be if container is NULL, but IS_ERR() doesn't check NULL.  If we
> > need that paranoia test, maybe we should just:
> > 
> > if (WARN_ON(!container)) {
> > 
> > I'm not fully convinced it's needed though.
> >   
> 
> Ok removing this check.
> 
> >> +          ret = PTR_ERR(container);
> >> +          goto err_pin_pages;
> >> +  }
> >> +
> >> +  down_read(&container->group_lock);
> >> +
> >> +  driver = container->iommu_driver;
> >> +  if (likely(driver && driver->ops->pin_pages))
> >> +          ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> >> +                                       npage, prot, phys_pfn);  
> > 
> > The caller is going to need to provide some means for us to callback to
> > invalidate pinned pages.
> > 
> > ret has already been used, so it's zero at this point.  I expect the
> > original intention was to let the initialization above fall through
> > here so that the caller gets an errno if the driver doesn't support
> > pin_pages.  Returning zero without actually doing anything seems like
> > an unexpected return value.
> >   
> 
> yes, changing it to:
> 
> driver = container->iommu_driver;
> if (likely(driver && driver->ops->pin_pages))
>         ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
>                                      npage, prot, phys_pfn);
> else
>         ret = -EINVAL;
> 
> 
> 
> 
> >> +static int vfio_pfn_account(struct vfio_iommu *iommu, unsigned long pfn)
> >> +{
> >> +  struct vfio_pfn *p;
> >> +  struct vfio_domain *domain = iommu->local_domain;
> >> +  int ret = 1;
> >> +
> >> +  if (!domain)
> >> +          return 1;
> >> +
> >> +  mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +
> >> +  p = vfio_find_pfn(domain, pfn);
> >> +  if (p)
> >> +          ret = 0;
> >> +
> >> +  mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +  return ret;
> >> +}  
> > 
> > So if the vfio_pfn for a given pfn exists, return 0, else return 1.
> > But do we know that the vfio_pfn exists at the point where we actually
> > do that accounting?
> >  
> 
> Only below functions call vfio_pfn_account()
> __vfio_pin_pages_remote() -> vfio_pfn_account()
> __vfio_unpin_pages_remote() -> vfio_pfn_account()
> 
> Consider the case when mdev device is already assigned to VM, run some
> app in VM that pins some pages, then hotplug pass through device.
> Then __vfio_pin_pages_remote() is called when iommu capable domain is
> attached to container to pin all pages from vfio_iommu_replay(). So if
> at this time vfio_pfn exist means that the page is pinned through
> local_domain when iommu capable domain was not present, so accounting
> was already done for that pages. Hence returned 0 here which mean don't
> add this page in accounting.

Right, I see that's the intention, I can't pick any holes in the
concept, but I'll continue to try to look for bugs.

> >> +
> >>  struct vwork {
> >>    struct mm_struct        *mm;
> >>    long                    npage;
> >> @@ -150,17 +269,17 @@ static void vfio_lock_acct_bg(struct work_struct 
> >> *work)
> >>    kfree(vwork);
> >>  }
> >>  
> >> -static void vfio_lock_acct(long npage)
> >> +static void vfio_lock_acct(struct task_struct *task, long npage)
> >>  {
> >>    struct vwork *vwork;
> >>    struct mm_struct *mm;
> >>  
> >> -  if (!current->mm || !npage)
> >> +  if (!task->mm || !npage)
> >>            return; /* process exited or nothing to do */
> >>  
> >> -  if (down_write_trylock(&current->mm->mmap_sem)) {
> >> -          current->mm->locked_vm += npage;
> >> -          up_write(&current->mm->mmap_sem);
> >> +  if (down_write_trylock(&task->mm->mmap_sem)) {
> >> +          task->mm->locked_vm += npage;
> >> +          up_write(&task->mm->mmap_sem);
> >>            return;
> >>    }
> >>  
> >> @@ -172,7 +291,7 @@ static void vfio_lock_acct(long npage)
> >>    vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> >>    if (!vwork)
> >>            return;
> >> -  mm = get_task_mm(current);
> >> +  mm = get_task_mm(task);
> >>    if (!mm) {
> >>            kfree(vwork);
> >>            return;
> >> @@ -228,20 +347,31 @@ static int put_pfn(unsigned long pfn, int prot)
> >>    return 0;
> >>  }  
> > 
> > This coversion of vfio_lock_acct() to pass a task_struct and updating
> > existing callers to pass current would be a great separate, easily
> > review-able patch.
> >  
> 
> Ok. I'll split this in separate commit.
> 
> 
> >>  
> >> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long 
> >> *pfn)
> >> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >> +                   int prot, unsigned long *pfn)
> >>  {
> >>    struct page *page[1];
> >>    struct vm_area_struct *vma;
> >> +  struct mm_struct *local_mm = (mm ? mm : current->mm);
> >>    int ret = -EFAULT;
> >>  
> >> -  if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> >> +  if (mm) {
> >> +          down_read(&local_mm->mmap_sem);
> >> +          ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> >> +                                  !!(prot & IOMMU_WRITE), 0, page, NULL);
> >> +          up_read(&local_mm->mmap_sem);
> >> +  } else
> >> +          ret = get_user_pages_fast(vaddr, 1,
> >> +                                    !!(prot & IOMMU_WRITE), page);
> >> +
> >> +  if (ret == 1) {
> >>            *pfn = page_to_pfn(page[0]);
> >>            return 0;
> >>    }
> >>  
> >> -  down_read(&current->mm->mmap_sem);
> >> +  down_read(&local_mm->mmap_sem);
> >>  
> >> -  vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> >> +  vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
> >>  
> >>    if (vma && vma->vm_flags & VM_PFNMAP) {
> >>            *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> >> @@ -249,7 +379,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int 
> >> prot, unsigned long *pfn)
> >>                    ret = 0;
> >>    }
> >>  
> >> -  up_read(&current->mm->mmap_sem);
> >> +  up_read(&local_mm->mmap_sem);
> >>  
> >>    return ret;
> >>  }  
> > 
> > This would also be a great separate patch.  
> 
> Ok.
> 
> >  Have you considered
> > renaming the mm_struct function arg to "remote_mm" and making the local
> > variable simply "mm"?  It seems like it would tie nicely with the
> > remote_mm path using get_user_pages_remote() while passing NULL for
> > remote_mm uses current->mm and the existing path (and avoid the general
> > oddness of passing local_mm to a "remote" function).
> >   
> 
> Yes, your suggestion looks good. Updating.
> 
> 
> >> @@ -259,33 +389,37 @@ static int vaddr_get_pfn(unsigned long vaddr, int 
> >> prot, unsigned long *pfn)
> >>   * the iommu can only map chunks of consecutive pfns anyway, so get the
> >>   * first page and all consecutive pages with the same locking.
> >>   */
> >> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> >> -                     int prot, unsigned long *pfn_base)
> >> +static long __vfio_pin_pages_remote(struct vfio_iommu *iommu,
> >> +                              unsigned long vaddr, long npage,
> >> +                              int prot, unsigned long *pfn_base)  
> 
> ...
> 
> 
> >> @@ -303,8 +437,10 @@ static long vfio_pin_pages(unsigned long vaddr, long 
> >> npage,
> >>                    break;
> >>            }
> >>  
> >> +          lock_acct += vfio_pfn_account(iommu, pfn);
> >> +  
> > 
> > I take it that this is the new technique for keeping the accounting
> > accurate, we only increment the locked accounting by the amount not
> > already pinned in a vfio_pfn.
> >  
> 
> That's correct.
> 
> 
> >>            if (!rsvd && !lock_cap &&
> >> -              current->mm->locked_vm + i + 1 > limit) {
> >> +              current->mm->locked_vm + lock_acct > limit) {
> >>                    put_pfn(pfn, prot);
> >>                    pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> >>                            __func__, limit << PAGE_SHIFT);
> >> @@ -313,23 +449,216 @@ static long vfio_pin_pages(unsigned long vaddr, 
> >> long npage,
> >>    }
> >>  
> >>    if (!rsvd)
> >> -          vfio_lock_acct(i);
> >> +          vfio_lock_acct(current, lock_acct);
> >>  
> >>    return i;
> >>  }
> >>  
> >> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> >> -                       int prot, bool do_accounting)
> >> +static long __vfio_unpin_pages_remote(struct vfio_iommu *iommu,
> >> +                                unsigned long pfn, long npage, int prot,
> >> +                                bool do_accounting)  
> > 
> > Have you noticed that it's kind of confusing that
> > __vfio_{un}pin_pages_remote() uses current, which does a
> > get_user_pages_fast() while "local" uses a provided task_struct and
> > uses get_user_pages_*remote*()?  And also what was effectively local
> > (ie. we're pinning for our own use here) is now "remote" and pinning
> > for a remote, vendor driver consumer, is now "local".  It's not very
> > intuitive.
> >   
> 
> 'local' in local_domain was suggested to describe the domain for local
> page tracking. Earlier suggestions to have 'mdev' or 'noimmu' in this
> name were discarded. May be we should revisit what the name should be.
> Any suggestion?
> 
> For local_domain, to pin pages, flow is:
> 
> for local_domain
>     |- vfio_pin_pages()
>         |- vfio_iommu_type1_pin_pages()
>             |- __vfio_pin_page_local()
>                 |-  vaddr_get_pfn(task->mm)
>                     |- get_user_pages_remote()
> 
> __vfio_pin_page_local() --> get_user_pages_remote()


In vfio.c we have the concept of an external user, perhaps that could
be continued here.  An mdev driver would be an external, or remote
pinning.

> >>  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>                                     struct iommu_group *iommu_group)
> >>  {
> >>    struct vfio_iommu *iommu = iommu_data;
> >> -  struct vfio_group *group, *g;
> >> +  struct vfio_group *group;
> >>    struct vfio_domain *domain, *d;
> >>    struct bus_type *bus = NULL;
> >>    int ret;
> >> @@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void 
> >> *iommu_data,
> >>    mutex_lock(&iommu->lock);
> >>  
> >>    list_for_each_entry(d, &iommu->domain_list, next) {
> >> -          list_for_each_entry(g, &d->group_list, next) {
> >> -                  if (g->iommu_group != iommu_group)
> >> -                          continue;
> >> +          if (find_iommu_group(d, iommu_group)) {
> >> +                  mutex_unlock(&iommu->lock);
> >> +                  return -EINVAL;
> >> +          }
> >> +  }  
> > 
> > The find_iommu_group() conversion would also be an easy separate patch.
> >   
> 
> Ok.
> 
> >>  
> >> +  if (iommu->local_domain) {
> >> +          if (find_iommu_group(iommu->local_domain, iommu_group)) {
> >>                    mutex_unlock(&iommu->lock);
> >>                    return -EINVAL;
> >>            }
> >> @@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void 
> >> *iommu_data,
> >>    if (ret)
> >>            goto out_free;
> >>  
> >> +  if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
> >> +      (bus == &mdev_bus_type)) {
> >> +          if (!iommu->local_domain) {
> >> +                  domain->local_addr_space =
> >> +                          kzalloc(sizeof(*domain->local_addr_space),
> >> +                                          GFP_KERNEL);
> >> +                  if (!domain->local_addr_space) {
> >> +                          ret = -ENOMEM;
> >> +                          goto out_free;
> >> +                  }
> >> +
> >> +                  domain->local_addr_space->task = current;
> >> +                  INIT_LIST_HEAD(&domain->group_list);
> >> +                  domain->local_addr_space->pfn_list = RB_ROOT;
> >> +                  mutex_init(&domain->local_addr_space->pfn_list_lock);
> >> +                  iommu->local_domain = domain;
> >> +          } else
> >> +                  kfree(domain);
> >> +
> >> +          list_add(&group->next, &domain->group_list);  
> > 
> > I think you mean s/domain/iommu->local_domain/ here, we just freed
> > domain in the else path.
> >   
> 
> Yes, corrected.
> 
> >> +          mutex_unlock(&iommu->lock);
> >> +          return 0;
> >> +  }
> >> +
> >>    domain->domain = iommu_domain_alloc(bus);
> >>    if (!domain->domain) {
> >>            ret = -EIO;
> >> @@ -859,6 +1277,41 @@ static void vfio_iommu_unmap_unpin_all(struct 
> >> vfio_iommu *iommu)
> >>            vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
> >>  }
> >>  
> >> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
> >> +{
> >> +  struct vfio_domain *domain = iommu->local_domain;
> >> +  struct vfio_dma *dma, *tdma;
> >> +  struct rb_node *n;
> >> +  long locked = 0;
> >> +
> >> +  rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list,
> >> +                                       node) {
> >> +          vfio_unmap_unpin(iommu, dma);
> >> +  }
> >> +
> >> +  mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +
> >> +  n = rb_first(&domain->local_addr_space->pfn_list);
> >> +
> >> +  for (; n; n = rb_next(n))
> >> +          locked++;
> >> +
> >> +  vfio_lock_acct(domain->local_addr_space->task, locked);
> >> +  mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +}  
> > 
> > Couldn't a properly timed mlock by the user allow them to lock more
> > memory than they're allowed here?  For instance imagine the vendor
> > driver has pinned the entire VM memory and the user has exactly the
> > locked memory limit for that VM.  During the gap here between unpinning
> > the entire vfio_dma list and re-accounting for the pfn_list, the user
> > can mlock up to their limit again an now they've doubled the locked
> > memory they're allowed.
> >   
> 
> As per original code, vfio_unmap_unpin() calls
> __vfio_unpin_pages_remote(.., false) with do_accounting set to false,
> why is that so?

Because vfio_dma tracks the user granularity of calling MAP_DMA, not
the granularity with which the iommu mapping was actually done.  There
might be multiple non-contiguous chunks to make that mapping and we
don't know how the iommu chose to map a given chunk to support large
page sizes.  If we chose to do accounting on the iommu_unmap()
granularity, we might account for every 4k page separately.  We choose
not to do accounting there so that we can batch the accounting into one
update per range.

> Here if accounting is set to true then we don't have to do re-accounting
> here.

If vfio_unmap_unpin() did not do accounting, you could update
accounting once with the difference between what was pinned and what
remains pinned via the mdev and avoid the gap caused by de-accounting
everything and then re-accounting only for the mdev pinnings.

> >> +
> >> +static void vfio_local_unpin_all(struct vfio_domain *domain)
> >> +{
> >> +  struct rb_node *node;
> >> +
> >> +  mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +  while ((node = rb_first(&domain->local_addr_space->pfn_list)))
> >> +          vfio_unpin_pfn(domain,
> >> +                          rb_entry(node, struct vfio_pfn, node), false);
> >> +
> >> +  mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +}
> >> +
> >>  static void vfio_iommu_type1_detach_group(void *iommu_data,
> >>                                      struct iommu_group *iommu_group)
> >>  {
> >> @@ -868,31 +1321,57 @@ static void vfio_iommu_type1_detach_group(void 
> >> *iommu_data,
> >>  
> >>    mutex_lock(&iommu->lock);
> >>  
> >> -  list_for_each_entry(domain, &iommu->domain_list, next) {
> >> -          list_for_each_entry(group, &domain->group_list, next) {
> >> -                  if (group->iommu_group != iommu_group)
> >> -                          continue;
> >> -
> >> -                  iommu_detach_group(domain->domain, iommu_group);
> >> +  if (iommu->local_domain) {
> >> +          domain = iommu->local_domain;
> >> +          group = find_iommu_group(domain, iommu_group);
> >> +          if (group) {
> >>                    list_del(&group->next);
> >>                    kfree(group);
> >> -                  /*
> >> -                   * Group ownership provides privilege, if the group
> >> -                   * list is empty, the domain goes away.  If it's the
> >> -                   * last domain, then all the mappings go away too.
> >> -                   */
> >> +
> >>                    if (list_empty(&domain->group_list)) {
> >> -                          if (list_is_singular(&iommu->domain_list))
> >> +                          vfio_local_unpin_all(domain);
> >> +                          if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> >>                                    vfio_iommu_unmap_unpin_all(iommu);
> >> -                          iommu_domain_free(domain->domain);
> >> -                          list_del(&domain->next);
> >>                            kfree(domain);
> >> +                          iommu->local_domain = NULL;
> >> +                  }  
> > 
> > 
> > I can't quite wrap my head around this, if we have mdev groups attached
> > and this iommu group matches an mdev group, remove from list and free
> > the group.  If there are now no more groups in the mdev group list,
> > then for each vfio_pfn, unpin the pfn, /without/ doing accounting
> > udpates   
> 
> corrected the code to do accounting here.
> 
> > and remove the vfio_pfn, but only if the ref_count is now
> > zero.  
> 
> Yes, If you see the loop vfio_local_unpin_all(), it iterates till the
> node in rb tree exist
> 
> >> +  while ((node = rb_first(&domain->local_addr_space->pfn_list)))
> >> +          vfio_unpin_pfn(domain,
> >> +                          rb_entry(node, struct vfio_pfn, node), false);
> >> +  
> 
> 
> and vfio_unpin_pfn() only remove the node from rb tree if ref count is
> zero.
> 
> static int vfio_unpin_pfn(struct vfio_domain *domain,
>                           struct vfio_pfn *vpfn, bool do_accounting)
> {
>         __vfio_unpin_page_local(domain, vpfn->pfn, vpfn->prot,
>                                 do_accounting);
> 
>         if (atomic_dec_and_test(&vpfn->ref_count))
>                 vfio_remove_from_pfn_list(domain, vpfn);
> 
>         return 1;
> }
> 
> so for example for a vfio_pfn ref_count is 2, first iteration would be:
>  - call __vfio_unpin_page_local()
>  - atomic_dec(ref_count), so now ref_count is 1, but node is not removed
> from rb tree.
> 
> In next iteration:
>  - call __vfio_unpin_page_local()
>  - atomic_dec(ref_count), so now ref_count is 0, remove node from rb tree.

Ok, I missed that, thanks.

> >  We free the domain, so if the ref_count was non-zero we've now
> > just leaked memory.  I think that means that if a vendor driver pins a
> > given page twice, that leak occurs.  Furthermore, if there is not an
> > iommu capable domain in the container, we remove all the vfio_dma
> > entries as well, ok.  Maybe the only issue is those leaked vfio_pfns.
> >   
> 
> So if vendor driver pins a page twice, vfio_unpin_pfn() would get called
> twice and only when ref count is zero that node is removed from rb tree.
> So there is no memory leak.

Ok



reply via email to

[Prev in Thread] Current Thread [Next in Thread]