qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking wit


From: Joao Martins
Subject: Re: [PATCH v2 17/20] vfio/common: Support device dirty page tracking with vIOMMU
Date: Fri, 24 Feb 2023 19:16:03 +0000

On 24/02/2023 15:56, Alex Williamson wrote:
> On Fri, 24 Feb 2023 12:53:26 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 24/02/2023 11:25, Joao Martins wrote:
>>> On 23/02/2023 23:26, Jason Gunthorpe wrote:  
>>>> On Thu, Feb 23, 2023 at 03:33:09PM -0700, Alex Williamson wrote:  
>>>>> On Thu, 23 Feb 2023 16:55:54 -0400
>>>>> Jason Gunthorpe <jgg@nvidia.com> wrote:  
>>>>>> On Thu, Feb 23, 2023 at 01:06:33PM -0700, Alex Williamson wrote:
>>>>>> Or even better figure out how to get interrupt remapping without IOMMU
>>>>>> support :\  
>>>>>
>>>>> -machine q35,default_bus_bypass_iommu=on,kernel-irqchip=split \
>>>>> -device intel-iommu,caching-mode=on,intremap=on  
>>>>
>>>> Joao?
>>>>
>>>> If this works lets just block migration if the vIOMMU is turned on..  
>>>
>>> At a first glance, this looked like my regular iommu incantation.
>>>
>>> But reading the code this ::bypass_iommu (new to me) apparently tells that
>>> vIOMMU is bypassed or not for the PCI devices all the way to avoiding
>>> enumerating in the IVRS/DMAR ACPI tables. And I see VFIO double-checks 
>>> whether
>>> PCI device is within the IOMMU address space (or bypassed) prior to DMA 
>>> maps and
>>> such.
>>>
>>> You can see from the other email that all of the other options in my head 
>>> were
>>> either bit inconvenient or risky. I wasn't aware of this option for what is
>>> worth -- much simpler, should work!
>>>  
>>
>> I say *should*, but on a second thought interrupt remapping may still be
>> required to one of these devices that are IOMMU-bypassed. Say to put 
>> affinities
>> to vcpus above 255? I was trying this out with more than 255 vcpus with a 
>> couple
>> VFs and at a first glance these VFs fail to probe (these are CX6 VFs).
>>
>> It is a working setup without the parameter, but now adding a
>> default_bus_bypass_iommu=on fails to init VFs:
>>
>> [   32.412733] mlx5_core 0000:00:02.0: Rate limit: 127 rates are supported,
>> range: 0Mbps to 97656Mbps
>> [   32.416242] mlx5_core 0000:00:02.0: mlx5_load:1204:(pid 3361): Failed to
>> alloc IRQs
>> [   33.227852] mlx5_core 0000:00:02.0: probe_one:1684:(pid 3361): 
>> mlx5_init_one
>> failed with error code -19
>> [   33.242182] mlx5_core 0000:00:03.0: firmware version: 22.31.1660
>> [   33.415876] mlx5_core 0000:00:03.0: Rate limit: 127 rates are supported,
>> range: 0Mbps to 97656Mbps
>> [   33.448016] mlx5_core 0000:00:03.0: mlx5_load:1204:(pid 3361): Failed to
>> alloc IRQs
>> [   34.207532] mlx5_core 0000:00:03.0: probe_one:1684:(pid 3361): 
>> mlx5_init_one
>> failed with error code -19
>>
>> I haven't dived yet into why it fails.
> 
> Hmm, I was thinking this would only affect DMA, but on second thought
> I think the DRHD also describes the interrupt remapping hardware and
> while interrupt remapping is an optional feature of the DRHD, DMA
> remapping is always supported afaict.  I saw IR vectors in
> /proc/interrupts and thought it worked, but indeed an assigned device
> is having trouble getting vectors.
> 

AMD/IVRS might be a little different.

I also tried disabling dma-translation from IOMMU feature as I had mentioned in
another email, and that renders the same result as default_bus_bypass_iommu.

So it's either this KVM pv-op (which is not really interrupt remapping and it's
x86 specific) or full vIOMMU. The PV op[*] has the natural disadvantage of
requiring a compatible guest kernel.

[*] See, KVM_FEATURE_MSI_EXT_DEST_ID.

>>
>>> And avoiding vIOMMU simplifies the whole patchset too, if it's OK to add a 
>>> live
>>> migration blocker if `bypass_iommu` is off for any PCI device.
>>>   
>>
>> Still we could have for starters a live migration blocker until we revisit 
>> the
>> vIOMMU case ... should we deem that the default_bus_bypass_iommu=on or the
>> others I suggested as non-options?
> 
> I'm very uncomfortable presuming a vIOMMU usage model, especially when
> it leads to potentially untracked DMA if our assumptions are violated.

We can track DMA that got dirtied, but it doesn't mean that said DMA is mapped.
I don't think VFIO ties those two in? Like you can ask to track certain ranges,
but if it's in IOMMU then device gets target abort. Start dirty tracking,
doesn't imply that you allow such DMA

with vIOMMU it's just anything that falls outside the IOMMU mapped ranges (or
identity map) get always marked as dirty if it wasn't armed in the device dirty
tracker. It's a best effort basis -- as I don't think supporting vIOMMU has a
ton of options without a more significant compromise. If the vIOMMU is in
passthrough mode, then things work just as if no-vIOMMU is there. Avihai's code
reflects that.

Considering your earlier suggestion that we only start dirty tracking and record
ranges *when*  dirty tracking start operation happens ... then this gets further
simplified. We also have to take into account that we can't have guarantees that
we can change ranges under tracking to be dynamic.

For improving vIOMMU case we either track the MAX_IOVA or we compose an
artifical range based the max-iova of current vIOMMU maps.

> We could use a MemoryListener on the IOVA space to record a high level
> mark, but we'd need to continue to monitor that mark while we're in
> pre-copy and I don't think anyone would agree that a migratable VM can
> suddenly become unmigratable due to a random IOVA allocation would be
> supportable.  That leads me to think that a machine option to limit the
> vIOMMU address space, and testing that against the device prior to
> declaring migration support of the device is possibly our best option.
> 
> Is that feasible?  Do all the vIOMMU models have a means to limit the
> IOVA space? 

I can say that *at least* AMD and Intel support that. Intel supports either 39
or 48 address-width modes (only those two values as I understand). AMD
supposedly has a more granular management of VASize and PASize.

I have no idea on smmuv3 or virtio-iommu.

But isn't this is actually what Avihai does in the series, but minus the device
part? The address-width is fetched directly from the vIOMMU model, via the
IOMMU_ATTR_MAX_IOVA, and one of the options is to compose a range based on max
vIOMMU range.

> How does QEMU learn a limit for a given device? 

IOMMU_ATTR_MAX_IOVA for vIOMMU

For device this is not described in ACPI or any place that I know :/ without
getting into VF specifics

> We
> probably need to think about whether there are devices that can even
> support the guest physical memory ranges when we start relocating RAM
> to arbitrary addresses (ex. hypertransport). 

In theory we require one-bit more in device DMA engine. so instead of max 39bits
we require 40bits for a 1T guest. GPUs and modern NICs are 64-bit DMA address
capable devices, but it's a bit hard to learn this as it's device specific.

> Can we infer anything
> from the vCPU virtual address space or is that still an unreasonable
> range to track for devices?  Thanks,
> 
We sort of rely on that for iommu=pt or no-vIOMMU case where vCPU address space
matches that of IOVA space, but that not sure how much you would from vCPU
address space that vIOMMU mapping doesn't give you already



reply via email to

[Prev in Thread] Current Thread [Next in Thread]