[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthro
From: |
Zhenzhong Duan |
Subject: |
[PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device |
Date: |
Wed, 19 Feb 2025 16:22:08 +0800 |
Hi,
Per Jason Wang's suggestion, iommufd nesting series[1] is split into
"Enable stage-1 translation for emulated device" series and
"Enable stage-1 translation for passthrough device" series.
This series is 2nd part focusing on passthrough device. We don't do
shadowing of guest page table for passthrough device but pass stage-1
page table to host side to construct a nested domain. There was some
effort to enable this feature in old days, see [2] for details.
The key design is to utilize the dual-stage IOMMU translation
(also known as IOMMU nested translation) capability in host IOMMU.
As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform
the stage-1 address translation. Along with it, modifications to
present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
subsystem. VFIO calls them to register/unregister HostIOMMUDevice
instance to vIOMMU at vfio device realize stage.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
to bind/unbind device to IOMMUFD backed domains, either nested
domain or not.
See below diagram:
VFIO Device Intel IOMMU
.-----------------. .-------------------.
| | | |
| .---------|PCIIOMMUOps |.-------------. |
| | IOMMUFD |(set_iommu_device) || Host IOMMU | |
| | Device |------------------------>|| Device list | |
| .---------|(unset_iommu_device) |.-------------. |
| | | | |
| | | V |
| .---------| HostIOMMUDeviceIOMMUFD | .-------------. |
| | IOMMUFD | (attach_hwpt)| | Host IOMMU | |
| | link |<------------------------| | Device | |
| .---------| (detach_hwpt)| .-------------. |
| | | | |
| | | ... |
.-----------------. .-------------------.
Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
whenever possible and create new one on demand, also supports multiple
iommufd objects and ERRATA_772415.
E.g., Stage-2 page table could be shared by different devices if there
is no conflict and devices link to same iommufd object, i.e. devices
under same host IOMMU can share same stage-2 page table. If there is
conflict, i.e. there is one device under non cache coherency mode
which is different from others, it requires a separate stage-2 page
table in non-CC mode.
SPR platform has ERRATA_772415 which requires no readonly mappings
in stage-2 page table. This series supports creating VTDIOASContainer
with no readonly mappings. If there is a rare case that some IOMMUs
on a multiple IOMMU host have ERRATA_772415 and others not, this
design can still survive.
See below example diagram for a full view:
IntelIOMMUState
|
V
.------------------. .------------------. .-------------------.
| VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer |-->...
| (iommufd0,RW&RO) | | (iommufd1,RW&RO) | | (iommufd0,RW only)|
.------------------. .------------------. .-------------------.
| | |
| .-->... |
V V
.-------------------. .-------------------. .---------------.
| VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->... | VTDS2Hwpt(CC)
|-->...
.-------------------. .-------------------. .---------------.
| | | |
| | | |
.-----------. .-----------. .------------. .------------.
| IOMMUFD | | IOMMUFD | | IOMMUFD | | IOMMUFD |
| Device(CC)| | Device(CC)| | Device | | Device(CC) |
| (iommufd0)| | (iommufd0)| | (non-CC) | | (errata) |
| | | | | (iommufd0) | | (iommufd0) |
.-----------. .-----------. .------------. .------------.
This series is also a prerequisite work for vSVA, i.e. Sharing
guest application address space with passthrough devices.
To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on".
i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...
Passthrough device should use iommufd backend to work with stage-1 translation.
i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
If host doesn't support nested translation, qemu will fail with an unsupported
report.
Test done:
- VFIO devices hotplug/unplug
- different VFIO devices linked to different iommufds
- vhost net device ping test
PATCH1-8: Add HWPT-based nesting infrastructure support
PATCH9-10: Some cleanup work
PATCH11: cap/ecap related compatibility check between vIOMMU and Host IOMMU
PATCH12-19:Implement stage-1 page table for passthrough device
PATCH20: Enable stage-1 translation for passthrough device
Qemu code can be found at [3]
TODO:
- RAM discard
- dirty tracking on stage-2 page table
[1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
[2]
https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2
Thanks
Zhenzhong
Changelog:
rfcv2:
- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
- add two cleanup patches(patch9-10)
- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
iommu pasid, this is important for dropping VTDPASIDAddressSpace
Yi Liu (3):
intel_iommu: Replay pasid binds after context cache invalidation
intel_iommu: Propagate PASID-based iotlb invalidation to host
intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
Zhenzhong Duan (17):
backends/iommufd: Add helpers for invalidating user-managed HWPT
vfio/iommufd: Add properties and handlers to
TYPE_HOST_IOMMU_DEVICE_IOMMUFD
HostIOMMUDevice: Introduce realize_late callback
vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
vfio/iommufd: Implement [at|de]tach_hwpt handlers
host_iommu_device: Define two new capabilities
HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA
intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
vtd_ce_get_pasid_entry
intel_iommu: Optimize context entry cache utilization
intel_iommu: Check for compatibility with IOMMUFD backed device when
x-flts=on
intel_iommu: Introduce a new structure VTDHostIOMMUDevice
intel_iommu: Add PASID cache management infrastructure
intel_iommu: Bind/unbind guest page table to host
intel_iommu: ERRATA_772415 workaround
intel_iommu: Bypass replay in stage-1 page table mode
intel_iommu: Enable host device when x-flts=on in scalable mode
hw/i386/intel_iommu_internal.h | 56 +
include/hw/i386/intel_iommu.h | 33 +-
include/system/host_iommu_device.h | 40 +
include/system/iommufd.h | 53 +
backends/iommufd.c | 58 +
hw/i386/intel_iommu.c | 1660 ++++++++++++++++++++++++----
hw/vfio/common.c | 17 +-
hw/vfio/iommufd.c | 48 +
backends/trace-events | 1 +
hw/i386/trace-events | 13 +
10 files changed, 1776 insertions(+), 203 deletions(-)
--
2.34.1
- [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device,
Zhenzhong Duan <=