[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH] docs: Enhance documentation for iommu bypass
From: |
Michael S. Tsirkin |
Subject: |
Re: [RFC PATCH] docs: Enhance documentation for iommu bypass |
Date: |
Wed, 22 May 2024 05:28:50 -0400 |
On Wed, May 22, 2024 at 03:40:08PM +0800, Aaron Lu wrote:
> When Intel vIOMMU is used and irq remapping is enabled, using
> bypass_iommu will cause following two callstacks dumped during kernel
> boot and all PCI devices attached to root bridge lose their MSI
> capabilities and fall back to using IOAPIC:
>
> [ 0.960262] ------------[ cut here ]------------
> [ 0.961245] WARNING: CPU: 3 PID: 1 at drivers/pci/msi/msi.h:121
> pci_msi_setup_msi_irqs+0x27/0x40
> [ 0.963070] Modules linked in:
> [ 0.963695] CPU: 3 PID: 1 Comm: swapper/0 Not tainted
> 6.9.0-rc7-00056-g45db3ab70092 #1
> [ 0.965225] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 0.967382] RIP: 0010:pci_msi_setup_msi_irqs+0x27/0x40
> [ 0.968378] Code: 90 90 90 0f 1f 44 00 00 48 8b 87 30 03 00 00 89 f2 48 85
> c0 74 14 f6 40 28 01 74 0e 48 81 c7 c0 00 00 00 31 f6 e9 29 42 9e ff <0f> 0b
> b8 ed ff ff ff c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00
> [ 0.971756] RSP: 0000:ffffc90000017988 EFLAGS: 00010246
> [ 0.972669] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> 0000000000000000
> [ 0.973901] RDX: 0000000000000005 RSI: 0000000000000005 RDI:
> ffff888100ee1000
> [ 0.975391] RBP: 0000000000000005 R08: ffff888101f44d90 R09:
> 0000000000000228
> [ 0.976629] R10: 0000000000000001 R11: 0000000000008d3f R12:
> ffffc90000017b80
> [ 0.977864] R13: ffff888102312000 R14: ffff888100ee1000 R15:
> 0000000000000005
> [ 0.979092] FS: 0000000000000000(0000) GS:ffff88817bd80000(0000)
> knlGS:0000000000000000
> [ 0.980473] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.981464] CR2: 0000000000000000 CR3: 000000000302e001 CR4:
> 0000000000770ef0
> [ 0.982687] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 0.983919] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 0.985143] PKRU: 55555554
> [ 0.985625] Call Trace:
> [ 0.986056] <TASK>
> [ 0.986440] ? __warn+0x80/0x130
> [ 0.987014] ? pci_msi_setup_msi_irqs+0x27/0x40
> [ 0.987810] ? report_bug+0x18d/0x1c0
> [ 0.988443] ? handle_bug+0x3a/0x70
> [ 0.989026] ? exc_invalid_op+0x13/0x60
> [ 0.989672] ? asm_exc_invalid_op+0x16/0x20
> [ 0.990374] ? pci_msi_setup_msi_irqs+0x27/0x40
> [ 0.991118] __pci_enable_msix_range+0x325/0x5b0
> [ 0.991883] pci_alloc_irq_vectors_affinity+0xa9/0x110
> [ 0.992698] vp_find_vqs_msix+0x1a8/0x4c0
> [ 0.993332] vp_find_vqs+0x3a/0x1a0
> [ 0.993893] vp_modern_find_vqs+0x17/0x70
> [ 0.994531] init_vq+0x3ad/0x410
> [ 0.995051] ? __pfx_default_calc_sets+0x10/0x10
> [ 0.995789] virtblk_probe+0xeb/0xbc0
> [ 0.996362] ? up_write+0x74/0x160
> [ 0.996900] ? down_write+0x4d/0x80
> [ 0.997450] virtio_dev_probe+0x1bc/0x270
> [ 0.998059] really_probe+0xc1/0x390
> [ 0.998626] ? __pfx___driver_attach+0x10/0x10
> [ 0.999288] __driver_probe_device+0x78/0x150
> [ 0.999924] driver_probe_device+0x1f/0x90
> [ 1.000506] __driver_attach+0xce/0x1c0
> [ 1.001073] bus_for_each_dev+0x70/0xc0
> [ 1.001638] bus_add_driver+0x112/0x210
> [ 1.002191] driver_register+0x55/0x100
> [ 1.002760] virtio_blk_init+0x4c/0x90
> [ 1.003332] ? __pfx_virtio_blk_init+0x10/0x10
> [ 1.003974] do_one_initcall+0x41/0x240
> [ 1.004510] ? kernel_init_freeable+0x240/0x4a0
> [ 1.005142] kernel_init_freeable+0x321/0x4a0
> [ 1.005749] ? __pfx_kernel_init+0x10/0x10
> [ 1.006311] kernel_init+0x16/0x1c0
> [ 1.006798] ret_from_fork+0x2d/0x50
> [ 1.007303] ? __pfx_kernel_init+0x10/0x10
> [ 1.007883] ret_from_fork_asm+0x1a/0x30
> [ 1.008431] </TASK>
> [ 1.008748] ---[ end trace 0000000000000000 ]---
>
> Another callstack happens at pci_msi_teardown_msi_irqs().
>
> Actually every PCI device will trigger these two paths. There are only
> two callstack dumps because the two places use WARN_ON_ONCE().
>
> What happened is: when irq remapping is enabled, kernel expects all PCI
> device(or its parent bridges) appear in some DMA Remapping Hardware unit
> Definition(DRHD)'s device scope list and if not, this device's irq domain
> will become NULL and that would make this device's MSI functionality
> enabling fail.
>
> Per my understanding, only virtualized system can have such a setup: irq
> remapping enabled while not all PCI/PCIe devices appear in a DRHD's
> device scope.
>
> Enhance the document by mentioning what could happen when bypass_iommu
> is used.
>
> For detailed qemu cmdline and guest kernel dmesg, please see:
> https://lore.kernel.org/qemu-devel/20240510072519.GA39314@ziqianlu-desk2/
>
> Reported-by: Juro Bystricky <juro.bystricky@intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Is this issue specific to Linux?
> ---
> docs/bypass-iommu.txt | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/docs/bypass-iommu.txt b/docs/bypass-iommu.txt
> index e6677bddd3..8226f79104 100644
> --- a/docs/bypass-iommu.txt
> +++ b/docs/bypass-iommu.txt
> @@ -68,6 +68,11 @@ devices might send malicious dma request to virtual
> machine if there is no
> iommu isolation. So it would be necessary to only bypass iommu for trusted
> device.
>
> +When Intel IOMMU is virtualized, if irq remapping is enabled, PCI and PCIe
> +devices that bypassed vIOMMU will have their MSI/MSI-x functionalities
> disabled
functionality
> +and fall back to IOAPIC. If this is not desired, disable irq remapping:
> +qemu -device intel-iommu,intremap=off
> +
> Implementation
> ==============
> The bypass iommu feature includes:
> --
> 2.45.0