qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] [Bug 1771238] Re: Not able to passthrough > 32 PCIe devices


From: David Coronel
Subject: [Qemu-devel] [Bug 1771238] Re: Not able to passthrough > 32 PCIe devices to a KVM Guest
Date: Tue, 15 May 2018 06:25:18 -0000

** Changed in: qemu (Ubuntu)
   Importance: Critical => High

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1771238

Title:
  Not able to passthrough > 32 PCIe devices to a KVM Guest

Status in QEMU:
  New
Status in qemu package in Ubuntu:
  New

Bug description:
  Using an Ubuntu Server 16.04-based host with KVM hypervisor installed,
  we are unable to launch a vanilla Ubuntu Server 16.04.4 guest with >=
  32 PCIe devices. It is 100% reproducible. Using fewer PCIe devices
  works fine. We are using the vanilla kvm and qemu packages from the
  Canonical repos.

  The ultimate goal is to create a KVM Guest wherein I can passthrough
  44 PCI devices.

  When a KVM Guest launches, it also has some internal PCIe devices
  including host bridge, USB, IDE (for virtual disk), and virtual nic
  etc.

  Script used to launch all devices looks like this:

  #!/bin/bash 
  NAME=16gpuvm 

  sudo qemu-img create -f qcow2 /home/lab/kvm/images/${NAME}.img 40G && 
  sudo virt-install \ 
  --name ${NAME} \ 
  --ram 716800 \ 
  --vcpus 88 \ 
  --disk path=/home/lab/kvm/images/${NAME}.img,format=qcow2 \ 
  --network bridge=virbr0 \ 
  --graphics none \ 
  --host-device 34:00.0 \ 
  --host-device 36:00.0 \ 
  --host-device 39:00.0 \ 
  --host-device 3b:00.0 \ 
  --host-device 57:00.0 \ 
  --host-device 59:00.0 \ 
  --host-device 5c:00.0 \ 
  --host-device 5e:00.0 \ 
  --host-device 61:00.0 \ 
  --host-device 62:00.0 \ 
  --host-device 63:00.0 \ 
  --host-device 65:00.0 \ 
  --host-device 66:00.0 \ 
  --host-device 67:00.0 \ 
  --host-device 35:00.0 \ 
  --host-device 3a:00.0 \ 
  --host-device 58:00.0 \ 
  --host-device 5d:00.0 \ 
  --host-device 2e:00.0 \ 
  --host-device 2f:00.0 \ 
  --host-device 51:00.0 \ 
  --host-device 52:00.0 \ 
  --host-device b7:00.0 \ 
  --host-device b9:00.0 \ 
  --host-device bc:00.0 \ 
  --host-device be:00.0 \ 
  --host-device e0:00.0 \ 
  --host-device e2:00.0 \ 
  --host-device e5:00.0 \ 
  --host-device e7:00.0 \ 
  --host-device c1:00.0 \ 
  --host-device c2:00.0 \ 
  --host-device c3:00.0 \ 
  --host-device c5:00.0 \ 
  --host-device c6:00.0 \ 
  --host-device c7:00.0 \ 
  --host-device b8:00.0 \ 
  --host-device bd:00.0 \ 
  --host-device e1:00.0 \ 
  --host-device e6:00.0 \ 
  --host-device b1:00.0 \ 
  --host-device b2:00.0 \ 
  --host-device da:00.0 \ 
  --host-device db:00.0 \ 
  --console pty,target_type=serial \ 
  --location http://ftp.ubuntu.com/ubuntu/dists/xenial/main/installer-amd64 \ 
  --initrd-inject=/home/lab/kvm/images/preseed.cfg \ 
  --extra-args=" 
  console=ttyS0,115200 
  locale=en_US 
  console-keymaps-at/keymap=us 
  console-setup/ask_detect=false 
  console-setup/layoutcode=us 
  keyboard-configuration/layout=USA 
  keyboard-configuration/variant=USA 
  hostname=${NAME} 
  file=file:/preseed.cfg 
  "

  Passing > 32 device causes this issue:  32nd device hits a DPC error
  and the host/HV crashes:

  
  Apr 25 22:34:35 xpl-evt-16 kernel: [18125.977496] dpc 0000:5b:10.0:pcie210: 
DPC containment event, status:0x0009 source:0x0000
  Apr 25 22:34:35 xpl-evt-16 kernel: [18125.977500] dpc 0000:5b:10.0:pcie210: 
DPC unmasked uncorrectable error detected, remove downstream devices
  Apr 25 22:34:35 xpl-evt-16 kernel: [18125.994326] vfio-pci 0000:5e:00.0: 
Refused to change power state, currently in D3
  Apr 25 22:34:35 xpl-evt-16 kernel: [18125.994427] iommu: Removing device 
0000:5e:00.0 from group 92

  
  From syslog (attached)

  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194358] dpc 0000:bb:04.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194387] dpc 0000:bb:10.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194413] dpc 0000:d9:00.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194439] dpc 0000:d9:01.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194472] dpc 0000:d9:02.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194499] dpc 0000:d9:03.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194526] dpc 0000:d9:04.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194553] dpc 0000:d9:0c.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194583] dpc 0000:df:00.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194619] dpc 0000:df:04.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194649] dpc 0000:df:10.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194679] dpc 0000:e4:00.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194709] dpc 0000:e4:04.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194742] dpc 0000:e4:10.0:pcie210: 
DPC error containment capabilities: Int Msg #3, RPExt- PoisonedTLP+ SwTrigger+ 
RP PIO Log 0, DL_ActiveErr+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.194763] pciehp 
0000:00:1c.0:pcie004: Slot #0 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ 
Surprise+ Interlock- NoCompl+ LLActRep+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.195036] pciehp 
0000:60:02.0:pcie204: Slot #2 AttnBtn+ PwrCtrl+ MRL+ AttnInd+ PwrInd+ HotPlug+ 
Surprise- Interlock- NoCompl- LLActRep+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.195278] pciehp 
0000:60:0a.0:pcie204: Slot #10 AttnBtn+ PwrCtrl+ MRL+ AttnInd+ PwrInd+ HotPlug+ 
Surprise- Interlock- NoCompl- LLActRep+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.195513] pciehp 
0000:c0:02.0:pcie204: Slot #2 AttnBtn+ PwrCtrl+ MRL+ AttnInd+ PwrInd+ HotPlug+ 
Surprise- Interlock- NoCompl- LLActRep+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.195753] pciehp 
0000:c0:0a.0:pcie204: Slot #10 AttnBtn+ PwrCtrl+ MRL+ AttnInd+ PwrInd+ HotPlug+ 
Surprise- Interlock- NoCompl- LLActRep+
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.196196] efifb: probing for efifb
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.196242] efifb: framebuffer at 
0x9c000000, using 1920k, total 1920k
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.196247] efifb: mode is 800x600x32, 
linelength=3200, pages=1
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.196250] efifb: scrolling: redraw
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.196254] efifb: Truecolor: 
size=8:8:8:8, shift=24:16:8:0
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.206652] Console: switching to 
colour frame buffer device 100x37
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.217034] fb0: EFI VGA frame buffer 
device
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.217173] intel_idle: MWAIT 
substates: 0x2020
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.217174] intel_idle: v0.4.1 model 
0x55
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.220874] intel_idle: 
lapic_timer_reliable_states 0xffffffff
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.221219] input: Power Button as 
/devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.221590] ACPI: Power Button [PWRF]
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.231089] ERST: Error Record 
Serialization Table (ERST) support is initialized.
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.231312] pstore: using zlib 
compression
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.231444] pstore: Registered erst as 
persistent store backend
  Apr 25 22:37:13 xpl-evt-16 kernel: [    2.232503] GHES: APEI firmware first 
mode is enabled by APEI bit and WHEA _OSC.


  All PCI devices go offline include NVMe.

  OS Drives go away, RAID-1 is remounted as RO, and eventually system
  crashes.

  
  Apr 25 22:37:13 xpl-evt-16 rsyslogd-2007: action 'action 9' suspended, next 
retry is Wed Apr 25 22:37:43 2018 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
  Apr 25 22:37:13 xpl-evt-16 systemd-udevd[1383]: Process '/sbin/mdadm 
--incremental /dev/nvme1n1p2 --offroot' failed with exit code 1.
  Apr 25 22:37:13 xpl-evt-16 systemd-udevd[1371]: Process '/sbin/mdadm 
--incremental /dev/nvme0n1p2 --offroot' failed with exit code 1.
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Starting Apply Kernel Variables...
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Mounted Configuration File System.
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Mounted FUSE Control File System.
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Started Apply Kernel Variables.
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Found device 
/dev/disk/by-uuid/269E-631A.
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Starting File System Check on 
/dev/disk/by-uuid/269E-631A...
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Started File System Check Daemon to 
report status.
  Apr 25 22:37:13 xpl-evt-16 systemd-fsck[1576]: fsck.fat 3.0.28 (2015-05-16)
  Apr 25 22:37:13 xpl-evt-16 systemd-fsck[1576]: /dev/nvme0n1p1: 10 files, 
1168/130812 clusters
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Started File System Check on 
/dev/disk/by-uuid/269E-631A.
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Mounting /boot/efi...
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Listening on Load/Save RF Kill Switch 
Status /dev/rfkill Watch.
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Mounted /boot/efi.
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Reached target Local File Systems.
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Starting Preprocess NFS 
configuration...
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Starting Create Volatile Files and 
Directories...
  Apr 25 22:37:13 xpl-evt-16 systemd-tmpfiles[1714]: 
[/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
  Apr 25 22:37:13 xpl-evt-16 systemd[1]: Starting openibd - configure Mellanox 
devices...
  Apr 25 22:37:13 xpl-evt-16 kernel: [    0.000000] random: get_random_bytes 
called from start_kernel+0x42/0x50d with crng_init=0

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1771238/+subscriptions



reply via email to

[Prev in Thread] Current Thread [Next in Thread]