Re: [Qemu-devel] Multi GPU passthrough via VFIO

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Multi GPU passthrough via VFIO

From:	Maik Broemme
Subject:	Re: [Qemu-devel] Multi GPU passthrough via VFIO
Date:	Mon, 14 Apr 2014 19:03:06 +0200
User-agent:	Mutt/1.5.21 (2010-09-15)
Hi Alex,

Maik Broemme <address@hidden> wrote:
> Hi Alex,
> 
> Alex Williamson <address@hidden> wrote:
> > On Fri, 2014-02-14 at 01:01 +0100, Maik Broemme wrote:
> > > Hi Alex,
> > > 
> > > Maik Broemme <address@hidden> wrote:
> > > > Hi Alex,
> > > > 
> > > > Alex Williamson <address@hidden> wrote:
> > > > > On Fri, 2014-02-07 at 01:22 +0100, Maik Broemme wrote:
> > > > > > Interesting is the diff between 1st and 2nd boot, so if I do the 
> > > > > > lspci
> > > > > > prior to the booting. The only difference between 1st start and 2nd
> > > > > > start are:
> > > > > > 
> > > > > > --- 001-lspci.290x.before.1st.log   2014-02-07 01:13:41.498827928 
> > > > > > +0100
> > > > > > +++ 004-lspci.290x.before.2nd.log   2014-02-07 01:16:50.966611282 
> > > > > > +0100
> > > > > > @@ -24,7 +24,7 @@
> > > > > >                     ClockPM- Surprise- LLActRep- BwNot-
> > > > > >             LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- 
> > > > > > CommClk+
> > > > > >                     ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > > > > -           LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> > > > > > DLActive- BWMgmt- ABWMgmt-
> > > > > > +           LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
> > > > > > DLActive- BWMgmt- ABWMgmt-
> > > > > >             DevCap2: Completion Timeout: Not Supported, 
> > > > > > TimeoutDis-, LTR-, OBFF Not Supported
> > > > > >             DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, 
> > > > > > LTR-, OBFF Disabled
> > > > > >             LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- 
> > > > > > SpeedDis-
> > > > > > @@ -33,13 +33,13 @@
> > > > > >             LnkSta2: Current De-emphasis Level: -3.5dB, 
> > > > > > EqualizationComplete-, EqualizationPhase1-
> > > > > >                      EqualizationPhase2-, EqualizationPhase3-, 
> > > > > > LinkEqualizationRequest-
> > > > > >     Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> > > > > > -           Address: 0000000000000000  Data: 0000
> > > > > > +           Address: 00000000fee00000  Data: 0000
> > > > > >     Capabilities: [100 v1] Vendor Specific Information: ID=0001 
> > > > > > Rev=1 Len=010 <?>
> > > > > >     Capabilities: [150 v2] Advanced Error Reporting
> > > > > >             UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
> > > > > > UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > > >             UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
> > > > > > UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> > > > > >             UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- 
> > > > > > UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> > > > > > -           CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> > > > > > NonFatalErr-
> > > > > > +           CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> > > > > > NonFatalErr+
> > > > > >             CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
> > > > > > NonFatalErr+
> > > > > >             AERCap: First Error Pointer: 00, GenCap+ CGenEn- 
> > > > > > ChkCap+ ChkEn-
> > > > > >     Capabilities: [270 v1] #19
> > > > > > 
> > > > > > After that if I do suspend-to-ram / resume trick I have again lspci
> > > > > > output from before 1st boot.
> > > > > 
> > > > > The Link Status change after X is stopped seems the most interesting 
> > > > > to
> > > > > me.  The MSI change is probably explained by the MSI save/restore of 
> > > > > the
> > > > > device, but should be harmless since MSI is disabled.  I'm a bit
> > > > > surprised the Correctable Error Status in the AER capability didn't 
> > > > > get
> > > > > cleared.  I would have thought that a bus reset would have caused the
> > > > > link to retrain back to the original speed/width as well.  Let's check
> > > > > that we're actually getting a bus reset, try this in addition to the
> > > > > previous qemu patch.  This just enables debug logging for the bus 
> > > > > resest
> > > > > function.  Thanks,
> > > > > 
> > > > 
> > > > Below are the outputs from 2 boots, VGA, load fglrx and start X. (2nd
> > > > time X gets killed and oops happened)
> > > > 
> > > > - 1st boot:
> > > > 
> > > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > vfio:   0000:01:00.0 group 1
> > > > vfio:   0000:01:00.1 group 1
> > > > vfio: 0000:01:00.1 hot reset: Success
> > > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > vfio:   0000:01:00.0 group 1
> > > > vfio: vfio: found another in-use device 0000:01:00.0
> > > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > > vfio:   0000:01:00.0 group 1
> > > > vfio:   0000:01:00.1 group 1
> > > > vfio: vfio: found another in-use device 0000:01:00.1
> > > > 
> > > > - 2nd boot:
> > > > 
> > > > vfio: vfio_pci_hot_reset(0000:01:00.1) multi
> > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > vfio:   0000:01:00.0 group 1
> > > > vfio:   0000:01:00.1 group 1
> > > > vfio: 0000:01:00.1 hot reset: Success
> > > > vfio: vfio_pci_hot_reset(0000:01:00.1) one
> > > > vfio: 0000:01:00.1: hot reset dependent devices:
> > > > vfio:   0000:01:00.0 group 1
> > > > vfio: vfio: found another in-use device 0000:01:00.0
> > > > vfio: vfio_pci_hot_reset(0000:01:00.0) one
> > > > vfio: 0000:01:00.0: hot reset dependent devices:
> > > > vfio:   0000:01:00.0 group 1
> > > > vfio:   0000:01:00.1 group 1
> > > > vfio: vfio: found another in-use device 0000:01:00.1
> > > > 
> > > 
> > > Did you had already a chance to look into it or anything else I can help
> > > with?
> > 
> > According to the log we're doing the bus reset on both the first and 2nd
> > boot (it's expected that only the "multi" call gets to success).  I'm
> > surprised then that the link doesn't retrain back to the original width.
> > You could try forcing the link to retrain.  Look at the root port
> > upstream from the GPU, lspci -t is handy for this.  Run lspci on the
> > root port to get the PCI express capability offset, then use setpci to
> > set the link retrain bit.  For example:
> > 
> > # lspci -tv | grep NVIDIA
> >            +-07.0-[03]--+-00.0  NVIDIA Corporation GK106GL [Quadro K4000]
> >            |            \-00.1  NVIDIA Corporation GK106 HDMI Audio 
> > Controller
> > 
> > (upstream root port is 00:07.0)
> > 
> > # lspci -v -s 7.0 | grep Capabilities
> >     Capabilities: [40] Subsystem: Intel Corporation 5520/5500/X58 I/O Hub 
> > PCI Express Root Port 7
> >     Capabilities: [60] MSI: Enable+ Count=1/2 Maskable+ 64bit-
> >     Capabilities: [90] Express Root Port (Slot+), MSI 00
> >     Capabilities: [e0] Power Management version 3
> >     Capabilities: [100] Advanced Error Reporting
> >     Capabilities: [150] Access Control Services
> >     Capabilities: [160] Vendor Specific Information: ID=0002 Rev=0 Len=00c 
> > <?>
> > 
> > (PCI express capability is offset 0x90, Link Control is 0x10 off that)
> > 
> > # setpci -s 7.0 a0.w
> > 0040
> > 
> > (retrain is bit 5, 0x20, OR'd with read value is 0x60)
> > 
> > # setpci -s 7.0 a0.w=60
> > 
> > # lspci... did it work?
> > 
> > Try doing that after the first boot to see if you can get back to a x16
> > link.  If that works, we may need to add something in the kernel to do
> > it automatically around a bus reset.  Thanks,
> > 
> 
> Well this doesn't help either and it looks like VFIO reset is setting it
> already back to original width. For example:
> 
>            +-02.0-[01]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Hawaii 
> XT [Radeon HD 8970]
>            |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Device 
> aac8
> 
> Before 1st run:
> 
> address@hidden:~# lspci -vvv -s 00:02.0 | grep LnkSta:
>               LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> DLActive+ BWMgmt- ABWMgmt-
> address@hidden:~# lspci -vvv -s 01:00.0 | grep LnkSta:
>               LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> DLActive- BWMgmt- ABWMgmt-
> 
> After power down of VM:
> 
> address@hidden:~# lspci -vvv -s 00:02.0 | grep LnkSta:
>               LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
> DLActive+ BWMgmt- ABWMgmt+
> address@hidden:~# lspci -vvv -s 01:00.0 | grep LnkSta:
>               LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ 
> DLActive- BWMgmt- ABWMgmt-
> 
> After 2nd start once VFIO did reset:
> 
> address@hidden:~# lspci -vvv -s 00:02.0 | grep LnkSta:
>               LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> DLActive+ BWMgmt- ABWMgmt+
> address@hidden:~# lspci -vvv -s 01:00.0 | grep LnkSta:
>               LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ 
> DLActive- BWMgmt- ABWMgmt-
> 
> The only difference on bus I see here is ABWMgmt- vs ABWMgmt+ but it
> shouldn't be relevant here as it the same if I unload fglrx module
> before shutdown the VM which is the only case where I can run multiple
> VM reboot cycles.
> 
> So the only difference on bus is the following:
> 
> -60: 10 08 00 00 02 cd 31 00 40 00 02 b1 80 25 14 00
> +60: 10 08 00 00 02 cd 31 00 40 00 11 b0 80 25 14 00
> 
> 6a (before 02, after 11)
> 6b (before b1, after b0)
> 
> But I cannot write these parameters using setpci. My PCI express capability
> is offset 0x58 + 0x10 for link control which is already set back to 40
> 
> address@hidden:~# lspci -vvv -s 00:02.0 | grep Capa
>       Capabilities: [50] Power Management version 3
>       Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
>       Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit-
>       Capabilities: [b0] Subsystem: Gigabyte Technology Co., Ltd Device 5000
>       Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+
>       Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 
> Len=010 <?>
>       Capabilities: [190 v1] Access Control Services
> 

Wouldn't it be a possible solution to do a D0 -> D3 -> D0 transition for
devices which doesn't support FLR? The setpci way doesn't help me at all

> > Alex
> > 
> > > > > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> > > > > index 8db182f..7fec259 100644
> > > > > --- a/hw/misc/vfio.c
> > > > > +++ b/hw/misc/vfio.c
> > > > > @@ -2927,6 +2927,10 @@ static bool 
> > > > > vfio_pci_host_match(PCIHostDeviceAddress *hos
> > > > >              host1->slot == host2->slot && host1->function == 
> > > > > host2->function);
> > > > >  }
> > > > >  
> > > > > +#undef DPRINTF
> > > > > +#define DPRINTF(fmt, ...) \
> > > > > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > > > > +
> > > > >  static int vfio_pci_hot_reset(VFIODevice *vdev, bool single)
> > > > >  {
> > > > >      VFIOGroup *group;
> > > > > @@ -3104,6 +3108,15 @@ out_single:
> > > > >      return ret;
> > > > >  }
> > > > >  
> > > > > +#undef DPRINTF
> > > > > +#ifdef DEBUG_VFIO
> > > > > +#define DPRINTF(fmt, ...) \
> > > > > +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> > > > > +#else
> > > > > +#define DPRINTF(fmt, ...) \
> > > > > +    do { } while (0)
> > > > > +#endif
> > > > > +
> > > > >  /*
> > > > >   * We want to differentiate hot reset of mulitple in-use devices vs 
> > > > > hot reset
> > > > >   * of a single in-use device.  VFIO_DEVICE_RESET will already handle 
> > > > > the case
> > > > > 
> > > > > 
> > > > 
> > > > --Maik
> > > > 
> > > 
> > > --Maik
> > 
> > 
> > 
> 
> --Maik
> 

--Maik
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] Multi GPU passthrough via VFIO, Maik Broemme <=
Prev by Date: Re: [Qemu-devel] [PATCH v3 01/26] tcg-aarch64: Properly detect SIGSEGV writes
Next by Date: [Qemu-devel] Question on gdb breakpoint
Previous by thread: [Qemu-devel] [PATCH 0/1] ppc: use capabilities helper
Next by thread: [Qemu-devel] Question on gdb breakpoint
Index(es):
- Date
- Thread