[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests
From: |
David Gibson |
Subject: |
Re: [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests |
Date: |
Mon, 9 Jan 2017 10:43:49 +1100 |
User-agent: |
Mutt/1.7.1 (2016-10-04) |
On Fri, Jan 06, 2017 at 12:57:58PM +0100, Greg Kurz wrote:
> On Thu, 5 Jan 2017 16:46:18 +1100
> David Gibson <address@hidden> wrote:
>
> > There was a discussion back in November on the qemu list which spilled
> > onto the libvirt list about how to add support for PCIe devices to
> > POWER VMs, specifically 'pseries' machine type PAPR guests.
> >
> > Here's a more concrete proposal for how to handle part of this in
> > future from the libvirt side. Strictly speaking what I'm suggesting
> > here isn't intrinsically linked to PCIe: it will make adding PCIe
> > support sanely easier, as well as having a number of advantages for
> > both PCIe and plain-PCI devices on PAPR guests.
> >
> > Background:
> >
> > * Currently the pseries machine type only supports vanilla PCI
> > buses.
> > * This is a qemu limitation, not something inherent - PAPR guests
> > running under PowerVM (the IBM hypervisor) can use passthrough
> > PCIe devices (PowerVM doesn't emulate devices though).
> > * In fact the way PCI access is para-virtalized in PAPR makes the
> > usual distinctions between PCI and PCIe largely disappear
> > * Presentation of PCIe devices to PAPR guests is unusual
> > * Unlike x86 - and other "bare metal" platforms, root ports are
> > not made visible to the guest. i.e. all devices (typically)
> > appear as though they were integrated devices on x86
> > * In terms of topology all devices will appear in a way similar to
> > a vanilla PCI bus, even PCIe devices
> > * However PCIe extended config space is accessible
> > * This means libvirt's usual placement of PCIe devices is not
> > suitable for PAPR guests
> > * PAPR has its own hotplug mechanism
> > * This is used instead of standard PCIe hotplug
> > * This mechanism works for both PCIe and vanilla-PCI devices
> > * This can hotplug/unplug devices even without a root port P2P
> > bridge between it and the root "bus
> > * Multiple independent host bridges are routine on PAPR
> > * Unlike PC (where all host bridges have multiplexed access to
> > configuration space) PCI host bridges (PHBs) are truly
> > independent for PAPR guests (disjoint MMIO regions in system
> > address space)
> > * PowerVM typically presents a separate PHB to the guest for each
> > host slot passed through
> >
> > The Proposal:
> >
> > I suggest that libvirt implement a new default algorithm for placing
> > (i.e. assigning addresses to) both PCI and PCIe devices for (only)
> > PAPR guests.
> >
> > The short summary is that by default it should assign each device to a
> > separate vPHB, creating vPHBs as necessary.
> >
> > * For passthrough sometimes a group of host devices can't be safely
> > isolated from each other - this is known as a (host) Partitionable
> > Endpoint (PE). In this case, if any device in the PE is passed
> > through to a guest, the whole PE must be passed through to the
> > same vPHB in the guest. From the guest POV, each vPHB has exactly
> > one (guest) PE.
> > * To allow for hotplugged devices, libvirt should also add a number
> > of additional, empty vPHBs (the PAPR spec allows for hotplug of
> > PHBs, but this is not yet implemented in qemu). When hotplugging
> > a new device (or PE) libvirt should locate a vPHB which doesn't
> > currently contain anything.
> > * libvirt should only (automatically) add PHBs - never root ports or
> > other PCI to PCI bridges
> >
> > In order to handle migration, the vPHBs will need to be represented in
> > the domain XML, which will also allow the user to override this
> > topology if they want.
> >
> > Advantages:
> >
> > There are still some details I need to figure out w.r.t. handling PCIe
> > devices (on both the qemu and libvirt sides). However the fact that
>
> One such detail may be that PCIe devices should have the
> "ibm,pci-config-space-type" property set to 1 in the DT,
> for the driver to be able to access the extended config
> space.
Right.
> > PAPR guests don't typically see PCIe root ports means that the normal
> > libvirt PCIe allocation scheme won't work. This scheme has several
> > advantages with or without support for PCIe devices:
> >
> > * Better performance for 32-bit devices
> >
> > With multiple devices on a single vPHB they all must share a (fairly
> > small) 32-bit DMA/IOMMU window. With separate PHBs they each have a
> > separate window. PAPR guests have an always-on guest visible IOMMU.
> >
> > * Better EEH handling for passthrough devices
> >
> > EEH is an IBM hardware-assisted mechanism for isolating and safely
> > resetting devices experiencing hardware faults so they don't bring
> > down other devices or the system at large. It's roughly similar to
> > PCIe AER in concept, but has a different IBM specific interface, and
> > works on both PCI and PCIe devices.
> >
> > Currently the kernel interfaces for handling EEH events on passthrough
> > devices will only work if there is a single (host) iommu group in the
> > vfio container. While lifting that restriction would be nice, it's
> > quite difficult to do so (it requires keeping state synchronized
> > between multiple host groups). That also means that an EEH error on
> > one device could stop another device where that isn't required by the
> > actual hardware.
> >
> > The unit of EEH isolation is a PE (Partitionable Endpoint) and
> > currently there is only one guest PE per vPHB. Changing this might
> > also be possible, but is again quite complex and may result in
> > confusing and/or broken distinctions between groups for EEH isolation
> > and IOMMU isolation purposes.
> >
> > Placing separate host groups in separate vPHBs sidesteps these
> > problems.
> >
> > * Guest NUMA node assignment of devices
> >
> > PAPR does not (and can't reasonably) use the pxb device. Instead to
> > allocate devices to different guest NUMA nodes they should be placed
> > on different vPHBs. Placing them on different PHBs by default allows
> > NUMA node to be assigned to those PHBs in a straightforward manner.
> >
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature
- [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests, David Gibson, 2017/01/05
- Re: [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests, Greg Kurz, 2017/01/06
- Re: [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests,
David Gibson <=
- Re: [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests, David Gibson, 2017/01/11
- Re: [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests, Alexey Kardashevskiy, 2017/01/12
- Re: [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests, Greg Kurz, 2017/01/12
- Re: [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests, David Gibson, 2017/01/12
- Re: [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests, Alexey Kardashevskiy, 2017/01/13
- Re: [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests, Greg Kurz, 2017/01/13
Re: [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests, Greg Kurz, 2017/01/06
Re: [Qemu-ppc] Proposal PCI/PCIe device placement on PAPR guests, Andrea Bolognani, 2017/01/06