Re: [Qemu-ppc] [PATCH qemu v7] spapr: Support NVIDIA V100 GPU with NVLin

qemu-ppc
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] [PATCH qemu v7] spapr: Support NVIDIA V100 GPU with NVLin

From:	David Gibson
Subject:	Re: [Qemu-ppc] [PATCH qemu v7] spapr: Support NVIDIA V100 GPU with NVLink2
Date:	Thu, 21 Mar 2019 14:38:06 +1100
User-agent:	Mutt/1.11.3 (2019-02-01)
On Tue, Mar 12, 2019 at 07:21:03PM +1100, Alexey Kardashevskiy wrote:
> NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
> space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
> implements special regions for such GPUs and emulates an NVLink bridge.
> NVLink2-enabled POWER9 CPUs also provide address translation services
> which includes an ATS shootdown (ATSD) register exported via the NVLink
> bridge device.
> 
> This adds a quirk to VFIO to map the GPU memory and create an MR;
> the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
> this to get the MR and map it to the system address space.
> Another quirk does the same for ATSD.
> 
> This adds additional steps to sPAPR PHB setup:
> 
> 1. Search for specific GPUs and NPUs, collect findings in
> sPAPRPHBState::nvgpus, manage system address space mappings;
> 
> 2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
> "memory-block", "link-speed" to advertise the NVLink2 function to
> the guest;
> 
> 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
> 
> 4. Add new memory blocks (with extra "linux,memory-usable" to prevent
> the guest OS from accessing the new memory until it is onlined) and
> npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
> uses it for link discovery.
> 
> This allocates space for GPU RAM and ATSD like we do for MMIOs by
> adding 2 new parameters to the phb_placement() hook. Older machine types
> set these to zero.
> 
> This puts new memory nodes in a separate NUMA node to as the GPU RAM
> needs to be configured equally distant from any other node in the system.
> Unlike the host setup which assigns numa ids from 255 downwards, this
> adds new NUMA nodes after the user configures nodes or from 1 if none
> were configured.
> 
> This adds requirement similar to EEH - one IOMMU group per vPHB.
> The reason for this is that ATSD registers belong to a physical NPU
> so they cannot invalidate translations on GPUs attached to another NPU.
> It is guaranteed by the host platform as it does not mix NVLink bridges
> or GPUs from different NPU in the same IOMMU group. If more than one
> IOMMU group is detected on a vPHB, this disables ATSD support for that
> vPHB and prints a warning.
> 
> Signed-off-by: Alexey Kardashevskiy <address@hidden>
> [aw: for vfio portions]
> Acked-by: Alex Williamson <address@hidden>

I've applied this to ppc-for-4.1, now.  A couple of minor points noted
below which could be fixed in a followup.

> ---
> 
> This is based on David's ppc-for-4.0 and acked "vfio_info_cap public" from
> https://patchwork.ozlabs.org/patch/1052645/
> 
> Changes:
> v7:
> * fixed compile on 32bit host (f29-i386)
> 
> v6:
> * changed error handling in spapr-nvlink code
> * fixed mmap error checking from NULL to MAP_FAILED
> * changed NUMA node ids and changed the commit log about it
> 
> v5:
> * converted MRs to VFIOQuirk - this fixed leaks
> 
> v4:
> * fixed ATSD placement
> * fixed spapr_phb_unrealize() to do nvgpu cleanup
> * replaced warn_report() with Error*
> 
> v3:
> * moved GPU RAM above PCI MMIO limit
> * renamed QOM property to nvlink2-tgt
> * moved nvlink2 code to its own file
> 
> ---
> 
> The example command line for redbud system:
> 
> pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \
> -nodefaults \
> -chardev stdio,id=STDIO0,signal=off,mux=on \
> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
> -mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \
> -enable-kvm -m 384G \
> -chardev socket,id=SOCKET0,server,nowait,host=localhost,port=40000 \
> -mon chardev=SOCKET0,mode=control \
> -smp 80,sockets=1,threads=4 \
> -netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \
> -device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \
> img/vdisk0.img \
> -device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \
> -device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \
> -device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \
> -device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \
> -device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \
> -device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \
> -device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \
> -device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \
> -device spapr-pci-host-bridge,id=phb1,index=1 \
> -device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \
> -device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \
> -device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \
> -device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \
> -device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \
> -device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \
> -device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \
> -device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \
> -machine pseries \
> -L /home/aik/t/qemu-ppc64-bios/ -d guest_errors
> 
> Note that QEMU attaches PCI devices to the last added vPHB so first
> 8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and
> 35:03:00.0..7:00:01.2 to the vPHB with id=phb1.
> ---
>  hw/ppc/Makefile.objs        |   2 +-
>  hw/vfio/pci.h               |   2 +
>  include/hw/pci-host/spapr.h |  45 ++++
>  include/hw/ppc/spapr.h      |   5 +-
>  hw/ppc/spapr.c              |  48 +++-
>  hw/ppc/spapr_pci.c          |  19 ++
>  hw/ppc/spapr_pci_nvlink2.c  | 450 ++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci-quirks.c        | 131 +++++++++++
>  hw/vfio/pci.c               |  14 ++
>  hw/vfio/trace-events        |   4 +
>  10 files changed, 711 insertions(+), 9 deletions(-)
>  create mode 100644 hw/ppc/spapr_pci_nvlink2.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index 1111b218a048..636e717f207c 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -9,7 +9,7 @@ obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
>  # IBM PowerNV
>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o 
> pnv_occ.o pnv_bmc.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> -obj-y += spapr_pci_vfio.o
> +obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o
>  endif
>  obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>  # PowerPC 4xx boards
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index b1ae4c07549a..706c30443617 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -194,6 +194,8 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp);
>  int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
>                                 struct vfio_region_info *info,
>                                 Error **errp);
> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp);
> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp);
>  
>  void vfio_display_reset(VFIOPCIDevice *vdev);
>  int vfio_display_probe(VFIOPCIDevice *vdev, Error **errp);
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index b4aad26798c0..53519c835e9f 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -87,6 +87,9 @@ struct SpaprPhbState {
>      uint32_t mig_liobn;
>      hwaddr mig_mem_win_addr, mig_mem_win_size;
>      hwaddr mig_io_win_addr, mig_io_win_size;
> +    hwaddr nv2_gpa_win_addr;
> +    hwaddr nv2_atsd_win_addr;
> +    struct spapr_phb_pci_nvgpu_config *nvgpus;
>  };
>  
>  #define SPAPR_PCI_MEM_WIN_BUS_OFFSET 0x80000000ULL
> @@ -105,6 +108,22 @@ struct SpaprPhbState {
>  
>  #define SPAPR_PCI_MSI_WINDOW         0x40000000000ULL
>  
> +#define SPAPR_PCI_NV2RAM64_WIN_BASE  SPAPR_PCI_LIMIT
> +#define SPAPR_PCI_NV2RAM64_WIN_SIZE  (2 * TiB) /* For up to 6 GPUs 256GB 
> each */
> +
> +/* Max number of these GPUsper a physical box */
> +#define NVGPU_MAX_NUM                6
> +/* Max number of NVLinks per GPU in any physical box */
> +#define NVGPU_MAX_LINKS              3
> +
> +/*
> + * GPU RAM starts at 64TiB so huge DMA window to cover it all ends at 128TiB
> + * which is enough. We do not need DMA for ATSD so we put them at 128TiB.
> + */
> +#define SPAPR_PCI_NV2ATSD_WIN_BASE   (128 * TiB)
> +#define SPAPR_PCI_NV2ATSD_WIN_SIZE   (NVGPU_MAX_NUM * NVGPU_MAX_LINKS * \
> +                                      64 * KiB)
> +
>  static inline qemu_irq spapr_phb_lsi_qirq(struct SpaprPhbState *phb, int pin)
>  {
>      SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> @@ -135,6 +154,13 @@ int spapr_phb_vfio_eeh_get_state(SpaprPhbState *sphb, 
> int *state);
>  int spapr_phb_vfio_eeh_reset(SpaprPhbState *sphb, int option);
>  int spapr_phb_vfio_eeh_configure(SpaprPhbState *sphb);
>  void spapr_phb_vfio_reset(DeviceState *qdev);
> +void spapr_phb_nvgpu_setup(SpaprPhbState *sphb, Error **errp);
> +void spapr_phb_nvgpu_free(SpaprPhbState *sphb);
> +void spapr_phb_nvgpu_populate_dt(SpaprPhbState *sphb, void *fdt, int bus_off,
> +                                 Error **errp);
> +void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt);
> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int 
> offset,
> +                                        SpaprPhbState *sphb);
>  #else
>  static inline bool spapr_phb_eeh_available(SpaprPhbState *sphb)
>  {
> @@ -161,6 +187,25 @@ static inline int 
> spapr_phb_vfio_eeh_configure(SpaprPhbState *sphb)
>  static inline void spapr_phb_vfio_reset(DeviceState *qdev)
>  {
>  }
> +static inline void spapr_phb_nvgpu_setup(SpaprPhbState *sphb, Error **errp)
> +{
> +}
> +static inline void spapr_phb_nvgpu_free(SpaprPhbState *sphb)
> +{
> +}
> +static inline void spapr_phb_nvgpu_populate_dt(SpaprPhbState *sphb, void 
> *fdt,
> +                                               int bus_off, Error **errp)
> +{
> +}
> +static inline void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb,
> +                                                   void *fdt)
> +{
> +}
> +static inline void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void 
> *fdt,
> +                                                      int offset,
> +                                                      SpaprPhbState *sphb)
> +{
> +}
>  #endif
>  
>  void spapr_phb_dma_reset(SpaprPhbState *sphb);
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 2b4c05a2ec33..74ce638bc10a 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -122,7 +122,8 @@ struct SpaprMachineClass {
>      void (*phb_placement)(SpaprMachineState *spapr, uint32_t index,
>                            uint64_t *buid, hwaddr *pio, 
>                            hwaddr *mmio32, hwaddr *mmio64,
> -                          unsigned n_dma, uint32_t *liobns, Error **errp);
> +                          unsigned n_dma, uint32_t *liobns, hwaddr *nv2gpa,
> +                          hwaddr *nv2atsd, Error **errp);
>      SpaprResizeHpt resize_hpt_default;
>      SpaprCapabilities default_caps;
>      SpaprIrq *irq;
> @@ -198,6 +199,8 @@ struct SpaprMachineState {
>  
>      bool cmd_line_caps[SPAPR_CAP_NUM];
>      SpaprCapabilities def, eff, mig;
> +
> +    unsigned gpu_numa_id;
>  };
>  
>  #define H_SUCCESS         0
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 20cade50d5d0..7cb7e0c126e7 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1034,12 +1034,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, 
> void *fdt)
>          0, cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE),
>          cpu_to_be32(max_cpus / smp_threads),
>      };
> +    uint32_t maxdomain = cpu_to_be32(spapr->gpu_numa_id > 1 ? 1 : 0);
>      uint32_t maxdomains[] = {
>          cpu_to_be32(4),
> -        cpu_to_be32(0),
> -        cpu_to_be32(0),
> -        cpu_to_be32(0),
> -        cpu_to_be32(nb_numa_nodes ? nb_numa_nodes : 1),
> +        maxdomain,
> +        maxdomain,
> +        maxdomain,
> +        cpu_to_be32(spapr->gpu_numa_id),
>      };
>  
>      _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));
> @@ -1713,6 +1714,16 @@ static void spapr_machine_reset(void)
>          spapr_irq_msi_reset(spapr);
>      }
>  
> +    /*
> +     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
> +     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
> +     * called from vPHB reset handler so we initialize the counter here.
> +     * If no NUMA is configured from the QEMU side, we start from 1 as GPU 
> RAM
> +     * must be equally distant from any other node.
> +     * The final value of spapr->gpu_numa_id is going to be written to
> +     * max-associativity-domains in spapr_build_fdt().
> +     */
> +    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
>      qemu_devices_reset();
>  
>      /*
> @@ -3935,7 +3946,9 @@ static void spapr_phb_pre_plug(HotplugHandler 
> *hotplug_dev, DeviceState *dev,
>      smc->phb_placement(spapr, sphb->index,
>                         &sphb->buid, &sphb->io_win_addr,
>                         &sphb->mem_win_addr, &sphb->mem64_win_addr,
> -                       windows_supported, sphb->dma_liobn, errp);
> +                       windows_supported, sphb->dma_liobn,
> +                       &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr,
> +                       errp);
>  }
>  
>  static void spapr_phb_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
> @@ -4136,7 +4149,8 @@ static const CPUArchIdList 
> *spapr_possible_cpu_arch_ids(MachineState *machine)
>  static void spapr_phb_placement(SpaprMachineState *spapr, uint32_t index,
>                                  uint64_t *buid, hwaddr *pio,
>                                  hwaddr *mmio32, hwaddr *mmio64,
> -                                unsigned n_dma, uint32_t *liobns, Error 
> **errp)
> +                                unsigned n_dma, uint32_t *liobns,
> +                                hwaddr *nv2gpa, hwaddr *nv2atsd, Error 
> **errp)
>  {
>      /*
>       * New-style PHB window placement.
> @@ -4181,6 +4195,9 @@ static void spapr_phb_placement(SpaprMachineState 
> *spapr, uint32_t index,
>      *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE;
>      *mmio32 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM32_WIN_SIZE;
>      *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE;
> +
> +    *nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index * 
> SPAPR_PCI_NV2RAM64_WIN_SIZE;
> +    *nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index * 
> SPAPR_PCI_NV2ATSD_WIN_SIZE;
>  }
>  
>  static ICSState *spapr_ics_get(XICSFabric *dev, int irq)
> @@ -4385,6 +4402,18 @@ DEFINE_SPAPR_MACHINE(4_0, "4.0", true);
>  /*
>   * pseries-3.1
>   */
> +static void phb_placement_3_1(SpaprMachineState *spapr, uint32_t index,
> +                              uint64_t *buid, hwaddr *pio,
> +                              hwaddr *mmio32, hwaddr *mmio64,
> +                              unsigned n_dma, uint32_t *liobns,
> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
> +{
> +    spapr_phb_placement(spapr, index, buid, pio, mmio32, mmio64, n_dma, 
> liobns,
> +                        nv2gpa, nv2atsd, errp);
> +    *nv2gpa = 0;
> +    *nv2atsd = 0;
> +}
> +
>  static void spapr_machine_3_1_class_options(MachineClass *mc)
>  {
>      SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> @@ -4404,6 +4433,7 @@ static void 
> spapr_machine_3_1_class_options(MachineClass *mc)
>      smc->default_caps.caps[SPAPR_CAP_SBBC] = SPAPR_CAP_BROKEN;
>      smc->default_caps.caps[SPAPR_CAP_IBS] = SPAPR_CAP_BROKEN;
>      smc->default_caps.caps[SPAPR_CAP_LARGE_DECREMENTER] = SPAPR_CAP_OFF;
> +    smc->phb_placement = phb_placement_3_1;
>  }
>  
>  DEFINE_SPAPR_MACHINE(3_1, "3.1", false);
> @@ -4535,7 +4565,8 @@ DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
>  static void phb_placement_2_7(SpaprMachineState *spapr, uint32_t index,
>                                uint64_t *buid, hwaddr *pio,
>                                hwaddr *mmio32, hwaddr *mmio64,
> -                              unsigned n_dma, uint32_t *liobns, Error **errp)
> +                              unsigned n_dma, uint32_t *liobns,
> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
>  {
>      /* Legacy PHB placement for pseries-2.7 and earlier machine types */
>      const uint64_t base_buid = 0x800000020000000ULL;
> @@ -4579,6 +4610,9 @@ static void phb_placement_2_7(SpaprMachineState *spapr, 
> uint32_t index,
>       * fallback behaviour of automatically splitting a large "32-bit"
>       * window into contiguous 32-bit and 64-bit windows
>       */
> +
> +    *nv2gpa = 0;
> +    *nv2atsd = 0;
>  }
>  
>  static void spapr_machine_2_7_class_options(MachineClass *mc)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 20915d2b3c19..67a277fae481 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1355,6 +1355,8 @@ static void spapr_populate_pci_child_dt(PCIDevice *dev, 
> void *fdt, int offset,
>      if (sphb->pcie_ecs && pci_is_express(dev)) {
>          _FDT(fdt_setprop_cell(fdt, offset, "ibm,pci-config-space-type", 
> 0x1));
>      }
> +
> +    spapr_phb_nvgpu_populate_pcidev_dt(dev, fdt, offset, sphb);
>  }
>  
>  /* create OF node for pci device and required OF DT properties */
> @@ -1589,6 +1591,8 @@ static void spapr_phb_unrealize(DeviceState *dev, Error 
> **errp)
>      int i;
>      const unsigned windows_supported = spapr_phb_windows_supported(sphb);
>  
> +    spapr_phb_nvgpu_free(sphb);
> +
>      if (sphb->msi) {
>          g_hash_table_unref(sphb->msi);
>          sphb->msi = NULL;
> @@ -1877,8 +1881,14 @@ void spapr_phb_dma_reset(SpaprPhbState *sphb)
>  static void spapr_phb_reset(DeviceState *qdev)
>  {
>      SpaprPhbState *sphb = SPAPR_PCI_HOST_BRIDGE(qdev);
> +    Error *errp = NULL;
>  
>      spapr_phb_dma_reset(sphb);
> +    spapr_phb_nvgpu_free(sphb);
> +    spapr_phb_nvgpu_setup(sphb, &errp);
> +    if (errp) {
> +        error_report_err(errp);
> +    }
>  
>      /* Reset the IOMMU state */
>      object_child_foreach(OBJECT(qdev), spapr_phb_children_reset, NULL);
> @@ -1911,6 +1921,8 @@ static Property spapr_phb_properties[] = {
>                       pre_2_8_migration, false),
>      DEFINE_PROP_BOOL("pcie-extended-configuration-space", SpaprPhbState,
>                       pcie_ecs, true),
> +    DEFINE_PROP_UINT64("gpa", SpaprPhbState, nv2_gpa_win_addr, 0),
> +    DEFINE_PROP_UINT64("atsd", SpaprPhbState, nv2_atsd_win_addr, 0),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -2191,6 +2203,7 @@ int spapr_populate_pci_dt(SpaprPhbState *phb, uint32_t 
> intc_phandle, void *fdt,
>      PCIBus *bus = PCI_HOST_BRIDGE(phb)->bus;
>      SpaprFdt s_fdt;
>      SpaprDrc *drc;
> +    Error *errp = NULL;
>  
>      /* Start populating the FDT */
>      nodename = g_strdup_printf("address@hidden" PRIx64, phb->buid);
> @@ -2283,6 +2296,12 @@ int spapr_populate_pci_dt(SpaprPhbState *phb, uint32_t 
> intc_phandle, void *fdt,
>          return ret;
>      }
>  
> +    spapr_phb_nvgpu_populate_dt(phb, fdt, bus_off, &errp);
> +    if (errp) {
> +        error_report_err(errp);
> +    }
> +    spapr_phb_nvgpu_ram_populate_dt(phb, fdt);
> +
>      return 0;
>  }
>  
> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> new file mode 100644
> index 000000000000..3aa66aff6dbd
> --- /dev/null
> +++ b/hw/ppc/spapr_pci_nvlink2.c
> @@ -0,0 +1,450 @@
> +/*
> + * QEMU sPAPR PCI for NVLink2 pass through
> + *
> + * Copyright (c) 2019 Alexey Kardashevskiy, IBM Corporation.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a 
> copy
> + * of this software and associated documentation files (the "Software"), to 
> deal
> + * in the Software without restriction, including without limitation the 
> rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
> FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu-common.h"
> +#include "hw/pci/pci.h"
> +#include "hw/pci-host/spapr.h"
> +#include "qemu/error-report.h"
> +#include "hw/ppc/fdt.h"
> +#include "hw/pci/pci_bridge.h"
> +
> +#define PHANDLE_PCIDEV(phb, pdev)    (0x12000000 | \
> +                                     (((phb)->index) << 16) | 
> ((pdev)->devfn))
> +#define PHANDLE_GPURAM(phb, n)       (0x110000FF | ((n) << 8) | \
> +                                     (((phb)->index) << 16))
> +#define PHANDLE_NVLINK(phb, gn, nn)  (0x00130000 | (((phb)->index) << 8) | \
> +                                     ((gn) << 4) | (nn))
> +
> +#define SPAPR_GPU_NUMA_ID           (cpu_to_be32(1))
> +
> +struct spapr_phb_pci_nvgpu_config {
> +    uint64_t nv2_ram_current;
> +    uint64_t nv2_atsd_current;
> +    int num; /* number of non empty (i.e. tgt!=0) entries in slots[] */
> +    struct spapr_phb_pci_nvgpu_slot {
> +        uint64_t tgt;
> +        uint64_t gpa;
> +        unsigned numa_id;
> +        PCIDevice *gpdev;
> +        int linknum;
> +        struct {
> +            uint64_t atsd_gpa;
> +            PCIDevice *npdev;
> +            uint32_t link_speed;
> +        } links[NVGPU_MAX_LINKS];
> +    } slots[NVGPU_MAX_NUM];
> +    Error *errp;
> +};
> +
> +static struct spapr_phb_pci_nvgpu_slot *
> +spapr_nvgpu_get_slot(struct spapr_phb_pci_nvgpu_config *nvgpus, uint64_t tgt)
> +{
> +    int i;
> +
> +    /* Search for partially collected "slot" */
> +    for (i = 0; i < nvgpus->num; ++i) {
> +        if (nvgpus->slots[i].tgt == tgt) {
> +            return &nvgpus->slots[i];
> +        }
> +    }
> +
> +    if (nvgpus->num == ARRAY_SIZE(nvgpus->slots)) {
> +        return NULL;
> +    }
> +
> +    i = nvgpus->num;
> +    nvgpus->slots[i].tgt = tgt;
> +    ++nvgpus->num;
> +
> +    return &nvgpus->slots[i];
> +}
> +
> +static void spapr_pci_collect_nvgpu(struct spapr_phb_pci_nvgpu_config 
> *nvgpus,
> +                                    PCIDevice *pdev, uint64_t tgt,
> +                                    MemoryRegion *mr, Error **errp)
> +{
> +    MachineState *machine = MACHINE(qdev_get_machine());
> +    SpaprMachineState *spapr = SPAPR_MACHINE(machine);
> +    struct spapr_phb_pci_nvgpu_slot *nvslot = spapr_nvgpu_get_slot(nvgpus, 
> tgt);
> +
> +    if (!nvslot) {
> +        error_setg(errp, "Found too many NVLink bridges per GPU");

I think this error message isn't strictly correct (it's too many GPUs
per vPHB here, rather than too many bridges per CPU).  That can be
fixed later, though.

> +        return;
> +    }
> +    g_assert(!nvslot->gpdev);
> +    nvslot->gpdev = pdev;
> +
> +    nvslot->gpa = nvgpus->nv2_ram_current;
> +    nvgpus->nv2_ram_current += memory_region_size(mr);
> +    nvslot->numa_id = spapr->gpu_numa_id;
> +    ++spapr->gpu_numa_id;
> +}
> +
> +static void spapr_pci_collect_nvnpu(struct spapr_phb_pci_nvgpu_config 
> *nvgpus,
> +                                    PCIDevice *pdev, uint64_t tgt,
> +                                    MemoryRegion *mr, Error **errp)
> +{
> +    struct spapr_phb_pci_nvgpu_slot *nvslot = spapr_nvgpu_get_slot(nvgpus, 
> tgt);
> +    int j;
> +
> +    if (!nvslot) {
> +        error_setg(errp, "Found too many NVLink bridges per GPU");

Same here.

> +        return;
> +    }
> +
> +    j = nvslot->linknum;
> +    if (j == ARRAY_SIZE(nvslot->links)) {
> +        error_setg(errp, "Found too many NVLink2 bridges");

In fact the message above would see more appropriate here.

> +        return;
> +    }
> +    ++nvslot->linknum;
> +
> +    g_assert(!nvslot->links[j].npdev);
> +    nvslot->links[j].npdev = pdev;
> +    nvslot->links[j].atsd_gpa = nvgpus->nv2_atsd_current;
> +    nvgpus->nv2_atsd_current += memory_region_size(mr);
> +    nvslot->links[j].link_speed =
> +        object_property_get_uint(OBJECT(pdev), "nvlink2-link-speed", NULL);
> +}
> +
> +static void spapr_phb_pci_collect_nvgpu(PCIBus *bus, PCIDevice *pdev,
> +                                        void *opaque)
> +{
> +    PCIBus *sec_bus;
> +    Object *po = OBJECT(pdev);
> +    uint64_t tgt = object_property_get_uint(po, "nvlink2-tgt", NULL);
> +
> +    if (tgt) {
> +        Error *local_err = NULL;
> +        struct spapr_phb_pci_nvgpu_config *nvgpus = opaque;
> +        Object *mr_gpu = object_property_get_link(po, "nvlink2-mr[0]", NULL);
> +        Object *mr_npu = object_property_get_link(po, "nvlink2-atsd-mr[0]",
> +                                                  NULL);
> +
> +        g_assert(mr_gpu || mr_npu);
> +        if (mr_gpu) {
> +            spapr_pci_collect_nvgpu(nvgpus, pdev, tgt, MEMORY_REGION(mr_gpu),
> +                                    &local_err);
> +        } else {
> +            spapr_pci_collect_nvnpu(nvgpus, pdev, tgt, MEMORY_REGION(mr_npu),
> +                                    &local_err);
> +        }
> +        error_propagate(&nvgpus->errp, local_err);
> +    }
> +    if ((pci_default_read_config(pdev, PCI_HEADER_TYPE, 1) !=
> +         PCI_HEADER_TYPE_BRIDGE)) {
> +        return;
> +    }
> +
> +    sec_bus = pci_bridge_get_sec_bus(PCI_BRIDGE(pdev));
> +    if (!sec_bus) {
> +        return;
> +    }
> +
> +    pci_for_each_device(sec_bus, pci_bus_num(sec_bus),
> +                        spapr_phb_pci_collect_nvgpu, opaque);
> +}
> +
> +void spapr_phb_nvgpu_setup(SpaprPhbState *sphb, Error **errp)
> +{
> +    int i, j, valid_gpu_num;
> +    PCIBus *bus;
> +
> +    /* Search for GPUs and NPUs */
> +    if (!sphb->nv2_gpa_win_addr || !sphb->nv2_atsd_win_addr) {
> +        return;
> +    }
> +
> +    sphb->nvgpus = g_new0(struct spapr_phb_pci_nvgpu_config, 1);
> +    sphb->nvgpus->nv2_ram_current = sphb->nv2_gpa_win_addr;
> +    sphb->nvgpus->nv2_atsd_current = sphb->nv2_atsd_win_addr;
> +
> +    bus = PCI_HOST_BRIDGE(sphb)->bus;
> +    pci_for_each_device(bus, pci_bus_num(bus),
> +                        spapr_phb_pci_collect_nvgpu, sphb->nvgpus);
> +
> +    if (sphb->nvgpus->errp) {
> +        error_propagate(errp, sphb->nvgpus->errp);
> +        sphb->nvgpus->errp = NULL;
> +        goto cleanup_exit;
> +    }
> +
> +    /* Add found GPU RAM and ATSD MRs if found */
> +    for (i = 0, valid_gpu_num = 0; i < sphb->nvgpus->num; ++i) {
> +        Object *nvmrobj;
> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> +
> +        if (!nvslot->gpdev) {
> +            continue;
> +        }
> +        nvmrobj = object_property_get_link(OBJECT(nvslot->gpdev),
> +                                           "nvlink2-mr[0]", NULL);
> +        /* ATSD is pointless without GPU RAM MR so skip those */
> +        if (!nvmrobj) {
> +            continue;
> +        }
> +
> +        ++valid_gpu_num;
> +        memory_region_add_subregion(get_system_memory(), nvslot->gpa,
> +                                    MEMORY_REGION(nvmrobj));
> +
> +        for (j = 0; j < nvslot->linknum; ++j) {
> +            Object *atsdmrobj;
> +
> +            atsdmrobj = 
> object_property_get_link(OBJECT(nvslot->links[j].npdev),
> +                                                 "nvlink2-atsd-mr[0]", NULL);
> +            if (!atsdmrobj) {
> +                continue;
> +            }
> +            memory_region_add_subregion(get_system_memory(),
> +                                        nvslot->links[j].atsd_gpa,
> +                                        MEMORY_REGION(atsdmrobj));
> +        }
> +    }
> +
> +    if (valid_gpu_num) {
> +        return;
> +    }
> +    /* We did not find any interesting GPU */
> +cleanup_exit:
> +    g_free(sphb->nvgpus);
> +    sphb->nvgpus = NULL;
> +}
> +
> +void spapr_phb_nvgpu_free(SpaprPhbState *sphb)
> +{
> +    int i, j;
> +
> +    if (!sphb->nvgpus) {
> +        return;
> +    }
> +
> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> +        Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
> +                                                    "nvlink2-mr[0]", NULL);
> +
> +        if (nv_mrobj) {
> +            memory_region_del_subregion(get_system_memory(),
> +                                        MEMORY_REGION(nv_mrobj));
> +        }
> +        for (j = 0; j < nvslot->linknum; ++j) {
> +            PCIDevice *npdev = nvslot->links[j].npdev;
> +            Object *atsd_mrobj;
> +            atsd_mrobj = object_property_get_link(OBJECT(npdev),
> +                                                  "nvlink2-atsd-mr[0]", 
> NULL);
> +            if (atsd_mrobj) {
> +                memory_region_del_subregion(get_system_memory(),
> +                                            MEMORY_REGION(atsd_mrobj));
> +            }
> +        }
> +    }
> +    g_free(sphb->nvgpus);
> +    sphb->nvgpus = NULL;
> +}
> +
> +void spapr_phb_nvgpu_populate_dt(SpaprPhbState *sphb, void *fdt, int bus_off,
> +                                 Error **errp)
> +{
> +    int i, j, atsdnum = 0;
> +    uint64_t atsd[8]; /* The existing limitation of known guests */
> +
> +    if (!sphb->nvgpus) {
> +        return;
> +    }
> +
> +    for (i = 0; (i < sphb->nvgpus->num) && (atsdnum < ARRAY_SIZE(atsd)); 
> ++i) {
> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> +
> +        if (!nvslot->gpdev) {
> +            continue;
> +        }
> +        for (j = 0; j < nvslot->linknum; ++j) {
> +            if (!nvslot->links[j].atsd_gpa) {
> +                continue;
> +            }
> +
> +            if (atsdnum == ARRAY_SIZE(atsd)) {
> +                error_report("Only %"PRIuPTR" ATSD registers supported",
> +                             ARRAY_SIZE(atsd));
> +                break;
> +            }
> +            atsd[atsdnum] = cpu_to_be64(nvslot->links[j].atsd_gpa);
> +            ++atsdnum;
> +        }
> +    }
> +
> +    if (!atsdnum) {
> +        error_setg(errp, "No ATSD registers found");
> +        return;
> +    }
> +
> +    if (!spapr_phb_eeh_available(sphb)) {
> +        /*
> +         * ibm,mmio-atsd contains ATSD registers; these belong to an NPU PHB
> +         * which we do not emulate as a separate device. Instead we put
> +         * ibm,mmio-atsd to the vPHB with GPU and make sure that we do not
> +         * put GPUs from different IOMMU groups to the same vPHB to ensure
> +         * that the guest will use ATSDs from the corresponding NPU.
> +         */
> +        error_setg(errp, "ATSD requires separate vPHB per GPU IOMMU group");
> +        return;
> +    }
> +
> +    _FDT((fdt_setprop(fdt, bus_off, "ibm,mmio-atsd", atsd,
> +                      atsdnum * sizeof(atsd[0]))));
> +}
> +
> +void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt)
> +{
> +    int i, j, linkidx, npuoff;
> +    char *npuname;
> +
> +    if (!sphb->nvgpus) {
> +        return;
> +    }
> +
> +    npuname = g_strdup_printf("npuphb%d", sphb->index);
> +    npuoff = fdt_add_subnode(fdt, 0, npuname);
> +    _FDT(npuoff);
> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#address-cells", 1));
> +    _FDT(fdt_setprop_cell(fdt, npuoff, "#size-cells", 0));
> +    /* Advertise NPU as POWER9 so the guest can enable NPU2 contexts */
> +    _FDT((fdt_setprop_string(fdt, npuoff, "compatible", "ibm,power9-npu")));
> +    g_free(npuname);
> +
> +    for (i = 0, linkidx = 0; i < sphb->nvgpus->num; ++i) {
> +        for (j = 0; j < sphb->nvgpus->slots[i].linknum; ++j) {
> +            char *linkname = g_strdup_printf("address@hidden", linkidx);
> +            int off = fdt_add_subnode(fdt, npuoff, linkname);
> +
> +            _FDT(off);
> +            /* _FDT((fdt_setprop_cell(fdt, off, "reg", linkidx))); */
> +            _FDT((fdt_setprop_string(fdt, off, "compatible",
> +                                     "ibm,npu-link")));
> +            _FDT((fdt_setprop_cell(fdt, off, "phandle",
> +                                   PHANDLE_NVLINK(sphb, i, j))));
> +            _FDT((fdt_setprop_cell(fdt, off, "ibm,npu-link-index", 
> linkidx)));
> +            g_free(linkname);
> +            ++linkidx;
> +        }
> +    }
> +
> +    /* Add memory nodes for GPU RAM and mark them unusable */
> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> +        Object *nv_mrobj = object_property_get_link(OBJECT(nvslot->gpdev),
> +                                                    "nvlink2-mr[0]", NULL);
> +        uint32_t associativity[] = {
> +            cpu_to_be32(0x4),
> +            SPAPR_GPU_NUMA_ID,
> +            SPAPR_GPU_NUMA_ID,
> +            SPAPR_GPU_NUMA_ID,
> +            cpu_to_be32(nvslot->numa_id)
> +        };
> +        uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);
> +        uint64_t mem_reg[2] = { cpu_to_be64(nvslot->gpa), cpu_to_be64(size) 
> };
> +        char *mem_name = g_strdup_printf("address@hidden"PRIx64, 
> nvslot->gpa);
> +        int off = fdt_add_subnode(fdt, 0, mem_name);
> +
> +        _FDT(off);
> +        _FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
> +        _FDT((fdt_setprop(fdt, off, "reg", mem_reg, sizeof(mem_reg))));
> +        _FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
> +                          sizeof(associativity))));
> +
> +        _FDT((fdt_setprop_string(fdt, off, "compatible",
> +                                 "ibm,coherent-device-memory")));
> +
> +        mem_reg[1] = cpu_to_be64(0);
> +        _FDT((fdt_setprop(fdt, off, "linux,usable-memory", mem_reg,
> +                          sizeof(mem_reg))));
> +        _FDT((fdt_setprop_cell(fdt, off, "phandle",
> +                               PHANDLE_GPURAM(sphb, i))));
> +        g_free(mem_name);
> +    }
> +
> +}
> +
> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int 
> offset,
> +                                        SpaprPhbState *sphb)
> +{
> +    int i, j;
> +
> +    if (!sphb->nvgpus) {
> +        return;
> +    }
> +
> +    for (i = 0; i < sphb->nvgpus->num; ++i) {
> +        struct spapr_phb_pci_nvgpu_slot *nvslot = &sphb->nvgpus->slots[i];
> +
> +        /* Skip "slot" without attached GPU */
> +        if (!nvslot->gpdev) {
> +            continue;
> +        }
> +        if (dev == nvslot->gpdev) {
> +            uint32_t npus[nvslot->linknum];
> +
> +            for (j = 0; j < nvslot->linknum; ++j) {
> +                PCIDevice *npdev = nvslot->links[j].npdev;
> +
> +                npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev));
> +            }
> +            _FDT(fdt_setprop(fdt, offset, "ibm,npu", npus,
> +                             j * sizeof(npus[0])));
> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
> +                                   PHANDLE_PCIDEV(sphb, dev))));
> +            continue;
> +        }
> +
> +        for (j = 0; j < nvslot->linknum; ++j) {
> +            if (dev != nvslot->links[j].npdev) {
> +                continue;
> +            }
> +
> +            _FDT((fdt_setprop_cell(fdt, offset, "phandle",
> +                                   PHANDLE_PCIDEV(sphb, dev))));
> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu",
> +                                  PHANDLE_PCIDEV(sphb, nvslot->gpdev)));
> +            _FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink",
> +                                   PHANDLE_NVLINK(sphb, i, j))));
> +            /*
> +             * If we ever want to emulate GPU RAM at the same location as on
> +             * the host - here is the encoding GPA->TGT:
> +             *
> +             * gta  = ((sphb->nv2_gpa >> 42) & 0x1) << 42;
> +             * gta |= ((sphb->nv2_gpa >> 45) & 0x3) << 43;
> +             * gta |= ((sphb->nv2_gpa >> 49) & 0x3) << 45;
> +             * gta |= sphb->nv2_gpa & ((1UL << 43) - 1);
> +             */
> +            _FDT(fdt_setprop_cell(fdt, offset, "memory-region",
> +                                  PHANDLE_GPURAM(sphb, i)));
> +            _FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr",
> +                                 nvslot->tgt));
> +            _FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed",
> +                                  nvslot->links[j].link_speed));
> +        }
> +    }
> +}
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index 40a12001f580..29b2697fe12c 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -2180,3 +2180,134 @@ int vfio_add_virt_caps(VFIOPCIDevice *vdev, Error 
> **errp)
>  
>      return 0;
>  }
> +
> +static void vfio_pci_nvlink2_get_tgt(Object *obj, Visitor *v,
> +                                     const char *name,
> +                                     void *opaque, Error **errp)
> +{
> +    uint64_t tgt = (uintptr_t) opaque;
> +    visit_type_uint64(v, name, &tgt, errp);
> +}
> +
> +static void vfio_pci_nvlink2_get_link_speed(Object *obj, Visitor *v,
> +                                                 const char *name,
> +                                                 void *opaque, Error **errp)
> +{
> +    uint32_t link_speed = (uint32_t)(uintptr_t) opaque;
> +    visit_type_uint32(v, name, &link_speed, errp);
> +}
> +
> +int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp)
> +{
> +    int ret;
> +    void *p;
> +    struct vfio_region_info *nv2reg = NULL;
> +    struct vfio_info_cap_header *hdr;
> +    struct vfio_region_info_cap_nvlink2_ssatgt *cap;
> +    VFIOQuirk *quirk;
> +
> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
> +                                   PCI_VENDOR_ID_NVIDIA,
> +                                   VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM,
> +                                   &nv2reg);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    hdr = vfio_get_region_info_cap(nv2reg, 
> VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
> +    if (!hdr) {
> +        ret = -ENODEV;
> +        goto free_exit;
> +    }
> +    cap = (void *) hdr;
> +
> +    p = mmap(NULL, nv2reg->size, PROT_READ | PROT_WRITE | PROT_EXEC,
> +             MAP_SHARED, vdev->vbasedev.fd, nv2reg->offset);
> +    if (p == MAP_FAILED) {
> +        ret = -errno;
> +        goto free_exit;
> +    }
> +
> +    quirk = vfio_quirk_alloc(1);
> +    memory_region_init_ram_ptr(&quirk->mem[0], OBJECT(vdev), "nvlink2-mr",
> +                               nv2reg->size, p);
> +    QLIST_INSERT_HEAD(&vdev->bars[0].quirks, quirk, next);
> +
> +    object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
> +                        vfio_pci_nvlink2_get_tgt, NULL, NULL,
> +                        (void *) (uintptr_t) cap->tgt, NULL);
> +    trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt,
> +                                          nv2reg->size);
> +free_exit:
> +    g_free(nv2reg);
> +
> +    return ret;
> +}
> +
> +int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp)
> +{
> +    int ret;
> +    void *p;
> +    struct vfio_region_info *atsdreg = NULL;
> +    struct vfio_info_cap_header *hdr;
> +    struct vfio_region_info_cap_nvlink2_ssatgt *captgt;
> +    struct vfio_region_info_cap_nvlink2_lnkspd *capspeed;
> +    VFIOQuirk *quirk;
> +
> +    ret = vfio_get_dev_region_info(&vdev->vbasedev,
> +                                   VFIO_REGION_TYPE_PCI_VENDOR_TYPE |
> +                                   PCI_VENDOR_ID_IBM,
> +                                   VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD,
> +                                   &atsdreg);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    hdr = vfio_get_region_info_cap(atsdreg,
> +                                   VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
> +    if (!hdr) {
> +        ret = -ENODEV;
> +        goto free_exit;
> +    }
> +    captgt = (void *) hdr;
> +
> +    hdr = vfio_get_region_info_cap(atsdreg,
> +                                   VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD);
> +    if (!hdr) {
> +        ret = -ENODEV;
> +        goto free_exit;
> +    }
> +    capspeed = (void *) hdr;
> +
> +    /* Some NVLink bridges may not have assigned ATSD */
> +    if (atsdreg->size) {
> +        p = mmap(NULL, atsdreg->size, PROT_READ | PROT_WRITE | PROT_EXEC,
> +                 MAP_SHARED, vdev->vbasedev.fd, atsdreg->offset);
> +        if (p == MAP_FAILED) {
> +            ret = -errno;
> +            goto free_exit;
> +        }
> +
> +        quirk = vfio_quirk_alloc(1);
> +        memory_region_init_ram_device_ptr(&quirk->mem[0], OBJECT(vdev),
> +                                          "nvlink2-atsd-mr", atsdreg->size, 
> p);
> +        QLIST_INSERT_HEAD(&vdev->bars[0].quirks, quirk, next);
> +    }
> +
> +    object_property_add(OBJECT(vdev), "nvlink2-tgt", "uint64",
> +                        vfio_pci_nvlink2_get_tgt, NULL, NULL,
> +                        (void *) (uintptr_t) captgt->tgt, NULL);
> +    trace_vfio_pci_nvlink2_setup_quirk_ssatgt(vdev->vbasedev.name, 
> captgt->tgt,
> +                                              atsdreg->size);
> +
> +    object_property_add(OBJECT(vdev), "nvlink2-link-speed", "uint32",
> +                        vfio_pci_nvlink2_get_link_speed, NULL, NULL,
> +                        (void *) (uintptr_t) capspeed->link_speed, NULL);
> +    trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name,
> +                                              capspeed->link_speed);
> +free_exit:
> +    g_free(atsdreg);
> +
> +    return ret;
> +}
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index dd12f363915d..07aa141aabe6 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3069,6 +3069,20 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          goto out_teardown;
>      }
>  
> +    if (vdev->vendor_id == PCI_VENDOR_ID_NVIDIA) {
> +        ret = vfio_pci_nvidia_v100_ram_init(vdev, errp);
> +        if (ret && ret != -ENODEV) {
> +            error_report("Failed to setup NVIDIA V100 GPU RAM");
> +        }
> +    }
> +
> +    if (vdev->vendor_id == PCI_VENDOR_ID_IBM) {
> +        ret = vfio_pci_nvlink2_init(vdev, errp);
> +        if (ret && ret != -ENODEV) {
> +            error_report("Failed to setup NVlink2 bridge");
> +        }
> +    }
> +
>      vfio_register_err_notifier(vdev);
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn_quirk(vdev);
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index cf1e8868182b..88841e9a61da 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -87,6 +87,10 @@ vfio_pci_igd_opregion_enabled(const char *name) "%s"
>  vfio_pci_igd_host_bridge_enabled(const char *name) "%s"
>  vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
>  
> +vfio_pci_nvidia_gpu_setup_quirk(const char *name, uint64_t tgt, uint64_t 
> size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> +vfio_pci_nvlink2_setup_quirk_ssatgt(const char *name, uint64_t tgt, uint64_t 
> size) "%s tgt=0x%"PRIx64" size=0x%"PRIx64
> +vfio_pci_nvlink2_setup_quirk_lnkspd(const char *name, uint32_t link_speed) 
> "%s link_speed=0x%x"
> +
>  # hw/vfio/common.c
>  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, 
> unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, 
> uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature
[Prev in Thread]
Current Thread
[Next in Thread]
[Qemu-ppc] [PATCH qemu v7] spapr: Support NVIDIA V100 GPU with NVLink2, Alexey Kardashevskiy, 2019/03/12
- Re: [Qemu-ppc] [PATCH qemu v7] spapr: Support NVIDIA V100 GPU with NVLink2, David Gibson <=
Prev by Date: Re: [Qemu-ppc] [PATCH v3 22/23] target/ppc: Use qemu_guest_getrandom for DARN
Next by Date: Re: [Qemu-ppc] [PATCH v3 22/23] target/ppc: Use qemu_guest_getrandom for DARN
Previous by thread: [Qemu-ppc] [PATCH qemu v7] spapr: Support NVIDIA V100 GPU with NVLink2
Next by thread: [Qemu-ppc] [PULL 00/62] ppc-for-4.0 queue 20190312
Index(es):
- Date
- Thread