Re: [Qemu-devel] [PULL 01/36] spapr: Support NVIDIA V100 GPU with NVLink

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PULL 01/36] spapr: Support NVIDIA V100 GPU with NVLink

From:	Laurent Vivier
Subject:	Re: [Qemu-devel] [PULL 01/36] spapr: Support NVIDIA V100 GPU with NVLink2
Date:	Fri, 17 May 2019 19:37:04 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1

On 26/04/2019 08:05, David Gibson wrote:
> From: Alexey Kardashevskiy <address@hidden>
> 
> NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
> space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
> implements special regions for such GPUs and emulates an NVLink bridge.
> NVLink2-enabled POWER9 CPUs also provide address translation services
> which includes an ATS shootdown (ATSD) register exported via the NVLink
> bridge device.
> 
> This adds a quirk to VFIO to map the GPU memory and create an MR;
> the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
> this to get the MR and map it to the system address space.
> Another quirk does the same for ATSD.
> 
> This adds additional steps to sPAPR PHB setup:
> 
> 1. Search for specific GPUs and NPUs, collect findings in
> sPAPRPHBState::nvgpus, manage system address space mappings;
> 
> 2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
> "memory-block", "link-speed" to advertise the NVLink2 function to
> the guest;
> 
> 3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;
> 
> 4. Add new memory blocks (with extra "linux,memory-usable" to prevent
> the guest OS from accessing the new memory until it is onlined) and
> npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
> uses it for link discovery.
> 
> This allocates space for GPU RAM and ATSD like we do for MMIOs by
> adding 2 new parameters to the phb_placement() hook. Older machine types
> set these to zero.
> 
> This puts new memory nodes in a separate NUMA node to as the GPU RAM
> needs to be configured equally distant from any other node in the system.
> Unlike the host setup which assigns numa ids from 255 downwards, this
> adds new NUMA nodes after the user configures nodes or from 1 if none
> were configured.
> 
> This adds requirement similar to EEH - one IOMMU group per vPHB.
> The reason for this is that ATSD registers belong to a physical NPU
> so they cannot invalidate translations on GPUs attached to another NPU.
> It is guaranteed by the host platform as it does not mix NVLink bridges
> or GPUs from different NPU in the same IOMMU group. If more than one
> IOMMU group is detected on a vPHB, this disables ATSD support for that
> vPHB and prints a warning.
> 
> Signed-off-by: Alexey Kardashevskiy <address@hidden>
> [aw: for vfio portions]
> Acked-by: Alex Williamson <address@hidden>
> Message-Id: <address@hidden>
> Signed-off-by: David Gibson <address@hidden>
> ---
>  hw/ppc/Makefile.objs        |   2 +-
>  hw/ppc/spapr.c              |  48 +++-
>  hw/ppc/spapr_pci.c          |  19 ++
>  hw/ppc/spapr_pci_nvlink2.c  | 450 ++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci-quirks.c        | 131 +++++++++++
>  hw/vfio/pci.c               |  14 ++
>  hw/vfio/pci.h               |   2 +
>  hw/vfio/trace-events        |   4 +
>  include/hw/pci-host/spapr.h |  45 ++++
>  include/hw/ppc/spapr.h      |   5 +-
>  10 files changed, 711 insertions(+), 9 deletions(-)
>  create mode 100644 hw/ppc/spapr_pci_nvlink2.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index 1111b218a0..636e717f20 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -9,7 +9,7 @@ obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
>  # IBM PowerNV
>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o 
> pnv_occ.o pnv_bmc.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> -obj-y += spapr_pci_vfio.o
> +obj-y += spapr_pci_vfio.o spapr_pci_nvlink2.o
>  endif
>  obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>  # PowerPC 4xx boards
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index b52b82d298..b81e237635 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1034,12 +1034,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, 
> void *fdt)
>          0, cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE),
>          cpu_to_be32(max_cpus / smp_threads),
>      };
> +    uint32_t maxdomain = cpu_to_be32(spapr->gpu_numa_id > 1 ? 1 : 0);
>      uint32_t maxdomains[] = {
>          cpu_to_be32(4),
> -        cpu_to_be32(0),
> -        cpu_to_be32(0),
> -        cpu_to_be32(0),
> -        cpu_to_be32(nb_numa_nodes ? nb_numa_nodes : 1),
> +        maxdomain,
> +        maxdomain,
> +        maxdomain,
> +        cpu_to_be32(spapr->gpu_numa_id),
>      };
>  
>      _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));
> @@ -1698,6 +1699,16 @@ static void spapr_machine_reset(void)
>          spapr_irq_msi_reset(spapr);
>      }
>  
> +    /*
> +     * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
> +     * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
> +     * called from vPHB reset handler so we initialize the counter here.
> +     * If no NUMA is configured from the QEMU side, we start from 1 as GPU 
> RAM
> +     * must be equally distant from any other node.
> +     * The final value of spapr->gpu_numa_id is going to be written to
> +     * max-associativity-domains in spapr_build_fdt().
> +     */
> +    spapr->gpu_numa_id = MAX(1, nb_numa_nodes);
>      qemu_devices_reset();
>  
>      /*
> @@ -3907,7 +3918,9 @@ static void spapr_phb_pre_plug(HotplugHandler 
> *hotplug_dev, DeviceState *dev,
>      smc->phb_placement(spapr, sphb->index,
>                         &sphb->buid, &sphb->io_win_addr,
>                         &sphb->mem_win_addr, &sphb->mem64_win_addr,
> -                       windows_supported, sphb->dma_liobn, errp);
> +                       windows_supported, sphb->dma_liobn,
> +                       &sphb->nv2_gpa_win_addr, &sphb->nv2_atsd_win_addr,
> +                       errp);
>  }
>  
>  static void spapr_phb_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
> @@ -4108,7 +4121,8 @@ static const CPUArchIdList 
> *spapr_possible_cpu_arch_ids(MachineState *machine)
>  static void spapr_phb_placement(SpaprMachineState *spapr, uint32_t index,
>                                  uint64_t *buid, hwaddr *pio,
>                                  hwaddr *mmio32, hwaddr *mmio64,
> -                                unsigned n_dma, uint32_t *liobns, Error 
> **errp)
> +                                unsigned n_dma, uint32_t *liobns,
> +                                hwaddr *nv2gpa, hwaddr *nv2atsd, Error 
> **errp)
>  {
>      /*
>       * New-style PHB window placement.
> @@ -4153,6 +4167,9 @@ static void spapr_phb_placement(SpaprMachineState 
> *spapr, uint32_t index,
>      *pio = SPAPR_PCI_BASE + index * SPAPR_PCI_IO_WIN_SIZE;
>      *mmio32 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM32_WIN_SIZE;
>      *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE;
> +
> +    *nv2gpa = SPAPR_PCI_NV2RAM64_WIN_BASE + index * 
> SPAPR_PCI_NV2RAM64_WIN_SIZE;
> +    *nv2atsd = SPAPR_PCI_NV2ATSD_WIN_BASE + index * 
> SPAPR_PCI_NV2ATSD_WIN_SIZE;
>  }
>  
>  static ICSState *spapr_ics_get(XICSFabric *dev, int irq)
> @@ -4357,6 +4374,18 @@ DEFINE_SPAPR_MACHINE(4_0, "4.0", true);
>  /*
>   * pseries-3.1
>   */
> +static void phb_placement_3_1(SpaprMachineState *spapr, uint32_t index,
> +                              uint64_t *buid, hwaddr *pio,
> +                              hwaddr *mmio32, hwaddr *mmio64,
> +                              unsigned n_dma, uint32_t *liobns,
> +                              hwaddr *nv2gpa, hwaddr *nv2atsd, Error **errp)
> +{
> +    spapr_phb_placement(spapr, index, buid, pio, mmio32, mmio64, n_dma, 
> liobns,
> +                        nv2gpa, nv2atsd, errp);
> +    *nv2gpa = 0;
> +    *nv2atsd = 0;
> +}
> +
>  static void spapr_machine_3_1_class_options(MachineClass *mc)
>  {
>      SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> @@ -4372,6 +4401,7 @@ static void 
> spapr_machine_3_1_class_options(MachineClass *mc)
>      smc->default_caps.caps[SPAPR_CAP_SBBC] = SPAPR_CAP_BROKEN;
>      smc->default_caps.caps[SPAPR_CAP_IBS] = SPAPR_CAP_BROKEN;
>      smc->default_caps.caps[SPAPR_CAP_LARGE_DECREMENTER] = SPAPR_CAP_OFF;
> +    smc->phb_placement = phb_placement_3_1;

I think this should be renamed and go into the 4.0 machine type as it
has already been released.

Thanks,
Laurent

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [PULL 01/36] spapr: Support NVIDIA V100 GPU with NVLink2, Laurent Vivier <=
- Re: [Qemu-devel] [PULL 01/36] spapr: Support NVIDIA V100 GPU with NVLink2, Greg Kurz, 2019/05/17
- Re: [Qemu-devel] [PULL 01/36] spapr: Support NVIDIA V100 GPU with NVLink2, David Gibson, 2019/05/20

Prev by Date: Re: [Qemu-devel] [PULL 0/4] Ui 20190517 patches
Next by Date: [Qemu-devel] [PATCH 0/4] Disable FPU/DSP for CPU0 on musca-a and mps2-an521
Previous by thread: Re: [Qemu-devel] [PULL 1/8] usb: assign unique serial numbers to hid devices
Next by thread: Re: [Qemu-devel] [PULL 01/36] spapr: Support NVIDIA V100 GPU with NVLink2
Index(es):
- Date
- Thread