[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [libvirt] [RFC] libvirt vGPU QEMU integration
From: |
Michal Privoznik |
Subject: |
Re: [Qemu-devel] [libvirt] [RFC] libvirt vGPU QEMU integration |
Date: |
Fri, 19 Aug 2016 14:42:27 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 |
On 18.08.2016 18:41, Neo Jia wrote:
> Hi libvirt experts,
Hi, welcome to the list.
>
> I am starting this email thread to discuss the potential solution / proposal
> of
> integrating vGPU support into libvirt for QEMU.
>
> Some quick background, NVIDIA is implementing a VFIO based mediated device
> framework to allow people to virtualize their devices without SR-IOV, for
> example NVIDIA vGPU, and Intel KVMGT. Within this framework, we are reusing
> the
> VFIO API to process the memory / interrupt as what QEMU does today with
> passthru
> device.
So as far as I understand, this is solely NVIDIA's API and other vendors
(e.g. Intel) will use their own or is this a standard that others will
comply to?
>
> The difference here is that we are introducing a set of new sysfs file for
> virtual device discovery and life cycle management due to its virtual nature.
>
> Here is the summary of the sysfs, when they will be created and how they
> should
> be used:
>
> 1. Discover mediated device
>
> As part of physical device initialization process, vendor driver will register
> their physical devices, which will be used to create virtual device (mediated
> device, aka mdev) to the mediated framework.
>
> Then, the sysfs file "mdev_supported_types" will be available under the
> physical
> device sysfs, and it will indicate the supported mdev and configuration for
> this
> particular physical device, and the content may change dynamically based on
> the
> system's current configurations, so libvirt needs to query this file every
> time
> before create a mdev.
Ah, that was gonna be my question. Because in the example below, you
used "echo '...vgpu_type_id=20...' > /sys/bus/.../mdev_create". And I
was wondering where does the number 20 come from. Now what I am
wondering about is how libvirt should expose these to users. Moreover,
how it should let users to chose.
We have a node device driver where I guess we could expose possible
options and then require some explicit value in the domain XML (but what
value would that be? I don't think taking vgpu_type_id-s as they are
would be a great idea).
>
> Note: different vendors might have their own specific configuration sysfs as
> well, if they don't have pre-defined types.
>
> For example, we have a NVIDIA Tesla M60 on 86:00.0 here registered, and here
> is
> NVIDIA specific configuration on an idle system.
>
> For example, to query the "mdev_supported_types" on this Tesla M60:
>
> cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types
> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> max_resolution
> 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600
> 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600
> 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600
> 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600
> 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600
> 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600
> 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160
> 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
>
> 2. Create/destroy mediated device
>
> Two sysfs files are available under the physical device sysfs path :
> mdev_create
> and mdev_destroy
>
> The syntax of creating a mdev is:
>
> echo "$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_create
>
> The syntax of destroying a mdev is:
>
> echo "$mdev_UUID:vendor_specific_argument_list" >
> /sys/bus/pci/devices/.../mdev_destroy
>
> The $mdev_UUID is a unique identifier for this mdev device to be created, and
> it
> is unique per system.
Ah, so a caller (the one doing the echo - e.g. libvirt) can generate
their own UUID under which the mdev will be known? I'm asking because of
migration - we might want to preserve UUIDs when a domain is migrated to
the other side. Speaking of which, is there such limitation or will
guest be able to migrate even if UUID's changed?
>
> For NVIDIA vGPU, we require a vGPU type identifier (shown as vgpu_type_id in
> above Tesla M60 output), and a VM UUID to be passed as
> "vendor_specific_argument_list".
I understand the need for vgpu_type_id, but can you shed more light on
the VM UUID? Why is that required?
>
> If there is no vendor specific arguments required, either "$mdev_UUID" or
> "$mdev_UUID:" will be acceptable as input syntax for the above two commands.
>
> To create a M60-4Q device, libvirt needs to do:
>
> echo "$mdev_UUID:vgpu_type_id=20,vm_uuid=$VM_UUID" >
> /sys/bus/pci/devices/0000\:86\:00.0/mdev_create
>
> Then, you will see a virtual device shows up at:
>
> /sys/bus/mdev/devices/$mdev_UUID/
>
> For NVIDIA, to create multiple virtual devices per VM, it has to be created
> upfront before bringing any of them online.
>
> Regarding error reporting and detection, on failure, write() to sysfs using fd
> returns error code, and write to sysfs file through command prompt shows the
> string corresponding to error code.
>
> 3. Start/stop mediated device
>
> Under the virtual device sysfs, you will see a new "online" sysfs file.
>
> you can do cat /sys/bus/mdev/devices/$mdev_UUID/online to get the current
> status
> of this virtual device (0 or 1), and to start a virtual device or stop a
> virtual
> device you can do:
>
> echo "1|0" > /sys/bus/mdev/devices/$mdev_UUID/online
>
> libvirt needs to query the current state before changing state.
>
> Note: if you have multiple devices, you need to write to the "online" file
> individually.
>
> For NVIDIA, if there are multiple mdev per VM, libvirt needs to bring all of
> them "online" before starting QEMU.
This is a valid requirement, indeed.
>
> 4. Launch QEMU/VM
>
> Pass the mdev sysfs path to QEMU as vfio-pci device:
>
> -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$mdev_UUID,id=vgpu0
One question here. Libvirt allows users to run qemu under different
user:group than root:root. If that's the case, libvirt sets security
labels on all files qemu can/will touch. Are we going to need to do
something in that respect here?
>
> 5. Shutdown sequence
>
> libvirt needs to shutdown the qemu, bring the virtual device offline, then
> destroy the
> virtual device
>
> 6. VM Reset
>
> No change or requirement for libvirt as this will be handled via VFIO reset
> API
> and QEMU process will keep running as before.
>
> 7. Hot-plug
>
> It optional for vendors to support hot-plug.
>
> And it is same syntax to create a virtual device for hot-plug.
>
> For hot-unplug, after executing QEMU monitor "device del" command, libvirt
> needs
> to write to "destroy" sysfs to complete hot-unplug process.
>
> Since hot-plug is optional, then mdev_create or mdev_destroy operations may
> return an error if it is not supported.
Thank you for very detailed description! In general, I like the API as
it looks usable from my POV (I'm no VFIO devel though).
Michal