qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: device compatibility interface for live migration with assigned devi


From: Yan Zhao
Subject: Re: device compatibility interface for live migration with assigned devices
Date: Mon, 17 Aug 2020 09:52:43 +0800
User-agent: Mutt/1.9.4 (2018-02-28)

On Fri, Aug 14, 2020 at 01:30:00PM +0100, Sean Mooney wrote:
> On Fri, 2020-08-14 at 13:16 +0800, Yan Zhao wrote:
> > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > 
> > > On 2020/8/10 下午3:46, Yan Zhao wrote:
> > > > > driver is it handled by?
> > > > 
> > > > It looks that the devlink is for network device specific, and in
> > > > devlink.h, it says
> > > > include/uapi/linux/devlink.h - Network physical device Netlink
> > > > interface,
> > > 
> > > 
> > > Actually not, I think there used to have some discussion last year and the
> > > conclusion is to remove this comment.
> > > 
> > > It supports IB and probably vDPA in the future.
> > > 
> > 
> > hmm... sorry, I didn't find the referred discussion. only below discussion
> > regarding to why to add devlink.
> > 
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html
> >     >This doesn't seem to be too much related to networking? Why can't 
> > something
> >     >like this be in sysfs?
> >     
> >     It is related to networking quite bit. There has been couple of
> >     iteration of this, including sysfs and configfs implementations. There
> >     has been a consensus reached that this should be done by netlink. I
> >     believe netlink is really the best for this purpose. Sysfs is not a good
> >     idea
> > 
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html
> >     >there is already a way to change eth/ib via
> >     >echo 'eth' > /sys/bus/pci/drivers/mlx4_core/0000:02:00.0/mlx4_port1
> >     >
> >     >sounds like this is another way to achieve the same?
> >     
> >     It is. However the current way is driver-specific, not correct.
> >     For mlx5, we need the same, it cannot be done in this way. Do devlink is
> >     the correct way to go.
> im not sure i agree with that.
> standardising a filesystem based api that is used across all vendors is also 
> a valid
> option.  that said if devlink is the right choice form a kerenl perspective 
> by all
> means use it but i have not heard a convincing argument for why it actually 
> better.
> with tthat said we have been uing tools like ethtool to manage aspect of nics 
> for decades
> so its not that strange an idea to use a tool and binary protocoal rather 
> then a text
> based interface for this but there are advantages to both approches.
> >
Yes, I agree with you.

> > https://lwn.net/Articles/674867/
> >     There a is need for some userspace API that would allow to expose things
> >     that are not directly related to any device class like net_device of
> >     ib_device, but rather chip-wide/switch-ASIC-wide stuff.
> > 
> >     Use cases:
> >     1) get/set of port type (Ethernet/InfiniBand)
> >     2) monitoring of hardware messages to and from chip
> >     3) setting up port splitters - split port into multiple ones and squash 
> > again,
> >        enables usage of splitter cable
> >     4) setting up shared buffers - shared among multiple ports within one 
> > chip
> > 
> > 
> > 
> > we actually can also retrieve the same information through sysfs, .e.g
> > 
> > > - [path to device]
> > 
> >   |--- migration
> >   |     |--- self
> >   |     |   |---device_api
> >   | |   |---mdev_type
> >   | |   |---software_version
> >   | |   |---device_id
> >   | |   |---aggregator
> >   |     |--- compatible
> >   |     |   |---device_api
> >   | |   |---mdev_type
> >   | |   |---software_version
> >   | |   |---device_id
> >   | |   |---aggregator
> > 
> > 
> > 
> > > 
> > > >   I feel like it's not very appropriate for a GPU driver to use
> > > > this interface. Is that right?
> > > 
> > > 
> > > I think not though most of the users are switch or ethernet devices. It
> > > doesn't prevent you from inventing new abstractions.
> > 
> > so need to patch devlink core and the userspace devlink tool?
> > e.g. devlink migration
> and devlink python libs if openstack was to use it directly.
> we do have caes where we just frok a process and execaute a comannd in a shell
> with or without elevated privladge but we really dont like doing that due to 
> the performacne impacat and security implciations so where we can use python 
> bindign
> over c apis we do. pyroute2 is the only python lib i know off of the top of 
> my head
> that support devlink so we would need to enhacne it to support this new 
> devlink api.
> there may be otherss i have not really looked in the past since we dont need 
> to use
> devlink at all today.
> > 
> > > Note that devlink is based on netlink, netlink has been widely used by
> > > various subsystems other than networking.
> > 
> > the advantage of netlink I see is that it can monitor device status and
> > notify upper layer that migration database needs to get updated.
> > But not sure whether openstack would like to use this capability.
> > As Sean said, it's heavy for openstack. it's heavy for vendor driver
> > as well :)
> > 
> > And devlink monitor now listens the notification and dumps the state
> > changes. If we want to use it, need to let it forward the notification
> > and dumped info to openstack, right?
> i dont think we would use direct devlink monitoring in nova even if it was 
> avaiable.
> we could but we already poll libvirt and the system for other resouce 
> periodicly.
so, if we use file system based approach, could openstack periodically check and
update the migration info?
e.g.
every minute, read /sys/<path to device>/migration/self/*, and if there
are any file disappearing or appearing or content changes, just let the
placement know.

Then when about to start migration, check source device's
/sys/<path to src device>/migration/compatible/* and searches the
placement if there are existing device matching to it,
if yes, create vm with the device and migrate to it;
if not, and if it's an mdev, try to create a matching one and migrate to
it.
(to create a matching mdev, I guess openstack can follow below sequence:
1. find a target device with the same device id (e.g. parent pci id)
2. create an mdev with matching mdev type
3. adjust other vendor specific attributes
4. if 2 or 3 fails, go to 1 again
)

is this approach feasible?


> we likely wouldl just add monitoriv via devlink to that periodic task.
> we certenly would not use it to detect a migration or a need to update a 
> migration database(not sure what that is)
by migration database, I meant the traits in the placement. :)

if a periodic monitoring or devlink is required, then periodically
monitor sysfs is also viable, right?
> 
> in reality if we can consume this info indirectly via a libvirt api that will
> be the appcoh we will take at least for the libvirt driver in nova. for cyborg
> they may take a different appoch. we already use pyroute2 in 2 projects, 
> os-vif and
> neutron and it does have devlink support so the burden of using devlink is 
> not that
> high for openstack but its a less frineadly interface for configuration tools 
> like
> ansiable vs a filesystem based approch.
> > 

 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]