|
From: | Laine Stump |
Subject: | Re: [Qemu-ppc] [libvirt] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default |
Date: | Mon, 5 Dec 2016 15:54:49 -0500 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 |
(Sorry for any duplicates. I sent it from the wrong address the
first time) On 12/01/2016 11:18 PM, David Gibson wrote: On Fri, Nov 25, 2016 at 02:46:21PM +0100, Andrea Bolognani wrote: That's a broad statement. Why? If qemu reports the default devices and characteristics of the devices properly (and libvirt uses that information) there's no reason for it to make the wrong decision.
Explain "kinda-sorta-but-not-really". If there's a deficiency in the model maybe it can be fixed. But the True, but not for the reasons you think. If qemu is able to respond to queries with adequate details about the devices available for a machinetype (and what buses are in place by default), there's no reason that libvirt can't add devices addressed such that all the connections are legal; what libvirt *can't* get right is the policy requested in the next higher layer of management (and ultimately of the user) - does this device need to be hotpluggable? Does the user want to keep all devices on the root complex to avoid extra PCI controllers? And qemu fundamentally CAN NOT get it right either. qemu knows what is possible and what is allowed, but it doesn't know what the user *wants* (beyond "they want device X", which is only 1/2 the story), and has no way of being told what the user wants other than with a PCI address. To back up for a minute, some background info: once a device has been added to a domain, at *any* time in the future (not just during a migration, but forever more until the end of time) that device must always have the same PCI address as it had that first time. In order to guarantee that, libvirt needs to either: a) keep track of the order the devices were added and always put the devices in the same order on the commandline (assuming that qemu guarantees that it actually assigns addresses based on the order of the devices' appearance on the commandline, which has never been stated anywhere as an API guarantee of qemu), or b) remember the address of each device as it is added and specify that on the commandline in the future. libvirt chooses (b). And where is the logical place to store that address? In the config. So we've established that the PCI address of a device needs to be stored in the config. So why does libvirt need to choose it the first time? 1) Because qemu doesn't have (and CAN NOT have) all the information about what are the user's plans for that device: a) It has no idea if the user wants that device to be hotpluggable (on a root-port) or not (on root complex as an integrated device) b) it doesn't know if the user wants the device to be placed on an expander bus so that its NUMA status can be discovered by the guest OS. If there is a choice, there must be a way to make that choice. The way that qemu provides to make the choice is by specifying an address. So libvirt must specify an address in its config. 2) Because qemu is unable/unwilling to automatically add PCIe root ports when necessary, it's *not even possible* (on PCIe machinetypes) for it to place a device on a hotpluggable port without libvirt specifying a PCI address the very first time the device is added (and also adding in a root-port), but libvirt's default policy is that (almost) all devices should be hotpluggable. If we were to follow your recommendation ("libvirt never specifies PCI addresses, but instead allows qemu to assign them"), hotplug on PCIe-based machinetypes would not be possible, though. There have even been mentions that even libvirt is too *low* in the stack to be specifying the PCI address of devices (i.e. that all PCI address decisions should be up to higher level management applications) - I had posted a patch that would allow specifying "hotpluggable='yes|no'" in the XML rather than forcing the call to specify an address, and this was NACKed because it was seen as libvirt dictating policy. (In the end, libvirt *does* dictate a default policy, (it's just that the only way to modify that policy is by manually specifying addresses) - libvirt's default PCI address policy is that (almost) all devices will be assigned an address that makes them hotpluggable, and will not be placed on a non-0 NUMA node. So, in spite of libvirt's effort, in the end we *still* need to expose address configuration to higher level management applications, since they may want to force all devices onto the root complex (e.g. libguestfs, which does it to reduce PCI controller count, and thus startup time) or force certain devices to be on a non-0 NUMA node (e.g. OpenStack when it wants to place a VFIO assigned device on the same NUMA node in the guest as it is in the host). With all of that, I fail to see how it would be at all viable to simply leave PCI address assignment up to qemu. There are all And since qemu knows about them, it should be able to report them. Which is what Eduardo's work is doing. And then libvirt will know about all the constraints in an programmatic manner (rather than the horrible (tedious, error prone) hardcoding of all those details that we've had to suffer with until now). The ONLY way libvirt can get this (temporarily) And since no matter how hard qemu might try to come up with a policy for address assignment that will satisfy the needs of 100% of the users 100% of the time, it will fail (because different users have different needs). Because qemu will be unable to properly place all devices all the time, libvirt (and higher level management) will still need to do it. Even in the basic case qemu doesn't provide what libvirt requires as default - that devices be hotpluggable. So, libvirt should be allowing the No, that doesn't work because qemu would in many situations place the devices at the wrong address / on the wrong controller, because there are many possible topologies that are legal, and the user may (for perfectly valid reasons) want something different from what qemu would have chosen. (An example of two differing (and valid) policies - libguestfs needs guests to startup as quickly as possible, and thus wants as few PCI controllers as possible (this makes a noticeable difference in Linux boot time), so it wants all devices to be integrated on the root complex. On the other hand, a generic guest in libvirt should make all devices hotpluggable just in case the user wants to unplug them, so by default it tries to place all devices on a pcie-root-port. You can't support both of these if addressing is all left up to qemu)
It may make you feel good to say that, but the facts don't back it up. Any project makes design mistakes, but in the specific case you're discussing here, I think you haven't looked from a wide enough viewpoint to see the necessity of what libvirt is doing and why it can't be done by qemu (certainly not all the time anyway). And what's libvirt has always done the best that could be done with the information provided by qemu. The problem isn't that libvirt is creating new problems for qemu out of thin air, it's that qemu is unable to automatically address PCI devices for all possible situations and user policy preferences, so higher levels need to make the decisions about addressing to satisfy their policies (ie what they *want*, eg hotpluggable, integrated on root complex), and qemu hasn't (until Eduardo's patches) been able to provide adequate information about what is *legal* (e.g which devices can be plugged into which model of pci controller, what slots are available on each type of controller, whether those slots are hotpluggable) in a programmatic way, so libvirt has had to hardcode rules about bus-device compatibility and capabilities, slot ranges, etc in order to make proper decisions itself when possible, and to sanity-check decisions about addresses made by higher level management when not. I don't think that's a design flaw. I think that's making the best of a "less than ideal" situation. I'd feel better about this if there seemed to be some recognition of Eduardo's work isn't being done to make up for some mythical design flaw in libvirt. It is being done in order to give libvirt the (previously unavailable) information it needs to do a necessary job, and is being done at least partly at the request of libvirt (we've certainly been demanding some of that stuff for a long time!) The summary is that it's impossible for qemu to correctly decide where to put new devices, especially in a PCIe hierarchy for a few reasons (at least); because of this, libvirt (and higher level management) needs to be able to assign addresses to devices, and in order for us/them to be able to do that properly, qemu needs to provide detailed and accurate information about what buses/controllers/devices are in each machinetype, what controllers/devices are available to add, and what are the legal ways of connecting those devices and controllers. |
[Prev in Thread] | Current Thread | [Next in Thread] |