qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus b


From: Marcel Apfelbaum
Subject: Re: [Qemu-ppc] [RFC PATCH qemu] spapr_pci: Create PCI-express root bus by default
Date: Wed, 14 Dec 2016 20:26:48 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1

On 12/14/2016 04:46 AM, David Gibson wrote:
On Tue, Dec 13, 2016 at 02:25:44PM +0200, Marcel Apfelbaum wrote:
On 12/07/2016 06:42 PM, Andrea Bolognani wrote:
[Added Marcel to CC]



Hi,

Sorry for the late reply.

On Wed, 2016-12-07 at 15:11 +1100, David Gibson wrote:
Is the difference between q35 and pseries guests with
respect to PCIe only relevant when it comes to assigned
devices, or in general? I'm asking this because you seem to
focus entirely on assigned devices.

Well, in a sense that's up to us.  The only existing model we have is
PowerVM, and PowerVM only does device passthrough, no emulated
devices.  PAPR doesn't really distinguish one way or the other, but
it's written from the perspective of assuming that all PCI devices
correspond to physical devices on the host

Okay, that makes sense.

On q35,
you'd generally expect physically separate (different slot) devices to
appear under separate root complexes.

This part I don't get at all, so please bear with me.

The way I read it you're claiming that eg. a SCSI controller
and a network adapter, being physically separate and assigned
to separate PCI slots, should have a dedicated PCIe Root
Complex each on a q35 guest.


Not a PCIe Root Complex, but a PCIe Root port.

Ah, sorry.  As I said, I've been pretty confused by all the terminology.

Right, my understanding was that if the devices were slotted, rather
than integrated, each one would sit under a separate root complex, the
root complex being a pseudo PCI to PCI bridge.

I assume "slotted" means "plugged into a slot that's not one
of those provided by pcie.0" or something along those lines.

More on the root complex bit later.

That doesn't match with my experience, where you would simply
assign them to separate slots of the default PCIe Root Bus
(pcie.0), eg. 00:01.0 and 00:02.0.

The qemu default, or the libvirt default?

I'm talking about the libvirt default, which is supposed to
follows Marcel's PCIe Guidelines.

I think this represents
treating the devices as though they were integrated devices in the
host bridge.  I believe on q35 they would not be hotpluggable


Correct. Please have a look to the new document
regarding pcie: docs/pcie.txt and the corresponding presentations.


Yeah, that's indeed not quite what libvirt would do by
default: in reality, there would be a ioh3420 between the
pcie.0 slots and each device exactly to enable hotplug.

but on
pseries they would be (because we don't use the standard hot plug
controller).

We can account for that in libvirt and avoid adding the
extra ioh3420s (or rather the upcoming generic PCIe Root
Ports) for pseries guests.

Maybe you're referring to the fact that you might want to
create multiple PCIe Root Complexes in order to assign the
host devices to separate guest NUMA nodes? How is creating
multiple PCIe Root Complexes on q35 using pxb-pcie different
than creating multiple PHBs using spapr-pci-host-bridge on
pseries?

Uh.. AIUI the root complex is the PCI to PCI bridge under which PCI-E
slots appear.  PXB is something different - essentially different host
bridges as you say (though with some weird hacks to access config
space, which make it dependent on the primary bus in a way which spapr
PHBs are not).

I'll admit I'm pretty confused myself about the exact distinction
between root complex, root port and upstream and downstream ports.

I think we both need to get our terminology straight :)
I'm sure Marcel will be happy to point us in the right
direction.

My understanding is that the PCIe Root Complex is the piece
of hardware that exposes a PCIe Root Bus (pcie.0 in QEMU);

right

Oh.. I wasn't as clear as I'd like to be on what the root complex is.
But I thought the root complex did have some guest visible presence in
the PCI tree.  What you're describing here seems equivalent to what
I'd call the PCI Host Bridge (== PHB).


Yes, a Root Complex is a type of Host Bridge in the sense it bridges
between CPU/Memory Controller ant the PCI subsystem.

PXBs can be connected to slots in pcie.0 to create more buses
that behave, for the most part, like pcie.0 and are mostly
useful to bind devices to specific NUMA nodes.

right

 Same applies
to legacy PCI with the pxb (instead of pxb-pcie) device.


pxb should not be used for PCIe machines, only for legacy PCI ones.

Noted.  And not for pseries at all.  Note that because we have a
para-virtualized platform (all PCI config access goes via hypercalls)
the distinction between PCI and PCI-E is much blurrier than in the x86
case.


OK

In a similar fashion, PHBs are the hardware thingies that
expose a PCI Root Bus (pci.0 and so on), the main difference
being that they are truly independent: so a q35 guest will
always have a "primary" PCIe Root Bus and (optionally) a
bunch of "secondary" ones, but the same will not be the case
for pseries guests.

OK

I don't think the difference is that important though, at
least from libvirt's point of view: whether you're creating
a pseries guest with two PHBs, or a q35 guest with its
built-in PCIe Root Complex and an extra PCIe Expander Bus,
you will end up with two "top level" buses that you can plug
more devices into.

I agree

 If we had spapr-pcie-host-bridge, we
could treat them mostly the same - with caveats such as the
one described above, of course.

Whereas on pseries they'll
appear as siblings on a virtual bus (which makes no physical sense for
point-to-point PCI-E).

What is the virtual bus in question? Why would it matter
that they're siblings?

On pseries it won't.  But my understanding is that libvirt won't
create them that way on q35 - instead it will insert the RCs / P2P
bridges to allow them to be hotplugged.  Inserting that bridge may
confuse pseries guests which aren't expecting it.

libvirt will automatically add PCIe Root Ports to make the
devices hotpluggable on q35 guests, yes. But, as mentioned
above, we can teach it not to.

I'm possibly missing the point entirely, but so far it
looks to me like there are different configurations you
might want to use depending on your goal, and both q35
and pseries give you comparable tools to achieve such
configurations.

I suppose we could try treating all devices on pseries as though they
were chipset builtin devices on q35, which will appear on the root
PCI-E bus without root complex.

Actually the root PCIe bus is part of a root complex.

So I think what I meant above was "root port".

Yes

  The point is that
there won't be the (pseudo) PCI to PCI bridge appearing above the
device that there typically would be on q35.


Understood, on one hand we have no PCIe Root Ports, on the other
hand the devices are not integrated - they can be hot-plugged
by platform specific means.

 But I suspect that's likely to cause
trouble with hotplug, and it will certainly need different address
allocation from libvirt.

PCIe Integrated Endpoint Devices are not hotpluggable on
q35, that's why libvirt will follow QEMU's PCIe topology
recommendations and place a PCIe Root Port between them;
I assume the same could be done for pseries guests as
soon as QEMU grows support for generic PCIe Root Ports,
something Marcel has already posted patches for.

Here you've hit on it.  No, we should not do that for pseries,
AFAICT.  PAPR doesn't really have the concept of integrated endpoint
devices, and all devices can be hotplugged via the PAPR mechanisms
(and none can via the PCI-E standard hotplug mechanism).


This seems to be interfering with the PCIe spec:
  1. No PCIe root ports ? those are part of the spec.

Yes, I dare say it does interfere with the spec.  Nonetheless, there
it is.

  2. Only integrated devices ? hotplug is not PCIe native?

That's correct.  PAPR supplies its own hotplug mechanism, which works
for both PCI and PCI-E devices, which is different from the standard
PCI-E hotplug mechanism.


Ok the hw is PCIe, but configuration/hot-plug
is platform specific.


Cool, I get it now.

Again, sorry for clearly misunderstanding your explanation,
but I'm still not seeing the issue here. I'm sure it's very
clear in your mind, but I'm afraid you're going to have to
walk me through it :(

I wish it were entirely clear in my mind.  Like I say I'm still pretty
confused by exactly the root complex entails.

Same here, but this back-and-forth is helping! :)

[...]
What about virtio devices, which present themselves either
as legacy PCI or PCIe depending on the kind of slot they
are plugged into? Would they show up as PCIe or legacy PCI
on a PCIe-enabled pseries guest?

That we'd have to address on the qemu side with some

Unfinished sentence?

[...]
Is the Root Complex not currently exposed? The Root Bus
certainly is,

Like I say, I'm fairly confused myself, but I'm pretty sure that Root
Complex != Root Bus.  The RC sits under the root bus IIRC.. or
possibly it consists of the root bus plus something under it as well.


The Root complex includes the PCI bus, some configuration registers if
needed, provides access to the configuration space, translates relevant CPU
reads/writes to PCI(e) transactions...

Do those configuration registers appear within PCI space, or outside
it (e.g. raw MMIO or PIO registers)?


Root Complexes use MMIO to expose the PCI configuration space,
they call it ECAM (enhanced configuration access mechanism) or MMConfig.

Now... from what Laine was saying it sounds like more of the
differences between PCI-E placement and PCI placement may be
implemented by libvirt than qemu than I realized.  So possibly we do
want to make the bus be PCI-E on the qemu side, but have libvirt use
the vanilla-PCI placement guidelines rather than PCI-E for pseries.

Basically the special casing I was mentioning earlier.

That looks complicated.. I wish I would no more about the pseries
PCIe stuff, does any one know where I can get the info ? (besides 'google 
it'...)

Andrea gave a pointer to the PAPR document.  Unfortunately how much it
covers here I'm not sure about.  In particular I'm not sure how much
of this is actually PAPR mandated, and how much is just copying
PowerVM as the pre-existing PAPR implementation.


Understood, I'll have a look, but with low expectations :).
Anyway, by now I do have some basic notions of spapr, thanks!



[...]
Maybe I just don't quite get the relationship between Root
Complexes and Root Buses, but I guess my question is: what
is preventing us from simply doing whatever a
spapr-pci-host-bridge is doing in order to expose a legacy
PCI Root Bus (pci.*) to the guest, and create a new
spapr-pcie-host-bridge that exposes a PCIe Root Bus (pcie.*)
instead?

Hrm, the suggestion of providing both a vanilla-PCI and PCI-E host
bridge came up before.  I think one of us spotted a problem with that,
but I don't recall what it was now.  I guess one is how libvirt would
map it's stupid-fake-domain-numbers to which root bus to use.

This would be a weird configuration, I never heard of something like that
on a bare metal machine, but I never worked on pseries, who knows...

Which aspect?  Having multiple independent host bridges is perfectly
reasonable - x86 just doesn't do it well for rather stupid historical
reasons.


I agree about the multiple host-bridges, is actually what pxb/pxb-pcie
devices (kind of) do.

I was talking about having one PCI PHB and another PHB which is PCI Express.
I was referring to one system having both PCI and PCIe PHBs.

PAPR is quite explicitly a paravirtual platform, you cannot have a
bare-metal PAPR machine.


Understood.

That issue is relevant whether or nor we have different PHB
flavors, isn't it? As soon as multiple PHBs are present in
a pseries guest, multiple PCI domains will be there as well,
and we need to handle that somehow.

On q35, on the other hand, I haven't been able to find a way
to create extra PCI domains: adding a pxb-pcie certainly
didn't work the same as adding an extra spapr-pci-host-bridge
in that regard.


Indeed, all the pxb-pcie devices "share" the same domain.

[...]
Maybe we should have a different model, specific to
pseries guests, instead, so that all PHBs would look the
same in the guest XML, something like

   <controller type='pci' model='phb-pcie'/>

It would require shuffling libvirt's PCI address allocation
code around quite a bit, but it should be doable. And if it
makes life easier for our users, then it's worth it.

Hrm.  So my first inclination would be to stick with the generic
names, and map those to creating new pseries host bridges on pseries
guests.  I would have thought that would be the easier option for
users.  But I may not have realized all the implications yet.

You're probably right, but I can't immediately see how we
would make the user aware of which PHB is which. Maybe we
could add some sub-element or extra attribute...

Anyway, we should not focus too much on this specific bit
at the moment, deciding on a specific XML is mostly
bikeshedding :)

[...]
* Eduardo's work, which you mentioned, is going to be very
   beneficial in the long run; in the short run, Marcel's
   PCIe device placement guidelines, a document that has seen
   contributions from QEMU, OVMF and libvirt developers, have
   been invaluable to improve libvirt's PCI address allocation
   logic. So we're already doing better, and more improvements
   are on the way :)

Right.. so here's the thing, I strongly suspect that Marcel's
guidelines will not be correct for pseries.

We should make the document stick for all PCIe archs, if we need
to modify it, let's do it.

Yeah, I'm not sure it's possible to cover both x86 and pseries at
once.  As you noted, it looks rather like PAPR is contradicting the
PCI-E spec.


At this point I agree that PAPR PCI-E is not really "by the book",
so the PCIe guidelines will not work for PAPR.

Again, one possible option here is to continue to treat pseries as
having a vanilla-PCI bus, but with a special flag saying that it's
magically able to connect PCI-E devices.


A PCIe bus supporting PCI devices is strange (QEMU allows it ...),
but a PCI bus supporting PCIe devices is hard to "swallow".

I would say maybe make it a special case of a PCIe bus with different rules.
It can derive from the PCIe bus class and override the usual behavior
with PAPR specific rule which happen to be similar with the PCI bus rules.

Adding Eduardo, he is currently working on a way to properly expose
the information on what devices can be plugged on what bus/slot.

  I'm not sure if they'll
be definitively wrong, or just different enough from PowerVM that it
might confuse guests, but either way.


I really need to understand how it would confuse the guests,
it does not deviate from the PCIe spec, it only adds some restrictions.

Because the guests are written to work with PowerVM, which seems to do
something other than the PCIe spec...

Those guidelines have been developed with q35/mach-virt in
mind[1], so I wouldn't at all be surprised if they didn't
apply to pseries guests. And in fact, we just found out
that they don't!

My point is that we could easily create a similar document
for pseries guests, and then libvirt will be able to pick
up whatever recommendations we come up with just like it
did for q35/mach-virt.

Can you send me a link to that
document though, which might help me figure this out.

It's docs/pcie.txt in QEMU's git repository.


[1] Even though I now realize that this is not immediately
    clear by looking at the document itself


I kind of miss the core issue, what is the main problem?

The core problem is that at this stage it's not possible to attach
PCIe devices (either emulated or passthrough) to a pseries guest.  We
need to be able to do that - specifically allowing the guest to access
PCIe extended config space.


Do we have in QEMU the code to expose the Extended Config Space
by other means instead of MMIO (used by x86)?

The PAPR virtualized PCI interfaces definitely do allow a guest to
access extended config space, but in most other regards they behave
more like vanilla-PCI than PCIe.


So the problem is related to how to expose the information to libvirt?
If yes, maybe Eduardo can help.


Thanks,
Marcel





reply via email to

[Prev in Thread] Current Thread [Next in Thread]