qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] hw/pci-host: Fix x86 Host Bridges 64bit PCI hol


From: Marcel Apfelbaum
Subject: Re: [Qemu-devel] [PATCH] hw/pci-host: Fix x86 Host Bridges 64bit PCI hole
Date: Thu, 19 Oct 2017 14:29:28 +0300
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.4.0

Hi Laszlo,

On 19/10/2017 13:41, Laszlo Ersek wrote:
On 10/19/17 11:30, Marcel Apfelbaum wrote:

Hi Laszlo,

On 18/10/2017 14:40, Laszlo Ersek wrote:
Hi Marcel,

On 10/18/17 11:58, Marcel Apfelbaum wrote:
Currently there is no MMIO range over 4G
reserved for PCI hotplug. Since the 32bit PCI hole
depends on the number of cold-plugged PCI devices
and other factors, it is very possible is too small
to hotplug PCI devices with large BARs.

Fix it by reserving all the address space after
RAM (and the reserved space for RAM hotplug),
but no more than 40bits. The later seems to be
QEMU's magic number for the Guest physical CPU addressable
bits and is safe with respect to migration.

Note this is a regression since prev QEMU versions had
some range reserved for 64bit PCI hotplug.

Signed-off-by: Marcel Apfelbaum <address@hidden>
---
   hw/i386/pc.c              | 22 ++++++++++++++++++++++
   hw/pci-host/piix.c        | 10 ++++++++++
   hw/pci-host/q35.c         |  9 +++++++++
   include/hw/i386/pc.h      | 10 ++++++++++
   include/hw/pci-host/q35.h |  1 +
   5 files changed, 52 insertions(+)

- What are the symptoms of the problem?


PCI devices with *relatively* large BARs can't be hot-plugged.
The reason is QEMU does not reserve MMIO range over 4G and
uses whatever remains on the 32bit PCI window.

If you coldplug enough PCI devices you will end up with
with a 64bit PCI window that covers only the cold-plugged PCI
devices, still no room for hotplug.

- What QEMU commit regressed the previous functionality?


  - commit 39848901818 pc: limit 64 bit hole to 2G by default
  shows us QEMU had the 64bit PCI hole, so it is a regression.

(Side remark: this commit was part of release v1.6.0, and at that time
we also had the "etc/pci-info" fw_cfg file.)


  - commit 2028fdf3791 piix: use 64 bit window programmed by guest
  leaves no room for PCI hotplug

(Side remark: part of v1.7.0, from 2013. So, we can call this a
regression, but then it's a very old one.

Of course this is *not* a counter-argument to your patch, or commit
message wording; it's just that I'm surprised that the issue could
remain dormant this long -- usually users report regressions very quickly.)


I also thought I am proposing a new feature and I was pretty amazed
I actually fix something that worked before.


- How exactly is the regression (and this fix) exposed to the guest?


The nuance is both above commits computed the actual MMIO
range used by cold-plugged devices, but as side effect
they influenced the remaining PCI window for hot-plugged devices.

What the guest sees is the CRS MMIO range over 4G.

OK, understood.


I'm confused because semi-recently I've done multiple successful
hotplugs, with OVMF, Q35, PCIe root ports, and PCI Express devices
having large (64-bit) MMIO BARs. The examples include ivshmem / Linux
guest (based on your hints) and vfio-pci (XL710 NIC) / Windows Server
2016 guest.

Cool! These are actually great news!
Back to the issue in hand, a few things:
1. It works with Q35 only, but not with the I440FX machines.

OK.

2. The hints are used for the PCIe root ports, meaning for PCI bridges
    and addresses a slightly different issues: leave hotplug space
    for secondary buses, while this patch tackles hotplug
    space for root buses.

OK.



By default, OVMF marks a 32GB area as 64-bit MMIO aperture, for BAR
allocation. This area is above the normal memory (plus reservation for
hotplug memory, if any). It is also GB-aligned. The aperture size can be
changed (currently) with an experimental -fw_cfg switch (the fw_cfg file
is called "opt/ovmf/X-PciMmio64Mb", it states the size of the 64-bit
MMIO aperture as a decimal number of megabytes).

I'll try to explain how I see the "picture":
CRS is created by QEMU and computed *after* the firmware finishes
the PCI BARs allocations.

Yes.

Then *QEMU* decides how big it will be the
64bit PCI hole, not the firmware - see the past commits.

Yes, first the guest-side programming happens, then QEMU (re-)calculates
the ACPI payload once the right ACPI fw_cfg file is selected / read.
This re-calculation (including the CRS) observes the previously
programmed values.


So IMHO you don't need the opt/ovmf/X-PciMmio64Mb since the OVMF
is not deciding this value. It actually should not care at all.

OVMF cannot *not* care. It must internally provide (to other,
platform-independent components in edk2) the 64-bit MMIO aperture that
the BARs can be allocated from.


But maybe I'm totally misunderstanding this patch! Are you saying that
the patch makes the following difference only: when the CRS is
recalculated (i.e., after the PCI enumeration has completed in the
guest, and the ACPI linker/loader fw_cfg files are opened), then you
only grow / round up the 64-bit range?


Exactly.

That would make sense, because a *superset* aperture exposed to the OS
via the _CRS does not invalidate any allocations done by the firmware
earlier. Furthermore, the firmware itself need not learn about the
details of said superset. In this sense I agree that OVMF should not
care at all.


Exactly, the range is a superset.

[snip]

Also, how do you plan to integrate this feature with SeaBIOS?

(Ever since I sent my previous response, I've been glad that I finished
my email with this question -- I really think OVMF and SeaBIOS should
behave uniformly in this regard. So I'm happy about your answer:)

Simple, no need :) . It just works as it should.
SeaBIOS computes everything, *then* QEMU creates the CRs ranges
based on firmware input and other properties.
I think it works also with OVMF out of the box.

OK, so this explains it all, thank you very much -- my question is if
you could spell out those "other properties" in a bit more detail (I
apologize if these should be obvious to me already, from the commit
message and the patch):


"Other properties" was a vague link to max RAM, and things like that.
Nothing specific.

(1) How exactly does the enlarged 64-bit hole remain a *superset* of the
firmware-programmed BARs? Can you point me to a location in the code?


Sure in the patch itself:
 We first take the firmware computed ranges:
     pci_bus_get_w64_range(h->bus, &w64);
 Look at the *get_pci_hole64_start -> If the firmware set it we don't touch.
 And to the *get_pci_hole64_end -> We only update the value if is smaller
                                    than what firmware set.



(2) What happens if we do have more than 1TB RAM?


You get no 64bit "hot-pluggable" reserved range.
The code dealing with that is:

+    if (hole64_start > PCI_HOST_PCI_HOLE64_END) {
+        hole64_start = PCI_HOST_PCI_HOLE64_END;
+    }

And since the end is hard-coded to PCI_HOST_PCI_HOLE64_END
we have a 0 size region which gets discarded.

An interesting case is if we have more than 1T RAM and the
firmware assigns BARs over that, we are increasing the
64bit window to include some of the RAM...

A better approach is to check against the actual hole64
end and not the magic value.
I'll send a v3, thanks!

A last note, when pxb/pxb-pcies will became hot-pluggable
we need to leave some bits for them too. I wouldn't
really want to add fw_config files for them too, or
use a complicate fw_config, or rewrite
the logic for all firmware if we don't have to.

OK.


64bit PCI window starts from memory(+reserved) end
and finishes at 40bit.

OK, this seems to (partly) answer my question (1) -- can you please
confirm that the

   finishes at 40bit

part does not imply forcefully truncating the CRS at 1TB, in case the
firmware already allocated BARs higher than that?


I confirm:

+    if (s->pci_hole64_fix && value < PCI_HOST_PCI_HOLE64_END) {
+        value = PCI_HOST_PCI_HOLE64_END;
+    }

I touch the value only if is smaller than what the firmware set.

If you have 1T of RAM you
get nothing for PCI hotplug, seems fair.

This is fine (as the answer to my question (2)) as long as you mean it
in the following sense: "QEMU will simply *not grow* the MMIO in the
_CRS past 1TB".

When guests with more then 1T of RAM will become
frequent, I suppose the QEMU's default CPU addressable
bits will change and we will change it too.

Thanks Marcel, I feel a lot safer now.

Please confirm that the changes can only grow the area exposed in the
_CRS -- i.e., only lower or preserve the base, and only raise or
preserve the absolute limit. I'll shut up then :)


Confirmed and thanks for the review!

Thanks,
Marcel

Thanks!
Laszlo





reply via email to

[Prev in Thread] Current Thread [Next in Thread]