qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture


From: Eduardo Habkost
Subject: Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
Date: Wed, 17 Jun 2020 13:02:42 -0400

On Wed, Jun 17, 2020 at 06:43:20PM +0200, Laszlo Ersek wrote:
> On 06/17/20 18:14, Laszlo Ersek wrote:
> > On 06/17/20 15:46, Dr. David Alan Gilbert wrote:
> >> * Laszlo Ersek (lersek@redhat.com) wrote:
> >>> On 06/16/20 19:14, Guilherme Piccoli wrote:
> >>>> Thanks Gerd, Dave and Eduardo for the prompt responses!
> >>>>
> >>>> So, I understand that when we use "-host-physical-bits", we are
> >>>> passing the *real* number for the guest, correct? So, in this case we
> >>>> can trust that the guest physbits matches the true host physbits.
> >>>>
> >>>> What if then we have OVMF relying in the physbits *iff*
> >>>> "-host-phys-bits" is used (which is the default in RH and a possible
> >>>> machine configuration on libvirt XML in Ubuntu), and we have OVMF
> >>>> fallbacks to 36-bit otherwise?
> >>>
> >>> I've now read the commit message on QEMU commit 258fe08bd341d, and the
> >>> complexity is simply stunning.
> >>>
> >>> Right now, OVMF calculates the guest physical address space size from
> >>> various range sizes (such as hotplug memory area end, default or
> >>> user-configured PCI64 MMIO aperture), and derives the minimum suitable
> >>> guest-phys address width from that address space size. This width is
> >>> then exposed to the rest of the firmware with the CPU HOB (hand-off
> >>> block), which in turn controls how the GCD (global coherency domain)
> >>> memory space map is sized. Etc.
> >>>
> >>> If QEMU can provide a *reliable* GPA width, in some info channel (CPUID
> >>> or even fw_cfg), then the above calculation could be reversed in OVMF.
> >>> We could take the width as a given (-> produce the CPU HOB directly),
> >>> plus calculate the *remaining* address space between the GPA space size
> >>> given by the width, and the end of the memory hotplug area end. If the
> >>> "remaining size" were negative, then obviously QEMU would have been
> >>> misconfigured, so we'd halt the boot. Otherwise, the remaining area
> >>> could be used as PCI64 MMIO aperture (PEI memory footprint of DXE page
> >>> tables be darned).
> >>>
> >>>> Now, regarding the problem "to trust or not" in the guests' physbits,
> >>>> I think it's an orthogonal discussion to some extent. It'd be nice to
> >>>> have that check, and as Eduardo said, prevent migration in such cases.
> >>>> But it's not really preventing OVMF big PCI64 aperture if we only
> >>>> increase the aperture _when  "-host-physical-bits" is used_.
> >>>
> >>> I don't know what exactly those flags do, but I doubt they are clearly
> >>> visible to OVMF in any particular way.
> >>
> >> The firmware should trust whatever it reads from the cpuid and thus gets
> >> told from qemu; if qemu is doing the wrong thing there then that's our
> >> problem and we need to fix it in qemu.
> > 
> > This sounds good in practice, but -- as Gerd too has stated, to my
> > understanding -- it has potential to break existing usage.
> > 
> > Consider assigning a single device with a 32G BAR -- right now that's
> > supposed to work, without the X-PciMmio64Mb OVMF knob, on even the "most
> > basic" hardware (36-bit host phys address width, and EPT supported). If
> > OVMF suddenly starts trusting the CPUID from QEMU, and that results in a
> > GPA width of 40 bits (i.e. new OVMF is run on old QEMU), then the big
> > BAR (and other stuff too) could be allocated from GPA space that EPT is
> > actually able to deal with. --> regression for the user.
> 
> s/able/unable/, sigh. :/

I was confused for a while, thanks for the clarification.  :)

So, I'm trying to write down which additional guarantees we want
to give to guests, exactly.  I don't want the documentation to
reference "host physical address bits", but actual behavior we
don't emulate.

What does "unable to deal with" means in this specific case?  I
remember MAXPHYADDR mismatches make EPT treatment of of reserved
bits not be what guests would expect from bare metal, but can
somebody point out to the specific guest-visible VCPU behavior
that would cause a regression in OVMF?  Bonus points if anybody
can find the exact Intel SDM paragraph we fail to implement.

> 
> > 
> > Sometimes I can tell users "hey given that you're building OVMF from
> > source, or taking it from a 3rd party origin anyway, can you just run
> > upstream QEMU too", but most of the time they just want everything to
> > continue working on a 3 year old Ubuntu LTS release or whatever. :/
> > 

Agreed.  It wouldn't reasonable to ask guest software to
unconditionally trust the data we provide to it after we provided
incorrect data to guests for [*checks git log*] 13 years.

-- 
Eduardo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]