Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture

From:	Laszlo Ersek
Subject:	Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
Date:	Thu, 18 Jun 2020 10:29:25 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0 Thunderbird/52.9.1

On 06/17/20 19:02, Eduardo Habkost wrote:
> On Wed, Jun 17, 2020 at 06:43:20PM +0200, Laszlo Ersek wrote:
>> On 06/17/20 18:14, Laszlo Ersek wrote:

>>> Consider assigning a single device with a 32G BAR -- right now that's
>>> supposed to work, without the X-PciMmio64Mb OVMF knob, on even the "most
>>> basic" hardware (36-bit host phys address width, and EPT supported). If
>>> OVMF suddenly starts trusting the CPUID from QEMU, and that results in a
>>> GPA width of 40 bits (i.e. new OVMF is run on old QEMU), then the big
>>> BAR (and other stuff too) could be allocated from GPA space that EPT is
>>> actually able to deal with. --> regression for the user.
>>
>> s/able/unable/, sigh. :/
> 
> I was confused for a while, thanks for the clarification.  :)

Sorry again -- looks like yesterday was already Friday for me...

> So, I'm trying to write down which additional guarantees we want
> to give to guests, exactly.  I don't want the documentation to
> reference "host physical address bits", but actual behavior we
> don't emulate.
> 
> What does "unable to deal with" means in this specific case?  I
> remember MAXPHYADDR mismatches make EPT treatment of of reserved
> bits not be what guests would expect from bare metal, but can
> somebody point out to the specific guest-visible VCPU behavior
> that would cause a regression in OVMF?  Bonus points if anybody
> can find the exact Intel SDM paragraph we fail to implement.

When I first encountered the problem, it wasn't actually related to
64-bit PCI MMIO -- it was much earlier, namely when I worked on enabling
64GiB+ *RAM size* in OVMF. Things fell apart just from the guest memory
side of things. The 64-bit PCI MMIO concern is an extrapolation from
that, not a symptom encountered in practice.

Let me find my original message about the RAM symptom...

... Meh, I can find two discussions related to this, in my personal
archives. One is on one of our internal RH mailing lists, so I'm not
going to quote that here. The other is on the public edk2-devel mailing
list -- but from June 2015, when edk2-devel was still hosted on
sourceforge.net. The sourceforge.net archives are no longer accessible,
and GMANE (the secondary archive) is dead too.

So I'm going to have to attach the one message from the thread that I
feel contains the most information. Start reading at the string "BUG:".

Thanks,
Laszlo

--- Begin Message --- Subject: Re: [edk2] [RFC 4/4] OvmfPkg: PlatformPei: invert MTRR setup in QemuInitializeRam() Date: Wed, 10 Jun 2015 15:03:05 +0200 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0

On 06/09/15 04:15, Laszlo Ersek wrote:
> On 06/08/15 23:46, Laszlo Ersek wrote:
>> At the moment we work with a UC default MTRR type, and set three memory
>> ranges to WB:
>> - [0, 640 KB),
>> - [1 MB, LowerMemorySize),
>> - [4 GB, 4 GB + UpperMemorySize).
>>
>> Unfortunately, coverage for the third range can fail with a high
>> likelihood. If the alignment of the base (ie. 4 GB) and the alignment of
>> the size (UpperMemorySize) differ, then MtrrLib creates a series of
>> variable MTRR entries, with power-of-two sized MTRR masks. And, it's
>> really easy to run out of variable MTRR entries, dependent on the
>> alignment difference.
>>
>> This is a problem because a Linux guest will loudly reject any high memory
>> that is not covered my MTRR.
>>
>> So, let's follow the inverse pattern (loosely inspired by SeaBIOS):
>> - flip the MTRR default type to WB,
>> - set [0, 640 KB) to WB -- fixed MTRRs have precedence over the default
>>   type and variable MTRRs, so we can't avoid this,
>> - set [640 KB, 1 MB) to UC -- implemented with fixed MTRRs,
>> - set [LowerMemorySize, 4 GB) to UC -- should succeed with variable MTRRs
>>   more likely than the other scheme (due to less chaotic alignment
>>   differences).
>>
>> Effects of this patch can be observed by setting DEBUG_CACHE (0x00200000)
>> in PcdDebugPrintErrorLevel.
>>
>> BUG: Although the MTRRs look good to me in the OVMF debug log, I still
>> can't boot >= 64 GB guests with this. Instead of the complaints mentioned
>> above, the Linux guest apparently spirals into an infinite loop (on KVM),
>> or hangs with no CPU load (on TCG).
> 
> No, actually there is no bug in this patch (so s/RFC/PATCH/). I did more
> testing and these are the findings:
> - I can reproduce the same issue on KVM with SeaBIOS guests.
> - The exact symptoms are that as soon as the highest guest-phys address
>   is >= 64 GB, then the guest kernel doesn't boot. It gets stuck
>   somewhere after hitting Enter in grub.
> - Normally 3 GB of the guest RAM is mapped under 4 GB in guest-phys
>   address space, then there's a 1 GB PCI hole, and the rest is above
>   4 GB. This means that a 63 GB guest can be started (because 63 - 3 + 4
>   == 64), but if you add just 1 MB more, it won't boot.
> - (This was the big discovery:) I flipped the "ept" parameter of the
>   kvm_intel module on my host to N, and then things started to work. I
>   just booted a 128 GB Linux guest with this patchset. (I have 4 GB
>   RAM in my host, plus approx 250 GB swap.) The guest could see it all.
> - The TCG boot didn't hang either; I just couldn't wait earlier for
>   network initialization to complete.
> 
> I'm CC'ing Paolo for help with the EPT question. Other than that, this
> series is functional. (For QEMU/KVM at least; Xen will likely need more
> fixes from others.)

We have a root cause, it seems. The issue is that the processor in my
laptop, on which I tested, has only 36 bits for physical addresses:

  $ grep 'address sizes' /proc/cpuinfo
  address sizes   : 36 bits physical, 48 bits virtual
  ...

Which matches where the problem surfaces (64 GB guest-phys address
space) with hw-supported nested paging (EPT) enabled on the host.

In order to confirm this, a colleague of mine gave me access to a server
with 96 GB of RAM, and:

  address sizes : 46 bits physical, 48 bits virtual

On this host I booted a 72 GB OVMF guest on QEMU/KVM, with EPT enabled,
and according to the guest dmesg, the guest saw it all.

  Memory: 74160924K/75493820K available (7735K kernel code, 1149K
  rwdata, 3340K rodata, 1500K init, 1524K bss, 1332896K reserved, 0K
  cma-reserved)

Maoming: since you reported this issue, please confirm that the patch
series resolves it for you as well. In that case, I'll repost the series
with "PATCH" as subject-prefix instead of "RFC", and I'll drop the BUG
note from the last commit message.

Thanks
Laszlo

>> Cc: Maoming <maoming.maoming@huawei.com>
>> Cc: Huangpeng (Peter) <peter.huangpeng@huawei.com>
>> Cc: Wei Liu <wei.liu2@citrix.com>
>> Contributed-under: TianoCore Contribution Agreement 1.0
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>> ---
>>  OvmfPkg/PlatformPei/MemDetect.c | 43 
>> +++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 39 insertions(+), 4 deletions(-)
>>
>> diff --git a/OvmfPkg/PlatformPei/MemDetect.c 
>> b/OvmfPkg/PlatformPei/MemDetect.c
>> index 3ceb142..cceab22 100644
>> --- a/OvmfPkg/PlatformPei/MemDetect.c
>> +++ b/OvmfPkg/PlatformPei/MemDetect.c
>> @@ -194,6 +194,8 @@ QemuInitializeRam (
>>  {
>>    UINT64                      LowerMemorySize;
>>    UINT64                      UpperMemorySize;
>> +  MTRR_SETTINGS               MtrrSettings;
>> +  EFI_STATUS                  Status;
>>  
>>    DEBUG ((EFI_D_INFO, "%a called\n", __FUNCTION__));
>>  
>> @@ -214,12 +216,45 @@ QemuInitializeRam (
>>      }
>>    }
>>  
>> -  MtrrSetMemoryAttribute (BASE_1MB, LowerMemorySize - BASE_1MB, 
>> CacheWriteBack);
>> +  //
>> +  // We'd like to keep the following ranges uncached:
>> +  // - [640 KB, 1 MB)
>> +  // - [LowerMemorySize, 4 GB)
>> +  //
>> +  // Everything else should be WB. Unfortunately, programming the inverse 
>> (ie.
>> +  // keeping the default UC, and configuring the complement set of the 
>> above as
>> +  // WB) is not reliable in general, because the end of the upper RAM can 
>> have
>> +  // practically any alignment, and we may not have enough variable MTRRs to
>> +  // cover it exactly.
>> +  //
>> +  if (IsMtrrSupported ()) {
>> +    MtrrGetAllMtrrs (&MtrrSettings);
>>  
>> -  MtrrSetMemoryAttribute (0, BASE_512KB + BASE_128KB, CacheWriteBack);
>> +    //
>> +    // MTRRs disabled, fixed MTRRs disabled, default type is uncached
>> +    //
>> +    ASSERT ((MtrrSettings.MtrrDefType & BIT11) == 0);
>> +    ASSERT ((MtrrSettings.MtrrDefType & BIT10) == 0);
>> +    ASSERT ((MtrrSettings.MtrrDefType & 0xFF) == 0);
>>  
>> -  if (UpperMemorySize != 0) {
>> -    MtrrSetMemoryAttribute (BASE_4GB, UpperMemorySize, CacheWriteBack);
>> +    //
>> +    // flip default type to writeback
>> +    //
>> +    SetMem (&MtrrSettings.Fixed, sizeof MtrrSettings.Fixed, 0x06);
>> +    ZeroMem (&MtrrSettings.Variables, sizeof MtrrSettings.Variables);
>> +    MtrrSettings.MtrrDefType |= BIT11 | BIT10 | 6;
>> +    MtrrSetAllMtrrs (&MtrrSettings);
>> +
>> +    //
>> +    // punch holes
>> +    //
>> +    Status = MtrrSetMemoryAttribute (BASE_512KB + BASE_128KB,
>> +               SIZE_256KB + SIZE_128KB, CacheUncacheable);
>> +    ASSERT_EFI_ERROR (Status);
>> +
>> +    Status = MtrrSetMemoryAttribute (LowerMemorySize,
>> +               SIZE_4GB - LowerMemorySize, CacheUncacheable);
>> +    ASSERT_EFI_ERROR (Status);
>>    }
>>  }
>>  
>>
> 


------------------------------------------------------------------------------
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/edk2-devel

--- End Message ---

[Prev in Thread]

Current Thread

[Next in Thread]

Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture, (continued)

Prev by Date: Re: [PATCH v9 0/5] vhost-user block device backend implementation
Next by Date: [PATCH v2 2/2] iotests: filter few more luks specific create options
Previous by thread: Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
Next by thread: Re: ovmf / PCI passthrough impaired due to very limiting PCI64 aperture
Index(es):
- Date
- Thread