Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

From:	Joao Martins
Subject:	Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
Date:	Thu, 24 Feb 2022 20:34:40 +0000
On 2/24/22 20:12, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote:
>> On 2/24/22 19:54, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:
>>>> On 2/24/22 18:30, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
>>>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
>>>>>>>> On 2/23/22 23:35, Joao Martins wrote:
>>>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>>>>>> +                                          uint64_t pci_hole64_size)
>>>>>>>>>>> +{
>>>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>>>>> +    uint32_t eax, vendor[3];
>>>>>>>>>>> +
>>>>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>>>>>> +        return;
>>>>>>>>>>> +    }
>>>>>>>>>>
>>>>>>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
>>>>>>>>>> It's really about what we present to the guest though,
>>>>>>>>>> isn't it?
>>>>>>>>>
>>>>>>>>> It was the easier catch all to use cpuid without going into
>>>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
>>>>>>>>> for systems with an IOMMU present.
>>>>>>>>>
>>>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
>>>>>>>>>>
>>>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
>>>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
>>>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region 
>>>>>>>>> is
>>>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't 
>>>>>>>>> think it's
>>>>>>>>> even worth checking the range exists in:
>>>>>>>>>
>>>>>>>>>       /sys/kernel/iommu_groups/0/reserved_regions
>>>>>>>>>
>>>>>>>>> (Also that sysfs ABI is >= 4.11 only)
>>>>>>>>
>>>>>>>> Here's what I have staged in local tree, to address your comment.
>>>>>>>>
>>>>>>>> Naturally the first chunk is what's affected by this patch the rest is 
>>>>>>>> a
>>>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to 
>>>>>>>> pass
>>>>>>>> all the tests and what not.
>>>>>>>>
>>>>>>>> I am not entirely sure this is the right place to put such a helper, 
>>>>>>>> open
>>>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow the 
>>>>>>>> rest
>>>>>>>> of the file's style.
>>>>>>>>
>>>>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>>>>>> index a9be5d33a291..2ea4430d5dcc 100644
>>>>>>>> --- a/hw/i386/pc.c
>>>>>>>> +++ b/hw/i386/pc.c
>>>>>>>> @@ -868,10 +868,8 @@ static void 
>>>>>>>> x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>>>                                            uint64_t pci_hole64_size)
>>>>>>>>  {
>>>>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>> -    uint32_t eax, vendor[3];
>>>>>>>>
>>>>>>>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>>> -    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>>> +    if (!qemu_amd_iommu_is_present()) {
>>>>>>>>          return;
>>>>>>>>      }
>>>>>>>>
>>>>>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>>>>>>>> index 7bcce3bceb0f..eb4ea071ecec 100644
>>>>>>>> --- a/include/qemu/osdep.h
>>>>>>>> +++ b/include/qemu/osdep.h
>>>>>>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>>>>>>>>   */
>>>>>>>>  size_t qemu_get_host_physmem(void);
>>>>>>>>
>>>>>>>> +/**
>>>>>>>> + * qemu_amd_iommu_is_present:
>>>>>>>> + *
>>>>>>>> + * Operating system agnostic way of querying if an AMD IOMMU
>>>>>>>> + * is present.
>>>>>>>> + *
>>>>>>>> + */
>>>>>>>> +bool qemu_amd_iommu_is_present(void);
>>>>>>>> +
>>>>>>>>  /*
>>>>>>>>   * Toggle write/execute on the pages marked MAP_JIT
>>>>>>>>   * for the current thread.
>>>>>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>>>>>>>> index f2be7321c59f..54cef21217c4 100644
>>>>>>>> --- a/util/oslib-posix.c
>>>>>>>> +++ b/util/oslib-posix.c
>>>>>>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>>>>>>>>  #endif
>>>>>>>>      return 0;
>>>>>>>>  }
>>>>>>>> +
>>>>>>>> +bool qemu_amd_iommu_is_present(void)
>>>>>>>> +{
>>>>>>>> +    bool found = false;
>>>>>>>> +#ifdef CONFIG_LINUX
>>>>>>>> +    struct dirent *entry;
>>>>>>>> +    char *path;
>>>>>>>> +    DIR *dir;
>>>>>>>> +
>>>>>>>> +    path = g_strdup_printf("/sys/class/iommu");
>>>>>>>> +    dir = opendir(path);
>>>>>>>> +    if (!dir) {
>>>>>>>> +        g_free(path);
>>>>>>>> +        return found;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    do {
>>>>>>>> +            entry = readdir(dir);
>>>>>>>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
>>>>>>>> +                found = true;
>>>>>>>> +                break;
>>>>>>>> +            }
>>>>>>>> +    } while (entry);
>>>>>>>> +
>>>>>>>> +    g_free(path);
>>>>>>>> +    closedir(dir);
>>>>>>>> +#endif
>>>>>>>> +    return found;
>>>>>>>> +}
>>>>>>>
>>>>>>> why are we checking whether an AMD IOMMU is present
>>>>>>> on the host? 
>>>>>>> Isn't what we care about whether it's
>>>>>>> present in the VM? What we are reserving after all
>>>>>>> is a range of GPAs, not actual host PA's ...
>>>>>>>
>>>>>> Right.
>>>>>>
>>>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
>>>>>> and so we need to not map that portion of address space entirely
>>>>>> and use some other portion of the GPA-space. This has to
>>>>>> do with host IOMMU which is where the DMA mapping of guest PA takes
>>>>>> place. So, if you do not have an host IOMMU, you can't
>>>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, 
>>>>>> therefore you
>>>>>> don't need this problem.
>>>>>>
>>>>>> Whether one uses a guest IOMMU or not does not affect the result,
>>>>>> it would be more of a case of mimicking real hardware not fixing
>>>>>> the issue of this series.
>>>>>
>>>>>
>>>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
>>>>> And ideally move the logic reporting reserved ranges there too.
>>>>> However, I think vdpa has the same issue too.
>>>>> CC Jason for an opinion.
>>>>
>>>> It does have the same problem.
>>>>
>>>> This is not specific to VFIO, it's to anything that uses the iommu.
>>>
>>> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
>>> that we don't have a global "enable-vfio" flag since vfio devices
>>> can be hot-plugged. I think this is not the first time we wanted
>>> something like this, right Alex?
>>>
>>>> It's
>>>> just that VFIO doesn't let you shoot in the foot :)
>>>>
>>>> vDPA would need to validate iova ranges as well to prevent mapping on
>>>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
>>>> map request is within a valid iova range"). Now you could argue that
>>>> it's an IOMMU core problem, maybe.
>>>>
>>>> But I this as a separate problem,
>>>> because regardless of said validation, your guest provisioning
>>>> is still going to fail for guests with >=1010G of max GPA and even if
>>>> it doesn't fail you shouldn't be letting it DMA map those in the first
>>>> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
>>>> from IOMMU if DMA is attempted over the first megabyte of that 1T hole).
>>>
>>> I wonder what's the status on a system without an IOMMU.
>>>
>> In theory you should be OK. Also it's worth keeping in mind that AMD machines
>> with >=1T have this 12G hole marked as Reserved, so even DMA at last when CPU
>> is the initiator should be fine on worst case scenario.
> 
> Not sure I understand this last sentence.
> 
I was trying to say that the E820 from hardware is marked as Reserved and any 
DMA
from/to endpoints from kernel-supplied CPU addresses will use those reserved 
ranges.

>>>>> Also, some concerns
>>>>> - I think this changes memory layout for working VMs not using VFIO.
>>>>>   Better preserve the old layout for old machine types?
>>>>>
>>>> Oh definitely, and I do that in this series. See the last commit, and
>>>> in the past it was also a machine-property exposed to userspace.
>>>> Otherwise I would be breaking (badly) compat/migration.
>>>>
>>>> I would like to emphasize that these memory layout changes are *exclusive* 
>>>> and
>>>> *only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
>>>> It's not every guest, but a fairly limited subset of big-memory guests that
>>>> are not the norm.
>>>>
>>>> Albeit the phys-bits property errors might gives us a bit more trouble, so
>>>> it might help being more conservative.
>>>>
>>>>> - You mention the phys bits issue very briefly, and it's pretty
>>>>>   concerning.  Do we maybe want to also disable the work around if phys
>>>>>   bits is too small? 
>>>>
>>>> We are doing that here (well, v4), as suggested by Igor. Note that in this 
>>>> series
>>>> it's a warning, but v4 I have it as an error and with the 32-bit issues 
>>>> that
>>>> I found through qtest.
>>>>
>>>> I share the same concern as you over making this an error because of 
>>>> compatibility.
>>>> Perhaps, we could go back to the previous version and turn back into a 
>>>> warning and
>>>> additionally even disabling the relocation all together. Or if desired even
>>>> depending on a machine-property to enable it.
>>>
>>> I would be inclined to disable the relocation. And maybe block vfio? I'm
>>> not 100% sure about that last one, but this can be a vfio decision to
>>> make.
>>>
>> I don't see this as a VFIO problem (VFIO is actually being a good citizen 
>> and doing the
>> right thing :)). From my perspective this fix is also useful to vDPA (which 
>> we also care),
>> and thus will also let us avoid DMA mapping in these GPA ranges.
>>
>> The reason I mention VFIO in cover letter is that it's one visible UAPI 
>> change that
>> users will notice that suddenly their 1TB+ VMs break.
> 
> What I mean is that most VMs don't use either VFIO or VDPA.
> If we had VFIO,VDPA as an accelerator that needs to be listed
> upfront when qemu is started, we could solve this with
> a bit less risk by not changing anything for VMs without these two.
> 
That wouldn't work for the vfio/vdpa hotplug case (which we do use),
and part of the reason I picked to always avoid the 1TB hole. VFIO even tells
you what are those allowed IOVA ranges when you create the container.

The machine property, though, could communicate /the intent/ to add
any VFIO or vDPA devices in the future and maybe cover for that. That
let's one avoid any 'accelerator-only' problems and restrict the issues
of this series to VMs with VFIO/VDPA if the risk is a concern.

> Alex what do you think about adding this?
> 
> Given we do not have such a thing right now, one can get
> into a situation where phys bits limit CPU, then
> more memory is supplied than would fit above reserved region, then
> we stick the memory over the reserved region, and finally
> then vfio device is added.
> 
The current code considers the maximum possible address considering
memory hotplug, PCI hole64 and etc. So we would need to worry about
whether VFIO or VDPA is going to be hotplugged at some point in the
future during guest lifecycle, do decide to alter the memory layout
at guest provisioning.

> In this last configuration, should vfio device add fail?
> Or just warn and try to continue? I think we can code this last
> decision as part of vfio code and then Alex will get to decide ;)
> And yes, a similar thing with vdpa.
> 

Of all those cases I would feel the machine-property is better,
and more flexible than having VFIO/VDPA deal with a bad memory-layout and
discovering late stage that the user is doing something wrong (and thus
fail the DMA_MAP operation for those who do check invalid iovas)
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable, (continued)
Prev by Date: Re: [PATCH v4 24/47] target/ppc: move vrl[bhwd]nm/vrl[bhwd]mi to decodetree
Next by Date: Re: [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG)
Previous by thread: Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
Next by thread: Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
Index(es):
- Date
- Thread