qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MM


From: Alexey Kardashevskiy
Subject: Re: [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO
Date: Tue, 19 Dec 2017 17:02:59 +1100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0

On 19/12/17 14:40, Alex Williamson wrote:
> On Tue, 19 Dec 2017 14:07:13 +1100
> Alexey Kardashevskiy <address@hidden> wrote:
> 
>> On 18/12/17 16:02, Alex Williamson wrote:
>>> With recently proposed kernel side vfio-pci changes, the MSI-X vector
>>> table area can be mmap'd from userspace, allowing direct access to
>>> non-MSI-X registers within the host page size of this area.  However,
>>> we only get that direct access if QEMU isn't also emulating MSI-X
>>> within that same page.  For x86/64 host, the system page size is 4K
>>> and the PCI spec recommends a minimum of 4K to 8K alignment to
>>> separate MSI-X from non-MSI-X registers, therefore only devices which
>>> don't honor this recommendation would see any improvement from this
>>> option.  The real targets for this feature are hosts where the page
>>> size exceeds the PCI spec recommended alignment, such as ARM64 systems
>>> with 64K pages.
>>>
>>> This new x-msix-relocation option accepts the following options:
>>>
>>>   off: Disable MSI-X relocation, use native device config (default)
>>>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
>>>        based on minimum additional MMIO requirement
>>>   bar0..bar5: Specify the target BAR, which will either be extended
>>>        if the BAR exists or added if the BAR slot is available.
>>>
>>> Signed-off-by: Alex Williamson <address@hidden>
>>> ---
>>>  hw/vfio/pci.c        |  102 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  hw/vfio/pci.h        |    1 
>>>  hw/vfio/trace-events |    2 +
>>>  3 files changed, 104 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index c383b842da20..b4426abf297a 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -1352,6 +1352,101 @@ static void 
>>> vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>      }
>>>  }
>>>  
>>> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
>>> +{
>>> +    int target_bar = -1;
>>> +    size_t msix_sz;
>>> +
>>> +    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
>>> +        return;
>>> +    }
>>> +
>>> +    /* The actual minimum size of MSI-X structures */
>>> +    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
>>> +              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
>>> +    /* Round up to host pages, we don't want to share a page */
>>> +    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
>>> +    /* PCI BARs must be a power of 2 */
>>> +    msix_sz = pow2ceil(msix_sz);
>>> +
>>> +    /* Auto: pick the BAR that incurs the least additional MMIO space */
>>> +    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
>>> +        int i;
>>> +        size_t best = UINT64_MAX;
>>> +
>>> +        for (i = 0; i < PCI_ROM_SLOT; i++) {  
>>
>>
>> I belieive that going from the other end is safer approach for "auto",
>> especially after discovering how mpt3sas works. Or you could add
>> "autoreverse" switch...
> 
> Or is extending the smallest BAR really a safer option?  I wonder how
> many drivers go through and fill fixed sized arrays with BAR info,
> expecting only the device implemented number of BARs.  Maybe they
> wouldn't notice if the BAR was simply bigger than expected.  On the
> other hand there are probably drivers dumb enough to index registers
> from the end for the BAR as well.  I don't think there exists an
> auto algorithm that will fit every device, but a higher hit rate than
> we have so far would be nice.

Everything is possible :(

I do not know if there are many users for this relocation though. So far
only one device has the problem (in 5 years or so) and it is fixed by
moving msix to bar5, I'd suggest start with this for now.

In general, I think we still need a way to simply disable that msix_table
region anyway if we find a device driver which uses all BARs, does not
tolerate changes to the default set of BARs, etc.


>  We could also implement MemoryRegionOps
> for the base BAR with some error reporting if it gets called.  That
> might make the problem more obvious than unassigned_mem_ops silently
> eating those accesses.

Makes sense.


> 
>>> +            size_t size;
>>> +
>>> +            if (vdev->bars[i].ioport) {
>>> +                continue;
>>> +            }
>>> +
>>> +            /* MSI-X MMIO must reside within first 32bit offset of BAR */
>>> +            if (vdev->bars[i].size > (UINT32_MAX / 2))
> 
> Adding a '|| !vdev->bars[i].size' here would make auto only extend BARs.
> 
> NB, the existing test here needs a bit of work too, 32bit BARs max out
> at 2G not 4G, so maybe we need separate tests here. >1G for 32bit
> BARs, >2G for 64bit BARs.  Hmm, do we have the option of promoting
> 32bit BARs to 64bit? It's all virtual addresses anyway, right.  We're
> in real trouble if were extending BARs where this is an issue though. 

until you get a driver like this :)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/rapidio/devices/tsi721.c?h=v4.15-rc4#n2782


> 
>>> +                continue;
>>> +
>>> +            /*
>>> +             * Must be pow2, so larger of double existing or double 
>>> msix_sz,
>>> +             * or if BAR unimplemented, msix_sz
>>> +             */
>>> +            size = MAX(vdev->bars[i].size * 2,
>>> +                       vdev->bars[i].size ? msix_sz * 2 : msix_sz);
>>> +
>>> +            trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size);
>>> +
>>> +            if (size < best) {
>>> +                best = size;
>>> +                target_bar = i;
>>> +            }
>>> +
>>> +            if (vdev->bars[i].mem64) {
>>> +              i++;
>>> +            }
>>> +        }
>>> +    } else {
>>> +        target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0);
>>> +    }
>>> +
>>> +    if (target_bar < 0 || vdev->bars[target_bar].ioport ||
>>> +        (!vdev->bars[target_bar].size &&
>>> +         target_bar > 0 && vdev->bars[target_bar - 1].mem64)) {
>>> +        return; /* Go BOOM?  Plumb Error */
>>> +    }  
>>
>>
>> This "if" only seems to make sense for the non-auto branch...
> 
> Most of it, yes, but it's still possible for a device to exist where
> the auto loop would come up empty.  Imagine if each BAR was
> sufficiently large that we couldn't extend it and still give the MSI-X
> MMIO areas a 32-bit offset within the BAR.  Exceptionally unlikely, it
> doesn't hurt to test all the corner cases.  I also missed the case of
> testing that the BAR isn't too large already here.

Fair enough.


>  
>>> +
>>> +    /*
>>> +     * If adding a new BAR, test if we can make it 64bit.  We make it
>>> +     * prefetchable since QEMU MSI-X emulation has no read side effects
>>> +     * and doing so makes mapping more flexible.
>>> +     */
>>> +    if (!vdev->bars[target_bar].size) {
>>> +        if (target_bar < (PCI_ROM_SLOT - 1) &&
>>> +            !vdev->bars[target_bar + 1].size) {
>>> +            vdev->bars[target_bar].mem64 = true;
>>> +            vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64;
>>> +        }
>>> +        vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH;
>>> +        vdev->bars[target_bar].size = msix_sz;
>>> +        vdev->msix->table_offset = 0;
>>> +    } else {
>>> +        vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2,
>>> +                                          msix_sz * 2);
>>> +        /*
>>> +         * Due to above size calc, MSI-X always starts halfway into the 
>>> BAR,
>>> +         * which will always be a separate host page.
>>> +         */
>>> +        vdev->msix->table_offset = vdev->bars[target_bar].size / 2;
>>> +    }
>>> +
>>> +    vdev->msix->table_bar = target_bar;
>>> +    vdev->msix->pba_bar = target_bar;  
>>
>>
>> Ah, here is how I got confused that commenting vfio_pci_fixup_msix_region() 
>> out
>> was not necessary at the time but I missed that it is called before
>> vfio_pci_relocate_msix(), when simply swapped - BARs get mapped. Ok, thanks,
> 
> For a kernel that allows mapping the MSI-X region, yes, but if you ran
> that on an older kernel I think QEMU would break when it can't mmap the
> entire region.  We can't only support new kernels.  Thanks,


Sure, I am not suggesting changing this.



-- 
Alexey



reply via email to

[Prev in Thread] Current Thread [Next in Thread]