[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] [PATCH qemu RFC 1/7] vfio/spapr: Fix indirect levels calc
From: |
Alexey Kardashevskiy |
Subject: |
Re: [Qemu-ppc] [PATCH qemu RFC 1/7] vfio/spapr: Fix indirect levels calculation |
Date: |
Mon, 19 Nov 2018 16:51:57 +1100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.3.0 |
On 19/11/2018 12:45, David Gibson wrote:
> On Tue, Nov 13, 2018 at 07:30:58PM +1100, Alexey Kardashevskiy wrote:
>> The current code assumes that we can address more bits on a PCI bus
>> for DMA than we really can. Limit to the known tested maximum of 55 bits
>> and assume 64K IOMMU pages.
>
> This doesn't make much sense to me. It looks nothing like the
> calculation it's replacing, and not just for extreme values. For
> example the new calculation doesn't depend on the size of the window
> at all, whereas the previous one did.
Uff. Yes, this is confusing. There are a typical 64k IOMMU page size and
a typical 64k system page size...
For the window as big as 0x200.0000.0000 and 64k pages (actual working
example), 33554432 entries are needed, or 256MB of space. The existing
code calculates it as 25/6+1=5 levels. Which the hardware can do.
But:
1. all TCE levels are the same size and
2. for the minimum size of the level I take a system page size which is 64k.
So with 5 levels 16bit each (and this is without counting IOMMU page
size, which is plus extra 16 bits) the table is huuuge.
And I know from tests that 55 bits is the maximum the hardware can
handle, and I have this limit in the powernv code so VFIO ioctl to
create a window will fail. If I do not have a check in powernv, than
things silently do not work (no EEH though).
> You say you're assuming 64k
> IOMMU pages, but create.page_shift (the actual IOMMU page size) is in
> there.
It is a radix tree basically. Out of 64bits available, I cannot use bits
56..63 (hardware does not work) and 0..12/15/21 (IOMMU page size), so I
have bits 13/16/22..55 which I need to split equally between levels,
each of which is 64k (system page size).
> Nor is it clear to me why the maximum PCI address is relevant
> to the number of levels in the first place.
The window can only as big as the PHB hardware supports, this is why
maximum PCI address.
>
> In addition, AFAICT the new calculation gives the answer '2' for all
> likely IOMMU page sizes (4k, 64k, 2M)
Hm, I did not realize that :) Instead of using 64k as a default level
size, I could use some idea of what CONFIG_FORCE_MAX_ZONEORDER could be.
Currently it is:
arch/powerpc/Kconfig
config FORCE_MAX_ZONEORDER
int "Maximum zone order"
range 8 9 if PPC64 && PPC_64K_PAGES
default "9" if PPC64 && PPC_64K_PAGES
range 13 13 if PPC64 && !PPC_64K_PAGES
default "13" if PPC64 && !PPC_64K_PAGES
range 9 64 if PPC32 && PPC_16K_PAGES
default "9" if PPC32 && PPC_16K_PAGES
range 7 64 if PPC32 && PPC_64K_PAGES
default "7" if PPC32 && PPC_64K_PAGES
range 5 64 if PPC32 && PPC_256K_PAGES
default "5" if PPC32 && PPC_256K_PAGES
range 11 64
default "11"
So assuming that "8" is the absolute minimum, I could try levels as big
as 1<<(16+8-1) and if this fails, then increase the number of levels
till a window is created or it is too many levels. Or there is some
obviously better solution to the problem?
>
>>
>> Signed-off-by: Alexey Kardashevskiy <address@hidden>
>> ---
>> hw/vfio/spapr.c | 3 ++-
>> hw/vfio/trace-events | 2 +-
>> 2 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
>> index becf71a..f5fdc53 100644
>> --- a/hw/vfio/spapr.c
>> +++ b/hw/vfio/spapr.c
>> @@ -183,7 +183,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>> entries = create.window_size >> create.page_shift;
>> pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
>> pages = MAX(pow2ceil(pages), 1); /* Round up */
>> - create.levels = ctz64(pages) / 6 + 1;
>> + create.levels = MAX(1, (55 - create.page_shift) / 16);
>>
>> ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
>> if (ret) {
>> @@ -200,6 +200,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>> return -EINVAL;
>> }
>> trace_vfio_spapr_create_window(create.page_shift,
>> + create.levels,
>> create.window_size,
>> create.start_addr);
>> *pgsize = pagesize;
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index a85e866..db730f3 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -128,6 +128,6 @@ vfio_prereg_listener_region_add_skip(uint64_t start,
>> uint64_t end) "0x%"PRIx64"
>> vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end)
>> "0x%"PRIx64" - 0x%"PRIx64
>> vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64"
>> size=0x%"PRIx64" ret=%d"
>> vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64"
>> size=0x%"PRIx64" ret=%d"
>> -vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x
>> winsize=0x%"PRIx64" offset=0x%"PRIx64
>> +vfio_spapr_create_window(int ps, unsigned int levels, uint64_t ws, uint64_t
>> off) "pageshift=0x%x levels=%u winsize=0x%"PRIx64" offset=0x%"PRIx64
>> vfio_spapr_remove_window(uint64_t off) "offset=0x%"PRIx64
>> vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to
>> liobn fd %d"
>
--
Alexey
- Re: [Qemu-ppc] [PATCH qemu RFC 5/7] spapr-iommu: Always advertise the maximum possible DMA window size, (continued)
[Qemu-ppc] [PATCH qemu RFC 4/7] vfio/nvidia-v100: Disable VBIOS update, Alexey Kardashevskiy, 2018/11/13
[Qemu-ppc] [PATCH qemu RFC 3/7] pci: Move NVIDIA vendor id to the rest of ids, Alexey Kardashevskiy, 2018/11/13
[Qemu-ppc] [PATCH qemu RFC 1/7] vfio/spapr: Fix indirect levels calculation, Alexey Kardashevskiy, 2018/11/13
[Qemu-ppc] [PATCH qemu RFC 7/7] spapr: Add NVLink2 pass through support, Alexey Kardashevskiy, 2018/11/13
[Qemu-ppc] [PATCH qemu RFC 2/7] linux-header: Update for new capabilities, Alexey Kardashevskiy, 2018/11/13
[Qemu-ppc] [PATCH qemu RFC 6/7] vfio: Make vfio_get_region_info_cap public, Alexey Kardashevskiy, 2018/11/13