[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
From: |
Alexander Graf |
Subject: |
Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide |
Date: |
Mon, 13 Jan 2014 22:48:21 +0100 |
> Am 13.01.2014 um 22:39 schrieb Alex Williamson <address@hidden>:
>
>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <address@hidden> wrote:
>>>
>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
>>>>>>>>>>>> From: Paolo Bonzini <address@hidden>
>>>>>>>>>>>>
>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit wide.
>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits above
>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
>>>>>>>>>>>> consequently messing up the computations.
>>>>>>>>>>>>
>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from
>>>>>>>>>>>> address
>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive. The region it
>>>>>>>>>>>> gets
>>>>>>>>>>>> is the newly introduced master abort region, which is as big as
>>>>>>>>>>>> the PCI
>>>>>>>>>>>> address space (see pci_bus_init). Due to a typo that's only
>>>>>>>>>>>> 2^63-1,
>>>>>>>>>>>> not 2^64. But we get it anyway because phys_page_find ignores the
>>>>>>>>>>>> upper
>>>>>>>>>>>> bits of the physical address. In address_space_translate_internal
>>>>>>>>>>>> then
>>>>>>>>>>>>
>>>>>>>>>>>> diff = int128_sub(section->mr->size, int128_make64(addr));
>>>>>>>>>>>> *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
>>>>>>>>>>>>
>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
>>>>>>>>>>>>
>>>>>>>>>>>> The size of the PCI address space region should be fixed anyway.
>>>>>>>>>>>>
>>>>>>>>>>>> Reported-by: Luiz Capitulino <address@hidden>
>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <address@hidden>
>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <address@hidden>
>>>>>>>>>>>> ---
>>>>>>>>>>>> exec.c | 8 ++------
>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/exec.c b/exec.c
>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
>>>>>>>>>>>> --- a/exec.c
>>>>>>>>>>>> +++ b/exec.c
>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
>>>>>>>>>>>>
>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables. */
>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
>>>>>>>>>>>>
>>>>>>>>>>>> #define P_L2_BITS 10
>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
>>>>>>>>>>>> {
>>>>>>>>>>>> system_memory = g_malloc(sizeof(*system_memory));
>>>>>>>>>>>>
>>>>>>>>>>>> - assert(ADDR_SPACE_BITS <= 64);
>>>>>>>>>>>> -
>>>>>>>>>>>> - memory_region_init(system_memory, NULL, "system",
>>>>>>>>>>>> - ADDR_SPACE_BITS == 64 ?
>>>>>>>>>>>> - UINT64_MAX : (0x1ULL << ADDR_SPACE_BITS));
>>>>>>>>>>>> + memory_region_init(system_memory, NULL, "system", UINT64_MAX);
>>>>>>>>>>>> address_space_init(&address_space_memory, system_memory,
>>>>>>>>>>>> "memory");
>>>>>>>>>>>>
>>>>>>>>>>>> system_io = g_malloc(sizeof(*system_io));
>>>>>>>>>>>
>>>>>>>>>>> This seems to have some unexpected consequences around sizing 64bit
>>>>>>>>>>> PCI
>>>>>>>>>>> BARs that I'm not sure how to handle.
>>>>>>>>>>
>>>>>>>>>> BARs are often disabled during sizing. Maybe you
>>>>>>>>>> don't detect BAR being disabled?
>>>>>>>>>
>>>>>>>>> See the trace below, the BARs are not disabled. QEMU pci-core is
>>>>>>>>> doing
>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
>>>>>>>>> pass-through here.
>>>>>>>>
>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be happening
>>>>>>>> while I/O & memory are enabled int he command register. Thanks,
>>>>>>>>
>>>>>>>> Alex
>>>>>>>
>>>>>>> OK then from QEMU POV this BAR value is not special at all.
>>>>>>
>>>>>> Unfortunately
>>>>>>
>>>>>>>>>>> After this patch I get vfio
>>>>>>>>>>> traces like this:
>>>>>>>>>>>
>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
>>>>>>>>>>> (save lower 32bits of BAR)
>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff,
>>>>>>>>>>> len=0x4)
>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
>>>>>>>>>>> (read size mask)
>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004,
>>>>>>>>>>> len=0x4)
>>>>>>>>>>> (restore BAR)
>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
>>>>>>>>>>> (memory region re-mapped)
>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
>>>>>>>>>>> (save upper 32bits of BAR)
>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff,
>>>>>>>>>>> len=0x4)
>>>>>>>>>>> (write mask to BAR)
>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
>>>>>>>>>>> (memory region gets unmapped)
>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff
>>>>>>>>>>> [0x7fcf3654d000]
>>>>>>>>>>> (memory region gets re-mapped with new address)
>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710,
>>>>>>>>>>> 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical addresses)
>>>>>>>>>>
>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
>>>>>>>>>
>>>>>>>>> Two reasons, first I can't tell the difference between RAM and MMIO.
>>>>>>>
>>>>>>> Why can't you? Generally memory core let you find out easily.
>>>>>>
>>>>>> My MemoryListener is setup for &address_space_memory and I then filter
>>>>>> out anything that's not memory_region_is_ram(). This still gets
>>>>>> through, so how do I easily find out?
>>>>>>
>>>>>>> But in this case it's vfio device itself that is sized so for sure you
>>>>>>> know it's MMIO.
>>>>>>
>>>>>> How so? I have a MemoryListener as described above and pass everything
>>>>>> through to the IOMMU. I suppose I could look through all the
>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems really
>>>>>> ugly.
>>>>>>
>>>>>>> Maybe you will have same issue if there's another device with a 64 bit
>>>>>>> bar though, like ivshmem?
>>>>>>
>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
>>>>>> MemoryRegion from memory_region_init_ram or memory_region_init_ram_ptr.
>>>>>
>>>>> Must be a 64 bit BAR to trigger the issue though.
>>>>>
>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is
>>>>>>>>> something
>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
>>>>>>>>>
>>>>>>>>>>> Prior to this change, there was no re-map with the fffffffffebe0000
>>>>>>>>>>> address, presumably because it was beyond the address space of the
>>>>>>>>>>> PCI
>>>>>>>>>>> window. This address is clearly not in a PCI MMIO space, so why
>>>>>>>>>>> are we
>>>>>>>>>>> allowing it to be realized in the system address space at this
>>>>>>>>>>> location?
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Alex
>>>>>>>>>>
>>>>>>>>>> Why do you think it is not in PCI MMIO space?
>>>>>>>>>> True, CPU can't access this address but other pci devices can.
>>>>>>>>>
>>>>>>>>> What happens on real hardware when an address like this is programmed
>>>>>>>>> to
>>>>>>>>> a device? The CPU doesn't have the physical bits to access it. I
>>>>>>>>> have
>>>>>>>>> serious doubts that another PCI device would be able to access it
>>>>>>>>> either. Maybe in some limited scenario where the devices are on the
>>>>>>>>> same conventional PCI bus. In the typical case, PCI addresses are
>>>>>>>>> always limited by some kind of aperture, whether that's explicit in
>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made
>>>>>>>>> explicit
>>>>>>>>> in ACPI). Even if I wanted to filter these out as noise in vfio, how
>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
>>>>>>>>> programmed. PCI has this knowledge, I hope. VFIO doesn't. Thanks,
>>>>>>>>>
>>>>>>>>> Alex
>>>>>>>
>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit that
>>>>>>> full 64 bit addresses must be allowed and hardware validation
>>>>>>> test suites normally check that it actually does work
>>>>>>> if it happens.
>>>>>>
>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
>>>>>> routing, that's more what I'm referring to. There are generally only
>>>>>> fixed address windows for RAM vs MMIO.
>>>>>
>>>>> The physical chipset? Likely - in the presence of IOMMU.
>>>>> Without that, devices can talk to each other without going
>>>>> through chipset, and bridge spec is very explicit that
>>>>> full 64 bit addressing must be supported.
>>>>>
>>>>> So as long as we don't emulate an IOMMU,
>>>>> guest will normally think it's okay to use any address.
>>>>>
>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
>>>>>>> windows would protect you, but pci already does this filtering:
>>>>>>> if you see this address in the memory map this means
>>>>>>> your virtual device is on root bus.
>>>>>>>
>>>>>>> So I think it's the other way around: if VFIO requires specific
>>>>>>> address ranges to be assigned to devices, it should give this
>>>>>>> info to qemu and qemu can give this to guest.
>>>>>>> Then anything outside that range can be ignored by VFIO.
>>>>>>
>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO. There's
>>>>>> currently no way to find out the address width of the IOMMU. We've been
>>>>>> getting by because it's safely close enough to the CPU address width to
>>>>>> not be a concern until we start exposing things at the top of the 64bit
>>>>>> address space. Maybe I can safely ignore anything above
>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now. Thanks,
>>>>>>
>>>>>> Alex
>>>>>
>>>>> I think it's not related to target CPU at all - it's a host limitation.
>>>>> So just make up your own constant, maybe depending on host architecture.
>>>>> Long term add an ioctl to query it.
>>>>
>>>> It's a hardware limitation which I'd imagine has some loose ties to the
>>>> physical address bits of the CPU.
>>>>
>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
>>>>> placing BARs above some address.
>>>>
>>>> That doesn't help this case, it's a spurious mapping caused by sizing
>>>> the BARs with them enabled. We may still want such a thing to feed into
>>>> building ACPI tables though.
>>>
>>> Well the point is that if you want BIOS to avoid
>>> specific addresses, you need to tell it what to avoid.
>>> But neither BIOS nor ACPI actually cover the range above
>>> 2^48 ATM so it's not a high priority.
>>>
>>>>> Since it's a vfio limitation I think it should be a vfio API, along the
>>>>> lines of vfio_get_addr_space_bits(void).
>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
>>>>
>>>> It's an IOMMU hardware limitation, legacy assignment has the same
>>>> problem. It looks like legacy will abort() in QEMU for the failed
>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
>>>> mappings. In the short term, I think I'll ignore any mappings above
>>>> TARGET_PHYS_ADDR_SPACE_BITS,
>>>
>>> That seems very wrong. It will still fail on an x86 host if we are
>>> emulating a CPU with full 64 bit addressing. The limitation is on the
>>> host side there's no real reason to tie it to the target.
>
> I doubt vfio would be the only thing broken in that case.
>
>>>> long term vfio already has an IOMMU info
>>>> ioctl that we could use to return this information, but we'll need to
>>>> figure out how to get it out of the IOMMU driver first.
>>>> Thanks,
>>>>
>>>> Alex
>>>
>>> Short term, just assume 48 bits on x86.
>
> I hate to pick an arbitrary value since we have a very specific mapping
> we're trying to avoid. Perhaps a better option is to skip anything
> where:
>
> MemoryRegionSection.offset_within_address_space >
> ~MemoryRegionSection.offset_within_address_space
>
>>> We need to figure out what's the limitation on ppc and arm -
>>> maybe there's none and it can address full 64 bit range.
>>
>> IIUC on PPC and ARM you always have BAR windows where things can get mapped
>> into. Unlike x86 where the full phyiscal address range can be overlayed by
>> BARs.
>>
>> Or did I misunderstand the question?
>
> Sounds right, if either BAR mappings outside the window will not be
> realized in the memory space or the IOMMU has a full 64bit address
> space, there's no problem. Here we have an intermediate step in the BAR
> sizing producing a stray mapping that the IOMMU hardware can't handle.
> Even if we could handle it, it's not clear that we want to. On AMD-Vi
> the IOMMU pages tables can grow to 6-levels deep. A stray mapping like
> this then causes space and time overhead until the tables are pruned
> back down. Thanks,
I thought sizing is hard defined as a set to
-1? Can't we check for that one special case and treat it as "not mapped, but
tell the guest the size in config space"?
Alex
>
> Alex
>
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, (continued)
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Michael S. Tsirkin, 2014/01/09
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Alex Williamson, 2014/01/09
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Alex Williamson, 2014/01/09
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Michael S. Tsirkin, 2014/01/09
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Alex Williamson, 2014/01/09
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Michael S. Tsirkin, 2014/01/10
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Alex Williamson, 2014/01/10
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Michael S. Tsirkin, 2014/01/12
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Alexander Graf, 2014/01/12
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Alex Williamson, 2014/01/13
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide,
Alexander Graf <=
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Alex Williamson, 2014/01/13
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Avi Kivity, 2014/01/14
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Michael S. Tsirkin, 2014/01/14
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Alex Williamson, 2014/01/14
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Michael S. Tsirkin, 2014/01/14
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Michael S. Tsirkin, 2014/01/14
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Alex Williamson, 2014/01/14
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Michael S. Tsirkin, 2014/01/14
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Alex Williamson, 2014/01/14
- Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide, Michael S. Tsirkin, 2014/01/14