qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] block/nvme: Fix VFIO_MAP_DMA failed: No space left on dev


From: Philippe Mathieu-Daudé
Subject: Re: [PATCH v2] block/nvme: Fix VFIO_MAP_DMA failed: No space left on device
Date: Tue, 22 Jun 2021 14:42:30 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0

On 6/22/21 10:06 AM, Philippe Mathieu-Daudé wrote:
> On 6/22/21 9:29 AM, Philippe Mathieu-Daudé wrote:
>> On 6/21/21 5:36 PM, Fam Zheng wrote:
>>>> On 21 Jun 2021, at 16:13, Philippe Mathieu-Daudé <philmd@redhat.com> wrote:
>>>> On 6/21/21 3:18 PM, Fam Zheng wrote:
>>>>>> On 21 Jun 2021, at 10:32, Philippe Mathieu-Daudé <philmd@redhat.com> 
>>>>>> wrote:
>>>>>>
>>>>>> When the NVMe block driver was introduced (see commit bdd6a90a9e5,
>>>>>> January 2018), Linux VFIO_IOMMU_MAP_DMA ioctl was only returning
>>>>>> -ENOMEM in case of error. The driver was correctly handling the
>>>>>> error path to recycle its volatile IOVA mappings.
>>>>>>
>>>>>> To fix CVE-2019-3882, Linux commit 492855939bdb ("vfio/type1: Limit
>>>>>> DMA mappings per container", April 2019) added the -ENOSPC error to
>>>>>> signal the user exhausted the DMA mappings available for a container.
>>>>>>
>>>>>> The block driver started to mis-behave:
>>>>>>
>>>>>> qemu-system-x86_64: VFIO_MAP_DMA failed: No space left on device
>>>>>> (qemu)
>>>>>> (qemu) info status
>>>>>> VM status: paused (io-error)
>>>>>> (qemu) c
>>>>>> VFIO_MAP_DMA failed: No space left on device
>>>>>> qemu-system-x86_64: block/block-backend.c:1968: blk_get_aio_context: 
>>>>>> Assertion `ctx == blk->ctx' failed.
>>>>>
>>>>> Hi Phil,
>>>>>
>>>>>
>>>>> The diff looks good to me, but I’m not sure what exactly caused the 
>>>>> assertion failure. There is `if (r) { goto fail; }` that handles -ENOSPC 
>>>>> before, so it should be treated as a general case. What am I missing?
>>>>
>>>> Good catch, ENOSPC ends setting BLOCK_DEVICE_IO_STATUS_NOSPACE
>>>> -> BLOCK_ERROR_ACTION_STOP, so the VM is paused with DMA mapping
>>>> exhausted. I don't understand the full "VM resume" path, but this
>>>> is not what we want (IO_NOSPACE is to warn the operator to add
>>>> more storage and resume, which is pointless in our case, resuming
>>>> won't help until we flush the mappings).
>>>>
>>>> IIUC what we want is return ENOMEM to set BLOCK_DEVICE_IO_STATUS_FAILED.
>>>
>>> I agree with that. It just makes me feel there’s another bug in the 
>>> resuming code path. Can you get a backtrace?
>>
>> It seems the resuming code path bug has been fixed elsewhere:
>>
>> (qemu) info status
>> info status
>> VM status: paused (io-error)
>> (qemu) c
>> c
>> 2021-06-22T07:27:00.745466Z qemu-system-x86_64: VFIO_MAP_DMA failed: No
>> space left on device
>> (qemu) info status
>> info status
>> VM status: paused (io-error)
>> (qemu) c
>> c
>> 2021-06-22T07:27:12.458137Z qemu-system-x86_64: VFIO_MAP_DMA failed: No
>> space left on device
>> (qemu) c
>> c
>> 2021-06-22T07:27:13.439167Z qemu-system-x86_64: VFIO_MAP_DMA failed: No
>> space left on device
>> (qemu) c
>> c
>> 2021-06-22T07:27:14.272071Z qemu-system-x86_64: VFIO_MAP_DMA failed: No
>> space left on device
>> (qemu)
>>
> 
> I tested all releases up to v4.1.0 and could not trigger the
> blk_get_aio_context() assertion. Building using --enable-debug.
> IIRC Gentoo is more aggressive, so I'll restart using -O2.

Took 4h30 to test all releases with -O3, couldn't reproduce :(

I wish I hadn't postponed writing an Ansible test script...

On v1 Michal said he doesn't have access to the machine anymore,
so I'll assume the other issue got fixed elsewhere.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]