qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] hw/nvme: Fix deallocate when metadata is present


From: Klaus Jensen
Subject: Re: [PATCH] hw/nvme: Fix deallocate when metadata is present
Date: Fri, 3 Jun 2022 22:03:18 +0200

On Jun  3 13:14, Jonathan Derrick wrote:
> When metadata is present in the namespace and deallocates are issued, the 
> first
> deallocate could fail to zero the block range, resulting in another
> deallocation to be issued. Normally after the deallocation completes and the
> range is checked for zeroes, a deallocation is then issued for the metadata
> space. In the failure case where the range is not zeroed, deallocation is
> reissued for the block range (and followed with metadata deallocation), but 
> the
> original range deallocation task will also issue a metadata deallocation:
> 
> nvme_dsm_cb()
>   *range deallocation*
>   nvme_dsm_md_cb()
>     if (nvme_block_status_all()) (range deallocation failure)
>       nvme_dsm_cb()
>       *range deallocation*
>         nvme_dsm_md_cb()
>           if (nvme_block_status_all()) (no failure)
>           *metadata deallocation*
>     *metadata deallocation*
> 
> This sequence results in reentry of nvme_dsm_cb() before the metadata has been
> deallocated. During reentry, the metadata is deallocated in the reentrant 
> task.
> nvme_dsm_bh() is called which deletes and sets iocb->bh to NULL. When reentry
> returns from nvme_dsm_cb(), metadata deallocation takes place again, and
> results in a null pointer dereference on the iocb->bh:
> 
> BH deletion:
> #0  nvme_dsm_bh (opaque=0x55ef893e2f10) at ../hw/nvme/ctrl.c:2316
> #1  0x000055ef868eb333 in aio_bh_call (bh=0x55ef8a441b30) at 
> ../util/async.c:141
> #2  0x000055ef868eb441 in aio_bh_poll (ctx=0x55ef892c6e40) at 
> ../util/async.c:169
> #3  0x000055ef868d2789 in aio_dispatch (ctx=0x55ef892c6e40) at 
> ../util/aio-posix.c:415
> #4  0x000055ef868eb896 in aio_ctx_dispatch (source=0x55ef892c6e40, 
> callback=0x0, user_data=0x0) at ../util/async.c:311
> #5  0x00007f5bfe4ab17d in g_main_context_dispatch () at 
> /lib/x86_64-linux-gnu/libglib-2.0.so.0
> #6  0x000055ef868fcd98 in glib_pollfds_poll () at ../util/main-loop.c:232
> #7  0x000055ef868fce16 in os_host_main_loop_wait (timeout=0) at 
> ../util/main-loop.c:255
> #8  0x000055ef868fcf27 in main_loop_wait (nonblocking=0) at 
> ../util/main-loop.c:531
> #9  0x000055ef864a2442 in qemu_main_loop () at ../softmmu/runstate.c:726
> #10 0x000055ef860f957a in main (argc=29, argv=0x7ffdc9705508, 
> envp=0x7ffdc97055f8) at ../softmmu/main.c:50
> 
> nvme_dsm_cb() called for metadata after nvme_dsm_bh() completes from 
> reentrant task:
> Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
> 0x000055ef868eb07c in aio_bh_enqueue (bh=0x0, new_flags=2) at 
> ../util/async.c:70
> 70          AioContext *ctx = bh->ctx;
> (gdb) backtrace
> #0  0x000055ef868eb07c in aio_bh_enqueue (bh=0x0, new_flags=2) at 
> ../util/async.c:70
> #1  0x000055ef868eb4cf in qemu_bh_schedule (bh=0x0) at ../util/async.c:186
> #2  0x000055ef862db21e in nvme_dsm_cb (opaque=0x55ef897b41a0, ret=0) at 
> ../hw/nvme/ctrl.c:2423
> #3  0x000055ef8665a662 in blk_aio_complete (acb=0x55ef89c6d8c0) at 
> ../block/block-backend.c:1419
> #4  0x000055ef8665a940 in blk_aio_write_entry (opaque=0x55ef89c6d8c0) at 
> ../block/block-backend.c:1486
> #5  0x000055ef868edcf2 in coroutine_trampoline (i0=-536848976, i1=32602) at 
> ../util/coroutine-ucontext.c:173
> #6  0x00007f5bfe0bc510 in __start_context () at 
> ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91
> #7  0x00007f5bf757bb40 in  ()
> #8  0x0000000000000000 in  ()
> 
> The fix is to return when an nvme_dsm_cb() is reentered due to failure to
> deallocate the block range, so that metadata deallocate is then only issued in
> the reentrant task and prevent doing it again when the reentrant task returns
> to the original task.
> 
> Reproduction steps (with emulated namespace):
> nvme format --lbaf=1 -f /dev/nvme0n1
> mkfs.ext4 /dev/nvme0n1
> mkfs.ext4 -F /dev/nvme0n1
> 
> Signed-off-by: Francis Pravin AntonyX Michael Raj 
> <francis.michael@solidigm.com>
> Signed-off-by: Michael Kropaczek <michael.kropaczek@solidigm.com>
> Signed-off-by: Jonathan Derrick <jonathan.derrick@solidigm.com>
> ---
>  hw/nvme/ctrl.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 03760ddeae..74540a03d5 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -2372,6 +2372,7 @@ static void nvme_dsm_md_cb(void *opaque, int ret)
>          }
>  
>          nvme_dsm_cb(iocb, 0);
> +        return;
>      }
>  
>      iocb->aiocb = blk_aio_pwrite_zeroes(ns->blkconf.blk, nvme_moff(ns, slba),
> -- 
> 2.25.1
> 

Yeah, thanks for the elaborate analysis!

A fix for this has been lingering in nvme-next. I sincerely apologize
for not sending the pull request to master earlier, which might have
saved you the trouble of tracking this down.

I just sent the pull request to Peter.

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]