[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: flatview_write_continue global mutex deadlock

From: Vladimir Sementsov-Ogievskiy
Subject: Re: flatview_write_continue global mutex deadlock
Date: Thu, 3 Sep 2020 18:42:12 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0

03.09.2020 15:29, Paolo Bonzini wrote:
On 03/09/20 13:16, Vladimir Sementsov-Ogievskiy wrote:
(gdb) info thr
   Id   Target Id                                            Frame
* 1    Thread 0x7fb9f0f39e00 (LWP 215115) "qemu-system-x86"
0x00007fb9d784f54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000056069bfbd3f1 in qemu_poll_ns (fds=0x7fb9401dcdf0, nfds=1,
timeout=542076652475) at ../util/qemu-timer.c:347
#2  0x000056069bfd949f in fdmon_poll_wait (ctx=0x56069e6864c0,
ready_list=0x7fb9481fc200, timeout=542076652475) at ../util/fdmon-poll.c:79
#3  0x000056069bfcdf4c in aio_poll (ctx=0x56069e6864c0, blocking=true)
at ../util/aio-posix.c:601
#4  0x000056069be80cf3 in bdrv_do_drained_begin (bs=0x56069e6c0950,
recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at
#5  0x000056069be80ddb in bdrv_drained_begin (bs=0x56069e6c0950) at
#6  0x000056069bf1e5b4 in blk_drain (blk=0x56069e6adb50) at
#7  0x000056069ba40fb5 in ide_cancel_dma_sync (s=0x56069f0d1548) at
#13 0x000056069bd965e2 in flatview_write_continue (fv=0x7fb9401ce100,
addr=49152, attrs=..., ptr=0x7fb9f0f87000, len=1, addr1=0, l=1,
mr=0x56069f0d2420) at ../exec.c:3176

So this is a vCPU thread.  The question is, why is the reconnect timer
not on the same AioContext?  If it were, aio_poll would execute it.


(gdb) fr 4
#4  0x0000564cdffabcf3 in bdrv_do_drained_begin (bs=0x564ce2112950, 
recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at 
427             BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, 
(gdb) p bs->aio_context
$2 = (AioContext *) 0x564ce20d84c0
(gdb) p bs->drv
$3 = (BlockDriver *) 0x564ce088bb60 <bdrv_nbd_unix>
(gdb) set $s=(BDRVNBDState *)bs->opaque
(gdb) p $s->connection_co
connection_co                 connection_co_sleep_ns_state
(gdb) p $s->connection_co_sleep_ns_state
$4 = (QemuCoSleepState *) 0x0
(gdb) p $s->state
(gdb) p $s->connection_co
$6 = (Coroutine *) 0x564ce2118880


(gdb) qemu coroutine $6
#0  0x0000564ce00f402b in qemu_coroutine_switch (from_=0x564ce2118880, 
to_=0x7f8dd2fff598, action=COROUTINE_YIELD) at ../util/coroutine-ucontext.c:302
#1  0x0000564ce00c2b1a in qemu_coroutine_yield () at 
#2  0x0000564cdffc4e50 in nbd_co_reconnect_loop (s=0x564ce21180e0) at 
#3  0x0000564cdffc4f13 in nbd_connection_entry (opaque=0x564ce21180e0) at 
#4  0x0000564ce00f3d33 in coroutine_trampoline (i0=-502167424, i1=22092) at 
#5  0x00007f8e621cc190 in ?? () from /lib64/libc.so.6
#6  0x00007ffcb9b51540 in ?? ()
#7  0x0000000000000000 in ?? ()

so no timer exists now: reconnect code goes to yield during drain, to continue 
after drain-end.. Haha, that's obviously bad design, as nobody will wake up the 
waiting requests, and drain will hang forever. OK thanks, you helped me, I see 
now that nbd code is wrong..

But still, is it OK to do blk_drain holding the global mutex? Drain may take a 
relatively long time, and vm is not responding due to global mutex locked in 
cpu thread..

Best regards,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]