[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
flatview_write_continue global mutex deadlock
From: |
Vladimir Sementsov-Ogievskiy |
Subject: |
flatview_write_continue global mutex deadlock |
Date: |
Thu, 3 Sep 2020 14:16:29 +0300 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 |
Hi all!
I can trigger long io request with help of nbd reconnect-delay option, which
make the request wait for some time for the connection to establish again and
retry (of course, this works if connection is already lost). So, the request
itself may be long. And this triggers the deadlock, which seems unrelated to
nbd itself.
So, what I do:
1. Create an image:
qemu-img create -f qcow2 xx 100M
2. Start NBD server:
qemu-nbd xx
3. Start vm with second nbd disk on node2, like this:
./build/x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive
file=/work/images/cent7.qcow2 -drive
driver=nbd,server.type=inet,server.host=192.168.100.5,server.port=10809,reconnect-delay=60
-vnc :0 -m 2G -enable-kvm -vga std
4. Access the vm through vnc (or some other way?), and check that NBD
drive works:
dd if=/dev/sdb of=/dev/null bs=1M count=10
- the command should succeed.
5. Now, kill the nbd server, and run dd in the guest again:
dd if=/dev/sdb of=/dev/null bs=1M count=10
Now Qemu is trying to reconnect, and dd-generated requests are waiting for the
connection (they will wait up to 60 seconds (see reconnect-delay option above)
and than fail). But suddenly, vm may totally hang in the deadlock. You may need
to increase reconnect-delay period to catch the dead-lock.
Guest os is Centos 7.3.1611, kernel 3.10.0-514.el7.x86_64
The dead lock looks as follows:
(gdb) bt
#0 0x00007fb9d784f54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fb9d784ae9b in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007fb9d784ad68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x000056069bfb3b06 in qemu_mutex_lock_impl (mutex=0x56069c7a3fe0 <qemu_global_mutex>,
file=0x56069c24c79b "../util/main-loop.c", line=238) at ../util/qemu-thread-posix.c:79
#4 0x000056069bd00056 in qemu_mutex_lock_iothread_impl (file=0x56069c24c79b
"../util/main-loop.c", line=238) at ../softmmu/cpus.c:1782
#5 0x000056069bfcfd6f in os_host_main_loop_wait (timeout=151823947) at
../util/main-loop.c:238
#6 0x000056069bfcfe7a in main_loop_wait (nonblocking=0) at
../util/main-loop.c:516
#7 0x000056069bd7777b in qemu_main_loop () at ../softmmu/vl.c:1676
#8 0x000056069b95fec2 in main (argc=13, argv=0x7fffd42bff08,
envp=0x7fffd42bff78) at ../softmmu/main.c:50
(gdb) p qemu_global_mutex
$1 = {lock = {__data = {__lock = 2, __count = 0, __owner = 215121, __nusers =
1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next =
0x0}},
__size = "\002\000\000\000\000\000\000\000QH\003\000\001", '\000' <repeats 26 times>,
__align = 2}, file = 0x56069c1d597d "../exec.c", line = 3139, initialized = true}
exec.c:3139 is in prepare_mmio_access(), called from flatview_write_continue().
Let's check qemu_global_mutex owner thread:
(gdb) info thr
Id Target Id Frame
* 1 Thread 0x7fb9f0f39e00 (LWP 215115) "qemu-system-x86" 0x00007fb9d784f54d
in __lll_lock_wait () from /lib64/libpthread.so.0
2 Thread 0x7fb9ca20e700 (LWP 215116) "qemu-system-x86" 0x00007fb9d756bbf9
in syscall () from /lib64/libc.so.6
3 Thread 0x7fb9481ff700 (LWP 215121) "qemu-system-x86" 0x00007fb9d7566cff
in ppoll () from /lib64/libc.so.6
4 Thread 0x7fb9461ff700 (LWP 215123) "qemu-system-x86" 0x00007fb9d784ca35
in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) thr 3
[Switching to thread 3 (Thread 0x7fb9481ff700 (LWP 215121))]
#0 0x00007fb9d7566cff in ppoll () from /lib64/libc.so.6
(gdb) bt
#0 0x00007fb9d7566cff in ppoll () from /lib64/libc.so.6
#1 0x000056069bfbd3f1 in qemu_poll_ns (fds=0x7fb9401dcdf0, nfds=1,
timeout=542076652475) at ../util/qemu-timer.c:347
#2 0x000056069bfd949f in fdmon_poll_wait (ctx=0x56069e6864c0,
ready_list=0x7fb9481fc200, timeout=542076652475) at ../util/fdmon-poll.c:79
#3 0x000056069bfcdf4c in aio_poll (ctx=0x56069e6864c0, blocking=true) at
../util/aio-posix.c:601
#4 0x000056069be80cf3 in bdrv_do_drained_begin (bs=0x56069e6c0950,
recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at
../block/io.c:427
#5 0x000056069be80ddb in bdrv_drained_begin (bs=0x56069e6c0950) at
../block/io.c:433
#6 0x000056069bf1e5b4 in blk_drain (blk=0x56069e6adb50) at
../block/block-backend.c:1718
#7 0x000056069ba40fb5 in ide_cancel_dma_sync (s=0x56069f0d1548) at
../hw/ide/core.c:723
#8 0x000056069bb90d29 in bmdma_cmd_writeb (bm=0x56069f0d22d0, val=8) at
../hw/ide/pci.c:298
#9 0x000056069b9fa529 in bmdma_write (opaque=0x56069f0d22d0, addr=0, val=8,
size=1) at ../hw/ide/piix.c:75
#10 0x000056069bd5d9d5 in memory_region_write_accessor (mr=0x56069f0d2420,
addr=0, value=0x7fb9481fc4d8, size=1, shift=0, mask=255, attrs=...) at
../softmmu/memory.c:483
#11 0x000056069bd5dbf3 in access_with_adjusted_size (addr=0, value=0x7fb9481fc4d8,
size=1, access_size_min=1, access_size_max=4, access_fn=0x56069bd5d8f6
<memory_region_write_accessor>, mr=0x56069f0d2420,
attrs=...) at ../softmmu/memory.c:544
#12 0x000056069bd60bda in memory_region_dispatch_write (mr=0x56069f0d2420,
addr=0, data=8, op=MO_8, attrs=...) at ../softmmu/memory.c:1465
#13 0x000056069bd965e2 in flatview_write_continue (fv=0x7fb9401ce100,
addr=49152, attrs=..., ptr=0x7fb9f0f87000, len=1, addr1=0, l=1,
mr=0x56069f0d2420) at ../exec.c:3176
#14 0x000056069bd9673a in flatview_write (fv=0x7fb9401ce100, addr=49152,
attrs=..., buf=0x7fb9f0f87000, len=1) at ../exec.c:3216
#15 0x000056069bd96aae in address_space_write (as=0x56069c7a5940
<address_space_io>, addr=49152, attrs=..., buf=0x7fb9f0f87000, len=1) at
../exec.c:3307
#16 0x000056069bd96b20 in address_space_rw (as=0x56069c7a5940
<address_space_io>, addr=49152, attrs=..., buf=0x7fb9f0f87000, len=1,
is_write=true) at ../exec.c:3317
#17 0x000056069bd07f06 in kvm_handle_io (port=49152, attrs=...,
data=0x7fb9f0f87000, direction=1, size=1, count=1) at
../accel/kvm/kvm-all.c:2262
#18 0x000056069bd086db in kvm_cpu_exec (cpu=0x56069e6cdb30) at
../accel/kvm/kvm-all.c:2508
#19 0x000056069bcfef84 in qemu_kvm_cpu_thread_fn (arg=0x56069e6cdb30) at
../softmmu/cpus.c:1188
#20 0x000056069bfb4681 in qemu_thread_start (args=0x56069e6f4860) at
../util/qemu-thread-posix.c:521
#21 0x00007fb9d7848ea5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007fb9d75718dd in clone () from /lib64/libc.so.6
(gdb) fr 13
#13 0x000056069bd965e2 in flatview_write_continue (fv=0x7fb9401ce100,
addr=49152, attrs=..., ptr=0x7fb9f0f87000, len=1, addr1=0, l=1,
mr=0x56069f0d2420) at ../exec.c:3176
3176 result |= memory_region_dispatch_write(mr, addr1, val,
(gdb) list
3171 release_lock |= prepare_mmio_access(mr);
3172 l = memory_access_size(mr, l, addr1);
3173 /* XXX: could force current_cpu to NULL to avoid
3174 potential bugs */
3175 val = ldn_he_p(buf, l);
3176 result |= memory_region_dispatch_write(mr, addr1, val,
3177 size_memop(l),
attrs);
3178 } else {
3179 /* RAM case */
3180 ram_ptr = qemu_ram_ptr_length(mr->ram_block, addr1, &l,
false);
(gdb) p release_lock
$2 = true
So, global mutex is locked in flatview_write_continue, and on the path we have
blk_drain and wait for some requests (actually nbd requests, waiting for
connection, but this shouldn't matter). In the same time main thread waits on
global mutex, so it's impossible to proceed with these nbd requests. Deadlock..
Paolo could you please help with it? Or who can? I know nothing about exec.c
code :(
Side idea about nbd-reconnect feature: probably we should drop (finish with
failure) all the requests waiting for reconnect in .bdrv_co_drain_begin handler
of nbd driver. But I'm not sure: it will break reconnect feature if drain() is
often event.
--
Best regards,
Vladimir
- flatview_write_continue global mutex deadlock,
Vladimir Sementsov-Ogievskiy <=