qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Hang with migration multi-thread compression under high loa


From: Daniel P. Berrange
Subject: [Qemu-devel] Hang with migration multi-thread compression under high load
Date: Wed, 27 Apr 2016 15:20:23 +0100
User-agent: Mutt/1.5.24 (2015-08-30)

I've been testing various features of migration and have hit a problem
with the multi-thread compression. It works fine when I have 2 or more
threads, but if I tell it to only use a single thread, then it almost
always hangs

I'm doing a migration between 2 guests on the same machine over a
tcp localhost socket, using this command line to launch them:

  /home/berrange/src/virt/qemu/x86_64-softmmu/qemu-system-x86_64
     -chardev socket,id=mon,path=/var/tmp/qemu-src-4644-monitor.sock
     -mon chardev=mon,mode=control
     -display none
     -vga none
     -machine accel=kvm
     -kernel /boot/vmlinuz-4.4.7-300.fc23.x86_64
     -initrd /home/berrange/src/virt/qemu/tests/migration/initrd-stress.img
     -append "noapic edd=off printk.time=1 noreplace-smp cgroup_disable=memory 
pci=noearly console=ttyS0 debug ramsize=1"
     -chardev stdio,id=cdev0
     -device isa-serial,chardev=cdev0
     -m 1536
     -smp 1

The target VM also gets

    -incoming tcp:localhost:9000


When the VM hangs, the source QEMU shows this stack trace:


Thread 5 (Thread 0x7fdb0aa93700 (LWP 4648)):
#0  0x00007fdb172d4c59 in syscall () at /lib64/libc.so.6
#1  0x0000558bf0b9ba15 in qemu_event_wait (val=<optimized out>, ev=<optimized 
out>) at util/qemu-thread-posix.c:292
#2  0x0000558bf0b9ba15 in qemu_event_wait (address@hidden 
<rcu_call_ready_event>) at util/qemu-thread-posix.c:399
#3  0x0000558bf0ba9e3e in call_rcu_thread (opaque=<optimized out>) at 
util/rcu.c:250
#4  0x00007fdb175a060a in start_thread () at /lib64/libpthread.so.0
#5  0x00007fdb172daa4d in clone () at /lib64/libc.so.6

Thread 4 (Thread 0x7fdb0a292700 (LWP 4649)):
#0  0x00007fdb172d0707 in ioctl () at /lib64/libc.so.6
#1  0x0000558bf089b887 in kvm_vcpu_ioctl (address@hidden, address@hidden) at 
/home/berrange/src/virt/qemu/kvm-all.c:1984
#2  0x0000558bf089b9df in kvm_cpu_exec (address@hidden) at 
/home/berrange/src/virt/qemu/kvm-all.c:1834
#3  0x0000558bf0889982 in qemu_kvm_cpu_thread_fn (arg=0x558bf268c8c0) at 
/home/berrange/src/virt/qemu/cpus.c:1069
#4  0x00007fdb175a060a in start_thread () at /lib64/libpthread.so.0
#5  0x00007fdb172daa4d in clone () at /lib64/libc.so.6

Thread 3 (Thread 0x7fdaa3dff700 (LWP 4655)):
#0  0x00007fdb175a5b10 in pthread_cond_wait@@GLIBC_2.3.2 () at 
/lib64/libpthread.so.0
#1  0x0000558bf0b9b709 in qemu_cond_wait (address@hidden, address@hidden) at 
util/qemu-thread-posix.c:123
#2  0x0000558bf08ad40b in do_data_compress (opaque=0x558bf42e2870) at 
/home/berrange/src/virt/qemu/migration/ram.c:306
#3  0x00007fdb175a060a in start_thread () at /lib64/libpthread.so.0
#4  0x00007fdb172daa4d in clone () at /lib64/libc.so.6

Thread 2 (Thread 0x7fdaa35fe700 (LWP 4656)):
#0  0x00007fdb175a909d in sendmsg () at /lib64/libpthread.so.0
#1  0x0000558bf0b9ed35 in do_send_recv (address@hidden, address@hidden, 
iov_cnt=<optimized out>, address@hidden) at util/iov.c:107
#2  0x0000558bf0b9f2a4 in iov_send_recv (sockfd=15, address@hidden, 
address@hidden, address@hidden, address@hidden, address@hidden) at 
util/iov.c:194
#3  0x0000558bf0aaeacc in socket_writev_buffer (opaque=0x558bf2ebf1e0, 
iov=0x558bf267f130, iovcnt=1, pos=<optimized out>) at 
migration/qemu-file-unix.c:48
#4  0x0000558bf0aad47a in qemu_fflush (address@hidden) at 
migration/qemu-file.c:123
#5  0x0000558bf0aad5ed in qemu_put_buffer (f=0x558bf2677100, address@hidden "", 
address@hidden) at migration/qemu-file.c:337
#6  0x0000558bf0aae040 in qemu_put_qemu_file (size=4045, buf=0x558bf26ae840 "", 
f=0x558bf2677100) at migration/qemu-file.c:642
#7  0x0000558bf0aae040 in qemu_put_qemu_file (address@hidden, 
f_src=0x558bf26ae810) at migration/qemu-file.c:642
#8  0x0000558bf08adf2d in ram_find_and_save_block 
(bytes_transferred=0x558bf116b050 <bytes_transferred>, offset=6647840, 
block=0x558bf26c5460, f=0x558bf2677100)
    at /home/berrange/src/virt/qemu/migration/ram.c:886
#9  0x0000558bf08adf2d in ram_find_and_save_block 
(bytes_transferred=0x558bf116b050 <bytes_transferred>, last_stage=<optimized 
out>, pss=<synthetic pointer>, f=0x558bf267---Type <return> to continue, or q 
<return> to quit---
7100) at /home/berrange/src/virt/qemu/migration/ram.c:972
#10 0x0000558bf08adf2d in ram_find_and_save_block 
(bytes_transferred=0x558bf116b050 <bytes_transferred>, ms=<optimized out>, 
dirty_ram_abs=<optimized out>, last_stage=false, pss=<synthetic pointer>, 
f=0x558bf2677100) at /home/berrange/src/virt/qemu/migration/ram.c:1247
#11 0x0000558bf08adf2d in ram_find_and_save_block 
(bytes_transferred=0x558bf116b050 <bytes_transferred>, dirty_ram_abs=<optimized 
out>, last_stage=<optimized out>, pss=<synthetic pointer>, f=<optimized out>, 
ms=<optimized out>) at /home/berrange/src/virt/qemu/migration/ram.c:1299
#12 0x0000558bf08adf2d in ram_find_and_save_block (address@hidden, 
address@hidden, bytes_transferred=0x558bf116b050 <bytes_transferred>)
    at /home/berrange/src/virt/qemu/migration/ram.c:1359
#13 0x0000558bf08ae211 in ram_save_iterate (f=0x558bf2677100, opaque=<optimized 
out>) at /home/berrange/src/virt/qemu/migration/ram.c:1993
#14 0x0000558bf08b40e7 in qemu_savevm_state_iterate (f=0x558bf2677100, 
address@hidden) at /home/berrange/src/virt/qemu/migration/savevm.c:970
#15 0x0000558bf0aa9c74 in migration_thread (opaque=0x558bf10fa140 
<current_migration>) at migration/migration.c:1694
#16 0x00007fdb175a060a in start_thread () at /lib64/libpthread.so.0
#17 0x00007fdb172daa4d in clone () at /lib64/libc.so.6

Thread 1 (Thread 0x7fdb23bedc80 (LWP 4647)):
#0  0x00007fdb172cf0a1 in ppoll () at /lib64/libc.so.6
#1  0x0000558bf0b06769 in qemu_poll_ns (__ss=0x0, __timeout=0x7ffe007c71c0, 
__nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x0000558bf0b06769 in qemu_poll_ns (fds=<optimized out>, nfds=<optimized 
out>, address@hidden) at qemu-timer.c:325
#3  0x0000558bf0b061da in main_loop_wait (timeout=1000000000) at main-loop.c:252
#4  0x0000558bf0b061da in main_loop_wait (nonblocking=<optimized out>) at 
main-loop.c:506
#5  0x0000558bf0855a34 in main () at vl.c:1934
#6  0x0000558bf0855a34 in main (argc=<optimized out>, argv=<optimized out>, 
envp=<optimized out>) at vl.c:4658


An strace of the source shows that the 'sendmsg' call in the source
QEMU is not completing, which says that the socket buffer is full.
So I think we can discount the source QEMU as being a problem.


Now the target QEMU shows this stack trace:



Thread 4 (Thread 0x7f86de523700 (LWP 4652)):
#0  0x00007f86ead64c59 in syscall () at /lib64/libc.so.6
#1  0x0000560ecd8a2a15 in qemu_event_wait (val=<optimized out>, ev=<optimized 
out>) at util/qemu-thread-posix.c:292
#2  0x0000560ecd8a2a15 in qemu_event_wait (address@hidden 
<rcu_call_ready_event>) at util/qemu-thread-posix.c:399
#3  0x0000560ecd8b0e3e in call_rcu_thread (opaque=<optimized out>) at 
util/rcu.c:250
#4  0x00007f86eb03060a in start_thread () at /lib64/libpthread.so.0
#5  0x00007f86ead6aa4d in clone () at /lib64/libc.so.6

Thread 3 (Thread 0x7f86ddd22700 (LWP 4653)):
#0  0x00007f86eb035b10 in pthread_cond_wait@@GLIBC_2.3.2 () at 
/lib64/libpthread.so.0
#1  0x0000560ecd8a2709 in qemu_cond_wait (cond=<optimized out>, address@hidden 
<qemu_global_mutex>)
    at util/qemu-thread-posix.c:123
#2  0x0000560ecd590943 in qemu_kvm_cpu_thread_fn (cpu=<optimized out>) at 
/home/berrange/src/virt/qemu/cpus.c:1035
#3  0x0000560ecd590943 in qemu_kvm_cpu_thread_fn (arg=0x560ece73b8c0) at 
/home/berrange/src/virt/qemu/cpus.c:1074
#4  0x00007f86eb03060a in start_thread () at /lib64/libpthread.so.0
#5  0x00007f86ead6aa4d in clone () at /lib64/libc.so.6

Thread 2 (Thread 0x7f8677bed700 (LWP 4657)):
#0  0x00007f86eb035b10 in pthread_cond_wait@@GLIBC_2.3.2 () at 
/lib64/libpthread.so.0
#1  0x0000560ecd8a2709 in qemu_cond_wait (address@hidden, address@hidden) at 
util/qemu-thread-posix.c:123
#2  0x0000560ecd5b422d in do_data_decompress (opaque=0x560ed0391870) at 
/home/berrange/src/virt/qemu/migration/ram.c:2195
#3  0x00007f86eb03060a in start_thread () at /lib64/libpthread.so.0
#4  0x00007f86ead6aa4d in clone () at /lib64/libc.so.6

Thread 1 (Thread 0x7f86f767dc80 (LWP 4651)):
#0  0x0000560ecd5b744f in ram_load (len=711, host=0x7f8677e06000, 
f=0x560ed03a7950) at /home/berrange/src/virt/qemu/migration/ram.c:2263
#1  0x0000560ecd5b744f in ram_load (f=0x560ed03a7950, opaque=<optimized out>, 
version_id=<optimized out>)
    at /home/berrange/src/virt/qemu/migration/ram.c:2513
#2  0x0000560ecd5b8b87 in vmstate_load (f=0x560ed03a7950, se=0x560ece731f90, 
version_id=4)
    at /home/berrange/src/virt/qemu/migration/savevm.c:643
#3  0x0000560ecd5b8fc3 in qemu_loadvm_state_main (mis=0x560ece75c330, 
f=0x560ed03a7950) at /home/berrange/src/virt/qemu/migration/savevm.c:1819
#4  0x0000560ecd5b8fc3 in qemu_loadvm_state_main (address@hidden, 
address@hidden)
    at /home/berrange/src/virt/qemu/migration/savevm.c:1850
#5  0x0000560ecd5bbd36 in qemu_loadvm_state (address@hidden) at 
/home/berrange/src/virt/qemu/migration/savevm.c:1911
#6  0x0000560ecd7b1b2f in process_incoming_migration_co (opaque=0x560ed03a7950) 
at migration/migration.c:384
#7  0x0000560ecd8b1eba in coroutine_trampoline (i0=<optimized out>, 
i1=<optimized out>) at util/coroutine-ucontext.c:78
#8  0x00007f86eacb0f30 in __start_context () at /lib64/libc.so.6
#9  0x00007ffc877e49c0 in  ()


for some reason it isn't shown in the stack thrace for thread
1 above, when initially connecting GDB it says the main thread
is at:

decompress_data_with_multi_threads (len=702, host=0x7fd78fe06000, 
f=0x55901af09950) at /home/berrange/src/virt/qemu/migration/ram.c:2254
2254            for (idx = 0; idx < thread_count; idx++) {


Looking at the target QEMU, we see  do_data_decompress method
is waiting in a condition var:

        while (!param->start && !quit_decomp_thread) {
            qemu_cond_wait(&param->cond, &param->mutex);
            ....do stuff..
            param->start = false
        }


Now the decompress_data_with_multi_threads is checking param->start without
holding the param->mutex lock.

Changing decompress_data_with_multi_threads to acquire param->mutex
lock makes it work, but isn't ideal, since that now blocks the
decompress_data_with_multi_threads() method on the completion of
each thread, which defeats the point of having multiple threads.


As mentioned above I'm only seeing the hang when using 1 decompress
thread. If it let QEMU have multiple decompress threads everything
is fine. Also, it only happens if I have a very heavy guest workload.
If the guest is completely idle, it again works fine. So clearly there
is some kind of race condition I'm unlucky enough to hit here.

In terms of monitor commands I'm just running


  migrate_set_capabilities compress on        (src + dst)
  migrate_set_parameters compress-threads 1   (src only)
  migrate_set_parameters decompress-threads 1 (dst only)

Then migrate -d tcp:localhost:9000

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|



reply via email to

[Prev in Thread] Current Thread [Next in Thread]