qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Live migration results in non-working virtio-net device


From: 陈梁
Subject: Re: [Qemu-devel] Live migration results in non-working virtio-net device (sometimes)
Date: Sat, 01 Mar 2014 10:34:03 +0800

> On Thursday 30 January 2014 13:23:04 Neil Skrypuch wrote:
>> First, let me briefly outline the way we use live migration, as it is
>> probably not typical. We use live migration (with block migration) to make
>> backups of VMs with zero downtime. The basic process goes like this:
>> 
>> 1) migrate src VM -> dest VM
>> 2) migration completes
>> 3) cont src VM
>> 4) gracefully shut down dest VM
>> 5) dest VM's disk image is now a valid backup
>> 
>> In general, this works very well.
>> 
>> Up until now we have been using qemu-kvm 1.1.2 and have not had any issues
>> with the above process. I am now attempting to upgrade us to a newer version
>> of qemu, but all newer versions I've tried occasionally result in the
>> virtio- net device ceasing to function on the src VM after step 3.
>> 
>> I am able to reproduce this reliably (given enough iterations), it happens
>> in roughly 2% of all migrations.
>> 
>> Here is the complete qemu command line for the src VM:
>> 
>> /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive
>> file=/var/lib/kvm/testbackup.polldev.com.img,if=virtio -m 2048 -smp
>> 4,cores=4,sockets=1,threads=1 -net
>> nic,macaddr=52:54:98:00:00:00,model=virtio -net tap,script=/etc/qemu-ifup-
>> br2,downscript=no -curses -name
>> "testbackup.polldev.com",process=testbackup.polldev.com -monitor
>> unix:/var/lib/kvm/monitor/testbackup,server,nowait
>> 
>> The dest VM:
>> 
>> /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive
>> file=/backup/testbackup.polldev.com.img.bak20140129,if=virtio -m 2048 -smp
>> 4,cores=4,sockets=1,threads=1 -net
>> nic,macaddr=52:54:98:00:00:00,model=virtio -net tap,script=no,downscript=no
>> - curses -name "testbackup.polldev.com",process=testbackup.polldev.com
>> -monitor unix:/var/lib/kvm/monitor/testbackup.bak,server,nowait -incoming
>> tcp:0:4444
>> 
>> The migration is performed like so:
>> 
>> echo "migrate -b tcp:localhost:4444" | socat STDIO UNIX-
>> CONNECT:/var/lib/kvm/monitor/testbackup
>> echo "migrate_set_speed 1G" | socat STDIO UNIX-
>> CONNECT:/var/lib/kvm/monitor/testbackup
>> #wait
>> echo cont | socat STDIO UNIX-CONNECT:/var/lib/kvm/monitor/testbackup
>> 
>> The guest in question is a minimal install of CentOS 6.5.
>> 
>> I have observed this issue across the following qemu versions:
>> 
>> qemu 1.4.2
>> qemu 1.6.0
>> qemu 1.6.1
>> qemu 1.7.0
>> 
>> I also attempted to test qemu 1.5.3, but live migration flat out crashed
>> there (totally different issue).
>> 
>> I have also tested a number of other scenarios with qemu 1.6.0, all of which
>> exhibit the same failure mode:
>> 
>> qemu 1.6.0 + host kernel 3.1.0
>> qemu 1.6.0 + host kernel 3.10.7
>> qemu 1.6.0 + host kernel 3.10.17
>> qemu 1.6.0 + virtio with -netdev/-device syntax
>> qemu 1.6.0 + accel=tcg
>> 
>> The one case I have found that works properly is the following:
>> 
>> qemu 1.6.0 + e1000
>> 
>> It is worth noting that when the virtio-net device ceases to function in the
>> guest that removing and reinserting the virtio-net kernel module results in
>> the device working again (except in 1.4.2, this had no effect there).
>> 
>> As mentioned above I can reproduce this with minimal effort, and am willing
>> to test out any patches or provide further details as necessary.
>> 
>> - Neil
> 
> Ok, I was able to narrow this down to somewhere in between 1.2.2 (or rather, 
> 1.2.0) and 1.3.0. Migration in 1.3.0 is broken, however, I was able to cherry 
> pick d7cd369, d5f1f28, and 9ee0cb2 on top of 1.3.0 to fix the unrelated 
> migration bug and confirm that the bug from this thread is still present in 
> 1.3.0.
> 
> I started a git bisect on 1.2.2..1.3.0 but didn't get very far before running 
> into several unrelated bugs which kept migration from working.
> 
> I also tested out the latest master code (d844a7b) and it fails in the same 
> way as 1.7.0.
> 
> - Neil
> 

hi,have you try to ping from vm to other host after migration?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]