qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Live migration results in non-working virtio-net device


From: Neil Skrypuch
Subject: Re: [Qemu-devel] Live migration results in non-working virtio-net device (sometimes)
Date: Fri, 28 Feb 2014 15:14:12 -0500
User-agent: KMail/4.10.2 (Linux/3.9.1-gentoo-r1; KDE/4.10.2; x86_64; ; )

On Thursday 30 January 2014 13:23:04 Neil Skrypuch wrote:
> First, let me briefly outline the way we use live migration, as it is
> probably not typical. We use live migration (with block migration) to make
> backups of VMs with zero downtime. The basic process goes like this:
> 
> 1) migrate src VM -> dest VM
> 2) migration completes
> 3) cont src VM
> 4) gracefully shut down dest VM
> 5) dest VM's disk image is now a valid backup
> 
> In general, this works very well.
> 
> Up until now we have been using qemu-kvm 1.1.2 and have not had any issues
> with the above process. I am now attempting to upgrade us to a newer version
> of qemu, but all newer versions I've tried occasionally result in the
> virtio- net device ceasing to function on the src VM after step 3.
> 
> I am able to reproduce this reliably (given enough iterations), it happens
> in roughly 2% of all migrations.
> 
> Here is the complete qemu command line for the src VM:
> 
> /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive
> file=/var/lib/kvm/testbackup.polldev.com.img,if=virtio -m 2048 -smp
> 4,cores=4,sockets=1,threads=1 -net
> nic,macaddr=52:54:98:00:00:00,model=virtio -net tap,script=/etc/qemu-ifup-
> br2,downscript=no -curses -name
> "testbackup.polldev.com",process=testbackup.polldev.com -monitor
> unix:/var/lib/kvm/monitor/testbackup,server,nowait
> 
> The dest VM:
> 
> /usr/bin/qemu-system-x86_64 -machine accel=kvm -drive
> file=/backup/testbackup.polldev.com.img.bak20140129,if=virtio -m 2048 -smp
> 4,cores=4,sockets=1,threads=1 -net
> nic,macaddr=52:54:98:00:00:00,model=virtio -net tap,script=no,downscript=no
> - curses -name "testbackup.polldev.com",process=testbackup.polldev.com
> -monitor unix:/var/lib/kvm/monitor/testbackup.bak,server,nowait -incoming
> tcp:0:4444
> 
> The migration is performed like so:
> 
> echo "migrate -b tcp:localhost:4444" | socat STDIO UNIX-
> CONNECT:/var/lib/kvm/monitor/testbackup
> echo "migrate_set_speed 1G" | socat STDIO UNIX-
> CONNECT:/var/lib/kvm/monitor/testbackup
> #wait
> echo cont | socat STDIO UNIX-CONNECT:/var/lib/kvm/monitor/testbackup
> 
> The guest in question is a minimal install of CentOS 6.5.
> 
> I have observed this issue across the following qemu versions:
> 
> qemu 1.4.2
> qemu 1.6.0
> qemu 1.6.1
> qemu 1.7.0
> 
> I also attempted to test qemu 1.5.3, but live migration flat out crashed
> there (totally different issue).
> 
> I have also tested a number of other scenarios with qemu 1.6.0, all of which
> exhibit the same failure mode:
> 
> qemu 1.6.0 + host kernel 3.1.0
> qemu 1.6.0 + host kernel 3.10.7
> qemu 1.6.0 + host kernel 3.10.17
> qemu 1.6.0 + virtio with -netdev/-device syntax
> qemu 1.6.0 + accel=tcg
> 
> The one case I have found that works properly is the following:
> 
> qemu 1.6.0 + e1000
> 
> It is worth noting that when the virtio-net device ceases to function in the
> guest that removing and reinserting the virtio-net kernel module results in
> the device working again (except in 1.4.2, this had no effect there).
> 
> As mentioned above I can reproduce this with minimal effort, and am willing
> to test out any patches or provide further details as necessary.
> 
> - Neil

Ok, I was able to narrow this down to somewhere in between 1.2.2 (or rather, 
1.2.0) and 1.3.0. Migration in 1.3.0 is broken, however, I was able to cherry 
pick d7cd369, d5f1f28, and 9ee0cb2 on top of 1.3.0 to fix the unrelated 
migration bug and confirm that the bug from this thread is still present in 
1.3.0.

I started a git bisect on 1.2.2..1.3.0 but didn't get very far before running 
into several unrelated bugs which kept migration from working.

I also tested out the latest master code (d844a7b) and it fails in the same 
way as 1.7.0.

- Neil



reply via email to

[Prev in Thread] Current Thread [Next in Thread]