[Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0

From:	Chris Webb
Subject:	[Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
Date:	Mon, 2 Apr 2012 16:37:23 +0100
User-agent:	Mutt/1.5.20 (2009-06-14)

I have an interesting bug with the e1000 emulation in qemu-kvm 1.0. I've
spent a bit of time trying to track it down, but the behaviour is
sufficiently odd that I'm rather baffled.

The public networking on our VMs consists of a bridge to which the physical
nic is enslaved, a tap interface created for each VM, and qemu-kvm instances
with 

  -netdev tap,id=vlan.0,ifname=tap1,script=no,downscript=no
  -device e1000,id=nic.0,mac=02:00:54:2d:08:f2,netdev=vlan.0

arguments. There are all sorts of ebtables and iptables rules in place
during normal operation of our clusters but I can reproduce the problem by
hand with all of these removed, so I don't think they're implicated.

We initially saw a problem after an upgrade from 0.15.x to 1.0. Upon running
the VMs with the same command line as before, we saw a large number of boot
failures where the guest had apparently come up without networking. Killing
and restarting these VMs generally fixed them. Now on the same hosts, we see
problems with about one e1000-using VM in twenty, strangely much less
frequent than during the mass reboot, and apparently at random.

Because it's an intermittent fault, it's been a bit amusing to reproduce! My
best recipe for reproducing it by hand is to create a standard Debian 6.0
install in a VM, statically configure its networking so it works correctly
for the network attached to br0, and add something like

  curl -s http://www.google.com/ >/dev/null && poweroff

to /etc/rc.local. I can then start qemus in a loop until I get one with
broken networking, typically after twenty or so boots:

  N=1; while true; do
    echo "Starting VM $N:"
    tunctl -t tap1
    brctl addif br0 tap1
    ip link set tap1 up
    qemu-kvm -m 1024 -smp 1 -cpu host -nodefaults -usbdevice tablet \
             -vga cirrus -vnc :1 -drive file=reboot2.img,if=ide,index=0 \
             -netdev tap,id=vlan.0,ifname=tap1,script=no,downscript=no \
             -device e1000,id=nic.0,mac=02:00:54:2d:08:f2,netdev=vlan.0
    tunctl -d tap1
    sleep 1 && echo
    let N++
  done

My test host is running linux 3.2.2, but I've reproduced on an earlier
2.6.39.2 kernel on another host as well. It's definitely present in both
mainline qemu and qemu-kvm 1.0.

That particular guest is running a 2.6.32 kernel but actually the problem
isn't kernel version specific, nor even linux specific as we've seen it on
windows VMs too.

There are at least two things I can do which apparently cause the guests to
always work fine. One is to remove the -usbdevice tablet in the above. The
second is to replace 

  -netdev tap,id=vlan.0,ifname=tap1,script=no,downscript=no

with -netdev user,id=vlan.0 and reconfigure the guest networking to match
the correct usermode networking ip address, netmask and gateway.

This is a bit puzzling to me. The first fix naively suggests something wrong
with the device emulation in the guest, whereas the second fix would point
at something wrong in qemu code that glues the guest to the tap interface.
However, the problem also never happens with virtio or rtl8139 devices.

Once I've got a guest with broken networking, the network stays down even if
I do things like 'ip link set eth0 down; sleep 5; ip link set eth0 up'.
Killing and restarting the same VM, it runs fine next time.

Any suggestions of anything I could do to better pin down this one would be
very gratefully received!

Best wishes,

Chris.

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0, Chris Webb <=
- Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0, Stefan Hajnoczi, 2012/04/03
  - Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0, Chris Webb, 2012/04/03
    - Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0, Stefan Hajnoczi, 2012/04/03
    - Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0, Chris Webb, 2012/04/03
    - Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0, Stefan Hajnoczi, 2012/04/03
    - Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0, Chris Webb, 2012/04/03
    - Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0, Stefan Hajnoczi, 2012/04/03
    - Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0, Chris Webb, 2012/04/03
    - Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0, Stefan Hajnoczi, 2012/04/11
    - Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0, Chris Webb, 2012/04/12

Prev by Date: Re: [Qemu-devel] [PATCH 1/4] block: cancel jobs when a device is ready to go away
Next by Date: Re: [Qemu-devel] [PATCH 0/4] Job API improvements and bugfixes
Previous by thread: Re: [Qemu-devel] [PATCH 1/4] block: cancel jobs when a device is ready to go away
Next by thread: Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0
Index(es):
- Date
- Thread