qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] sda abort with virtio-scsi


From: Jim Minter
Subject: Re: [Qemu-devel] sda abort with virtio-scsi
Date: Wed, 3 Feb 2016 23:34:35 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0

Hi again, thanks for replying,

On 03/02/16 23:19, Paolo Bonzini wrote:
On 03/02/2016 22:46, Jim Minter wrote:
I am hitting the following VM lockup issue running a VM with latest
RHEL7 kernel on a host also running latest RHEL7 kernel.  FWIW I'm using
virtio-scsi because I want to use discard=unmap.  I ran the VM as follows:

/usr/libexec/qemu-kvm -nodefaults \
   -cpu host \
   -smp 4 \
   -m 8192 \
   -drive discard=unmap,file=vm.qcow2,id=disk1,if=none,cache=unsafe \
   -device virtio-scsi-pci \
   -device scsi-disk,drive=disk1 \
   -netdev bridge,id=net0,br=br0 \
   -device virtio-net-pci,netdev=net0,mac=$(utils/random-mac.py) \
   -chardev socket,id=chan0,path=/tmp/rhev.sock,server,nowait \
   -chardev socket,id=chan1,path=/tmp/qemu.sock,server,nowait \
   -monitor unix:tmp/vm.sock,server,nowait \
   -device virtio-serial-pci \
   -device virtserialport,chardev=chan0,name=com.redhat.rhevm.vdsm \
   -device virtserialport,chardev=chan1,name=org.qemu.guest_agent.0 \
   -device cirrus-vga \
   -vnc none \
   -usbdevice tablet

The host was busyish at the time, but not excessively (IMO).  Nothing
untoward in the host's kernel log; host storage subsystem is fine.  I
didn't get any qemu logs this time around, but I will when the issue
next recurs.  The VM's full kernel log is attached; here are the
highlights:

Hannes, were you going to send a patch to disable time outs?


INFO: rcu_sched detected stalls on CPUs/tasks: { 3} (detected by 2, t=60002 
jiffies, g=5253, c=5252, q=0)
sending NMI to all CPUs:
NMI backtrace for cpu 1
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.0-327.4.5.el7.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
task: ffff88023417d080 ti: ffff8802341a4000 task.ti: ffff8802341a4000
RIP: 0010:[<ffffffff81058e96>]  [<ffffffff81058e96>] native_safe_halt+0x6/0x10
RSP: 0018:ffff8802341a7e98  EFLAGS: 00000286
RAX: 00000000ffffffed RBX: ffff8802341a4000 RCX: 0100000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
RBP: ffff8802341a7e98 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: ffff8802341a4000 R14: ffff8802341a4000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4978587008 CR3: 000000003645e000 CR4: 00000000003407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
  ffff8802341a7eb8 ffffffff8101dbcf ffff8802341a4000 ffffffff81a68260
  ffff8802341a7ec8 ffffffff8101e4d6 ffff8802341a7f20 ffffffff810d62e5
  ffff8802341a7fd8 ffff8802341a4000 2581685d70de192c 7ba58fdb3a3bc8d4
Call Trace:
  [<ffffffff8101dbcf>] default_idle+0x1f/0xc0
  [<ffffffff8101e4d6>] arch_cpu_idle+0x26/0x30
  [<ffffffff810d62e5>] cpu_startup_entry+0x245/0x290
  [<ffffffff810475fa>] start_secondary+0x1ba/0x230
Code: 00 00 00 00 00 55 48 89 e5 fa 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb 
5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 
00 55 48 89 e5 f4 5d c3 66 0f 1f 84
NMI backtrace for cpu 0

This is the NMI watchdog firing; the CPU got stuck for 20 seconds.  The
issue was not a busy host, but a busy storage (could it be a network
partition if the disk was hosted on NFS???)

The VM qcow2 storage is on host-local SSD, and although there's some competition for the host CPU and storage, it seems surprising to me that the VM should be starved of CPU to this extent. I was worried there was some way in which the contention could cause an abort and perhaps thence the lockup (which does not seem to recover when the host load goes down).

Firing the NMI watchdog is fixed in more recent QEMU, which has
asynchronous cancellation, assuming you're running RHEL's QEMU 1.5.3
(try /usr/libexec/qemu-kvm --version, or rpm -qf /usr/libexec/qemu-kvm).

/usr/libexec/qemu-kvm --version reports QEMU emulator version 1.5.3 (qemu-kvm-1.5.3-105.el7_2.3)

Cheers,

Jim



reply via email to

[Prev in Thread] Current Thread [Next in Thread]