qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-block] [Qemu-devel] Guest unresponsive after Virtqueue size ex


From: Stefan Hajnoczi
Subject: Re: [Qemu-block] [Qemu-devel] Guest unresponsive after Virtqueue size exceeded error
Date: Wed, 20 Feb 2019 16:58:39 +0000
User-agent: Mutt/1.10.1 (2018-07-13)

On Mon, Feb 18, 2019 at 07:21:25AM +0000, Fernando Casas Schössow wrote:
> It took a few days but last night the problem was reproduced.
> This is the information from the log:
> 
> vdev 0x55f261d940f0 ("virtio-blk")
> vq 0x55f261d9ee40 (idx 0)
> inuse 128 vring.num 128
> old_shadow_avail_idx 58874 last_avail_idx 58625 avail_idx 58874
> avail 0x3d87a800 avail_idx (cache bypassed) 58625

Hi Paolo,
Are you aware of any recent MemoryRegionCache issues?  The avail_idx
value 58874 was read via the cache while a non-cached read produces
58625!

I suspect that 58625 is correct since the vring is already full and the
driver wouldn't bump avail_idx any further until requests complete.

Fernando also hits this issue with virtio-scsi so it's not a
virtio_blk.ko driver bug or a virtio-blk device emulation issue.

A QEMU core dump is available for debugging.

Here is the patch that produced this debug output:
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index a1ff647a66..28d89fcbcb 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -866,6 +866,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
         return NULL;
     }
     rcu_read_lock();
+    uint16_t old_shadow_avail_idx = vq->shadow_avail_idx;
     if (virtio_queue_empty_rcu(vq)) {
         goto done;
     }
@@ -879,6 +880,12 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
     max = vq->vring.num;

     if (vq->inuse >= vq->vring.num) {
+        fprintf(stderr, "vdev %p (\"%s\")\n", vdev, vdev->name);
+        fprintf(stderr, "vq %p (idx %u)\n", vq, (unsigned int)(vq - vdev->vq));
+        fprintf(stderr, "inuse %u vring.num %u\n", vq->inuse, vq->vring.num);
+        fprintf(stderr, "old_shadow_avail_idx %u last_avail_idx %u avail_idx 
%u\n", old_shadow_avail_idx, vq->last_avail_idx, vq->shadow_avail_idx);
+        fprintf(stderr, "avail %#" HWADDR_PRIx " avail_idx (cache bypassed) 
%u\n", vq->vring.avail, virtio_lduw_phys(vdev, vq->vring.avail + 
offsetof(VRingAvail, idx)));
+        fprintf(stderr, "used_idx %u\n", vq->used_idx);
+        abort(); /* <--- core dump! */
         virtio_error(vdev, "Virtqueue size exceeded");
         goto done;
     }

Stefan

> used_idx 58497
> 2019-02-18 03:20:08.605+0000: shutting down, reason=crashed
> 
> The dump file, including guest memory, was generated successfully (after gzip 
> the file is around 492MB).
> I switched the guest now to virtio-scsi to get the information and dump with 
> this setup as well.
> 
> How should we proceed?
> 
> Thanks.
> 
> On lun, feb 11, 2019 at 4:17 AM, Stefan Hajnoczi <address@hidden> wrote:
> Thanks for collecting the data! The fact that both virtio-blk and virtio-scsi 
> failed suggests it's not a virtqueue element leak in the virtio-blk or 
> virtio-scsi device emulation code. The hung task error messages from inside 
> the guest are a consequence of QEMU hitting the "Virtqueue size exceeded" 
> error. QEMU refuses to process further requests after the error, causing 
> tasks inside the guest to get stuck on I/O. I don't have a good theory 
> regarding the root cause. Two ideas: 1. The guest is corrupting the vring or 
> submitting more requests than will fit into the ring. Somewhat unlikely 
> because it happens with both Windows and Linux guests. 2. QEMU's virtqueue 
> code is buggy, maybe the memory region cache which is used for fast guest RAM 
> accesses. Here is an expanded version of the debug patch which might help 
> identify which of these scenarios is likely. Sorry, it requires running the 
> guest again! This time let's make QEMU dump core so both QEMU state and guest 
> RAM are captured for further debugging. That way it will be possible to 
> extract more information using gdb without rerunning. Stefan --- diff --git 
> a/hw/virtio/virtio.c b/hw/virtio/virtio.c index a1ff647a66..28d89fcbcb 100644 
> --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -866,6 +866,7 @@ void 
> *virtqueue_pop(VirtQueue *vq, size_t sz) return NULL; } rcu_read_lock(); + 
> uint16_t old_shadow_avail_idx = vq->shadow_avail_idx; if 
> (virtio_queue_empty_rcu(vq)) { goto done; } @@ -879,6 +880,12 @@ void 
> *virtqueue_pop(VirtQueue *vq, size_t sz) max = vq->vring.num; if (vq->inuse 
> >= vq->vring.num) { + fprintf(stderr, "vdev %p (\"%s\")\n", vdev, 
> vdev->name); + fprintf(stderr, "vq %p (idx %u)\n", vq, (unsigned int)(vq - 
> vdev->vq)); + fprintf(stderr, "inuse %u vring.num %u\n", vq->inuse, 
> vq->vring.num); + fprintf(stderr, "old_shadow_avail_idx %u last_avail_idx %u 
> avail_idx %u\n", old_shadow_avail_idx, vq->last_avail_idx, 
> vq->shadow_avail_idx); + fprintf(stderr, "avail %#" HWADDR_PRIx " avail_idx 
> (cache bypassed) %u\n", vq->vring.avail, virtio_lduw_phys(vdev, 
> vq->vring.avail + offsetof(VRingAvail, idx))); + fprintf(stderr, "used_idx 
> %u\n", vq->used_idx); + abort(); /* <--- core dump! */ virtio_error(vdev, 
> "Virtqueue size exceeded"); goto done; }
> 
> 

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]