Segfault in hw/scsi/scsi-disk.c caused by null pointer

We are running a Ceph 15.2.16 cluster at work, connect VMs to it using librbd1 14.2.22, running on the QEMU 4.2 release by Ubuntu’s Cloud Archive (4.2-3ubuntu6.23~cloud0).

A few of our VMs randomly trigger a segfault in qemu-system-x86_64. This seems to only happen when lots of volume detach requests are executed in parallel. We can somewhat reliably reproduce this in our staging environment (it still takes thousands of detaches and sometimes hours of running tests until it occurs once). In production it is a lot rarer, but it does occur with some regularity.

Using crash dumps and debug symbols I was able to pin-point the cause of the segfault to a problem with the `scsi_disk_emulate_inquiry` function. Within, QEMU tries to access two SCSIDiskState variables that are both `0x0`, when QEMU expects them to be characters.

1. `s->version`

2. `s->vendor`

I found out that in other parts of the code (`scsi_block_realize`) those variables are not expected to always be valid characters, so I copied that over and created my own release of the distro package with this patch (here diffed against master):

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c

index efee6739f9..419d6e28f1 100644

--- a/hw/scsi/scsi-disk.c

+++ b/hw/scsi/scsi-disk.c

@@ -775,6 +775,14 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, uint8_t *outbuf)

return -1;

}

+ /* Avoid null-pointers leading to segfaults below */

+ if (!s->version) {

+ s->version = g_strdup(qemu_hw_version());

+ }

+ if (!s->vendor) {

+ s->vendor = g_strdup("QEMU");

+ }

/* PAGE CODE == 0 */

buflen = req->cmd.xfer;

if (buflen > SCSI_MAX_INQUIRY_LEN) {

And with that fix, we have no segfaults anymore.

So I have that, but there are a few problems:

1. I don’t have a way to reproduce this in the master.

2. I don’t know if this is safe.

3. I don’t know if this just masks the issue and the problem lies somewhere deeper.

So.. should this be something I create a bug report for? I feel like a maintainer might be able to know much better if this is an actual issue or not. I would also suspect that this is indeed a QEMU bug to a degree, even though it’s maybe something that does not happen with other librbd1 releases, because the assumption that those pointers are always point to valid characters is likely wrong.

Also, even if it’s something just for our version, I would appreciate any kind of input on my patch and why you think it helps our case.

Cheers,

Denis

From:	Denis Krienbühl
Subject:	Segfault in hw/scsi/scsi-disk.c caused by null pointer
Date:	Tue, 9 Aug 2022 17:33:51 +0200