|
From: | Hanna Czenczek |
Subject: | Re: [PULL 11/33] scsi: only access SCSIDevice->requests from one thread |
Date: | Thu, 25 Jan 2024 18:32:46 +0100 |
User-agent: | Mozilla Thunderbird |
On 23.01.24 18:10, Kevin Wolf wrote:
Am 23.01.2024 um 17:40 hat Hanna Czenczek geschrieben:On 21.12.23 22:23, Kevin Wolf wrote:From: Stefan Hajnoczi<stefanha@redhat.com> Stop depending on the AioContext lock and instead access SCSIDevice->requests from only one thread at a time: - When the VM is running only the BlockBackend's AioContext may access the requests list. - When the VM is stopped only the main loop may access the requests list. These constraints protect the requests list without the need for locking in the I/O code path. Note that multiple IOThreads are not supported yet because the code assumes all SCSIRequests are executed from a single AioContext. Leave that as future work. Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com> Reviewed-by: Eric Blake<eblake@redhat.com> Message-ID:<20231204164259.1515217-2-stefanha@redhat.com> Signed-off-by: Kevin Wolf<kwolf@redhat.com> --- include/hw/scsi/scsi.h | 7 +- hw/scsi/scsi-bus.c | 181 ++++++++++++++++++++++++++++------------- 2 files changed, 131 insertions(+), 57 deletions(-)My reproducer for https://issues.redhat.com/browse/RHEL-3934 now breaks more often because of this commit than because of the original bug, i.e. when repeatedly hot-plugging and unplugging a virtio-scsi and a scsi-hd device, this tends to happen when unplugging the scsi-hd: {"execute":"device_del","arguments":{"id":"stg0"}} {"return": {}} qemu-system-x86_64: ../block/block-backend.c:2429: blk_get_aio_context: Assertion `ctx == blk->ctx' failed.
[...]
I don’t know anything about the problem yet, but as usual, I like speculation and discovering how wrong I was later on, so one thing I came across that’s funny about virtio-scsi is that requests can happen even while a disk is being attached or detached. That is, Linux seems to probe all LUNs when a new virtio-scsi device is being attached, and it won’t stop just because a disk is being attached or removed. So maybe that’s part of the problem, that we get a request while the BB is being detached, and temporarily in an inconsistent state (BDS context differs from BB context).I don't know anything about the problem either, but since you already speculated about the cause, let me speculate about the solution: Can we hold the graph writer lock for the tran_commit() call in bdrv_try_change_aio_context()? And of course take the reader lock for blk_get_aio_context(), but that should be completely unproblematic.
Actually, now that completely unproblematic part is giving me trouble. I wanted to just put a graph lock into blk_get_aio_context() (making it a coroutine with a wrapper), but callers of blk_get_aio_context() generally assume the context is going to stay the BB’s context for as long as their AioContext * variable is in scope. I was tempted to think callers know what happens to the BB they pass to blk_get_aio_context(), and it won’t change contexts so easily, but then I remembered this is exactly what happens in this case; we run scsi_device_for_each_req_async_bh() in one thread (which calls blk_get_aio_context()), and in the other, we change the BB’s context.
It seems like there are very few blk_* functions right now that require taking a graph lock around it, so I’m hesitant to go that route. But if we’re protecting a BB’s context via the graph write lock, I can’t think of a way around having to take a read lock whenever reading a BB’s context, and holding it for as long as we assume that context to remain the BB’s context. It’s also hard to figure out how long that is, case by case; for example, dma_blk_read() schedules an AIO function in the BB context; but we probably don’t care that this context remains the BB’s context until the request is done. In the case of scsi_device_for_each_req_async_bh(), we already take care to re-schedule it when it turns out the context is outdated, so it does seem quite important here, and we probably want to keep the lock until after the QTAILQ_FOREACH_SAFE() loop.
On a tangent, this TOCTTOU problem makes me wary of other blk_* functions that query information. For example, fuse_read() (in block/export/fuse.c) truncates requests to the BB length. But what if the BB length changes concurrently between blk_getlength() and blk_pread()? While we can justify using the graph lock for a BB’s AioContext, we can’t use it for other metadata like its length.
Hanna
[Prev in Thread] | Current Thread | [Next in Thread] |