Re: deadlock when using iothread during backup

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: deadlock when using iothread during backup_clean()

From:	Fiona Ebner
Subject:	Re: deadlock when using iothread during backup_clean()
Date:	Thu, 28 Sep 2023 10:06:10 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1

Am 05.09.23 um 13:42 schrieb Paolo Bonzini:
> On 9/5/23 12:01, Fiona Ebner wrote:
>> Can we assume block_job_remove_all_bdrv() to always hold the job's
>> AioContext?
> 
> I think so, see job_unref_locked(), job_prepare_locked() and
> job_finalize_single_locked().  These call the callbacks that ultimately
> get to block_job_remove_all_bdrv().
>     
>> And if yes, can we just tell bdrv_graph_wrlock() that it
>> needs to release that before polling to fix the deadlock?
> 
> No, but I think it should be released and re-acquired in
> block_job_remove_all_bdrv() itself.
> 

For fixing the backup cancel deadlock, I tried the following:

> diff --git a/blockjob.c b/blockjob.c
> index 58c5d64539..fd6132ebfe 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -198,7 +198,9 @@ void block_job_remove_all_bdrv(BlockJob *job)
>       * one to make sure that such a concurrent access does not attempt
>       * to process an already freed BdrvChild.
>       */
> +    aio_context_release(job->job.aio_context);
>      bdrv_graph_wrlock(NULL);
> +    aio_context_acquire(job->job.aio_context);
>      while (job->nodes) {
>          GSList *l = job->nodes;
>          BdrvChild *c = l->data;

but it's not enough unfortunately. And I don't just mean with the later
deadlock during bdrv_close() (via bdrv_cbw_drop()) as mentioned in the
other mail.

Even when I got lucky and that deadlock didn't trigger by chance or with
an additional change to try and avoid that one

> diff --git a/block.c b/block.c
> index e7f349b25c..02d2c4e777 100644
> --- a/block.c
> +++ b/block.c
> @@ -5165,7 +5165,7 @@ static void bdrv_close(BlockDriverState *bs)
>          bs->drv = NULL;
>      }
>  
> -    bdrv_graph_wrlock(NULL);
> +    bdrv_graph_wrlock(bs);
>      QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
>          bdrv_unref_child(bs, child);
>      }

often guest IO would get completely stuck after canceling the backup.
There's nothing obvious to me in the backtraces at that point, but it
seems the vCPU and main threads running like usual, while the IO thread
is stuck in aio_poll(), i.e. never returns from the __ppoll() call. This
would happen with both, a VirtIO SCSI and a VirtIO block disk and with
both aio=io_uring and aio=threads.

I should also mention I'm using

> fio --name=file --size=4k --direct=1 --rw=randwrite --bs=4k --ioengine=psync 
> --numjobs=5 --runtime=6000 --time_based

inside the guest during canceling of the backup.

I'd be glad for any pointers what to look for and happy to provide more
information.

Best Regards,
Fiona

[Prev in Thread]

Current Thread

[Next in Thread]

deadlock when using iothread during backup_clean(), Fiona Ebner, 2023/09/05
- Re: deadlock when using iothread during backup_clean(), Fiona Ebner, 2023/09/05
- Re: deadlock when using iothread during backup_clean(), Paolo Bonzini, 2023/09/05
  - Re: deadlock when using iothread during backup_clean(), Fiona Ebner <=

Prev by Date: [PATCH] Makefile: build plugins before running TCG tests
Next by Date: Re: [PATCH v3 0/6] python/machine: use socketpair() for console socket
Previous by thread: Re: deadlock when using iothread during backup_clean()
Next by thread: [PATCH] docs/system/replay: do not show removed command line option
Index(es):
- Date
- Thread