qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [RFC PATCH 0/6] job: replace AioContext lock with job_mutex
Date: Thu, 15 Jul 2021 15:35:37 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0

13.07.2021 19:38, Stefan Hajnoczi wrote:
On Tue, Jul 13, 2021 at 06:18:39PM +0300, Vladimir Sementsov-Ogievskiy wrote:
13.07.2021 16:10, Stefan Hajnoczi wrote:
On Mon, Jul 12, 2021 at 10:41:46AM +0200, Emanuele Giuseppe Esposito wrote:


On 08/07/2021 15:04, Stefan Hajnoczi wrote:
On Thu, Jul 08, 2021 at 01:32:12PM +0200, Paolo Bonzini wrote:
On 08/07/21 12:36, Stefan Hajnoczi wrote:
What is very clear from this patch is that it
is strictly related to the brdv_* and lower level calls, because
they also internally check or even use the aiocontext lock.
Therefore, in order to make it work, I temporarly added some
aiocontext_acquire/release pair around the function that
still assert for them or assume they are hold and temporarly
unlock (unlock() - lock()).

Sounds like the issue is that this patch series assumes AioContext locks
are no longer required for calling the blk_*()/bdrv_*() APIs? That is
not the case yet, so you had to then add those aio_context_lock() calls
back in elsewhere. This approach introduces unnecessary risk. I think we
should wait until blk_*()/bdrv_*() no longer requires the caller to hold
the AioContext lock before applying this series.

In general I'm in favor of pushing the lock further down into smaller and
smaller critical sections; it's a good approach to make further audits
easier until it's "obvious" that the lock is unnecessary.  I haven't yet
reviewed Emanuele's patches to see if this is what he's doing where he's
adding the acquire/release calls, but that's my understanding of both his
cover letter and your reply.

The problem is the unnecessary risk. We know what the goal is for
blk_*()/bdrv_*() but it's not quite there yet. Does making changes in
block jobs help solve the final issues with blk_*()/bdrv_*()?

Correct me if I am wrong, but it seems to me that the bdrv_*()/blk_*()
operation mostly take care of building, modifying and walking the bds graph.
So since graph nodes can have multiple AioContext, it makes sense that we
have a lock when modifying the graph, right?

If so, we can simply try to replace the AioContext lock with a graph lock,
or something like that. But I am not sure of this.

Block graph manipulation (all_bdrv_states and friends) requires the BQL.
It has always been this way.

This raises the question: if block graph manipulation is already under
the BQL and BlockDriver callbacks don't need the AioContext anymore, why

I don't believe that block drivers are thread-safe now. They have some 
mutexes.. But who make an audit of thread-safety?

Emanuele :)

FWIW I took a look at the stream, mirror, and backup jobs today and
couldn't find anything that's unsafe after this series. I was expecting
to find issues but I think Emanuele and Paolo have taken care of them.


Hmm, do you mean that all jobs are thread-safe?

Looking at the mirror, what protects s->ops_in_flight for example? It's 
accessed from job coroutines and from mirror_top filter write operation.


are aio_context_acquire() calls still needed in block jobs?

AIO_WAIT_WHILE() requires that AioContext is acquired according to its
documentation, but I'm not sure that's true anymore. Thread-safe/atomic
primitives are used by AIO_WAIT_WHILE(), so as long as the condition
being waited for is thread-safe too it should work without the
AioContext lock.

Back to my comment about unnecessary risk, pushing the lock down is a
strategy for exploring the problem, but I'm not sure those intermediate
commits need to be committed to qemu.git/master because of the time
required to review them and the risk of introducing (temporary) bugs.

I agree. Add my bit of criticism:

What I dislike about the whole thread-safe update you do:

1. There is no proof of concept - some good example of multiqueue, or something that uses mutliple 
threads and shows good performance. Something that works, and shows where are we going to and why 
it is good. That may be draft patches with a lot of "FIXME" and "TODO", but 
working. For now I feel that I've spent my time to reviewing and proving to myself thread-safety of 
two previous thread-safe series, but I don't have a hope to see a benefit of it in the near future..

The multi-queue block layer should improve performance in cases where
the bottleneck is a single IOThread. It will allow users to assign more
than one IOThread.

But I think the bigger impact of this work will be addressing
long-standing problems with the block layer's programming model. We
continue to have IOThread bugs because there are many undocumented
assumptions. With the multi-queue block layer the code switches to a
more well-understood multi-threaded programming model and hopefully
fewer issues will arise because there is no problematic AioContext lock
to worry about.

Stefan



--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]