[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: xen-block: race condition when stopping the device (WAS: Re: [Xen-de
From: |
Durrant, Paul |
Subject: |
RE: xen-block: race condition when stopping the device (WAS: Re: [Xen-devel] [xen-4.13-testing test] 144736: regressions - FAIL) |
Date: |
Mon, 16 Dec 2019 09:34:27 +0000 |
> -----Original Message-----
[snip]
> >>
> >> This feels like a race condition between the init/free code with
> >> handler. Anthony, does it ring any bell?
> >>
> >
> > From that stack bt it looks like an iothread managed to run after the
> sring was NULLed. This should not be able happen as the dataplane should
> have been moved back onto QEMU's main thread context before the ring is
> unmapped.
>
> My knowledge of this code is fairly limited, so correct me if I am wrong.
>
> blk_set_aio_context() would set the context for the block aio. AFAICT,
> the only aio for the block is xen_block_complete_aio().
Not quite. xen_block_dataplane_start() calls xen_device_bind_event_channel()
and that will add an event channel fd into the aio context, so the shared ring
is polled by the iothread as well as block i/o completion.
>
> In the stack above, we are not dealing with a block aio but an aio tie
> to the event channel (see the call from xen_device_poll). So I don't
> think the blk_set_aio_context() would affect the aio.
>
For the reason I outline above, it does.
> So it would be possible to get the iothread running because we received
> a notification on the event channel while we are stopping the block (i.e
> xen_block_dataplane_stop()).
>
We should assume an iothread can essentially run at any time, as it is a
polling entity. It should eventually block polling on fds assign to its aio
context but I don't think the abstraction guarantees that it cannot be awoken
for other reasons (e.g. off a timeout). However and event from the frontend
will certainly cause the evtchn fd poll to wake up.
> If xen_block_dataplane_stop() grab the context lock first, then the
> iothread dealing with the event may wait on the lock until its released.
>
> By the time the lock is grabbed, we may have free all the resources
> (including srings). So the event iothread will end up to dereference a
> NULL pointer.
>
I think the problem may actually be that xen_block_dataplane_event() does not
acquire the context and thus is not synchronized with
xen_block_dataplane_stop(). The documentation in multiple-iothreads.txt is not
clear whether a poll handler called by an iothread needs to acquire the context
though; TBH I would not have thought it necessary.
> It feels to me we need a way to quiesce all the iothreads (blk,
> event,...) before continuing. But I am a bit unsure how to do this in
> QEMU.
>
Looking at virtio-blk.c I see that it does seem to close off its evtchn
equivalent from iothread context via aio_wait_bh_oneshot(). So I wonder whether
the 'right' thing to do is to call xen_device_unbind_event_channel() using the
same mechanism to ensure xen_block_dataplane_event() can't race.
Paul
> Cheers,
>
> --
> Julien Grall