[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Intermittent hang of iotest 194 (bdrv_drain_all after n
From: |
Max Reitz |
Subject: |
Re: [Qemu-devel] Intermittent hang of iotest 194 (bdrv_drain_all after non-shared storage migration) |
Date: |
Fri, 10 Nov 2017 18:48:53 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 |
On 2017-11-10 03:36, Fam Zheng wrote:
> On Thu, 11/09 20:31, Max Reitz wrote:
>> On 2017-11-09 16:30, Fam Zheng wrote:
>>> On Thu, 11/09 16:14, Max Reitz wrote:
[...]
>>>> *sigh*
>>>>
>>>> OK, I'll look into it...
>>>
>>> OK, I'll let you.. Just one more thing: could it relate to the
>>> use-after-free
>>> bug reported on block_job_defer_to_main_loop()?
>>>
>>> https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01144.html
>>
>> Thanks for the heads-up; I think it's a different issue, though.
>>
>> What appears to be happening is that the mirror job completes and then
>> drains its BDS. While that is happening, a bdrv_drain_all() comes in
>> from block_migration_cleanup().
>>
>> That now tries to drain the mirror node. However, that node cannot be
>> drained until the job is truly gone now, so that is what's happening:
>> mirror_exit() is called, it cleans up, destroys the mirror node, and
>> returns.
>>
>> Now bdrv_drain_all() can go on, specifically the BDRV_POLL_WHILE() on
>> the mirror node. However, oops, that node is gone now... So that's
>> where the issue seems to be. :-/
>>
>> Maybe all that we need to do is wrap the bdrv_drain_recurse() call in
>> bdrv_drain_all_begin() in a bdrv_ref()/bdrv_unref() pair? Having run
>> 194 for a couple of minutes, that seems to indeed work -- until it dies
>> because of an invalid BB pointer in bdrv_next(). I guess that is
>> because bdrv_next() does not guard against deleted BDSs.
>>
>> Copying all BDS into an own list (in both bdrv_drain_all_begin() and
>> bdrv_drain_all_end()), with a strong reference to every single one, and
>> then draining them really seems to work, though. (Survived 9000
>> iterations, that seems good enough for something that usually fails
>> after, like, 5.)
>
> Yes, that makes sense. I'm curious if the patch in
>
> https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01649.html
>
> would also work?
No, unfortunately it did not.
(Or maybe fortunately so, since that means I didn't do a whole lot of
work for nothing :-))
Max
signature.asc
Description: OpenPGP digital signature