qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: test-blockjob: intermittent CI failures in msys2-64bit job


From: Emanuele Giuseppe Esposito
Subject: Re: test-blockjob: intermittent CI failures in msys2-64bit job
Date: Wed, 26 Apr 2023 10:17:28 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0


Am 26/04/2023 um 10:03 schrieb Hanna Czenczek:
> On 26.04.23 09:38, Emanuele Giuseppe Esposito wrote:
>>
>> Am 25/04/2023 um 18:48 schrieb Hanna Czenczek:
>>> On 24.04.23 20:32, Vladimir Sementsov-Ogievskiy wrote:
>>>> On 24.04.23 16:36, Emanuele Giuseppe Esposito wrote:
>>>>>
>>>>> Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy:
>>>>>> On 17.03.23 15:35, Thomas Huth wrote:
>>>>>>> On 17/03/2023 11.17, Peter Maydell wrote:
>>>>>>>> On Mon, 6 Mar 2023 at 11:16, Peter Maydell
>>>>>>>> <peter.maydell@linaro.org>
>>>>>>>> wrote:
>>>>>>>>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell
>>>>>>>>> <peter.maydell@linaro.org> wrote:
>>>>>>>>>> I've noticed that test-blockjob seems to fail intermittently
>>>>>>>>>> on the msys2-64bit job:
>>>>>>>>>>
>>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803
>>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024
>>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440
>>>>>>>>>>
>>>>>>>>>> Sample output:
>>>>>>>>>> | 53/89
>>>>>>>>>> ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby:
>>>>>>>>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR
>>>>>>>>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3
>>>>>>>>> Here's an intermittent failure from my macos x86 machine:
>>>>>>>>>
>>>>>>>>> 172/621 qemu:unit / test-blockjob
>>>>>>>>>               ERROR           0.26s   killed by signal 6 SIGABRT
>>>>>>>> And an intermittent on the freebsd 13 CI job:
>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240
>>>>>>>>
>>>>>>>>>>> MALLOC_PERTURB_=197
>>>>>>>>>>> G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit
>>>>>>>>>>> G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit
>>>>>>>>>>> /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k
>>>>>>>> ▶ 178/650 /blockjob/ids
>>>>>>>>               OK
>>>>>>>> 178/650 qemu:unit / test-blockjob
>>>>>>>>               ERROR           0.31s   killed by signal 6 SIGABRT
>>>>>>>> ――――――――――――――――――――――――――――――――――――― ✀
>>>>>>>> ―――――――――――――――――――――――――――――――――――――
>>>>>>>> stderr:
>>>>>>>> Assertion failed: (job->status == JOB_STATUS_STANDBY), function
>>>>>>>> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line
>>>>>>>> 499.
>>>>>>>>
>>>>>>>>
>>>>>>>> TAP parsing error: Too few tests run (expected 9, got 1)
>>>>>>>> (test program exited with status code -6)
>>>>>>>> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>>>>>>>>
>>>>>>>> Anybody in the block team looking at these, or shall we just
>>>>>>>> disable this test entirely ?
>>>>>>> I ran into this issue today, too:
>>>>>>>
>>>>>>>     https://gitlab.com/thuth/qemu/-/jobs/3954367101
>>>>>>>
>>>>>>> ... if nobody is interested in fixing this test, I think we should
>>>>>>> disable it...
>>>>>>>
>>>>>>>     Thomas
>>>>>>>
>>>>>> I'm looking at this now, and seems that it's broken since
>>>>>> 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove
>>>>>> Aiocontext locks", as it stops critical section by new
>>>>>> aio_context_release() call exactly after bdrv_drain_all_and(), so
>>>>>> it's
>>>>>> not a surprise that job may start at that moment and the following
>>>>>> assertion fires.
>>>>>>
>>>>>> Emanuele could please look at it?
>>>>>>
>>>>> Well if I understood correctly, the only thing that was preventing the
>>>>> job from continuing was the aiocontext lock held.
>>>>>
>>>>> The failing assertion even mentions that:
>>>>> /* Lock the IO thread to prevent the job from being run */
>>>>> [...]
>>>>> /* But the job cannot run, so it will remain on standby */
>>>>> assert(job->status == JOB_STATUS_STANDBY);
>>>>>
>>>>> Essentially bdrv_drain_all_end() would wake up the job coroutine,
>>>>> but it
>>>>> would anyways block somewhere after since the aiocontext lock was
>>>>> taken
>>>>> by the test.
>>>>>
>>>>> Now that we got rid of aiocontext lock in job code, nothing
>>>>> prevents the
>>>>> test from resuming.
>>>>> Mixing job lock and aiocontext acquire/release is not a good idea, and
>>>>> it would anyways block job_resume() called by bdrv_drain_all_end(), so
>>>>> the test itself would deadlock.
>>>>>
>>>>> So unless @Kevin has a better idea, this seems to be just an "hack" to
>>>>> test stuff that is not possible to do now anymore. What I would
>>>>> suggest
>>>>> is to get rid of that test, or at least of that assert function. I
>>>>> unfortunately cannot reproduce the failure, but I think the remaining
>>>>> functions called by the test should run as expected.
>>>>>
>>>> Thanks! I agree. Probably, alternatively we could just expand the
>>>> drained section, like
>>>>
>>>> @@ -488,12 +488,6 @@ static void test_complete_in_standby(void)
>>>>       bdrv_drain_all_begin();
>>>>       assert_job_status_is(job, JOB_STATUS_STANDBY);
>>>>
>>>> -    /* Lock the IO thread to prevent the job from being run */
>>>> -    aio_context_acquire(ctx);
>>>> -    /* This will schedule the job to resume it */
>>>> -    bdrv_drain_all_end();
>>>> -    aio_context_release(ctx);
>>>> -
>>>>       WITH_JOB_LOCK_GUARD() {
>>>>           /* But the job cannot run, so it will remain on standby */
>>>>           assert(job->status == JOB_STATUS_STANDBY);
>>>> @@ -511,6 +505,7 @@ static void test_complete_in_standby(void)
>>>>           job_dismiss_locked(&job, &error_abort);
>>>>       }
>>>>
>>>> +    bdrv_drain_all_end();
>>>>       aio_context_acquire(ctx);
>>>>       destroy_blk(blk);
>>>>       aio_context_release(ctx);
>>>>
>>>>
>>>> But, seems that test wanted to specifically test job, that still in
>>>> STANDBY exactly after drained section...
>>>>
>>>> [cc: Hanna]
>>>>
>>>> Hanna, it was your test (added in
>>>> c2c731a4d35062295cd3260e66b3754588a2fad4, two years ago). Don't you
>>>> remember was important to catch STANDBY job *after* a drained section?
>>> I’m not quite sure, but I think the idea was that we basically try to
>>> get as close to something that might come in over QMP.  Over QMP, you
>>> can’t issue a job-complete while keeping everything drained, so I
>>> wouldn’t just extend the drained section.
>>>
>>> Getting rid of the assert function also seems pointless.  If we want to
>>> test whether job-complete works on tests in standby, we must put the
>>> test in standby, and verify this.  We can get rid of the test, of
>>> course, but it is a regression test, so it isn’t like it was added just
>>> for fun.  Then again, it’s now already effectively commented out via
>>> environment variable, so it doesn’t seem like a loss in practice to to
>>> fully drop it.
>>>
>>> Anyway – the thing I wonder about is, if this is to test whether jobs in
>>> standby can be completed…  Why don’t we just pause the job instead of
>>> going through the context lock hassle?  I.e. just put a job_pause()
>>> right after bdrv_drain_all_begin().
>>>
>>> If I’m not mistaken, reproducing the bug in the test seems really simple
>>> by adding a sleep(1) right before WITH_JOB_LOCK_GUARD(); and doing that
>>> works just fine if only you have a job_pause() in the drained section.
>>> (And dropping the aio_context_acquire()/release() calls, because they
>>> don’t do anything anymore.)
>>>
>> Why sleep(1)? I have the patch ready:
>>
>> diff --git a/tests/unit/test-blockjob.c b/tests/unit/test-blockjob.c
>> index a130f6fefb..8054b72afa 100644
>> --- a/tests/unit/test-blockjob.c
>> +++ b/tests/unit/test-blockjob.c
>> @@ -488,11 +488,15 @@ static void test_complete_in_standby(void)
>>       bdrv_drain_all_begin();
>>       assert_job_status_is(job, JOB_STATUS_STANDBY);
>>
>> +    /*
>> +     * Increase pause_count so that the counter is
>> +     * unbalanced and job won't resume
>> +     */
>> +    job_pause(job);
>> +
>>       /* Lock the IO thread to prevent the job from being run */
>> -    aio_context_acquire(ctx);
>>       /* This will schedule the job to resume it */
>>       bdrv_drain_all_end();
>> -    aio_context_release(ctx);
>>
>>       WITH_JOB_LOCK_GUARD() {
>>           /* But the job cannot run, so it will remain on standby */
>>
>> But I don't get why we should sleep additionally.
> 
> Yep, that’s the patch.  The sleep was just a note on how to get it to
> reproduce 100 %.
> 
> Hanna
> 

Patch sent, thanks!
20230426081601.2567990-1-eesposit@redhat.com/">https://patchew.org/QEMU/20230426081601.2567990-1-eesposit@redhat.com/

Thank you,
Emanuele




reply via email to

[Prev in Thread] Current Thread [Next in Thread]