[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: test-blockjob: intermittent CI failures in msys2-64bit job
From: |
Emanuele Giuseppe Esposito |
Subject: |
Re: test-blockjob: intermittent CI failures in msys2-64bit job |
Date: |
Wed, 26 Apr 2023 10:17:28 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 |
Am 26/04/2023 um 10:03 schrieb Hanna Czenczek:
> On 26.04.23 09:38, Emanuele Giuseppe Esposito wrote:
>>
>> Am 25/04/2023 um 18:48 schrieb Hanna Czenczek:
>>> On 24.04.23 20:32, Vladimir Sementsov-Ogievskiy wrote:
>>>> On 24.04.23 16:36, Emanuele Giuseppe Esposito wrote:
>>>>>
>>>>> Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy:
>>>>>> On 17.03.23 15:35, Thomas Huth wrote:
>>>>>>> On 17/03/2023 11.17, Peter Maydell wrote:
>>>>>>>> On Mon, 6 Mar 2023 at 11:16, Peter Maydell
>>>>>>>> <peter.maydell@linaro.org>
>>>>>>>> wrote:
>>>>>>>>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell
>>>>>>>>> <peter.maydell@linaro.org> wrote:
>>>>>>>>>> I've noticed that test-blockjob seems to fail intermittently
>>>>>>>>>> on the msys2-64bit job:
>>>>>>>>>>
>>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803
>>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024
>>>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440
>>>>>>>>>>
>>>>>>>>>> Sample output:
>>>>>>>>>> | 53/89
>>>>>>>>>> ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby:
>>>>>>>>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR
>>>>>>>>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3
>>>>>>>>> Here's an intermittent failure from my macos x86 machine:
>>>>>>>>>
>>>>>>>>> 172/621 qemu:unit / test-blockjob
>>>>>>>>> ERROR 0.26s killed by signal 6 SIGABRT
>>>>>>>> And an intermittent on the freebsd 13 CI job:
>>>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240
>>>>>>>>
>>>>>>>>>>> MALLOC_PERTURB_=197
>>>>>>>>>>> G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit
>>>>>>>>>>> G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit
>>>>>>>>>>> /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k
>>>>>>>> ▶ 178/650 /blockjob/ids
>>>>>>>> OK
>>>>>>>> 178/650 qemu:unit / test-blockjob
>>>>>>>> ERROR 0.31s killed by signal 6 SIGABRT
>>>>>>>> ――――――――――――――――――――――――――――――――――――― ✀
>>>>>>>> ―――――――――――――――――――――――――――――――――――――
>>>>>>>> stderr:
>>>>>>>> Assertion failed: (job->status == JOB_STATUS_STANDBY), function
>>>>>>>> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line
>>>>>>>> 499.
>>>>>>>>
>>>>>>>>
>>>>>>>> TAP parsing error: Too few tests run (expected 9, got 1)
>>>>>>>> (test program exited with status code -6)
>>>>>>>> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>>>>>>>>
>>>>>>>> Anybody in the block team looking at these, or shall we just
>>>>>>>> disable this test entirely ?
>>>>>>> I ran into this issue today, too:
>>>>>>>
>>>>>>> https://gitlab.com/thuth/qemu/-/jobs/3954367101
>>>>>>>
>>>>>>> ... if nobody is interested in fixing this test, I think we should
>>>>>>> disable it...
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>> I'm looking at this now, and seems that it's broken since
>>>>>> 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove
>>>>>> Aiocontext locks", as it stops critical section by new
>>>>>> aio_context_release() call exactly after bdrv_drain_all_and(), so
>>>>>> it's
>>>>>> not a surprise that job may start at that moment and the following
>>>>>> assertion fires.
>>>>>>
>>>>>> Emanuele could please look at it?
>>>>>>
>>>>> Well if I understood correctly, the only thing that was preventing the
>>>>> job from continuing was the aiocontext lock held.
>>>>>
>>>>> The failing assertion even mentions that:
>>>>> /* Lock the IO thread to prevent the job from being run */
>>>>> [...]
>>>>> /* But the job cannot run, so it will remain on standby */
>>>>> assert(job->status == JOB_STATUS_STANDBY);
>>>>>
>>>>> Essentially bdrv_drain_all_end() would wake up the job coroutine,
>>>>> but it
>>>>> would anyways block somewhere after since the aiocontext lock was
>>>>> taken
>>>>> by the test.
>>>>>
>>>>> Now that we got rid of aiocontext lock in job code, nothing
>>>>> prevents the
>>>>> test from resuming.
>>>>> Mixing job lock and aiocontext acquire/release is not a good idea, and
>>>>> it would anyways block job_resume() called by bdrv_drain_all_end(), so
>>>>> the test itself would deadlock.
>>>>>
>>>>> So unless @Kevin has a better idea, this seems to be just an "hack" to
>>>>> test stuff that is not possible to do now anymore. What I would
>>>>> suggest
>>>>> is to get rid of that test, or at least of that assert function. I
>>>>> unfortunately cannot reproduce the failure, but I think the remaining
>>>>> functions called by the test should run as expected.
>>>>>
>>>> Thanks! I agree. Probably, alternatively we could just expand the
>>>> drained section, like
>>>>
>>>> @@ -488,12 +488,6 @@ static void test_complete_in_standby(void)
>>>> bdrv_drain_all_begin();
>>>> assert_job_status_is(job, JOB_STATUS_STANDBY);
>>>>
>>>> - /* Lock the IO thread to prevent the job from being run */
>>>> - aio_context_acquire(ctx);
>>>> - /* This will schedule the job to resume it */
>>>> - bdrv_drain_all_end();
>>>> - aio_context_release(ctx);
>>>> -
>>>> WITH_JOB_LOCK_GUARD() {
>>>> /* But the job cannot run, so it will remain on standby */
>>>> assert(job->status == JOB_STATUS_STANDBY);
>>>> @@ -511,6 +505,7 @@ static void test_complete_in_standby(void)
>>>> job_dismiss_locked(&job, &error_abort);
>>>> }
>>>>
>>>> + bdrv_drain_all_end();
>>>> aio_context_acquire(ctx);
>>>> destroy_blk(blk);
>>>> aio_context_release(ctx);
>>>>
>>>>
>>>> But, seems that test wanted to specifically test job, that still in
>>>> STANDBY exactly after drained section...
>>>>
>>>> [cc: Hanna]
>>>>
>>>> Hanna, it was your test (added in
>>>> c2c731a4d35062295cd3260e66b3754588a2fad4, two years ago). Don't you
>>>> remember was important to catch STANDBY job *after* a drained section?
>>> I’m not quite sure, but I think the idea was that we basically try to
>>> get as close to something that might come in over QMP. Over QMP, you
>>> can’t issue a job-complete while keeping everything drained, so I
>>> wouldn’t just extend the drained section.
>>>
>>> Getting rid of the assert function also seems pointless. If we want to
>>> test whether job-complete works on tests in standby, we must put the
>>> test in standby, and verify this. We can get rid of the test, of
>>> course, but it is a regression test, so it isn’t like it was added just
>>> for fun. Then again, it’s now already effectively commented out via
>>> environment variable, so it doesn’t seem like a loss in practice to to
>>> fully drop it.
>>>
>>> Anyway – the thing I wonder about is, if this is to test whether jobs in
>>> standby can be completed… Why don’t we just pause the job instead of
>>> going through the context lock hassle? I.e. just put a job_pause()
>>> right after bdrv_drain_all_begin().
>>>
>>> If I’m not mistaken, reproducing the bug in the test seems really simple
>>> by adding a sleep(1) right before WITH_JOB_LOCK_GUARD(); and doing that
>>> works just fine if only you have a job_pause() in the drained section.
>>> (And dropping the aio_context_acquire()/release() calls, because they
>>> don’t do anything anymore.)
>>>
>> Why sleep(1)? I have the patch ready:
>>
>> diff --git a/tests/unit/test-blockjob.c b/tests/unit/test-blockjob.c
>> index a130f6fefb..8054b72afa 100644
>> --- a/tests/unit/test-blockjob.c
>> +++ b/tests/unit/test-blockjob.c
>> @@ -488,11 +488,15 @@ static void test_complete_in_standby(void)
>> bdrv_drain_all_begin();
>> assert_job_status_is(job, JOB_STATUS_STANDBY);
>>
>> + /*
>> + * Increase pause_count so that the counter is
>> + * unbalanced and job won't resume
>> + */
>> + job_pause(job);
>> +
>> /* Lock the IO thread to prevent the job from being run */
>> - aio_context_acquire(ctx);
>> /* This will schedule the job to resume it */
>> bdrv_drain_all_end();
>> - aio_context_release(ctx);
>>
>> WITH_JOB_LOCK_GUARD() {
>> /* But the job cannot run, so it will remain on standby */
>>
>> But I don't get why we should sleep additionally.
>
> Yep, that’s the patch. The sleep was just a note on how to get it to
> reproduce 100 %.
>
> Hanna
>
Patch sent, thanks!
20230426081601.2567990-1-eesposit@redhat.com/">https://patchew.org/QEMU/20230426081601.2567990-1-eesposit@redhat.com/
Thank you,
Emanuele