Re: test-blockjob: intermittent CI failures in msys2-64bit job

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: test-blockjob: intermittent CI failures in msys2-64bit job

From:	Emanuele Giuseppe Esposito
Subject:	Re: test-blockjob: intermittent CI failures in msys2-64bit job
Date:	Mon, 24 Apr 2023 15:36:37 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0


Am 21/04/2023 um 12:13 schrieb Vladimir Sementsov-Ogievskiy:
> On 17.03.23 15:35, Thomas Huth wrote:
>> On 17/03/2023 11.17, Peter Maydell wrote:
>>> On Mon, 6 Mar 2023 at 11:16, Peter Maydell <peter.maydell@linaro.org>
>>> wrote:
>>>>
>>>> On Fri, 3 Mar 2023 at 18:36, Peter Maydell
>>>> <peter.maydell@linaro.org> wrote:
>>>>>
>>>>> I've noticed that test-blockjob seems to fail intermittently
>>>>> on the msys2-64bit job:
>>>>>
>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3872508803
>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3871061024
>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3865312440
>>>>>
>>>>> Sample output:
>>>>> | 53/89
>>>>> ERROR:../tests/unit/test-blockjob.c:499:test_complete_in_standby:
>>>>> assertion failed: (job->status == JOB_STATUS_STANDBY) ERROR
>>>>> 53/89 qemu:unit / test-blockjob ERROR 0.08s exit status 3
>>>
>>>> Here's an intermittent failure from my macos x86 machine:
>>>>
>>>> 172/621 qemu:unit / test-blockjob
>>>>             ERROR           0.26s   killed by signal 6 SIGABRT
>>>
>>> And an intermittent on the freebsd 13 CI job:
>>> https://gitlab.com/qemu-project/qemu/-/jobs/3950667240
>>>
>>>>>> MALLOC_PERTURB_=197
>>>>>> G_TEST_BUILDDIR=/tmp/cirrus-ci-build/build/tests/unit
>>>>>> G_TEST_SRCDIR=/tmp/cirrus-ci-build/tests/unit
>>>>>> /tmp/cirrus-ci-build/build/tests/unit/test-blockjob --tap -k
>>> ▶ 178/650 /blockjob/ids
>>>             OK
>>> 178/650 qemu:unit / test-blockjob
>>>             ERROR           0.31s   killed by signal 6 SIGABRT
>>> ――――――――――――――――――――――――――――――――――――― ✀ 
>>> ―――――――――――――――――――――――――――――――――――――
>>> stderr:
>>> Assertion failed: (job->status == JOB_STATUS_STANDBY), function
>>> test_complete_in_standby, file ../tests/unit/test-blockjob.c, line
>>> 499.
>>>
>>>
>>> TAP parsing error: Too few tests run (expected 9, got 1)
>>> (test program exited with status code -6)
>>> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>>>
>>> Anybody in the block team looking at these, or shall we just
>>> disable this test entirely ?
>>
>> I ran into this issue today, too:
>>
>>   https://gitlab.com/thuth/qemu/-/jobs/3954367101
>>
>> ... if nobody is interested in fixing this test, I think we should
>> disable it...
>>
>>   Thomas
>>
> 
> I'm looking at this now, and seems that it's broken since
> 6f592e5aca1a27fe1c1f6 "job.c: enable job lock/unlock and remove
> Aiocontext locks", as it stops critical section by new
> aio_context_release() call exactly after bdrv_drain_all_and(), so it's
> not a surprise that job may start at that moment and the following
> assertion fires.
> 
> Emanuele could please look at it?
> 
Well if I understood correctly, the only thing that was preventing the
job from continuing was the aiocontext lock held.

The failing assertion even mentions that:
/* Lock the IO thread to prevent the job from being run */
[...]
/* But the job cannot run, so it will remain on standby */
assert(job->status == JOB_STATUS_STANDBY);

Essentially bdrv_drain_all_end() would wake up the job coroutine, but it
would anyways block somewhere after since the aiocontext lock was taken
by the test.

Now that we got rid of aiocontext lock in job code, nothing prevents the
test from resuming.
Mixing job lock and aiocontext acquire/release is not a good idea, and
it would anyways block job_resume() called by bdrv_drain_all_end(), so
the test itself would deadlock.

So unless @Kevin has a better idea, this seems to be just an "hack" to
test stuff that is not possible to do now anymore. What I would suggest
is to get rid of that test, or at least of that assert function. I
unfortunately cannot reproduce the failure, but I think the remaining
functions called by the test should run as expected.

Thank you,
Emanuele

[Prev in Thread]

Current Thread

[Next in Thread]

Re: test-blockjob: intermittent CI failures in msys2-64bit job, Vladimir Sementsov-Ogievskiy, 2023/04/21
- Re: test-blockjob: intermittent CI failures in msys2-64bit job, Emanuele Giuseppe Esposito <=
  - Re: test-blockjob: intermittent CI failures in msys2-64bit job, Vladimir Sementsov-Ogievskiy, 2023/04/24
    - Re: test-blockjob: intermittent CI failures in msys2-64bit job, Hanna Czenczek, 2023/04/25
    - Re: test-blockjob: intermittent CI failures in msys2-64bit job, Vladimir Sementsov-Ogievskiy, 2023/04/25
    - Re: test-blockjob: intermittent CI failures in msys2-64bit job, Emanuele Giuseppe Esposito, 2023/04/26
    - Re: test-blockjob: intermittent CI failures in msys2-64bit job, Hanna Czenczek, 2023/04/26
    - Re: test-blockjob: intermittent CI failures in msys2-64bit job, Emanuele Giuseppe Esposito, 2023/04/26

Prev by Date: [PULL 28/30] migration: Create migrate_cpu_throttle_tailslow() function
Next by Date: Re: [PATCH v2 RESEND] xen: Fix SEGV on domain disconnect
Previous by thread: Re: test-blockjob: intermittent CI failures in msys2-64bit job
Next by thread: Re: test-blockjob: intermittent CI failures in msys2-64bit job
Index(es):
- Date
- Thread