Re: [Qemu-devel] State of QEMU CI as we enter 4.0

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] State of QEMU CI as we enter 4.0

From:	Alex Bennée
Subject:	Re: [Qemu-devel] State of QEMU CI as we enter 4.0
Date:	Thu, 21 Mar 2019 09:47:44 +0000
User-agent:	mu4e 1.1.0; emacs 26.1

Wainer dos Santos Moschetta <address@hidden> writes:

> Hi all,
<snip>
>> Conclusion
>> ==========
>>
>> I think generally the state of QEMU's CI has improved over the last few
>> years but we still have a number of challenges caused by its distributed
>> nature and test stability. We are still re-active to failures rather
>> than having something fast and reliable enough to gate changes going
>> into the code base. This results in fairly long periods when one
>> or more parts of the testing mosaic are stuck on red waiting for fixes
>> to finally get merged back into master.
>>
>> So what do people think? Have I missed anything out? What else can we do
>> to improve the situation?
>>
>> Let the discussion begin ;-)
>
> I want to help on improve QEMU CI, and in fact I can commit some time
> to do so. But since I'm new to the community and made just a few
> contributions, I'm in the position of only try to understand what we
> have in place now.
>
> So allow me to put this in a different perspective. I took some notes
> in terms of CI workflows we have. It goes below along with some
> comments and questions:
>
> ----
> Besides being distributed across CI providers, there are different CI
> workflows being executed on each stages of the development process.
>
> - Developer tests before send the patch to the mailing-list
>   Each developer has its own recipe.
>   Can be as simple as `make check[-TEST-SUITE]` locally. Or
> Docker-based `make docker-*` tests.

The make docker-* tests mostly cover building on other distros where
there might be subtle differences. The tests themselves are the same
make check-FOO as before.

>   It seems not widely used but some may also use push to GitHub/GibLab
> + triggers to the cloud provider.
>
>   What kind of improvements we can make here?
>   Perhaps (somehow) automate the github/githab + triggers to cloud
> provider workflow?

We have a mechanism that can already do that with patchew. But I'm not
sure how much automation can be done for developers given they need to
have accounts on the relevant services. Once that is done however it
really is just a few git pushes.

>   Allow to reproduce a failure that happens on cloud provider locally,
> when it comes to failures that occurred on next stages of development
> (see below) seems highly appreciated.

In theory yes, in practice it seems our CI providers are quite good at
producing failures under load. I've run tests that fail on travis 10's
thousands of times locally without incident. The reproductions I've done
recently have all been on VMs where I've constrained memory and vCPUs
and then very heavily loaded them. It seems like most developers are
blessed with beefy boxes that rarely show up these problems.

What would be more useful is being able to debug the failure that
occurred on the CI system. Either by:

  a) having some sort of access to the failed system

  The original Travis setup didn't really support that but I think there
  may be options now. I haven't really looked into the other CI setups
  yet. They may be better off. Certainly if we can augment CI with our
  own runners they are easier to give developers access to.

  b) upload the failure artefacts *somewhere*

  Quite a lot of these failures should be dumping core. Maybe if we can
  upload the core, associated binary, config.log and commit id to
  something we can then do a bit more post-mortem on what went wrong.

  c) dump more information in the CI logs

  An alternative to uploading would be some sort of clean-up script
  which could at least dump backtraces of cores in the logs.

>
> - Developer sends a patch to the mailing-list
>   Patchew pushes the patch to GitHub, run tests (checkpatch, asan,
> address@hidden, address@hidden)
>   Reports to ML on failure. Shouldn't send an email on success as well
> so that it creates awareness about CI?

Patchew has been a little inconsistent of late with it's notifications.
Maybe a simple email with a "Just so you know patchew has run all it's
tests on this and it's fine" wouldn't be considered too noisy?

> - Maintainer tests its branch before the pull-request
>   Alike developers, it seems each one sits on its own recipe that may
> (or may not) trigger on an CI provider.

Usually the same set of normal checks plus any particular hand-crafted
tests that might be appropriate for the patches included. For example
for all of Emilio's scaling patches I ran lot of stress tests by hand.
They are only semi-automated because it's not something I'd do for most
branches.

> - Maintainer sends a pull-request to the mailing-list
>   Again patchew gets in. It seems it runs the same tests. Am I right?
>   Also send the email to mailing-list only on failure.

Yes - although generally a PR is collection of patches so it's
technically a new tree state to test.

> - Peter runs tests for each PR
>   IIUC not integrated to any CI provider yet.
>   Likely here we have the most complete scenario in terms of coverage
> (several hosts, targets, build configs, etc).
>   Maybe the area that needs more care.

Peter does catch stuff the CI tests don't so I don't think we are ready
to replace him with a robot just yet ;-) However he currently has access
to a wider a range of other architectures than just about anybody at the
moment.

> - Post-merged on GitHub branches (master and stable-*)
>   Ubuntu x86_64 and MacOS at Travis
>     Reports success/failure to qemu's irc channel
>   Cross compilers (Debian docker image) at shippable
>   FreeBSD at cirrus
>   Debian x86_64 at GitLab
>
>   Is LAVA setup still in use?

Yes although it needs some work. I basically runs the RISU tests for
AArch64 although it does have the ability to run on a bunch of
interesting platforms - probably more relevant to testing KVM type stuff
as it can replace rootfs and kernels.

>   Shouldn't we (and are we able?) use a single CI provider for the
> sake of maintainability? If so it seems Gitlab CI is a strong
> candidate.

I would certainly like a more unified view of the state of a given
branch but the distributed nature does have some benefits in terms of
scaling and redundancy. GitLab has some promise but given how much of a
pain building an arm64 runner has been it's not quite there yet.

--
Alex Bennée

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] State of QEMU CI as we enter 4.0, (continued)
- Re: [Qemu-devel] State of QEMU CI as we enter 4.0, Stefan Hajnoczi, 2019/03/15
  - Re: [Qemu-devel] State of QEMU CI as we enter 4.0, Alex Bennée, 2019/03/15
  - Re: [Qemu-devel] State of QEMU CI as we enter 4.0, Peter Maydell, 2019/03/15
    - Re: [Qemu-devel] State of QEMU CI as we enter 4.0, Paolo Bonzini, 2019/03/15
  - Re: [Qemu-devel] State of QEMU CI as we enter 4.0, Fam Zheng, 2019/03/17
- Re: [Qemu-devel] State of QEMU CI as we enter 4.0, Wainer dos Santos Moschetta, 2019/03/18
  - Re: [Qemu-devel] State of QEMU CI as we enter 4.0, Alex Bennée <=
- Re: [Qemu-devel] State of QEMU CI as we enter 4.0, Cleber Rosa, 2019/03/18
  - Re: [Qemu-devel] State of QEMU CI as we enter 4.0, Alex Bennée, 2019/03/21

Prev by Date: Re: [Qemu-devel] [PATCH v8 1/2] pflash: Require backend size to match device, improve errors
Next by Date: Re: [Qemu-devel] State of QEMU CI as we enter 4.0
Previous by thread: Re: [Qemu-devel] State of QEMU CI as we enter 4.0
Next by thread: Re: [Qemu-devel] State of QEMU CI as we enter 4.0
Index(es):
- Date
- Thread