Re: [RFC] QEMU Gating CI

On Mon, Dec 2, 2019 at 11:29 AM Cleber Rosa <address@hidden> wrote:

On Mon, Dec 02, 2019 at 05:08:35PM +0000, Peter Maydell wrote:
> On Mon, 2 Dec 2019 at 17:00, Stefan Hajnoczi <address@hidden> wrote:
> >
> > On Mon, Dec 02, 2019 at 09:05:52AM -0500, Cleber Rosa wrote:
> > > To exemplify my point, if one specific test run as part of "check-tcg"
> > > is found to be faulty on a specific job (say on a specific OS), the
> > > entire "check-tcg" test set may be disabled as a CI-level maintenance
> > > action. Of course a follow up action to deal with the specific test
> > > is required, probably in the form of a Launchpad bug and patches
> > > dealing with the issue, but without necessarily a CI related angle to
> > > it.
> >
> > I think this coarse level of granularity is unrealistic. We cannot
> > disable 99 tests because of 1 known failure. There must be a way of
> > disabling individual tests. You don't need to implement it yourself,
> > but I think this needs to be solved by someone before a gating CI can be
> > put into use.
> >
> > It probably involves adding a "make EXCLUDE_TESTS=foo,bar check"
> > variable so that .gitlab-ci.yml can be modified to exclude specific
> > tests on certain OSes.
>
> We don't have this at the moment, so I'm not sure we need to
> add it as part of moving to doing merge testing via gitlab ?
> The current process is "if the pullreq causes a test to fail
> then the pullreq needs to be changed, perhaps by adding a
> patch which disables the test on a particular platform if
> necessary". Making that smoother might be nice, but I would
> be a little wary about adding requirements to the move-to-gitlab
> that don't absolutely need to be there.
>
> thanks
> -- PMM
>

Right, it goes without saying that:

1) I acknowledge the problem (and I can have a long conversation
about it :)

Just make sure that any pipeline and mandatory CI steps don't slow things down too much... While the examples have talked about 1 or 2 pull requests getting done in parallel, and that's great, the problem is when you try to land 10 or 20 all at once, one that causes the failure and you aren't sure which one it actually is... Make sure whatever you design has sane exception case handling to not cause too much collateral damage... I worked one place that would back everything out if a once-a-week CI test ran and had failures... That CI test-run took 2 days to run, so it wasn't practical to run it often, or for every commit. In the end, though, the powers that be implemented a automated bisection tool that made it marginally less sucky..

Warner

From:	Warner Losh
Subject:	Re: [RFC] QEMU Gating CI
Date:	Mon, 2 Dec 2019 11:36:35 -0700