[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: recent flakiness (intermittent hangs) of migration-test
From: |
Peter Xu |
Subject: |
Re: recent flakiness (intermittent hangs) of migration-test |
Date: |
Mon, 2 Nov 2020 10:14:01 -0500 |
On Mon, Nov 02, 2020 at 03:19:50PM +0100, Christian Schoenebeck wrote:
> On Montag, 2. November 2020 14:55:04 CET Philippe Mathieu-Daudé wrote:
> > On 10/30/20 2:53 PM, Peter Xu wrote:
> > > On Fri, Oct 30, 2020 at 11:48:28AM +0000, Peter Maydell wrote:
> > >>> Peter, is it possible that you enable QTEST_LOG=1 in your future
> > >>> migration-test testcase and try to capture the stderr? With the help
> > >>> of commit a47295014d ("migration-test: Only hide error if !QTEST_LOG",
> > >>> 2020-10-26), the test should be able to dump quite some helpful
> > >>> information to further identify the issue.>>
> > >> Here's the result of running just the migration test with
> > >> QTEST_LOG=1:
> > >> https://people.linaro.org/~peter.maydell/migration.log
> > >> It's 300MB because when the test hangs one of the processes
> > >> is apparently in a polling state and continues to send status
> > >> queries.
> > >>
> > >> My impression is that the test is OK on an unloaded machine but
> > >> more likely to fail if the box is doing other things at the
> > >> same time. Alternatively it might be a 'parallel make check' bug.
> > >
> > > Thanks for collecting that, Peter.
> > >
> > > I'm copy-pasting the important information out here (with some moves and
> > > indents to make things even clearer):
> > >
> > > ...
> > > {"execute": "migrate-recover", "arguments": {"uri":
> > > "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}, "id":
> > > "recover-cmd"} {"timestamp": {"seconds": 1604056292, "microseconds":
> > > 177955}, "event": "MIGRATION", "data": {"status": "setup"}} {"return":
> > > {}, "id": "recover-cmd"}
> > > {"execute": "query-migrate"}
> > > ...
> > > {"execute": "migrate", "arguments": {"resume": true, "uri":
> > > "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}} qemu-system-x86_64:
> > > ram_save_queue_pages no previous block
> > > qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
> > > {"return": {}}
> > > {"execute": "migrate-set-parameters", "arguments":
> > > {"max-postcopy-bandwidth": 0}} ...
> > >
> > > The problem is probably an misuse on last_rb on destination node. When
> > > looking at it, I also found a race. So I guess I should fix both...
> > >
> > > Peter, would it be easy to try apply the two patches I attached to see
> > > whether the test hang would be resolved? Dave, feel free to give early
> > > comments too on the two fixes before I post them on the list.
> >
> > Per this comment:
> > https://www.mail-archive.com/qemu-devel@nongnu.org/msg756235.html
> > You could add:
> > Tested-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
>
> Yes, you can do that.
>
> We've extensively tested with Peter Xu's patches in the last couple days on
> various systems and haven't encountered any further lockup since then.
Thanks, Christian & all. I'll post them formally soon with your tags.
--
Peter Xu