[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: recent flakiness (intermittent hangs) of migration-test
From: |
Christian Schoenebeck |
Subject: |
Re: recent flakiness (intermittent hangs) of migration-test |
Date: |
Mon, 02 Nov 2020 15:19:50 +0100 |
On Montag, 2. November 2020 14:55:04 CET Philippe Mathieu-Daudé wrote:
> On 10/30/20 2:53 PM, Peter Xu wrote:
> > On Fri, Oct 30, 2020 at 11:48:28AM +0000, Peter Maydell wrote:
> >>> Peter, is it possible that you enable QTEST_LOG=1 in your future
> >>> migration-test testcase and try to capture the stderr? With the help
> >>> of commit a47295014d ("migration-test: Only hide error if !QTEST_LOG",
> >>> 2020-10-26), the test should be able to dump quite some helpful
> >>> information to further identify the issue.>>
> >> Here's the result of running just the migration test with
> >> QTEST_LOG=1:
> >> https://people.linaro.org/~peter.maydell/migration.log
> >> It's 300MB because when the test hangs one of the processes
> >> is apparently in a polling state and continues to send status
> >> queries.
> >>
> >> My impression is that the test is OK on an unloaded machine but
> >> more likely to fail if the box is doing other things at the
> >> same time. Alternatively it might be a 'parallel make check' bug.
> >
> > Thanks for collecting that, Peter.
> >
> > I'm copy-pasting the important information out here (with some moves and
> > indents to make things even clearer):
> >
> > ...
> > {"execute": "migrate-recover", "arguments": {"uri":
> > "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}, "id":
> > "recover-cmd"} {"timestamp": {"seconds": 1604056292, "microseconds":
> > 177955}, "event": "MIGRATION", "data": {"status": "setup"}} {"return":
> > {}, "id": "recover-cmd"}
> > {"execute": "query-migrate"}
> > ...
> > {"execute": "migrate", "arguments": {"resume": true, "uri":
> > "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}} qemu-system-x86_64:
> > ram_save_queue_pages no previous block
> > qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
> > {"return": {}}
> > {"execute": "migrate-set-parameters", "arguments":
> > {"max-postcopy-bandwidth": 0}} ...
> >
> > The problem is probably an misuse on last_rb on destination node. When
> > looking at it, I also found a race. So I guess I should fix both...
> >
> > Peter, would it be easy to try apply the two patches I attached to see
> > whether the test hang would be resolved? Dave, feel free to give early
> > comments too on the two fixes before I post them on the list.
>
> Per this comment:
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg756235.html
> You could add:
> Tested-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Yes, you can do that.
We've extensively tested with Peter Xu's patches in the last couple days on
various systems and haven't encountered any further lockup since then.
Best regards,
Christian Schoenebeck