qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: recent flakiness (intermittent hangs) of migration-test


From: Peter Xu
Subject: Re: recent flakiness (intermittent hangs) of migration-test
Date: Mon, 2 Nov 2020 10:14:01 -0500

On Mon, Nov 02, 2020 at 03:19:50PM +0100, Christian Schoenebeck wrote:
> On Montag, 2. November 2020 14:55:04 CET Philippe Mathieu-Daudé wrote:
> > On 10/30/20 2:53 PM, Peter Xu wrote:
> > > On Fri, Oct 30, 2020 at 11:48:28AM +0000, Peter Maydell wrote:
> > >>> Peter, is it possible that you enable QTEST_LOG=1 in your future
> > >>> migration-test testcase and try to capture the stderr?  With the help
> > >>> of commit a47295014d ("migration-test: Only hide error if !QTEST_LOG",
> > >>> 2020-10-26), the test should be able to dump quite some helpful
> > >>> information to further identify the issue.>> 
> > >> Here's the result of running just the migration test with
> > >> QTEST_LOG=1:
> > >> https://people.linaro.org/~peter.maydell/migration.log
> > >> It's 300MB because when the test hangs one of the processes
> > >> is apparently in a polling state and continues to send status
> > >> queries.
> > >> 
> > >> My impression is that the test is OK on an unloaded machine but
> > >> more likely to fail if the box is doing other things at the
> > >> same time. Alternatively it might be a 'parallel make check' bug.
> > > 
> > > Thanks for collecting that, Peter.
> > > 
> > > I'm copy-pasting the important information out here (with some moves and
> > > indents to make things even clearer):
> > > 
> > > ...
> > > {"execute": "migrate-recover", "arguments": {"uri":
> > > "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}, "id":
> > > "recover-cmd"} {"timestamp": {"seconds": 1604056292, "microseconds":
> > > 177955}, "event": "MIGRATION", "data": {"status": "setup"}} {"return":
> > > {}, "id": "recover-cmd"}
> > > {"execute": "query-migrate"}
> > > ...
> > > {"execute": "migrate", "arguments": {"resume": true, "uri":
> > > "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}} qemu-system-x86_64:
> > > ram_save_queue_pages no previous block
> > > qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
> > > {"return": {}}
> > > {"execute": "migrate-set-parameters", "arguments":
> > > {"max-postcopy-bandwidth": 0}} ...
> > > 
> > > The problem is probably an misuse on last_rb on destination node.  When
> > > looking at it, I also found a race.  So I guess I should fix both...
> > > 
> > > Peter, would it be easy to try apply the two patches I attached to see
> > > whether the test hang would be resolved?  Dave, feel free to give early
> > > comments too on the two fixes before I post them on the list.
> > 
> > Per this comment:
> > https://www.mail-archive.com/qemu-devel@nongnu.org/msg756235.html
> > You could add:
> > Tested-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> 
> Yes, you can do that.
> 
> We've extensively tested with Peter Xu's patches in the last couple days on 
> various systems and haven't encountered any further lockup since then.

Thanks, Christian & all.  I'll post them formally soon with your tags.

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]