[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: recent flakiness (intermittent hangs) of migration-test
From: |
Philippe Mathieu-Daudé |
Subject: |
Re: recent flakiness (intermittent hangs) of migration-test |
Date: |
Mon, 2 Nov 2020 14:55:04 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.3.1 |
On 10/30/20 2:53 PM, Peter Xu wrote:
> On Fri, Oct 30, 2020 at 11:48:28AM +0000, Peter Maydell wrote:
>>> Peter, is it possible that you enable QTEST_LOG=1 in your future
>>> migration-test
>>> testcase and try to capture the stderr? With the help of commit a47295014d
>>> ("migration-test: Only hide error if !QTEST_LOG", 2020-10-26), the test
>>> should
>>> be able to dump quite some helpful information to further identify the
>>> issue.
>>
>> Here's the result of running just the migration test with
>> QTEST_LOG=1:
>> https://people.linaro.org/~peter.maydell/migration.log
>> It's 300MB because when the test hangs one of the processes
>> is apparently in a polling state and continues to send status
>> queries.
>>
>> My impression is that the test is OK on an unloaded machine but
>> more likely to fail if the box is doing other things at the
>> same time. Alternatively it might be a 'parallel make check' bug.
>
> Thanks for collecting that, Peter.
>
> I'm copy-pasting the important information out here (with some moves and
> indents to make things even clearer):
>
> ...
> {"execute": "migrate-recover", "arguments": {"uri":
> "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}, "id": "recover-cmd"}
> {"timestamp": {"seconds": 1604056292, "microseconds": 177955}, "event":
> "MIGRATION", "data": {"status": "setup"}}
> {"return": {}, "id": "recover-cmd"}
> {"execute": "query-migrate"}
> ...
> {"execute": "migrate", "arguments": {"resume": true, "uri":
> "unix:/tmp/migration-test-nGzu4q/migsocket-recover"}}
> qemu-system-x86_64: ram_save_queue_pages no previous block
> qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
> {"return": {}}
> {"execute": "migrate-set-parameters", "arguments": {"max-postcopy-bandwidth":
> 0}}
> ...
>
> The problem is probably an misuse on last_rb on destination node. When
> looking
> at it, I also found a race. So I guess I should fix both...
>
> Peter, would it be easy to try apply the two patches I attached to see whether
> the test hang would be resolved? Dave, feel free to give early comments too
> on
> the two fixes before I post them on the list.
Per this comment:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg756235.html
You could add:
Tested-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
- Re: recent flakiness (intermittent hangs) of migration-test,
Philippe Mathieu-Daudé <=