On Fri, Nov 26, 2021 at 05:44:04PM +0100, Li Zhang wrote:
On 11/26/21 4:49 PM, Daniel P. Berrangé wrote:
On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote:
When doing live migration with multifd channels 8, 16 or larger number,
the guest hangs in the presence of the network errors such as missing TCP ACKs.
At sender's side:
The main thread is blocked on qemu_thread_join, migration_fd_cleanup
is called because one thread fails on qio_channel_write_all when
the network problem happens and other send threads are blocked on sendmsg.
They could not be terminated. So the main thread is blocked on qemu_thread_join
to wait for the threads terminated.
Isn't the right answer here to ensure we've called 'shutdown' on
all the FDs, so that the threads get kicked out of sendmsg, before
trying to join the thread ?
If we shutdown the channels at sender's side, it could terminate send
threads. The receive threads are still waiting there.
From receiver's side, if wait semaphore is timeout, the channels can be
terminated at last. And the sender threads also be terminated at last.
If something goes wrong on the sender side, the mgmt app should be
tearing down the destination QEMU entirely, so I'm not sure we need
to do anything special to deal with received threads.
Using semtimedwait just feels risky because it will introduce false
failures if the system/network is under high load such that the
connections don't all establish within 1 second.