|
From: | Li Zhang |
Subject: | Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever |
Date: | Mon, 6 Dec 2021 10:28:33 +0100 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 |
On 11/29/21 4:49 PM, Dr. David Alan Gilbert wrote:
* Daniel P. Berrangé (berrange@redhat.com) wrote:On Mon, Nov 29, 2021 at 11:20:08AM +0000, Dr. David Alan Gilbert wrote:* Daniel P. Berrangé (berrange@redhat.com) wrote:On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote:When doing live migration with multifd channels 8, 16 or larger number, the guest hangs in the presence of the network errors such as missing TCP ACKs. At sender's side: The main thread is blocked on qemu_thread_join, migration_fd_cleanup is called because one thread fails on qio_channel_write_all when the network problem happens and other send threads are blocked on sendmsg. They could not be terminated. So the main thread is blocked on qemu_thread_join to wait for the threads terminated.Isn't the right answer here to ensure we've called 'shutdown' on all the FDs, so that the threads get kicked out of sendmsg, before trying to join the thread ?I agree a timeout is wrong here; there is no way to get a good timeout value. However, I'm a bit confused - we should be able to try a shutdown on the receive side using the 'yank' command. - that's what it's there for; Li does this solve your problem?Why do we even need to use 'yank' on the receive side ? Until migration has switched over from src to dst, the receive side is discardable and the whole process can just be teminated with kill(SIGTERM/SIGKILL).True, although it's nice to be able to quit cleanly.
I found that the 'yank' function has been registered on receive side actually.
It's different from the send side. It's in the function: void migration_channel_process_incoming(QIOChannel *ioc) { MigrationState *s = migrate_get_current(); Error *local_err = NULL; trace_migration_set_incoming_channel( ioc, object_get_typename(OBJECT(ioc))); if (s->parameters.tls_creds && *s->parameters.tls_creds && !object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_TLS)) { migration_tls_channel_process_incoming(s, ioc, &local_err); } else { migration_ioc_register_yank(ioc); migration_ioc_process_incoming(ioc, &local_err); } if (local_err) { error_report_err(local_err); } }
On the source side 'yank' is needed, because the QEMU process is still running the live workload and thus is precious and mustn't be killed.True. DaveRegards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
[Prev in Thread] | Current Thread | [Next in Thread] |