qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread t


From: Li Zhang
Subject: Re: [PATCH 1/2] multifd: use qemu_sem_timedwait in multifd_recv_thread to avoid waiting forever
Date: Wed, 1 Dec 2021 13:11:13 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0


On 11/29/21 3:50 PM, Dr. David Alan Gilbert wrote:
* Li Zhang (lizhang@suse.de) wrote:
On 11/29/21 12:20 PM, Dr. David Alan Gilbert wrote:
* Daniel P. Berrangé (berrange@redhat.com) wrote:
On Fri, Nov 26, 2021 at 04:31:53PM +0100, Li Zhang wrote:
When doing live migration with multifd channels 8, 16 or larger number,
the guest hangs in the presence of the network errors such as missing TCP ACKs.

At sender's side:
The main thread is blocked on qemu_thread_join, migration_fd_cleanup
is called because one thread fails on qio_channel_write_all when
the network problem happens and other send threads are blocked on sendmsg.
They could not be terminated. So the main thread is blocked on qemu_thread_join
to wait for the threads terminated.
Isn't the right answer here to ensure we've called 'shutdown' on
all the FDs, so that the threads get kicked out of sendmsg, before
trying to join the thread ?
I agree a timeout is wrong here; there is no way to get a good timeout
value.
However, I'm a bit confused - we should be able to try a shutdown on the
receive side using the 'yank' command. - that's what it's there for; Li
does this solve your problem?
No, I tried to register 'yank' on the receive side, the receive threads are
still waiting there.

It seems that on send side, 'yank' doesn't work either when the send threads
are blocked.

This may be not the case to call yank. I am not quite sure about it.
We need to fix that; 'yank' should be able to recover from any network
issue.  If it's not working we need to understand why.

Hi Dr. David,

On the receive side, I register 'yank' and it is called. But it is just to shut down the channels,

it couldn't fix the problem of the receive threads which are waiting for the semaphore.

So the receive threads are still waiting there.

On the send side,  the main process is blocked on qemu_thread_join(), when I tried the 'yank'

command with QMP,  it is not handled. So the QMP doesn't work and yank doesn't work.

I think it's necessary to shutdown the channels before terminating the threads, which can prevent the send threads

being blocked on sendmsg.

By investigating the source code of yank, it only shuts down the channels, the live migration may recover when

something wrong occurs because of io channels. But if the threads are blocked on semphores,  locks or

something else, it couldn't recover by yank command line.


multifd_load_cleanup already kicks sem_sync before trying to do a
thread_join - so have we managed to trigger that on the receive side?
There is no problem with sem_sync in function multifd_load_cleanup.

But it is not called in my case, because no errors are detected on the
receive side.
If you're getting TCP errors why aren't you seeing any errors on the
receive side?

From the kernel log,  a TCP SYN flooding is detected. This causes the TCP ACK

missing and the receive side just sends a RST to reset the connection forcely without errors.

If TCP SYN Flooding detecting is disabled, the problem can be ignored and it can cotinue to tranfer the data.

And live migration works, but I don't think the TCP SYNC flooding detecting should be disabled.

On the send side, it causes a failure when writing qio channels and migration_save_cleanup is called.


Thank

Li


The problem is here:

void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
{
     MigrationIncomingState *mis = migration_incoming_get_current();
     Error *local_err = NULL;
     bool start_migration;

    ...

     if (!mis->from_src_file) {

     ...

      } else {
         /* Multiple connections */
         assert(migrate_use_multifd());
         start_migration = multifd_recv_new_channel(ioc, &local_err);
         if (local_err) {
             error_propagate(errp, local_err);
             return;
         }
     }
    if (start_migration) {
         migration_incoming_process();
     }
}

start_migration is always 0, and migration is not started because some
receive threads are not created.

No errors are detected here and the main process works well but receive
threads are all waiting for semaphore.

It's hard to know if the receive threads are not created. If we can find a
way to check if any receive threads
So is this only a problem for network issues that happen during startup,
before all the threads have been created?

Dave

are not created, we can kick the sem_sync and do cleanup.

 From the source code, the thread will be created when QIO channel detects
something by GIO watch if I understand correctly.

If nothing is detected, socket_accept_icoming_migration won't be called, the
thread will not be created.

socket_start_incoming_migration_internal ->

     qio_net_listener_set_client_func_full(listener,
socket_accept_incoming_migration,
                                           NULL, NULL,
g_main_context_get_thread_default());

    qio_net_listener_set_client_func_full ->

                qio_channel_add_watch_source(
                 QIO_CHANNEL(listener->sioc[i]), G_IO_IN,
                 qio_net_listener_channel_func,
                 listener, (GDestroyNotify)object_unref, context);

   socket_accept_incoming_migration ->

        migration_channel_process_incoming ->

                migration_ioc_process_incoming ->

                      multifd_recv_new_channel ->

                             qemu_thread_create(&p->thread, p->name,
multifd_recv_thread, p,
QEMU_THREAD_JOINABLE);

Dave

Regards,
Daniel
--
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




reply via email to

[Prev in Thread] Current Thread [Next in Thread]