[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v2 0/1] migration: multifd live migration improvement
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [PATCH v2 0/1] migration: multifd live migration improvement |
Date: |
Mon, 6 Dec 2021 19:54:41 +0000 |
User-agent: |
Mutt/2.1.3 (2021-09-10) |
* Li Zhang (lizhang@suse.de) wrote:
> When testing live migration with multifd channels (8, 16, or a bigger number)
> and using qemu -incoming (without "defer"), if a network error occurs
> (for example, triggering the kernel SYN flooding detection),
> the migration fails and the guest hangs forever.
>
> The test environment and the command line is as the following:
>
> QEMU verions: QEMU emulator version 6.2.91 (v6.2.0-rc1-47-gc5fbdd60cf)
> Host OS: SLE 15 with kernel: 5.14.5-1-default
> Network Card: mlx5 100Gbps
> Network card: Intel Corporation I350 Gigabit (1Gbps)
>
> Source:
> qemu-system-x86_64 -M q35 -smp 32 -nographic \
> -serial telnet:10.156.208.153:4321,server,nowait \
> -m 4096 -enable-kvm -hda /var/lib/libvirt/images/openSUSE-15.3.img \
> -monitor stdio
> Dest:
> qemu-system-x86_64 -M q35 -smp 32 -nographic \
> -serial telnet:10.156.208.154:4321,server,nowait \
> -m 4096 -enable-kvm -hda /var/lib/libvirt/images/openSUSE-15.3.img \
> -monitor stdio \
> -incoming tcp:1.0.8.154:4000
>
> (qemu) migrate_set_parameter max-bandwidth 100G
> (qemu) migrate_set_capability multifd on
> (qemu) migrate_set_parameter multifd-channels 16
>
> The guest hangs when executing the command: migrate -d tcp:1.0.8.154:4000.
>
> If a network problem happens, TCP ACK is not received by destination
> and the destination resets the connection with RST.
>
> No. Time Source Destination Protocol Length Info
> 119 1.021169 1.0.8.153 1.0.8.154 TCP 1410 60166
> → 4000 [PSH, ACK] Seq=65 Ack=1 Win=62720 Len=1344 TSval=1338662881
> TSecr=1399531897
> No. Time Source Destination Protocol Length Info
> 125 1.021181 1.0.8.154 1.0.8.153 TCP 54 4000
> → 60166 [RST] Seq=1 Win=0 Len=0
>
> kernel log:
> [334520.229445] TCP: request_sock_TCP: Possible SYN flooding on port 4000.
> Sending cookies. Check SNMP counters.
> [334562.994919] TCP: request_sock_TCP: Possible SYN flooding on port 4000.
> Sending cookies. Check SNMP counters.
> [334695.519927] TCP: request_sock_TCP: Possible SYN flooding on port 4000.
> Sending cookies. Check SNMP counters.
> [334734.689511] TCP: request_sock_TCP: Possible SYN flooding on port 4000.
> Sending cookies. Check SNMP counters.
> [335687.740415] TCP: request_sock_TCP: Possible SYN flooding on port 4000.
> Sending cookies. Check SNMP counters.
> [335730.013598] TCP: request_sock_TCP: Possible SYN flooding on port 4000.
> Sending cookies. Check SNMP counters.
Should we document somewhere how to avoid that? Is there something we
should be doing in the connection code to avoid it?
Dave
> There are two problems here:
> 1. On the send side, the main thread is blocked on qemu_thread_join and
> send threads are blocked on sendmsg
> 2. On receive side, the receive threads are blocked on qemu_sem_wait to
> wait for a semaphore.
>
> The patch is to fix the first problem, and the guest doesn't hang any more.
> But there is no better solution to fix the second problem yet.
>
> Li Zhang (1):
> multifd: Shut down the QIO channels to avoid blocking the send threads
> when they are terminated.
>
> migration/multifd.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> --
> 2.31.1
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK