qemu-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-commits] [qemu/qemu] 224f36: migration/rdma: prevent from double f


From: Peter Maydell
Subject: [Qemu-commits] [qemu/qemu] 224f36: migration/rdma: prevent from double free the same mr
Date: Wed, 14 Jul 2021 04:00:57 -0700

  Branch: refs/heads/master
  Home:   https://github.com/qemu/qemu
  Commit: 224f364a49ec88f9710908574393818d964d0593
      
https://github.com/qemu/qemu/commit/224f364a49ec88f9710908574393818d964d0593
  Author: Li Zhijian <lizhijian@cn.fujitsu.com>
  Date:   2021-07-13 (Tue, 13 Jul 2021)

  Changed paths:
    M migration/rdma.c

  Log Message:
  -----------
  migration/rdma: prevent from double free the same mr

backtrace:
'0x00007ffff5f44ec2 in __ibv_dereg_mr_1_1 (mr=0x7fff1007d390) at 
/home/lizhijian/rdma-core/libibverbs/verbs.c:478
478             void *addr              = mr->addr;
(gdb) bt
 #0  0x00007ffff5f44ec2 in __ibv_dereg_mr_1_1 (mr=0x7fff1007d390) at 
/home/lizhijian/rdma-core/libibverbs/verbs.c:478
 #1  0x0000555555891fcc in rdma_delete_block (block=<optimized out>, 
rdma=0x7fff38176010) at ../migration/rdma.c:691
 #2  qemu_rdma_cleanup (rdma=0x7fff38176010) at ../migration/rdma.c:2365
 #3  0x00005555558925b0 in qio_channel_rdma_close_rcu (rcu=0x555556b8b6c0) at 
../migration/rdma.c:3073
 #4  0x0000555555d652a3 in call_rcu_thread (opaque=opaque@entry=0x0) at 
../util/rcu.c:281
 #5  0x0000555555d5edf9 in qemu_thread_start (args=0x7fffe88bb4d0) at 
../util/qemu-thread-posix.c:541
 #6  0x00007ffff54c73f9 in start_thread () at /lib64/libpthread.so.0
 #7  0x00007ffff53f3b03 in clone () at /lib64/libc.so.6 '

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Message-Id: <20210708144521.1959614-1-lizhijian@cn.fujitsu.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>


  Commit: a51dcef08ba574c129ae347f6f47b61ccb10cf07
      
https://github.com/qemu/qemu/commit/a51dcef08ba574c129ae347f6f47b61ccb10cf07
  Author: Laurent Vivier <lvivier@redhat.com>
  Date:   2021-07-13 (Tue, 13 Jul 2021)

  Changed paths:
    M migration/migration.c

  Log Message:
  -----------
  migration: failover: emit a warning when the card is not fully unplugged

When the migration fails or is canceled we wait the end of the unplug
operation to be able to plug it back. But if the unplug operation
is never finished we stop to wait and QEMU emits a warning to inform
the user.

Based-on: 20210629155007.629086-1-lvivier@redhat.com
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-Id: <20210701131458.112036-1-lvivier@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>


  Commit: 2e3e3da3c2ad559d1255a9a3bf3df0782c2cf231
      
https://github.com/qemu/qemu/commit/2e3e3da3c2ad559d1255a9a3bf3df0782c2cf231
  Author: Peter Xu <peterx@redhat.com>
  Date:   2021-07-13 (Tue, 13 Jul 2021)

  Changed paths:
    M migration/migration.c

  Log Message:
  -----------
  migration: Release return path early for paused postcopy

When postcopy pause triggered, we rely on the migration thread to cleanup the
to_dst_file handle, and the return path thread to cleanup the from_dst_file
handle (which is stored in the local variable "rp").

Within the process, from_dst_file cleanup (qemu_fclose) is postponed until it's
setup again due to a postcopy recovery.

It used to work before yank was born; after yank is introduced we rely on the
refcount of IOC to correctly unregister yank function in channel_close().  If
without the early and on-time release of from_dst_file handle the yank function
will be leftover during paused postcopy.

Without this patch, below steps (quoted from Xiaohui) could trigger qemu src
crash:

  1.Boot vm on src host
  2.Boot vm on dst host
  3.Enable postcopy on src&dst host
  4.Load stressapptest in vm and set postcopy speed to 50M
  5.Start migration from src to dst host, change into postcopy mode when 
migration is active.
  6.When postcopy is active, down the network card(do migration via this 
network) on dst host.
  7.Wait untill postcopy is paused on src&dst host.
  8.Before up network card, recover migration on dst host, will get error like 
following.
  9.Ignore the error of step 8, go on recovering migration on src host:

  After step 9, qemu on src host will core dump after some seconds:
  qemu-kvm: ../util/yank.c:107: yank_unregister_instance: Assertion 
`QLIST_EMPTY(&entry->yankfns)' failed.
  1.sh: line 38: 44662 Aborted                 (core dumped)

Reported-by: Li Xiaohui <xiaohli@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210708190653.252961-2-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>


  Commit: ca30f24d12c9ba1fc0654e6e983f950f7792a217
      
https://github.com/qemu/qemu/commit/ca30f24d12c9ba1fc0654e6e983f950f7792a217
  Author: Peter Xu <peterx@redhat.com>
  Date:   2021-07-13 (Tue, 13 Jul 2021)

  Changed paths:
    M migration/migration.c

  Log Message:
  -----------
  migration: Don't do migrate cleanup if during postcopy resume

Below process could crash qemu with postcopy recovery:

  1. (hmp) migrate -d ..
  2. (hmp) migrate_start_postcopy
  3. [network down, postcopy paused]
  4. (hmp) migrate -r $WRONG_PORT
     when try the recover on an invalid $WRONG_PORT, cleanup_bh will be cleared
  5. (hmp) migrate -r $RIGHT_PORT
     [qemu crash on assert(cleanup_bh)]

The thing is we shouldn't cleanup if it's postcopy resume; the error is set
mostly because the channel is wrong, so we return directly waiting for the user
to retry.

migrate_fd_cleanup() should only be called when migration is cancelled or
completed.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210708190653.252961-3-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>


  Commit: ca7bd0821bb62a1561dd409507039558c0e1f5ac
      
https://github.com/qemu/qemu/commit/ca7bd0821bb62a1561dd409507039558c0e1f5ac
  Author: Peter Xu <peterx@redhat.com>
  Date:   2021-07-13 (Tue, 13 Jul 2021)

  Changed paths:
    M migration/migration.c

  Log Message:
  -----------
  migration: Clear error at entry of migrate_fd_connect()

For each "migrate" command, remember to clear the s->error before going on.
For one reason, when there's a new error it'll be still remembered; see
migrate_set_error() who only sets the error if error==NULL.  Meanwhile if a
failed migration completes (e.g., postcopy recovered and finished), we
shouldn't dump an error when calling migrate_fd_cleanup() at last.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210708190653.252961-4-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>


  Commit: 63268c4970a5f126cc9af75f3ccb8057abef5ec0
      
https://github.com/qemu/qemu/commit/63268c4970a5f126cc9af75f3ccb8057abef5ec0
  Author: Peter Xu <peterx@redhat.com>
  Date:   2021-07-13 (Tue, 13 Jul 2021)

  Changed paths:
    M migration/ram.c

  Log Message:
  -----------
  migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

Taking the mutex every time for each dirty bit to clear is too slow, especially
we'll take/release even if the dirty bit is cleared.  So far it's only used to
sync with special cases with qemu_guest_free_page_hint() against migration
thread, nothing really that serious yet.  Let's move the lock to be upper.

There're two callers of migration_bitmap_clear_dirty().

For migration, move it into ram_save_iterate().  With the help of MAX_WAIT
logic, we'll only run ram_save_iterate() for no more than 50ms-ish time, so
taking the lock once there at the entry.  It also means any call sites to
qemu_guest_free_page_hint() can be delayed; but it should be very rare, only
during migration, and I don't see a problem with it.

For COLO, move it up to colo_flush_ram_cache().  I think COLO forgot to take
that lock even when calling ramblock_sync_dirty_bitmap(), where another example
is migration_bitmap_sync() who took it right.  So let the mutex cover both the
ramblock_sync_dirty_bitmap() and migration_bitmap_clear_dirty() calls.

It's even possible to drop the lock so we use atomic operations upon rb->bmap
and the variable migration_dirty_pages.  I didn't do it just to still be safe,
also not predictable whether the frequent atomic ops could bring overhead too
e.g. on huge vms when it happens very often.  When that really comes, we can
keep a local counter and periodically call atomic ops.  Keep it simple for now.

Cc: Wei Wang <wei.w.wang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hailiang Zhang <zhang.zhanghailiang@huawei.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Juan Quintela <quintela@redhat.com>
Cc: Leonardo Bras Soares Passos <lsoaresp@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210630200805.280905-1-peterx@redhat.com>
Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>


  Commit: 4598b0735025042c62e85a52e4c91fc0d50ec157
      
https://github.com/qemu/qemu/commit/4598b0735025042c62e85a52e4c91fc0d50ec157
  Author: Peter Maydell <peter.maydell@linaro.org>
  Date:   2021-07-13 (Tue, 13 Jul 2021)

  Changed paths:
    M migration/migration.c
    M migration/ram.c
    M migration/rdma.c

  Log Message:
  -----------
  Merge remote-tracking branch 
'remotes/dgilbert-gitlab/tags/pull-migration-20210713a' into staging

Migration pull 2021-07-13

# gpg: Signature made Tue 13 Jul 2021 16:22:28 BST
# gpg:                using RSA key 45F5C71B4A0CB7FB977A9FA90516331EBC5BFDE7
# gpg: Good signature from "Dr. David Alan Gilbert (RH2) <dgilbert@redhat.com>" 
[full]
# Primary key fingerprint: 45F5 C71B 4A0C B7FB 977A  9FA9 0516 331E BC5B FDE7

* remotes/dgilbert-gitlab/tags/pull-migration-20210713a:
  migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()
  migration: Clear error at entry of migrate_fd_connect()
  migration: Don't do migrate cleanup if during postcopy resume
  migration: Release return path early for paused postcopy
  migration: failover: emit a warning when the card is not fully unplugged
  migration/rdma: prevent from double free the same mr

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>


Compare: https://github.com/qemu/qemu/compare/2a54fc454cf0...4598b0735025



reply via email to

[Prev in Thread] Current Thread [Next in Thread]