[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCO

From: Aviad Yehezkel
Subject: Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
Date: Thu, 17 May 2018 10:31:24 +0300
User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0

On 5/17/2018 5:42 AM, 858585 jemmy wrote:
On Wed, May 16, 2018 at 11:11 PM, Aviad Yehezkel
<address@hidden> wrote:
Hi Lidong and David,
Sorry for the late response, I had to ramp up on migration code and build a
setup on my side.

PSB my comments for this patch below.
For the RDMA post-copy patches I will comment next week after testing on
Mellanox side too.


On 5/16/2018 5:21 PM, Aviad Yehezkel wrote:

-----Original Message-----
From: Dr. David Alan Gilbert [mailto:address@hidden
Sent: Wednesday, May 16, 2018 4:13 PM
To: 858585 jemmy <address@hidden>
Cc: Aviad Yehezkel <address@hidden>; Juan Quintela
<address@hidden>; qemu-devel <address@hidden>; Gal Shachaf
<address@hidden>; Adi Dotan <address@hidden>; Lidong Chen
Subject: Re: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED
event after rdma_disconnect

* 858585 jemmy (address@hidden) wrote:


I wonder why dereg_mr takes so long - I could understand if
reg_mr took a long time, but why for dereg, that sounds like the
easy side.
I use perf collect the information when ibv_dereg_mr is invoked.

-   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
     - put_compound_page
        - 98.45% put_page
        + 1.55% __ib_umem_release
+   8.31%  client2  [kernel.kallsyms]  [k] compound_unlock_irqrestore
+   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
+   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
+   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
+   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
+   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
+   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
+   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
+   3.65%  client2  [kernel.kallsyms]  [k] release_pages
+   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
+   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
+   3.13%  client2  [kernel.kallsyms]  [k] get_pageblock_flags_group
+   3.09%  client2  [kernel.kallsyms]  [k] put_page

the reason is __ib_umem_release will loop many times for each page.

static void __ib_umem_release(struct ib_device *dev, struct
ib_umem *umem, int dirty) {
      struct scatterlist *sg;
      struct page *page;
      int i;

      if (umem->nmap > 0)
           ib_dma_unmap_sg(dev, umem->sg_head.sgl,

           for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
loop a lot of times for each page.here
Why 'lot of times for each page'?  I don't know this code at all,
but I'd expected once per page?
sorry, once per page, but a lot of page for a big size virtual machine.
Ah OK; so yes it seems best if you can find a way to do the release
in the migration thread then;  still maybe this is something some of
the kernel people could look at speeding up?
The kernel code seem is not complex, and I have no idea how to speed up.
Me neither; but I'll ask around.
I will ask internally and get back on this one.

With your other kernel fix, does the problem of the missing
Yes, after kernel and qemu fixed, this issue never happens again.
I'm confused; which qemu fix; my question was whether the kernel fix
by itself fixed the problem of the missing event.
this qemu fix:
migration: update index field when delete or qsort RDMALocalBlock
OK good; so then we shouldn't need this 2/2 patch.
It is good that the qemu fix solved this issue but me and my colleagues
think we need 2/2 patch anyway.
According to IB Spec once active side send DREQ message he should wait for
DREP message and only once it arrived it should trigger a DISCONNECT event.
DREP message can be dropped due to network issues.
For that case the spec defines a DREP_timeout state in the CM state machine,
if the DREP is dropped we should get a timeout and a TIMEWAIT_EXIT event
will be trigger.
Unfortunately the current kernel CM implementation doesn't include the
DREP_timeout state and in above scenario we will not get DISCONNECT or
(Similar scenario exists also for passive side).
We think it is best to apply this patch until we will enhance the kernel

Hi Aviad:
     How long about the DREP_timeout state?
     Do RDMA have some tools like tcpdump? then I can use to confirm whether
     DREP message has received.

Are you running IB or RoCE?
this issue also cause by some ram block is not released. but I do not
find the root cause.
Hmm, we should try and track that down.

Do you think we should remove rdma_get_cm_event after rdma_disconnect?
I don't think so; if 'rdma_disconnect' is supposed to generate the
event I think we're supposed to wait for it to know that the
disconnect is really complete.
After move qemu_fclose to migration thread, it will not block the main
thread when wait the disconnection event.
I'm not sure about moving the fclose to the migration thread; it worries
me with the interaction with cancel and other failures.




                page = sg_page(sg);
                if (umem->writable && dirty)




Anyway, it should not invoke rdma_get_cm_event in main
thread, and the event channel is also destroyed in

Signed-off-by: Lidong Chen <address@hidden>
   migration/rdma.c       | 12 ++----------
   migration/trace-events |  1 -
   2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c index
0dd4033..92e4d30 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -2275,8 +2275,7 @@ static int
qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,

   static void qemu_rdma_cleanup(RDMAContext *rdma)  {
-    struct rdma_cm_event *cm_event;
-    int ret, idx;
+    int idx;

       if (rdma->cm_id && rdma->connected) {
           if ((rdma->error_state ||
@@ -2290,14 +2289,7 @@ static void
qemu_rdma_cleanup(RDMAContext *rdma)
               qemu_rdma_post_send_control(rdma, NULL, &head);

-        ret = rdma_disconnect(rdma->cm_id);
-        if (!ret) {
-            trace_qemu_rdma_cleanup_waiting_for_disconnect();
-            ret = rdma_get_cm_event(rdma->channel, &cm_event);
-            if (!ret) {
-                rdma_ack_cm_event(cm_event);
-            }
-        }
+        rdma_disconnect(rdma->cm_id);
I'm worried whether this change could break stuff:
The docs say for rdma_disconnect that it flushes any posted work
requests to the completion queue;  so unless we wait for the
do we know the stuff has been flushed?   In the normal
non-cancel case
I'm worried that means we could lose something.
(But I don't know the rdma/infiniband specs well enough to know
if it's
really a problem).
In qemu_fclose function, it invoke qemu_fflush(f) before invoke
so I think it's safe for normal migration case.

For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED
after rdma_disconnect,
I loop in Aviad Yehezkel<address@hidden>, Aviad will help
find the root cause.

Regarding previous discussion on that thread:
rdma_disconnect can be invoke even without invoking ib_dreg_mr first.

common teardown flow:

   <-- here resources like QPs are still alive and can DMA to RAM
   <-- All resources are destroyed by MR can be still active


           rdma->connected = false;
diff --git a/migration/trace-events b/migration/trace-events
index d6be74b..64573ff 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
   qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after
listen: %p"
   qemu_rdma_block_for_wrid_miss(const char *wcompstr, int
wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s
(%" PRIu64 ")"
   qemu_rdma_cleanup_disconnect(void) ""
-qemu_rdma_cleanup_waiting_for_disconnect(void) ""
   qemu_rdma_close(void) ""
   qemu_rdma_connect_pin_all_requested(void) ""
   qemu_rdma_connect_pin_all_outcome(bool pin) "%d"

Dr. David Alan Gilbert / address@hidden / Manchester, UK
Dr. David Alan Gilbert / address@hidden / Manchester, UK
Dr. David Alan Gilbert / address@hidden / Manchester, UK
Dr. David Alan Gilbert / address@hidden / Manchester, UK
Dr. David Alan Gilbert / address@hidden / Manchester, UK
Dr. David Alan Gilbert / address@hidden / Manchester, UK

reply via email to

[Prev in Thread] Current Thread [Next in Thread]