qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCO


From: Aviad Yehezkel
Subject: Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
Date: Thu, 17 May 2018 10:46:30 +0300
User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0



On 5/17/2018 10:41 AM, 858585 jemmy wrote:
On Thu, May 17, 2018 at 3:31 PM, Aviad Yehezkel
<address@hidden> wrote:

On 5/17/2018 5:42 AM, 858585 jemmy wrote:
On Wed, May 16, 2018 at 11:11 PM, Aviad Yehezkel
<address@hidden> wrote:
Hi Lidong and David,
Sorry for the late response, I had to ramp up on migration code and build
a
setup on my side.

PSB my comments for this patch below.
For the RDMA post-copy patches I will comment next week after testing on
Mellanox side too.

Thanks!

On 5/16/2018 5:21 PM, Aviad Yehezkel wrote:

-----Original Message-----
From: Dr. David Alan Gilbert [mailto:address@hidden
Sent: Wednesday, May 16, 2018 4:13 PM
To: 858585 jemmy <address@hidden>
Cc: Aviad Yehezkel <address@hidden>; Juan Quintela
<address@hidden>; qemu-devel <address@hidden>; Gal Shachaf
<address@hidden>; Adi Dotan <address@hidden>; Lidong Chen
<address@hidden>
Subject: Re: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED
event after rdma_disconnect

* 858585 jemmy (address@hidden) wrote:

<snip>

I wonder why dereg_mr takes so long - I could understand if
reg_mr took a long time, but why for dereg, that sounds like the
easy side.
I use perf collect the information when ibv_dereg_mr is invoked.

-   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
                                                             `
      - put_compound_page
         - 98.45% put_page
              __ib_umem_release
              ib_umem_release
              dereg_mr
              mlx5_ib_dereg_mr
              ib_dereg_mr
              uverbs_free_mr
              remove_commit_idr_uobject
              _rdma_remove_commit_uobject
              rdma_remove_commit_uobject
              ib_uverbs_dereg_mr
              ib_uverbs_write
              vfs_write
              sys_write
              system_call_fastpath
              __GI___libc_write
              0
         + 1.55% __ib_umem_release
+   8.31%  client2  [kernel.kallsyms]  [k]
compound_unlock_irqrestore
+   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
+   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
+   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
+   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
+   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
+   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
+   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
+   3.65%  client2  [kernel.kallsyms]  [k] release_pages
+   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
+   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
+   3.13%  client2  [kernel.kallsyms]  [k]
get_pageblock_flags_group
+   3.09%  client2  [kernel.kallsyms]  [k] put_page

the reason is __ib_umem_release will loop many times for each page.

static void __ib_umem_release(struct ib_device *dev, struct
ib_umem *umem, int dirty) {
       struct scatterlist *sg;
       struct page *page;
       int i;

       if (umem->nmap > 0)
            ib_dma_unmap_sg(dev, umem->sg_head.sgl,
                                           umem->npages,
                                           DMA_BIDIRECTIONAL);

            for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
<<
loop a lot of times for each page.here
Why 'lot of times for each page'?  I don't know this code at all,
but I'd expected once per page?
sorry, once per page, but a lot of page for a big size virtual
machine.
Ah OK; so yes it seems best if you can find a way to do the release
in the migration thread then;  still maybe this is something some of
the kernel people could look at speeding up?
The kernel code seem is not complex, and I have no idea how to speed
up.
Me neither; but I'll ask around.
I will ask internally and get back on this one.

With your other kernel fix, does the problem of the missing
RDMA_CM_EVENT_DISCONNECTED events go away?
Yes, after kernel and qemu fixed, this issue never happens again.
I'm confused; which qemu fix; my question was whether the kernel fix
by itself fixed the problem of the missing event.
this qemu fix:
migration: update index field when delete or qsort RDMALocalBlock
OK good; so then we shouldn't need this 2/2 patch.
It is good that the qemu fix solved this issue but me and my colleagues
think we need 2/2 patch anyway.
According to IB Spec once active side send DREQ message he should wait
for
DREP message and only once it arrived it should trigger a DISCONNECT
event.
DREP message can be dropped due to network issues.
For that case the spec defines a DREP_timeout state in the CM state
machine,
if the DREP is dropped we should get a timeout and a TIMEWAIT_EXIT event
will be trigger.
Unfortunately the current kernel CM implementation doesn't include the
DREP_timeout state and in above scenario we will not get DISCONNECT or
TIMEWAIT_EXIT events.
(Similar scenario exists also for passive side).
We think it is best to apply this patch until we will enhance the kernel
code.

Hi Aviad:
      How long about the DREP_timeout state?
      Do RDMA have some tools like tcpdump? then I can use to confirm
whether
      DREP message has received.

      Thanks.
Are you running IB or RoCE?
RoCE v2.
Than download the latest wireshark, it should have RDMA RoCE support.
Capture packets on the associated Ethernet interface.
Let me know if you need help.

this issue also cause by some ram block is not released. but I do not
find the root cause.
Hmm, we should try and track that down.

Do you think we should remove rdma_get_cm_event after
rdma_disconnect?
I don't think so; if 'rdma_disconnect' is supposed to generate the
event I think we're supposed to wait for it to know that the
disconnect is really complete.
After move qemu_fclose to migration thread, it will not block the main
thread when wait the disconnection event.
I'm not sure about moving the fclose to the migration thread; it worries
me with the interaction with cancel and other failures.

Dave

Dave

Dave

                 page = sg_page(sg);
                 if (umem->writable && dirty)
                     set_page_dirty_lock(page);
                 put_page(page);
            }

            sg_free_table(&umem->sg_head);
            return;
}

Dave


Dave

Anyway, it should not invoke rdma_get_cm_event in main
thread, and the event channel is also destroyed in
qemu_rdma_cleanup.

Signed-off-by: Lidong Chen <address@hidden>
---
    migration/rdma.c       | 12 ++----------
    migration/trace-events |  1 -
    2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/migration/rdma.c b/migration/rdma.c index
0dd4033..92e4d30 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -2275,8 +2275,7 @@ static int
qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,

    static void qemu_rdma_cleanup(RDMAContext *rdma)  {
-    struct rdma_cm_event *cm_event;
-    int ret, idx;
+    int idx;

        if (rdma->cm_id && rdma->connected) {
            if ((rdma->error_state ||
@@ -2290,14 +2289,7 @@ static void
qemu_rdma_cleanup(RDMAContext *rdma)
                qemu_rdma_post_send_control(rdma, NULL,
&head);
            }

-        ret = rdma_disconnect(rdma->cm_id);
-        if (!ret) {
-
trace_qemu_rdma_cleanup_waiting_for_disconnect();
-            ret = rdma_get_cm_event(rdma->channel,
&cm_event);
-            if (!ret) {
-                rdma_ack_cm_event(cm_event);
-            }
-        }
+        rdma_disconnect(rdma->cm_id);
I'm worried whether this change could break stuff:
The docs say for rdma_disconnect that it flushes any posted
work
requests to the completion queue;  so unless we wait for the
event
do we know the stuff has been flushed?   In the normal
non-cancel case
I'm worried that means we could lose something.
(But I don't know the rdma/infiniband specs well enough to
know
if it's
really a problem).
In qemu_fclose function, it invoke qemu_fflush(f) before invoke
f->ops->close.
so I think it's safe for normal migration case.

For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED
event
after rdma_disconnect,
I loop in Aviad Yehezkel<address@hidden>, Aviad will help
us
find the root cause.

Regarding previous discussion on that thread:
rdma_disconnect can be invoke even without invoking ib_dreg_mr first.

common teardown flow:

rdma_disconnect
    <-- here resources like QPs are still alive and can DMA to RAM
rdma_destroy_cm_id
    <-- All resources are destroyed by MR can be still active
rdma_dreg_mr



Dave

            trace_qemu_rdma_cleanup_disconnect();
            rdma->connected = false;
        }
diff --git a/migration/trace-events b/migration/trace-events
index d6be74b..64573ff 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
    qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context
after
listen: %p"
    qemu_rdma_block_for_wrid_miss(const char *wcompstr, int
wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s
(%d) but got %s
(%" PRIu64 ")"
    qemu_rdma_cleanup_disconnect(void) ""
-qemu_rdma_cleanup_waiting_for_disconnect(void) ""
    qemu_rdma_close(void) ""
    qemu_rdma_connect_pin_all_requested(void) ""
    qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
--
1.8.3.1

--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK





reply via email to

[Prev in Thread] Current Thread [Next in Thread]