|
From: | Aviad Yehezkel |
Subject: | Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect |
Date: | Thu, 17 May 2018 10:46:30 +0300 |
User-agent: | Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 |
On 5/17/2018 10:41 AM, 858585 jemmy wrote:
On Thu, May 17, 2018 at 3:31 PM, Aviad Yehezkel <address@hidden> wrote:On 5/17/2018 5:42 AM, 858585 jemmy wrote:On Wed, May 16, 2018 at 11:11 PM, Aviad Yehezkel <address@hidden> wrote:Hi Lidong and David, Sorry for the late response, I had to ramp up on migration code and build a setup on my side. PSB my comments for this patch below. For the RDMA post-copy patches I will comment next week after testing on Mellanox side too. Thanks! On 5/16/2018 5:21 PM, Aviad Yehezkel wrote:-----Original Message----- From: Dr. David Alan Gilbert [mailto:address@hidden Sent: Wednesday, May 16, 2018 4:13 PM To: 858585 jemmy <address@hidden> Cc: Aviad Yehezkel <address@hidden>; Juan Quintela <address@hidden>; qemu-devel <address@hidden>; Gal Shachaf <address@hidden>; Adi Dotan <address@hidden>; Lidong Chen <address@hidden> Subject: Re: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect * 858585 jemmy (address@hidden) wrote: <snip>I wonder why dereg_mr takes so long - I could understand if reg_mr took a long time, but why for dereg, that sounds like the easy side.I use perf collect the information when ibv_dereg_mr is invoked. - 9.95% client2 [kernel.kallsyms] [k] put_compound_page ` - put_compound_page - 98.45% put_page __ib_umem_release ib_umem_release dereg_mr mlx5_ib_dereg_mr ib_dereg_mr uverbs_free_mr remove_commit_idr_uobject _rdma_remove_commit_uobject rdma_remove_commit_uobject ib_uverbs_dereg_mr ib_uverbs_write vfs_write sys_write system_call_fastpath __GI___libc_write 0 + 1.55% __ib_umem_release + 8.31% client2 [kernel.kallsyms] [k] compound_unlock_irqrestore + 7.01% client2 [kernel.kallsyms] [k] page_waitqueue + 7.00% client2 [kernel.kallsyms] [k] set_page_dirty + 6.61% client2 [kernel.kallsyms] [k] unlock_page + 6.33% client2 [kernel.kallsyms] [k] put_page_testzero + 5.68% client2 [kernel.kallsyms] [k] set_page_dirty_lock + 4.30% client2 [kernel.kallsyms] [k] __wake_up_bit + 4.04% client2 [kernel.kallsyms] [k] free_pages_prepare + 3.65% client2 [kernel.kallsyms] [k] release_pages + 3.62% client2 [kernel.kallsyms] [k] arch_local_irq_save + 3.35% client2 [kernel.kallsyms] [k] page_mapping + 3.13% client2 [kernel.kallsyms] [k] get_pageblock_flags_group + 3.09% client2 [kernel.kallsyms] [k] put_page the reason is __ib_umem_release will loop many times for each page. static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty) { struct scatterlist *sg; struct page *page; int i; if (umem->nmap > 0) ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->npages, DMA_BIDIRECTIONAL); for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) { << loop a lot of times for each page.hereWhy 'lot of times for each page'? I don't know this code at all, but I'd expected once per page?sorry, once per page, but a lot of page for a big size virtual machine.Ah OK; so yes it seems best if you can find a way to do the release in the migration thread then; still maybe this is something some of the kernel people could look at speeding up?The kernel code seem is not complex, and I have no idea how to speed up.Me neither; but I'll ask around.I will ask internally and get back on this one.With your other kernel fix, does the problem of the missing RDMA_CM_EVENT_DISCONNECTED events go away?Yes, after kernel and qemu fixed, this issue never happens again.I'm confused; which qemu fix; my question was whether the kernel fix by itself fixed the problem of the missing event.this qemu fix: migration: update index field when delete or qsort RDMALocalBlockOK good; so then we shouldn't need this 2/2 patch.It is good that the qemu fix solved this issue but me and my colleagues think we need 2/2 patch anyway. According to IB Spec once active side send DREQ message he should wait for DREP message and only once it arrived it should trigger a DISCONNECT event. DREP message can be dropped due to network issues. For that case the spec defines a DREP_timeout state in the CM state machine, if the DREP is dropped we should get a timeout and a TIMEWAIT_EXIT event will be trigger. Unfortunately the current kernel CM implementation doesn't include the DREP_timeout state and in above scenario we will not get DISCONNECT or TIMEWAIT_EXIT events. (Similar scenario exists also for passive side). We think it is best to apply this patch until we will enhance the kernel code.Hi Aviad: How long about the DREP_timeout state? Do RDMA have some tools like tcpdump? then I can use to confirm whether DREP message has received. Thanks.Are you running IB or RoCE?RoCE v2.
Than download the latest wireshark, it should have RDMA RoCE support. Capture packets on the associated Ethernet interface. Let me know if you need help.
this issue also cause by some ram block is not released. but I do not find the root cause.Hmm, we should try and track that down.Do you think we should remove rdma_get_cm_event after rdma_disconnect?I don't think so; if 'rdma_disconnect' is supposed to generate the event I think we're supposed to wait for it to know that the disconnect is really complete.After move qemu_fclose to migration thread, it will not block the main thread when wait the disconnection event.I'm not sure about moving the fclose to the migration thread; it worries me with the interaction with cancel and other failures. DaveDaveDavepage = sg_page(sg); if (umem->writable && dirty) set_page_dirty_lock(page); put_page(page); } sg_free_table(&umem->sg_head); return; }DaveDaveAnyway, it should not invoke rdma_get_cm_event in main thread, and the event channel is also destroyed in qemu_rdma_cleanup. Signed-off-by: Lidong Chen <address@hidden> --- migration/rdma.c | 12 ++---------- migration/trace-events | 1 - 2 files changed, 2 insertions(+), 11 deletions(-) diff --git a/migration/rdma.c b/migration/rdma.c index 0dd4033..92e4d30 100644 --- a/migration/rdma.c +++ b/migration/rdma.c @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma, static void qemu_rdma_cleanup(RDMAContext *rdma) { - struct rdma_cm_event *cm_event; - int ret, idx; + int idx; if (rdma->cm_id && rdma->connected) { if ((rdma->error_state || @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma) qemu_rdma_post_send_control(rdma, NULL, &head); } - ret = rdma_disconnect(rdma->cm_id); - if (!ret) { - trace_qemu_rdma_cleanup_waiting_for_disconnect(); - ret = rdma_get_cm_event(rdma->channel, &cm_event); - if (!ret) { - rdma_ack_cm_event(cm_event); - } - } + rdma_disconnect(rdma->cm_id);I'm worried whether this change could break stuff: The docs say for rdma_disconnect that it flushes any posted work requests to the completion queue; so unless we wait for the event do we know the stuff has been flushed? In the normal non-cancel case I'm worried that means we could lose something. (But I don't know the rdma/infiniband specs well enough to know if it's really a problem).In qemu_fclose function, it invoke qemu_fflush(f) before invoke f->ops->close. so I think it's safe for normal migration case. For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, I loop in Aviad Yehezkel<address@hidden>, Aviad will help us find the root cause.Regarding previous discussion on that thread: rdma_disconnect can be invoke even without invoking ib_dreg_mr first. common teardown flow: rdma_disconnect <-- here resources like QPs are still alive and can DMA to RAM rdma_destroy_cm_id <-- All resources are destroyed by MR can be still active rdma_dreg_mrDavetrace_qemu_rdma_cleanup_disconnect(); rdma->connected = false; } diff --git a/migration/trace-events b/migration/trace-events index d6be74b..64573ff 100644 --- a/migration/trace-events +++ b/migration/trace-events @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d" qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p" qemu_rdma_block_for_wrid_miss(const char *wcompstr, int wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s (%d) but got %s (%" PRIu64 ")" qemu_rdma_cleanup_disconnect(void) "" -qemu_rdma_cleanup_waiting_for_disconnect(void) "" qemu_rdma_close(void) "" qemu_rdma_connect_pin_all_requested(void) "" qemu_rdma_connect_pin_all_outcome(bool pin) "%d" -- 1.8.3.1-- Dr. David Alan Gilbert / address@hidden / Manchester, UK-- Dr. David Alan Gilbert / address@hidden / Manchester, UK-- Dr. David Alan Gilbert / address@hidden / Manchester, UK-- Dr. David Alan Gilbert / address@hidden / Manchester, UK-- Dr. David Alan Gilbert / address@hidden / Manchester, UK-- Dr. David Alan Gilbert / address@hidden / Manchester, UK
[Prev in Thread] | Current Thread | [Next in Thread] |