[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCO
From: |
858585 jemmy |
Subject: |
Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect |
Date: |
Thu, 17 May 2018 15:41:29 +0800 |
On Thu, May 17, 2018 at 3:31 PM, Aviad Yehezkel
<address@hidden> wrote:
>
>
> On 5/17/2018 5:42 AM, 858585 jemmy wrote:
>>
>> On Wed, May 16, 2018 at 11:11 PM, Aviad Yehezkel
>> <address@hidden> wrote:
>>>
>>> Hi Lidong and David,
>>> Sorry for the late response, I had to ramp up on migration code and build
>>> a
>>> setup on my side.
>>>
>>> PSB my comments for this patch below.
>>> For the RDMA post-copy patches I will comment next week after testing on
>>> Mellanox side too.
>>>
>>> Thanks!
>>>
>>> On 5/16/2018 5:21 PM, Aviad Yehezkel wrote:
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Dr. David Alan Gilbert [mailto:address@hidden
>>>> Sent: Wednesday, May 16, 2018 4:13 PM
>>>> To: 858585 jemmy <address@hidden>
>>>> Cc: Aviad Yehezkel <address@hidden>; Juan Quintela
>>>> <address@hidden>; qemu-devel <address@hidden>; Gal Shachaf
>>>> <address@hidden>; Adi Dotan <address@hidden>; Lidong Chen
>>>> <address@hidden>
>>>> Subject: Re: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED
>>>> event after rdma_disconnect
>>>>
>>>> * 858585 jemmy (address@hidden) wrote:
>>>>
>>>> <snip>
>>>>
>>>>>>>>>> I wonder why dereg_mr takes so long - I could understand if
>>>>>>>>>> reg_mr took a long time, but why for dereg, that sounds like the
>>>>>>>>>> easy side.
>>>>>>>>>
>>>>>>>>> I use perf collect the information when ibv_dereg_mr is invoked.
>>>>>>>>>
>>>>>>>>> - 9.95% client2 [kernel.kallsyms] [k] put_compound_page
>>>>>>>>> `
>>>>>>>>> - put_compound_page
>>>>>>>>> - 98.45% put_page
>>>>>>>>> __ib_umem_release
>>>>>>>>> ib_umem_release
>>>>>>>>> dereg_mr
>>>>>>>>> mlx5_ib_dereg_mr
>>>>>>>>> ib_dereg_mr
>>>>>>>>> uverbs_free_mr
>>>>>>>>> remove_commit_idr_uobject
>>>>>>>>> _rdma_remove_commit_uobject
>>>>>>>>> rdma_remove_commit_uobject
>>>>>>>>> ib_uverbs_dereg_mr
>>>>>>>>> ib_uverbs_write
>>>>>>>>> vfs_write
>>>>>>>>> sys_write
>>>>>>>>> system_call_fastpath
>>>>>>>>> __GI___libc_write
>>>>>>>>> 0
>>>>>>>>> + 1.55% __ib_umem_release
>>>>>>>>> + 8.31% client2 [kernel.kallsyms] [k]
>>>>>>>>> compound_unlock_irqrestore
>>>>>>>>> + 7.01% client2 [kernel.kallsyms] [k] page_waitqueue
>>>>>>>>> + 7.00% client2 [kernel.kallsyms] [k] set_page_dirty
>>>>>>>>> + 6.61% client2 [kernel.kallsyms] [k] unlock_page
>>>>>>>>> + 6.33% client2 [kernel.kallsyms] [k] put_page_testzero
>>>>>>>>> + 5.68% client2 [kernel.kallsyms] [k] set_page_dirty_lock
>>>>>>>>> + 4.30% client2 [kernel.kallsyms] [k] __wake_up_bit
>>>>>>>>> + 4.04% client2 [kernel.kallsyms] [k] free_pages_prepare
>>>>>>>>> + 3.65% client2 [kernel.kallsyms] [k] release_pages
>>>>>>>>> + 3.62% client2 [kernel.kallsyms] [k] arch_local_irq_save
>>>>>>>>> + 3.35% client2 [kernel.kallsyms] [k] page_mapping
>>>>>>>>> + 3.13% client2 [kernel.kallsyms] [k]
>>>>>>>>> get_pageblock_flags_group
>>>>>>>>> + 3.09% client2 [kernel.kallsyms] [k] put_page
>>>>>>>>>
>>>>>>>>> the reason is __ib_umem_release will loop many times for each page.
>>>>>>>>>
>>>>>>>>> static void __ib_umem_release(struct ib_device *dev, struct
>>>>>>>>> ib_umem *umem, int dirty) {
>>>>>>>>> struct scatterlist *sg;
>>>>>>>>> struct page *page;
>>>>>>>>> int i;
>>>>>>>>>
>>>>>>>>> if (umem->nmap > 0)
>>>>>>>>> ib_dma_unmap_sg(dev, umem->sg_head.sgl,
>>>>>>>>> umem->npages,
>>>>>>>>> DMA_BIDIRECTIONAL);
>>>>>>>>>
>>>>>>>>> for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
>>>>>>>>> <<
>>>>>>>>> loop a lot of times for each page.here
>>>>>>>>
>>>>>>>> Why 'lot of times for each page'? I don't know this code at all,
>>>>>>>> but I'd expected once per page?
>>>>>>>
>>>>>>> sorry, once per page, but a lot of page for a big size virtual
>>>>>>> machine.
>>>>>>
>>>>>> Ah OK; so yes it seems best if you can find a way to do the release
>>>>>> in the migration thread then; still maybe this is something some of
>>>>>> the kernel people could look at speeding up?
>>>>>
>>>>> The kernel code seem is not complex, and I have no idea how to speed
>>>>> up.
>>>>
>>>> Me neither; but I'll ask around.
>>>
>>> I will ask internally and get back on this one.
>>>>
>>>>
>>>>>>>> With your other kernel fix, does the problem of the missing
>>>>>>>> RDMA_CM_EVENT_DISCONNECTED events go away?
>>>>>>>
>>>>>>> Yes, after kernel and qemu fixed, this issue never happens again.
>>>>>>
>>>>>> I'm confused; which qemu fix; my question was whether the kernel fix
>>>>>> by itself fixed the problem of the missing event.
>>>>>
>>>>> this qemu fix:
>>>>> migration: update index field when delete or qsort RDMALocalBlock
>>>>
>>>> OK good; so then we shouldn't need this 2/2 patch.
>>>
>>> It is good that the qemu fix solved this issue but me and my colleagues
>>> think we need 2/2 patch anyway.
>>> According to IB Spec once active side send DREQ message he should wait
>>> for
>>> DREP message and only once it arrived it should trigger a DISCONNECT
>>> event.
>>> DREP message can be dropped due to network issues.
>>> For that case the spec defines a DREP_timeout state in the CM state
>>> machine,
>>> if the DREP is dropped we should get a timeout and a TIMEWAIT_EXIT event
>>> will be trigger.
>>> Unfortunately the current kernel CM implementation doesn't include the
>>> DREP_timeout state and in above scenario we will not get DISCONNECT or
>>> TIMEWAIT_EXIT events.
>>> (Similar scenario exists also for passive side).
>>> We think it is best to apply this patch until we will enhance the kernel
>>> code.
>>>
>> Hi Aviad:
>> How long about the DREP_timeout state?
>> Do RDMA have some tools like tcpdump? then I can use to confirm
>> whether
>> DREP message has received.
>>
>> Thanks.
>
> Are you running IB or RoCE?
RoCE v2.
>>>>>
>>>>> this issue also cause by some ram block is not released. but I do not
>>>>> find the root cause.
>>>>
>>>> Hmm, we should try and track that down.
>>>>
>>>>>>> Do you think we should remove rdma_get_cm_event after
>>>>>>> rdma_disconnect?
>>>>>>
>>>>>> I don't think so; if 'rdma_disconnect' is supposed to generate the
>>>>>> event I think we're supposed to wait for it to know that the
>>>>>> disconnect is really complete.
>>>>>
>>>>> After move qemu_fclose to migration thread, it will not block the main
>>>>> thread when wait the disconnection event.
>>>>
>>>> I'm not sure about moving the fclose to the migration thread; it worries
>>>> me with the interaction with cancel and other failures.
>>>>
>>>> Dave
>>>>
>>>>>> Dave
>>>>>>
>>>>>>>> Dave
>>>>>>>>
>>>>>>>>> page = sg_page(sg);
>>>>>>>>> if (umem->writable && dirty)
>>>>>>>>> set_page_dirty_lock(page);
>>>>>>>>> put_page(page);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> sg_free_table(&umem->sg_head);
>>>>>>>>> return;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>> Dave
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> Dave
>>>>>>>>>>>>
>>>>>>>>>>>>>>> Anyway, it should not invoke rdma_get_cm_event in main
>>>>>>>>>>>>>>> thread, and the event channel is also destroyed in
>>>>>>>>>>>>>>> qemu_rdma_cleanup.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Lidong Chen <address@hidden>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>> migration/rdma.c | 12 ++----------
>>>>>>>>>>>>>>> migration/trace-events | 1 -
>>>>>>>>>>>>>>> 2 files changed, 2 insertions(+), 11 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/migration/rdma.c b/migration/rdma.c index
>>>>>>>>>>>>>>> 0dd4033..92e4d30 100644
>>>>>>>>>>>>>>> --- a/migration/rdma.c
>>>>>>>>>>>>>>> +++ b/migration/rdma.c
>>>>>>>>>>>>>>> @@ -2275,8 +2275,7 @@ static int
>>>>>>>>>>>>>>> qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> static void qemu_rdma_cleanup(RDMAContext *rdma) {
>>>>>>>>>>>>>>> - struct rdma_cm_event *cm_event;
>>>>>>>>>>>>>>> - int ret, idx;
>>>>>>>>>>>>>>> + int idx;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> if (rdma->cm_id && rdma->connected) {
>>>>>>>>>>>>>>> if ((rdma->error_state ||
>>>>>>>>>>>>>>> @@ -2290,14 +2289,7 @@ static void
>>>>>>>>>>>>>>> qemu_rdma_cleanup(RDMAContext *rdma)
>>>>>>>>>>>>>>> qemu_rdma_post_send_control(rdma, NULL,
>>>>>>>>>>>>>>> &head);
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - ret = rdma_disconnect(rdma->cm_id);
>>>>>>>>>>>>>>> - if (!ret) {
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> trace_qemu_rdma_cleanup_waiting_for_disconnect();
>>>>>>>>>>>>>>> - ret = rdma_get_cm_event(rdma->channel,
>>>>>>>>>>>>>>> &cm_event);
>>>>>>>>>>>>>>> - if (!ret) {
>>>>>>>>>>>>>>> - rdma_ack_cm_event(cm_event);
>>>>>>>>>>>>>>> - }
>>>>>>>>>>>>>>> - }
>>>>>>>>>>>>>>> + rdma_disconnect(rdma->cm_id);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm worried whether this change could break stuff:
>>>>>>>>>>>>>> The docs say for rdma_disconnect that it flushes any posted
>>>>>>>>>>>>>> work
>>>>>>>>>>>>>> requests to the completion queue; so unless we wait for the
>>>>>>>>>>>>>> event
>>>>>>>>>>>>>> do we know the stuff has been flushed? In the normal
>>>>>>>>>>>>>> non-cancel case
>>>>>>>>>>>>>> I'm worried that means we could lose something.
>>>>>>>>>>>>>> (But I don't know the rdma/infiniband specs well enough to
>>>>>>>>>>>>>> know
>>>>>>>>>>>>>> if it's
>>>>>>>>>>>>>> really a problem).
>>>>>>>>>>>>>
>>>>>>>>>>>>> In qemu_fclose function, it invoke qemu_fflush(f) before invoke
>>>>>>>>>>>>> f->ops->close.
>>>>>>>>>>>>> so I think it's safe for normal migration case.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED
>>>>>>>>>>>>> event
>>>>>>>>>>>>> after rdma_disconnect,
>>>>>>>>>>>>> I loop in Aviad Yehezkel<address@hidden>, Aviad will help
>>>>>>>>>>>>> us
>>>>>>>>>>>>> find the root cause.
>>>
>>>
>>> Regarding previous discussion on that thread:
>>> rdma_disconnect can be invoke even without invoking ib_dreg_mr first.
>>>
>>> common teardown flow:
>>>
>>> rdma_disconnect
>>> <-- here resources like QPs are still alive and can DMA to RAM
>>> rdma_destroy_cm_id
>>> <-- All resources are destroyed by MR can be still active
>>> rdma_dreg_mr
>>>
>>>
>>>
>>>>>>>>>>>>>> Dave
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> trace_qemu_rdma_cleanup_disconnect();
>>>>>>>>>>>>>>> rdma->connected = false;
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> diff --git a/migration/trace-events b/migration/trace-events
>>>>>>>>>>>>>>> index d6be74b..64573ff 100644
>>>>>>>>>>>>>>> --- a/migration/trace-events
>>>>>>>>>>>>>>> +++ b/migration/trace-events
>>>>>>>>>>>>>>> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d"
>>>>>>>>>>>>>>> qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context
>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>> listen: %p"
>>>>>>>>>>>>>>> qemu_rdma_block_for_wrid_miss(const char *wcompstr, int
>>>>>>>>>>>>>>> wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s
>>>>>>>>>>>>>>> (%d) but got %s
>>>>>>>>>>>>>>> (%" PRIu64 ")"
>>>>>>>>>>>>>>> qemu_rdma_cleanup_disconnect(void) ""
>>>>>>>>>>>>>>> -qemu_rdma_cleanup_waiting_for_disconnect(void) ""
>>>>>>>>>>>>>>> qemu_rdma_close(void) ""
>>>>>>>>>>>>>>> qemu_rdma_connect_pin_all_requested(void) ""
>>>>>>>>>>>>>>> qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> 1.8.3.1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>>>>>>
>>>>>>>> --
>>>>>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>>>>
>>>>>> --
>>>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>>
>>>> --
>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>
>>>
>
- Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, (continued)
- Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, Dr. David Alan Gilbert, 2018/05/11
- Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, 858585 jemmy, 2018/05/14
- Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, Dr. David Alan Gilbert, 2018/05/14
- Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, 858585 jemmy, 2018/05/16
- Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, Dr. David Alan Gilbert, 2018/05/16
- Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, 858585 jemmy, 2018/05/16
- Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, Dr. David Alan Gilbert, 2018/05/16
- Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, 858585 jemmy, 2018/05/16
- Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, Dr. David Alan Gilbert, 2018/05/16
- Message not available
- Message not available
- Message not available
- Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, Aviad Yehezkel, 2018/05/17
- Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect,
858585 jemmy <=
- Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, Aviad Yehezkel, 2018/05/17
- Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, 858585 jemmy, 2018/05/22
Re: [Qemu-devel] [PATCH 1/2] migration: update index field when delete or qsort RDMALocalBlock, Dr. David Alan Gilbert, 2018/05/08