qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v4 11/12] migration: poll the cm event while wai


From: 858585 jemmy
Subject: Re: [Qemu-devel] [PATCH v4 11/12] migration: poll the cm event while wait RDMA work request completion
Date: Tue, 5 Jun 2018 22:26:13 +0800

On Sun, Jun 3, 2018 at 11:04 PM, Aviad Yehezkel
<address@hidden> wrote:
> +Gal
>
> Gal, please comment with our findings.

some suggestion from Gal:
1.Regarding the GIOConditions for the FPollFD.events/revents:
G_IO_IN is enough for cm_channel – if it is not empty, it will return
POLLIN | POLLRDNORM, but you can also use G_IO_HUP | G_IO_ERR just to
be safe.
2.Please note that you are not currently checking for error return
values on qemu_poll_ns, rdma_get_cm_event and rdma_ack_cm_event.
3.should consider checking for specific RDMA_CM_EVENT types:
RDMA_CM_EVENT_DISCONNECTED, RDMA_CM_EVENT_DEVICE_REMOVAL.for example,
receiving RDMA_CM_EVENT_ADDR_CHANGE should not result in error
4.it is better to first poll the CQ, and only if it has no new CQEs
poll the eventsQ. This way, you will not go into error even if you’ve
got a CQE.

I will send new version patch.

>
> Thanks!
>
>
> On 5/31/2018 10:36 AM, 858585 jemmy wrote:
>>
>> On Thu, May 31, 2018 at 1:33 AM, Dr. David Alan Gilbert
>> <address@hidden> wrote:
>>>
>>> * Lidong Chen (address@hidden) wrote:
>>>>
>>>> If the peer qemu is crashed, the qemu_rdma_wait_comp_channel function
>>>> maybe loop forever. so we should also poll the cm event fd, and when
>>>> receive any cm event, we consider some error happened.
>>>>
>>>> Signed-off-by: Lidong Chen <address@hidden>
>>>
>>> I don't understand enough about the way the infiniband fd's work to
>>> fully review this; so I'd appreciate if some one who does could
>>> comment/add their review.
>>
>> Hi Avaid:
>>      we need your help. I also not find any document about the cq
>> channel event fd and
>> cm channel event f.
>>      Should we set the events to G_IO_IN | G_IO_HUP | G_IO_ERR? or
>> G_IO_IN is enough?
>>      pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>>      pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>>      Thanks.
>>
>>>> ---
>>>>   migration/rdma.c | 35 ++++++++++++++++++++++++-----------
>>>>   1 file changed, 24 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/migration/rdma.c b/migration/rdma.c
>>>> index 1b9e261..d611a06 100644
>>>> --- a/migration/rdma.c
>>>> +++ b/migration/rdma.c
>>>> @@ -1489,6 +1489,9 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma,
>>>> uint64_t *wr_id_out,
>>>>    */
>>>>   static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>>>>   {
>>>> +    struct rdma_cm_event *cm_event;
>>>> +    int ret = -1;
>>>> +
>>>>       /*
>>>>        * Coroutine doesn't start until migration_fd_process_incoming()
>>>>        * so don't yield unless we know we're running inside of a
>>>> coroutine.
>>>> @@ -1504,25 +1507,35 @@ static int
>>>> qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>>>>            * But we need to be able to handle 'cancel' or an error
>>>>            * without hanging forever.
>>>>            */
>>>> -        while (!rdma->error_state  && !rdma->received_error) {
>>>> -            GPollFD pfds[1];
>>>> +        while (!rdma->error_state && !rdma->received_error) {
>>>> +            GPollFD pfds[2];
>>>>               pfds[0].fd = rdma->comp_channel->fd;
>>>>               pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>>>> +            pfds[0].revents = 0;
>>>> +
>>>> +            pfds[1].fd = rdma->channel->fd;
>>>> +            pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>>>> +            pfds[1].revents = 0;
>>>> +
>>>>               /* 0.1s timeout, should be fine for a 'cancel' */
>>>> -            switch (qemu_poll_ns(pfds, 1, 100 * 1000 * 1000)) {
>>>> -            case 1: /* fd active */
>>>> -                return 0;
>>>> +            qemu_poll_ns(pfds, 2, 100 * 1000 * 1000);
>>>
>>> Shouldn't we still check the return value of this; if it's negative
>>> something has gone wrong.
>>
>> I will fix this.
>> Thanks.
>>
>>> Dave
>>>
>>>> -            case 0: /* Timeout, go around again */
>>>> -                break;
>>>> +            if (pfds[1].revents) {
>>>> +                ret = rdma_get_cm_event(rdma->channel, &cm_event);
>>>> +                if (!ret) {
>>>> +                    rdma_ack_cm_event(cm_event);
>>>> +                }
>>>> +                error_report("receive cm event while wait comp
>>>> channel,"
>>>> +                             "cm event is %d", cm_event->event);
>>>>
>>>> -            default: /* Error of some type -
>>>> -                      * I don't trust errno from qemu_poll_ns
>>>> -                     */
>>>> -                error_report("%s: poll failed", __func__);
>>>> +                /* consider any rdma communication event as an error */
>>>>                   return -EPIPE;
>>>>               }
>>>>
>>>> +            if (pfds[0].revents) {
>>>> +                return 0;
>>>> +            }
>>>> +
>>>>               if (migrate_get_current()->state ==
>>>> MIGRATION_STATUS_CANCELLING) {
>>>>                   /* Bail out and let the cancellation happen */
>>>>                   return -EPIPE;
>>>> --
>>>> 1.8.3.1
>>>>
>>> --
>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]