qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 3/5] migration/rdma: Allow cancelling while wait


From: Peter Xu
Subject: Re: [Qemu-devel] [PATCH 3/5] migration/rdma: Allow cancelling while waiting for wrid
Date: Fri, 14 Jul 2017 10:57:42 +0800
User-agent: Mutt/1.5.24 (2015-08-30)

On Wed, Jul 12, 2017 at 01:36:07PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (address@hidden) wrote:
> > On Tue, Jul 04, 2017 at 07:49:13PM +0100, Dr. David Alan Gilbert (git) 
> > wrote:
> > > From: "Dr. David Alan Gilbert" <address@hidden>
> > > 
> > > When waiting for a WRID, if the other side dies we end up waiting
> > > for ever with no way to cancel the migration.
> > > Cure this by poll()ing the fd first with a timeout and checking
> > > error flags and migration state.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert <address@hidden>
> > > ---
> > >  migration/rdma.c | 54 
> > > ++++++++++++++++++++++++++++++++++++++++++++++++------
> > >  1 file changed, 48 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/migration/rdma.c b/migration/rdma.c
> > > index 6111e10c70..7273ae9929 100644
> > > --- a/migration/rdma.c
> > > +++ b/migration/rdma.c
> > > @@ -1466,6 +1466,52 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, 
> > > uint64_t *wr_id_out,
> > >      return  0;
> > >  }
> > >  
> > > +/* Wait for activity on the completion channel.
> > > + * Returns 0 on success, none-0 on error.
> > > + */
> > > +static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
> > > +{
> > > +    /*
> > > +     * Coroutine doesn't start until migration_fd_process_incoming()
> > > +     * so don't yield unless we know we're running inside of a coroutine.
> > > +     */
> > > +    if (rdma->migration_started_on_destination) {
> > > +        yield_until_fd_readable(rdma->comp_channel->fd);
> > > +    } else {
> > > +        /* This is the source side, we're in a separate thread
> > > +         * or destination prior to migration_fd_process_incoming()
> > > +         * we can't yield; so we have to poll the fd.
> > > +         * But we need to be able to handle 'cancel' or an error
> > > +         * without hanging forever.
> > > +         */
> > > +        while (!rdma->error_state && !rdma->error_reported &&
> > > +               !rdma->received_error) {
> > 
> > Should we just check rdma->error_state? Since iiuc error_reported only
> > means whether error is reported (set to 1 if reported), and
> > received_error means whether we got explicit error from peer (I think
> > only via a RDMA_CONTROL_ERROR message), and it's 1 if received. IIUC
> > either error_reported or received_error will mean that error_state be
> > set already.
> 
> I think I agree we don't need to check both error_state and
> error_reported; I think it's best to check received_error since if
> the other end fails we want this end to stop waiting for anything else.
> I've not convinced myself that every path from receiving that error
> till this point would have set the error_state (although it probably
> has);  so I suggest:
>    while (!rdma->error_state && !rdma->received_error) {

qemu_rdma_exchange_get_response() is the only place I see to set the
received_error. And when that is triggered, the stack would be:

   qemu_rdma_exchange_get_response() set received_error=true, ret -EIO
     <-- qemu_rdma_exchange_recv() return the -EIO
       <-- qio_channel_rdma_readv() pass the -EIO to error_state, or
       <-- qemu_rdma_registration_handle() pass the -EIO to error_state

But even so, I'm totally fine that we check both.

> 
> > > +            GPollFD pfds[1];
> > > +            pfds[0].fd = rdma->comp_channel->fd;
> > > +            pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
> > > +            /* 0.5s timeout, should be fine for a 'cancel' */
> > > +            switch (qemu_poll_ns(pfds, 1, 500 * 1000 * 1000)) {
> > 
> > (I would be more aggresive, say, 100ms, or even less, for better UI
> >  response with merely nothing lost. But 500ms is okay as well to me)
> 
> Yes, it felt fairly arbitrary here.
> 
> > > +            case 1: /* fd active */
> > > +                return 0;
> > > +
> > > +            case 0: /* Timeout, go around again */
> > > +                break;
> > > +
> > > +            default: /* Error of some type */
> > > +                return -1;
> > > +            }
> > > +
> > > +            if (migrate_get_current()->state == 
> > > MIGRATION_STATUS_CANCELLING) {
> > > +                /* Bail out and let the cancellation happen */
> > > +                return -EPIPE;
> > > +            }
> > > +        }
> > > +    }
> > > +
> > > +    return rdma->error_state || rdma->error_reported ||
> > > +           rdma->received_error;
> > 
> > Similar question here. Maybe just return rdma->error_state?
> 
> As above, I could just drop the error_reported part.

Sure. Will reply on the new version.

-- 
Peter Xu



reply via email to

[Prev in Thread] Current Thread [Next in Thread]