qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation


From: Kevin Wolf
Subject: Re: [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation
Date: Tue, 4 May 2021 12:57:29 +0200

Am 04.05.2021 um 11:44 hat Michael S. Tsirkin geschrieben:
> On Tue, May 04, 2021 at 11:27:12AM +0200, Kevin Wolf wrote:
> > Am 04.05.2021 um 10:59 hat Michael S. Tsirkin geschrieben:
> > > On Thu, Apr 29, 2021 at 07:13:12PM +0200, Kevin Wolf wrote:
> > > > This is a partial revert of commits 77542d43149 and bc79c87bcde.
> > > > 
> > > > Usually, an error during initialisation means that the configuration was
> > > > wrong. Reconnecting won't make the error go away, but just turn the
> > > > error condition into an endless loop. Avoid this and return errors
> > > > again.
> > > 
> > > So there are several possible reasons for an error:
> > > 
> > > 1. remote restarted - we would like to reconnect,
> > >    this was the original use-case for reconnect.
> > > 
> > >    I am not very happy that we are killing this usecase.
> > 
> > This patch is killing it only during initialisation, where it's quite
> > unlikely compared to other cases and where the current implementation is
> > rather broken. So reverting the broken feature and going back to a
> > simpler correct state feels like a good idea to me.
> > 
> > The idea is to add the "retry during initialisation" feature back on top
> > of this, but it requires some more changes in the error paths so that we
> > can actually distinguish different kinds of errors and don't retry when
> > we already know that it can't succeed.
> 
> Okay ... let's make all this explicit in the commit log though, ok?

That's fair, I'll add a paragraph addressing this case when merging the
series, like this:

    Note that this removes the ability to reconnect during
    initialisation (but not during operation) when there is no permanent
    error, but the backend restarts, as the implementation was buggy.
    This feature can be added back in a follow-up series after changing
    error paths to distinguish cases where retrying could help from
    cases with permanent errors.

> > > 2. qemu detected an error and closed the connection
> > >    looks like we try to handle that by reconnect,
> > >    this is something we should address.
> > 
> > Yes, if qemu produces the error locally, retrying is useless.
> > 
> > > 3. remote failed due to a bad command from qemu.
> > >    this usecase isn't well supported at the moment.
> > > 
> > >    How about supporting it on the remote side? I think that if the
> > >    data is well-formed just has a configuration remote can not support
> > >    then instead of closing the connection, remote can wait for
> > >    commands with need_reply set, and respond with an error. Or at
> > >    least do it if VHOST_USER_PROTOCOL_F_REPLY_ACK has been negotiated.
> > >    If VHOST_USER_SET_VRING_ERR is used then signalling that fd might
> > >    also be reasonable.
> > > 
> > >    OTOH if qemu is buggy and sends malformed data and remote detects
> > >    that then hacing qemu retry forever is ok, might actually be
> > >    benefitial for debugging.
> > 
> > I haven't really checked this case yet, it seems to be less common.
> > Explicitly communicating an error is certainly better than just cutting
> > the connection. But as you say, it means QEMU is buggy, so blindly
> > retrying in this case is kind of acceptable.
> > 
> > Raphael suggested that we could limit the number of retries during
> > initialisation so that it wouldn't result in a hang at least.
> 
> not sure how do I feel about random limits ... how would we set the
> limit?

To be honest, probably even 1 would already be good enough in practice.
Make it 5 or something and you definitely cover any realistic case when
there is no bug involved.

Even hitting this case once requires bad luck with the timing, so that
the restart of the backend coincides with already having connected to
the socket, but not completed the configuration yet, which is a really
short window. Having the backend drop the connection again in the same
short window on the second attempt is an almost sure sign of a bug with
one of the operations done during initialisation.

Even if this corner case turned out to be a bit less unlikely to happen
than I'm thinking (which is, it won't happen at all), randomly failing a
device-add once in a while still feels a lot better than hanging the VM
once in a while.

Kevin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]