qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 01/27] migration: Network Failover can't work with a pause


From: Daniel P . Berrangé
Subject: Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest
Date: Thu, 3 Dec 2020 11:45:12 +0000
User-agent: Mutt/1.14.6 (2020-07-11)

On Thu, Dec 03, 2020 at 06:40:11AM -0500, Michael S. Tsirkin wrote:
> On Thu, Dec 03, 2020 at 11:32:53AM +0000, Daniel P. Berrangé wrote:
> > On Thu, Dec 03, 2020 at 06:21:47AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 02, 2020 at 12:01:21PM +0000, Daniel P. Berrangé wrote:
> > > > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote:
> > > > > On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote:
> > > > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé 
> > > > > > > wrote:
> > > > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote:
> > > > > > > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. 
> > > > > > > > > > Tsirkin wrote:
> > > > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. 
> > > > > > > > > >> Berrangé wrote:
> > > > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. 
> > > > > > > > > >> > Tsirkin wrote:
> > > > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan 
> > > > > > > > > >> > > Quintela wrote:
> > > > > > > > > >> > > > If we have a paused guest, it can't unplug the 
> > > > > > > > > >> > > > network VF device, so
> > > > > > > > > >> > > > we wait there forever.  Just change the code to give 
> > > > > > > > > >> > > > one error on that
> > > > > > > > > >> > > > case.
> > > > > > > > > >> > > > 
> > > > > > > > > >> > > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > > > > > > > >> > > 
> > > > > > > > > >> > > It's certainly possible but it's management that 
> > > > > > > > > >> > > created
> > > > > > > > > >> > > this situation after all - why do we bother to enforce
> > > > > > > > > >> > > a policy? It is possible that management will unpause 
> > > > > > > > > >> > > immediately
> > > > > > > > > >> > > afterwards and everything will proceed smoothly.
> > > > > > > > > >> > > 
> > > > > > > > > >> > > Yes migration will not happen until guest is
> > > > > > > > > >> > > unpaused but the same it true of e.g. a guest that is 
> > > > > > > > > >> > > stuck
> > > > > > > > > >> > > because of a bug.
> > > > > > > > > >> > 
> > > > > > > > > >> > That's pretty different behaviour from how migration 
> > > > > > > > > >> > normally handles
> > > > > > > > > >> > a paused guest, which is that it is guaranteed to 
> > > > > > > > > >> > complete the migration
> > > > > > > > > >> > in as short a time as network bandwidth allows.
> > > > > > > > > >> > 
> > > > > > > > > >> > Just ignoring the situation I think will lead to 
> > > > > > > > > >> > surprise apps / admins,
> > > > > > > > > >> > because the person/entity invoking the migration is not 
> > > > > > > > > >> > likely to have
> > > > > > > > > >> > checked wether this particular guest uses net failover 
> > > > > > > > > >> > or not before
> > > > > > > > > >> > invoking - they'll just be expecting a paused migration 
> > > > > > > > > >> > to run fast and
> > > > > > > > > >> > be guaranteed to complete.
> > > > > > > > > >> > 
> > > > > > > > > >> > Regards,
> > > > > > > > > >> > Daniel
> > > > > > > > > >> 
> > > > > > > > > >> Okay I guess. But then shouldn't we handle the reverse 
> > > > > > > > > >> situation too:
> > > > > > > > > >> pausing guest after migration started but before device was
> > > > > > > > > >> unplugged?
> > > > > > > > > >> 
> > > > > > > > > >
> > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail
> > > > > > > > > > pausing guest until migration is cancelled?
> > > > > > > > > >
> > > > > > > > > > All this seems heavy handed to me ...
> > > > > > > > > 
> > > > > > > > > This is the minimal fix that I can think of.
> > > > > > > > > 
> > > > > > > > > Further solution would be:
> > > > > > > > > - Add a new migration parameter: migrate-paused
> > > > > > > > > - change libvirt to use the new parameter if it exist
> > > > > > > > > - in qemu, when we do start migration (but after we wait for 
> > > > > > > > > the unplug
> > > > > > > > >   device) paused the guest before starting migration and 
> > > > > > > > > resume it after
> > > > > > > > >   migration finish.
> > > > > > > > 
> > > > > > > > It would also have to handle issuing of paused after migration 
> > > > > > > > has
> > > > > > > > been started - delay the pause request until the nuplug is 
> > > > > > > > complete
> > > > > > > > is one answer.
> > > > > > > 
> > > > > > > Hmm my worry would be that pausing is one way to give cpu
> > > > > > > resources back to host. It's problematic if guest can delay
> > > > > > > that indefinitely.
> > > > > > 
> > > > > > hmm, yes, that is awkward.  Perhaps we should just report an 
> > > > > > explicit
> > > > > > error then.
> > > > > 
> > > > > Report an error in response to which command? Do you mean
> > > > > fail migration?
> > > > 
> > > > If mgt attempt to pause an existing migration that hasn't finished
> > > > the PCI unplug stage, then fail the pause request.
> > > 
> > > Pause guest not migration ...
> > > Might be tricky ...
> > > 
> > > Let me ask this, why not just produce a warning
> > > that migration wan't finish until guest actually runs?
> > > User will then know and unpause the guest when he wants
> > > migration to succeed ...
> > 
> > A warning is going to be essentally invisible if the pause command
> > succeeeds. 
> 
> I mean the situation here isn't earth shattering, an admin
> created it. Maybe he will unpause shortly
> and all will be well ...

It isn't really about the admin.  It is about countless existing mgmt apps
that expect migration will always succeed if the VM is paused. The mgmt
apps triggering the migraiton is not neccessarily the same as the app
which introduced the use of NIC failover in the config.

eg in OpenStack Nova provides the VM config, but there are completely
separate apps that are built todo automation on top of Nova which 
this is liable to break. There's no human admin there to diagnose
this and re-try with unpause, as all the logic is in the apps.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




reply via email to

[Prev in Thread] Current Thread [Next in Thread]