qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v1 1/1] migration: Fix yank on postcopy multifd crashing gues


From: Peter Xu
Subject: Re: [PATCH v1 1/1] migration: Fix yank on postcopy multifd crashing guest after migration
Date: Tue, 29 Nov 2022 15:50:13 -0500

On Tue, Nov 29, 2022 at 05:28:26PM -0300, Leonardo Bras Soares Passos wrote:
> Hello Peter,

Leo,

> 
> On Thu, Nov 24, 2022 at 1:04 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Nov 09, 2022 at 02:56:29AM -0300, Leonardo Bras wrote:
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index a0cdb714f7..250caff7f4 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -1889,6 +1889,8 @@ static void *postcopy_ram_listen_thread(void 
> > > *opaque)
> > >          exit(EXIT_FAILURE);
> > >      }
> > >
> > > +    migration_load_cleanup();
> >
> > It's a bit weird to call multifd-load-clean in a listen phase..
> 
> I agree.
> 
> >
> > How about moving it right above
> > trace_process_incoming_migration_co_postcopy_end_main()?  Then the new
> > helper can also be static.
> 
> Seems a nice Idea to have this function to be static.
> 
> We have to guarantee this is run after the migration finished, but
> before migration_incoming_state_destroy().

IIUC it doesn't need to be when migration finished.  It should be fine as
long as we finished precopy phase, and that's what the migration coroutine
does, iiuc.  The thing is postcopy doesn't use multifd at all, so logically
it can be released before postcopy starts.

Actually, IMHO it'll be safer to do it like that, just to make sure we
won't accidentally receive multifd pages _after_ postcopy starts, because
that'll be another more severe and hard to debug issue since the guest can
see partial copied pages from multifd recv channels.

> 
> You suggested calling it right above of
> trace_process_incoming_migration_co_postcopy_end_main(), which git
> grep pointed me to an if clause in process_incoming_migration_co().
> If I got the location correctly, it would not help: this coroutine is
> ran just after the VM went to the target host, and not when the
> migration finished.
> 
> If we are using multifd channels, this will break the migration with
> segmentation fault (SIGSEGV), since the channels have not finished
> sending yet.

If this happens, then I had a feeling that there's something else that
needs syncs.  As I discussed above, we should make sure multifd pages all
landed before we start vcpu threads.

Said that, now I think I'm not against your original proposal to fix this
immediate crash.  However I am still wondering whether we really should
disable multifd with postcopy, as there seem to be still a few missing
pieces even to enable multifd during precopy-only.

Thanks,

-- 
Peter Xu




reply via email to

[Prev in Thread] Current Thread [Next in Thread]