qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling


From: Dr. David Alan Gilbert
Subject: Re: [Qemu-devel] [PATCH 0/2] migration/channel errors and cancelling
Date: Tue, 19 Dec 2017 11:33:21 +0000
User-agent: Mutt/1.9.1 (2017-09-22)

* Peter Xu (address@hidden) wrote:
> On Tue, Dec 19, 2017 at 10:14:08AM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Xu (address@hidden) wrote:
> > > On Fri, Dec 15, 2017 at 05:16:53PM +0000, Dr. David Alan Gilbert (git) 
> > > wrote:
> > > > From: "Dr. David Alan Gilbert" <address@hidden>
> > > > 
> > > > Hi,
> > > >   Where a channel fails asynchronously during connect, call
> > > > back through the migration code so it can clean up.
> > > >   In particular this causes the transition of a 'cancelling' state
> > > > to 'cancelled' in the case of:
> > > > 
> > > >     migrate -d tcp:deadhost:port
> > > >      <host tries to connect>
> > > >     migrate_cancel
> > > > 
> > > > previously the status would get stuck in cancelling because
> > > > the final cleanup didn't happen.
> > > > 
> > > >   This is the second part of the fix for:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1525899
> > > 
> > > IIUC this series tries to deliver the connection error a long way
> > > until migrate_fd_connect() to handle it. But, haven't we already have
> > > a function migrate_fd_error() to do that (which is faster, and
> > > simpler)?
> > > 
> > > void migrate_fd_error(MigrationState *s, const Error *error)
> > > {
> > >     trace_migrate_fd_error(error_get_pretty(error));
> > >     assert(s->to_dst_file == NULL);
> > >     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> > >                       MIGRATION_STATUS_FAILED);
> > >     migrate_set_error(s, error);
> > >     notifier_list_notify(&migration_state_notifiers, s);
> > >     block_cleanup_parameters(s);
> > > }
> > > 
> > > I think it's not handling the case when cancelling.  If we let it to
> > > handle the cancelling case well, would it be a simpler fix?
> > >
> > > Moreover, I think this is another good example that migration is not
> > > handling the cleanup "cleanly" in general... I really hope we can do
> > > this better in 2.12.  I'll see whether I can give it a shot, but in
> > > all cases it'll be after the merging of existing patches since there
> > > are already quite a lot of dangling patches.
> > 
> > No, I think migrate_fd_error is the cause of the problem here, not the
> > answer.
> 
> Could I ask why migrate_fd_error() is problematic?  Yeah I agree that
> we should have a single point to clean things up, then can we call
> migrate_fd_cleanup() somehow inside migrate_fd_error()?
> 
> The thing I don't really understand is: why we want the error be
> delivered via those functions (migration_channel_connect,
> migrate_fd_connect, etc.) to finally be cleaned up.

migrate_fd_cleanup has a lot of code for dealing with different cases:
   a) Closing the to_dst_file
   b) joining the thread if its already running
   c) Cleaning up multifd (stub)
   d) finishing of cancel
   e) notification
   f) block cleanup

we seem to have copied some of those into migrate_fd_error - but not all
of them.  In this case the one we're missing is (d) got finishing
cancel;
when you issue a cancel command we move from whatever state we were in
to the 'cancelling' state, various things get cleaned up and eventually
we move to cancelled; that wasn't happening because we missed the
code in migrate_fd_cleanup out.  So we could copy that code (copied code
is bad) or we could just make sure migrate_fd_cleanup is called like
it normally is.

The other thinking is that at the moment the migration/socket.c and tls.c
etc code has to choose between callbacks into the main migration code,
either migration_channel_connet or migrate_fd_error - now it's simpler,
once you've asked for an outgoing migration you always get a callback
to migration_channel_connect.  Much more predictable.

Dave

> > 
> > If we stick to the simple rule that a migration must always call
> > migrate_fd_cleanup then the cancellation problems are fixed - I think
> > that's how we make migration 'clean' - a single cleanup routine
> > that always gets called.
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK



reply via email to

[Prev in Thread] Current Thread [Next in Thread]