Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take a

From:	Dr. David Alan Gilbert
Subject:	Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take an errp
Date:	Mon, 13 Jun 2022 12:13:05 +0100
User-agent:	Mutt/2.2.1 (2022-02-19)

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Jun 09, 2022 at 05:02:29PM -0400, Peter Xu wrote:
> > On Wed, Jun 08, 2022 at 06:05:28PM +0100, Dr. David Alan Gilbert wrote:
> > > > @@ -2005,7 +2005,17 @@ static void loadvm_postcopy_handle_run_bh(void 
> > > > *opaque)
> > > >      /* TODO we should move all of this lot into postcopy_ram.c or a 
> > > > shared code
> > > >       * in migration.c
> > > >       */
> > > > -    cpu_synchronize_all_post_init();
> > > > +    cpu_synchronize_all_post_init(&local_err);
> > > > +    if (local_err) {
> > > > +        /*
> > > > +         * TODO: a better way to do this is to tell the src that we 
> > > > cannot
> > > > +         * run the VM here so hopefully we can keep the VM running on 
> > > > src
> > > > +         * and immediately halt the switch-over.  But that needs work.
> > > 
> > > Yes, I think it is possible; unlike some of the later errors in the same
> > > function, in this case we know no disks/network/etc have been touched,
> > > so we should be able to recover.
> > > I wonder if we can move the postcopy_state_set(POSTCOPY_INCOMING_RUNNING)
> > > out of loadvm_postcopy_handle_run to after this point.
> > > 
> > > We've already got the return path, so we should be able to signal the
> > > failure unless we're very unlucky.
> > 
> > Right.  It's just that for the new ACK we may need to modify the return
> > path protocol for sure, because none of the existing ones can notify such
> > an information.
> > 
> > One idea is to reuse MIG_RP_MSG_RESUME_ACK, it was only used for postcopy
> > recovery before to do the final handshake with offload=1 only (which is
> > defined as MIGRATION_RESUME_ACK_VALUE).  We could try to fill in the
> > payload with some !1 value, to tell the source that we NACK the migration
> > then src fails the migration as long as possible?
> > 
> > That seems to be even compatibile with one old qemu migrating to a new qemu
> > scenario, because when the old qemu notices the MIG_RP_MSG_RESUME_ACK
> > message with !1 payload, it'll mark the rp bad:
> 
> Oh it won't be compatible..  The clean way to do this is we need to modify
> the src qemu to halt in postcopy_start() to wait for that ack before
> continue.  That may need another cap/param to enable.

OK; I was wondering aobut sending a RP_MSG_SHUT with a failure; but if
you'd need to change the source it's still a problem.

> The thing is I'm not very sure whether this will be worth it.
> 
> Non-compatible migrations should be rare on put register failures.  For the
> issue I was working on, it was actually a kernel bug that triggered it but
> it's just hard to figure out where's wrong.  With properly working kernels
> and matching hosts they should just not really heppen.  I'm worried adding
> too much complexity could over-engineer things without much benefits.

OK that makes sense.

> In that case, I'd think it proper if we start with what this patchset
> provides, which at least allows us to fail in a crystal clear way?

Yes, the clear error is important.

Dave

> > 
> >   if (migrate_handle_rp_resume_ack(ms, tmp32)) {
> >       mark_source_rp_bad(ms);
> >       goto out;
> >   }
> > 
> >   static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
> >   {
> >       trace_source_return_path_thread_resume_ack(value);
> >   
> >       if (value != MIGRATION_RESUME_ACK_VALUE) {
> >           error_report("%s: illegal resume_ack value %"PRIu32,
> >                        __func__, value);
> >           return -1;
> >       }
> >       ...
> >   }
> > 
> > If it looks generally good, I can try with such a change in v2.
> > 
> > Thanks,
> > 
> > -- 
> > Peter Xu
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH RFC 0/5] CPU: Detect put cpu register errors for migrations, Peter Xu, 2022/06/07
- [PATCH RFC 1/5] cpus-common: Introduce run_on_cpu_func2 which allows error returns, Peter Xu, 2022/06/07
- [PATCH RFC 2/5] cpus-common: Add run_on_cpu2(), Peter Xu, 2022/06/07
- [PATCH RFC 3/5] accel: Allow synchronize_post_init() to take an Error**, Peter Xu, 2022/06/07
- [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take an errp, Peter Xu, 2022/06/07
  - Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take an errp, Dr. David Alan Gilbert, 2022/06/08
    - Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take an errp, Peter Xu, 2022/06/09
    - Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take an errp, Peter Xu, 2022/06/10
    - Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take an errp, Dr. David Alan Gilbert <=
- [PATCH RFC 5/5] KVM: Hook kvm_arch_put_registers() errors to the caller, Peter Xu, 2022/06/07

Prev by Date: Re: [PATCH] hw/openrisc: pass random seed to fdt
Next by Date: Re: [PATCH v2 00/11] vfio/migration: Implement VFIO migration protocol v2
Previous by thread: Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take an errp
Next by thread: [PATCH RFC 5/5] KVM: Hook kvm_arch_put_registers() errors to the caller
Index(es):
- Date
- Thread