[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take a
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [PATCH RFC 4/5] cpu: Allow cpu_synchronize_all_post_init() to take an errp |
Date: |
Mon, 13 Jun 2022 12:13:05 +0100 |
User-agent: |
Mutt/2.2.1 (2022-02-19) |
* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Jun 09, 2022 at 05:02:29PM -0400, Peter Xu wrote:
> > On Wed, Jun 08, 2022 at 06:05:28PM +0100, Dr. David Alan Gilbert wrote:
> > > > @@ -2005,7 +2005,17 @@ static void loadvm_postcopy_handle_run_bh(void
> > > > *opaque)
> > > > /* TODO we should move all of this lot into postcopy_ram.c or a
> > > > shared code
> > > > * in migration.c
> > > > */
> > > > - cpu_synchronize_all_post_init();
> > > > + cpu_synchronize_all_post_init(&local_err);
> > > > + if (local_err) {
> > > > + /*
> > > > + * TODO: a better way to do this is to tell the src that we
> > > > cannot
> > > > + * run the VM here so hopefully we can keep the VM running on
> > > > src
> > > > + * and immediately halt the switch-over. But that needs work.
> > >
> > > Yes, I think it is possible; unlike some of the later errors in the same
> > > function, in this case we know no disks/network/etc have been touched,
> > > so we should be able to recover.
> > > I wonder if we can move the postcopy_state_set(POSTCOPY_INCOMING_RUNNING)
> > > out of loadvm_postcopy_handle_run to after this point.
> > >
> > > We've already got the return path, so we should be able to signal the
> > > failure unless we're very unlucky.
> >
> > Right. It's just that for the new ACK we may need to modify the return
> > path protocol for sure, because none of the existing ones can notify such
> > an information.
> >
> > One idea is to reuse MIG_RP_MSG_RESUME_ACK, it was only used for postcopy
> > recovery before to do the final handshake with offload=1 only (which is
> > defined as MIGRATION_RESUME_ACK_VALUE). We could try to fill in the
> > payload with some !1 value, to tell the source that we NACK the migration
> > then src fails the migration as long as possible?
> >
> > That seems to be even compatibile with one old qemu migrating to a new qemu
> > scenario, because when the old qemu notices the MIG_RP_MSG_RESUME_ACK
> > message with !1 payload, it'll mark the rp bad:
>
> Oh it won't be compatible.. The clean way to do this is we need to modify
> the src qemu to halt in postcopy_start() to wait for that ack before
> continue. That may need another cap/param to enable.
OK; I was wondering aobut sending a RP_MSG_SHUT with a failure; but if
you'd need to change the source it's still a problem.
> The thing is I'm not very sure whether this will be worth it.
>
> Non-compatible migrations should be rare on put register failures. For the
> issue I was working on, it was actually a kernel bug that triggered it but
> it's just hard to figure out where's wrong. With properly working kernels
> and matching hosts they should just not really heppen. I'm worried adding
> too much complexity could over-engineer things without much benefits.
OK that makes sense.
> In that case, I'd think it proper if we start with what this patchset
> provides, which at least allows us to fail in a crystal clear way?
Yes, the clear error is important.
Dave
> >
> > if (migrate_handle_rp_resume_ack(ms, tmp32)) {
> > mark_source_rp_bad(ms);
> > goto out;
> > }
> >
> > static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
> > {
> > trace_source_return_path_thread_resume_ack(value);
> >
> > if (value != MIGRATION_RESUME_ACK_VALUE) {
> > error_report("%s: illegal resume_ack value %"PRIu32,
> > __func__, value);
> > return -1;
> > }
> > ...
> > }
> >
> > If it looks generally good, I can try with such a change in v2.
> >
> > Thanks,
> >
> > --
> > Peter Xu
>
> --
> Peter Xu
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
[PATCH RFC 5/5] KVM: Hook kvm_arch_put_registers() errors to the caller, Peter Xu, 2022/06/07