[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] Problem with "savevm" on ppc64
From: |
David Gibson |
Subject: |
Re: [Qemu-ppc] Problem with "savevm" on ppc64 |
Date: |
Mon, 24 Oct 2016 13:09:57 +1100 |
User-agent: |
Mutt/1.7.1 (2016-10-04) |
On Fri, Oct 21, 2016 at 10:12:28AM +0100, Dr. David Alan Gilbert wrote:
> * David Gibson (address@hidden) wrote:
> > On Thu, Oct 20, 2016 at 03:17:12PM +0200, Thomas Huth wrote:
> > > Hi all,
> > >
> > > I'm currently facing a strange problem with the "savevm" HMP command on
> > > ppc64 with TCG and the pseries machine. Steps for reproduction:
> > >
> > > 1) Create a disk image:
> > > qemu-img create -f qcow2 /tmp/test.qcow2 1M
> > >
> > > 2) Start QEMU (in TCG mode):
> > > qemu-system-ppc64 -nographic -vga none -m 256 -hda /tmp/test.qcow2
> > >
> > > 3) Hit "CTRL-a c" to enter the HMP monitor
> > >
> > > 4) Type "savevm test1" to save a snapshot
> > >
> > > The savevm command then hangs forever and the test.qcow2 image keeps
> > > growing and growing.
> > >
> > > It seems like qemu_savevm_state_iterate() does not make any more
> > > progress because ram_save_iterate() keeps returning 0 ... but why can
> > > that happen?
> >
> > Hmm. You don't mention installing anything on the disk image, so I'm
> > assuming the VM is just sitting in SLOF, unable to boot. And it's
> > TCG, so there's no VRMA. Which mans I'd expect the HPT to be
> > completely empty.
> >
> > I strongly suspect that's what's triggering the problem, although I'm
> > not quite sure of the mechanism
> >
> > > I've tinkered with the code a little bit, and I can get it to work with
> > > the following patch - which however is quite independent of the
> > > ram_save_iterate() code:
> > >
> > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > index ddb7438..a7ac0bf 100644
> > > --- a/hw/ppc/spapr.c
> > > +++ b/hw/ppc/spapr.c
> > > @@ -1290,7 +1290,7 @@ static int htab_save_setup(QEMUFile *f, void
> > > *opaque)
> > > return 0;
> > > }
> > >
> > > -static void htab_save_first_pass(QEMUFile *f, sPAPRMachineState *spapr,
> > > +static int htab_save_first_pass(QEMUFile *f, sPAPRMachineState *spapr,
> > > int64_t max_ns)
> > > {
> > > bool has_timeout = max_ns != -1;
> > > @@ -1340,6 +1340,8 @@ static void htab_save_first_pass(QEMUFile *f,
> > > sPAPRMachineState *spapr,
> > > spapr->htab_first_pass = false;
> > > }
> > > spapr->htab_save_index = index;
> > > +
> > > + return !spapr->htab_first_pass;
> > > }
> > >
> > > static int htab_save_later_pass(QEMUFile *f, sPAPRMachineState *spapr,
> > > @@ -1444,7 +1446,7 @@ static int htab_save_iterate(QEMUFile *f, void
> > > *opaque)
> > > return rc;
> > > }
> > > } else if (spapr->htab_first_pass) {
> > > - htab_save_first_pass(f, spapr, MAX_ITERATION_NS);
> > > + rc = htab_save_first_pass(f, spapr, MAX_ITERATION_NS);
> > > } else {
> > > rc = htab_save_later_pass(f, spapr, MAX_ITERATION_NS);
> > > }
> > >
> > > That means, if htab_save_iterate() does not initially return a 0, then
> > > ram_save_iterate() also does not run into the condition of returning
> > > zeroes later. But how are these related? And is it safe for returning a
> > > non-zero value in htab_save_first_pass()? Does anybody with more
> > > experience in this area has a clue what's going on here?
> >
> > So, looking at this I think it's unsafe. htab_save_first_pass() never
> > examines dirty bits, so we could get:
> > htab_save_first_pass() called once, saves part of HPT
> > guest dirties an HPTE in the already saved region
> > enter migration completion stage
> > htab_save_first_pass() saves the remainder of the HPT, returns 1
> >
> > That would trigger the code to think the HPT migration is complete,
> > without ever saving the HPTE that got dirtied part way through.
> >
> > But.. then I looked further and got *really* confused.
>
> Ewww, my head.....this is very confusing.
>
> > The comment above qemu_savevm_state_iterate() and the logic in
> > qemu_savevm_state() say that the iterate function returns >0 to
> > indicate that it is done and we can move onto the completion phase.
> >
> > But both ram_save_iterate() and block_save_iterate() seem to have that
> > inverted: they return >0 if they actually saved something.
>
> I think you're right that logic looks inverted somewhere, but be careful
> that we have:
> a) The return value of qemu_savevm_state_iterate
> b) The return value of the iterators it calls
Yes, but I checked that those basically the same.
qemu_savevm_state_iterate() adds some extra error conditions, but
never alters the return value of the callback.
And, even then. I've said the sense is "inverted" but that's a bit
sloppy - it's not just a matter of putting a ! in.
qemu_savevm_state_iterate() is expecting something saying whether or
not the iteration has completed. The RAM and block iterators are
returning something saying whether or not they made forward progress.
In a number of significant cases those will be the inverse of each
other, but not always.
> Both those return values individually make sense *if* the logic
> within qemu_savevm_state_iterate did the right thing; which I'm not
> convinced of.
>
> > I think the only reason migration has been working *at all* is that
> > migration_thread() and qemu_savevm_state_complete_precopy() ignore the
> > return value of the iterate() function (except for <0 errors).
> >
> > qemu_savevm_state(), however, does not.
>
> However, there are two other reasons it works:
> a) We keep iterating in the migration code until the amount of remaining
> iterable RAM/etc is large (i.e. result of qemu_savevm_state_pending).
Sorry, I misspoke I should have said the only reason savevm is
working. migration ignores the result of the iterate callback and so
works.
> b) Even if we never called the _iterate function the stream should be
> correct because the _complete function should ensure that anything
> left is migrated. I don't *think* this can happen in a normal migration
> but hmm maybe if you set a very large downtime.
>
> > So, I don't think this is actually a Power bug. But, it's masked on
> > machine types other than pseries, because they will almost never have
> > >1 iterable state handler.
>
> Well, they have it in the case of qemu block migration, but I guess that's
> probably only used in migration not savevm.
Yes. Also the block iterator has the same inverted sense as the RAM
one, so I suspect it may be subject to the same problem anyway.
> > I think what must be happening is that
> > because we don't have time limits in the savevm case, the first call
> > to ram_save_iterate() from qemu_savevm_state_iterate() is saving all
> > of RAM. On x86 that would be the end of the QTAILQ_FOREACH(), we'd
> > return non-zero and that would kick qemu_savevm_state() into the
> > completion phase.
> >
> > But on Power, we save all RAM, then move onto the HPT iterator. It
> > returns 0 on the first pass, and we loop again in
> > qemu_savevm_state(). But on the next call, RAM is all saved, so
> > ram_save_iterate now returns 0. That breaks us out of the FOREACH in
> > qemu_savevm_state_iterate(), returning 0. Repeat forever.
> >
> > Juan, Dave, sound right?
>
> I think so.
>
> > I suspect the right fix is to make ram_save_iterate() return void.
> > Migration will use the bandwidth / downtime calculations to determine
> > completion as always. savevm would skip the "live" phase entirely and
> > move directly on to completion phase unconditionally.
>
> Skipping the live phase sounds OK, haven't got a clue if any other iterators
> will like that. but then why did qemu_savevm_state() ever bother calling
> the iterate functions?
As far as I can tell there are only 3 live iterators in the whole of
qemu - RAM, block and htab. Between us, I think we've checked that
those should be safe to skip the iteration phase for savevm. So it
seems reasonable to make that a requirement for any future iterators.
> However, there's the comment near the end of qemu_savevm_state_iterate
> which I don't think we're obeying (because of the disagreement on return
> types)
> but the comment sounds a reasonable goal:
>
> /* Do not proceed to the next vmstate before this one reported
> completion of the current stage. This serializes the migration
> and reduces the probability that a faster changing state is
> synchronized over and over again. */
>
> I think the logic here is that if the return value worked the way the
> comment at the top of the function, and htab_iterator thinks, then if you had
> a bandwidth limit, you'd end up repeatedly calling ram_save_iterate and never
> calling the block iterator until RAM was done, then you'd do some block;
> and maybe come back to RAM, but the iteration would always prefer the
> items near the top of the list of iterators.
Right. It's not instantly obvious to me that that would be a
desirable goal, but it seems like it plausibly could be.
And yes, AFAICT we're not actually doing that because the ram and
block iterators don't give return values as
qemu_savevm_state_iterate() expects.
So.. two options
1) Change the callbacks to return void (or errors only), have
savevm go straight to completion phase, and give up on the
priority ordering in qemu_savevm_state_iterate() - it's clearly
not vital
2) Fix one side or the other so they agree on the meaning of the
return value. This is essentially *only* impoortant for the
priority ordering, so it will be fairly easy to mess that up
again in future.
Where next?
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature