qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] Problem with "savevm" on ppc64


From: David Gibson
Subject: Re: [Qemu-ppc] Problem with "savevm" on ppc64
Date: Mon, 24 Oct 2016 13:09:57 +1100
User-agent: Mutt/1.7.1 (2016-10-04)

On Fri, Oct 21, 2016 at 10:12:28AM +0100, Dr. David Alan Gilbert wrote:
> * David Gibson (address@hidden) wrote:
> > On Thu, Oct 20, 2016 at 03:17:12PM +0200, Thomas Huth wrote:
> > >  Hi all,
> > > 
> > > I'm currently facing a strange problem with the "savevm" HMP command on
> > > ppc64 with TCG and the pseries machine. Steps for reproduction:
> > > 
> > > 1) Create a disk image:
> > >    qemu-img create -f qcow2  /tmp/test.qcow2 1M
> > > 
> > > 2) Start QEMU (in TCG mode):
> > >    qemu-system-ppc64 -nographic -vga none -m 256 -hda /tmp/test.qcow2
> > > 
> > > 3) Hit "CTRL-a c" to enter the HMP monitor
> > > 
> > > 4) Type "savevm test1" to save a snapshot
> > > 
> > > The savevm command then hangs forever and the test.qcow2 image keeps
> > > growing and growing.
> > > 
> > > It seems like qemu_savevm_state_iterate() does not make any more
> > > progress because ram_save_iterate() keeps returning 0 ... but why can
> > > that happen?
> > 
> > Hmm.  You don't mention installing anything on the disk image, so I'm
> > assuming the VM is just sitting in SLOF, unable to boot.  And it's
> > TCG, so there's no VRMA.  Which mans I'd expect the HPT to be
> > completely empty.
> > 
> > I strongly suspect that's what's triggering the problem, although I'm
> > not quite sure of the mechanism
> > 
> > > I've tinkered with the code a little bit, and I can get it to work with
> > > the following patch - which however is quite independent of the
> > > ram_save_iterate() code:
> > > 
> > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > index ddb7438..a7ac0bf 100644
> > > --- a/hw/ppc/spapr.c
> > > +++ b/hw/ppc/spapr.c
> > > @@ -1290,7 +1290,7 @@ static int htab_save_setup(QEMUFile *f, void 
> > > *opaque)
> > >      return 0;
> > >  }
> > > 
> > > -static void htab_save_first_pass(QEMUFile *f, sPAPRMachineState *spapr,
> > > +static int htab_save_first_pass(QEMUFile *f, sPAPRMachineState *spapr,
> > >                                   int64_t max_ns)
> > >  {
> > >      bool has_timeout = max_ns != -1;
> > > @@ -1340,6 +1340,8 @@ static void htab_save_first_pass(QEMUFile *f,
> > > sPAPRMachineState *spapr,
> > >          spapr->htab_first_pass = false;
> > >      }
> > >      spapr->htab_save_index = index;
> > > +
> > > +    return !spapr->htab_first_pass;
> > >  }
> > > 
> > >  static int htab_save_later_pass(QEMUFile *f, sPAPRMachineState *spapr,
> > > @@ -1444,7 +1446,7 @@ static int htab_save_iterate(QEMUFile *f, void
> > > *opaque)
> > >              return rc;
> > >          }
> > >      } else  if (spapr->htab_first_pass) {
> > > -        htab_save_first_pass(f, spapr, MAX_ITERATION_NS);
> > > +        rc = htab_save_first_pass(f, spapr, MAX_ITERATION_NS);
> > >      } else {
> > >          rc = htab_save_later_pass(f, spapr, MAX_ITERATION_NS);
> > >      }
> > > 
> > > That means, if htab_save_iterate() does not initially return a 0, then
> > > ram_save_iterate() also does not run into the condition of returning
> > > zeroes later. But how are these related? And is it safe for returning a
> > > non-zero value in htab_save_first_pass()? Does anybody with more
> > > experience in this area has a clue what's going on here?
> > 
> > So, looking at this I think it's unsafe.  htab_save_first_pass() never
> > examines dirty bits, so we could get:
> >     htab_save_first_pass() called once, saves part of HPT
> >     guest dirties an HPTE in the already saved region
> >     enter migration completion stage
> >     htab_save_first_pass() saves the remainder of the HPT, returns 1
> > 
> > That would trigger the code to think the HPT migration is complete,
> > without ever saving the HPTE that got dirtied part way through.
> > 
> > But.. then I looked further and got *really* confused.
> 
> Ewww, my head.....this is very confusing.
> 
> > The comment above qemu_savevm_state_iterate() and the logic in
> > qemu_savevm_state() say that the iterate function returns >0 to
> > indicate that it is done and we can move onto the completion phase.
> > 
> > But both ram_save_iterate() and block_save_iterate() seem to have that
> > inverted: they return >0 if they actually saved something.
> 
> I think you're right that logic looks inverted somewhere, but be careful
> that we have:
>     a) The return value of qemu_savevm_state_iterate
>     b) The return value of the iterators it calls

Yes, but I checked that those basically the same.
qemu_savevm_state_iterate() adds some extra error conditions, but
never alters the return value of the callback.

And, even then.  I've said the sense is "inverted" but that's a bit
sloppy - it's not just a matter of putting a ! in.
qemu_savevm_state_iterate() is expecting something saying whether or
not the iteration has completed.  The RAM and block iterators are
returning something saying whether or not they made forward progress.
In a number of significant cases those will be the inverse of each
other, but not always.

> Both those return values individually make sense *if* the logic
> within qemu_savevm_state_iterate did the right thing; which I'm not
> convinced of.
> 
> > I think the only reason migration has been working *at all* is that
> > migration_thread() and qemu_savevm_state_complete_precopy() ignore the
> > return value of the iterate() function (except for <0 errors).
> >
> > qemu_savevm_state(), however, does not.
> 
> However, there are two other reasons it works:
>    a) We keep iterating in the migration code until the amount of remaining
>       iterable RAM/etc is large (i.e. result of qemu_savevm_state_pending).

Sorry, I misspoke I should have said the only reason savevm is
working.  migration ignores the result of the iterate callback and so
works.

>    b) Even if we never called the _iterate function the stream should be
>       correct because the _complete function should ensure that anything
>       left is migrated.  I don't *think* this can happen in a normal migration
>       but hmm maybe if you set a very large downtime.
> 
> > So, I don't think this is actually a Power bug.  But, it's masked on
> > machine types other than pseries, because they will almost never have
> > >1 iterable state handler.
> 
> Well, they have it in the case of qemu block migration, but I guess that's
> probably only used in migration not savevm.

Yes.  Also the block iterator has the same inverted sense as the RAM
one, so I suspect it may be subject to the same problem anyway.

> > I think what must be happening is that
> > because we don't have time limits in the savevm case, the first call
> > to ram_save_iterate() from qemu_savevm_state_iterate() is saving all
> > of RAM.  On x86 that would be the end of the QTAILQ_FOREACH(), we'd
> > return non-zero and that would kick qemu_savevm_state() into the
> > completion phase.
> > 
> > But on Power, we save all RAM, then move onto the HPT iterator.  It
> > returns 0 on the first pass, and we loop again in
> > qemu_savevm_state().  But on the next call, RAM is all saved, so
> > ram_save_iterate now returns 0.  That breaks us out of the FOREACH in
> > qemu_savevm_state_iterate(), returning 0.  Repeat forever.
> > 
> > Juan, Dave, sound right?
> 
> I think so.
> 
> > I suspect the right fix is to make ram_save_iterate() return void.
> > Migration will use the bandwidth / downtime calculations to determine
> > completion as always.  savevm would skip the "live" phase entirely and
> > move directly on to completion phase unconditionally.
> 
> Skipping the live phase sounds OK, haven't got a clue if any other iterators
> will like that.  but then why did qemu_savevm_state() ever bother calling
> the iterate functions?

As far as I can tell there are only 3 live iterators in the whole of
qemu - RAM, block and htab.  Between us, I think we've checked that
those should be safe to skip the iteration phase for savevm.  So it
seems reasonable to make that a requirement for any future iterators.

> However, there's the comment near the end of qemu_savevm_state_iterate
> which I don't think we're obeying (because of the disagreement on return 
> types)
> but the comment sounds a reasonable goal:
> 
>             /* Do not proceed to the next vmstate before this one reported
>                completion of the current stage. This serializes the migration
>                and reduces the probability that a faster changing state is
>                synchronized over and over again. */
> 
> I think the logic here is that if the return value worked the way the
> comment at the top of the function, and htab_iterator thinks, then if you had
> a bandwidth limit, you'd end up repeatedly calling ram_save_iterate and never
> calling the block iterator until RAM was done, then you'd do some block;
> and maybe come back to RAM, but the iteration would always prefer the
> items near the top of the list of iterators.

Right.  It's not instantly obvious to me that that would be a
desirable goal, but it seems like it plausibly could be.

And yes, AFAICT we're not actually doing that because the ram and
block iterators don't give return values as
qemu_savevm_state_iterate() expects.

So.. two options
    1) Change the callbacks to return void (or errors only), have
       savevm go straight to completion phase, and give up on the
       priority ordering in qemu_savevm_state_iterate() - it's clearly
       not vital

     2) Fix one side or the other so they agree on the meaning of the
     return value.  This is essentially *only* impoortant for the
     priority ordering, so it will be fairly easy to mess that up
     again in future.

Where next?

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]