[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-ch
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing |
Date: |
Fri, 21 Feb 2014 09:44:34 +0000 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
* Michael R. Hines (address@hidden) wrote:
> On 02/21/2014 12:32 AM, Dr. David Alan Gilbert wrote:
> >
> >I'm happy to use more memory to get FT, all I'm trying to do is see
> >if it's possible to put a lower bound than 2x on it while still maintaining
> >full FT, at the expense of performance in the case where it uses
> >a lot of memory.
> >
> >>The bottom line is: if you put a *hard* constraint on memory usage,
> >>what will happen to the guest when that garbage collection you mentioned
> >>shows up later and runs for several minutes? How about an hour?
> >>Are we just going to block the guest from being allowed to start a
> >>checkpoint until the memory usage goes down just for the sake of avoiding
> >>the 2x memory usage?
> >Yes, or move to the next checkpoint sooner than the N milliseconds when
> >we see the buffer is getting full.
>
> OK, I see there is definitely some common ground there: So to be
> more specific, what we really need is two things: (I've learned that
> the reviewers are very cautious about adding to much policy into
> QEMU itself, but let's iron this out anyway:)
>
> 1. First, we need to throttle down the guest (QEMU can already do this
> using the recently introduced "auto-converge" feature). This means
> that the guest is still making forward progress, albeit slow progress.
>
> 2. Then we would need some kind of policy, or better yet, a trigger that
> does something to the effect of "we're about to use a whole lot of
> checkpoint memory soon - can we afford this much memory usage".
> Such a trigger would be conditional on the current policy of the
> administrator or management software: We would either have a QMP
> command that with a boolean flag that says "Yes" or "No", it's
> tolerable or not to use that much memory in the next checkpoint.
>
> If the answer is "Yes", then nothing changes.
> If the answer is "No", then we should either:
> a) throttle down the guest
> b) Adjust the checkpoint frequency
> c) Or pause it altogether while we migrate some other VMs off the
> host such that we can complete the next checkpoint in its
> entirety.
Yes I think so, although what I was thinking was mainly (b) possibly
to the point of not starting the next checkpoint.
> It's not clear to me how much of this (or any) of this control loop should
> be in QEMU or in the management software, but I would definitely agree
> that a minimum of at least the ability to detect the situation and remedy
> the situation should be in QEMU. I'm not entirely convince that the
> ability to *decide* to remedy the situation should be in QEMU, though.
The management software access is low frequency, high latency; it should
be setting general parameters (max memory allowed, desired checkpoint
frequency etc) but I don't see that we can use it to do anything on
a sooner than a few second basis; so yes it can monitor things and
tweek the knobs if it sees the host as a whole is getting tight on RAM
etc - but we can't rely on it to throw in the breaks if this guest
suddenly decides to take bucket loads of RAM; something has to react
quickly in relation to previously set limits.
> >>If you block the guest from being checkpointed,
> >>then what happens if there is a failure during that extended period?
> >>We will have saved memory at the expense of availability.
> >If the active machine fails during this time then the secondary carries
> >on from it's last good snapshot in the knowledge that the active
> >never finished the new snapshot and so never uncorked it's previous packets.
> >
> >If the secondary machine fails during this time then tha active drops
> >it's nascent snapshot and carries on.
>
> Yes, that makes sense. Where would that policy go, though,
> continuing the above concern?
I think there has to be some input from the management layer for failover,
because (as per my split-brain concerns) something has to make the decision
about which of the source/destination is to take over, and I don't
believe individual instances have that information.
> >However, what you have made me realise is that I don't have an answer
> >for the memory usage on the secondary; while the primary can pause
> >it's guest until the secondary ack's the checkpoint, the secondary has
> >to rely on the primary not to send it huge checkpoints.
>
> Good question: There's a lot of work ideas out there in the academic
> community to compress the secondary, or push the secondary to
> a flash-based device, or de-duplicate the secondary. I'm sure any
> of them would put a dent in the problem, but I'm not seeing a smoking
> gun solution that would absolutely save all that memory completely.
Ah, I was thinking that flash would be a good solution for secondary;
it would be a nice demo.
> (Personally, I don't believe in swap. I wouldn't even consider swap
> or any kind of traditional disk-based remedy to be a viable solution).
Well it certainly exists - I've seen it!
Swap works well in limited circumstances; but as soon as you've got
multiple VMs fighting over something with 10s of ms latency you're doomed.
> >>The customer that is expecting 100% fault tolerance and the provider
> >>who is supporting it need to have an understanding that fault tolerance
> >>is not free and that constraining memory usage will adversely affect
> >>the VM's ability to be protected.
> >>
> >>Do I understand your expectations correctly? Is fault tolerance
> >>something you're willing to sacrifice?
> >As above, no I'm willing to sacrifice performance but not fault tolerance.
> >(It is entirely possible that others would want the other trade off, i.e.
> >some minimum performance is worse than useless, so if we can't maintain
> >that performance then dropping FT leaves us in a more-working position).
> >
>
> Agreed - I think a "proactive" failover in this case would solve the
> problem.
> If we observed that availability/fault tolerance was going to be at
> risk soon (which is relatively easy to detect) - we could just *force*
> a failover to the secondary host and restart the protection from
> scratch.
>
>
> >>
> >>Well, that's simple: If there is a failure of the source, the destination
> >>will simply revert to the previous checkpoint using the same mode
> >>of operation. The lost ACKs that you're curious about only
> >>apply to the checkpoint that is in progress. Just because a
> >>checkpoint is in progress does not mean that the previous checkpoint
> >>is thrown away - it is already loaded into the destination's memory
> >>and ready to be activated.
> >I still don't see why, if the link between them fails, the destination
> >doesn't fall back it it's previous checkpoint, AND the source carries
> >on running - I don't see how they can differentiate which of them has failed.
>
> I think you're forgetting that the source I/O is buffered - it doesn't
> matter that the source VM is still running. As long as it's output is
> buffered - it cannot have any non-fault-tolerant affect on the outside
> world.
>
> In the future, if a technician access the machine or the network
> is restored, the management software can terminate the stale
> source virtual machine.
I think going with my comment above; I'm working on the basis it's just
as likely for the destination to fail as it is for the source to fail,
and a destination failure shouldn't kill the source; and in the case
of a destination failure the source is going to have to let it's buffered
I/Os start going again.
> >>We have a script architecture (not on github) which runs MC in a tight
> >>loop hundreds of times and kills the source QEMU and timestamps how
> >>quickly the
> >>destination QEMU loses the TCP socket connection receives an error code
> >>from the kernel - every single time, the destination resumes nearly
> >>instantaneously.
> >>I've not empirically seen a case where the socket just hangs or doesn't
> >>change state.
> >>
> >>I'm not very familiar with the internal linux TCP/IP stack
> >>implementation itself,
> >>but I have not had a problem with the dependability of the linux socket
> >>not being able to shutdown the socket as soon as possible.
> >OK, that only covers a very small range of normal failures.
> >When you kill the destination QEMU the host OS knows that QEMU is dead
> >and sends a packet back closing the socket, hence the source knows
> >the destination is dead very quickly.
> >If:
> > a) The destination machine was to lose power or hang
> > b) Or a network link fail (other than the one attached to the source
> > possibly)
> >
> >the source would have to do a full TCP timeout.
> >
> >To test a,b I'd use an iptables rule somewhere to cause the packets to
> >be dropped (not rejected). Stopping the qemu in gdb might be good enough.
>
> Very good idea - I'll add that to the "todo" list of things to do
> in my test infrastructure. It may indeed turn out be necessary
> to add a formal keepalive between the source and destination.
>
> - Michael
Dave
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
- Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing, (continued)
- Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing, Dr. David Alan Gilbert, 2014/02/18
- Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing, Michael R. Hines, 2014/02/18
- Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing, Dr. David Alan Gilbert, 2014/02/19
- Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing, Michael R. Hines, 2014/02/19
- Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing, Dr. David Alan Gilbert, 2014/02/20
- Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing, Li Guang, 2014/02/20
- Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing, Michael R. Hines, 2014/02/20
- Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing, Michael R. Hines, 2014/02/20
- Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing, Dr. David Alan Gilbert, 2014/02/20
- Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing, Michael R. Hines, 2014/02/20
- Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing,
Dr. David Alan Gilbert <=
[Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage, mrhines, 2014/02/18
[Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states, mrhines, 2014/02/18
[Qemu-devel] [RFC PATCH v2 04/12] mc: support custom page loading and copying, mrhines, 2014/02/18
[Qemu-devel] [RFC PATCH v2 05/12] rdma: accelerated memcpy() support and better external RDMA user interfaces, mrhines, 2014/02/18
[Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC, mrhines, 2014/02/18