qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Thoughts on VM fence infrastructure


From: Felipe Franciosi
Subject: Re: Thoughts on VM fence infrastructure
Date: Tue, 1 Oct 2019 11:38:14 +0000


> On Oct 1, 2019, at 12:10 PM, Daniel P. Berrangé <address@hidden> wrote:
> 
> On Tue, Oct 01, 2019 at 10:46:24AM +0000, Felipe Franciosi wrote:
>> Hi Daniel!
>> 
>> 
>>> On Oct 1, 2019, at 11:31 AM, Daniel P. Berrangé <address@hidden> wrote:
>>> 
>>> On Tue, Oct 01, 2019 at 09:56:17AM +0000, Felipe Franciosi wrote:
>> 
>> (Apologies for the mangled URL, nothing I can do about that.) :(
>> 
>> There are several points which favour adding this to Qemu:
>> - Not all environments use systemd.
> 
> Sure, if you want to cope with that you can just use the HW watchdog
> directly instead of via systemd. 
> 
>> - HW watchdogs always reboot the host, which is too drastic.
>> - You may not want to protect all VMs in the same way.
> 
> Same points repeated below, so I'll respond there....
> 
>>> IMHO doing this at the host OS level is going to be more reliable in
>>> terms of detecting the problem in the first place, as well as more
>>> reliable in taking the action - its very difficult for a hardware CPU
>>> reset to fail to work.
>> 
>> Absolutely, but it's a very drastic measure that:
>> - May be unnecessary.
> 
> Of course, the inability to predict future consequences is what
> forces us into assuming the worst case & taking actions to
> mitigate that. It will definitely result in unccessary killing
> of hosts, but that is what gives you the safety guarantees you
> can't otherwise achieve.

The argument is that many configurations have controlled settings that
do not require that drastic level of protection. And the feature we're
discussing offer a softer way of dealing with these.

> I gave the example elsewhere that even if you kill QEMU, the kernel
> can have pending I/O associated with QEMU that can be sent if the
> host later recovers.

Even if an I/O is sent out of the host, there are no guarantees that
it isn't queued somewhere and will reach its destination even after
you pulled the power of a host. Such discussions were held separately
a while back when we were talking about task cancellation.

To that point, I've personally seen corruption with network storage
which was debugged as:

1) write(lba=0, value='a')
2) host crashed (hard reset)
3) vm restarted elsewhere
4) write(lba=0, value='a') (resubmitted from "1")
5) write(lba=0, value='b')
6) I/O from step "1" reached controller
7) read(lba=0) == 'a'

My argument is that you need to look into protection where protection
is needed. Perhaps the example above could be avoided with a session
distinction so that once I/O from step "4" was seen (coming from the
new session), I/Os from older sessions should be rejected by the
storage controller.

In any case, all I'm saying is that there are different levels of
protection. The feature we're discussing here offers one of them.

> 
>> - Will fence everything even perhaps only some VMs need protection.
> 
> I don't believe its viable to have offer real protection to only
> a subset of VMs, principally because the kernel is doing I/O work
> on behalf of the VM, so to protect just 1 VM you must fence the
> kernel.

You are assuming that all users of Qemu out there do I/O through the
kernel. What if they don't?

> 
>> What are your thoughts on this 3-level approach?
>> 1) Qemu tries to log() + abort() (deadline)
> 
> Just abort()'ing isn't going to be a viable strategy with QEMU's move
> towards a multi-process architecture. This introduces the problem that
> the "main" QEMU process has to enumerate all the helpers it is dealing
> with and kill them all off in some way. This is non-trivial especially
> if some of the helpers are running under different privilege levels.

If this is to extend to a multi-process model, I don't think it should
be one process killing others. It's called "self-fencing" because each
process should be responsible for killing itself based on a heartbeat.
(Or configuring the kernel to do it.)

> 
> You could declare that multi-process QEMU is out of scope, but I think
> QEMU self-fencing would need to offer compelling benefits over host OS
> self-fencing to justify that exception. Personally I'm not seeing it.

I would limit the feature for a monolithic model to begin with, but
definitely keep an eye on ways to extend it to a multi-process model.

The benefit is as described before, with the added arguments in this e-mail:
- May not want to protect all VMs.
- May not want to kill the entire host for a temporary network outage.
- Killing Qemu is sufficient in various configurations.

> 
>> 2) Kernel sends SIGKILL (harddeadline)
> 
> This is slightly easier to deal with multiple processes in that it
> isn't restricted by the privileges of the main QEMU vs helpers and
> could take advantage of cgroups perhaps.

Right, and it should be an option from the start. Thanks for weighting
in with extra ideas around cgroup.

F.

> 
>> 3) HW watchdog kicks in (harderdeadline)
> 
> 
> Regards,
> Daniel
> -- 
> |: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__berrange.com&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=YC-qqaG6kodFKVMNX0J0kjz9X6V_-FJAwJp7V6Ib-ww&s=gHylmilfA2eUInLpjzmHqoaGnpyR8GSQLmO8EAw4eR8&e=
>        -o-    
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.flickr.com_photos_dberrange&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=YC-qqaG6kodFKVMNX0J0kjz9X6V_-FJAwJp7V6Ib-ww&s=80QiYAfpKyAPCTKApjm8KYLTx1X7M6pJ53GBbJncy9o&e=
>   :|
> |: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__libvirt.org&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=YC-qqaG6kodFKVMNX0J0kjz9X6V_-FJAwJp7V6Ib-ww&s=yYvZX9Rbg3oemcKK3hUFeQ5vbgOkZY7I43TiTHdTqHw&e=
>           -o-            
> https://urldefense.proofpoint.com/v2/url?u=https-3A__fstop138.berrange.com&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=YC-qqaG6kodFKVMNX0J0kjz9X6V_-FJAwJp7V6Ib-ww&s=2ejDGTcMIDyWBM5xm5N4rZb1uXOj9YWpvTR1DNZLszM&e=
>   :|
> |: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__entangle-2Dphoto.org&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=YC-qqaG6kodFKVMNX0J0kjz9X6V_-FJAwJp7V6Ib-ww&s=BUlhrMDYb3k5CzU-Cmj22z_Pn4VelyueDC3Sx0JRbWE&e=
>      -o-    
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_dberrange&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=YC-qqaG6kodFKVMNX0J0kjz9X6V_-FJAwJp7V6Ib-ww&s=bi7Dqd0QZt6oUI3dl6TYRfhF3mCc8Wq7rXg554y9Ygc&e=
>   :|


reply via email to

[Prev in Thread] Current Thread [Next in Thread]