qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Thoughts on VM fence infrastructure


From: Felipe Franciosi
Subject: Re: Thoughts on VM fence infrastructure
Date: Mon, 30 Sep 2019 17:33:02 +0000


> On Sep 30, 2019, at 6:11 PM, Dr. David Alan Gilbert <address@hidden> wrote:
> 
> * Felipe Franciosi (address@hidden) wrote:
>> 
>> 
>>> On Sep 30, 2019, at 5:03 PM, Dr. David Alan Gilbert <address@hidden> wrote:
>>> 
>>> * Felipe Franciosi (address@hidden) wrote:
>>>> Hi David,
>>>> 
>>>>> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <address@hidden> 
>>>>> wrote:
>>>>> 
>>>>> * Felipe Franciosi (address@hidden) wrote:
>>>>>> Heyall,
>>>>>> 
>>>>>> We have a use case where a host should self-fence (and all VMs should
>>>>>> die) if it doesn't hear back from a heartbeat within a certain time
>>>>>> period. Lots of ideas were floated around where libvirt could take
>>>>>> care of killing VMs or a separate service could do it. The concern
>>>>>> with those is that various failures could lead to _those_ services
>>>>>> being unavailable and the fencing wouldn't be enforced as it should.
>>>>>> 
>>>>>> Ultimately, it feels like Qemu should be responsible for this
>>>>>> heartbeat and exit (or execute a custom callback) on timeout.
>>>>> 
>>>>> It doesn't feel doing it inside qemu would be any safer;  something
>>>>> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
>>>> 
>>>> The argument above is that we would have to rely on this external
>>>> service being functional. Consider the case where the host is
>>>> dysfunctional, with this service perhaps crashed and a corrupt
>>>> filesystem preventing it from restarting. The VMs would never die.
>>> 
>>> Yeh that could fail.
>>> 
>>>> It feels like a Qemu timer-driven heartbeat check and calls abort() /
>>>> exit() would be more reliable. Thoughts?
>>> 
>>> OK, yes; perhaps using a timer_create and telling it to send a fatal
>>> signal is pretty solid; it would take the kernel to do that once it's
>>> set.
>> 
>> I'm confused about why the kernel needs to be involved. If this is a
>> timer off the Qemu main loop, it can just check on the heartbeat
>> condition (which should be customisable) and call abort() if that's
>> not satisfied. If you agree on that I'd like to talk about how that
>> check could be made customisable.
> 
> There are times when the main loop can get blocked even though the CPU
> threads can be running and can in some configurations perform IO
> even without the main loop (I think!).

Ah, that's a very good point. Indeed, you can perform IO in those
cases specially when using vhost devices.

> By setting a timer in the kernel that sends a signal to qemu, the kernel
> will send that signal however broken qemu is.

Got you now. That's probably better. Do you reckon a signal is
preferable over SIGEV_THREAD?

I'm still wondering how to make this customisable so that different
types of heartbeat could be implemented (preferably without creating
external dependencies per discussion above). Thoughts welcome.

F.

> 
>> 
>>> 
>>> IMHO the safer way is to kick the host off the network by reprogramming
>>> switches; so even if the qemu is actually alive it can't get anywhere.
>>> 
>>> Dave
>> 
>> Naturally some off-host STONITH is preferable, but that's not always
>> available. A self-fencing mechanism right at the heart of the emulator
>> can do the job without external hardware dependencies.
> 
> Dave
> 
>> Cheers,
>> Felipe
>> 
>>> 
>>> 
>>>> Felipe
>>>> 
>>>>> 
>>>>>> Does something already exist for this purpose which could be used?
>>>>>> Would a generic Qemu-fencing infrastructure be something of interest?
>>>>> Dave
>>>>> 
>>>>> 
>>>>>> Cheers,
>>>>>> F.
>>>>>> 
>>>>> --
>>>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>>> 
>>> --
>>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>> 
> --
> Dr. David Alan Gilbert / address@hidden / Manchester, UK




reply via email to

[Prev in Thread] Current Thread [Next in Thread]