qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 1/1] virtio: fail device if set_event_notifier f


From: Cornelia Huck
Subject: Re: [Qemu-devel] [PATCH 1/1] virtio: fail device if set_event_notifier fails
Date: Mon, 6 Mar 2017 15:56:25 +0100

On Fri, 3 Mar 2017 14:08:37 +0100
Halil Pasic <address@hidden> wrote:

> On 03/03/2017 01:50 PM, Cornelia Huck wrote:
> > On Fri, 3 Mar 2017 13:43:32 +0100
> > Halil Pasic <address@hidden> wrote:
> > 
> >> On 03/03/2017 01:21 PM, Cornelia Huck wrote:
> >>> On Thu,  2 Mar 2017 19:59:42 +0100
> >>> Halil Pasic <address@hidden> wrote:
> >>>
> >>>> The function virtio_notify_irqfd used to ignore the return code of
> >>>> event_notifier_set. Let's fail the device should this occur.
> >>>
> >>> I'm wondering if there are reasons for event_notifier_set() to fail
> >>> beyond "we've hit an internal race and should make an effort to fix
> >>> that one, or else we have completely messed up in qemu". Marking the
> >>> device broken tells the guest that there's something wrong with the
> >>> device, but I think we want qemu bug reports when there's something
> >>> broken with the irqfd.
> >>>
> >>
> >> That's why the error is logged. I understand virtio_error like something
> >> suitable for indicating bugs.
> >>
> >> What do you suggest? Forcing a dump? I would rather leave it to the
> >> user to figure out how important is the state sitting in the machine
> >> and the device, and how much effort does (s)he want to put into recovering
> >> from the failure. 
> > 
> > How likely are those logged messages being brought to attention of the
> > admin? Does any management software flag machines with such error
> > messages? (that's more of a general question)
> > 
> 
> I admit, I did not investigate this thoroughly, also because the patch
> is flawed regarding multi-thread anyway. After a quick investigation
> it seems the linux guest won't auto-reset the device so the guest should
> end up with a not working device. I think it's pretty likely that the
> admin will check the logs if the device was important.

Thinking a bit more about this, it seems setting the device broken is
not the right solution for exactly that reason. Setting the virtio
device broken is a way to signal the guest to 'you did something
broken; please reset the device and start anew' (and that's how current
callers use it). In our case, this is not the guest's fault.

Maybe go back to the assert 'solution'? But I'm not sure that's enough
if production builds disable asserts...

> 
> I agree fully that it's a very general question, and I do not feel
> competent for answering it.
> 
> > I'd like to have some kind of trigger that rings an alarm bell so that
> > the admin might consider reporting this, but I don't have a good idea
> > on how to do that either...
> > 
> 
> There are tools for aggregating and processing logs, and triggering
> alarm bells too (for example ELK= logstash + Kibana + Elasticsearch).
> AFAIK logs are the most common way to deal with such stuff. But I'm far
> form being an expert. Of course logs are only as good as the messages
> landing in them...

Let's hope this works properly, then.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]