Re: [Qemu-devel] virtio device error reporting best practice?

On Mar 20, 2014, at 8:51 AM, Michael S. Tsirkin <address@hidden> wrote:

On Wed, Mar 19, 2014 at 11:04:19AM +1030, Rusty Russell wrote:
Dave Airlie <address@hidden> writes:
So I'm looking at how best to do virtio gpu device error reporting,
and how to deal with illegal stuff,

I've two levels of errors I want to support,

a) unrecoverable or bad guest kernel programming errors,

The QEMU standard approach is to exit at this point. No, really.

It's easy on the hypervisor but often not very friendly for driver writers
who might not be qemu experts.
QEMU's moving away from exiting on errors and it would be nice
to have a robust way to report driver bugs.
How about setting VIRTIO_CONFIG_S_DEVICE_FAILED ?

Another idea that windows driver implemented is reporting
failure reason hint. They wrote it out to ISR, specifically
they notified host about watchdog timer expiration for net device
in this way.

I removed it for now and really would like to have an official way to bring it back.

Also going back to the original question - Windows can handle graphic cards HW errors by reloading the driver and reseting the device (stating from Vista).

b) per 3D context errors from the renderer backend,

(b) I can easily report in an event queue and the guest kernel can in
theory blow away the offenders, this is how GL works with some
extensions,

That's probably sanest.

If it's possible to identify the offenders, I agree
a VQ is better than config space for that.
Need to make sure the queue is big enough to avoid
underrun of that queue though. Is that always possible?

GPU control queue, the response should always be no error, but in some
cases it will be because the guest hit some host resource error, or
asked for something insane, (guest kernel drivers would be broken in
most of these cases).

Alternately I can use the separate event queue to send async errors
when the guest does something bad,

I'm also considering adding some sort of flag in config space saying
the device needs a reset before it will continue doing anything,

I generally dislike error codes which Never Happen; it's like making
every void function return int just in case: the caller has no idea what
to do if it fails.

The litmus test: does *your* guest handle failures other than by giving
up on the device? If so, sure, you need to have a sane error-reporting
strategy.

Right but driver development is also a valid need.

The main reason I'm considering this stuff is for security reasons if
the guest asks for something really illegal or crazy what should the
expected behaviour of the host be? (at least secure I know that).

If the guest userspace can do it, don't exit. If the kernel only, and
it's should have known better, abort is OK.

I second that, at least for now.
Maybe we will add more capabilities in virtio 1.0, or
after that.

Sure that doesn't help much!
Rusty.

From:	Yan Vugenfirer
Subject:	Re: [Qemu-devel] virtio device error reporting best practice?
Date:	Fri, 21 Mar 2014 11:44:43 +0200