qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 1/5] msix_init: assert programming error


From: Markus Armbruster
Subject: Re: [Qemu-devel] [PATCH v2 1/5] msix_init: assert programming error
Date: Fri, 30 Sep 2016 16:06:18 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)

Alex Williamson <address@hidden> writes:

> On Thu, 29 Sep 2016 15:11:27 +0200
> Markus Armbruster <address@hidden> wrote:
>
>> Alex Williamson <address@hidden> writes:
>> 
>> > On Tue, 13 Sep 2016 08:16:20 +0200
>> > Markus Armbruster <address@hidden> wrote:
>> >  
>> >> Cc: Alex for device assignment expertise.
>> >> 
>> >> Cao jin <address@hidden> writes:
>> >>   
>> >> > On 09/12/2016 09:29 PM, Markus Armbruster wrote:    
>> >> >> Cao jin <address@hidden> writes:
>> >> >>    
>> >> >>> The input parameters is used for creating the msix capable device, so
>> >> >>> they must obey the PCI spec, or else, it should be programming error. 
>> >> >>>    
>> >> >>
>> >> >> True when the the parameters come from a device model attempting to
>> >> >> define a PCI device violating the spec.  But what if the parameters 
>> >> >> come
>> >> >> from an actual PCI device violating the spec, via device assignment?   
>> >> >>  
>> >> >
>> >> > Before the patch, on invalid param, the vfio behaviour is:
>> >> >   error_report("vfio: msix_init failed");
>> >> >   then, device create fail.
>> >> >
>> >> > After the patch, its behaviour is:
>> >> >   asserted.
>> >> >
>> >> > Do you mean we should still report some useful info to user on invalid
>> >> > params?    
>> >> 
>> >> In the normal case, asking msix_init() to create MSI-X that are out of
>> >> spec is a programming error: the code that does it is broken and needs
>> >> fixing.
>> >> 
>> >> Device assignment might be the exception: there, the parameters for
>> >> msix_init() come from the assigned device, not the program.  If they
>> >> violate the spec, the device is broken.  This wouldn't be a programming
>> >> error.  Alex, can this happen?
>> >> 
>> >> If yes, we may want to handle it by failing device assignment.  
>> >
>> >
>> > Generally, I think the entire premise of these sorts of patches is
>> > flawed.  We take a working error path that allows a driver to robustly
>> > abort on unexpected date and turn it into a time bomb.  Often the
>> > excuse for this is that "error handling is hard".  Tough.  Now a
>> > hot-add of a device that triggers this changes from a simple failure to
>> > a denial of service event.  Furthermore, we base that time bomb on our
>> > interpretation of the spec, which we can only validate against in-tree
>> > devices.
>> >
>> > We have actually had assigned devices that fail the sanity test here,
>> > there's a quirk in vfio_msix_early_setup() for a Chelsio device with
>> > this bug.  Do we really want user experiencing aborts when a simple
>> > device initialization failure is sufficient?
>> >
>> > Generally abort code paths like this cause me to do my own sanity
>> > testing, which is really poor practice since we should have that sanity
>> > testing in the common code.  Thanks,  
>> 
>> I prefer to assert on programming error, because 1. it does double duty
>> as documentation, 2. error handling of impossible conditions is commonly
>> wrong, and 3. assertion failures have a much better chance to get the
>> program fixed.  Even when presence of a working error path kills 2., the
>> other two make me stick to assertions.
>
> So we're looking at:
>
>> -    if (nentries < 1 || nentries > PCI_MSIX_FLAGS_QSIZE + 1) {
>> -        return -EINVAL;
>> -    }
>
> vs
>
>> +    assert(nentries >= 1 && nentries <= PCI_MSIX_FLAGS_QSIZE + 1);
>
> How do you argue that one of these provides better self documentation
> than the other?

The first one says "this can happen, and when it does, the function
fails cleanly."  For a genuine programming error, this is in part
misleading.

The second one says "I assert this can't happen.  We'd be toast if I was
wrong."

> The assert may have a better chance of getting fixed, but it's because
> the existence of the assert itself exposes a vulnerability in the code.
> Which would you rather have in production, a VMM that crashes on the
> slightest deviance from the input it expects or one that simply errors
> the faulting code path and continues?

Invalid input to a program should never be treated as programming error.

> Error handling is hard, which is why we need to look at it as a
> collection of smaller problems.  We return an error at a leaf function
> and let callers of that function decide how to handle it.  If some of
> those callers don't want to deal with error handling, abort there, we
> can come back to them later, but let the code paths that do want proper
> error handling to continue.  If we add aborts into the leaf function,
> then any calling path that wants to be robust against an error needs to
> fully sanitize the input itself, at which point we have different
> drivers sanitizing in different ways, all building up walls to protect
> themselves from the time bombs in these leaf functions.  It's crazy.

It depends on the kind of error in the leaf function.

I suspect we're talking past each other because we got different kinds
of errors in mind.

Programming is impossible without things like preconditions,
postconditions, invariants.

If a section of code is entered when its precondition doesn't hold,
we're toast.  This is the archetypical programming error.

If it can actually happen, the program is incorrect, and needs fixing.

Checking preconditions is often (but not always) practical.  In my
opinion, checking is good practice, and the proper way to check is
assert().  Makes the incorrect program fail before it can do further
damage, and helps with finding the programming error.

A preconditions is part of the contract between a function and its
users.  An strong precondition can make the function's job easier, but
that's no use if the resulting function is inconvenient to use.  On the
other hand, complicating the function to get a weaker precondition
nobody actually needs is just as dumb.

Returning an error is *not* checking preconditions.  Remember, if the
precondition doesn't hold, we're toast.  If we're toast when we return
an error, we're clearly doing it wrong.

You are arguing for weaker preconditions.  I'm not actually disagreeing
with you!  I'm merely expressing my opinion that checking preconditions
with assert() is a good idea.

>> However, input out-of-spec is not a programming error.  For most users
>> of msix_init(), the arguments are hard-coded, thus invalid arguments are
>> a programming error.  For device assignment, they come from a physical
>> device, thus invalid arguments can either be a programming error (our
>> idea of "invalid" is invalid) or bad input (the physical device is
>> out-of-spec).  Since we can't know, we better handle it rather than
>> assert.
>
> So are we going to flag every call path that device assignment might
> use as one that needs "proper" error handling any anything that's only
> used by emulated devices can assert?  How will anyone ever know?  vfio
> tries really hard to be just another device in the QEMU ecosystem.

It tries, but it can't help to add a few things.

Consider the number of MSI vectors.  It can only be 1, 2, 4, 8, 16 or
32.

When the callers of msi_init() pass literal numbers, making "the number
is valid" a precondition is quite sensible.

If the numbers come from the user via configuration, they need to be
checked.  Two sane ways to do that: check close to where the
configuration is processed, and check where it is used.  The former will
likely produce better error messages.  But the latter has its
advantages, too.  Checking next to its use in msi_init() involves making
it handle invalid numbers, i.e. weakening its precondition.

Making vectors configurable turned moves them from the realm of
preconditions to the realm of program input.  Code needs to be updated
for that.

What device assignment adds is moving many more bits to the program
input realm.  More code needs to be updated for that.

>> Bottom line: you convinced me msix_init() should stay as it is.  But now
>> msi_init() looks like it needs a change: it asserts on invalid
>> nr_vectors parameter.  Does that need fixing, Alex?
>
> IMHO, they all need to be fixed.  Besides, look at the callers of
> msi_init(), almost every one will assert on its own if msi_init()
> fails, all we're doing is hindering drivers like vfio-pci that can
> gracefully handle a failure.  I think that's exactly how each of these
> should be handled, find a leaf function with asserts, convert it to
> proper error handling, change the callers that don't already handle the
> error or assert to assert, then work down through each code path to
> figure out how they can more robustly handle an error.  I don't buy the
> argument that error handling is too hard or that we're more likely to
> get it wrong.  It needs to be handled as percolating small errors, each
> of which is trivial to handle on its own.  Thanks,

Once there's a need to handle a certain condition as an error, we should
do that, no argument.  This also provides a way to test the error path.

However, I wouldn't buy an argument that preconditions should be made as
weak as possible in leaf functions (let alone always) regardless of the
cost in complexity, and non-testability of error paths.  I'm strictly a
pay as you go person.

Back to the problem at hand.  Cao jin, would you be willing to fix
msi_init()?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]