[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Error handling in realize() methods
From: |
Laszlo Ersek |
Subject: |
Re: [Qemu-devel] Error handling in realize() methods |
Date: |
Wed, 9 Dec 2015 12:10:20 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 |
On 12/09/15 11:29, Dr. David Alan Gilbert wrote:
> * Markus Armbruster (address@hidden) wrote:
>> "Dr. David Alan Gilbert" <address@hidden> writes:
>>
>>> * Markus Armbruster (address@hidden) wrote:
>>>> In general, code running withing a realize() method should not exit() on
>>>> error. Instad, errors should be propagated through the realize()
>>>> method. Additionally, the realize() method should fail cleanly,
>>>> i.e. carefully undo its side effects such as wiring of interrupts,
>>>> mapping of memory, and so forth. Tedious work, but necessary to make
>>>> hot plug safe.
>> [...]
>>>> Next, let's consider the special case "out of memory".
>>>>
>>>> Our general approach is to treat it as immediately fatal. This makes
>>>> sense, because when a smallish allocation fails, the process is almost
>>>> certainly doomed anyway. Moreover, memory allocation is frequent, and
>>>> attempting to recover from failed memory allocation adds loads of
>>>> hard-to-test error paths. These are *dangerous* unless carefully tested
>>>> (and we don't).
>>>>
>>>> Certain important allocations we handle more gracefully. For instance,
>>>> we don't want to die when the user tries to hot-plug more memory than we
>>>> can allocate, or tries to open a QCOW2 image with a huge L1 table.
>>>>
>>>> Guest memory allocation used to have the "immediately fatal" policy
>>>> baked in at a fairly low level, but it's since been lifted into callers;
>>>> see commit c261d77..fc7a580 and fixups 4f96676..0bdaa3a. During review
>>>> of the latter, Peter Crosthwaite called out the &error_fatal in the
>>>> realize methods and their supporting code. I agreed with him back then
>>>> that the errors should really be propagated. But now I've changed my
>>>> mind: I think we should treat these memory allocation failures like
>>>> we've always treated them, namely report and exit(1). Except for
>>>> "large" allocations, where we have a higher probability of failure, and
>>>> a more realistic chance to recover safely.
>>>>
>>>> Can we agree that passing &error_fatal to memory_region_init_ram() &
>>>> friends is basically okay even in realize() methods and their supporting
>>>> code?
>>>
>>> I'd say it depends if they can be hotplugged; I think anything that we
>>> really
>>> want to hotplug onto real running machines (as opposed to where we're just
>>> hotplugging during setup) we should propagate errors properly.
>>>
>>> And tbh I don't buy the small allocation argument; I think we should handle
>>> them
>>> all; in my utopian world a guest wouldn't die unless there was no way out.
>>
>> I guess in Utopia nobody ever makes stupid coding mistakes, the error
>> paths are all covered by a comprehensive test suite, and they make the
>> code prettier, too. Oh, and kids always eat their vegetables without
>> complaint.
>
> Yes, it's lovely.
>
>> However, we don't actually live in Utopia. In our world, error paths
>> clutter the code, are full of bugs, and the histogram of their execution
>> counts in testing (automated or not) has a frighteningly tall bar at
>> zero.
>>
>> We're not going to make this problem less severe by making it bigger.
>> In fact, we consciously decided to hack off a big chunk with an axe:
>>
>> commit 8a1d02aba9f986ca03d854184cd432ee98bcd179
>> Author: aliguori <address@hidden>
>> Date: Thu Feb 5 22:05:49 2009 +0000
>>
>> Terminate emulation on memory allocation failure (Avi Kivity)
>>
>> Memory allocation failures are a very rare condition on virtual-memory
>> hosts. They are also very difficult to handle correctly (especially in a
>> hardware emulation context). Because of this, it is better to gracefully
>> terminate emulation rather than executing untested or even unwritten
>> recovery
>> code paths.
>>
>> This patch changes the qemu memory allocation routines to terminate
>> emulation
>> if an allocation failure is encountered.
>>
>> Signed-off-by: Avi Kivity <address@hidden>
>> Signed-off-by: Anthony Liguori <address@hidden>
>>
>>
>> git-svn-id: svn://svn.savannah.nongnu.org/qemu/address@hidden
>> c046a42c-6fe2-441c-8c8c-71466251a162
>>
>> Let me elaborate a bit on Avi's arguments:
>>
>> * Memory allocations are very, very common. I count at least 2500,
>> Memory allocation failure is easily the most common *potential* error,
>> both statically and dynamically.
>>
>> * Error recovery is always tedious and often hard. Especially when the
>> error happens in the middle of a complex operation that needs to be
>> fully rolled back to recover. A sensible approach is to acquire
>> resources early, when recovery is still relatively easy, but that's
>> often impractical for memory. This together with the ubiquitousness
>> of memory allocation makes memory allocation failure even harder to
>> handle than other kinds of errors.
>>
>> * Not all allocations are equal. When an attempt to allocate a gigabyte
>> fails gracefully, there's a good chance that ordinary code can go on
>> allocating and freeing kilobytes as usual. But when an attempt to
>> allocate kilobytes fails, it's very likely that handling this failure
>> gracefully will only lead to another one, and another one, until some
>> buggy error handling puts the doomed process out of its misery.
>>
>> Out of the ~2400 memory allocations that go through GLib, 59 can fail.
>> The others all terminate the process.
>>
>> * How often do we see failures from these other 2300+? Bug reports from
>> users? As far as I can see, they're vanishingly rare.
>>
>> * Reviewing and testing the error handling chains rooted at 59
>> allocations is hard enough, and I don't think we're doing particularly
>> well on the testing. What chance would we have with 2300+ more?
>>
>> 2300+ instances of a vanishingly rare error with generally complex
>> error handling and basically no test coverage: does that sound more
>> useful than 2300+ instances of a vanishingly rare error that kills the
>> process? If yes, how much of our limited resources is the difference
>> worth?
>>
>> * You might argue that we don't have to handle all 2300+ instances, only
>> the ones reachable from hotplug. I think that's a pipe dream. Once
>> you permit "terminate on memory allocation failure" in general purpose
>> code, it hides behind innocent-looking function calls. Even if we
>> could find them all, we'd still have to add memory allocation failure
>> handling to lots of general purpose code just to make it usable for
>> hot plug. And then we'd get to keep finding them all forever.
>
> I didn't say it was easy :-)
>
>> I don't think handling all memory allocation failures gracefully
>> everywhere or even just within hotplug is practical this side of Utopia.
>> I believe all we could achieve trying is an illusion of graceful
>> handling that is sufficiently convincing to let us pat on our backs,
>> call the job done, and go back to working on stuff that matters to
>> actual users.
>
> Handling them all probably isn't; handling some well defined cases is
> probably possible.
>
> Avi's argument is 6 years old, I suggest a few things have changed in that
> time:
> a) We now use the Error** mechanism in a lot of places - so a lot of
> code already is supposed to deal with a function call failing; if a function
> already has an error return and the caller deals with it, then making
> the function deal with an allocation error and the caller handle it is
> a lot easier.
> b) The use of hotplug is now common - people really hate it when their
> nice, happy working VM dies when they try and do something to it, like
> hotplug or migrate.
> c) I suspect (but don't know) that people are pushing the number of VMs on
> a host harder than they used to, but there again memory got cheap.
>
> I'm not that eager to protect every allocation; but in some common
> cases, where we already have Error** paths and it's relatively simple,
> then I think we should.
>
> (OK, to be honest I think we should protect every allocation - but I do
> have sympathy with the complexity/testing arguments).
I've been following this discussion with great interest.
My opinion should not be considered, because I won't be turning my
opinion into new code, or an agreement to support / maintain code. :)
My opinion is that
- every single allocation needs to be checked rigorously,
- any partial construction of a more complex object needs to be rolled
back in precise reverse order upon encountering any kind of failure
(not just allocation)
- this should occur regardless of testing coverage (although projects
exist (for example, SQLite, IIRC) that use random or systematic
malloc() error injection in their test suite, for good coverage)
- the primary requirements for this to work are:
- a clear notion of ownership at any point in the code
- a disciplined approach to ownership tracking; for example, a helper
callee (responsible for constructing a member of a more complex
object) is forbidden from releasing "sibling" resources allocated
by the caller
This is possible to do (I'm doing it and enforcing it in OVMF), but it
takes a lot of discipline, and *historically* the QEMU codebase has
stunk, whenever I've looked at its ownership tracking during
construction of objects.
I feel that in the last sequence of months (years?) the developer
discipline and the codebase have improved a *great* deal. Still I cannot
say how feasible it would be to bring all existent code into conformance
with the above.
... As I said, I just wanted to express this opinion as another (not
really practical) data point. My children utterly hate spinach, so
Markus's counterpoint is definitely not lost on me.
Thanks
Laszlo
>
> Dave
>
>> My current working assumption is that passing &error_fatal to
>> memory_region_init_ram() & friends is okay even in realize() methods and
>> their supporting code, except when the allocation can be large. Even
>> then, &error_fatal is better than buggy recovery code (which I can see
>> all over the place, but that's a separate topic).
>
>
> --
> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>
- [Qemu-devel] Error handling in realize() methods, Markus Armbruster, 2015/12/08
- Re: [Qemu-devel] Error handling in realize() methods, Markus Armbruster, 2015/12/10
- Re: [Qemu-devel] Error handling in realize() methods, Paolo Bonzini, 2015/12/09
- Re: [Qemu-devel] Error handling in realize() methods, Dr. David Alan Gilbert, 2015/12/09
- Re: [Qemu-devel] Error handling in realize() methods, Paolo Bonzini, 2015/12/09
- Re: [Qemu-devel] Error handling in realize() methods, Markus Armbruster, 2015/12/10