qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] [PATCH v3 2/5] ppc: spapr: Handle "ibm, nmi-register" and


From: David Gibson
Subject: Re: [Qemu-ppc] [PATCH v3 2/5] ppc: spapr: Handle "ibm, nmi-register" and "ibm, nmi-interlock" RTAS calls
Date: Thu, 28 Sep 2017 13:58:15 +1000
User-agent: Mutt/1.9.0 (2017-09-02)

On Wed, Sep 27, 2017 at 05:23:51PM +0530, Aravinda Prasad wrote:
> 
> 
> On Wednesday 27 September 2017 12:45 PM, David Gibson wrote:
> > On Thu, Sep 21, 2017 at 02:39:06PM +0530, Aravinda Prasad wrote:
> >>
> >>
> >> On Tuesday 22 August 2017 07:38 AM, David Gibson wrote:
> >>
> >> [ . . . ]
> >>
> >>>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>>>>> index 46012b3..eee8d33 100644
> >>>>>> --- a/include/hw/ppc/spapr.h
> >>>>>> +++ b/include/hw/ppc/spapr.h
> >>>>>> @@ -123,6 +123,12 @@ struct sPAPRMachineState {
> >>>>>>       * occurs during the unplug process. */
> >>>>>>      QTAILQ_HEAD(, sPAPRDIMMState) pending_dimm_unplugs;
> >>>>>>  
> >>>>>> +    /* State related to "ibm,nmi-register" and "ibm,nmi-interlock" 
> >>>>>> calls */
> >>>>>> +    target_ulong guest_machine_check_addr;
> >>>>>> +    bool mc_in_progress;
> >>>>>> +    int mc_cpu;
> >>>>>
> >>>>> mc_cpu isn't actually used yet in this patch.  In any case it and
> >>>>> mc_in_progress could probably be folded together, no?
> >>>>
> >>>> It is possible to fold mc_cpu and mc_in_progress together with the
> >>>> convention that if it is set to -1 mc is not in progress otherwise it is
> >>>> set to the CPU handling the mc.
> >>>>
> >>>>>
> >>>>> These values will also need to be migrated, AFAICT.
> >>>>
> >>>> I am thinking of how to handle the migration when machine check handling
> >>>> is in progress. Probably wait for machine check handling to complete
> >>>> before migrating as the error could be irrelevant once migrated to a new
> >>>> hardware. If that is the case we don't need to migrate these values.
> >>>
> >>> Ok.
> >>
> >> This is what I think about handling machine check during migration based
> >> on my understanding of the VM migration code.
> >>
> >> There are two possibilities here. First, migration can be initiated
> >> while the machine check handling is in progress. Second, A machine check
> >> error can happen when the migration is in progress.
> >>
> >> To handle the first case we can add migrate_add_blocker() call when we
> >> start handling the machine check error and issue migrate_del_blocker()
> >> when done. I think this should solve the issue.
> >>
> >> The second case is bit tricky. The migration has already started and
> >> hence migrate_add_blocker() call will fail. We also cannot wait till the
> >> completion of the migration to handle machine check error as the VM's
> >> data could be corrupt.
> >>
> >> Machine check errors should not be an issue when the migration is in the
> >> RAM copy phase as VM is still active with vCPUs running. The problem is
> >> when we hit a machine check when the migration is about to complete. For
> >> example,
> >>
> >> 1. vCPU2 hits a machine check error during migration.
> >>
> >> 2. KVM causes VM exit on vCPU2 and the NIP of vCPU2 is changed to the
> >> guest registered machine check handler.
> >>
> >> 3. The migration_completion() issues vm_stop() and hence either vCPU2 is
> >> never scheduled again on the source hardware or vCPU2 is preempted while
> >> executing the machine check handler.
> >>
> >> 4. vCPU2 is resumed on the target hardware and either starts or
> >> continues processing the machine check error. This could be a problem as
> >> these errors are specific to the source hardware. For instance, when the
> >> the guest issues memory poisoning upon such error, a clean page on the
> >> target hardware is poisoned while the corrupt page on source hardware is
> >> not poisoned.
> >>
> >> The second case of hitting machine check during the final phase of
> >> migration is rare but wanted to check what others think about it.
> > 
> > So, I've had a bit of a think about this.  I don't recall if these
> > fwnmi machine checks are expected on guest RAM, or guest IO addresses.
> 
> It is expected on guest RAM. I am not sure about guest IO address.
> 
> > 
> > 1) If RAM
> > 
> >   What exactly is the guest's notification for?  Even without
> >   migration, the host's free to move guest memory around in host
> >   memory, so it seems any hardware level poking should be done on the
> >   host side.
> 
> If the error is a correctable error, then host takes care of it by
> moving the page to a different location, the guest need not be and will
> not be notified. Guest will be notified if host is not able to fully
> recover. Hence we hit FWNMI in guest when RAM errors are not recovered
> by the host.

Ok.

> >   Is it just to notify the guest that we weren't able to fully recover
> >   on the host side and that page may contain corrupted data?  If
> >   that's so then it seems resuming the handling on the destination is
> >   still right.  It may be new good RAM, but the contents we migrated
> >   could still be corrupt from the machine check event on the source.
> 
> Yes. This is what I am doing in my v5 patch set which I am about to
> post. Additionally I block migration when processing machine check errors.
> 
> > 
> > 2) If IO
> > 
> >   AFAICT this could only happen with VFIO passthrough devices.. but it
> >   shouldn't be possible to migrate if there are any of those.
> > 
> 
> I am not very sure about IO errors.

Ok.  It sounds like that's not the primary case you're interested, so
I guess we can ignore it for now.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]