qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SM


From: Igor Mammedov
Subject: Re: [Qemu-devel] [PATCH v6 wave 2 2/3] hw/isa/lpc_ich9: add broadcast SMI feature
Date: Wed, 18 Jan 2017 11:03:53 +0100

On Tue, 17 Jan 2017 19:53:21 +0100
Laszlo Ersek <address@hidden> wrote:

> On 01/17/17 15:20, Igor Mammedov wrote:
> > On Tue, 17 Jan 2017 14:46:05 +0100
> > Laszlo Ersek <address@hidden> wrote:
> >   
> >> On 01/17/17 14:20, Igor Mammedov wrote:  
> >>> On Tue, 17 Jan 2017 13:06:13 +0100
> >>> Laszlo Ersek <address@hidden> wrote:
> >>>     
> >>>> On 01/17/17 12:21, Igor Mammedov wrote:    
> >>>>> On Mon, 16 Jan 2017 21:46:55 +0100
> >>>>> Laszlo Ersek <address@hidden> wrote:
> >>>>>       
> >>>>>> On 01/13/17 14:09, Igor Mammedov wrote:      
> >>>>>>> On Thu, 12 Jan 2017 19:24:45 +0100
> >>>>>>> Laszlo Ersek <address@hidden> wrote:
> >>>>>>>         
> >>>>>>>> The generic edk2 SMM infrastructure prefers
> >>>>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each 
> >>>>>>>> processor. If
> >>>>>>>> Trigger() only brings the current processor into SMM, then edk2 
> >>>>>>>> handles it
> >>>>>>>> in the following ways:
> >>>>>>>>
> >>>>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before
> >>>>>>>>     ExitBootServices(), but is not necessarily true at runtime), 
> >>>>>>>> then:
> >>>>>>>>
> >>>>>>>>     (a) If edk2 has been configured for "traditional" SMM 
> >>>>>>>> synchronization,
> >>>>>>>>         then the BSP sends directed SMIs to the APs with APIC 
> >>>>>>>> delivery,
> >>>>>>>>         bringing them into SMM individually. Then the BSP runs the 
> >>>>>>>> SMI
> >>>>>>>>         handler / dispatcher.
> >>>>>>>>
> >>>>>>>>     (b) If edk2 has been configured for "relaxed" SMM 
> >>>>>>>> synchronization,
> >>>>>>>>         then the APs that are not already in SMM are not brought in, 
> >>>>>>>> and
> >>>>>>>>         the BSP runs the SMI handler / dispatcher.
> >>>>>>>>
> >>>>>>>> (2) If Trigger() is executed by an AP (which is possible after
> >>>>>>>>     ExitBootServices(), and can be forced e.g. by "taskset -c 1
> >>>>>>>>     efibootmgr"), then the AP in question brings in the BSP with a
> >>>>>>>>     directed SMI, and the BSP runs the SMI handler / dispatcher.
> >>>>>>>>
> >>>>>>>> The smaller problem with (1a) and (2) is that the BSP and AP
> >>>>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr"
> >>>>>>>> command from (2) can take more than 3 seconds to complete, because
> >>>>>>>> efibootmgr accesses non-volatile UEFI variables intensively.
> >>>>>>>>
> >>>>>>>> The larger problem is that QEMU's current behavior diverges from the
> >>>>>>>> behavior usually seen on physical hardware, and that keeps exposing
> >>>>>>>> obscure corner cases, race conditions and other instabilities in 
> >>>>>>>> edk2,
> >>>>>>>> which generally expects / prefers a software SMI to affect all CPUs 
> >>>>>>>> at
> >>>>>>>> once.
> >>>>>>>>
> >>>>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to 
> >>>>>>>> inject
> >>>>>>>> the SMI on all VCPUs.
> >>>>>>>>
> >>>>>>>> While the original posting of this patch
> >>>>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html>
> >>>>>>>> only intended to speed up (2), based on our recent "stress testing" 
> >>>>>>>> of SMM
> >>>>>>>> this patch actually provides functional improvements.
> >>>>>>>>
> >>>>>>>> Cc: "Michael S. Tsirkin" <address@hidden>
> >>>>>>>> Cc: Gerd Hoffmann <address@hidden>
> >>>>>>>> Cc: Igor Mammedov <address@hidden>
> >>>>>>>> Cc: Paolo Bonzini <address@hidden>
> >>>>>>>> Signed-off-by: Laszlo Ersek <address@hidden>
> >>>>>>>> Reviewed-by: Michael S. Tsirkin <address@hidden>
> >>>>>>>> ---
> >>>>>>>>
> >>>>>>>> Notes:
> >>>>>>>>     v6:
> >>>>>>>>     - no changes, pick up Michael's R-b
> >>>>>>>>     
> >>>>>>>>     v5:
> >>>>>>>>     - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the
> >>>>>>>>       ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for
> >>>>>>>>       DEFINE_PROP_BIT() in the next patch)
> >>>>>>>>
> >>>>>>>>  include/hw/i386/ich9.h |  3 +++
> >>>>>>>>  hw/isa/lpc_ich9.c      | 10 +++++++++-
> >>>>>>>>  2 files changed, 12 insertions(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h
> >>>>>>>> index da1118727146..18dcca7ebcbf 100644
> >>>>>>>> --- a/include/hw/i386/ich9.h
> >>>>>>>> +++ b/include/hw/i386/ich9.h
> >>>>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void);
> >>>>>>>>  #define ICH9_SMB_HST_D1                         0x06
> >>>>>>>>  #define ICH9_SMB_HOST_BLOCK_DB                  0x07
> >>>>>>>>  
> >>>>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */
> >>>>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT            0
> >>>>>>>> +
> >>>>>>>>  #endif /* HW_ICH9_H */
> >>>>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
> >>>>>>>> index 376b7801a42c..ced6f803a4f2 100644
> >>>>>>>> --- a/hw/isa/lpc_ich9.c
> >>>>>>>> +++ b/hw/isa/lpc_ich9.c
> >>>>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, 
> >>>>>>>> void *arg)
> >>>>>>>>  
> >>>>>>>>      /* SMI_EN = PMBASE + 30. SMI control and enable register */
> >>>>>>>>      if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) {
> >>>>>>>> -        cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI);
> >>>>>>>> +        if (lpc->smi_negotiated_features &
> >>>>>>>> +            (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) {
> >>>>>>>> +            CPUState *cs;
> >>>>>>>> +            CPU_FOREACH(cs) {
> >>>>>>>> +                cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >>>>>>>> +            }        
> >>>>>>> Shouldn't CPUs with default SMI base be excluded from broadcast?      
> >>>>>>>   
> >>>>>>
> >>>>>> I don't think so; as far as I recall, in edk2 the initial SMM entry 
> >>>>>> code
> >>>>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same
> >>>>>> code area for execution. The VCPUs are told apart from each other by
> >>>>>> Initial APIC ID (that is, "who am I"), which is used as an index or
> >>>>>> search criterion into a global shared array. Once they find their
> >>>>>> respective private data blocks, the VCPUs don't step on each other's 
> >>>>>> toes.      
> >>>>> That's what I'm not sure.
> >>>>> If broadcast SMI is sent before relocation all CPUs will use
> >>>>> the same SMBASE and as result save their state in the same
> >>>>> memory (even if it's state after RESET/POWER ON) racing/overwriting
> >>>>> each other's saved state      
> >>>>
> >>>> Good point!
> >>>>    
> >>>>> and then on exit from SMI they all will restore
> >>>>> whatever state that race produced so it seems plain wrong thing to do.
> >>>>>
> >>>>> Are you sure that edk2 does broadcast for SMI initialization or does it
> >>>>> using directed SMIs?      
> >>>>
> >>>> Hmmm, you are right. The SmmRelocateBases() function in
> >>>> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for
> >>>> each individual AP in succession, by sending SMI IPIs to them, one by
> >>>> one. Then it does the same for the BSP, by sending itself a similar SMI 
> >>>> IPI.
> >>>>
> >>>> The above QEMU code is only exercised much later, after the SMBASE
> >>>> relocation, with the SMM stack up and running. It is used when a
> >>>> (preferably broadcast) SMI is triggered for firmware services (for
> >>>> example, the UEFI variable services).
> >>>>
> >>>> So, you are right.
> >>>>
> >>>> Considering edk2 only, the difference shouldn't matter -- when this code
> >>>> is invoked (via an IO port write), the SMBASEs will all have been 
> >>>> relocated.
> >>>>
> >>>> Also, I've been informed that on real hardware, writing to APM_CNT
> >>>> triggers an SMI on all CPUs, regardless of the individual SMBASE values.
> >>>> So I believe the above code should be closest to real hardware, and the
> >>>> pre-patch code was only written in unicast form for SeaBIOS's sake.
> >>>>
> >>>> I think the code is safe above. If the guest injects an SMI via APM_CNT
> >>>> after negotiating SMI broadcast, it should have not left any VCPUs
> >>>> without an SMI handler by that time. edk2 / OVMF conforms to this. In
> >>>> the future we can tweak this further, if necessary; we have 63 more
> >>>> bits, and we'll be able to reject invalid combinations of bits.
> >>>>
> >>>> Do you feel that we should skip VCPUs whose SMBASE has not been changed
> >>>> from the default?
> >>>>
> >>>> If so, doesn't that run the risk of missing a VCPU that has an actual
> >>>> SMI handler installed, and it just happens to be placed at the default
> >>>> SMBASE location?
> >>>>
> >>>> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs,
> >>>> but I think (a) that's not what real hardware does, and (b) it is risky
> >>>> if a VCPU's actual SMI handler has been installed while keeping the
> >>>> default SMBASE value.
> >>>>
> >>>> What do you think?    
> >>> (a) probably doesn't consider hotplug, so it's totally fine for the case
> >>> where firmware is the only one who wakes up/initializes CPUs.
> >>> however consider 2 CPU hotplugged and then broadcast SMI happens
> >>> to serve some other task (if there isn't unpark 'feature').
> >>> Even if real hw does it, it's behavior not defined by SDM with possibly
> >>> bad consequences.
> >>>
> >>> (b) alone it's risky so I'd skip based on combination of
> >>>  
> >>>  if (default SMBASE & CPU is in reset state)
> >>>     skip;
> >>>
> >>> that should be safe and it leaves open possibility to avoid adding
> >>> unpark 'feature' to CPU.    
> >>
> >> The above attributes ("SMBASE" and "CPU is in reset state") are specific
> >> to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the
> >> X86_CPU() macro?
> >>
> >> Can I assume here that the macro will never return NULL? (I think so,
> >> LPC is only used with x86.)  
> > yep, that should work.
> >   
> >>
> >> ... I guess SMBASE can be accessed as
> >>
> >>   X86_CPU(cs)->env.smbase  
> > preferred style:
> >     X86CPU *cpu = X86_CPU(cs);
> >     cpu->...
> >   
> >>
> >> How do I figure out if the CPU is in reset state ("waiting for first
> >> INIT") though? Note that this state should be distinguished from simply
> >> "halted" (i.e., after a HLT). After a HLT, the SMI should be injected
> >> and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send
> >> APs to sleep.  
> > 
> > how about using EIP reset value?
> > x86_cpu_reset()
> >    ...
> >    env->eip = 0xfff0;  
> 
> Okay, so I tried this. It doesn't work.
> 
> This is the code I wrote (extracting a new ich9_apm_broadcast_smi()
> function from ich9_apm_ctrl_changed(), due to size reasons):
> 
> > static void ich9_apm_broadcast_smi(void)
> > {
> >     CPUState *cs;
> >
> >     cpu_synchronize_all_states(); /* <--------- mark this */
it probably should be executed after pause_all_vcpus(),
maybe it will help.

> >     CPU_FOREACH(cs) {
> >         X86CPU *cpu = X86_CPU(cs);
> >         CPUX86State *env = &cpu->env;
> >
> >         if (env->smbase == 0x30000 && env->eip == 0xfff0) {
> >             CPUClass *k = CPU_GET_CLASS(cs);
> >             uint64_t cpu_arch_id = k->get_arch_id(cs);
> >
> >             trace_ich9_apm_broadcast_smi_skip(cpu_arch_id);
> >             continue;
> >         }
> >
> >         cpu_interrupt(cs, CPU_INTERRUPT_SMI);
> >     }
> > }  
> 
[...]
> (b) If I add the cpu_synchronize_all_states() call before the loop, then
> the incorrect filter matches go away; QEMU sees the KVM VCPU states
> correctly, and the SMI is broad-cast.
> 
> However, in this case, the boot slows to a *crawl*. It's unbearable. All
> VCPUs are pegged at 100%, and you can easily follow the OVMF debug log
> with the naked eye, almost line by line.
> I didn't expect that cpu_synchronize_all_states() would be so costly,
> but it makes the idea of checking VCPU state in the APM_CNT handler a
> non-starter.
I wonder were bottleneck is to slow down guest so much.
What is actual cost of calling cpu_synchronize_all_states()?

Maybe calling pause_all_vcpus() before cpu_synchronize_all_states()
would help.

 
> Can we stick with the current "v6 wave 2" in light of this?
> 
> Thanks,
> Laszlo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]