qemu-stable
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-stable] [PATCH] x86: Reset MTRR on vCPU reset


From: Alex Williamson
Subject: Re: [Qemu-stable] [PATCH] x86: Reset MTRR on vCPU reset
Date: Wed, 13 Aug 2014 16:06:29 -0600

On Wed, 2014-08-13 at 22:33 +0200, Laszlo Ersek wrote:
> a number of comments -- feel free to address or ignore each as you see fit:
> 
> On 08/13/14 21:09, Alex Williamson wrote:
> > The SDM specifies (June 2014 Vol3 11.11.5):
> > 
> >     On a hardware reset, the P6 and more recent processors clear the
> >     valid flags in variable-range MTRRs and clear the E flag in the
> >     IA32_MTRR_DEF_TYPE MSR to disable all MTRRs. All other bits in the
> >     MTRRs are undefined.
> > 
> > We currently do none of that, so whatever MTRR settings you had prior
> > to reset is what you have after reset.  Usually this doesn't matter
> > because KVM often ignores the guest mappings and uses write-back
> > anyway.  However, if you have an assigned device and an IOMMU that
> > allows NoSnoop for that device, KVM defers to the guest memory
> > mappings which are now stale after reset.  The result is that OVMF
> > rebooting on such a configuration takes a full minute to LZMA
> > decompress the EFI volume, a process that is nearly instant on the
> 
> For pedantry, instead of "EFI volume" we could say "LZMA-compressed
> Firmware File System file in the FVMAIN_COMPACT firmware volume".

Can you come up with something with maybe half that many words?  And
also, does it matter?  I want someone using OVMF and experiencing a long
reboot delay to know that this might fix their problem.  Noting that the
major time consuming stall is in the LZMA decompression code helps to
rationalize why the mapping change is important.  The specific blob of
data that's being decompressed seems mostly irrelevant, which is why I
only gave it 2 words.

> > initial boot.
> > 
> > Add support for reseting the SDM defined bits on vCPU reset.
> > 
> > Also, by my count we're already in danger of overflowing the entries
> > array that we pass to KVM, so I've topped it up for a bit of headroom.
> > 
> > Signed-off-by: Alex Williamson <address@hidden>
> > Cc: address@hidden
> > ---
> > 
> >  target-i386/cpu.c |    6 ++++++
> >  target-i386/cpu.h |    4 ++++
> >  target-i386/kvm.c |   14 +++++++++++++-
> >  3 files changed, 23 insertions(+), 1 deletion(-)
> > 
> > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > index 6d008ab..b5ae654 100644
> > --- a/target-i386/cpu.c
> > +++ b/target-i386/cpu.c
> > @@ -2588,6 +2588,12 @@ static void x86_cpu_reset(CPUState *s)
> >  
> >      env->xcr0 = 1;
> >  
> > +    /* MTRR init - Clear global enable bit and valid bit in each variable 
> > reg */
> > +    env->mtrr_deftype &= ~MSR_MTRRdefType_Enable;
> > +    for (i = 0; i < MSR_MTRRcap_VCNT; i++) {
> > +        env->mtrr_var[i].mask &= ~MSR_MTRRphysMask_Valid;
> > +    }
> > +
> 
> I can see that the limit, MSR_MTRRcap_VCNT, is #defined as 8. Would you
> be willing to update the definition of the "CPUX86State.mtrr_var" array
> too, in "target-i386/cpu.h"? Currently it says:

I was tempted to do that, but I was hoping there was some deeper
reasoning why these were already defined separately.  For instance, what
if we wanted to keep a stable vmstate size, but expose fewer variable
MTRRs to the guest.  MSR_MTRRcap_VCNT is the number exposed to the
guest, so it makes sense that we only need to clear the valid bits on
those.  As I look through the commits that got us here, that was
probably just wishful thinking.

>     MTRRVar mtrr_var[8];
> 
> >  #if !defined(CONFIG_USER_ONLY)
> >      /* We hard-wire the BSP to the first CPU. */
> >      if (s->cpu_index == 0) {
> > diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> > index e634d83..139890f 100644
> > --- a/target-i386/cpu.h
> > +++ b/target-i386/cpu.h
> > @@ -337,6 +337,8 @@
> >  #define MSR_MTRRphysBase(reg)           (0x200 + 2 * (reg))
> >  #define MSR_MTRRphysMask(reg)           (0x200 + 2 * (reg) + 1)
> >  
> > +#define MSR_MTRRphysMask_Valid (1 << 11)
> > +
> 
> Note: a signed integer (int32_t).
> 
> >  #define MSR_MTRRfix64K_00000            0x250
> >  #define MSR_MTRRfix16K_80000            0x258
> >  #define MSR_MTRRfix16K_A0000            0x259
> > @@ -353,6 +355,8 @@
> >  
> >  #define MSR_MTRRdefType                 0x2ff
> >  
> > +#define MSR_MTRRdefType_Enable (1 << 11)
> > +
> 
> Note: a signed integer (int32_t).
> 
> Now, if you scroll back to the bit-clearing in x86_cpu_reset(), you see
> 
>   ~MSR_MTRRdefType_Enable
> 
> and
> 
>  ~MSR_MTRRphysMask_Valid
> 
> These expressions evaluate to negative int (int32_t) values (because the
> bit-neg sets their sign bits).
> 
> Due to two's complement (which we are allowed to assume in qemu, see
> HACKING), the negative int32_t values will be just correct for the next
> step, when they are converted to uint64_t for the bit-ands, as part of
> the usual arithmetic conversions. ("env->mtrr_deftype" and
> "env->mtrr_var[i].mask" are uint64_t.) Mathematically this means an
> addition of UINT64_MAX+1. ("Sign extended".)
> 
> In general, even though they are correct due to two's complement, I
> dislike such detours into negative-valued signed integers by way of
> bit-neg, because people are mostly unaware of them and assume they "just
> work". My preferred solution would be
> 
> #define MSR_MTRRphysMask_Valid (1ull << 11)
> #define MSR_MTRRdefType_Enable (1ull << 11)
> 
> Feel free to ignore this of course.

This seems like an uphill battle, but I suppose I don't have any problem
with an overly pedantic definition like this.

> >  #define MSR_CORE_PERF_FIXED_CTR0        0x309
> >  #define MSR_CORE_PERF_FIXED_CTR1        0x30a
> >  #define MSR_CORE_PERF_FIXED_CTR2        0x30b
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > index 097fe11..cb31338 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -79,6 +79,7 @@ static int lm_capable_kernel;
> >  static bool has_msr_hv_hypercall;
> >  static bool has_msr_hv_vapic;
> >  static bool has_msr_hv_tsc;
> > +static bool has_msr_mtrr;
> >  
> >  static bool has_msr_architectural_pmu;
> >  static uint32_t num_architectural_pmu_counters;
> > @@ -739,6 +740,10 @@ int kvm_arch_init_vcpu(CPUState *cs)
> >          env->kvm_xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
> >      }
> >  
> > +    if (env->features[FEAT_1_EDX] & CPUID_MTRR) {
> > +        has_msr_mtrr = true;
> > +    }
> > +
> 
> Seems to match "MTRR Feature Identification" in my (older) copy of the SDM.
> 
> >      return 0;
> >  }
> >  
> > @@ -1183,7 +1188,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
> >      CPUX86State *env = &cpu->env;
> >      struct {
> >          struct kvm_msrs info;
> > -        struct kvm_msr_entry entries[100];
> > +        struct kvm_msr_entry entries[128];
> >      } msr_data;
> >      struct kvm_msr_entry *msrs = msr_data.entries;
> >      int n = 0, i;
> > @@ -1278,6 +1283,13 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
> >              kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_REFERENCE_TSC,
> >                                env->msr_hv_tsc);
> >          }
> > +        if (has_msr_mtrr) {
> > +            kvm_msr_entry_set(&msrs[n++], MSR_MTRRdefType, 
> > env->mtrr_deftype);
> > +            for (i = 0; i < MSR_MTRRcap_VCNT; i++) {
> > +                kvm_msr_entry_set(&msrs[n++],
> > +                                  MSR_MTRRphysMask(i), 
> > env->mtrr_var[i].mask);
> > +            }
> > +        }
> >  
> >          /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see
> >           *       kvm_put_msr_feature_control. */
> > 
> 
> I think that this code is correct (and sufficient for the reset
> problem), but I'm uncertain if it's complete:
> 
> (a) Shouldn't you put the matching PhysBase registers as well (for the
> variable range ones)?
> 
> Plus, shouldn't you put mtrr_fixed[11] too (MSR_MTRRfix64K_00000, ...)?

If my change wasn't isolated to the reset portion of kvm_put_msrs() then
I would agree with you.  But since it is, all of those registers are
undefined by the SDM.

> (b) You only modify kvm_put_msrs(). What about kvm_get_msrs()? I can see
> that you make the msr putting dependent on:
> 
>     /*
>      * The following MSRs have side effects on the guest or are too
>      * heavy for normal writeback. Limit them to reset or full state
>      * updates.
>      */
>     if (level >= KVM_PUT_RESET_STATE) {
> 
> But that's probably not your reason for omitting matching new code from
> kvm_get_msrs(): "HV_X64_MSR_REFERENCE_TSC" is also heavy-weight (visible
> in your patch's context), but that one is nevertheless handled in
> kvm_get_msrs().
> 
> My only reason for (b) is simply symmetry. For example, commit 48a5f3bc
> added HV_X64_MSR_REFERENCE_TSC at once to both put() and get().
> 
> According to "target-i386/machine.c", mtrr_deftype and co. are even
> migrated (part of vmstate), so this asymmetry could become a problem in
> migration. Eg. source host doesn't fetch MTRR state from KVM, hence wire
> format carries garbage, but on the target you put (part of) that garbage
> (right now, just the mask) back into KVM:
> 
> do_savevm()
>   qemu_savevm_state()
>     qemu_savevm_state_complete()
>       cpu_synchronize_all_states()
>         cpu_synchronize_state()
>           kvm_cpu_synchronize_state()
>             do_kvm_cpu_synchronize_state()
>               kvm_arch_get_registers()
>                 kvm_get_msrs()
> 
> do_loadvm()
>   load_vmstate()
>     qemu_loadvm_state()
>       cpu_synchronize_all_post_init()
>         cpu_synchronize_post_init()
>           kvm_cpu_synchronize_post_init()
>             kvm_arch_put_registers(..., KVM_PUT_FULL_STATE)
>               kvm_put_msrs(..., KVM_PUT_FULL_STATE)
> 
> /* state subset modified during VCPU reset */
> #define KVM_PUT_RESET_STATE     2
> 
> /* full state set, modified during initialization or on vmload */
> #define KVM_PUT_FULL_STATE      3
> 
> Hence I suspect (a) and (b) should be handled.
> 
> ... And then we arrive at cross-version migration, where both source and
> target hosts support MTRR, but the source qemu sends unsynchronized MTRR
> data (ie. garbage) in the migration stream, but the target passes it to
> KVM. I don't know if this is possible, and if so, what to do about it. :(

Where does the target pass it to KVM?  I think you've identified that we
migrate unsynchronized data, but the good news is that we don't do
anything with it unless you're running under TCG (in which case it is
synchronized anyway).  We neither load nor store the MTRR state from/to
KVM, which may have implications if you were to boot a guest, migrate
it, then hot-add an assigned device where we need to start caring about
guest mappings.

> (BTW,
> 
>         VMSTATE_MTRR_VARS(env.mtrr_var, X86CPU, 8, 8),
> 
> should be rebased to MSR_MTRRcap_VCNT too, probably.)
> 
> Apologies about the verbiage, I just wrote down whatever crossed my
> mind. I don't think I said anything overly important, but I feel unsafe
> about giving my R-b until someone disproves my migration worries.
> (Basically, before the patch, whatever MTRR data was in the migration
> stream never reached KVM. This changes now.)

Not really because it only gets pushed to KVM on vCPU reset and we're
clearing the necessary enable/valid bits.  The rest is undefined anyway.

> ... Is the following argument valid in your opinion?
> 
>   KVM cares about guest-specified MTRR values *only* when
>   kvm_arch_has_noncoherent_dma() returns true to vmx_get_mt_mask().
>   Since "kvm_arch_has_noncoherent_dma() returning true" (ie. device
>   assignment) exludes migration anyway, we don't have to care about
>   migration of MTRRs.

I think we do need to care about migration of MTRRs because a device can
be hot attached on the migration target while the MTRRs could have been
programmed on the migration source.  Therefore it doesn't matter than
device assignment excludes migration.  This patch still seems correct to
me, but you have identified another issue in the same problem space.
I'll start working on it.  Thanks,

Alex




reply via email to

[Prev in Thread] Current Thread [Next in Thread]