qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing


From: Paolo Bonzini
Subject: Re: [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing
Date: Fri, 5 Sep 2014 08:11:22 -0400 (EDT)


----- Messaggio originale -----
> Da: "Alexander Graf" <address@hidden>
> A: "Paolo Bonzini" <address@hidden>, address@hidden
> Cc: address@hidden, address@hidden, address@hidden
> Inviato: Venerdì, 5 settembre 2014 9:10:01
> Oggetto: Re: [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing
> 
> 
> 
> On 28.08.14 19:14, Paolo Bonzini wrote:
> > PowerPC TCG flushes the TLB on every IR/DR change, which basically
> > means on every user<->kernel context switch.  Use the 6-element
> > TLB array as a cache, where each MMU index is mapped to a different
> > state of the IR/DR/PR/HV bits.
> > 
> > This brings the number of TLB flushes down from ~900000 to ~50000
> > for starting up the Debian installer, which is in line with x86
> > and gives a ~10% performance improvement.
> > 
> > Signed-off-by: Paolo Bonzini <address@hidden>
> > ---
> >  cputlb.c                    | 19 +++++++++++++++++
> >  hw/ppc/spapr_hcall.c        |  6 +++++-
> >  include/exec/exec-all.h     |  5 +++++
> >  target-ppc/cpu.h            |  4 +++-
> >  target-ppc/excp_helper.c    |  6 +-----
> >  target-ppc/helper_regs.h    | 52
> >  +++++++++++++++++++++++++++++++--------------
> >  target-ppc/translate_init.c |  5 +++++
> >  7 files changed, 74 insertions(+), 23 deletions(-)
> > 
> > diff --git a/cputlb.c b/cputlb.c
> > index afd3705..17e1b03 100644
> > --- a/cputlb.c
> > +++ b/cputlb.c
> > @@ -67,6 +67,25 @@ void tlb_flush(CPUState *cpu, int flush_global)
> >      tlb_flush_count++;
> >  }
> >  
> > +void tlb_flush_idx(CPUState *cpu, int mmu_idx)
> > +{
> > +    CPUArchState *env = cpu->env_ptr;
> > +
> > +#if defined(DEBUG_TLB)
> > +    printf("tlb_flush_idx %d:\n", mmu_idx);
> > +#endif
> > +    /* must reset current TB so that interrupts cannot modify the
> > +       links while we are modifying them */
> > +    cpu->current_tb = NULL;
> > +
> > +    memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[mmu_idx]));
> > +    memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
> > +
> > +    env->tlb_flush_addr = -1;
> > +    env->tlb_flush_mask = 0;
> > +    tlb_flush_count++;
> > +}
> > +
> >  static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong
> >  addr)
> >  {
> >      if (addr == (tlb_entry->addr_read &
> > diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> > index 467858c..b95961c 100644
> > --- a/hw/ppc/spapr_hcall.c
> > +++ b/hw/ppc/spapr_hcall.c
> > @@ -556,13 +556,17 @@ static target_ulong h_cede(PowerPCCPU *cpu,
> > sPAPREnvironment *spapr,
> >  {
> >      CPUPPCState *env = &cpu->env;
> >      CPUState *cs = CPU(cpu);
> > +    bool flush;
> >  
> >      env->msr |= (1ULL << MSR_EE);
> > -    hreg_compute_hflags(env);
> > +    flush = hreg_compute_hflags(env);
> >      if (!cpu_has_work(cs)) {
> >          cs->halted = 1;
> >          cs->exception_index = EXCP_HLT;
> >          cs->exit_request = 1;
> > +    } else if (flush) {
> > +        cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
> > +        cs->exit_request = 1;
> 
> Can this ever happen?

No, I think it can't.

> Ok, so this basically changes the semantics of mmu_idx from a static
> array with predefined meanings to a dynamic array with runtime changing
> semantics.
> 
> The first thing that comes to mind here is why we're not just extending
> the existing array? After all, we have 4 bits -> 16 states minus one for
> PR+HV. Can our existing logic not deal with this?

Yeah, that would require 12 MMU indices.  Right now, include/exec/cpu_ldst.h
only supports 6 but that's easy to extend.

tlb_flush becomes progressively more expensive as you add more MMU modes,
but it may work.  This patch removes 98.8% of the TLB flushes, makes the
remaining ones twice as slow (NB_MMU_MODES goes from 3 to 6), and speeds
up QEMU by 10%.  You can solve this:

    0.9 = 0.988 * 0 + 0.012 * tlb_time * 2 + (1 - tlb_time) * 1
    tlb_time = 0.1 / 0.98 = 0.102

to compute that the time spent in TLB flushes before the patch is 10.2% of the
whole emulation time.

Doubling the NB_MMU_MODES further from 6 to 12 would still save 98.8% of the TLB
flushes, while making the remaining ones even more expensive.  The savings will 
be
smaller, but actually not by much:

    0.988 * 0 + 0.012 * tlb_time * 4 + (1 - tlb_time) * 1 = 0.903

i.e. what you propose would still save 9.7%.  Still, having 12 modes seemed 
like a
waste, since only 4 or 5 are used in practice...

On top of this patch it is possible to do another optimization: instead of
doing a full flush, tlb_flush could clear the TLB for the current index only
and invalidate the mapping.  The TLBs for other indices will be invalidated
lazily as they are populated again.  This would cut the cost of the TLB
flushes further, though the above math suggests that the actual speedup
will likely be smallish.

> Second thing I'm failing to grasp still is that in the previous patch
> you're changing ctx.mem_idx into to different static semantics. But that
> mem_idx gets passed to our ld/st helpers which again boils down to the
> mem_idx above, no? So aren't we accessing random unrelated mmu contexts now?

Yeah, that's br0ken.  In most cases mem_idx is used to check for privilege,
so we'd need to split the field in two (mem_idx and priv_level).

Paolo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]