Re: [Qemu-devel] [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flush

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flush

From:	Alexander Graf
Subject:	Re: [Qemu-devel] [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing
Date:	Tue, 9 Sep 2014 22:51:10 +0200


> Am 09.09.2014 um 18:42 schrieb Paolo Bonzini <address@hidden>:
> 
> Il 05/09/2014 14:11, Paolo Bonzini ha scritto:
>> 
>> 
>> ----- Messaggio originale -----
>>> Da: "Alexander Graf" <address@hidden>
>>> A: "Paolo Bonzini" <address@hidden>, address@hidden
>>> Cc: address@hidden, address@hidden, address@hidden
>>> Inviato: Venerdì, 5 settembre 2014 9:10:01
>>> Oggetto: Re: [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing
>>> 
>>> 
>>> 
>>>> On 28.08.14 19:14, Paolo Bonzini wrote:
>>>> PowerPC TCG flushes the TLB on every IR/DR change, which basically
>>>> means on every user<->kernel context switch.  Use the 6-element
>>>> TLB array as a cache, where each MMU index is mapped to a different
>>>> state of the IR/DR/PR/HV bits.
>>>> 
>>>> This brings the number of TLB flushes down from ~900000 to ~50000
>>>> for starting up the Debian installer, which is in line with x86
>>>> and gives a ~10% performance improvement.
>>>> 
>>>> Signed-off-by: Paolo Bonzini <address@hidden>
>>>> ---
>>>> cputlb.c                    | 19 +++++++++++++++++
>>>> hw/ppc/spapr_hcall.c        |  6 +++++-
>>>> include/exec/exec-all.h     |  5 +++++
>>>> target-ppc/cpu.h            |  4 +++-
>>>> target-ppc/excp_helper.c    |  6 +-----
>>>> target-ppc/helper_regs.h    | 52
>>>> +++++++++++++++++++++++++++++++--------------
>>>> target-ppc/translate_init.c |  5 +++++
>>>> 7 files changed, 74 insertions(+), 23 deletions(-)
>>>> 
>>>> diff --git a/cputlb.c b/cputlb.c
>>>> index afd3705..17e1b03 100644
>>>> --- a/cputlb.c
>>>> +++ b/cputlb.c
>>>> @@ -67,6 +67,25 @@ void tlb_flush(CPUState *cpu, int flush_global)
>>>>     tlb_flush_count++;
>>>> }
>>>> 
>>>> +void tlb_flush_idx(CPUState *cpu, int mmu_idx)
>>>> +{
>>>> +    CPUArchState *env = cpu->env_ptr;
>>>> +
>>>> +#if defined(DEBUG_TLB)
>>>> +    printf("tlb_flush_idx %d:\n", mmu_idx);
>>>> +#endif
>>>> +    /* must reset current TB so that interrupts cannot modify the
>>>> +       links while we are modifying them */
>>>> +    cpu->current_tb = NULL;
>>>> +
>>>> +    memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[mmu_idx]));
>>>> +    memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
>>>> +
>>>> +    env->tlb_flush_addr = -1;
>>>> +    env->tlb_flush_mask = 0;
>>>> +    tlb_flush_count++;
>>>> +}
>>>> +
>>>> static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong
>>>> addr)
>>>> {
>>>>     if (addr == (tlb_entry->addr_read &
>>>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>>>> index 467858c..b95961c 100644
>>>> --- a/hw/ppc/spapr_hcall.c
>>>> +++ b/hw/ppc/spapr_hcall.c
>>>> @@ -556,13 +556,17 @@ static target_ulong h_cede(PowerPCCPU *cpu,
>>>> sPAPREnvironment *spapr,
>>>> {
>>>>     CPUPPCState *env = &cpu->env;
>>>>     CPUState *cs = CPU(cpu);
>>>> +    bool flush;
>>>> 
>>>>     env->msr |= (1ULL << MSR_EE);
>>>> -    hreg_compute_hflags(env);
>>>> +    flush = hreg_compute_hflags(env);
>>>>     if (!cpu_has_work(cs)) {
>>>>         cs->halted = 1;
>>>>         cs->exception_index = EXCP_HLT;
>>>>         cs->exit_request = 1;
>>>> +    } else if (flush) {
>>>> +        cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
>>>> +        cs->exit_request = 1;
>>> 
>>> Can this ever happen?
>> 
>> No, I think it can't.
>> 
>>> Ok, so this basically changes the semantics of mmu_idx from a static
>>> array with predefined meanings to a dynamic array with runtime changing
>>> semantics.
>>> 
>>> The first thing that comes to mind here is why we're not just extending
>>> the existing array? After all, we have 4 bits -> 16 states minus one for
>>> PR+HV. Can our existing logic not deal with this?
>> 
>> Yeah, that would require 12 MMU indices.  Right now, include/exec/cpu_ldst.h
>> only supports 6 but that's easy to extend.
>> 
>> tlb_flush becomes progressively more expensive as you add more MMU modes,
>> but it may work.  This patch removes 98.8% of the TLB flushes, makes the
>> remaining ones twice as slow (NB_MMU_MODES goes from 3 to 6), and speeds
>> up QEMU by 10%.  You can solve this:
>> 
>>    0.9 = 0.988 * 0 + 0.012 * tlb_time * 2 + (1 - tlb_time) * 1
>>    tlb_time = 0.1 / 0.98 = 0.102
>> 
>> to compute that the time spent in TLB flushes before the patch is 10.2% of 
>> the
>> whole emulation time.
>> 
>> Doubling the NB_MMU_MODES further from 6 to 12 would still save 98.8% of the 
>> TLB
>> flushes, while making the remaining ones even more expensive.  The savings 
>> will be
>> smaller, but actually not by much:
>> 
>>    0.988 * 0 + 0.012 * tlb_time * 4 + (1 - tlb_time) * 1 = 0.903
>> 
>> i.e. what you propose would still save 9.7%.  Still, having 12 modes seemed 
>> like a
>> waste, since only 4 or 5 are used in practice...
> 
> The 12 MMU modes work just fine.  Thanks for the suggestion!

Awesome! While slightly more wasteful than the dynamic approach, I'm sure it's 
a lot easier to understand and debug :).


Alex

> 
> Paolo

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing, Alexander Graf, 2014/09/05
- Re: [Qemu-devel] [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing, Paolo Bonzini, 2014/09/05
  - Re: [Qemu-devel] [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing, Paolo Bonzini, 2014/09/09
    - Re: [Qemu-devel] [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing, Alexander Graf <=

Prev by Date: Re: [Qemu-devel] [PATCH v4 0/8] AHCI test suite framework
Next by Date: Re: [Qemu-devel] [patch 1/3] kvmclock: Ensure time in migration never goes backward
Previous by thread: Re: [Qemu-devel] [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing
Next by thread: [Qemu-devel] [PATCH 0/3 RESEND] s390x: css patches and small sclp cleanup
Index(es):
- Date
- Thread