[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing
From: |
Paolo Bonzini |
Subject: |
Re: [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing |
Date: |
Tue, 09 Sep 2014 18:42:15 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0 |
Il 05/09/2014 14:11, Paolo Bonzini ha scritto:
>
>
> ----- Messaggio originale -----
>> Da: "Alexander Graf" <address@hidden>
>> A: "Paolo Bonzini" <address@hidden>, address@hidden
>> Cc: address@hidden, address@hidden, address@hidden
>> Inviato: Venerdì, 5 settembre 2014 9:10:01
>> Oggetto: Re: [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing
>>
>>
>>
>> On 28.08.14 19:14, Paolo Bonzini wrote:
>>> PowerPC TCG flushes the TLB on every IR/DR change, which basically
>>> means on every user<->kernel context switch. Use the 6-element
>>> TLB array as a cache, where each MMU index is mapped to a different
>>> state of the IR/DR/PR/HV bits.
>>>
>>> This brings the number of TLB flushes down from ~900000 to ~50000
>>> for starting up the Debian installer, which is in line with x86
>>> and gives a ~10% performance improvement.
>>>
>>> Signed-off-by: Paolo Bonzini <address@hidden>
>>> ---
>>> cputlb.c | 19 +++++++++++++++++
>>> hw/ppc/spapr_hcall.c | 6 +++++-
>>> include/exec/exec-all.h | 5 +++++
>>> target-ppc/cpu.h | 4 +++-
>>> target-ppc/excp_helper.c | 6 +-----
>>> target-ppc/helper_regs.h | 52
>>> +++++++++++++++++++++++++++++++--------------
>>> target-ppc/translate_init.c | 5 +++++
>>> 7 files changed, 74 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/cputlb.c b/cputlb.c
>>> index afd3705..17e1b03 100644
>>> --- a/cputlb.c
>>> +++ b/cputlb.c
>>> @@ -67,6 +67,25 @@ void tlb_flush(CPUState *cpu, int flush_global)
>>> tlb_flush_count++;
>>> }
>>>
>>> +void tlb_flush_idx(CPUState *cpu, int mmu_idx)
>>> +{
>>> + CPUArchState *env = cpu->env_ptr;
>>> +
>>> +#if defined(DEBUG_TLB)
>>> + printf("tlb_flush_idx %d:\n", mmu_idx);
>>> +#endif
>>> + /* must reset current TB so that interrupts cannot modify the
>>> + links while we are modifying them */
>>> + cpu->current_tb = NULL;
>>> +
>>> + memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[mmu_idx]));
>>> + memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
>>> +
>>> + env->tlb_flush_addr = -1;
>>> + env->tlb_flush_mask = 0;
>>> + tlb_flush_count++;
>>> +}
>>> +
>>> static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong
>>> addr)
>>> {
>>> if (addr == (tlb_entry->addr_read &
>>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>>> index 467858c..b95961c 100644
>>> --- a/hw/ppc/spapr_hcall.c
>>> +++ b/hw/ppc/spapr_hcall.c
>>> @@ -556,13 +556,17 @@ static target_ulong h_cede(PowerPCCPU *cpu,
>>> sPAPREnvironment *spapr,
>>> {
>>> CPUPPCState *env = &cpu->env;
>>> CPUState *cs = CPU(cpu);
>>> + bool flush;
>>>
>>> env->msr |= (1ULL << MSR_EE);
>>> - hreg_compute_hflags(env);
>>> + flush = hreg_compute_hflags(env);
>>> if (!cpu_has_work(cs)) {
>>> cs->halted = 1;
>>> cs->exception_index = EXCP_HLT;
>>> cs->exit_request = 1;
>>> + } else if (flush) {
>>> + cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
>>> + cs->exit_request = 1;
>>
>> Can this ever happen?
>
> No, I think it can't.
>
>> Ok, so this basically changes the semantics of mmu_idx from a static
>> array with predefined meanings to a dynamic array with runtime changing
>> semantics.
>>
>> The first thing that comes to mind here is why we're not just extending
>> the existing array? After all, we have 4 bits -> 16 states minus one for
>> PR+HV. Can our existing logic not deal with this?
>
> Yeah, that would require 12 MMU indices. Right now, include/exec/cpu_ldst.h
> only supports 6 but that's easy to extend.
>
> tlb_flush becomes progressively more expensive as you add more MMU modes,
> but it may work. This patch removes 98.8% of the TLB flushes, makes the
> remaining ones twice as slow (NB_MMU_MODES goes from 3 to 6), and speeds
> up QEMU by 10%. You can solve this:
>
> 0.9 = 0.988 * 0 + 0.012 * tlb_time * 2 + (1 - tlb_time) * 1
> tlb_time = 0.1 / 0.98 = 0.102
>
> to compute that the time spent in TLB flushes before the patch is 10.2% of the
> whole emulation time.
>
> Doubling the NB_MMU_MODES further from 6 to 12 would still save 98.8% of the
> TLB
> flushes, while making the remaining ones even more expensive. The savings
> will be
> smaller, but actually not by much:
>
> 0.988 * 0 + 0.012 * tlb_time * 4 + (1 - tlb_time) * 1 = 0.903
>
> i.e. what you propose would still save 9.7%. Still, having 12 modes seemed
> like a
> waste, since only 4 or 5 are used in practice...
The 12 MMU modes work just fine. Thanks for the suggestion!
Paolo