[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flush
From: |
Alexander Graf |
Subject: |
Re: [Qemu-devel] [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing |
Date: |
Tue, 9 Sep 2014 22:51:10 +0200 |
> Am 09.09.2014 um 18:42 schrieb Paolo Bonzini <address@hidden>:
>
> Il 05/09/2014 14:11, Paolo Bonzini ha scritto:
>>
>>
>> ----- Messaggio originale -----
>>> Da: "Alexander Graf" <address@hidden>
>>> A: "Paolo Bonzini" <address@hidden>, address@hidden
>>> Cc: address@hidden, address@hidden, address@hidden
>>> Inviato: Venerdì, 5 settembre 2014 9:10:01
>>> Oggetto: Re: [Qemu-ppc] [PATCH 02/17] ppc: avoid excessive TLB flushing
>>>
>>>
>>>
>>>> On 28.08.14 19:14, Paolo Bonzini wrote:
>>>> PowerPC TCG flushes the TLB on every IR/DR change, which basically
>>>> means on every user<->kernel context switch. Use the 6-element
>>>> TLB array as a cache, where each MMU index is mapped to a different
>>>> state of the IR/DR/PR/HV bits.
>>>>
>>>> This brings the number of TLB flushes down from ~900000 to ~50000
>>>> for starting up the Debian installer, which is in line with x86
>>>> and gives a ~10% performance improvement.
>>>>
>>>> Signed-off-by: Paolo Bonzini <address@hidden>
>>>> ---
>>>> cputlb.c | 19 +++++++++++++++++
>>>> hw/ppc/spapr_hcall.c | 6 +++++-
>>>> include/exec/exec-all.h | 5 +++++
>>>> target-ppc/cpu.h | 4 +++-
>>>> target-ppc/excp_helper.c | 6 +-----
>>>> target-ppc/helper_regs.h | 52
>>>> +++++++++++++++++++++++++++++++--------------
>>>> target-ppc/translate_init.c | 5 +++++
>>>> 7 files changed, 74 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/cputlb.c b/cputlb.c
>>>> index afd3705..17e1b03 100644
>>>> --- a/cputlb.c
>>>> +++ b/cputlb.c
>>>> @@ -67,6 +67,25 @@ void tlb_flush(CPUState *cpu, int flush_global)
>>>> tlb_flush_count++;
>>>> }
>>>>
>>>> +void tlb_flush_idx(CPUState *cpu, int mmu_idx)
>>>> +{
>>>> + CPUArchState *env = cpu->env_ptr;
>>>> +
>>>> +#if defined(DEBUG_TLB)
>>>> + printf("tlb_flush_idx %d:\n", mmu_idx);
>>>> +#endif
>>>> + /* must reset current TB so that interrupts cannot modify the
>>>> + links while we are modifying them */
>>>> + cpu->current_tb = NULL;
>>>> +
>>>> + memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[mmu_idx]));
>>>> + memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
>>>> +
>>>> + env->tlb_flush_addr = -1;
>>>> + env->tlb_flush_mask = 0;
>>>> + tlb_flush_count++;
>>>> +}
>>>> +
>>>> static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong
>>>> addr)
>>>> {
>>>> if (addr == (tlb_entry->addr_read &
>>>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>>>> index 467858c..b95961c 100644
>>>> --- a/hw/ppc/spapr_hcall.c
>>>> +++ b/hw/ppc/spapr_hcall.c
>>>> @@ -556,13 +556,17 @@ static target_ulong h_cede(PowerPCCPU *cpu,
>>>> sPAPREnvironment *spapr,
>>>> {
>>>> CPUPPCState *env = &cpu->env;
>>>> CPUState *cs = CPU(cpu);
>>>> + bool flush;
>>>>
>>>> env->msr |= (1ULL << MSR_EE);
>>>> - hreg_compute_hflags(env);
>>>> + flush = hreg_compute_hflags(env);
>>>> if (!cpu_has_work(cs)) {
>>>> cs->halted = 1;
>>>> cs->exception_index = EXCP_HLT;
>>>> cs->exit_request = 1;
>>>> + } else if (flush) {
>>>> + cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
>>>> + cs->exit_request = 1;
>>>
>>> Can this ever happen?
>>
>> No, I think it can't.
>>
>>> Ok, so this basically changes the semantics of mmu_idx from a static
>>> array with predefined meanings to a dynamic array with runtime changing
>>> semantics.
>>>
>>> The first thing that comes to mind here is why we're not just extending
>>> the existing array? After all, we have 4 bits -> 16 states minus one for
>>> PR+HV. Can our existing logic not deal with this?
>>
>> Yeah, that would require 12 MMU indices. Right now, include/exec/cpu_ldst.h
>> only supports 6 but that's easy to extend.
>>
>> tlb_flush becomes progressively more expensive as you add more MMU modes,
>> but it may work. This patch removes 98.8% of the TLB flushes, makes the
>> remaining ones twice as slow (NB_MMU_MODES goes from 3 to 6), and speeds
>> up QEMU by 10%. You can solve this:
>>
>> 0.9 = 0.988 * 0 + 0.012 * tlb_time * 2 + (1 - tlb_time) * 1
>> tlb_time = 0.1 / 0.98 = 0.102
>>
>> to compute that the time spent in TLB flushes before the patch is 10.2% of
>> the
>> whole emulation time.
>>
>> Doubling the NB_MMU_MODES further from 6 to 12 would still save 98.8% of the
>> TLB
>> flushes, while making the remaining ones even more expensive. The savings
>> will be
>> smaller, but actually not by much:
>>
>> 0.988 * 0 + 0.012 * tlb_time * 4 + (1 - tlb_time) * 1 = 0.903
>>
>> i.e. what you propose would still save 9.7%. Still, having 12 modes seemed
>> like a
>> waste, since only 4 or 5 are used in practice...
>
> The 12 MMU modes work just fine. Thanks for the suggestion!
Awesome! While slightly more wasteful than the dynamic approach, I'm sure it's
a lot easier to understand and debug :).
Alex
>
> Paolo