qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path


From: Jani Kokkonen
Subject: Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path
Date: Mon, 3 Jun 2013 13:21:55 +0200
User-agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6

On 5/31/2013 10:25 PM, Richard Henderson wrote:
> On 05/31/2013 11:07 AM, Jani Kokkonen wrote:
>> +/* Load and compare a TLB entry, leaving the flags set.  Leaves X2 pointing
>> +   to the tlb entry.  Clobbers X0,X1,X2,X3 and TMP.  */
>> +
>> +static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg,
>> +                             int s_bits, uint8_t **label_ptr, int 
>> tlb_offset)
>> +{
> 
> You copied the comment from ARM, and it isn't correct.  You generate branches.

I will fix the comment.
> 
>> +    TCGReg base = TCG_AREG0;
>> +
>> +    tcg_out_shr(s, 1, TCG_REG_TMP, addr_reg, TARGET_PAGE_BITS);
>> +    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X1, tlb_offset);
>> +    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, base, TCG_REG_X1, 0);
>> +    tcg_out_andi(s, 1, TCG_REG_X0, TCG_REG_TMP, CPU_TLB_BITS, 0);
>> +    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, TCG_REG_X2,
>> +                             TCG_REG_X0, -CPU_TLB_ENTRY_BITS);
>> +#if TARGET_LONG_BITS == 64
>> +    tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
>> +#else
>> +    tcg_out_ldst(s, LDST_32, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
>> +#endif
>> +    /* check alignment */
>> +    if (s_bits) {
>> +        tcg_out_tst(s, 1, addr_reg, s_bits, 0);
>> +        label_ptr[0] = s->code_ptr;
>> +        tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
>> +    }
>> +    tcg_out_cmp(s, 1, TCG_REG_X3, TCG_REG_TMP, -TARGET_PAGE_BITS);
>> +    label_ptr[1] = s->code_ptr;
>> +    tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
> 
> I'm positive that the branch predictor would be happier with a single branch
> rather than the two you generate here.  It ought to be possible to use a
> different set of insns to do this in one go.
> 
> How about something like
> 
>       @ extract the tlb index from the address
>       ubfm    w0, addr_reg, TARGET_PAGE_BITS, CPU_TLB_BITS
> 
>       @ add any "high bits" from the tlb offset
>       @ noting that env will be much smaller than 24 bits.
>       add     x1, env, tlb_offset & 0xfff000
> 
>       @ zap the tlb index from the address for compare
>       @ this is all high bits plus 0-3 low bits set, so this
>       @ should match a logical immediate.
>       and     w/x2, addr_reg, TARGET_PAGE_MASK | ((1 << s_bits) - 1)
> 
>       @ merge the tlb index into the env+tlb_offset
>       add     x1, x1, x0, lsl #3
> 
>       @ load the tlb comparator.  the 12-bit scaled offset
>       @ form will fit the bits remaining from above, given that
>       @ we're loading an aligned object, and so the low 2/3 bits
>       @ will be clear.
>       ldr     w/x0, [x1, tlb_offset & 0xfff]
> 
>       @ load the tlb addend.  do this early to avoid stalling.
>       @ the addend_offset differs from tlb_offset by 1-3 words.
>       @ given that we've got overlap between the scaled 12-bit
>       @ value and the 12-bit shifted value above, this also ought
>       @ to always be representable.
>       ldr     x3, [x1, (tlb_offset & 0xfff) + (addend_offset - tlb_offset)]
> 
>       @ perform the comparison
>       cmp     w/x0, w/x2
> 
>       @ generate the complete host address in parallel with the cmp.
>       add     x3, x3, addr_reg                @ 64-bit guest
>       add     x3, x3, addr_reg, uxtw          @ 32-bit guest
> 
>       bne     miss_label
> 
> Note that the w/x above indicates the ext setting that ought to be used,
> depending on the address size of the guest.
> 
> This is at least 2 insns shorter than your sequence.

Ok, thanks. ubfm instruction will be added and I will modify implementation 
based on your comments.
> 
> Have you looked at doing the out-of-line tlb miss sequence right from the
> very beginning?  It's not that much more difficult to accomplish than the
> inline tlb miss.

I have to look into this one.
> 
> See CONFIG_QEMU_LDST_OPTIMIZATION, and the implementation in tcg/arm.
> You won't need two nops after the call; aarch64 can do all the required
> extensions and data movement operations in a single insn.
> 
> 

I will take this also into account. 

> r~
> 

-Jani




reply via email to

[Prev in Thread] Current Thread [Next in Thread]