qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] outlined TLB lookup on x86


From: Richard Henderson
Subject: Re: [Qemu-devel] outlined TLB lookup on x86
Date: Thu, 28 Nov 2013 15:12:04 +1300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0

On 11/27/2013 08:41 PM, Xin Tong wrote:
> I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on
> x86-64 machine, potentially for better instruction cache performance, I have a
> few  questions.
> 
> 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are 
> generated
> when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to
> the generated slow path and slow path refills the TLB, then load/store and
> jumps to the next emulated instruction. I am wondering is it easy to outline
> the code for the slow path.

Hard.  There's quite a bit of code on that slow path that's unique to the
surrounding code context -- which registers contain inputs and outputs, where
to continue after slow path.

The amount of code that's in the TB slow path now is approximately minimal, as
far as I can see.  If you've got an idea for improvement, please share.  ;-)


> I am thinking when a TLB misses, the outlined TLB
> lookup code should generate a call out to the qemu_ld/st_helpers[opc &
> ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the 
> critical
> path, so its not as important as the code when TLB hits.

That would work for true TLB misses to RAM, but does not work for memory mapped
I/O.

> 2. why not use a TLB or bigger size?  currently the TLB has 1<<8 entries. the
> TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
> measured this using Intel PIN. so even the miss rate is low (say 3%) the
> overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.

I'd be interested to experiment with different TLB sizes, to see what effect
that has on performance.  But I suspect that lack of TLB contexts mean that we
wind up flushing the TLB more often than real hardware does, and therefore a
larger TLB merely takes longer to flush.

But be aware that we can't simply make the change universally.  E.g. ARM can
use an immediate 8-bit operand during the TLB lookup, but would have to use
several insns to perform a 9-bit mask.

>  I am
> thinking the tlb may need to be organized in a set associative fashion to
> reduce conflict miss, e.g. 2 way set associative to reduce the miss rate. or
> have a victim tlb that is 4 way associative and use x86 simd instructions to 
> do
> the lookup once the direct-mapped tlb misses. Has anybody done any work on 
> this
> front ?

Even with SIMD, I don't believe you could make the fast-path of a set
associative lookup fast.  This is the sort of thing for which you really need
the dedicated hardware of the real TLB.  Feel free to prove me wrong with code,
of course.


r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]