qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Debian 7.8.0 SPARC64 on qemu - anything i can do to spe


From: Artyom Tarasenko
Subject: Re: [Qemu-devel] Debian 7.8.0 SPARC64 on qemu - anything i can do to speedup the emulation?
Date: Wed, 19 Aug 2015 12:41:59 +0200

Hi Richard,

On Tue, Aug 18, 2015 at 7:55 PM, Richard Henderson <address@hidden> wrote:
> On 08/18/2015 02:24 AM, Artyom Tarasenko wrote:
>> The unoptimized case is a sequence of multiple cmp and branch
>> operations (likely created by a "case" statement in the original
>> source code), especially where cmp is in a delay slot of a branch
>> instruction.
>
> Interesting.
>
>> I wonder whether we always have to finish a TB on a conditional jump.
>> Maybe it would make sense to translate further if a destination of a
>> jump is not too far from dc->pc? The definition of "not too far" is
>> indeed tricky.
>
> We can only handle two chained exits from a TB.  If we continue past
> a conditional branch, we may well encounter a second conditional branch, which
> would leave us with three different exits from the TB.
>
> Something that may be interesting to play with, however, is to change the TB
> with which the insn in a delay slot is connected.
>
> For instance, we currently spend some amount of effort computing and saving 
> the
> branch condition, so that we can then execute the delay slot, and afterwards
> use the saved branch condition to perform the branch.
>
> Another way of doing this is to immediately branch, exiting the TB.  But we 
> set
> up PC+NPC for the next TB such that the delay slot is the first insn that is
> executed within the next TB.  In that way, the compare in the delay slot that
> you mention *is* in the same TB as the branch that uses it, allowing
> the case to be optimized.
>
> This could wind up creating more TBs than the current solution, so it's not
> clear that it would be a win.  One can mitigate that somewhat by noticing the
> case where the delay slot is a nop.  I do think it's worth an experiment.

So it is possible to make a TB with non sequential instructions?
The instruction in the delay slot would be located most likely
elsewhere than the following instructions.

But I think I've been chasing a red herring. I see those helpers in
perf top when running sysbench, but not when running g++ (and at the
end g++ is much more relevant benchmark for me):


Samples: 83K of event 'cpu-clock', Event count (approx.): 15333243164,
Thread: qemu-system-spa(2743)
 27.10%  [kernel]                 [k] retint_signal
 12.66%  qemu-system-sparc64      [.] tcg_optimize
  9.18%  [vdso]                   [.] 0x0000000000000998
  8.39%  [kernel]                 [k] _raw_spin_unlock_irqrestore
  4.76%  qemu-system-sparc64      [.] tcg_liveness_analysis
  3.89%  qemu-system-sparc64      [.] tcg_reg_alloc_op
  2.80%  qemu-system-sparc64      [.] tcg_out_opc
  2.45%  qemu-system-sparc64      [.] get_physical_address_data
  1.86%  [kernel]                 [k] native_read_tsc
  1.62%  qemu-system-sparc64      [.] tlb_flush_page
  1.55%  qemu-system-sparc64      [.] tcg_out_modrm_sib_offset.constprop.42
  1.45%  [unknown]                [.] 0x00000000451c5cae
  1.43%  qemu-system-sparc64      [.] gen_intermediate_code_pc
  1.39%  qemu-system-sparc64      [.] tcg_temp_new_internal_i64
  1.24%  qemu-system-sparc64      [.] tb_flush_jmp_cache
  1.11%  qemu-system-sparc64      [.] disas_sparc_insn
  1.08%  qemu-system-sparc64      [.] tcg_out_modrm
  0.97%  qemu-system-sparc64      [.] tcg_reg_alloc_start
  0.77%  qemu-system-sparc64      [.] cpu_sparc_exec
  0.73%  qemu-system-sparc64      [.] replace_tlb_1bit_lru.isra.3
  0.72%  qemu-system-sparc64      [.] tcg_gen_code_search_pc
  0.72%  qemu-system-sparc64      [.] tcg_opt_gen_mov
  0.70%  qemu-system-sparc64      [.] reset_temp

I'm not sure why I still see kernel functions when I zoom into qemu
thread. Is this qemu signal handling?
And then it would be interesting to know where in this listing is the
generated code. Is it [vdso], [unknown] or is it hidden behind
retint_signal?

Ironically a good optimization target seems to be the tcg_optimize
function. If I zoom I see it spends most of the time in
reset_all_temps.

Any suggestions how to improve it?

Artyom

-- 
Regards,
Artyom Tarasenko

SPARC and PPC PReP under qemu blog: http://tyom.blogspot.com/search/label/qemu



reply via email to

[Prev in Thread] Current Thread [Next in Thread]