qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] profiling qemu


From: Blue Swirl
Subject: Re: [Qemu-devel] profiling qemu
Date: Tue, 14 Feb 2012 18:31:54 +0000

2012/2/14 Lluís Vilanova <address@hidden>:
> Artyom Tarasenko writes:
> [...]
>> QEMU 1.0.50 monitor - type 'help' for more information
>> (qemu) profile
>> unknown command: 'profile'
>> (qemu) info profile
>> async time  38505498320 (38.505)
>> qemu time   35947093161 (35.947)
>
>> Is there a way to find out more?
>
> Command "info jit" also has some information added when compiled with 
> profiling
> support.
>
> Search for CONFIG_PROFILER to see which code is activated during profiling.
>
>
>> Next I tried gprof:
>
>> build-prof $  gprof sparc64-softmmu/qemu-system-sparc64 gmon.out
>> Flat profile:
>
>> Each sample counts as 0.01 seconds.
>>   %   cumulative   self              self     total
>>  time   seconds   seconds    calls  Ts/call  Ts/call  name
>> 100.00      5.06     5.06                             main
>
>> Hmm. Not very informative. Is there a way to find out more details?
>
> Did you run QEMU for a reasonable amount of time? gprof uses sampling to 
> capture
> its execution time statistics, so a small execution of QEMU will not be able 
> to
> capture any meaningful information.
>
>
> [...]
>> Here it looks like "compute_all_sub" and "compute_all_sub_xcc" are
>> good candidates for optimizing: together they take the same amount of
>> time as cpu_sparc_exec. I guess both operations would be trivial in
>> the x86_64 assembler. What would be the best strategy to make TCG take
>> the advantage of running on a x86_64 host?
>
> A quick look into the code reveals that these two are called from a TCG helper
> (helper_compute_psr), so I see two approaches here applicable to the most
> frequently used "sub-operations" in helper_compute_psr:
>
> * Define new simpler helpers for those sub-operations that can be declared 
> with
>  TCG_CALL_CONST and generate the new psr/xcc values in temporal registers. You
>  must make sure any other code will still be able to use the new psr/xcc
>  values.

I guess this (and just checking if some of the functions are already
const) would be the fruit hanging the lowest.

> * Reimplement these sub-operations in pure TCG code.

One optimization mentioned by Fabrice that could be implemented is
that 'subcc' aka 'cmp' is often followed by a conditional branch. This
sequence could be detected and instead of computing the full set of
ICC & XCC flags followed by a flag test in the branch, we just
generate a brcond which matches the original cmp. Similar optimization
is used for i386 target.

Also the flags are not computed as lazily as they could be. For
example if ICC is not used but only XCC, we could compute it only if
needed later, or even just the subset of flags needed by the following
conditional instruction. This would be simpler but not as optimal.

The flags are also computed in canonical form at the end of basic
block, whereas the lazy versions could linger, maybe forever.

These are just from the top of my head, I think Sparc64 could be
optimized a lot.

> But first, make sure you run a proper benchmark to establish where are the
> hotspots in the sparc code for QEMU. The problem here is to establish what a
> proper benchmark is :)
>
>
> Lluis
>
> --
>  "And it's much the same thing with knowledge, for whenever you learn
>  something new, the whole world becomes that much richer."
>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>  Tollbooth



reply via email to

[Prev in Thread] Current Thread [Next in Thread]