qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: TCG performance on PPC64


From: Matheus K. Ferst
Subject: Re: TCG performance on PPC64
Date: Thu, 19 May 2022 17:31:54 -0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0

On 18/05/2022 11:44, Richard Henderson wrote:
On 5/18/22 06:16, Matheus K. Ferst wrote:
As a final test, I changed the images to have a normal user account already created and unlocked, disabled Cloud-Init, downloaded bc-1.07 sources[4][5], installed its build dependencies[6], and changed the test script to login, extract, configure, build, and shutdown the guest. I also added an aarch64 compatible machine (Apple M1 w/ 10 cores) to
our test setup. Running 100 iterations gave us the following results:

+---------+----------------------------------------------------+
|         |                        Host                        |
|  Guest  +-----------------+-----------------+----------------+
|         |      PPC64      |     x86_64      |     aarch64    |
+---------+-----------------+-----------------+----------------+
| PPC64   |  429.82 ± 11.57 |   352.34 ± 8.51 | 180.78 ± 42.02 |
| aarch64 | 1029.78 ± 46.01 | 1207.98 ± 80.49 |  487.50 ± 7.54 |
| s390x   |  589.97 ± 86.67 |  411.83 ± 41.88 | 221.86 ± 79.85 |
+---------+-----------------+-----------------+----------------+

These are some weird results.  Particularly the aarch64 host ones -- I'm really surprised that it's that much faster than the x86_64 at anything.  Oh, the E5-2687W was discontinued
7 years ago.  So I'll just put that down to age.


Right, this Xeon was discontinued even before POWER9 was launched. It's slower in other tasks but still outperforms PPC64 in TCG emulation.

What would be different in aarch64 emulation that yields a better performance on our POWER9?

That is a very good question.

  - I suppose that aarch64 has more instructions with GVec implementations than PPC64 and
s390x, so maybe aarch64 guests can better use host-vector instructions?

No, there's very little gvec in a kernel boot cycle.  Not none, but very little.

  - Looking at the flame graphs of each test (attached), I can see that tb_gen_code takes proportionally less time of aarch64 emulation than PPC64 and s390x, so it might be that
decodetree is faster?

No.  (1) aarch64 base instructions aren't using decodetree, (2) the existing ppc and s390 decode is pretty well architected; decodetree is not particularly optimized, it's simply
meant to be more readable.

Looking at the aarch64-on-ppc64 graph, I see that PAC encryption is taking up a huge proportion of your runtime.  Probably gcc has done a better job with those routines for ppc64 host.  You may want to run the aarch64 guest tests again with -cpu max,pauth=off.

You are right, with pauth=off:

+---------+------------------------------------------------+
|         |                       Host                     |
|  Guest  +----------------+---------------+---------------+
|         |      PPC64     |     x86_64    |    aarch64    |
+---------+----------------+---------------+---------------+
| aarch64 | 395.02 ± 12.22 | 339.13 ± 6.34 | 148.88 ± 8.32 |
+---------+----------------+---------------+---------------+

I wonder if the s390x command line also needs some cpu/machine options to be more representative of "normal" TCG uses.

Otherwise, the flame graph columns are too narrow to actually read, for me.

If your SVG viewer knows JS/CSS/etc., you can click a block to "zoom in" a particular call stack, function name and number of samples are shown on mouse hover, and there is a search tool with ctrl+f.

The results are also on a GitHub Wiki page now:
https://github.com/PPC64/qemu/wiki/TCG-Performance-on-PPC64

Thanks,
Matheus K. Ferst
Instituto de Pesquisas ELDORADO <http://www.eldorado.org.br/>
Analista de Software
Aviso Legal - Disclaimer <https://www.eldorado.org.br/disclaimer.html>

reply via email to

[Prev in Thread] Current Thread [Next in Thread]