[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-ppc] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation
From: |
David Gibson |
Subject: |
Re: [Qemu-ppc] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation |
Date: |
Thu, 29 Sep 2016 13:55:33 +1000 |
User-agent: |
Mutt/1.7.0 (2016-08-17) |
On Thu, Sep 29, 2016 at 09:11:10AM +0530, Nikunj A Dadhania wrote:
> David Gibson <address@hidden> writes:
>
> > [ Unknown signature status ]
> > On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote:
> >> Load 8byte at a time and manipulate.
> >>
> >> Big-Endian Storage
> >> +-------------+-------------+-------------+-------------+
> >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> >> +-------------+-------------+-------------+-------------+
> >>
> >> Little-Endian Storage
> >> +-------------+-------------+-------------+-------------+
> >> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
> >> +-------------+-------------+-------------+-------------+
> >>
> >> Vector load results in:
> >> +-------------+-------------+-------------+-------------+
> >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> >> +-------------+-------------+-------------+-------------+
> >
> > Ok. I'm guessing from this that implementing those GPR<->VSR
> > instructions showed that the earlier versions were endian-incorrect as
> > I suspected.
> >
> > Have you verified that this new implementation is actually faster (or
> > at least no slower) on LE than the original implementation with
> > individual 32-bit stores?
>
> Result of million lxvw4x, mfvsrd/mfvsrld and print
>
> Without patch:
> ==============
> [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le -cpu POWER9
> le_lxvw4x >/dev/null
> real 0m2.812s
> user 0m2.792s
> sys 0m0.020s
> [tcg_test]$
>
> With patch:
> ===========
> [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le -cpu POWER9
> le_lxvw4x >/dev/null
> real 0m2.801s
> user 0m2.783s
> sys 0m0.018s
> [tcg_test]$
>
> Not much perceivable difference, is there a better way to benchmark?
Not dramatically, that I can think of. A few tweaks you can make:
* Increase the loop counter so the test simply runs for longer
* Also run the test multiple times, so you can get an idea of how
much the results vary from one run to another
* Run the test on a system that's as idle of other activity as you
can make it (at both host and guest level).
For out purposes the user time is probably the meaningful thing here,
and should show less variance than the system and real time.
Note that it would be interesting to get these results for both a
power and x86 host.
In any case the results above are enough to convince me that the
change isn't likely to be a significant regression.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature
- Re: [Qemu-ppc] [PATCH v4 2/9] target-ppc: Implement mtvsrdd instruction, (continued)
[Qemu-ppc] [PATCH v4 6/9] target-ppc: add lxvh8x instruction, Nikunj A Dadhania, 2016/09/28
[Qemu-ppc] [PATCH v4 5/9] target-ppc: improve stxvw4x implementation, Nikunj A Dadhania, 2016/09/28
[Qemu-ppc] [PATCH v4 8/9] target-ppc: add lxvb16x instruction, Nikunj A Dadhania, 2016/09/28
[Qemu-ppc] [PATCH v4 7/9] target-ppc: add stxvh8x instruction, Nikunj A Dadhania, 2016/09/28