qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation


From: David Gibson
Subject: Re: [Qemu-ppc] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation
Date: Thu, 29 Sep 2016 13:55:33 +1000
User-agent: Mutt/1.7.0 (2016-08-17)

On Thu, Sep 29, 2016 at 09:11:10AM +0530, Nikunj A Dadhania wrote:
> David Gibson <address@hidden> writes:
> 
> > [ Unknown signature status ]
> > On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote:
> >> Load 8byte at a time and manipulate.
> >> 
> >> Big-Endian Storage
> >> +-------------+-------------+-------------+-------------+
> >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> >> +-------------+-------------+-------------+-------------+
> >> 
> >> Little-Endian Storage
> >> +-------------+-------------+-------------+-------------+
> >> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
> >> +-------------+-------------+-------------+-------------+
> >> 
> >> Vector load results in:
> >> +-------------+-------------+-------------+-------------+
> >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> >> +-------------+-------------+-------------+-------------+
> >
> > Ok.  I'm guessing from this that implementing those GPR<->VSR
> > instructions showed that the earlier versions were endian-incorrect as
> > I suspected.
> >
> > Have you verified that this new implementation is actually faster (or
> > at least no slower) on LE than the original implementation with
> > individual 32-bit stores?
> 
> Result of million lxvw4x, mfvsrd/mfvsrld and print
> 
> Without patch:
> ==============
> [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 
> le_lxvw4x  >/dev/null
> real  0m2.812s
> user  0m2.792s
> sys   0m0.020s
> [tcg_test]$
> 
> With patch:
> ===========
> [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 
> le_lxvw4x  >/dev/null
> real  0m2.801s
> user  0m2.783s
> sys   0m0.018s
> [tcg_test]$
> 
> Not much perceivable difference, is there a better way to benchmark?

Not dramatically, that I can think of.  A few tweaks you can make:
    * Increase the loop counter so the test simply runs for longer
    * Also run the test multiple times, so you can get an idea of how
      much the results vary from one run to another
    * Run the test on a system that's as idle of other activity as you
      can make it (at both host and guest level).

For out purposes the user time is probably the meaningful thing here,
and should show less variance than the system and real time.

Note that it would be interesting to get these results for both a
power and x86 host.

In any case the results above are enough to convince me that the
change isn't likely to be a significant regression.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]