bug-gnubg
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-gnubg] Vectorizing 3rd step


From: Øystein Johansen
Subject: Re: [Bug-gnubg] Vectorizing 3rd step
Date: Tue, 19 Apr 2005 23:56:41 +0200
User-agent: Mozilla Thunderbird 0.8 (Windows/20040913)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Øystein O Johansen wrote:
|>I initialize a vector for scaling in the second loop. I believe this
|>can be made simpler. Any suggestions?
|
|
| My own suggestion:
|
| Remove:
|   float scale[4];
|   v4sf scalevector;
|   scale[0] = scale[1] = scale[2] = scale[3] = ari;
|   scalevector = __builtin_ia32_loadaps(scale);
|
| Use:
|   v4sf scalevector = (v4sf) { ari, ari, ari, ari };

What? Did I really suggest that?

Look at the code produced:
~                v4sf scalevector = (v4sf){ari, ari, ari, ari};
~  80b:       8b 45 a4                mov    0xffffffa4(%ebp),%eax
~  80e:       89 85 78 ff ff ff       mov    %eax,0xffffff78(%ebp)
~  814:       8b 45 a4                mov    0xffffffa4(%ebp),%eax
~  817:       89 85 7c ff ff ff       mov    %eax,0xffffff7c(%ebp)
~  81d:       8b 45 a4                mov    0xffffffa4(%ebp),%eax
~  820:       89 45 80                mov    %eax,0xffffff80(%ebp)
~  823:       8b 45 a4                mov    0xffffffa4(%ebp),%eax
~  826:       89 45 84                mov    %eax,0xffffff84(%ebp)
~  829:       0f 28 85 78 ff ff ff    movaps 0xffffff78(%ebp),%xmm0
~  830:       0f 29 45 88             movaps %xmm0,0xffffff88(%ebp)

Look! It moves ari into the register four times!!

Here's a further improvement of the initialisation:

~                v4sf tmp = __builtin_ia32_loadss( &ari );
~                v4sf scalevector = __builtin_ia32_shufps(tmp, tmp, 0);

It produces this code:
~                v4sf tmp = __builtin_ia32_loadss( &ari );
~  80b:       8d 45 a4                lea    0xffffffa4(%ebp),%eax
~  80e:       f3 0f 10 00             movss  (%eax),%xmm0
~  812:       0f 29 45 88             movaps %xmm0,0xffffff88(%ebp)
~                v4sf scalevector = __builtin_ia32_shufps(tmp, tmp, 0);
~  816:       0f 28 45 88             movaps 0xffffff88(%ebp),%xmm0
~  81a:       0f c6 45 88 00          shufps $0x0,0xffffff88(%ebp),%xmm0
~  81f:       0f 29 85 78 ff ff ff    movaps %xmm0,0xffffff78(%ebp)

This looks better! Even if I hand coded the assembeler code to load ari
only once into the register, it would be 7 instructions opposed to the
above 6 instructions.

I'm now up to 27000 evaluations / second. I still see places in the code
where I can squeeze out a few more vectorisations and other improvements.

I will try to exit the loops without the counters in the next step. Then
I'll try to vectorize the sigmoid function.

Can anything in CalculateInputs be vectorized? Will it gain anything?

- -Øystein
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCZX6Z6kDTFPhwyqYRAjhMAJ4zdCMOSBnXH412IfwXKegIEPfx0QCfWyY3
3f6qfFXe5XuH9q+/1dvKmk4=
=/rFw
-----END PGP SIGNATURE-----





reply via email to

[Prev in Thread] Current Thread [Next in Thread]