octave-patch-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-patch-tracker] [patch #9415] Streamline hermitian transpose rout


From: Dan Sebald
Subject: [Octave-patch-tracker] [patch #9415] Streamline hermitian transpose routine of Array.cc for efficiency
Date: Sun, 23 Jul 2017 17:06:10 -0400 (EDT)
User-agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0

URL:
  <http://savannah.gnu.org/patch/?9415>

                 Summary: Streamline hermitian transpose routine of Array.cc
for efficiency
                 Project: GNU Octave
            Submitted by: sebald
            Submitted on: Sun 23 Jul 2017 09:06:08 PM UTC
                Category: None
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any

    _______________________________________________________

Details:

Attached is a patch that improves upon the Hermitian transpose in Array.cc. 
The main oversight is the fact that no blocking is done on the last few
columns, and I think that is most likely to be the case where the cache misses
are going to occur.  Cases where there are large numbers of rows and only a
few columns (very conceivable user-construct) take a large hit as a result. 
Call this "partial blocking".

So I've actually simplified the code by moving those left-over partial blocks
into the main save/copy loops and discarding the extra loops.  Call this "full
blocking".

I also found that moving fcn() to the first loop (i.e., saving into temp
buffer) rather than the second loop (i.e., copying out of temp buffer) is 10%
better, or so.  Don't know why...probably has something to do with the
function call with data that is more spread out in memory.

Then, as I looked at the resulting code, I wondered why the special case of NR
> 8 and NC > 8.  I constructed a test for that and found that the blocking is
beneficial for certainly NR or NC equal 7.  There is marginal loss for more
complicated looping when NR=1 or NC=1, so I see no point in not always using
the the block looping and simply tossed that second case code.

I replaced the remaining 8s with SPANC and SPANR where appropriate.

The algorithms tested are:

+verbatim*
Test scripts:

1)
start = cputime; x = rand(10000) + i*rand(10000); cputime - start
start = cputime; x'; cputime - start

2-A)
start = cputime; x = rand(10000000,10) + i*rand(10000000,10); cputime - start
start = cputime; x'; cputime - start

2-B)
start = cputime; x = rand(10,10000000) + i*rand(10,10000000); cputime - start
start = cputime; x'; cputime - start

3-A)
start = cputime; x = rand(12000000,7) + i*rand(12000000,7); cputime - start
start = cputime; x'; cputime - start

3-B)
start = cputime; x = rand(7,12000000) + i*rand(7,12000000); cputime - start
start = cputime; x'; cputime - start

4-A)
start = cputime; x = rand(100000000,1) + i*rand(100000000,1); cputime - start
start = cputime; x'; cputime - start

4-B)
start = cputime; x = rand(1,100000000) + i*rand(1,100000000); cputime - start
start = cputime; x'; cputime - start


A summary of the results are as follows:


SUMMARY

      NOBLOCK  CURRENT  PARTIAL    FULL
               PARTIAL  BLOCKING   BLOCKING
               BLOCK    FCN FIRST  FCN FIRST
                        NO NR,NC   NO NR,NC
      -------  -------  ---------  ---------
  1)    4.05    2.62      2.37       2.35
2-A)   10.3     3.73      3.68       2.47
2-B)    2.08    1.89      1.88       1.88
3-A)    5.77    5.77      5.78       1.60
3-B)    1.59    1.60      1.59       1.58
4-A)    1.79    1.81      1.78       1.78
4-B)    1.77    1.83      1.78       1.83


Here's a more detailed description and numbers:


WITHOUT BLOCKING CACHE OPTIMIZATION
-----------------------------------
octave:1> start = cputime; x = rand(10000) + i*rand(10000); cputime - start
ans =  8.5840
octave:2> start = cputime; x'; cputime - start
ans =  4.0560
octave:3> start = cputime; x'; cputime - start
ans =  4.0640
octave:4> start = cputime; x = rand(10000000,10) + i*rand(10000000,10);
cputime - start
ans =  9.2800
octave:5> start = cputime; x'; cputime - start
ans =  10.264
octave:6> start = cputime; x'; cputime - start
ans =  10.344
octave:7> start = cputime; x = rand(10,10000000) + i*rand(10,10000000);
cputime - start
ans =  8.6160
octave:8> start = cputime; x'; cputime - start
ans =  2.0760
octave:9> start = cputime; x'; cputime - start
ans =  2.1000
octave:10> start = cputime; x = rand(12000000,7) + i*rand(12000000,7); cputime
- start
ans =  7.2560
octave:11> start = cputime; x'; cputime - start
ans =  5.7680
octave:12> start = cputime; x'; cputime - start
ans =  5.7840
octave:13> start = cputime; x = rand(7,12000000) + i*rand(7,12000000); cputime
- start
ans =  6.9480
octave:14> start = cputime; x'; cputime - start
ans =  1.5880
octave:15> start = cputime; x'; cputime - start
ans =  1.5920
octave:16> start = cputime; x = rand(100000000,1) + i*rand(100000000,1);
cputime - start
ans =  8.4960
octave:17> start = cputime; x'; cputime - start
ans =  1.7880
octave:18> start = cputime; x'; cputime - start
ans =  1.8000
octave:19> start = cputime; x = rand(1,100000000) + i*rand(1,100000000);
cputime - start
ans =  8.5920
octave:20> start = cputime; x'; cputime - start
ans =  1.7760
octave:21> start = cputime; x'; cputime - start
ans =  1.7760

WITH CURRENT PARTIAL BLOCKING
-----------------------------
octave:1> start = cputime; x = rand(10000) + i*rand(10000); cputime - start
ans =  7.4280
octave:2> start = cputime; x'; cputime - start
ans =  2.5960
octave:3> start = cputime; x'; cputime - start
ans =  2.6640
octave:4> start = cputime; x = rand(10000000,10) + i*rand(10000000,10);
cputime - start
ans =  8.3680
octave:5> start = cputime; x'; cputime - start
ans =  3.7320
octave:6> start = cputime; x'; cputime - start
ans =  3.7400
octave:7> start = cputime; x = rand(10,10000000) + i*rand(10,10000000);
cputime - start
ans =  8.0280
octave:8> start = cputime; x'; cputime - start
ans =  1.8800
octave:9> start = cputime; x'; cputime - start
ans =  1.8960
octave:10> start = cputime; x = rand(12000000,7) + i*rand(12000000,7); cputime
- start
ans =  6.4800
octave:11> start = cputime; x'; cputime - start
ans =  5.7680
octave:12> start = cputime; x'; cputime - start
ans =  5.7720
octave:13> start = cputime; x = rand(7,12000000) + i*rand(7,12000000); cputime
- start
ans =  6.3080
octave:14> start = cputime; x'; cputime - start
ans =  1.6000
octave:15> start = cputime; x'; cputime - start
ans =  1.6080
octave:16> start = cputime; x = rand(100000000,1) + i*rand(100000000,1);
cputime - start
ans =  7.8640
octave:17> start = cputime; x'; cputime - start
ans =  1.7960
octave:18> start = cputime; x'; cputime - start
ans =  1.8320
octave:19> start = cputime; x = rand(1,100000000) + i*rand(1,100000000);
cputime - start
ans =  7.8720
octave:20> start = cputime; x'; cputime - start
ans =  1.8280
octave:21> start = cputime; x'; cputime - start
ans =  1.8320


WITH CURRENT PARTIAL BLOCKING AND FCN MOVED AHEAD NO NR,NC RESTRICTION
----------------------------------------------------------------------
octave:1> start = cputime; x = rand(10000) + i*rand(10000); cputime - start
ans =  7.5600
octave:2> start = cputime; x'; cputime - start
ans =  2.3520
octave:3> start = cputime; x'; cputime - start
ans =  2.3920
octave:4> start = cputime; x = rand(10000000,10) + i*rand(10000000,10);
cputime - start
ans =  8.3320
octave:5> start = cputime; x'; cputime - start
ans =  3.6720
octave:6> start = cputime; x'; cputime - start
ans =  3.6880
octave:7> start = cputime; x = rand(10,10000000) + i*rand(10,10000000);
cputime - start
ans =  8.1120
octave:8> start = cputime; x'; cputime - start
ans =  1.8720
octave:9> start = cputime; x'; cputime - start
ans =  1.8800
octave:10> start = cputime; x = rand(12000000,7) + i*rand(12000000,7); cputime
- start
ans =  6.6080
octave:11> start = cputime; x'; cputime - start
ans =  5.7720
octave:12> start = cputime; x'; cputime - start
ans =  5.8080
octave:13> start = cputime; x = rand(7,12000000) + i*rand(7,12000000); cputime
- start
ans =  6.4880
octave:14> start = cputime; x'; cputime - start
ans =  1.5920
octave:15> start = cputime; x'; cputime - start
ans =  1.5880
octave:16> start = cputime; x = rand(100000000,1) + i*rand(100000000,1);
cputime - start
ans =  7.9680
octave:17> start = cputime; x'; cputime - start
ans =  1.7880
octave:18> start = cputime; x'; cputime - start
ans =  1.7880
octave:19> start = cputime; x = rand(1,100000000) + i*rand(1,100000000);
cputime - start
ans =  8.0720
octave:20> start = cputime; x'; cputime - start
ans =  1.7840
octave:21> start = cputime; x'; cputime - start
ans =  1.7840


WITH FULL BLOCKING AND FCN MOVED AHEAD AND NR,NC>=8
---------------------------------------------------
octave:1> start = cputime; x = rand(10000) + i*rand(10000); cputime - start
ans =  8.1040
octave:2> start = cputime; x'; cputime - start
ans =  2.5840
octave:3> start = cputime; x'; cputime - start
ans =  2.5880
octave:4> start = cputime; x = rand(10000000,10) + i*rand(10000000,10);
cputime - start
ans =  9.1200
octave:5> start = cputime; x'; cputime - start
ans =  2.4560
octave:6> start = cputime; x'; cputime - start
ans =  2.4560
octave:7> start = cputime; x = rand(10,10000000) + i*rand(10,10000000);
cputime - start
ans =  8.5720
octave:8> start = cputime; x'; cputime - start
ans =  1.9960
octave:9> start = cputime; x'; cputime - start
ans =  2.0040
octave:10> start = cputime; x = rand(12000000,7) + i*rand(12000000,7); cputime
- start
ans =  7.1400
octave:11> start = cputime; x'; cputime - start
ans =  5.7600
octave:12> start = cputime; x'; cputime - start
ans =  5.7600
octave:13> start = cputime; x = rand(7,12000000) + i*rand(7,12000000); cputime
- start
ans =  6.7880
octave:14> start = cputime; x'; cputime - start
ans =  1.6120
octave:15> start = cputime; x'; cputime - start
ans =  1.6120
octave:16> start = cputime; x = rand(100000000,1) + i*rand(100000000,1);
cputime - start
ans =  8.3000
octave:17> start = cputime; x'; cputime - start
ans =  1.7840
octave:18> start = cputime; x'; cputime - start
ans =  1.8000
octave:19> start = cputime; x = rand(1,100000000) + i*rand(1,100000000);
cputime - start
ans =  8.3960
octave:20> start = cputime; x'; cputime - start
ans =  1.7880
octave:21> start = cputime; x'; cputime - start
ans =  1.7920


WITH FULL BLOCKING AND FCN MOVED AHEAD NO NR,NC RESTRICTION
-----------------------------------------------------------
octave:1> start = cputime; x = rand(10000) + i*rand(10000); cputime - start
ans =  7.5440
octave:2> start = cputime; x'; cputime - start
ans =  2.3560
octave:3> start = cputime; x'; cputime - start
ans =  2.3520
octave:4> start = cputime; x = rand(10000000,10) + i*rand(10000000,10);
cputime - start
ans =  8.4320
octave:5> start = cputime; x'; cputime - start
ans =  2.4520
octave:6> start = cputime; x'; cputime - start
ans =  2.4840
octave:7> start = cputime; x = rand(10,10000000) + i*rand(10,10000000);
cputime - start
ans =  8.1520
octave:8> start = cputime; x'; cputime - start
ans =  1.8760
octave:9> start = cputime; x'; cputime - start
ans =  1.8920
octave:10> start = cputime; x = rand(12000000,7) + i*rand(12000000,7); cputime
- start
ans =  6.6000
octave:11> start = cputime; x'; cputime - start
ans =  1.5960
octave:12> start = cputime; x'; cputime - start
ans =  1.6040
octave:13> start = cputime; x = rand(7,12000000) + i*rand(7,12000000); cputime
- start
ans =  6.3280
octave:14> start = cputime; x'; cputime - start
ans =  1.5800
octave:15> start = cputime; x'; cputime - start
ans =  1.5960
octave:16> start = cputime; x = rand(100000000,1) + i*rand(100000000,1);
cputime - start
ans =  7.8520
octave:17> start = cputime; x'; cputime - start
ans =  1.7800
octave:18> start = cputime; x'; cputime - start
ans =  1.7880
octave:19> start = cputime; x = rand(1,100000000) + i*rand(1,100000000);
cputime - start
ans =  7.9280
octave:20> start = cputime; x'; cputime - start
ans =  1.8280
octave:21> start = cputime; x'; cputime - start
ans =  1.8400





    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/patch/?9415>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]