bug-gsl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 'syrk' very slow compared to 'gemm'


From: Patrick Alken
Subject: Re: 'syrk' very slow compared to 'gemm'
Date: Wed, 19 Feb 2020 09:06:31 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0

Hello,

  The GSL CBLAS library is solely intended so that users can compile applications "out of the box" without needing to install an optimized BLAS library. The GSL CBLAS is not intended for serious work, and should never be used for performance critical applications. We always recommend installing an optimized BLAS library (such as ATLAS or MKL) and then never using gslcblas again.

Writing fast BLAS routines is non-trivial, and libraries such as ATLAS and MKL have spent many years (possibly decades) optimizing every single subroutine. Many of their routines are written in assembly language. The GSL project has no plans to try to reproduce this type of effort in the internal cblas library - as stated before our cblas library is only provided so that users can quickly install and start using GSL without installing additional dependencies. It should never be used for serious computations.

Hope this helps,
Patrick

On 2/19/20 12:47 AM, David Cortes wrote:
I'm comparing functions 'ssyrk' and 'dsyrk' in the CBLAS section
against 'sgemm' and 'dgemm' for the following operation:
C <- t(A)*A

And I'm finding that 'syrk' is seriously underperformant compared to
'gemm', to the point that it seems as if the program had crashed
instead. In particular, I've timed it with input sizes of A of
1,000,000x100 and 1,000x10,000, on an AMD Ryzen 7 2700 processor, with
these results:

1,000,000 x 100 sgemm: 1.63 s
1,000,000 x 100 ssyrk: 44.2 s  <- big problem here
1,000,000 x 100 dgemm: 3.26 s
1,000,000 x 100 dsyrk: 41.5 s  <- big problem here
1,000 x 10,000 sgemm: 48.3 s
1,000 x 10,000 ssyrk: 68   s   <- smaller problem
1,000,000 x 100 dgemm: 95  s
1,000,000 x 100 dsyrk: 88  s   <- smaller problem
(A naive 3-stage for loop over the *full* array calculating things
twice, with -O3 optimization, takes about the same as syrk)

Compare that against MKL running single threaded:
1,000,000 x 100 sgemm: 406 ms
1,000,000 x 100 ssyrk: 269 ms
1,000,000 x 100 dgemm: 864 ms
1,000,000 x 100 dsyrk: 532 ms
1,000 x 10,000 sgemm: 3.59 s
1,000 x 10,000 ssyrk: 1.84 s
1,000 x 10,000 dgemm: 7.61 s
1,000 x 10,000 dsyrk: 3.84 s

And against OpenBLAS:
1,000,000 x 100 sgemm: 449 ms
1,000,000 x 100 ssyrk: 326 ms
1,000,000 x 100 dgemm: 935 ms
1,000,000 x 100 dsyrk: 653 ms
1,000 x 10,000 sgemm: 4.32 s
1,000 x 10,000 ssyrk: 2.04 s
1,000 x 10,000 dgemm: 8.41 s
1,000 x 10,000 dsyrk: 3.96 s

Now, 'syrk' being slightly slower than 'gemm' is something that could
perhaps be expected, but being 25x slower (and 160x slower than MKL) is
too much I think.

Setup: GSL 2.5, installed from debian repositories.

Below are the function calls (A being an n-by-k matrix)

#include "cblas.h"
//#include "mkl.h"
#include <string.h>
void call_gemm(float *A, float *C, int n, int k, float alpha)
{
     cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
                 k, k, n,
                 alpha, A, k, A, k,
                 0., C, k);
}
void call_syrk(float *A, float *C, int n, int k, float alpha)
{
     cblas_ssyrk(CblasRowMajor, CblasUpper, CblasTrans,
                 k, n, alpha,
                 A, k, 0., C, k);
}
void for_loop(float *restrict A, float *restrict C, int n, int k, float
alpha)
{
     memset(A, 0, n*k*sizeof(float));
     for (int row = 0; row < k; row++)
         for (int col = 0; col < k; col++)
             for (int dim = 0; dim < n; dim++)
                 C[col + row*k] += alpha*A[row + dim*n];
}








reply via email to

[Prev in Thread] Current Thread [Next in Thread]