Re: 'syrk' very slow compared to 'gemm'

bug-gsl

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 'syrk' very slow compared to 'gemm'

From:	Patrick Alken
Subject:	Re: 'syrk' very slow compared to 'gemm'
Date:	Wed, 19 Feb 2020 09:06:31 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0

Hello,

The GSL CBLAS library is solely intended so that users can compileapplications "out of the box" without needing to install an optimizedBLAS library. The GSL CBLAS is not intended for serious work, and shouldnever be used for performance critical applications. We always recommendinstalling an optimized BLAS library (such as ATLAS or MKL) and thennever using gslcblas again.

Writing fast BLAS routines is non-trivial, and libraries such as ATLASand MKL have spent many years (possibly decades) optimizing every singlesubroutine. Many of their routines are written in assembly language. TheGSL project has no plans to try to reproduce this type of effort in theinternal cblas library - as stated before our cblas library is onlyprovided so that users can quickly install and start using GSL withoutinstalling additional dependencies. It should never be used for seriouscomputations.


Hope this helps,
Patrick

On 2/19/20 12:47 AM, David Cortes wrote:

I'm comparing functions 'ssyrk' and 'dsyrk' in the CBLAS section
against 'sgemm' and 'dgemm' for the following operation:
C <- t(A)*A

And I'm finding that 'syrk' is seriously underperformant compared to
'gemm', to the point that it seems as if the program had crashed
instead. In particular, I've timed it with input sizes of A of
1,000,000x100 and 1,000x10,000, on an AMD Ryzen 7 2700 processor, with
these results:

1,000,000 x 100 sgemm: 1.63 s
1,000,000 x 100 ssyrk: 44.2 s  <- big problem here
1,000,000 x 100 dgemm: 3.26 s
1,000,000 x 100 dsyrk: 41.5 s  <- big problem here
1,000 x 10,000 sgemm: 48.3 s
1,000 x 10,000 ssyrk: 68   s   <- smaller problem
1,000,000 x 100 dgemm: 95  s
1,000,000 x 100 dsyrk: 88  s   <- smaller problem
(A naive 3-stage for loop over the *full* array calculating things
twice, with -O3 optimization, takes about the same as syrk)

Compare that against MKL running single threaded:
1,000,000 x 100 sgemm: 406 ms
1,000,000 x 100 ssyrk: 269 ms
1,000,000 x 100 dgemm: 864 ms
1,000,000 x 100 dsyrk: 532 ms
1,000 x 10,000 sgemm: 3.59 s
1,000 x 10,000 ssyrk: 1.84 s
1,000 x 10,000 dgemm: 7.61 s
1,000 x 10,000 dsyrk: 3.84 s

And against OpenBLAS:
1,000,000 x 100 sgemm: 449 ms
1,000,000 x 100 ssyrk: 326 ms
1,000,000 x 100 dgemm: 935 ms
1,000,000 x 100 dsyrk: 653 ms
1,000 x 10,000 sgemm: 4.32 s
1,000 x 10,000 ssyrk: 2.04 s
1,000 x 10,000 dgemm: 8.41 s
1,000 x 10,000 dsyrk: 3.96 s

Now, 'syrk' being slightly slower than 'gemm' is something that could
perhaps be expected, but being 25x slower (and 160x slower than MKL) is
too much I think.

Setup: GSL 2.5, installed from debian repositories.

Below are the function calls (A being an n-by-k matrix)

#include "cblas.h"
//#include "mkl.h"
#include <string.h>
void call_gemm(float *A, float *C, int n, int k, float alpha)
{
     cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
                 k, k, n,
                 alpha, A, k, A, k,
                 0., C, k);
}
void call_syrk(float *A, float *C, int n, int k, float alpha)
{
     cblas_ssyrk(CblasRowMajor, CblasUpper, CblasTrans,
                 k, n, alpha,
                 A, k, 0., C, k);
}
void for_loop(float *restrict A, float *restrict C, int n, int k, float
alpha)
{
     memset(A, 0, n*k*sizeof(float));
     for (int row = 0; row < k; row++)
         for (int col = 0; col < k; col++)
             for (int dim = 0; dim < n; dim++)
                 C[col + row*k] += alpha*A[row + dim*n];
}

[Prev in Thread]

Current Thread

[Next in Thread]

'syrk' very slow compared to 'gemm', David Cortes, 2020/02/19
- Re: 'syrk' very slow compared to 'gemm', Patrick Alken <=

Prev by Date: 'syrk' very slow compared to 'gemm'
Next by Date: Including gsl_rstat.h causes "unknown type name size_t" error
Previous by thread: 'syrk' very slow compared to 'gemm'
Next by thread: Including gsl_rstat.h causes "unknown type name size_t" error
Index(es):
- Date
- Thread