I'm comparing functions 'ssyrk' and 'dsyrk' in the CBLAS section
against 'sgemm' and 'dgemm' for the following operation:
C <- t(A)*A
And I'm finding that 'syrk' is seriously underperformant compared to
'gemm', to the point that it seems as if the program had crashed
instead. In particular, I've timed it with input sizes of A of
1,000,000x100 and 1,000x10,000, on an AMD Ryzen 7 2700 processor, with
these results:
1,000,000 x 100 sgemm: 1.63 s
1,000,000 x 100 ssyrk: 44.2 s <- big problem here
1,000,000 x 100 dgemm: 3.26 s
1,000,000 x 100 dsyrk: 41.5 s <- big problem here
1,000 x 10,000 sgemm: 48.3 s
1,000 x 10,000 ssyrk: 68 s <- smaller problem
1,000,000 x 100 dgemm: 95 s
1,000,000 x 100 dsyrk: 88 s <- smaller problem
(A naive 3-stage for loop over the *full* array calculating things
twice, with -O3 optimization, takes about the same as syrk)
Compare that against MKL running single threaded:
1,000,000 x 100 sgemm: 406 ms
1,000,000 x 100 ssyrk: 269 ms
1,000,000 x 100 dgemm: 864 ms
1,000,000 x 100 dsyrk: 532 ms
1,000 x 10,000 sgemm: 3.59 s
1,000 x 10,000 ssyrk: 1.84 s
1,000 x 10,000 dgemm: 7.61 s
1,000 x 10,000 dsyrk: 3.84 s
And against OpenBLAS:
1,000,000 x 100 sgemm: 449 ms
1,000,000 x 100 ssyrk: 326 ms
1,000,000 x 100 dgemm: 935 ms
1,000,000 x 100 dsyrk: 653 ms
1,000 x 10,000 sgemm: 4.32 s
1,000 x 10,000 ssyrk: 2.04 s
1,000 x 10,000 dgemm: 8.41 s
1,000 x 10,000 dsyrk: 3.96 s
Now, 'syrk' being slightly slower than 'gemm' is something that could
perhaps be expected, but being 25x slower (and 160x slower than MKL) is
too much I think.
Setup: GSL 2.5, installed from debian repositories.
Below are the function calls (A being an n-by-k matrix)
#include "cblas.h"
//#include "mkl.h"
#include <string.h>
void call_gemm(float *A, float *C, int n, int k, float alpha)
{
cblas_sgemm(CblasRowMajor, CblasTrans, CblasNoTrans,
k, k, n,
alpha, A, k, A, k,
0., C, k);
}
void call_syrk(float *A, float *C, int n, int k, float alpha)
{
cblas_ssyrk(CblasRowMajor, CblasUpper, CblasTrans,
k, n, alpha,
A, k, 0., C, k);
}
void for_loop(float *restrict A, float *restrict C, int n, int k, float
alpha)
{
memset(A, 0, n*k*sizeof(float));
for (int row = 0; row < k; row++)
for (int col = 0; col < k; col++)
for (int dim = 0; dim < n; dim++)
C[col + row*k] += alpha*A[row + dim*n];
}