octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #60928] Performance of sort unexpectedly slow


From: Rik
Subject: [Octave-bug-tracker] [bug #60928] Performance of sort unexpectedly slow for DIM=2
Date: Mon, 16 Aug 2021 12:20:20 -0400 (EDT)
User-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36

Follow-up Comment #11, bug #60928 (project octave):

I'm narrowing in on the cause of the problem which now seems to be in the
gather phase.

First, I created benchmark data so there would be no differences between
various test runs do to actual differences in the data.

The N-D array is of size 5x5x5x5000.  The baseline times for sorting along
each dimension were


## Baseline
DIM1: 0.072642
DIM2: 2.146125
DIM3: 0.478395
DIM4: 0.137138


Next, I shut off the special case for dimension 1 and forced everything
through the gather, sort, scatter path.  Timings were:


## Everything through second code path
DIM1: 11.184590
DIM2: 2.272407
DIM3: 0.468403
DIM4: 0.146637


Clearly, it was very, very bad for dimensions 1 and 2.

Next I used a std::chrono::high_resolution_clock to bracket the gather, sort,
and scatter blocks of code.  For the first three dimensions, the only thing
that differs is the stride


  octave_idx_type ns = dv(dim);
  octave_idx_type iter = dv.numel () / ns;
  octave_idx_type stride = 1;

  for (int i = 0; i < dim; i++)
    stride *= dv(i);


The variable ns (number of elements per sort) is 5.  The number of loop
iterations is 125,000.  Only the stride changes from 1 to 5 to 125 depending
on dimension.

The sort and scatter phases always takes ~1 microsecond regardless of
dimension (1, 2, or 3).  However, the gather phase can take  a maximum of
~250, ~75 , ~9 microseconds.  Clearly something is going on in this phase.

The instrumented code is


          auto start = high_resolution_clock::now();
          octave_idx_type offset = j;
          octave_idx_type offset2 = 0;

          while (offset >= stride)
            {
              offset -= stride;
              offset2++;
            }

          offset += offset2 * stride * ns;

          // gather and partition out NaNs.
          // FIXME: impact on integer types noticeable?
          octave_idx_type kl = 0;
          octave_idx_type ku = ns;
          for (octave_idx_type i = 0; i < ns; i++)
            {
              T tmp = ov[i*stride + offset];
              if (sort_isnan<T> (tmp))
                buf[--ku] = tmp;
              else
                buf[kl++] = tmp;
            }
          auto stop = high_resolution_clock::now ();
          auto duration = duration_cast<microseconds> (stop - start);
          std::cout << "gather: " << duration.count () << std::endl;


I'm not sure what the while loop for calculation of offset and offset2 is
doing, but I don't like the look of it.  That's the next benchmark timing to
run.

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?60928>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]