Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?

On Sun, Dec 20, 2015 at 5:51 PM, West, Nathan <address@hidden> wrote:

Hi Stefan,

First of all I'm really happy to see this done. Using opencl in VOLK has come up once in a while and the general consensus was that transport / granularity of work in VOLK would not make it worth doing in VOLK, but we never knew for sure.

Another wrinkle is where that tradeoff between GPUs and processor work is for each pair of processor and GPU which is impossible to know without some kind of benchmark/wisdom generation. VOLK doesn't have any mechanism for doing that and recording it. It's interesting work if that's what you're looking to do.

From a VOLK perspective if we can build up that wisdom ability and add that ability to a dispatcher it's probably going to be useful, especially for people that are developing some opencl code in a workflow, but don't know for sure where code should run. I think the best way to develop this might be in a VOLK OOT unless you're fine working off a long-lived branch looking at this stuff.

I'm happy to continue discussing this, especially on the list

Nathan

Good points, Nathan. This seems like an interesting direction for VOLK, at least under these circumstances. A wisdom concept might work in general for different sizes of vectors. This could be an add-on to the volk_profile utility to do a full benchmarking.

But I definitely don't want to drop Stefan's work here. Let's figure out the best way to make it available so we don't lose track of it.

Tom

On Thu, Dec 17, 2015 at 8:53 PM, Douglas Geiger <address@hidden> wrote:
Stefan,
First off I definitely want to encourage investigations of this sort: so even though I have some thoughts similar to Sylvains/Tom's about whether VOLK is the right place to do this, I definitely want to encourage *trying* this, since you never know - we could be entirely wrong about whether or not this will work. The only way to know for sure is to try it.

That said: I do think there are way *within* VOLK to deal with the issue of the input size (i.e. vector size) having a large impact on performance - namely the custom dispatcher. This is a concept that exists in VOLK, but has larger gone unnoticed because by in the large the default dispatcher does a good (or at least, good-enough) job at selecting the proper proto-kernel. For off-loading concepts such as utilizing GPU's via OpenCL, a custom dispatcher *could* select the appropriate proto-kernel (including directing the OpenCL implemention to select a CPU vs. GPU-based implementation, if multiple OpenCL implementations are available) on a per-'work()' call from the GNURadio scheduler. In other words, instead of relying on volk_profile to select the best proto-kernel for all calls to that particular volk kernel, the dispatcher could have something more akin to the FFTW 'wisdom' where for different sizes of matrices/vectors, different proto-kernels are called (including the CPU SIMDized call, instead of the OpenCL call for smaller input sizes, etc.).

Anyways - I definitely think this is something that should be looked into more, and if you are interested in pursuing this as - either as a GSoC project or otherwise, I would definitely encourage it, as well as offer assistance/advice where I can.

Doug

On Thu, Dec 17, 2015 at 7:58 PM, Stefan Wunsch <address@hidden> wrote:

On 12/18/2015 12:30 AM, Tom Rondeau wrote:
> On Thu, Dec 17, 2015 at 1:14 PM, Sylvain Munaut <address@hidden> wrote:
>
>> Hi,
>>
>>> RUN_VOLK_TESTS: volk_32f_x2_matrix_nxn_multiply_puppet_32f(1000000,10)
>>> generic completed in 28482ms
>>> a_opencl completed in 13364.3ms
>>
>> Question is how does that number change for smaller problem sizes ?
>> And what would be the average problem size encountered in real env.
>>
>> For SIMD optimization the result of "who's the fastest" doesn't vary
>> too much depending on problem size because they don't have much setup
>> / teardown size.
>> For OpenCL I very much doubt that would be the case and if you end up
>> with an app making a lot of "smallish" (and given the default buffer
>> size of GR, I feel the calls to volk aren't processing millions of
>> samples at a time in a single call)
>>
>>
>> Cheers,
>>
>> Sylvain
>>
>
>
> Stefan,
>
> This is a great start. But Sylvain makes good points about the data
> transfer issue. That's definitely a problem we have to think about. It's
> why we have avoided pursuing GPU support in VOLK in the past. Now, if
> heterogeneous processor technologies change, so might this problem.
>
> On the other hand, Doug Geiger has made progress on building OpenCL support
> into the buffer structure of the scheduler. What you've done here might
> work better as a block designed around this concept.
>
> Tom
>

Hi,

I just wondered why it has not been done yet, but I see the problems now
(Sylvain made the point).
If a proper device selection and initialization is integrated into VOLK,
probably the same processings could be used for the scheduler (e.g.,
with a generic fallback). But as well, I think that I don't know enough
about all of this ;)

Greetings
Stefan

_______________________________________________
Discuss-gnuradio mailing list
address@hidden
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

--
Doug Geiger
address@hidden

_______________________________________________
Discuss-gnuradio mailing list
address@hidden
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

_______________________________________________
Discuss-gnuradio mailing list
address@hidden
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

From:	Tom Rondeau
Subject:	Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?
Date:	Sat, 26 Dec 2015 15:53:39 -0500