Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?

discuss-gnuradio

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?

From:	Stefan Wunsch
Subject:	Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?
Date:	Sun, 27 Dec 2015 17:53:55 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0

Hi,

My full OpenCL integration is available on github [0].

You are right, without a wisdom-like profiling it is not possible to
include OpenCL fully VOLKish, so that VOLK can fully decide whether the
CPU or the GPU is faster (and the user doesn't have to care about it).
At the moment, the volk_profile has to be called with the input size of
the desired use-case.

But probably it would be possible to add OpenCL as a feature, which is
turned off by default. Such as the orc integration, which can be turned
off. That's the way I have done it. I think this kind of GPU support
would be really nice, because you don't break a gnuradio OOT for most
users with the explicit use of OpenCL. Installing all the OpenCL
platform stuff isn't that easy for most people (e.g., it's horrible with
ubuntu).

Short summary what you can find on github [0]:

First thing is a volk_profile-like executable called volk_config_opencl,
which selects the desired OpenCL device by type (CPU, GPU or ALL) and/or
name (takes the device if the given string is part of the device name).
The result is written to $HOME/.volk such as volk_profile does. If this
has not been done before running VOLK in a program, the integrations
uses the first OpenCL device found.

The integration in VOLK itself is done by initializing the OpenCL
platform and device during the first call of get_machine(). This is the
same mechanism as chosen for the selection of the VOLK machine. Again,
it reads the config from $HOME/.volk or takes the first OpenCL device
found. The symbol HAVE_OPENCL is integrated and the cl_platform and
cl_device objects are initialized as globals, which can be used by the
kernels.

cmake is configured with disabled OpenCL by default, so VOLK does not
change at all if it is not explicitly desired. The needed code in VOLK
for OpenCL integrations is guarded by a FOUND_OpenCL symbol, which is
given to the compiler if ENABLE_OPENCL and FOUND_OpenCL are true.

Furthermore, the header volk_opencl.h is introduced, which holds helper
functions for the use of OpenCL in the kernel code. This can be things
like compiling the device program from given OpenCL C code.

I'll place an example terminal output at the bottom [1,2].

Greetings
Stefan

[0] https://github.com/stwunsch/volk-opencl/tree/opencl_full_integration

[1] Example volk_opencl_config

# Output --help option

$ volk_opencl_config -h
Configure the OpenCL device for VOLK.
Options:
  -h [ --help ]                    Print help messages
  -n [ --dry-run ] [=arg(=1)] (=0) Dry run. Respect other options, but don't
                                   write the config to file
  -i [ --identifier ] arg          Part of the desired OpenCL device name,
                                   e.g., 'GeForce'.
  -t [ --type ] arg (=ALL)         Desired OpenCL device type. Options:
'ALL',
                                   'CPU', 'GPU'

# Output finding desired OpenCL device

$ volk_opencl_config -i 'GeForce' -t 'ALL'
Look through a maximum of 5 platforms with a maximum of 5 devices.
-> Found 2 platforms.
Look for devices of the type 'ALL' with the identifier 'GeForce'.
Looking for devices on platform 0.
-> Found device 0: GeForce GT 730M
-> Selected this device.
Looking for devices on platform 1.
-> Found device 0: Intel(R) Core(TM) i5-4300M CPU @ 2.60GHz
Writing "/home/user/.volk/opencl_config"...

[2] Example volk_profile with a correlation kernel

The OpenCL kernel is not faster than AVX2, but this depends strongly on
the used GPU (mine is pretty bad).

$ volk_profile -i 100 -v 10240
Using VOLK machine: avx2_64_mmx_orc_opencl
Using OpenCL device: GeForce GT 730M
RUN_VOLK_TESTS: volk_32f_x2_corr_puppet_32f(10240,100)
generic completed in 1399.89ms
a_avx completed in 106.194ms
u_avx completed in 116.254ms
a_sse completed in 127.217ms
u_sse completed in 122.363ms
opencl completed in 187.315ms
Best aligned arch: a_avx
Best unaligned arch: u_avx
Writing "/home/stefan/.volk/volk_config"...

On 12/26/2015 09:53 PM, Tom Rondeau wrote:
> On Sun, Dec 20, 2015 at 5:51 PM, West, Nathan <address@hidden>
> wrote:
> 
>> Hi Stefan,
>>
>> First of all I'm really happy to see this done. Using opencl in VOLK has
>> come up once in a while and the general consensus was that transport /
>> granularity of work in VOLK would not make it worth doing in VOLK, but we
>> never knew for sure.
>>
>> Another wrinkle is where that tradeoff between GPUs and processor work is
>> for each pair of processor and GPU which is impossible to know without some
>> kind of benchmark/wisdom generation. VOLK doesn't have any mechanism for
>> doing that and recording it. It's interesting work if that's what you're
>> looking to do.
>>
>> From a VOLK perspective if we can build up that wisdom ability and add
>> that ability to a dispatcher it's probably going to be useful, especially
>> for people that are developing some opencl code in a workflow, but don't
>> know for sure where code should run. I think the best way to develop this
>> might be in a VOLK OOT unless you're fine working off a long-lived branch
>> looking at this stuff.
>>
>> I'm happy to continue discussing this, especially on the list
>>
>> Nathan
>>
> 
> 
> Good points, Nathan. This seems like an interesting direction for VOLK, at
> least under these circumstances. A wisdom concept might work in general for
> different sizes of vectors. This could be an add-on to the volk_profile
> utility to do a full benchmarking.
> 
> But I definitely don't want to drop Stefan's work here. Let's figure out
> the best way to make it available so we don't lose track of it.
> 
> Tom
> 
> 
> 
> 
>> On Thu, Dec 17, 2015 at 8:53 PM, Douglas Geiger <
>> address@hidden> wrote:
>>
>>> Stefan,
>>>  First off I definitely want to encourage investigations of this sort: so
>>> even though I have some thoughts similar to Sylvains/Tom's about whether
>>> VOLK is the right place to do this, I definitely want to encourage *trying*
>>> this, since you never know - we could be entirely wrong about whether or
>>> not this will work. The only way to know for sure is to try it.
>>>
>>>  That said: I do think there are way *within* VOLK to deal with the issue
>>> of the input size (i.e. vector size) having a large impact on performance -
>>> namely the custom dispatcher. This is a concept that exists in VOLK, but
>>> has larger gone unnoticed because by in the large the default dispatcher
>>> does a good (or at least, good-enough) job at selecting the proper
>>> proto-kernel. For off-loading concepts such as utilizing GPU's via OpenCL,
>>> a custom dispatcher *could* select the appropriate proto-kernel (including
>>> directing the OpenCL implemention to select a CPU vs. GPU-based
>>> implementation, if multiple OpenCL implementations are available) on a
>>> per-'work()' call from the GNURadio scheduler. In other words, instead of
>>> relying on volk_profile to select the best proto-kernel for all calls to
>>> that particular volk kernel, the dispatcher could have something more akin
>>> to the FFTW 'wisdom' where for different sizes of matrices/vectors,
>>> different proto-kernels are called (including the CPU SIMDized call,
>>> instead of the OpenCL call for smaller input sizes, etc.).
>>>
>>>  Anyways - I definitely think this is something that should be looked
>>> into more, and if you are interested in pursuing this as - either as a GSoC
>>> project or otherwise, I would definitely encourage it, as well as offer
>>> assistance/advice where I can.
>>>
>>>  Doug
>>>
>>>
>>> On Thu, Dec 17, 2015 at 7:58 PM, Stefan Wunsch <
>>> address@hidden> wrote:
>>>
>>>>
>>>>
>>>> On 12/18/2015 12:30 AM, Tom Rondeau wrote:
>>>>> On Thu, Dec 17, 2015 at 1:14 PM, Sylvain Munaut <address@hidden>
>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>> RUN_VOLK_TESTS:
>>>> volk_32f_x2_matrix_nxn_multiply_puppet_32f(1000000,10)
>>>>>>> generic completed in 28482ms
>>>>>>> a_opencl completed in 13364.3ms
>>>>>>
>>>>>> Question is how does that number change for smaller problem sizes ?
>>>>>> And what would be the average problem size encountered in real env.
>>>>>>
>>>>>> For SIMD optimization the result of "who's the fastest" doesn't vary
>>>>>> too much depending on problem size because they don't have much setup
>>>>>> / teardown size.
>>>>>> For OpenCL I very much doubt that would be the case and if you end up
>>>>>> with an app making a lot of "smallish" (and given the default buffer
>>>>>> size of GR, I feel the calls to volk aren't processing millions of
>>>>>> samples at a time in a single call)
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>     Sylvain
>>>>>>
>>>>>
>>>>>
>>>>> Stefan,
>>>>>
>>>>> This is a great start. But Sylvain makes good points about the data
>>>>> transfer issue. That's definitely a problem we have to think about.
>>>> It's
>>>>> why we have avoided pursuing GPU support in VOLK in the past. Now, if
>>>>> heterogeneous processor technologies change, so might this problem.
>>>>>
>>>>> On the other hand, Doug Geiger has made progress on building OpenCL
>>>> support
>>>>> into the buffer structure of the scheduler. What you've done here might
>>>>> work better as a block designed around this concept.
>>>>>
>>>>> Tom
>>>>>
>>>>
>>>> Hi,
>>>>
>>>> I just wondered why it has not been done yet, but I see the problems now
>>>> (Sylvain made the point).
>>>> If a proper device selection and initialization is integrated into VOLK,
>>>> probably the same processings could be used for the scheduler (e.g.,
>>>> with a generic fallback). But as well, I think that I don't know enough
>>>> about all of this ;)
>>>>
>>>> Greetings
>>>> Stefan
>>>>
>>>> _______________________________________________
>>>> Discuss-gnuradio mailing list
>>>> address@hidden
>>>> https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>>>>
>>>
>>>
>>>
>>> --
>>> Doug Geiger
>>> address@hidden
>>>
>>> _______________________________________________
>>> Discuss-gnuradio mailing list
>>> address@hidden
>>> https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>>>
>>>
>>
>> _______________________________________________
>> Discuss-gnuradio mailing list
>> address@hidden
>> https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>>
>>
> 
> 
> 
> _______________________________________________
> Discuss-gnuradio mailing list
> address@hidden
> https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>

[Prev in Thread]

Current Thread

[Next in Thread]

[Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?, Stefan Wunsch, 2015/12/17
- Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?, Sylvain Munaut, 2015/12/17
  - Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?, Tom Rondeau, 2015/12/17
    - Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?, Stefan Wunsch, 2015/12/17
    - Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?, Douglas Geiger, 2015/12/17
    - Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?, West, Nathan, 2015/12/20
    - Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?, Tom Rondeau, 2015/12/26
    - Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?, Marcus Müller, 2015/12/26
    - Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?, Stefan Wunsch <=
  - Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?, Stefan Wunsch, 2015/12/17

Prev by Date: Re: [Discuss-gnuradio] Help decoding Bell 202
Next by Date: [Discuss-gnuradio] detecting covert RFID scans
Previous by thread: Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?
Next by thread: Re: [Discuss-gnuradio] [VOLK] GPU acceleration -> OpenCL integration?
Index(es):
- Date
- Thread