octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using OpenMP in Octave


From: David Bateman
Subject: Re: Using OpenMP in Octave
Date: Mon, 29 Mar 2010 21:39:13 +0200
User-agent: Mozilla-Thunderbird 2.0.0.22 (X11/20090706)

Jaroslav Hajek wrote:
Unfortunately, it confirms what I anticipated: the elementary
operations scale poorly. Memory bandwidth is probably the real limit
here. The mappers involve more work per cycle and hence scale much
better.
I was hoping the multi-level cache architecture of modern processors with the L1 cache dedicated to each core would make even the elementary operations faster. Though as the times are identical for all the cases for the elementary operations it seems, as you say, that the copying to and from the memory takes more time than the floating point operation itself.


This is why I think we should not hurry with multithreading the
elementary operations, and reductions like sum(). I know Matlab does
it, but I think it's just fancy stuff, to convince customers that new
versions add significant value.
Elementary operations are seldom a bottleneck; add Amdahl's law to
their poor scaling and the result is going to be very little music for
lots of money.
Ok, it seems that these aren't profitable.
When I read about Matlab getting parallelized stuff like sum(), I was
a little surprised. 50 million numbers get summed in 0.07 seconds on
my computer; generating them in some non-trivial way typically takes
at least 50 times that long, often much more. In that case,
multithreaded sum is absolutely marginal, even if it scaled perfectly.

One area where multithreading really helps is the complicated mappers,
as shown by the second part of the benchmark.
Though I imagine airy scales more the the sine function..
Still, I think we should carefully consider how to best provide parallelism.
For instance, I would be happy with explicit parallelism, something
like pararrayfun from the OctaveForge package, so that I could write:

pararrayfun (3, @erf, x, "ChunksPerProc", 100); # parallelize on 3
threads, splitting the array to 300 chunks.

Note that if I was about to parallelize a larger section of code that
uses erf, I could do

erf = @(x) pararrayfun (3, @erf, x, "ChunksPerProc", 100); # use
parallel erf for the rest of the code
Yes I agree that this could be accelerated with OpenMP, rather than with fork/pipe as the control over the threads and that they run of different cores is more explicit

If we really insisted that the builtin functions must support
parallelism, I say it must fulfill at least the following:

1. an easy way of temporarily disabling it must exist (for high-level
parallel constructs like parcellfun, it should be done automatically)
2. the tuning constants should be customizable.
Why make it tunable if we've done sufficient testing that the defaults result in faster code every or at least the majority of cases and the slow ups are minor?

for instance, I can imagine something like

mt_size_limit ("sin", 1000); # parallelize sin for arrays with > 1000 elements
mt_size_limit ("erfinv", 500); # parallelize erfinv for arrays with >
500 elements
But this means we maintain a map of every parallelized mapper function and the number of elements where we apply a multi-thread approach.. That comes with its own overhead. Though given that some functions will take much longer per element than others to optimal point to change from a serial function to a parallel one will probably be very different, so if we don't maintain a table of sort we'll certain forgo some potential speed ups. The functions arrayfun and cellfun will be particularly nasty in this respect as the user could pass anything to them and Octave can have no idea a-priori of the optimal serial to parallel switching point.. Though I think I'd prefer having an additional option to arrayfun and cellfun so the user can define this value directly.

We have no chance to determine the best constant for all machines, so
I think users should be allowed to find out their own.
The bus speeds aren't that different in most processors that generic values will probably be fine.. If the optimal change over from one algorithm to another for a mapper function changes from 800 to 1000 do we really care?

David


reply via email to

[Prev in Thread] Current Thread [Next in Thread]