help-gsl
[Top][All Lists]

## [Help-gsl] Re: Will I get speedup by providing Jacobian to the GSL ODE45

 From: Robert G. Brown Subject: [Help-gsl] Re: Will I get speedup by providing Jacobian to the GSL ODE45 solver? Date: Thu, 7 Jun 2007 10:21:05 -0400 (EDT)

On Wed, 6 Jun 2007, Michael wrote:


Will I get speedup by providing Jacobian to the GSL ODE45 solver?

Hi all,

I am a newbie to GSL ODE solvers. I have a critical need for speed. I
am hunting for ways to speed up. My questions are:

1. If after a comparison of GSL ODE solvers, I conclude that ODE45 is
sufficient for my need in terms of accuracy. (And in fact I turned to
GSL ODE solver from Matlab, and I have already done a comparison
analysis using all the ODE solvers in Matlab. ) And I remember I don't
provide Jacobian to ODE45 solver. And now will I get speed-up at the
fixed accuracy level by providing the Jacobian to the GSL ODE45
solver?

2. Is it possible that after providing the Jacobian, other solvers use
the Jacobian more efficiently and start to outform ODE45 in terms of
speed or accuracy?

I am still experimenting these things, but I would like to hear your



I'll reply in just one place.  There are lots of things to answer from
your previous messages, so forgive me/remind me if I forget one.  I'll
have to make this (uncharacteristically, for me:-) brief as I have some
other work to do today, so bear with me.

ODE45 in matlab is probably rkf45 in GSL and it is an excellent general
purpose ODE solver, I agree.  But, as I already noted in a previous
message, some of the alternatives sometimes turn out to be faster and
just as accurate, so I'd still recommend that you experiment with this
if speed is THE defining goal here.  If it is, the it is worth the time
required to prototype and run the code for a day or two to tune it up,
and it is trivial to swap ODE solvers with the GSL, really.  At most you
have to write the Jacobian code.

Don't be "scared" of doing this because you don't know what a Jacobian
is (not unlikely, actually:-).  GIYF:

http://en.wikipedia.org/wiki/Jacobian

It is just the matrix of partial derivatives that show how all of the
variables in the vector being solved depend on (in the case of the ODE
solver) each other (and in the more general case, on some underlying set
of "spatial" coordinates).  The van der pol oscillator example should
make this clear.  However, if you are solving (say) single variable ODEs
and make your vector "system" out of N independent 1-D ODES, each with
the same derivative, then the Jacobian matrix is diagonal and trivial --
basically the unit matrix.

OK, now, to speed things up.  Speed costs money, so you have to decide
how "valuable" the results are to you (what you want to spend) and then
optimize within that expenditure limit.  You can also spend "nothing"
and optimize on whatever hardware you have, but you can't do this AND
insist that you be able to solve umpty-zillion ODE instances in finite
time.  Reality is real, and you have to live within the physical
limitations of your hardware or spend more on better/more hardware.

There are three distinct dimensions you can optimize:

* Number of CPUs.  You are fortunate in that your problem decomposes
into embarrassingly parallel subproblems trivially.  This means you can
basically do your work twice as fast by using two systems at the same
time, N times as fast by using N systems (less a hair for overhead).  So
one way to get linear speedup is to buy more systems and make a small
beowulf cluster to do your work.  If you are interested in this
approach, check out:

http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php

to learn about parallel scaling and getting started with beowulfery.
Also look at:

http://www.clustermonkey.net

where there are all sorts of articles on this, including a whole series
by yours truly aimed at newbies.  Note well that this NEED NOT BE VERY
EXPENSIVE!  These days \$3000 will buy you 8 64-bit opteron cores with at
least 512 MB per core in only two boxes, and on an embarrassingly
parallel problem that is a HUGE amount of compute capacity!

* Quality of CPU, memory bus, etc.  Cheap, single core, small cache
32-bit CPU cores on a slow plodgy memory subsystem will not perform as
well as a bleeding edge 64-bit cores on a fast memory subsystem for
problems of this type.  If you're currently working on a 1.6 GHz Celeron
and move your code over to a 3 GHz 64-bit CPU, it will speed up by maybe
four right there.  If you move it to a dual core CPU that's eight.  If
you move it to a dual dual core system, that is as much as 16x speedup
right there per SINGLE system.

There are price point nonlinearities in here you should be aware of.
Your problem is almost certainly CPU bound -- unlikely to depend
strongly on anything but CPU clock and maybe memory bus clock and
architecture when you have enough cores hitting the same memory.
Bleeding edge CPUs are MUCH more expensive than ones only 10% to 20%
you should multiply the number of CPU cores you can afford per
architecture by the system clock of those cores and optimize the
resulting number, remembering tha AMD cores are typically faster at the
same clock than Intel cores but YMMV and it's up to you to prototype
and compare, not me.

* Writing your code to take advantage of vectorization and CPU cache,
etc.

This latter is tricky.  I know that you want me to show you an example
of a vectorized solution, but I'd have to write out a full answer and
test it (just like you will) and I don't have time to, so you'll have to
try it on your own.  I'll describe "how" to proceed, and assume/hope
that you can manage the actual coding.

First of all, it isn't completely clear that you are better off
vectorizing the independent ODEs to be solved vs calling the ODE solver
one at a time.  The trade off involved is the overhead of starting up
the ODE solver (a fixed cost per invocation per solution) vs the speedup
advantage of executing all of the core loop in cache while working on
register memory.  At some point increasing the size of the problem
forces the system out of cache and maybe even out of being relatively
local in memory (unlikely, actually, if your problem decomposes).  This
nonlinearly slows it down to a new scaling regime.  However, if you
really only evaluate the solution for four or five points per run, the
overhead of startup can be a significant fraction of the time of
solution (or not -- something to check out) and you may be better off
vectorizing a number of solutions into a single startup call.

In other words, there is LIKELY to be an optimum that looks like the
following:

a) Put each independent ODE to be solved into a component of a vector
y[].  Make the length of y a variable so you can adjust it at run time.

b) Write code that systematically assigns your parameter space to the
vector of initial conditions or parameters passed to the ODE solver or
in a global parameter vector -- it doesn't matter how, but it has to be
systematic and automated.

c) Write code that evaluates the vector of deriviatives -- if all that
changes is the parameter you can reuse the actual derivative code in a
trivial loop.

d) Write code that makes the jacobian a unit matrix -- no y[] depends
on any other y[] and \partial y_i/\partial y_i = 1.  Use the GSL example
as a template, but your code will be much simpler on the right hand
side of the assignments.

e) Test it for (say) rkf45, why not.  Make sure that it gives you the
same answers you were getting in matlab, that the evolution routine
works flawlessly to each of the desired times, etc.

Now experiment.  Try running it varying N and plot out the average time
required to complete per solution.  Try varying ODE solvers and repeat.
Try varying architecture (if you have more than one kind of system to
use to run it on).

If god is good to you, you will rapidly determine that the code has one
of the three generic limiting forms:

* It will run fastest with N=1, so you are best off running a loop to
run the ODE solver N times (on as many hosts/cores as you can manage to
split up the work) for 1 problem at a time.

* It will run fastest with the largest possible N that will fit the
program comfortably in memory -- maybe N = 2^20, maybe N = 2^28 (if you
have a lot of memory -- GB+ -- to run in).

* It will run fastest somewhere in between.

where the answers WILL vary per architecture tested and maybe per ODE
tested.

I don't even know how I would bet between these three.  I suppose I'd
"expect" an optimum somewhere in the ballpark of N = 32K -- small enough
to keep the code in L1 and data in L2, big enough to spread the startup
overhead out over a "large" number of calls, but modern memory
subsystems could easily perform well enough to make the second one true
or if you carefully write the startup code so it avoids reallocating the
ODE structs from run to run and use register variables throughout might
make the first one true.  Just carefully coding can move the optimum
around -- it is really another degree of freedom.

I personally would advise you not to go the assembler route unless it is
your last resort.  C compilers are actually pretty efficient and you'd
have to work long and hard to get a really significant speedup out of it
compared to what you could do by just careful C coding.  Indeed, I'd try
rewriting the code in C first to maybe leave out some of the GSL-added
overhead of testing for successful completions and the like -- at your
own considerable risk -- and vectorizing the core loops in that way.

But I actually wouldn't try that -- not for a parallelizable problem.
Code portability and readability and reliability matter, and using the
GSL is a good idea for all of these reasons.  Hardware is cheaper than
human time.  Buy some and throw it at the problem, accepting that you
can't squeeze blood from turnips and you can't get 2^25 solutions per
second out of hardware capable of generating only 2^20, but that if you
add 2^5 CPU cores (a mere 32 processors) you can...

HTH,

rgb

--

--
Robert G. Brown                        http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305