[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Chicken-users] Optimizing inner loops
From: |
Will M Farr |
Subject: |
Re: [Chicken-users] Optimizing inner loops |
Date: |
Tue, 29 Aug 2006 16:07:56 -0400 |
Hello Carlos (and everyone),
Thanks for the reference to the inline egg---I'd seen that, but had
forgotten about it. Below is a little bit of code you can use to
test the speed of the various methods I discussed in my last email
for different sizes of vectors.
Each of the attached files can be compiled to an executable which is
invoked as
./test_{c,scm,fast_scm} <n> <n-iter>
which adds two vectors of double-precision floating point numbers of
length <n> and stores the result in a third vector <n-iter> times.
On my machine (PowerBook G4 800 Mhz, Mac OS 10.4.7) user timings with
n = 1000 and n-iter = 1000000 (conveniently 1 GFLOP) are as follows:
csr-dyn-69:/tmp farr$ time ./test_c 1000 1000000
real 0m7.889s
user 0m6.407s
sys 0m0.057s
csr-dyn-69:/tmp farr$ time ./test_fast_scm 1000 1000000
real 0m7.879s
user 0m6.296s
sys 0m0.094s
csr-dyn-69:/tmp farr$ time ./test_scm 1000 1000000
^C
real 30m56.616s
user 19m30.921s
sys 1m14.798s
As you can see, I terminated the test_scm run prematurely because it
was taking *forever*. I guess the moral of this story is that the
pure-scheme implementation isn't going to come close (over two orders
of magnitude when I killed it) to the speed of the C operation.
Several other things I can think of about this test:
1. Putting inline c code in the inner loops of your scheme program is
just as fast as pure-c for 1000-element vectors. This performance
equality will degrade as the vectors get shorter (because less work
is done in c before looping in scheme). In general, the more you do
inside the foreign-lambda* c-code, the closer you will come to pure-c
code because the looping scheme calls will be a smaller and smaller
fraction of the run time. On my machine, test_fast_scm is about 15
times slower than test_c when n = 3.
2. Cache effects will become important at some point (which is above
n = 1000 on my machine)---i.e. if you run ./test_{...} 1000000 1000,
which is also 1 GFLOP of operations, the relative timings will change
a lot. With such a long vector to add, and so little work being done
on each vector element, this is now a test of how fast your computer
can move vector elements from main memory into the low-level cache---
the speed of the add instruction becomes a lot less relevant, so you
can probably better afford the extra work done in the pure scheme
code. In the paper http://repository.readscheme.org/ftp/papers/
sw2000/lucier.pdf , they claim that their gambit-c code runs just as
fast as c. They are correct, but (based on some other tests I've
done) this only holds when you're bus-speed limited---gambit is, in
fact, a bit slower (maybe a factor of 2 or 3) than c in its
arithmetic. This factor clearly doesn't matter when you're stalled
on a cache miss (though I bet even extreme cache missing wouldn't
disguise the two orders of magnitude in the native-scheme chicken code).
Feel free to run this code yourself with relevant numbers for <n> and
<n-iter> for your project. That should at least give you an idea of
the maximum performance.
Will
P.S.---My compile options were:
gcc: -O3
csc: -block -optimize-level 3 -debug-level 0 -lambda-lift -disable-
interrupts
On my system, csc invokes gcc with (among other path-setting options)
-Os.
P.P.S.---I'm surprised at how bad the pure-scheme code is for this.
I'd love to hear from some of the other Chicken experts out there
that I've made a mistake somewhere....
test-fast.scm
Description: Binary data
test.c
Description: Binary data
test.scm
Description: Binary data