[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
vectorized code on biprocessors -- memory contention?
From: |
Javier Fernandez Baldomero |
Subject: |
vectorized code on biprocessors -- memory contention? |
Date: |
Tue, 01 Jun 2004 00:00:42 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 |
Hi all,
Perhaps somebody could help me with this, I'm stuck.
It's a vectorization problem. I'd ask Peter if only
he had solved his own problems :-) (sorry for the joke,
Peter, I guess we'll all be asking for your help in a
couple of months :-). I'm running octave 2.1.50 on a Linux
biprocessor, with SMP kernels 2.6.6, 2.4.20-8.
I have been able to reduce the problem to the attached tst.m file,
which only requires octave in a biprocessor (no MPI, no PVM, just octave).
Hope that makes it easier to reproduce my measurements and maybe
somebody knows/guesses/can give a hint on what could I do next.
The m file is a trivially parallelizable computation typically
used to test parallel languages (used in the Parallel-
Processing-HOWTO, for instance). The problem is not with the
result (which is OK), but with the time it takes under octave
in a biprocessor.
The computation is a simple integration over 2E5 subdivisions
of the interval [0,1]. Doing it in a loop takes 3.09s (YMWV).
Doing half the subdivisions on each of two octaves started on
the biprocessor spends half the time. Perfect. No problems
with Linux scheduling, cache conflicts, memory bandwidth, etc.
Using the vectorized version of the loop, and rising N=8E6
(to keep around 3.2s elapsed time), when half the subdivisions
are computed in two octaves, time goes down, but not to the half.
I only get 3.2/1.96=1.63 ratio, instead of the expected 2.0 ratio.
When I do the same in a cluster of uniprocessors, I get the 2.0 ratio
(and 3,4... when using 3,4... computers. Well, almost almost 3,4...)
The m file lists a series of attempts to discover the reason
behing that 1.63 ratio. I've tried to declare as globals the
vector variables (in case allocating them both in parallel
is a problem compared to allocating double size from just 1 CPU).
I've also tried to initialize to 1 or 0, row or column vector,
with no effect. I then measured time spent on each operation,
and each of the vectorized ops is cooperating proportionally
(more or lees) to the 1.63 ratio. Please, do notice that when
just 1 CPU is working, it's half time for half subdivisions.
It's when both CPUs at the same time compute half the workload,
when the problem shows up. And it only happens in biprocessors.
In a normal cluster with separate memory for each CPU, it's allright.
Finally. I made the simpler mem.m which simply times the
three kind of vector operations involved: simple allocation,
counting and more complex computation. Start two octaves
in a biprocessor, and try in both octaves:
octave:1> mem(8E6)
time = 2.872207
t1 = 0.244704 -- 261.540 MB/s allocating
t2 = 1.002991 -- 63.809 MB/s counting
t3 = 1.624512 -- 39.396 MB/s computing
you mileage will vary.
Try then half size on each octave. It scales "well",
as long as the other CPU is quiet.
octave:2> mem(4E6)
time = 1.416145
t1 = 0.113682 -- 281.487 MB/s allocating
t2 = 0.493910 -- 64.789 MB/s counting
t3 = 0.808553 -- 39.577 MB/s computing
roughly half the time. your mileage should not vary.
Then try the trick of starting both at once.
Type on each octave:
octave:3> while !exist('mem.lock'), end; mem(4E6)
and in other window create the lock file
touch mem.lock
Then you get the 1.63 ratio
octave:3> while !exist('mem.lock'), end; mem(4E6)
time = 1.750034
t1 = 0.165004 -- 193.935 MB/s allocating
t2 = 0.579697 -- 55.201 MB/s counting
t3 = 1.005333 -- 31.830 MB/s computing
octave:4> 2.872207/1.750034
ans = 1.6412
octave:5>
How could I avoid the 39->31MB/s dropdown,
and possibly the 64->55 as well? The 281->193
is not too important, since it only accounts for
0.11/0.16s of the final accumulated time.
Is it due to memory contention? Then how can
it reach 193MB/s when simply allocating? (in
parallel as well). Is it due to many internal
copies of the arrays made by octave? Can the
vectorized instructions be written so that
the ratio is better?
Thanks in advance for any idea. I'm stuck.
-javier
function time=tst(mode,rank,C,N)
% compute pi by integrating arctan' in [0,1] * 4
%
% when parallelized, each CPU would have a different rank 1..C
% and would accumulate area of rectangles C subdivisions apart
% The problem is not with the result, but with the time spent
% For what we are concerned now, keep rank==C==1.
%
% fix N=2E5 to get 3s CPU time at 2GHz in loop mode ('l')
% mono=tst('l',1,1,2E5)
% fix N=8E6 to get 3s CPU time at 2GHz with "matrix" language ('m')
% mono=tst('m',1,1,8E6)
%
% Loop mode scales well: Start 2 octaves in a biprocessor
% Type the following command on each octave
% while !exist('tst.lock'), end; bi=tst('l',1,1,1E5)
% Launch both at once typing "touch tst.lock" in other window
% Each octave spends half the expected time (around 1.5s)
% My exact figures: mono/bi = 3.0979 / 1.5455 = 2.0044
%
% Matrix mode does not scale well: On those same octaves, type
% while !exist('tst.lock'), end; bi=tst('m',1,1,4E6)
% Then "touch tst.lock" in other window
% Each octave spends more than half, typically speedup=1.6
% My exact figures: mono/bi = 3.1962 / 1.9561 = 1.6340
%
% tried: global i x, just in case it's all about allocation
% -> does not affect
% initialization to ones / zeros (ruining pi computation)
% -> yet worse, 2.4310 / 1.5368 = 1.5819 instead of 1.6
% -> rising N=1.06E7 so that mono=3.2 again gives 1.63 again
% row / column vector
% -> no change
% breaking down timespent into intervals
% -> time is proportionally spent on t1/t2/t3 segments
% compacting vectorized sentences
% -> no change
% directly measuring bandwidth
% -> small figures, memory access can't be the bottleneck
%
% help!!! :-)
%
tic
switch mode
case 'm'
% global i x
width=1/N; lsum=0;
i=[rank-1:C:N-1];
% t1=toc;
% i= ones(1,fix( (N-rank+1)/C ));
% i= ones( fix( (N-rank+1)/C ), 1 );
% i=zeros(1,fix( (N-rank+1)/C ));
% i=zeros( fix( (N-rank+1)/C ), 1 );
% x=([rank-1:C:N-1]+0.5)*width; % compacting i= and x=
x=(i+0.5)*width;
% t2=toc;
lsum=sum(4./(1+x.^2));
% t3=toc;
case 'l'
width=1/N; lsum=0;
for i=rank-1:C:N-1
x=(i+0.5)*width;
lsum=lsum+4/(1+x^2);
end
end
pi =lsum/N
time=toc
if mode=='m'
t3-=t2; t2-=t1;
t1,t2,t3,cum=sum([t1,t2,t3])
end
function mem(sz)
tic
a=ones(1,sz);
t1=toc;
C=1;
b=[1:C:sz];
t2=toc;
c=sum(4./(1+b.^2));
t3=toc;
printf('time = %f\n',t3);
SZ=8*sz; t3-=t2; t2-=t1;
printf('t1 = %f -- %7.3f MB/s allocating\n',t1,SZ/t1/1E6);
printf('t2 = %f -- %7.3f MB/s counting \n',t2,SZ/t2/1E6);
printf('t3 = %f -- %7.3f MB/s computing \n',t3,SZ/t3/1E6);
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- vectorized code on biprocessors -- memory contention?,
Javier Fernandez Baldomero <=