[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Swarm-Support] Re: Performance issues
From: |
Marcus G. Daniels |
Subject: |
[Swarm-Support] Re: Performance issues |
Date: |
Sun, 02 Feb 2003 11:24:49 -0700 |
User-agent: |
Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.3b) Gecko/20030117 |
Bill Northcott wrote:
First it would be very surprising if Swarm ran as well on a PowerPC
machine as it does on a P4. As far as I can see all the recent
development has been done on Intel hardware.
Mostly it was done on Sun hardware. That's what SFI and the SDG had
available.
I would not say that it modified any programming, really. As far as
profiling goes, I'd say most of that happened on Sun, and then in some
isolated nasty logistical cases on RedHat on Intel. Bleeding edge
compiler and toolchain things tend to work better on Linux-based systems.
At first I could only profile full-native (GCJ-based) Java/Swarm models
on Redhat.
So it is sort of inevitable
that the code contains many optimisations for Intel architectures even if
most of them are unconscious.
Unfortunately, the profiling of Swarm (again, mostly on Suns), has
consisted of gprof after gprof run and memory profiles. Reduce
empirically-observed bottlenecks and reduce memory usage. There hasn't
really been any attention given to profiling Swarm with cache
simulators. It would be a good thing to do, preferably on multiple
architectures (or simulated architectures).
Excuse my ignorance, Marcus, but would your benchmark use both CPUs?
Nope, and neither would Swarm and neither would most applications.
Compilers don't magically parallelize code. A different benchmark, say
the SQL example could be multithreaded, though (mysql would be).
Whether a level 3 cache helps reduces to what the working set of a
problem is and the memory access pattern. When that working set gets
big and the pattern sparse and disordered, then memory latency (or even
swap) will come to dominate runtime. With lots of agents running around
in a simulation, and a shared landscape, I think a few megabytes of
cache isn't going to help a whole lot. But it really depends. A
determined person can always measure and tune (and finally, for a given
architecture) given the tools and some knowledge.
Whether the level 3 cache helps a multithreaded program will reduce to
the memory demands of the two threads. My concern, especially for any
moderately complex agent-based model (assuming Swarm could spawn
threads, which it can't), would be that having two agents running
instead of one agent just increases the chance that the cache will be
busted.
2. Vector processors. Apple/Motorola included these because they could
produce huge speed ups in signal processing apps. Apple wanted it for its
multimedia users, and Motorola for its big market in embedded telecomms
chips.
It is extremely effective for certain types of problem. Witness the
benchmark for SETI which uses Altivec, and the impressive performance on
Photoshop filters and video codecs.
I am sure Marcus' banchmark does not use the vector processors so a
substantial part of the CPU silicon is sitting there doing nothing.
If we are talking about how SIMD features can increase performance in an
application then we need to be able to compare apples with apples. It
seems to me the comparison has to reduce to
"if I compile program X with standard or easily-accessible tools on
platform A and B which runs faster?". It's not fair to say, "I really
mean program Y", which is what you are saying if you mean "you should
really write that program with the Altivec in mind". If that's the
case, then we can just say in response "you should really write that
program with the SSE2 Pentium 4 extensions in mind".
3. Disk throughput. As far as I can see, the target market for Xserve was
the film industry and their extensive digital processing. These people
are dealing in tens of terabytes of data. The published benchmarks for
XServe show that it compares extremely well with 1U PC (Dell PowerEdge
1650) servers when serving large files to multiple clients. Indeed they
stand up well against other architectures with much higher price tags.
Apple's material show that the Dell PowerEdge 1650 and XServe have
similar I/O performance. Well, that's nice the 1650 is the low-end rack
server based on the Pentium III!
So is any of this relevant to Swarm/ABM? It seems to me that it is.
There has been plenty of discussion about multithreading/multiCPUs.
Currently the stuff is not there in Swarm because Intel architecture did
not provide the hardware, but the benefits should be fairly obvious. IBM
and Apple clearly think multiple cpus are the way to go rather than very
high clock speeds. I think the standard for Power4 is 4 cpus per header.
It looks like Apple will start using this architecture with 64 bit chips
(PPC970) from IBM later this year. It could provide very good price
performance after the novelty premum has worn off.
When vendors have compiler technology or killer-apps that
auto-parallelize reliably, I'll buy this argument. We already had a
discussion here about some of the practical problems of multithreaded
Swarm models. Until there are economical 4 or 8 processor systems, I
just don't see the benefit of complex, delicate code for multithreading
of Swarm models. And I am highly skeptical that software-based
distributed shared memory systems can provide fast enough memory access
to enable clusters to function well on agent-based models.
In my experience there is a massive amount of parameter tweaking and
iteration involved in understanding agent-based models, and this
iteration is easy to parallelize. Just run multiple simulations at once
with different parameters on different CPUs.
Vector processors. It seems to me this would be the easier optimisation
to put in Swarm. My misguided thoughts would be to look at random number
generators and agents using regression/neural net/signal processing
decision methods as prime candidates for Altivec speed up. I think the
necessary code is already incorporated in the current GNU compiler
sources. I am sure Marcus can comment on this much better than me.
There might be some opportunities for Altivec or SSE2 usage, but I think
they'd mostly be for add-in libraries. Like you say, neural nets,
perhaps some GA fitness evaluations, etc. Whether it would justify the
work, I don't know.
*I rather discount the comparison with multiCPU Xeon based
server/workstation (Dell Precision 650s etc.) architectures. These cannot
be described as cheap PCs.
Intel has been cutting prices on Xeon chips lately. The 2.8Ghz Xeons @$485 are
around $100 cheaper than the 3Ghz Hyperthreaded Pentium 4s and $100 more than
the 2.8Ghz Pentium 4s. Dual chip Xeon motherboards start at about $300. One
issue for rack systems like the Xserve or Poweredge 1650 is heat and current
usage. In terms of price/performance for a fast CPU and I/O
server/workstation, Xeon is not bad.
- [Swarm-Support] Re: Performance issues,
Marcus G. Daniels <=