swarm-support
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Swarm-Support] Re: Performance issues


From: Marcus G. Daniels
Subject: [Swarm-Support] Re: Performance issues
Date: Sun, 02 Feb 2003 11:24:49 -0700
User-agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.3b) Gecko/20030117

Bill Northcott wrote:

First it would be very surprising if Swarm ran as well on a PowerPC machine as it does on a P4. As far as I can see all the recent development has been done on Intel hardware.
Mostly it was done on Sun hardware. That's what SFI and the SDG had available. I would not say that it modified any programming, really. As far as profiling goes, I'd say most of that happened on Sun, and then in some isolated nasty logistical cases on RedHat on Intel. Bleeding edge compiler and toolchain things tend to work better on Linux-based systems. At first I could only profile full-native (GCJ-based) Java/Swarm models on Redhat.

So it is sort of inevitable that the code contains many optimisations for Intel architectures even if most of them are unconscious.
Unfortunately, the profiling of Swarm (again, mostly on Suns), has consisted of gprof after gprof run and memory profiles. Reduce empirically-observed bottlenecks and reduce memory usage. There hasn't really been any attention given to profiling Swarm with cache simulators. It would be a good thing to do, preferably on multiple architectures (or simulated architectures).

Excuse my ignorance, Marcus, but would your benchmark use both CPUs?

Nope, and neither would Swarm and neither would most applications. Compilers don't magically parallelize code. A different benchmark, say the SQL example could be multithreaded, though (mysql would be).

Whether a level 3 cache helps reduces to what the working set of a problem is and the memory access pattern. When that working set gets big and the pattern sparse and disordered, then memory latency (or even swap) will come to dominate runtime. With lots of agents running around in a simulation, and a shared landscape, I think a few megabytes of cache isn't going to help a whole lot. But it really depends. A determined person can always measure and tune (and finally, for a given architecture) given the tools and some knowledge.

Whether the level 3 cache helps a multithreaded program will reduce to the memory demands of the two threads. My concern, especially for any moderately complex agent-based model (assuming Swarm could spawn threads, which it can't), would be that having two agents running instead of one agent just increases the chance that the cache will be busted.

2. Vector processors. Apple/Motorola included these because they could produce huge speed ups in signal processing apps. Apple wanted it for its multimedia users, and Motorola for its big market in embedded telecomms chips. It is extremely effective for certain types of problem. Witness the benchmark for SETI which uses Altivec, and the impressive performance on Photoshop filters and video codecs. I am sure Marcus' banchmark does not use the vector processors so a substantial part of the CPU silicon is sitting there doing nothing.

If we are talking about how SIMD features can increase performance in an application then we need to be able to compare apples with apples. It seems to me the comparison has to reduce to "if I compile program X with standard or easily-accessible tools on platform A and B which runs faster?". It's not fair to say, "I really mean program Y", which is what you are saying if you mean "you should really write that program with the Altivec in mind". If that's the case, then we can just say in response "you should really write that program with the SSE2 Pentium 4 extensions in mind".
3. Disk throughput. As far as I can see, the target market for Xserve was the film industry and their extensive digital processing. These people are dealing in tens of terabytes of data. The published benchmarks for XServe show that it compares extremely well with 1U PC (Dell PowerEdge 1650) servers when serving large files to multiple clients. Indeed they stand up well against other architectures with much higher price tags.

Apple's material show that the Dell PowerEdge 1650 and XServe have similar I/O performance. Well, that's nice the 1650 is the low-end rack server based on the Pentium III!
So is any of this relevant to Swarm/ABM?  It seems to me that it is.
There has been plenty of discussion about multithreading/multiCPUs. Currently the stuff is not there in Swarm because Intel architecture did not provide the hardware, but the benefits should be fairly obvious. IBM and Apple clearly think multiple cpus are the way to go rather than very high clock speeds. I think the standard for Power4 is 4 cpus per header. It looks like Apple will start using this architecture with 64 bit chips (PPC970) from IBM later this year. It could provide very good price performance after the novelty premum has worn off.

When vendors have compiler technology or killer-apps that auto-parallelize reliably, I'll buy this argument. We already had a discussion here about some of the practical problems of multithreaded Swarm models. Until there are economical 4 or 8 processor systems, I just don't see the benefit of complex, delicate code for multithreading of Swarm models. And I am highly skeptical that software-based distributed shared memory systems can provide fast enough memory access to enable clusters to function well on agent-based models. In my experience there is a massive amount of parameter tweaking and iteration involved in understanding agent-based models, and this iteration is easy to parallelize. Just run multiple simulations at once with different parameters on different CPUs.

Vector processors. It seems to me this would be the easier optimisation to put in Swarm. My misguided thoughts would be to look at random number generators and agents using regression/neural net/signal processing decision methods as prime candidates for Altivec speed up. I think the necessary code is already incorporated in the current GNU compiler sources. I am sure Marcus can comment on this much better than me.
There might be some opportunities for Altivec or SSE2 usage, but I think they'd mostly be for add-in libraries. Like you say, neural nets, perhaps some GA fitness evaluations, etc. Whether it would justify the work, I don't know.

*I rather discount the comparison with multiCPU Xeon based server/workstation (Dell Precision 650s etc.) architectures. These cannot be described as cheap PCs.

Intel has been cutting prices on Xeon chips lately.  The 2.8Ghz Xeons @$485 are 
around $100 cheaper than the 3Ghz Hyperthreaded Pentium 4s and $100 more than 
the 2.8Ghz Pentium 4s.   Dual chip Xeon motherboards start at about $300.  One 
issue for rack systems like the Xserve or Poweredge 1650 is heat and current 
usage.  In terms of price/performance for a fast CPU and I/O 
server/workstation, Xeon is not bad.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]