[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Discussion: Switching Esprseso to shared memory parallelization
Ulf D Schiller
Re: Discussion: Switching Esprseso to shared memory parallelization
Wed, 14 Jul 2021 03:28:04 +0000
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0
Hi Rudolf and All,
I second Steffen's comments on weighing different aspects. My group is
currently not using ESPResSo for any large scale applications. Having
followed the development almost since its inception, I hope you
nevertheless allow me to add some considerations and perhaps clear up
some conflation in the arguments.
* When comparing paradigms, one has to be careful not to compare apples
with oranges. You correctly describe shared-memory and
distributed-memory parallelization as different paradigms, but also mix
it with a consideration of overheads/delays. It is certainly true that
MPI-based parallelization involves accessing and copying data between
processes. However, modern MPI implementations can handle intra-node
communication fairly efficiently -- both MPICH and OpenMPI can use
shared memory for message transfer and since MPI-3 there is also
explicit shared memory programming. Even for inter-node communication,
latency can typically be hidden by overlapping communication and
computation. I am not sure that one would find a substantial performance
difference between a semi-decent MPI-code on a single node and its
OpenMP counterpart. An optimized version of the latter can in principle
be faster, but that would be demonstrated best with benchmark data.
Apart from comparing the paradigms per-se, I would give some
consideration to programming models and tool stacks. MPI is arguably the
de-facto standard for distributed memory parallelization which brings
the advantage of composability and portability (and to a good extent
backwards compatibility). For shared-memory parallelization, the
landscape is more diverse with different toolchains depending on
architecture/vendor, and there is some uncertainty as to how that
landscape will evolve, say in the next decade.
* "Adding new features to Espresso will be easier, because a lot of
non-trivial communication code does not have to be written."
"Writing and validating MPI-parallel code is difficult."
I beg to differ on these points. MPI parallelization does not have to be
that difficult to implement and test. In my experience, debugging
shared-memory/thread-level parallelism can quickly become much more
non-trivial and cumbersome. Generally, I don't think the learning curve
for MPI is steeper than that for OpenMP when going beyond trivial loop
To give one example of extensibility, the original halo communication
scheme for the grid-based kernels in ESPResSo (P3M, MEMD, LB) was
designed to accommodate varying halo extents and data content, so adding
a new property was as simple as adding a field to the underlying
MPI_Datatype. I can see how things are a bit more involved for
particles, perhaps because the ghost particle scheme began to evolve at
the time of MPI-1 when derived datatypes weren't around. It might be
worth assessing whether this could be addressed by refactoring the ghost
communication and leveraging more modern MPI capabilities.
* "The MPI and Boost::MPI dependencies complicate Espresso's
installation and make it virtually impossible to run Espresso on public
Python platforms such as Azure Notebooks or Google Collab as well as
building on Windows natively."
Hmm... I'm not sure I fully understand. I am regularly building other
software packages with MPI and Boost dependencies on various systems; I
am not a fan of cmake superbuilds but they seem to work reasonably well
in case of conflicting dependencies with system-wide installed
libraries. At any rate, this type of issue can occur with any other
library and is not an issue of MPI itself.
As for Azure and Collab, my personal take is that those are not really
platforms I desire to run a scientific computing application on. I would
rather look to federated HPC clouds where academic stakeholders might
have a say in future developments.
* "Assuming that one million time steps per day is acceptable"
For this assumption to be meaningful, one has to know what one million
time steps mean physically. And not just how many (pico, nano, micro,
...)-seconds, but what does it mean in terms of the characteristic
relaxation time? Given the range that one can find in soft matter, I'm
afraid it's really tough to find a general notion of what is acceptable.
I think that one can find many examples that are out of the question to
address on a single node.
As Steffen already pointed out, if ESPResSo becomes a single-node code,
the availability of HPC resources will be severely limited if not
completely axed. I would add a consideration regarding potential funding
sources: The major funding agencies including ERC are committed to
exascale computing and substantial amounts of money are pumped into
hard/software/co-design to demonstrate capability at scale. This is not
necessarily science driven, whether you like it or not, but a code that
cannot run across nodes is unlikely to be considered meritorious. I also
think ESPResSo would be less likely to attract users, as people tend to
select packages with the "bigger" set of features regardless whether
they actually need them or not.
So, eventually it boils down to the question who your targeted users are
(mostly ICP or broader community) and whether you want ESPResSo to play
a role in the HPC ecosystem. I hope that my comments do not come across
as too opinionated; I recognize that there are many more factors to
consider and hopefully my outside perspective will help you navigate the
On 7/13/21 2:23 AM, Rudolf Weeber wrote:
> Hi Steffen,
> thank you for the detailed feedback and the points you raise.
>> 1) You asked about scenarios that need MPI parallelism. At an IPVS+ITV (U.
>> Stuttgart) collaboration, we perform large simulations
>> that subject particles to a background flow field. Within that flow field,
>> we want include multiple scales of turbulence.
>> These simulations need MPI parallelism as they have millions of particles
>> and millions of bonds. This data does simply not fit into the RAM of a
>> single node.
> This project was, to my knowledge, the biggest system that was ever run on
> Espresso by a very wide margin. The systems for which Espresso is typically
> used at the ICP are, to my understanding, all below 100k particles.
> Those of you running big systems, please speak up, so we are aware of it.
> Also, if you would like to run bigger systems but don't due to performance
> issues, that would be of interest.
>> 2) You talk about "HPC nodes" having about 20-64 cores. This is certainly
>> true. I just want to make the remark that with a shared-memory
>> paralleization there will be no more HPC nodes for ESPResSo users. When
>> applying for runtime at an HPC center, you have to detail about the
>> parallelization and the scalability of your code. If you run on one node
>> only they will most likely turn you down and you are left with your local
> I should have written "node on a cluster". Espresso simulations would
> typicall go to tier 3 systems (i.e., university or regional clusters). But
> you are right that actually removing the possibility to run bigger systems
> and therefore not being seen as part of the HPC community may well be an
>> 3) While I see your point that the current MPI parallelization might not be
>> the easiest to understand and roll out, I want to make it clear that
>> devising a well-performing shared-memory parallelization is not a trivial
>> matter, too. "Sprinkling in" a couple of "#pragma omp parallel for" will
>> certainly not be enough. As with the distributed-memory parallelization you
>> will have to devise a spatial domain decomposition and come up with a
>> workload distribution between the threads. You will have to know which
>> threads imports data from others and devise locking mechanisms to guard
>> these accesses. Reasoning about this code and debugging it might turn out to
>> be as hard as for the MPI-based code. If you want to go down this path, I
>> strongly suggest not reinventing the wheel and taking a look at, e.g., the
>> AutoPAS  project.
> This is a valid point. Before we make any decision, we will definitely have
> some sort of technical preview/prototype to see what can be achieved at
> acceptable levels of complexity and performance.
> In my personal opinion, with Esprseso, the aim is for ease of extensibility
> rather than for best performance.
> How do other people in the community see this?
> Some areas where I would hope for simplifications in a purely shared memory
> * Relation between Python and C++-objects: In a shared memory code, the
> Python object could directly own the core object. In an MPI-simulation, an
> intermediate layer creates and manages mirror objects on the remote processes.
> * Due to the intermediate layer and the mirror objects, checkpointing and
> restoring a simulation is very difficult. We have currently disabled it for
> certain features.
> * We may not need (or have to replace by PFFT) the custom 3D FFt in the
> electrostatic and dipolar P3M. To my understanding, there is a
> thread-parallel drop-in replacement for FFTW. (1000 lines of code, currently)
> * Bonds and virtual sites with a range much larger than the Lennard-Jones
> cutoff would not force a larger cell size (thereby slowing the short-range
> * We would probably not need most of the ghost communication code (about 600
> lines) as cells across boundaries can be linked directly.
> * We would probably not need most of the parallel callback and particle setup
> code (about 1500 lines of code, + some 90 callbacks scattered throughotu the
> Of course, all of this would need to be investigated in more detail, before a
> decision was made.
>> One particular problem that I encountered in the past and that I want to
>> briefly mention here is bonds: They are only stored on one of the two (or
>> more) involved particles. This is one of the reasons, why ESPResSo currently
>> needs to communicate the forces back after calculating them and you will
>> certainly need measures that deal with this circumstance in a shared-memory
>> parallel code. Such details will increase the complexity of a shared-memory
>> parallel code and it might end up not being easy to understand for newcomers
>> or make it hard to implement new features, too.
> I agree. Although, in a shared memory code, there is the option to run some
> stuff serially and still getting the benefit of the parallelized short-range
> and bond loop and electrostatics.
> You are right. The bond storage will almost certainly have to be changed.
> Otherwise the bond loop cannot be executed in parallel without requiring all
> access to particle force to be atomic.
> If we stay with MPI, as you point out, this would eliminate one ghost
> communication per time step.
> By now, the bond storage has been abstracted somewhat, so this change is
> probably doable now.
> Thank you for pointing out AutoPas. Changing particle storage to
> struct-of-arrays would be extremely beneficial for performance.
> Thank you again for sharing your thoughts! Hearing different points of view
> is very important for us to make good decisions on Espresso development.
> There is certainly the need for further discussion and experimentation before
> we decide on the future parallelization paragdigm.
> The purpose of my post was to get an idea of how many use cases there
> actually are for very big systems, in the hope that it might help us to
> direct our (limited) resources to where they are most needed.
> Regards, Rudolf
ULF D. SCHILLER
ASSISTANT PROFESSOR, MATERIALS SCIENCE AND ENGINEERING
College of Engineering, Computing and Applied Sciences
299C Sirrine Hall
Clemson, SC 29634