[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Question about checkpoint
From: |
Jean-Noël Grad |
Subject: |
Re: Question about checkpoint |
Date: |
Thu, 09 Dec 2021 17:13:32 +0100 |
User-agent: |
Roundcube Webmail/1.3.17 |
Dear Lili Zeng,
Thank you for sharing your observations. The checkpointing mechanism
doesn't restore all global variables. Some of them are regenerated from
the state of other global variables. For example, the cell structure
global variable stores raw pointers to the particles, but these pointers
are meaningless when reloading from a checkpoint, and so the cell
structure global has to be regenerated by placing "new" particles in an
empty simulation box based on the particles positions stored in the
checkpoint file. This causes unintended side-effects when reloading a
simulation from a checkpoint, because a cell resort is triggered when
particles are placed in the simulation box. In particular, cached values
of the integrator are invalidated. This can lead to an error in the
calculated forces in the order of 1e-4. If the time step is 1e-2, the
particle positions will be off by 1e-8 at the next integration step and
the trajectory will diverge.
Please find attached a minimal working example to replicate this
behavior. It simulates a single particle for 4 time steps using the
Langevin thermostat, with a "perturbation" at the third time step. This
perturbation can be a reload from a checkpoint (--perturbation load), a
cell resort (--perturbation resort) or an invalidation of the cached
integrator values (--perturbation recalc). The reference trajectory is
obtained without perturbation (--perturbation continue), which also
saves a checkpoint before the third time step. Here are the results for
ESPResSo 4.2-dev:
$ ./pypresso checkpoint_mwe.py --perturbation continue
rng_counter = 1, p.pos=array([ 0.00084648, -0.00009886, 0.00006749])
p.f=array([ 15.00509416, 4.49980631, -23.50091465])
rng_counter = 2, p.pos=array([ 0.00319346, 0.00025227, -0.00221511])
p.f=array([ 24.24592812, 4.85133902, -17.03844439])
rng_counter = 3, p.pos=array([ 0.00796504, 0.00108853, -0.00620156])
p.f=array([ 14.41245656, 9.68804012, 22.38072167])
rng_counter = 4, p.pos=array([ 0.01417787, 0.00289359, -0.00794993])
p.f=array([-14.30846688, -4.63255625, 24.10474287])
$ ./pypresso checkpoint_mwe.py --perturbation load
rng_counter = 1, unknown
rng_counter = 2, unknown
rng_counter = 3, p.pos=array([ 0.00795898, 0.00108732, -0.00619730])
p.f=array([ 14.41306271, 9.68816141, 22.38029571])
rng_counter = 4, p.pos=array([ 0.01416581, 0.00289118, -0.00794146])
p.f=array([-14.30786679, -4.63243618, 24.10432117])
$ ./pypresso checkpoint_mwe.py --perturbation resort
rng_counter = 1, p.pos=array([ 0.00084648, -0.00009886, 0.00006749])
p.f=array([ 15.00509416, 4.49980631, -23.50091465])
rng_counter = 2, p.pos=array([ 0.00319346, 0.00025227, -0.00221511])
p.f=array([ 24.24592812, 4.85133902, -17.03844439])
rng_counter = 3, p.pos=array([ 0.00795898, 0.00108732, -0.00619730])
p.f=array([ 14.41306271, 9.68816141, 22.38029571])
rng_counter = 4, p.pos=array([ 0.01416581, 0.00289118, -0.00794146])
p.f=array([-14.30786679, -4.63243618, 24.10432117])
$ ./pypresso checkpoint_mwe.py --perturbation recalc
rng_counter = 1, p.pos=array([ 0.00084648, -0.00009886, 0.00006749])
p.f=array([ 15.00509416, 4.49980631, -23.50091465])
rng_counter = 2, p.pos=array([ 0.00319346, 0.00025227, -0.00221511])
p.f=array([ 24.24592812, 4.85133902, -17.03844439])
rng_counter = 3, p.pos=array([ 0.00795898, 0.00108732, -0.00619730])
p.f=array([ 14.41306271, 9.68816141, 22.38029571])
rng_counter = 4, p.pos=array([ 0.01416220, 0.00288876, -0.00794705])
p.f=array([-14.30750646, -4.63219397, 24.10488068])
According to these results, the trajectory after a reload is similar to
the trajectory one would obtain by forcing a particle resort, and
diverges by 1e-5 in unit of length at the third time step. This might
not be the only contributing factor for the drift you observed, but
characterizing which other components of ESPResSo are affected by the
reload is not a trivial task since many features of ESPResSo behave
differently depending on which other features are currently active.
Unfortunately, the checkpointing feature of ESPResSo is still at an
experimental stage and is not well maintained. We have checkpointing
tests to guarantee that the state of particles, thermostats, integrators
and numerical solvers is correctly reloaded. However the cell system
isn't properly reloaded at the moment, and therefore isn't tested. I
would not recommended enabling checkpointing if you need reproducible
trajectories.
Best,
JN
On 2021-12-07 20:21, Lili Zeng wrote:
Hi,
I'm a PhD student at McGill University in Montreal, Canada, studying
polymer physics, and I'm a user of ESPResSo (version 4.1). I have a
question about checkpoints. I noticed that when I run a simulation
(say, with 2000 timesteps) in one shot, the final positions of my
particles are different from when I run the first half of an identical
simulation (so 1000 timesteps), then save under checkpoint, restore
the simulation using checkpoint, and run to completion (another 1000
timesteps). This is true even for the simplest systems, eg one single
particle suspended in space with no interactions. In this case, the
difference in the final particle position is very small (on order of
10^(-6) for a particle of size 1 in a system box of size 50), but it
still exists. I wanted to ask whether this is normal, and whether
there is any way to get identical results using checkpoint compared to
when not using checkpoint.
Thank you!
Lili
checkpoint_mwe.py
Description: Text Data