[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH 4/4] migration: use dirty_rate_high_cnt more agg
From: |
Peter Xu |
Subject: |
Re: [Qemu-devel] [PATCH 4/4] migration: use dirty_rate_high_cnt more aggressively |
Date: |
Fri, 26 May 2017 10:45:12 +0800 |
User-agent: |
Mutt/1.5.24 (2015-08-30) |
On Thu, May 25, 2017 at 11:20:02AM +0000, Felipe Franciosi wrote:
> + Matthew Rosato, original reviewer of 070afca25
>
> > On 25 May 2017, at 02:03, Peter Xu <address@hidden> wrote:
> >
> > On Wed, May 24, 2017 at 05:10:03PM +0100, Felipe Franciosi wrote:
> >> The commit message from 070afca25 suggests that dirty_rate_high_cnt
> >> should be used more aggressively to start throttling after two
> >> iterations instead of four. The code, however, only changes the auto
> >> convergence behaviour to throttle after three iterations. This makes the
> >> behaviour more aggressive by kicking off throttling after two iterations
> >> as originally intended.
> >
> > For this one, I don't think fixing the code to match the commit
> > message is that important. Instead, for me this patch looks more like
> > something "changed iteration loops from 3 to 2".
>
> I agree. If we decide a v2 is needed I can amend the commit message
> accordingly.
>
> > So the point is, what
> > would be the best practical number for it. And when we change an
> > existing value, we should have some reason, since it'll change
> > behavior of existing user (though I'm not sure whether this one will
> > affect much).
>
> We've done a few tests with this internally using various workloads (DBs,
> synthetic mem stress, &c.). In summary, and along the lines with how Qemu
> implements auto convergence today, I would say this counter should be removed.
>
> Consider that when live migrating a large-ish VM (100GB+ RAM), the network
> will be under stress for (at least) several minutes. At the same time, the
> sole purpose of this counter is to attempt convergence without having to
> throttle the guest. That is, it defers throttling in the hope that either the
> guest calms down or that the network gets less congested.
>
> For real workloads, both cases are unlikely to happen. Throttling will
> eventually be needed otherwise the migration will not converge.
I am not much experienced with using auto convergence with real
workload, but from what you explained I see it a reasonable change to
even remove it (that sounds more persuasive to me than "just try to
follow what the commit message said", with test results :), especially
considering that we have cpu-throttle-initial and
cpu-throttle-increment parameters as tunables, so we should be fine
even we want to tune the speed down a bit in the future.
>
> > I believe with higher dirty_rate_high_cnt, we have more smooth
> > throttling, but it'll be slower in responding; While if lower or even
> > remove it, we'll get very fast throttling response speed but I guess
> > it may be more possible to report a false positive?
>
> With a higher dirty_rate_high_cnt, migration will simply take longer (if not
> converging). For real workloads, it doesn't change how much throttling is
> required. For spiky or varied workloads, it gives a chance for migration to
> converge without throttling, at the cost of massive network stress and a
> longer overall migration time (which is bad user experience IMO).
>
> > IMHO here 3 is
> > okay since after all we are solving the problem of unconverged
> > migration, so as long as we can converge, I think it'll be fine.
>
> Based on 070afca25's commit message, Jason seemed to think that four was too
> much and meant to change it to two. As explained above, I'd be in favour of
> removing this counter altogether, or at least make it configurable (perhaps a
> #define would be enough an improvement for now). This patch is intended as a
> compromise by effectively using two.
If to be a tunable, maybe another MigrationParameter? But I don't know
whether it would really worth it since the other two can do more
fine-grained tunes already. So I would prefer to remove it as well if
thorough tests are carried out.
Maybe we can also wait to see how other people think about it.
Thanks,
--
Peter Xu
[Qemu-devel] [PATCH 3/4] migration: set bytes_xfer_* outside of autoconverge logic, Felipe Franciosi, 2017/05/24