[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram
From: |
Dr. David Alan Gilbert |
Subject: |
Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining() |
Date: |
Tue, 10 Apr 2018 11:02:36 +0100 |
User-agent: |
Mutt/1.9.3 (2018-01-21) |
* David Gibson (address@hidden) wrote:
> On Mon, 9 Apr 2018 19:57:47 +0100
> "Dr. David Alan Gilbert" <address@hidden> wrote:
>
> > * Balamuruhan S (address@hidden) wrote:
> > > On 2018-04-04 13:36, Peter Xu wrote:
> > > > On Wed, Apr 04, 2018 at 11:55:14AM +0530, Balamuruhan S wrote:
> [snip]
> > > > > > - postcopy: that'll let you start the destination VM even without
> > > > > > transferring all the RAMs before hand
> > > > >
> > > > > I am seeing issue in postcopy migration between POWER8(16M) ->
> > > > > POWER9(1G)
> > > > > where the hugepage size is different. I am trying to enable it but
> > > > > host
> > > > > start
> > > > > address have to be aligned with 1G page size in
> > > > > ram_block_discard_range(),
> > > > > which I am debugging further to fix it.
> > > >
> > > > I thought the huge page size needs to be matched on both side
> > > > currently for postcopy but I'm not sure.
> > >
> > > you are right! it should be matched, but we need to support
> > > POWER8(16M) -> POWER9(1G)
> > >
> > > > CC Dave (though I think Dave's still on PTO).
> >
> > There's two problems there:
> > a) Postcopy with really big huge pages is a problem, because it takes
> > a long time to send the whole 1G page over the network and the vCPU
> > is paused during that time; for example on a 10Gbps link, it takes
> > about 1 second to send a 1G page, so that's a silly time to keep
> > the vCPU paused.
> >
> > b) Mismatched pagesizes are a problem on postcopy; we require that the
> > whole of a hostpage is sent continuously, so that it can be
> > atomically placed in memory, the source knows to do this based on
> > the page sizes that it sees. There are some other cases as well
> > (e.g. discards have to be page aligned.)
>
> I'm not entirely clear on what mismatched means here. Mismatched
> between where and where? I *think* the relevant thing is a mismatch
> between host backing page size on source and destination, but I'm not
> certain.
Right. As I understand it, we make no requirements on (an x86) guest
as to what page sizes it uses given any particular host page sizes.
> > Both of the problems are theoretically fixable; but neither case is
> > easy.
> > (b) could be fixed by sending the hugepage size back to the source,
> > so that it knows to perform alignments on a larger boundary to it's
> > own RAM blocks.
>
> Sounds feasible, but like something that will take some thought and
> time upstream.
Yes; it's not too bad.
> > (a) is a much much harder problem; one *idea* would be a major
> > reorganisation of the kernels hugepage + userfault code to somehow
> > allow them to temporarily present as normal pages rather than a
> > hugepage.
>
> Yeah... for Power specifically, I think doing that would be really
> hard, verging on impossible, because of the way the MMU is
> virtualized. Well.. it's probably not too bad for a native POWER9
> guest (using the radix MMU), but the issue here is for POWER8 compat
> guests which use the hash MMU.
My idea was to fill the pagetables for that hugepage using small page
entries but using the physical hugepages memory; so that once we're
done we'd flip it back to being a single hugepage entry.
(But my understanding is that doesn't fit at all into the way the kernel
hugepage code works).
> > Does P9 really not have a hugepage that's smaller than 1G?
>
> It does (2M), but we can't use it in this situation. As hinted above,
> POWER9 has two very different MMU modes, hash and radix. In hash mode
> (which is similar to POWER8 and earlier CPUs) the hugepage sizes are
> 16M and 16G, in radix mode (more like x86) they are 2M and 1G.
>
> POWER9 hosts always run in radix mode. Or at least, we only support
> running them in radix mode. We support both radix mode and hash mode
> guests, the latter including all POWER8 compat mode guests.
>
> The next complication is because the way the hash virtualization works,
> any page used by the guest must be HPA-contiguous, not just
> GPA-contiguous. Which means that any pagesize used by the guest must
> be smaller or equal than the host pagesizes used to back the guest.
> We (sort of) cope with that by only advertising the 16M pagesize to the
> guest if all guest RAM is backed by >= 16M pages.
>
> But that advertisement only happens at guest boot. So if we migrate a
> guest from POWER8, backed by 16M pages to POWER9 backed by 2M pages,
> the guest still thinks it can use 16M pages and jams up. (I'm in the
> middle of upstream work to make the failure mode less horrible).
>
> So, the only way to run a POWER8 compat mode guest with access to 16M
> pages on a POWER9 radix mode host is using 1G hugepages on the host
> side.
Ah ok; I'm not seeing an easy answer here.
The only vague thing I can think of is if you gave P9 a fake 16M
hugepage mode, that did all HPA and mappings in 16M chunks (using 8 x 2M
page entries).
Dave
> --
> David Gibson <address@hidden>
> Principal Software Engineer, Virtualization, Red Hat
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
- Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining(), (continued)