Re: [Qemu-devel] [PATCH] migration: calculate expected

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram

From:	Dr. David Alan Gilbert
Subject:	Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining()
Date:	Tue, 10 Apr 2018 11:02:36 +0100
User-agent:	Mutt/1.9.3 (2018-01-21)

* David Gibson (address@hidden) wrote:
> On Mon, 9 Apr 2018 19:57:47 +0100
> "Dr. David Alan Gilbert" <address@hidden> wrote:
> 
> > * Balamuruhan S (address@hidden) wrote:
> > > On 2018-04-04 13:36, Peter Xu wrote:  
> > > > On Wed, Apr 04, 2018 at 11:55:14AM +0530, Balamuruhan S wrote:
> [snip]
> > > > > > - postcopy: that'll let you start the destination VM even without
> > > > > >   transferring all the RAMs before hand  
> > > > > 
> > > > > I am seeing issue in postcopy migration between POWER8(16M) ->
> > > > > POWER9(1G)
> > > > > where the hugepage size is different. I am trying to enable it but
> > > > > host
> > > > > start
> > > > > address have to be aligned with 1G page size in
> > > > > ram_block_discard_range(),
> > > > > which I am debugging further to fix it.  
> > > > 
> > > > I thought the huge page size needs to be matched on both side
> > > > currently for postcopy but I'm not sure.  
> > > 
> > > you are right! it should be matched, but we need to support
> > > POWER8(16M) -> POWER9(1G)
> > >   
> > > > CC Dave (though I think Dave's still on PTO).  
> > 
> > There's two problems there:
> >   a) Postcopy with really big huge pages is a problem, because it takes
> >      a long time to send the whole 1G page over the network and the vCPU
> >      is paused during that time;  for example on a 10Gbps link, it takes
> >      about 1 second to send a 1G page, so that's a silly time to keep
> >      the vCPU paused.
> > 
> >   b) Mismatched pagesizes are a problem on postcopy; we require that the
> >      whole of a hostpage is sent continuously, so that it can be
> >      atomically placed in memory, the source knows to do this based on
> >      the page sizes that it sees.  There are some other cases as well 
> >      (e.g. discards have to be page aligned.)
> 
> I'm not entirely clear on what mismatched means here.  Mismatched
> between where and where?  I *think* the relevant thing is a mismatch
> between host backing page size on source and destination, but I'm not
> certain.

Right.  As I understand it, we make no requirements on (an x86) guest
as to what page sizes it uses given any particular host page sizes.

> > Both of the problems are theoretically fixable; but neither case is
> > easy.
> > (b) could be fixed by sending the hugepage size back to the source,
> > so that it knows to perform alignments on a larger boundary to it's
> > own RAM blocks.
> 
> Sounds feasible, but like something that will take some thought and
> time upstream.

Yes; it's not too bad.

> > (a) is a much much harder problem; one *idea* would be a major
> > reorganisation of the kernels hugepage + userfault code to somehow
> > allow them to temporarily present as normal pages rather than a
> > hugepage.
> 
> Yeah... for Power specifically, I think doing that would be really
> hard, verging on impossible, because of the way the MMU is
> virtualized.  Well.. it's probably not too bad for a native POWER9
> guest (using the radix MMU), but the issue here is for POWER8 compat
> guests which use the hash MMU.

My idea was to fill the pagetables for that hugepage using small page
entries but using the physical hugepages memory; so that once we're
done we'd flip it back to being a single hugepage entry.
(But my understanding is that doesn't fit at all into the way the kernel
hugepage code works).

> > Does P9 really not have a hugepage that's smaller than 1G?
> 
> It does (2M), but we can't use it in this situation.  As hinted above,
> POWER9 has two very different MMU modes, hash and radix.  In hash mode
> (which is similar to POWER8 and earlier CPUs) the hugepage sizes are
> 16M and 16G, in radix mode (more like x86) they are 2M and 1G.
> 
> POWER9 hosts always run in radix mode.  Or at least, we only support
> running them in radix mode.  We support both radix mode and hash mode
> guests, the latter including all POWER8 compat mode guests.
> 
> The next complication is because the way the hash virtualization works,
> any page used by the guest must be HPA-contiguous, not just
> GPA-contiguous.  Which means that any pagesize used by the guest must
> be smaller or equal than the host pagesizes used to back the guest.
> We (sort of) cope with that by only advertising the 16M pagesize to the
> guest if all guest RAM is backed by >= 16M pages.
> 
> But that advertisement only happens at guest boot.  So if we migrate a
> guest from POWER8, backed by 16M pages to POWER9 backed by 2M pages,
> the guest still thinks it can use 16M pages and jams up.  (I'm in the
> middle of upstream work to make the failure mode less horrible).
> 
> So, the only way to run a POWER8 compat mode guest with access to 16M
> pages on a POWER9 radix mode host is using 1G hugepages on the host
> side.

Ah ok;  I'm not seeing an easy answer here.
The only vague thing I can think of is if you gave P9 a fake 16M
hugepage mode, that did all HPA and mappings in 16M chunks (using 8 x 2M
page entries).

Dave

> -- 
> David Gibson <address@hidden>
> Principal Software Engineer, Virtualization, Red Hat


--
Dr. David Alan Gilbert / address@hidden / Manchester, UK

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining(), (continued)
- Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining(), Juan Quintela, 2018/04/04
  - Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining(), Balamuruhan S, 2018/04/10
    - Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining(), Balamuruhan S, 2018/04/10
- Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining(), Balamuruhan S, 2018/04/04
  - Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining(), Peter Xu, 2018/04/04
    - Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining(), Balamuruhan S, 2018/04/04
    - Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining(), Dr. David Alan Gilbert, 2018/04/09
    - Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining(), David Gibson, 2018/04/09
    - Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining(), Dr. David Alan Gilbert <=
    - Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining(), David Gibson, 2018/04/10

Prev by Date: Re: [Qemu-devel] [qemu RFC] qapi: add "firmware.json"
Next by Date: [Qemu-devel] [PATCH v2 for-2.12] icount: fix cpu_restore_state_from_tb for non-tb-exit cases
Previous by thread: Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining()
Next by thread: Re: [Qemu-devel] [PATCH] migration: calculate expected_downtime with ram_bytes_remaining()
Index(es):
- Date
- Thread