[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v7 14/17] spapr: Don't attempt to clamp RMA to VRMA constrain
From: |
Greg Kurz |
Subject: |
Re: [PATCH v7 14/17] spapr: Don't attempt to clamp RMA to VRMA constraint |
Date: |
Tue, 3 Mar 2020 10:55:37 +0100 |
On Tue, 3 Mar 2020 14:43:48 +1100
David Gibson <address@hidden> wrote:
> The Real Mode Area (RMA) is the part of memory which a guest can access
> when in real (MMU off) mode. Of course, for a guest under KVM, the MMU
> isn't really turned off, it's just in a special translation mode - Virtual
> Real Mode Area (VRMA) - which looks like real mode in guest mode.
>
> The mechanics of how this works when using the hash MMU (HPT) put a
> constraint on the size of the RMA, which depends on the size of the
> HPT. So, the latter part of spapr_setup_hpt_and_vrma() clamps the RMA
> we advertise to the guest based on this VRMA limit.
>
> There are several things wrong with this:
> 1) spapr_setup_hpt_and_vrma() doesn't actually clamp, it takes the minimum
> of Node 0 memory size and the VRMA limit. That will *often* work the
> same as clamping, but there can be other constraints on RMA size which
> supersede Node 0 memory size. We have real bugs caused by this
> (currently worked around in the guest kernel)
> 2) Some callers of spapr_setup_hpt_and_vrma() are in a situation where
> we're past the point that we can actually advertise an RMA limit to the
> guest
> 3) But most fundamentally, the VRMA limit depends on host configuration
> (page size) which shouldn't be visible to the guest, but this partially
> exposes it. This can cause problems with migration in certain edge
> cases, although we will mostly get away with it.
>
> In practice, this clamping is almost never applied anyway. With 64kiB
> pages and the normal rules for sizing of the HPT, the theoretical VRMA
> limit will be 4x(guest memory size) and so never hit. It will hit with
> 4kiB pages, where it will be (guest memory size)/4. However all mainstream
> distro kernels for POWER have used a 64kiB page size for at least 10 years.
>
> So, simply replace this logic with a check that the RMA we've calculated
> based only on guest visible configuration will fit within the host implied
> VRMA limit. This can break if running HPT guests on a host kernel with
> 4kiB page size. As noted that's very rare. There also exist several
> possible workarounds:
> * Change the host kernel to use 64kiB pages
> * Use radix MMU (RPT) guests instead of HPT
> * Use 64kiB hugepages on the host to back guest memory
> * Increase the guest memory size so that the RMA hits one of the fixed
> limits before the RMA limit. This is relatively easy on POWER8 which
> has a 16GiB limit, harder on POWER9 which has a 1TiB limit.
> * Use a guest NUMA configuration which artificially constrains the RMA
> within the VRMA limit (the RMA must always fit within Node 0).
>
> Previously, on KVM, we also temporarily reduced the rma_size to 256M so
> that the we'd load the kernel and initrd safely, regardless of the VRMA
> limit. This was a) confusing, b) could significantly limit the size of
> images we could load and c) introduced a behavioural difference between
> KVM and TCG. So we remove that as well.
>
> Signed-off-by: David Gibson <address@hidden>
> Reviewed-by: Alexey Kardashevskiy <address@hidden>
> ---
Reviewed-by: Greg Kurz <address@hidden>
> hw/ppc/spapr.c | 28 ++++++++++------------------
> hw/ppc/spapr_hcall.c | 4 ++--
> include/hw/ppc/spapr.h | 3 +--
> 3 files changed, 13 insertions(+), 22 deletions(-)
>
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 18bf4bc3de..ef7667455c 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1569,7 +1569,7 @@ void spapr_reallocate_hpt(SpaprMachineState *spapr, int
> shift,
> spapr_set_all_lpcrs(0, LPCR_HR | LPCR_UPRT);
> }
>
> -void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr)
> +void spapr_setup_hpt(SpaprMachineState *spapr)
> {
> int hpt_shift;
>
> @@ -1585,10 +1585,16 @@ void spapr_setup_hpt_and_vrma(SpaprMachineState
> *spapr)
> }
> spapr_reallocate_hpt(spapr, hpt_shift, &error_fatal);
>
> - if (spapr->vrma_adjust) {
> + if (kvm_enabled()) {
> hwaddr vrma_limit = kvmppc_vrma_limit(spapr->htab_shift);
>
> - spapr->rma_size = MIN(spapr_node0_size(MACHINE(spapr)), vrma_limit);
> + /* Check our RMA fits in the possible VRMA */
> + if (vrma_limit < spapr->rma_size) {
> + error_report("Unable to create %" HWADDR_PRIu
> + "MiB RMA (VRMA only allows %" HWADDR_PRIu "MiB",
> + spapr->rma_size / MiB, vrma_limit / MiB);
> + exit(EXIT_FAILURE);
> + }
> }
> }
>
> @@ -1628,7 +1634,7 @@ static void spapr_machine_reset(MachineState *machine)
> spapr->patb_entry = PATE1_GR;
> spapr_set_all_lpcrs(LPCR_HR | LPCR_UPRT, LPCR_HR | LPCR_UPRT);
> } else {
> - spapr_setup_hpt_and_vrma(spapr);
> + spapr_setup_hpt(spapr);
> }
>
> qemu_devices_reset();
> @@ -2695,20 +2701,6 @@ static void spapr_machine_init(MachineState *machine)
>
> spapr->rma_size = node0_size;
>
> - /* With KVM, we don't actually know whether KVM supports an
> - * unbounded RMA (PR KVM) or is limited by the hash table size
> - * (HV KVM using VRMA), so we always assume the latter
> - *
> - * In that case, we also limit the initial allocations for RTAS
> - * etc... to 256M since we have no way to know what the VRMA size
> - * is going to be as it depends on the size of the hash table
> - * which isn't determined yet.
> - */
> - if (kvm_enabled()) {
> - spapr->vrma_adjust = 1;
> - spapr->rma_size = MIN(spapr->rma_size, 0x10000000);
> - }
> -
> /* Actually we don't support unbounded RMA anymore since we added
> * proper emulation of HV mode. The max we can get is 16G which
> * also happens to be what we configure for PAPR mode so make sure
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index c2b3286625..40c86e91eb 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1458,7 +1458,7 @@ static void
> spapr_check_setup_free_hpt(SpaprMachineState *spapr,
> spapr_free_hpt(spapr);
> } else if (!(patbe_new & PATE1_GR)) {
> /* RADIX->HASH || NOTHING->HASH : Allocate HPT */
> - spapr_setup_hpt_and_vrma(spapr);
> + spapr_setup_hpt(spapr);
> }
> return;
> }
> @@ -1845,7 +1845,7 @@ static target_ulong
> h_client_architecture_support(PowerPCCPU *cpu,
> * (because the guest isn't going to use radix) then set it up here.
> */
> if ((spapr->patb_entry & PATE1_GR) && !guest_radix) {
> /* legacy hash or new hash: */
> - spapr_setup_hpt_and_vrma(spapr);
> + spapr_setup_hpt(spapr);
> }
>
> if (fdt_bufsize < sizeof(hdr)) {
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index a4216935a1..90dbc55931 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -156,7 +156,6 @@ struct SpaprMachineState {
> SpaprPendingHpt *pending_hpt; /* in-progress resize */
>
> hwaddr rma_size;
> - int vrma_adjust;
> uint32_t fdt_size;
> uint32_t fdt_initial_size;
> void *fdt_blob;
> @@ -795,7 +794,7 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool
> reset, size_t space);
> void spapr_events_init(SpaprMachineState *sm);
> void spapr_dt_events(SpaprMachineState *sm, void *fdt);
> void close_htab_fd(SpaprMachineState *spapr);
> -void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
> +void spapr_setup_hpt(SpaprMachineState *spapr);
> void spapr_free_hpt(SpaprMachineState *spapr);
> SpaprTceTable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
> void spapr_tce_table_enable(SpaprTceTable *tcet,
- Re: [PATCH v7 01/17] ppc: Remove stub support for 32-bit hypervisor mode, (continued)
- [PATCH v7 14/17] spapr: Don't attempt to clamp RMA to VRMA constraint, David Gibson, 2020/03/02
- Re: [PATCH v7 14/17] spapr: Don't attempt to clamp RMA to VRMA constraint,
Greg Kurz <=
- [PATCH v7 12/17] spapr: Don't use weird units for MIN_RMA_SLOF, David Gibson, 2020/03/02
- [PATCH v7 08/17] target/ppc: Streamline calculation of RMA limit from LPCR[RMLS], David Gibson, 2020/03/02
- [PATCH v7 10/17] target/ppc: Only calculate RMLS derived RMA limit on demand, David Gibson, 2020/03/02
- [PATCH v7 17/17] spapr: Fold spapr_node0_size() into its only caller, David Gibson, 2020/03/02