qemu-ppc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-ppc] [PATCH 26/77] ppc/pnv: Add skeletton PowerNV platform


From: David Gibson
Subject: Re: [Qemu-ppc] [PATCH 26/77] ppc/pnv: Add skeletton PowerNV platform
Date: Tue, 24 Nov 2015 13:43:51 +1100
User-agent: Mutt/1.5.23 (2015-06-09)

On Tue, Nov 24, 2015 at 12:45:48PM +1100, Benjamin Herrenschmidt wrote:
> On Fri, 2015-11-20 at 19:21 +1100, David Gibson wrote:
> > On Wed, Nov 11, 2015 at 11:27:39AM +1100, Benjamin Herrenschmidt
> > wrote:
> > > No devices yet, not even an interrupt controller, just to get
> > > started.
> > > 
> > > Signed-off-by: Benjamin Herrenschmidt <address@hidden>
> > > ---
> > >  default-configs/ppc64-softmmu.mak |   1 +
> > >  hw/ppc/Makefile.objs              |   2 +
> > >  hw/ppc/pnv.c                      | 600
> > > ++++++++++++++++++++++++++++++++++++++
> > >  include/hw/ppc/pnv.h              |  36 +++
> > >  4 files changed, 639 insertions(+)
> > >  create mode 100644 hw/ppc/pnv.c
> > >  create mode 100644 include/hw/ppc/pnv.h
> > 
> > Many of my comments below may be made irrelevant by later patches in
> > the series.
> 
> Heh, well there is where the "meat" of the new platform starts showing
> up :-)
> 
>  .../...
> 
> > > +#define _FDT(exp) \
> > > +    do { \
> > > +        int ret = (exp);                                           \
> > > +        if (ret < 0) {                                             \
> > > +            fprintf(stderr, "qemu: error creating device tree: %s: 
> > > %s\n", \
> > > +                    #exp, fdt_strerror(ret));                      \
> > > +            exit(1);                                               \
> > > +        }                                                          \
> > > +    } while (0)
> > 
> > We should probably make a file where helper routines used by both
> > spapr.c and pnv.c can live.
> 
> Probably but I'd see that as a later cleanup rather than doing it
> in this series...

Ok.

> 
>  .../...
> 
> > > +    if (pcc->l1_dcache_size) {
> > > +        _FDT((fdt_property_cell(fdt, "d-cache-size", 
> > > pcc->l1_dcache_size)));
> > > +    } else {
> > > +        fprintf(stderr, "Warning: Unknown L1 dcache size for cpu\n");
> > 
> > Hmm (note to self) should probably change a bunch of these both in
> > spapr.c and pnv.c from explicit fprintfs() to modern error_report()
> > and similar.
> 
> That's a train I completely missed, but yes.
> 
>   .../...
> 
> > > +    }
> > > +    _FDT((fdt_property(fdt, "ibm,ppc-interrupt-server#s",
> > > +                       servers_prop, sizeof(servers_prop))));
> > > +    _FDT((fdt_property(fdt, "ibm,ppc-interrupt-gserver#s",
> > > +                       gservers_prop, sizeof(gservers_prop))));
> > > +
> > > +    _FDT((fdt_end_node(fdt)));
> > > +}
> > > +
> > > +static void *powernv_create_fdt(PnvSystem *sys, uint32_t initrd_base, 
> > > uint32_t initrd_size)
> > > +{
> > 
> > So.. does it make sense for qemu to create a device tree for powernv,
> > rather than leaving it to Opal?
> 
> Well, OPAL only creates a device-tree if you are on an FSP machine in
> which case it expects a complex data structure (HDAT) coming from the
> FSP to use as a source of info.
> 
> On OpenPower machines, which is closer to what we simulate here, we
> do get a device-tree as an input in OPAL, it's generated by HostBoot.
> 
> Now, I am not running HostBoot in qemu because most of what it does
> is completely irrelevant to an emulated system (training the various
> links, initializing the memory buffer chips, etc...).
> 
> However we do need to pass a number of platform information to OPAL
> which HB does via the device-tree, such as which cores are enabled,
> the memory map configured for PCI, which PHBs are enabled, etc...  so
> creating a DT in qemu makes sense, it mimmics HB in essence.
> 
> OPAL will enrich that device-tree before starting Linux.

Ok.  Some comments mentioning that you're simulating the exit state
from HostBoot would be good then.

> > > +    /*
> > > +     * Add info to guest to indentify which host is it being run on
> > > +     * and what is the uuid of the guest
> > > +     */
> > > +    if (kvmppc_get_host_model(&buf)) {
> > > +        _FDT((fdt_property_string(fdt, "host-model", buf)));
> > > +        g_free(buf);
> > > +    }
> > > +    if (kvmppc_get_host_serial(&buf)) {
> > > +        _FDT((fdt_property_string(fdt, "host-serial", buf)));
> > > +        g_free(buf);
> > > +    }
> > 
> > Since you're emulating a "bare metal" machine, surely the host
> > properties aren't relevant here.
> 
> They may or may not. But yes, I can take that out.
> 
> > > +    buf = g_strdup_printf(UUID_FMT, qemu_uuid[0], qemu_uuid[1],
> > > +                          qemu_uuid[2], qemu_uuid[3], qemu_uuid[4],
> > > +                          qemu_uuid[5], qemu_uuid[6], qemu_uuid[7],
> > > +                          qemu_uuid[8], qemu_uuid[9], qemu_uuid[10],
> > > +                          qemu_uuid[11], qemu_uuid[12], qemu_uuid[13],
> > > +                          qemu_uuid[14], qemu_uuid[15]);
> > > +
> > > +    _FDT((fdt_property_string(fdt, "vm,uuid", buf)));
> > > +    g_free(buf);
> > > +
> > > +    _FDT((fdt_begin_node(fdt, "chosen")));
> > > +    _FDT((fdt_property(fdt, "linux,initrd-start",
> > > +                       &start_prop, sizeof(start_prop))));
> > > +    _FDT((fdt_property(fdt, "linux,initrd-end",
> > > +                       &end_prop, sizeof(end_prop))));
> > > +    _FDT((fdt_end_node(fdt)));
> > > +
> > > +    _FDT((fdt_property_cell(fdt, "#address-cells", 0x2)));
> > > +    _FDT((fdt_property_cell(fdt, "#size-cells", 0x2)));
> > > +
> > > +    /* cpus */
> > > +    _FDT((fdt_begin_node(fdt, "cpus")));
> > > +    _FDT((fdt_property_cell(fdt, "#address-cells", 0x1)));
> > > +    _FDT((fdt_property_cell(fdt, "#size-cells", 0x0)));
> > > +
> > > +    CPU_FOREACH(cs) {
> > > +        powernv_create_cpu_node(fdt, cs, smt);
> > > +    }
> > > +
> > > +    _FDT((fdt_end_node(fdt)));
> > > +
> > > +    /* Memory */
> > > +    _FDT((powernv_populate_memory(fdt)));
> > > +
> > > +    /* /hypervisor node */
> > > +    if (kvm_enabled()) {
> > > +        uint8_t hypercall[16];
> > > +
> > > +        /* indicate KVM hypercall interface */
> > > +        _FDT((fdt_begin_node(fdt, "hypervisor")));
> > > +        _FDT((fdt_property_string(fdt, "compatible", "linux,kvm")));
> > > +        if (kvmppc_has_cap_fixup_hcalls()) {
> > > +            /*
> > > +             * Older KVM versions with older guest kernels were broken 
> > > with the
> > > +             * magic page, don't allow the guest to map it.
> > > +             */
> > > +            kvmppc_get_hypercall(first_cpu->env_ptr, hypercall,
> > > +                                 sizeof(hypercall));
> > > +            _FDT((fdt_property(fdt, "hcall-instructions", hypercall,
> > > +                              sizeof(hypercall))));
> > > +        }
> > > +        _FDT((fdt_end_node(fdt)));
> > > +    }
> > 
> > And a hypercall interface surely doesn't make sense for powernv.
> 
> It's qemu paravirt, it exist on G5 too :-) It's for PR KVM, it allows
> to speed up some bits and pieces. But yeah we don't yet really
> "support" it at this point. However we might.

Ah, yes, I forgot about that.

> > > +
> > > +    _FDT((fdt_end_node(fdt))); /* close root node */
> > > +    _FDT((fdt_finish(fdt)));
> > > +
> > > +    return fdt;
> > > +}
> > > +
> > > +static void powernv_cpu_reset(void *opaque)
> > > +{
> > > +    PowerPCCPU *cpu = opaque;
> > > +    CPUState *cs = CPU(cpu);
> > > +    CPUPPCState *env = &cpu->env;
> > > +
> > > +    cpu_reset(cs);
> > > +
> > > +    env->spr[SPR_PIR] = ppc_get_vcpu_dt_id(cpu);
> > > +    env->spr[SPR_HIOR] = 0;
> > > +    env->gpr[3] = FDT_ADDR;
> > > +    env->nip = 0x10;
> > > +    env->msr |= MSR_HVB;
> > > +}
> > 
> > So, I believe the qemu-ishly correct way of doing this is to have the
> > cpu initialization in the cpu code, rather than the platform code, as
> > much as possible.  On PAPR we kind of get away with initialization in
> > the platform code on the grounds that it's a paravirt platform, but
> > powernv doesn't have that excuse.
> > 
> > But this may well be stuff that changes in later patches, so..
> 
> Well no, not really. But here too, we mimmic the state as coming out of
> HostBoot, which isn't quite the same thing. We need to provide the FDT
> entry, etc...
> 
> The "real" reset state of a P8 isn't something we can easily
> simulate... 
> 
> It runs some microcode from a SEEPROM with a small microcontroller
> which initializes a core, which then runs some HB code off it's L3
> cache etc... really not something we want to do in qemu at least
> for now.
> 
> So the initial state here is somewhat in between full virt and
> paravirt, we simulate a platform that has been partially initialized by
> HostBoot, to the state it has when it enters OPAL.

Ok, that makes sense, but I think it needs a bit more explanation in
the code to that effect.

> > > +static const VMStateDescription vmstate_powernv = {
> > > +    .name = "powernv",
> > > +    .version_id = 1,
> > > +    .minimum_version_id = 1,
> > > +};
> > 
> > It might be best to leave out the vmstate entirely until you're ready
> > to implement migration, rather than having a partial, probably not
> > working migration implementation.
> 
> Ok.
> 
> > > +
> > > +static void pnv_create_chip(PnvSystem *sys, unsigned int chip_no)
> > > +{
> > > +    PnvChip *chip = &sys->chips[chip_no];
> > > +
> > > +    if (chip_no >= PNV_MAX_CHIPS) {
> > > +            return;
> > > +    }
> > > +
> > > +    /* XXX Improve chip numbering to better match HW */
> > > +    chip->chip_id = chip_no;
> > 
> > I think modern qemu conventions would suggest creating the chips as
> > QOM objects rather than having a fixed array.
> 
> Yeah, more code & much larger memory footprint for the same result :-)
> 
> I can look into it but it's low priority. I still want to rework some
> of that chip stuff in future patches anyway.
> 
> > > +}
> > > +
> > > +static void ppc_powernv_init(MachineState *machine)
> > > +{
> > > +    ram_addr_t ram_size = machine->ram_size;
> > > +    const char *cpu_model = machine->cpu_model;
> > > +    const char *kernel_filename = machine->kernel_filename;
> > > +    const char *initrd_filename = machine->initrd_filename;
> > > +    uint32_t initrd_base = 0;
> > > +    long initrd_size = 0;
> > > +    PowerPCCPU *cpu;
> > > +    CPUPPCState *env;
> > > +    MemoryRegion *sysmem = get_system_memory();
> > > +    MemoryRegion *ram = g_new(MemoryRegion, 1);
> > > +    sPowerNVMachineState *pnv_machine = POWERNV_MACHINE(machine);
> > > +    PnvSystem *sys = &pnv_machine->sys;
> > > +    long fw_size;
> > > +    char *filename;
> > > +    void *fdt;
> > > +    int i;
> > > +
> > > +    /* init CPUs */
> > > +    if (cpu_model == NULL) {
> > > +        cpu_model = kvm_enabled() ? "host" : "POWER8";
> > > +    }
> > > +
> > > +    for (i = 0; i < smp_cpus; i++) {
> > > +        cpu = cpu_ppc_init(cpu_model);
> > > +        if (cpu == NULL) {
> > > +            fprintf(stderr, "Unable to find PowerPC CPU definition\n");
> > > +            exit(1);
> > > +        }
> > > +        env = &cpu->env;
> > > +
> > > +        /* Set time-base frequency to 512 MHz */
> > > +        cpu_ppc_tb_init(env, TIMEBASE_FREQ);
> > > +
> > > +        /* MSR[IP] doesn't exist nowadays */
> > > +        env->msr_mask &= ~(1 << 6);
> > > +
> > > +        qemu_register_reset(powernv_cpu_reset, cpu);
> > > +    }
> > > +
> > > +    /* allocate RAM */
> > > +    memory_region_allocate_system_memory(ram, NULL, "ppc_powernv.ram", 
> > > ram_size);
> > > +    memory_region_add_subregion(sysmem, 0, ram);
> > > +
> > > +    /* XXX We should decide how many chips to create based on #cores and
> > > +     * Venice vs. Murano vs. Naples chip type etc..., for now, just 
> > > create
> > > +     * one chip. Also creation of the CPUs should be done per-chip
> > > +     */
> > > +    sys->num_chips = 1;
> > > +
> > > +    /* Create only one PHB for now until I figure out what's wrong
> > > +     * when I create more (resource assignment failures in Linux)
> > > +     */
> > > +    pnv_create_chip(sys, 0);
> > > +
> > > +    if (bios_name == NULL) {
> > > +        bios_name = FW_FILE_NAME;
> > > +    }
> > > +    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
> > > +    fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
> > > +    if (fw_size < 0) {
> > > +        hw_error("qemu: could not load OPAL '%s'\n", filename);
> > > +        exit(1);
> > > +    }
> > > +    g_free(filename);
> > > +
> > > +
> > > +    if (kernel_filename == NULL) {
> > > +        kernel_filename = KERNEL_FILE_NAME;
> > > +    }
> > > +    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS,
> > > kernel_filename);
> > 
> > The commit withe Opal image should go in before this, no?
> 
> Now this is a bit of an open discussion at the moment :-)
> 
> The way OPAL is built on OPP machines today is by essentially building
> a complete flash image which contains HostBoot, OPAL and the petitboot-
> based bootloader which contains a Linux kernel etc...
> 
> We could create a target without HB and with a slimmed down Linux but
> it would still probably be about 12MB I reckon, if not more. It feels a
> bit "big" to ship as a binary as part of qemu...
> 
> We would also have to add code to qemu to "find" OPAL in that image,
> and then add a model for the flash controller.
> 
> The other option is to bundle just OPAL itself. However that means
> you can't go anywhere without a -kernel argument, which would then
> be either a petitboot-based bootloader or your actual target kernel.

Hm, ok.  But in order for this to be usable, we need some way to get a
suitable image.  So medium term, I think it makes sense to include
both opal and PetitBoot, so you can boot the guest like a real
machine.

However, including only Opal and requiring -kernel would be a
reasonable interim step.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]