qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt


From: Cédric Le Goater
Subject: Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
Date: Fri, 4 May 2018 15:11:57 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2

On 05/04/2018 06:51 AM, David Gibson wrote:
> On Thu, May 03, 2018 at 06:06:14PM +0200, Cédric Le Goater wrote:
>> On 05/03/2018 07:35 AM, David Gibson wrote:
>>> On Thu, Apr 26, 2018 at 11:27:21AM +0200, Cédric Le Goater wrote:
>>>> On 04/26/2018 09:11 AM, David Gibson wrote:
>>>>> On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
>>>>>> The XIVE presenter engine uses a set of registers to handle priority
>>>>>> management and interrupt acknowledgment among other things. The most
>>>>>> important ones being :
>>>>>>
>>>>>>   - Interrupt Priority Register (PIPR)
>>>>>>   - Interrupt Pending Buffer (IPB)
>>>>>>   - Current Processor Priority (CPPR)
>>>>>>   - Notification Source Register (NSR)
>>>>>>
>>>>>> There is one set of registers per level of privilege, four in all :
>>>>>> HW, HV pool, OS and User. These are called rings. All registers are
>>>>>> accessible through a specific MMIO region called the Thread Interrupt
>>>>>> Management Areas (TIMA) but, depending on the privilege level of the
>>>>>> CPU, the view of the TIMA is filtered. The sPAPR machine runs at the
>>>>>> OS privilege and therefore can only accesses the OS and the User
>>>>>> rings. The others are for hypervisor levels.
>>>>>>
>>>>>> The CPU interrupt state is modeled with a XiveNVT object which stores
>>>>>> the values of the different registers. The different TIMA views are
>>>>>> mapped at the same address for each CPU and 'current_cpu' is used to
>>>>>> retrieve the XiveNVT holding the ring registers.
>>>>>>
>>>>>> Signed-off-by: Cédric Le Goater <address@hidden>
>>>>>> ---
>>>>>>
>>>>>>  Changes since v2 :
>>>>>>
>>>>>>  - introduced the XiveFabric interface
>>>>>>
>>>>>>  hw/intc/spapr_xive.c        |  25 ++++
>>>>>>  hw/intc/xive.c              | 279 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>  include/hw/ppc/spapr_xive.h |   5 +
>>>>>>  include/hw/ppc/xive.h       |  31 +++++
>>>>>>  include/hw/ppc/xive_regs.h  |  84 +++++++++++++
>>>>>>  5 files changed, 424 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>>>>>> index 90cde8a4082d..f07832bf0a00 100644
>>>>>> --- a/hw/intc/spapr_xive.c
>>>>>> +++ b/hw/intc/spapr_xive.c
>>>>>> @@ -13,6 +13,7 @@
>>>>>>  #include "target/ppc/cpu.h"
>>>>>>  #include "sysemu/cpus.h"
>>>>>>  #include "monitor/monitor.h"
>>>>>> +#include "hw/ppc/spapr.h"
>>>>>>  #include "hw/ppc/spapr_xive.h"
>>>>>>  #include "hw/ppc/xive.h"
>>>>>>  #include "hw/ppc/xive_regs.h"
>>>>>> @@ -95,6 +96,22 @@ static void spapr_xive_realize(DeviceState *dev, 
>>>>>> Error **errp)
>>>>>>  
>>>>>>      /* Allocate the Interrupt Virtualization Table */
>>>>>>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>>>>>> +
>>>>>> +    /* The Thread Interrupt Management Area has the same address for
>>>>>> +     * each chip. On sPAPR, we only need to expose the User and OS
>>>>>> +     * level views of the TIMA.
>>>>>> +     */
>>>>>> +    xive->tm_base = XIVE_TM_BASE;
>>>>>
>>>>> The constant should probably have PAPR in the name somewhere, since
>>>>> it's just for PAPR machines (same for the ESB mappings, actually).
>>>>
>>>> ok. 
>>>>
>>>> I have also made 'tm_base' a property, like 'vc_base' for ESBs, in 
>>>> case we want to change the value when the guest is instantiated. 
>>>> I doubt it but this is an address in the global address space, so 
>>>> letting the machine have control is better I think.
>>>
>>> I agree.
>>>
>>>>>> +
>>>>>> +    memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive),
>>>>>> +                          &xive_tm_user_ops, xive, "xive.tima.user",
>>>>>> +                          1ull << TM_SHIFT);
>>>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user);
>>>>>> +
>>>>>> +    memory_region_init_io(&xive->tm_mmio_os, OBJECT(xive),
>>>>>> +                          &xive_tm_os_ops, xive, "xive.tima.os",
>>>>>> +                          1ull << TM_SHIFT);
>>>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os);
>>>>>>  }
>>>>>>  
>>>>>>  static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>>>>>> @@ -104,6 +121,13 @@ static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, 
>>>>>> uint32_t lisn)
>>>>>>      return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
>>>>>>  }
>>>>>>  
>>>>>> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
>>>>>> +{
>>>>>> +    PowerPCCPU *cpu = spapr_find_cpu(server);
>>>>>> +
>>>>>> +    return cpu ? XIVE_NVT(cpu->intc) : NULL;
>>>>>> +}
>>>>>
>>>>> So this is a bit of a tangent, but I've been thinking of implementing
>>>>> a scheme where there's an opaque pointer in the cpu structure for the
>>>>> use of the machine.  I'm planning for that to replace the intc pointer
>>>>> (which isn't really used directly by the cpu). That would allow us to
>>>>> have spapr put a structure there and have both xics and xive pointers
>>>>> which could be useful later on.
>>>>
>>>> ok. That should simplify the patchset at the end, in which we need to 
>>>> switch the 'intc' pointer. 
>>>>
>>>>> I think we'd need something similar to correctly handle migration of
>>>>> the VPA state, which is currently horribly broken.
>>>>>
>>>>>> +
>>>>>>  static const VMStateDescription vmstate_spapr_xive_ive = {
>>>>>>      .name = TYPE_SPAPR_XIVE "/ive",
>>>>>>      .version_id = 1,
>>>>>> @@ -143,6 +167,7 @@ static void spapr_xive_class_init(ObjectClass 
>>>>>> *klass, void *data)
>>>>>>      dc->vmsd = &vmstate_spapr_xive;
>>>>>>  
>>>>>>      xfc->get_ive = spapr_xive_get_ive;
>>>>>> +    xfc->get_nvt = spapr_xive_get_nvt;
>>>>>>  }
>>>>>>  
>>>>>>  static const TypeInfo spapr_xive_info = {
>>>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>>>> index dccad0318834..5691bb9474e4 100644
>>>>>> --- a/hw/intc/xive.c
>>>>>> +++ b/hw/intc/xive.c
>>>>>> @@ -14,7 +14,278 @@
>>>>>>  #include "sysemu/cpus.h"
>>>>>>  #include "sysemu/dma.h"
>>>>>>  #include "monitor/monitor.h"
>>>>>> +#include "hw/ppc/xics.h" /* for ICP_PROP_CPU */
>>>>>>  #include "hw/ppc/xive.h"
>>>>>> +#include "hw/ppc/xive_regs.h"
>>>>>> +
>>>>>> +/*
>>>>>> + * XIVE Interrupt Presenter
>>>>>> + */
>>>>>> +
>>>>>> +static uint64_t xive_nvt_accept(XiveNVT *nvt)
>>>>>> +{
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr)
>>>>>> +{
>>>>>> +    if (cppr > XIVE_PRIORITY_MAX) {
>>>>>> +        cppr = 0xff;
>>>>>> +    }
>>>>>> +
>>>>>> +    nvt->ring_os[TM_CPPR] = cppr;
>>>>>
>>>>> Surely this needs to recheck if we should be interrupting the cpu?
>>>>
>>>> yes. In patch 9, when we introduce the nvt notify routine.
>>>
>>> Ok.
>>>
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * OS Thread Interrupt Management Area MMIO
>>>>>> + */
>>>>>> +static uint64_t xive_tm_read_special(XiveNVT *nvt, hwaddr offset,
>>>>>> +                                           unsigned size)
>>>>>> +{
>>>>>> +    uint64_t ret = -1;
>>>>>> +
>>>>>> +    if (offset == TM_SPC_ACK_OS_REG && size == 2) {
>>>>>> +        ret = xive_nvt_accept(nvt);
>>>>>> +    } else {
>>>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
>>>>>> +                      HWADDR_PRIx" size %d\n", offset, size);
>>>>>> +    }
>>>>>> +
>>>>>> +    return ret;
>>>>>> +}
>>>>>> +
>>>>>> +#define TM_RING(offset) ((offset) & 0xf0)
>>>>>> +
>>>>>> +static uint64_t xive_tm_os_read(void *opaque, hwaddr offset,
>>>>>> +                                      unsigned size)
>>>>>> +{
>>>>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>>>>>
>>>>> So, as I said on a previous version of this, we can actually correctly
>>>>> represent different mappings in different cpu spaces, by exploiting
>>>>> cpu->as and not just having them all point to &address_space_memory.
>>>>
>>>> Yes, you did and I haven't studied the question yet. For the next version.
>>>
>>> So, it's possible that using the cpu->as thing will be more trouble
>>> that it's worth. 
>>
>> One of the trouble is the number of memory regions to use, one per cpu, 
> 
> Well, we're already going to have an NVT object for each cpu, yes?  So
> a memory region per-cpu doesn't seem like a big stretch.
> 
>> and the KVM support.
> 
> And I really don't see how the memory regions impacts KVM.

The TIMA is setup when the KVM device is initialized using some specific 
ioctl to get an fd on a MMIO region from the host. It is then passed to 
the guest as a 'ram_device', same for the ESBs. 

This is not a common region.
 
>> Having a single region is much easier. 
>>
>>> I am a little concerned about using current_cpu though.  
>>> First, will it work with KVM with kernel_irqchip=off - the
>>> cpus are running truly concurrently,
>>
>> FWIW, I didn't see any issue yet while stressing. 
> 
> Ok.
> 
>>> but we still need to work out who's poking at the TIMA.  
>>
>> I understand. The registers are accessed by the current cpu to set the 
>> CPPR and to ack an interrupt. But when we route an event, we also access 
>> and modify the registers. Do you suggest some locking ? I am not sure
>> how are protected the TIMA region accesses vs. the routing, which is 
>> necessarily initiated by an ESB MMIO though.
> 
> Locking isn't really the issue.  I mean, we do need locking, but the
> BQL should provide that.  The issue is what exactly does "current"
> mean in the context of multiple concurrently running cpus.  Does it
> always mean what we need it to mean in every context we might call
> this from.

I would say so.

C.

>>> Second, are there any cases where we might
>>> need to trip this "on behalf of" a specific cpu that's not the current
>>> one.
>>
>> ah. yes. sort of :) only in powernv, when the xive is reseted (and when 
>> dumping the state for debug).
>>
>> The IC has a way to access indirectly the registers of a HW thread. 
>> It, first, sets the PC_TCTXT_INDIR_THRDID register with the PIR of 
>> the targeted thread and then loads on the indirect TIMA can be done 
>> as if it was the current thread. The indirect TIMA is mapped 4 pages 
>> after the  IC BAR.
>>
>> The resulting memory region op is a little ugly and might need 
>> some rework : 
>>
>> static uint64_t xive_tm_hv_read(void *opaque, hwaddr offset,
>>                                  unsigned size)
>> {
>>     PowerPCCPU **cpuptr = opaque;
>>     PowerPCCPU *cpu = *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu);
>>     ...
>>
>>
>>>>>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
>>>>>> +    uint64_t ret = -1;
>>>>>> +    int i;
>>>>>> +
>>>>>> +    if (offset >= TM_SPC_ACK_EBB) {
>>>>>> +        return xive_tm_read_special(nvt, offset, size);
>>>>>> +    }
>>>>>> +
>>>>>> +    if (TM_RING(offset) != TM_QW1_OS) {
>>>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS 
>>>>>> ring @%"
>>>>>> +                      HWADDR_PRIx"\n", offset);
>>>>>> +        return ret;
>>>>>
>>>>> Just return -1 would be clearer here;
>>>>
>>>> ok.
>>>>
>>>>>
>>>>>> +    }
>>>>>> +
>>>>>> +    ret = 0;
>>>>>> +    for (i = 0; i < size; i++) {
>>>>>> +        ret |= (uint64_t) nvt->regs[offset + i] << (8 * (size - i - 1));
>>>>>> +    }
>>>>>> +
>>>>>> +    return ret;
>>>>>> +}
>>>>>> +
>>>>>> +static bool xive_tm_is_readonly(uint8_t offset)
>>>>>> +{
>>>>>> +    return offset != TM_QW1_OS + TM_CPPR;
>>>>>> +}
>>>>>> +
>>>>>> +static void xive_tm_write_special(XiveNVT *nvt, hwaddr offset,
>>>>>> +                                        uint64_t value, unsigned size)
>>>>>> +{
>>>>>> +    /* TODO: support TM_SPC_SET_OS_PENDING */
>>>>>> +
>>>>>> +    /* TODO: support TM_SPC_ACK_OS_EL */
>>>>>> +}
>>>>>> +
>>>>>> +static void xive_tm_os_write(void *opaque, hwaddr offset,
>>>>>> +                                   uint64_t value, unsigned size)
>>>>>> +{
>>>>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>>>>>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
>>>>>> +    int i;
>>>>>> +
>>>>>> +    if (offset >= TM_SPC_ACK_EBB) {
>>>>>> +        xive_tm_write_special(nvt, offset, value, size);
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    if (TM_RING(offset) != TM_QW1_OS) {
>>>>>
>>>>> Why have this if you have separate OS and user regions as you appear
>>>>> to do below?
>>>>
>>>> This is another problem we are trying to solve. 
>>>>
>>>> The registers a CPU can access depends on the TIMA view it is using. 
>>>> The OS TIMA view only sees the OS ring registers. The HV view sees all. 
>>>>
>>>>> Or to look at it another way, shouldn't it be possible to make the
>>>>> read/write accessors the same for the OS and user rings?
>>>>
>>>> For some parts yes, but the special load/store addresses are different
>>>> for each view, the read-only register also. It seemed easier to duplicate.
>>>>
>>>> I think the problem will become clearer (or worse) with pnv which uses 
>>>> the HV mode.
>>>
>>> Oh.  I had the impression that each ring had a basically identical set
>>> of registers and you just had access to the region for your ring and
>>> the ones below.  Are you saying instead it's basically a single block
>>> of registers with various different privilege levels for each of them?
>>
>> yes. I think I answered this question more clearly in a previous email.
>>
>>> [snip]
>>>>>> +}
>>>>>> +
>>>>>> +static void xive_nvt_unrealize(DeviceState *dev, Error **errp)
>>>>>> +{
>>>>>> +    qemu_unregister_reset(xive_nvt_reset, dev);
>>>>>> +}
>>>>>> +
>>>>>> +static void xive_nvt_init(Object *obj)
>>>>>> +{
>>>>>> +    XiveNVT *nvt = XIVE_NVT(obj);
>>>>>> +
>>>>>> +    nvt->ring_os = &nvt->regs[TM_QW1_OS];
>>>>>
>>>>> The ring_os field is basically pointless, being just an offset into a
>>>>> structure you already have.  A macro or inline would be a better idea.
>>>>
>>>> ok. I liked the idea but I agree it's overkill to have an init routine
>>>> just for this. I will find something.
>>>
>>> That too, but it's also something that looks like an optimization but
>>> isn't, which is bad practice.  On modern cpus math is cheap (and this
>>> is just a trivial offset), memory accesses are expensive.  You're
>>> essentially caching this offset - raising all the usual invalidation
>>> questions for a cache - when caching it is *more* expensive than just
>>> computing it every time.
>>
>> ok. removing this offset was a good opportunity to generalize the 
>> routing algorithm and use a 'ring' parameter in all routines. Same 
>> for the accept path. 
>>
>>
>> C.
>>
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]