qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC PATCH 0/8] Towards an Heterogeneous QEMU


From: Christian Pinto
Subject: Re: [Qemu-devel] [RFC PATCH 0/8] Towards an Heterogeneous QEMU
Date: Thu, 22 Oct 2015 11:21:22 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0

Hello Peter,

On 07/10/2015 17:48, Peter Crosthwaite wrote:
On Mon, Oct 5, 2015 at 8:50 AM, Christian Pinto
<address@hidden> wrote:
Hello Peter,

thanks for your comments

On 01/10/2015 18:26, Peter Crosthwaite wrote:
On Tue, Sep 29, 2015 at 6:57 AM, Christian Pinto
<address@hidden>  wrote:
Hi all,

This RFC patch-series introduces the set of changes enabling the
architectural elements to model the architecture presented in a previous
RFC
letter: "[Qemu-devel][RFC] Towards an Heterogeneous QEMU".

To recap the goal of such RFC:

The idea is to enhance the current architecture of QEMU to enable the
modeling
of a state of-the-art SoC with an AMP processing style, where different
processing units share the same system memory and communicate through
shared
memory and inter-processor interrupts.
This might have a lot in common with a similar inter-qemu
communication solution effort at Xilinx. Edgar talks about it at KVM
forum:

https://www.youtube.com/watch?v=L5zG5Aukfek

Around 18:30 mark. I think it might be lower level that your proposal,
remote-port is designed to export the raw hardware interfaces (busses
and pins) between QEMU and some other system, another QEMU being the
common use cases.
Thanks for pointing this out. Indeed what presented by Edgar has a lot
of similarities with our proposal, but is targeting a different scenario
where low-level modeling of the various hardware components is taken
into account.
The goal of my proposal is on the other hand to enable a set of tools
for high level early prototyping of systems with a heterogeneous
set of cores, so to model a platform that does not exist in reality but that
the user wants to experiment with.
As an example I can envision a programming model researcher willing to
explore an
heterogeneous system based on an X86 and a multi-core ARM accelerator
sharing memory,
to build a new programming paradigm on top of it. Such user would not need
the specific details
of the hardware nor all the various devices available in a real SoC, but
only an abstract model
encapsulating the main features needed for his research.

So to link also to your next comment there is no actual SoC/hardware
targeted by this
work.
An example is a multi-core ARM CPU
working alongside with two Cortex-M micro controllers.

Marcin is doing something with A9+M3. It sounds like he already has a
lot working (latest emails were on some finer points). What is the
board/SoC in question here (if you are able to share)?

  From the user point of view there is usually an operating system booting
on
the Master processor (e.g. Linux) at platform startup, while the other
processors are used to offload the Master one from some computation or to
deal
with real-time interfaces.
I feel like this is architecting hardware based on common software use
cases, rather than directly modelling the SoC in question. Can we
model the hardware (e.g. the devices that are used for rpmesg and IPIs
etc.) as regular devices, as it is in-SoC? That means AMP is just
another guest?
This is a set of extensions focusing more on the communication channel
between the processors
rather than a full SoC model. With this patch series each of the AMP
processors is a different "guest".
Ok. My issue here is that establishing a 1:1 relationship between QEMU
guests and AMP peers is building use-case policy into the mechanism. I
am seeing the use-case where there is just one guest that sets up AMP
without any QEMU awareness of the AMPness.

It is the Master OS that triggers the boot of the
Slave processors, and provides them also the binary code to execute (e.g.
RTOS, binary firmware) by placing it into a pre-defined memory area that
is
accessible to the Slaves. Usually the memory for the Slaves is carved out
from
the Master OS during boot. Once a Slave is booted the two processors can
communicate through queues in shared memory and inter-processor
interrupts
(IPIs). In Linux, it is the remoteproc/rpmsg framework that enables the
control (boot/shutdown) of Slave processors, and also to establish a
communication channel based on virtio queues.

Currently, QEMU is not able to model such an architecture mainly because
only
a single processor can be emulated at one time,
SMP does work already. MTTCG will remove the one-run-at-a-time
limitation. Multi-arch will allow you to mix multiple CPU
architectures (e.g. PPC + ARM in same QEMU). But multiple
heterogeneous ARMs should already just work, and there is already an
in-tree precedent with the xlnx-zynqmp SoC. That SoC has 4xA53 and
2xR5 (all ARM).
Since Multi-arch is not yet available, with this proposal it is possible to
experiment with heterogeneous processors at high level of abstraction,
even beyond the ARM + ARM (e.g. X86 + ARM), exploiting the off-the-shelf
QEMU.

One thing I want to add is that all the solutions mentioned in this
discussion Multi-arch,
Xilinx's patches, and our proposal could coexist from the code point of
view, and none
would prevent the others from being used.
We should consolidate where possible though.

Multiple system address spaces and CPUs have different views of the
address space is another common snag on this effort, and is discussed
on a recent thread between myself and Marcin.
Yes I have seen the discussion, but it was mostly dealing with one single
QEMU instance modeling all the cores. Here the different address spaces
are enforced by multiple QEMU instances.
and the OS binary image needs
to be placed in memory at model startup.

I don't see what this limitation is exactly. Can you explain more? I
do see a need to work on the ARM bootloader for AMP flows, it is a
pure SMP bootloader than assumes total control.
the problem here was to me that when we launch QEMU a binary needs to be
provided and put in memory
in order to be executed. In this patch series the slave doesn't have a
proper memory allocated when first launched.
But it could though couldn't it? Can't the slave guest just have full
access to it's own address space (probably very similar to the masters
address space) from machine init time? This seems more realistic than
setting up the hardware based on guest level information.

Actually the address space for a slave is built at init time, the thing that is not completely configured is the memory region modeling the RAM. Such region is configured in terms of size, but there is no pointer to the actual memory. The pointer is mmap-ed later
before the slave boots.


The information about memory (fd + offset for mmap) is sent only later when
the boot is triggered. This is also
safe since the slave will be waiting in the incoming state, and thus no
corruption or errors can happen before the
boot is triggered.
Can this effort be a bootloader overhaul? Two things:

1: The bootloader needs to repeatable
2: The bootloaders need to be targetable (to certain CPUs or clusters)
Well in this series the bootloader for the master is different from the one
for the slave. In my idea the master,
besides the firmware/kernel image, will copy also a bootloader for the
slave.
This patch series adds a set of modules and introduces minimal changes to
the
current QEMU code-base to implement what described above, with master and
slave
implemented as two different instances of QEMU. The aim of this work is
to
enable application and runtime programmers to test their AMP
applications, or
their new inter-SoC communtication protocol.

The main changes are depicted in the following diagram and involve:
      - A new multi-client socket implementation that allows multiple
instances of
        QEMU to attach to the same socket, with only one acting as a
master.
      - A new memory backend, the shared memory backend, based on
        the file memory backend. Such new backend enables, on the master
side,
        to allocate the whole memory as shareable (e.g. /dev/shm, or
hugetlbfs).
        On the slave side it enables the startup of QEMU without any main
memory
        allocated. The the slave goes in a waiting state, the same used in
the
        case of an incoming migration, and a callback is registered on a
        multi-client socket shared with the master.
        The waiting state ends when the master sends to the slave the file
        descriptor and offset to mmap and use as memory.
This is useful in it's own right and came up in the Xilinx implementation.
It is also mentioned in the video you are pointing, where the Microblaze
cores are instantiated as foreign QEMU instances.
Is the code publicly available? There was a question about that in the video
but I couldn't catch the answer.
This would probably be a starting point, this is the remote port
adapter, which is a generic construct for sending/receiving hardware
events to/from things outside QEMU (or other QEMUs):

https://github.com/Xilinx/qemu/blob/pub/2015.2.plnx/hw/core/remote-port.c

This specific device interfaces to RP adapter and lets you send GPIOs
between QEMUs:

https://github.com/Xilinx/qemu/blob/pub/2015.2.plnx/hw/core/remote-port-gpio.c

Thanks, quite interesting work. As you said already,
that is a lower level approach modeling details like the bus transactions.
While the part of memory sharing between the qemu instances is not going through the socket
but using the file backed memory, similar to what we do in our work.

      - A new inter-processor interrupt hardware distribution module, that
is used
        also to trigger the boot of slave processors. Such module uses a
pair of
        eventfd for each master-slave couple to trigger interrupts between
the
        instances. No slave-to-slave interrupts are envisioned by the
current
        implementation.
Wouldn't that just be a software interrupt in the local QEMU instance?
Since in this proposal there will be multiple instances of QEMU running at
the same time, eventfd
are used to signal the event (interrupt) among the different processes. So
writing to a register of the IDM
will raise an interrupt to a remote QEMU instance using eventfd. Did this
answer your question?
I was thinking more about your comment about slave-to-slave
interrupts. This would just trivially be a local software-generated
interrupts of some form within the slave cluster.

Sorry, I did not catch your comment at first time. You are right, if cores are in the same cluster a software generated interrupt is going to be enough. Of course the eventfd based interrupts
make sense for a remote QEMU.

The multi client-socket is used for the master to trigger
        the boot of a slave, and also for each master-slave couple to
exchancge the
        eventd file descriptors. The IDM device can be instantiated either
as a
        PCI or sysbus device.

So if everything is is one QEMU, IPIs can be implemented with just a
regular interrupt controller (which has a software set).
As said there are multiple instances of QEMU running at the same time, and
each of them will see the IDM in their memory map.
Even if the IDM instances will be physically different, because of the
multiple processes, all together will act as a single block (e.g., a light
version of a mailbox).

                             Memory
                             (e.g. hugetlbfs)

+------------------+       +--------------+
+------------------+
|                  |       |              |            |
|
|   QEMU MASTER    |       |   Master     |            |   QEMU SLAVE
|
|                  |       |   Memory     |            |
|
| +------+  +------+-+     |              |          +-+------+  +------+
|
| |      |  |SHMEM   |     |              |          |SHMEM   |  |      |
|
| | VCPU |  |Backend +----->              |    +----->Backend |  | VCPU |
|
| |      |  |        |     |              |    | +--->        |  |      |
|
| +--^---+  +------+-+     |              |    | |   +-+------+  +--^---+
|
|    |             |       |              |    | |     |            |
|
|    +--+          |       |              |    | |     |        +---+
|
|       | IRQ      |       | +----------+ |    | |     |    IRQ |
|
|       |          |       | |          | |    | |     |        |
|
|  +----+----+     |       | | Slave    <------+ |     |   +----+---+
|
+--+  IDM    +-----+       | | Memory   | |      |     +---+ IDM
+-----+
     +-^----^--+             | |          | |      |         +-^---^--+
       |    |                | +----------+ |      |           |   |
       |    |                +--------------+      |           |   |
       |    |                                      |           |   |
       |    +--------------------------------------+-----------+   |
       |   UNIX Domain Socket(send mem fd + offset, trigger boot)  |
       |                                                           |
       +-----------------------------------------------------------+
                                eventfd

So the slave can only see a subset of the masters memory? Is the
masters memory just the full system memory and the master is doing
IOMMU setup for the slave pre-boot? Or is it a hard feature of the
physical SoC?
Yes slaves can only see the memory that has been reserved for them. This is
ensured by carving out
the memory from the master kernel and providing the offset to such memory to
the slave. Each slave
will have its own memory map, and see the memory at the address defined in
the machine model.
There is no IOMMU modeled, but it is neither a hard feature since decided at
run-time.
The whole code can be checked out from:
https://git.virtualopensystems.com/dev/qemu-het.git
branch:
qemu-het-rfc-v1

Patches apply to the current QEMU master branch

=========
Demo
=========

This patch series comes in the form of a demo to better understand how
the
changes introduced can be exploited.
At the current status the demo can be executed using an ARM target for
both
master and slave.

The demo shows how a master QEMU instance carves out the memory for a
slave,
copies inside linux kernel image and device tree blob and finally
triggers the
boot.

These processes must have underlying hardware implementation, is the
master using a system controller to implement the slave boot? (setting
reset and entry points via registers?). How hard are they to model as
regular devs?

In this series the system controller is the IDM device, that through a set
of registers makes the master in
"control" each of the slaves. The IDM device is already seen as a regular
device by each of the QEMU instances
involved.

I'm starting to think this series is two things that should be
decoupled. One is the abstract device(s) to facilitate your AMP, the
other is the inter-qemu communication. For the abstract device, I
guess this would be a new virtio-idm device. We should try and involve
virtio people perhaps. I can see the value in it quite separate from
modelling the real sysctrl hardware.

Interesting, which other value/usage do you see in it? For me the IDM was meant to
work as an abstract system controller to centralize the management
of the slaves (boot_regs and interrupts).

But I think the implementation
should be free of any inter-QEMU awareness. E.g. from P4 of this
series:

+static void send_shmem_fd(IDMState *s, MSClient *c)
+{
+    int fd, len;
+    uint32_t *message;
+    HostMemoryBackend *backend = MEMORY_BACKEND(s->hostmem);
+
+    len = strlen(SEND_MEM_FD_CMD)/4 + 3;
+    message = malloc(len * sizeof(uint32_t));
+    strcpy((char *) message, SEND_MEM_FD_CMD);
+    message[len - 2] = s->pboot_size;
+    message[len - 1] = s->pboot_offset;
+
+    fd = memory_region_get_fd(&backend->mr);
+
+    multi_socket_send_fds_to(c, &fd, 1, (char *) message, len *
sizeof(uint32_t));

The device itself is aware of shared-memory and multi-sockets. Using
the device for single-QEMU AMP would require neither - can the IDM
device be used in a homogeneous AMP flow in one of our existing SMP
machine models (eg on a dual core A9 with one core being master and
the other slave)?

Can this be architected in two phases for greater utility, with the
AMP devices as just normal devices, and the inter-qemu communication
as a separate feature?

I see your point, and it is an interesting proposal.

What I can think here to remove the awareness of how the IDM communicates with
the slaves, is to define a kind of AMP Slave interface. So there will be an
instance of the interface for each of the slaves, encapsulating the
communication part (being either local or based on sockets).
The AMP Slave interfaces would be what you called the AMP devices, with one device per slave.

At master side, besides the IDM, one would instantiate
as many interface devices as slaves. During the initialization the IDM will link
with all those interfaces, and only call functions like: send_interrupt() or
boot_slave() to interact with the slaves. The interface will be the same for
both local or remote slaves, while the implementation of the methods will
differ and reside in the specific AMP Slave Interface device.
On the slave side, if the slave is remote, another instance of the
interface is instantiated so to connect to socket/eventfd.

So as an example the send_shmem_fd function you pointed could be hidden in the
slave interface, and invoked only when the IDM will invoke the slave_boot()
function of a remote slave interface.

This would higher the level of abstraction and open the door to potentially any communication mechanism between master and slave, without the need to adapt the
IDM device to the specific case. Or, eventually, to mix between local and
remote instances.


Thanks,

Christian


Regards,
Peter




reply via email to

[Prev in Thread] Current Thread [Next in Thread]