[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUM
From: |
Eduardo Habkost |
Subject: |
Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUMA nodes |
Date: |
Fri, 3 Mar 2017 10:57:04 -0300 |
User-agent: |
Mutt/1.7.1 (2016-10-04) |
On Fri, Mar 03, 2017 at 01:01:44PM +0800, He Chen wrote:
> Current, QEMU does not provide a clear command to set vNUMA distance for
> guest although we already have `-numa` command to set vNUMA nodes.
>
> vNUMA distance makes sense in certain scenario.
> But now, if we create a guest that has 4 vNUMA nodes, when we check NUMA
> info via `numactl -H`, we will see:
>
> node distance:
> node 0 1 2 3
> 0: 10 20 20 20
> 1: 20 10 20 20
> 2: 20 20 10 20
> 3: 20 20 20 10
>
> Guest kernel regards all local node as distance 10, and all remote node
> as distance 20 when there is no SLIT table since QEMU doesn't build it.
> It looks like a little strange when you have seen the distance in an
> actual physical machine that contains 4 NUMA nodes. My machine shows:
>
> node distance:
> node 0 1 2 3
> 0: 10 21 31 41
> 1: 21 10 21 31
> 2: 31 21 10 21
> 3: 41 31 21 10
>
> To set vNUMA distance, guest should see a complete SLIT table.
> I found QEMU has provide `-acpitable` command that allows users to add
> a ACPI table into guest, but it requires users building ACPI table by
> themselves first. Using `-acpitable` to add a SLIT table may be not so
> straightforward or flexible, imagine that when the vNUMA configuration
> is changes and we need to generate another SLIT table manually. It may
> not be friendly to users or upper software like libvirt.
>
> This RFC patch is going to add SLIT table support in QEMU, and provides
> addtional option `distance` for command `-numa` to allow user set vNUMA
> distance by QEMU command.
>
> With this patch, when a user wants to create a guest that contains
> several vNUMA nodes and also wants to set distance among those nodes,
> the QEMU command would like:
>
> ```
> -object
> memory-backend-ram,size=1G,prealloc=yes,host-nodes=0,policy=bind,id=node0 \
> -numa
> node,nodeid=0,cpus=0,memdev=node0,distance=10,distance=21,distance=31,distance=41
> \
> -object
> memory-backend-ram,size=1G,prealloc=yes,host-nodes=1,policy=bind,id=node1 \
> -numa
> node,nodeid=1,cpus=1,memdev=node1,distance=21,distance=10,distance=21,distance=31
> \
> -object
> memory-backend-ram,size=1G,prealloc=yes,host-nodes=2,policy=bind,id=node2 \
> -numa
> node,nodeid=2,cpus=2,memdev=node2,distance=31,distance=21,distance=10,distance=21
> \
> -object
> memory-backend-ram,size=1G,prealloc=yes,host-nodes=3,policy=bind,id=node3 \
> -numa
> node,nodeid=3,cpus=3,memdev=node3,distance=41,distance=31,distance=21,distance=10
> \
> ```
>
> BTW. Please forgive me that I don't do a pretty wrapper to `distance`
> option.
>
It would be nice to have a more intuitive syntax to represent
ordered lists in QemuOpts. But this is what we have today.
> What do you think of this RFC patch? And shall we consider to add vNUMA
> distance support to QEMU?
>
I think the proposal makes sense. I would like the semantics of the new option
to be documented at qapi-schema.json and qemu-options.hx.
I would call the new NumaNodeOptions field "distances", as it is
a list of distances.
I wonder if we should isolate this into a separate "-numa
distances" option, for greater flexibility.
e.g. the first version could support this:
-numa node,nodeid=0,cpus=0,memdev=node0 \
-numa node,nodeid=1,cpus=1,memdev=node1 \
-numa node,nodeid=2,cpus=2,memdev=node2 \
-numa node,nodeid=3,cpus=3,memdev=node3 \
-numa
distances,from-node=0,distances=10,distances=21,distances=31,distances=41 \
-numa
distances,from-node=1,distances=21,distances=10,distances=21,distances=31 \
-numa
distances,from-node=2,distances=31,distances=21,distances=10,distances=21 \
-numa
distances,from-node=3,distances=41,distances=31,distances=21,distances=10
but in the future we could support something like:
-numa node,nodeid=0,cpus=0,memdev=node0 \
-numa node,nodeid=1,cpus=1,memdev=node1 \
-numa node,nodeid=2,cpus=2,memdev=node2 \
-numa node,nodeid=3,cpus=3,memdev=node3 \
-numa
distances,distances[0][0]=10,distances[0][1]=21,distances[0][2]=31,distances[0][3]=41,\
distances[1][0]=21,distances[1][1]=10,distances[1][2]=21,distances[1][3]=31,\
distances[2][0]=31,distances[2][1]=21,distances[2][2]=10,distances[2][3]=21,\
distances[3][0]=41,distances[3][1]=31,distances[3][2]=21,distances[3][3]=10
> Thanks,
> -He
>
> Signed-off-by: He Chen <address@hidden>
> ---
> hw/i386/acpi-build.c | 28 ++++++++++++++++++++++++++++
> include/hw/acpi/acpi-defs.h | 9 +++++++++
> include/sysemu/numa.h | 2 ++
> numa.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> qapi-schema.json | 12 ++++++++----
> 5 files changed, 91 insertions(+), 4 deletions(-)
>
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 1c928ab..ee8236e 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -2395,6 +2395,32 @@ build_srat(GArray *table_data, BIOSLinker *linker,
> MachineState *machine)
> }
>
> static void
> +build_slit(GArray *table_data, BIOSLinker *linker, MachineState *machine)
> +{
> + struct AcpiSystemLocalityDistanceTable *slit;
> + uint8_t *entry;
> + int slit_start, slit_data_len, i, j;
> + slit_start = table_data->len;
> +
> + slit = acpi_data_push(table_data, sizeof(*slit));
> + slit->nb_localities = nb_numa_nodes;
> +
> + slit_data_len = sizeof(uint8_t) * nb_numa_nodes * nb_numa_nodes;
> + entry = acpi_data_push(table_data, slit_data_len);
> +
> + for (i = 0; i < nb_numa_nodes; i++) {
> + for (j = 0; j < nb_numa_nodes; j++) {
> + entry[i * nb_numa_nodes + j] = numa_info[i].distance[j];
> + }
> + }
> +
> + build_header(linker, table_data,
> + (void *)(table_data->data + slit_start),
> + "SLIT",
> + table_data->len - slit_start, 1, NULL, NULL);
> +}
> +
> +static void
> build_mcfg_q35(GArray *table_data, BIOSLinker *linker, AcpiMcfgInfo *info)
> {
> AcpiTableMcfg *mcfg;
> @@ -2669,6 +2695,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState
> *machine)
> if (pcms->numa_nodes) {
> acpi_add_table(table_offsets, tables_blob);
> build_srat(tables_blob, tables->linker, machine);
> + acpi_add_table(table_offsets, tables_blob);
> + build_slit(tables_blob, tables->linker, machine);
> }
> if (acpi_get_mcfg(&mcfg)) {
> acpi_add_table(table_offsets, tables_blob);
> diff --git a/include/hw/acpi/acpi-defs.h b/include/hw/acpi/acpi-defs.h
> index 4cc3630..b183a8f 100644
> --- a/include/hw/acpi/acpi-defs.h
> +++ b/include/hw/acpi/acpi-defs.h
> @@ -527,6 +527,15 @@ struct AcpiSratProcessorGiccAffinity
>
> typedef struct AcpiSratProcessorGiccAffinity AcpiSratProcessorGiccAffinity;
>
> +/*
> + * SLIT (NUMA distance description) table
> + */
> +struct AcpiSystemLocalityDistanceTable {
> + ACPI_TABLE_HEADER_DEF
> + uint64_t nb_localities;
> +} QEMU_PACKED;
> +typedef struct AcpiSystemLocalityDistanceTable
> AcpiSystemLocalityDistanceTable;
> +
> /* PCI fw r3.0 MCFG table. */
> /* Subtable */
> struct AcpiMcfgAllocation {
> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> index 8f09dcf..b936eeb 100644
> --- a/include/sysemu/numa.h
> +++ b/include/sysemu/numa.h
> @@ -21,6 +21,8 @@ typedef struct node_info {
> struct HostMemoryBackend *node_memdev;
> bool present;
> QLIST_HEAD(, numa_addr_range) addr; /* List to store address ranges */
> + uint8_t distance[MAX_NODES];
> + bool has_distance;
> } NodeInfo;
>
> extern NodeInfo numa_info[MAX_NODES];
> diff --git a/numa.c b/numa.c
> index 9f56be9..b4e11f3 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -50,6 +50,8 @@ static int have_memdevs = -1;
> static int max_numa_nodeid; /* Highest specified NUMA node ID, plus one.
> * For all nodes, nodeid < max_numa_nodeid
> */
> +static int min_numa_distance = 10;
> +static int max_numa_distance = 255;
> int nb_numa_nodes;
> NodeInfo numa_info[MAX_NODES];
>
> @@ -144,6 +146,8 @@ static void numa_node_parse(NumaNodeOptions *node,
> QemuOpts *opts, Error **errp)
> {
> uint16_t nodenr;
> uint16List *cpus = NULL;
> + uint8List *distance = NULL;
> + int i;
>
> if (node->has_nodeid) {
> nodenr = node->nodeid;
> @@ -208,6 +212,28 @@ static void numa_node_parse(NumaNodeOptions *node,
> QemuOpts *opts, Error **errp)
> numa_info[nodenr].node_mem = object_property_get_int(o, "size",
> NULL);
> numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
> }
> +
> + if (node->has_distance) {
> + for (i = 0, distance = node->distance; distance; i++, distance =
> distance->next) {
> + if (distance->value >= max_numa_distance) {
> + error_setg(errp,
> + "NUMA distance (%" PRIu8 ")"
> + " should be smaller than maxnumadistance (%d)",
> + distance->value, max_numa_distance);
> + return;
> + }
> + if (distance->value < min_numa_distance) {
> + error_setg(errp,
> + "NUMA distance (%" PRIu8 ")"
> + " should be larger than minnumadistance (%d)",
> + distance->value, min_numa_distance);
> + return;
> + }
> + numa_info[nodenr].distance[i] = distance->value;
> + }
> + numa_info[nodenr].has_distance = true;
> + }
> +
> numa_info[nodenr].present = true;
> max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
> }
> @@ -294,6 +320,23 @@ static void validate_numa_cpus(void)
> g_free(seen_cpus);
> }
>
> +static void validate_numa_distance(void)
> +{
> + int i, j;
> +
> + for (i = 0; i < nb_numa_nodes; i++) {
> + for (j = i; j < nb_numa_nodes; j++) {
> + if (i == j && numa_info[i].distance[j] != min_numa_distance) {
> + error_report("Local distance must be %d!",
> min_numa_distance);
> + exit(EXIT_FAILURE);
> + } else if (numa_info[i].distance[j] != numa_info[j].distance[i])
> {
> + error_report("Unequal NUMA distance between nodes %d and
> %d!", i, j);
> + exit(EXIT_FAILURE);
> + }
> + }
> + }
> +}
> +
> void parse_numa_opts(MachineClass *mc)
> {
> int i;
> @@ -390,6 +433,7 @@ void parse_numa_opts(MachineClass *mc)
> }
>
> validate_numa_cpus();
> + validate_numa_distance();
> } else {
> numa_set_mem_node_id(0, ram_size, 0);
> }
> diff --git a/qapi-schema.json b/qapi-schema.json
> index baa0d26..f2f2d05 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -5597,14 +5597,18 @@
> # @memdev: #optional memory backend object. If specified for one node,
> # it must be specified for all nodes.
> #
> +# @distance: #optional NUMA distance array. The length of this array should
> +# be equal to number of NUMA nodes.
> +#
> # Since: 2.1
> ##
> { 'struct': 'NumaNodeOptions',
> 'data': {
> - '*nodeid': 'uint16',
> - '*cpus': ['uint16'],
> - '*mem': 'size',
> - '*memdev': 'str' }}
> + '*nodeid': 'uint16',
> + '*cpus': ['uint16'],
> + '*mem': 'size',
> + '*memdev': 'str' ,
> + '*distance': ['uint8'] }}
>
> ##
> # @HostMemPolicy:
> --
> 2.7.4
>
--
Eduardo
- [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUMA nodes, He Chen, 2017/03/03
- Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUMA nodes,
Eduardo Habkost <=
- Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUMA nodes, Eric Blake, 2017/03/03
- Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUMA nodes, Daniel P. Berrange, 2017/03/03
- Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUMA nodes, Eduardo Habkost, 2017/03/03
- Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUMA nodes, Daniel P. Berrange, 2017/03/03
- Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUMA nodes, Eduardo Habkost, 2017/03/03
- Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUMA nodes, Daniel P. Berrange, 2017/03/03
- Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUMA nodes, He Chen, 2017/03/06
- Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUMA nodes, Markus Armbruster, 2017/03/07
Re: [Qemu-devel] [RFC] x86: Allow to set NUMA distance for different NUMA nodes, no-reply, 2017/03/03