qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA bindin


From: Bharata B Rao
Subject: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
Date: Sun, 30 Oct 2011 00:15:02 +0530
User-agent: Mutt/1.5.21 (2010-09-15)

Hi,

As guests become NUMA aware, it becomes important for the guests to
have correct NUMA policies when they run on NUMA aware hosts.
Currently limited support for NUMA binding is available via libvirt
where it is possible to apply a NUMA policy to the guest as a whole.
However multinode guests would benefit if guest memory belonging to
different guest nodes are mapped appropriately to different host NUMA nodes.

To achieve this we would need QEMU to expose information about
guest RAM ranges (Guest Physical Address - GPA) and their host virtual
address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external
tool like libvirt would be able to divide the guest RAM as per the guest NUMA
node geometry and bind guest memory nodes to corresponding host memory nodes
using HVA. This needs both QEMU (and libvirt) changes as well as changes
in the kernel.

- System calls that set NUMA memory policies (like mbind) currently work
  for the current (or the calling) process. These syscalls need to be
  extended so that a process like libvirt is able to set NUMA memory
  policies for QEMU process's memory ranges.
- This RFC is actually about the proposed change in QEMU to export
  GPA and HVA via QEMU monitor.
  
The patch against QEMU present towards the end of this note is an attempt
to achieve this. This patch adds a new monitor command "info ram".
"info ram" prints out GPA and HVA for different sections of guest RAM.

For a guest booted with options "-smp sockets=2,cores=4,threads=2
-numa node,nodeid=0,cpus=0-15 -numa node,nodeid=1,cpus=16-31 -cpu core2duo
-m 5g", the exported data looks like this:

******************
(qemu) info ram
GPA: 0-9ffff RAM: 0-9ffff HVA: 0x7efe7fe00000-0x7efe7fe9ffff
GPA: cc000-effff RAM: cc000-effff HVA: 0x7efe7fecc000-0x7efe7feeffff
GPA: 100000-dfffffff RAM: 100000-dfffffff HVA: 0x7efe7ff00000-0x7eff5fdfffff
GPA: fc000000-fc7fffff RAM: 140040000-14083ffff HVA: 
0x7efe7f400000-0x7efe7fbfffff
GPA: 100000000-15fffffff RAM: e0000000-13fffffff HVA: 
0x7eff5fe00000-0x7effbfdfffff
******************

I will remove the ram_addr (prefixed with RAM:) from the above. Having it
here just to validate the regions and to compare with "info mtree" output
(shown below).

******************
(qemu) info mtree
memory
0000000000000000-7ffffffffffffffe (prio 0): system
  0000000000000000-00000000dfffffff (prio 0): alias ram-below-4g @pc.ram 
0000000000000000-00000000dfffffff
  00000000000a0000-00000000000bffff (prio 1): alias smram-region @pci 
00000000000a0000-00000000000bffff
  00000000000c0000-00000000000c3fff (prio 1): alias pam-rom @pc.ram 
00000000000c0000-00000000000c3fff
  00000000000c4000-00000000000c7fff (prio 1): alias pam-rom @pc.ram 
00000000000c4000-00000000000c7fff
  00000000000c8000-00000000000cbfff (prio 1): alias pam-rom @pc.ram 
00000000000c8000-00000000000cbfff
  00000000000cc000-00000000000cffff (prio 1): alias pam-ram @pc.ram 
00000000000cc000-00000000000cffff
  00000000000d0000-00000000000d3fff (prio 1): alias pam-ram @pc.ram 
00000000000d0000-00000000000d3fff
  00000000000d4000-00000000000d7fff (prio 1): alias pam-ram @pc.ram 
00000000000d4000-00000000000d7fff
  00000000000d8000-00000000000dbfff (prio 1): alias pam-ram @pc.ram 
00000000000d8000-00000000000dbfff
  00000000000dc000-00000000000dffff (prio 1): alias pam-ram @pc.ram 
00000000000dc000-00000000000dffff
  00000000000e0000-00000000000e3fff (prio 1): alias pam-ram @pc.ram 
00000000000e0000-00000000000e3fff
  00000000000e4000-00000000000e7fff (prio 1): alias pam-ram @pc.ram 
00000000000e4000-00000000000e7fff
  00000000000e8000-00000000000ebfff (prio 1): alias pam-ram @pc.ram 
00000000000e8000-00000000000ebfff
  00000000000ec000-00000000000effff (prio 1): alias pam-ram @pc.ram 
00000000000ec000-00000000000effff
  00000000000f0000-00000000000fffff (prio 1): alias pam-rom @pc.ram 
00000000000f0000-00000000000fffff
  00000000e0000000-00000000ffffffff (prio 0): alias pci-hole @pci 
00000000e0000000-00000000ffffffff
  00000000fee00000-00000000feefffff (prio 0): apic
  0000000100000000-000000015fffffff (prio 0): alias ram-above-4g @pc.ram 
00000000e0000000-000000013fffffff
  4000000000000000-7fffffffffffffff (prio 0): alias pci-hole64 @pci 
4000000000000000-7fffffffffffffff
pc.ram
0000000000000000-000000013fffffff (prio 0): pc.ram
******************


The current patch just exports the information and expects
external tools to make use of it for binding. But we do understand that
memory ranges can change and external tool should be able to respond to
this. This is the current thinking on how to handle this:

- Whenever the address range changes, send an async notification to
  libvirt (using QMP perhaps?)
- libvirt will note the change and re-read the current guest RAM mapping
  info and re-bind the regions as appropriate.

I haven't fully figured out this part (QEMU to libvirt notification part)
yet and any pointers or suggestions here will be useful.

Also a question:

- In what ways the guest memory layout can change ? Is the change driven
  by external agents like libvirt (memory hot add)? or can things change
  transparently within QEMU. If its only the former, then we kind of know
  when to do rebinding.

The patch follows:

---
Export guest RAM address via QEMU monitor.

NUMA aware QEMU guests running on NUMA systems can benefit from binding
guest RAM to host appropriate NUMA node memory. Allow admin tools like
libvirt to achieve this by exporting guest RAM information via
QEMU monitor.

Signed-off-by: Bharata B Rao <address@hidden>
---

 memory.c  |   33 +++++++++++++++++++++++++++++++++
 memory.h  |    2 ++
 monitor.c |   12 ++++++++++++
 3 files changed, 47 insertions(+), 0 deletions(-)


diff --git a/memory.c b/memory.c
index dc5e35d..3ae10e5 100644
--- a/memory.c
+++ b/memory.c
@@ -1402,3 +1402,36 @@ void mtree_info(fprintf_function mon_printf, void *f)
         mtree_print_mr(mon_printf, f, address_space_io.root, 0, 0, &ml_head);
     }
 }
+
+#if !defined(CONFIG_USER_ONLY)
+void ram_info_print(fprintf_function mon_printf, void *f)
+{
+    FlatRange *fr;
+
+    FOR_EACH_FLAT_RANGE(fr, &address_space_memory.current_map) {
+        AddrRange ar = fr->addr;
+        ram_addr_t ram;
+        uint8_t *hva;
+
+        ram = cpu_get_physical_page_desc(ar.start);
+
+        /* Only show RAM area */
+        if ((ram & ~TARGET_PAGE_MASK) != IO_MEM_RAM) {
+            continue;
+        }
+        ram &= TARGET_PAGE_MASK;
+        hva = qemu_get_ram_ptr(ram);
+        mon_printf(f, "GPA: %llx-%llx" " RAM: "
+            RAM_ADDR_FMT "-" RAM_ADDR_FMT " HVA: %p-%p\n",
+            (unsigned long long)ar.start,
+            (unsigned long long)(ar.start+ar.size-1),
+            ram, (ram_addr_t)(ram+ar.size-1),
+            hva, hva+ar.size-1);
+    }
+}
+#else
+void ram_info_print(fprintf_function mon_printf, void *f)
+{
+    mon_printf(f, "Not supported\n");
+}
+#endif
diff --git a/memory.h b/memory.h
index d5b47da..b5fb5e0 100644
--- a/memory.h
+++ b/memory.h
@@ -503,6 +503,8 @@ void memory_region_transaction_commit(void);
 
 void mtree_info(fprintf_function mon_printf, void *f);
 
+void ram_info_print(fprintf_function mon_printf, void *f);
+
 #endif
 
 #endif
diff --git a/monitor.c b/monitor.c
index ffda0fe..3b1a7f3 100644
--- a/monitor.c
+++ b/monitor.c
@@ -2738,6 +2738,11 @@ int monitor_get_fd(Monitor *mon, const char *fdname)
     return -1;
 }
 
+static void do_info_ram(Monitor *mon)
+{
+    ram_info_print((fprintf_function)monitor_printf, mon);
+}
+
 static const mon_cmd_t mon_cmds[] = {
 #include "hmp-commands.h"
     { NULL, NULL, },
@@ -3050,6 +3055,13 @@ static const mon_cmd_t info_cmds[] = {
         .mhandler.info = do_trace_print_events,
     },
     {
+        .name       = "ram",
+        .args_type  = "",
+        .params     = "",
+        .help       = "show RAM information",
+        .mhandler.info = do_info_ram,
+    },
+    {
         .name       = NULL,
     },
 };



reply via email to

[Prev in Thread] Current Thread [Next in Thread]