qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH RFC] mem-prealloc: Reduce large guest start-up a


From: Jitendra Kolhe
Subject: Re: [Qemu-devel] [PATCH RFC] mem-prealloc: Reduce large guest start-up and migration time.
Date: Tue, 7 Feb 2017 13:14:18 +0530
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0

On 1/30/2017 2:02 PM, Jitendra Kolhe wrote:
> On 1/27/2017 6:33 PM, Dr. David Alan Gilbert wrote:
>> * Jitendra Kolhe (address@hidden) wrote:
>>> Using "-mem-prealloc" option for a very large guest leads to huge guest
>>> start-up and migration time. This is because with "-mem-prealloc" option
>>> qemu tries to map every guest page (create address translations), and
>>> make sure the pages are available during runtime. virsh/libvirt by
>>> default, seems to use "-mem-prealloc" option in case the guest is
>>> configured to use huge pages. The patch tries to map all guest pages
>>> simultaneously by spawning multiple threads. Given the problem is more
>>> prominent for large guests, the patch limits the changes to the guests
>>> of at-least 64GB of memory size. Currently limiting the change to QEMU
>>> library functions on POSIX compliant host only, as we are not sure if
>>> the problem exists on win32. Below are some stats with "-mem-prealloc"
>>> option for guest configured to use huge pages.
>>>
>>> ------------------------------------------------------------------------
>>> Idle Guest      | Start-up time | Migration time
>>> ------------------------------------------------------------------------
>>> Guest stats with 2M HugePage usage - single threaded (existing code)
>>> ------------------------------------------------------------------------
>>> 64 Core - 4TB   | 54m11.796s    | 75m43.843s
>>> 64 Core - 1TB   | 8m56.576s     | 14m29.049s
>>> 64 Core - 256GB | 2m11.245s     | 3m26.598s
>>> ------------------------------------------------------------------------
>>> Guest stats with 2M HugePage usage - map guest pages using 8 threads
>>> ------------------------------------------------------------------------
>>> 64 Core - 4TB   | 5m1.027s      | 34m10.565s
>>> 64 Core - 1TB   | 1m10.366s     | 8m28.188s
>>> 64 Core - 256GB | 0m19.040s     | 2m10.148s
>>> -----------------------------------------------------------------------
>>> Guest stats with 2M HugePage usage - map guest pages using 16 threads
>>> -----------------------------------------------------------------------
>>> 64 Core - 4TB   | 1m58.970s     | 31m43.400s
>>> 64 Core - 1TB   | 0m39.885s     | 7m55.289s
>>> 64 Core - 256GB | 0m11.960s     | 2m0.135s
>>> -----------------------------------------------------------------------
>>
>> That's a nice improvement.
>>
>>> Signed-off-by: Jitendra Kolhe <address@hidden>
>>> ---
>>>  util/oslib-posix.c | 64 
>>> +++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>  1 file changed, 61 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>>> index f631464..a8bd7c2 100644
>>> --- a/util/oslib-posix.c
>>> +++ b/util/oslib-posix.c
>>> @@ -55,6 +55,13 @@
>>>  #include "qemu/error-report.h"
>>>  #endif
>>>  
>>> +#define PAGE_TOUCH_THREAD_COUNT 8
>>
>> It seems a shame to fix that number as a constant.
>>
> 
> Yes, as per comments received we will update patch to incorporate vcpu count.
> 
>>> +typedef struct {
>>> +    char *addr;
>>> +    uint64_t numpages;
>>> +    uint64_t hpagesize;
>>> +} PageRange;
>>> +
>>>  int qemu_get_thread_id(void)
>>>  {
>>>  #if defined(__linux__)
>>> @@ -323,6 +330,52 @@ static void sigbus_handler(int signal)
>>>      siglongjmp(sigjump, 1);
>>>  }
>>>  
>>> +static void *do_touch_pages(void *arg)
>>> +{
>>> +    PageRange *range = (PageRange *)arg;
>>> +    char *start_addr = range->addr;
>>> +    uint64_t numpages = range->numpages;
>>> +    uint64_t hpagesize = range->hpagesize;
>>> +    uint64_t i = 0;
>>> +
>>> +    for (i = 0; i < numpages; i++) {
>>> +        memset(start_addr + (hpagesize * i), 0, 1);
>>> +    }
>>> +    qemu_thread_exit(NULL);
>>> +
>>> +    return NULL;
>>> +}
>>> +
>>> +static int touch_all_pages(char *area, size_t hpagesize, size_t numpages)
>>> +{
>>> +    QemuThread page_threads[PAGE_TOUCH_THREAD_COUNT];
>>> +    PageRange page_range[PAGE_TOUCH_THREAD_COUNT];
>>> +    uint64_t    numpage_per_thread, size_per_thread;
>>> +    int         i = 0, tcount = 0;
>>> +
>>> +    numpage_per_thread = (numpages / PAGE_TOUCH_THREAD_COUNT);
>>> +    size_per_thread = (hpagesize * numpage_per_thread);
>>> +    for (i = 0; i < (PAGE_TOUCH_THREAD_COUNT - 1); i++) {
>>> +        page_range[i].addr = area;
>>> +        page_range[i].numpages = numpage_per_thread;
>>> +        page_range[i].hpagesize = hpagesize;
>>> +
>>> +        qemu_thread_create(page_threads + i, "touch_pages",
>>> +                           do_touch_pages, (page_range + i),
>>> +                           QEMU_THREAD_JOINABLE);
>>> +        tcount++;
>>> +        area += size_per_thread;
>>> +        numpages -= numpage_per_thread;
>>> +    }
>>> +    for (i = 0; i < numpages; i++) {
>>> +        memset(area + (hpagesize * i), 0, 1);
>>> +    }
>>> +    for (i = 0; i < tcount; i++) {
>>> +        qemu_thread_join(page_threads + i);
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>>  void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
>>>  {
>>>      int ret;
>>> @@ -353,9 +406,14 @@ void os_mem_prealloc(int fd, char *area, size_t 
>>> memory, Error **errp)
>>>          size_t hpagesize = qemu_fd_getpagesize(fd);
>>>          size_t numpages = DIV_ROUND_UP(memory, hpagesize);
>>>  
>>> -        /* MAP_POPULATE silently ignores failures */
>>> -        for (i = 0; i < numpages; i++) {
>>> -            memset(area + (hpagesize * i), 0, 1);
>>> +        /* touch pages simultaneously for memory >= 64G */
>>> +        if (memory < (1ULL << 36)) {
>>> +            /* MAP_POPULATE silently ignores failures */
>>> +            for (i = 0; i < numpages; i++) {
>>> +                memset(area + (hpagesize * i), 0, 1);
>>> +            }
>>> +        } else {
>>> +            touch_all_pages(area, hpagesize, numpages);
>>>          }
>>>      }
>>
>> Maybe it's possible to do this quicker?
>> If we are using NUMA, and have separate memory-blocks for each NUMA node,
>> wont this call os_mem_prealloc separately for each node?
>> I wonder if it's possible to get that to run in parallel?
>>
> 
> I will investigate.
> 
each numa node, seems to be getting treated as an independent qemu object. 
While parsing and creating the object itself we try to touch pages in 
os_mem_prealloc(). To parallelize numa node creation we would need to modify 
host_memory_backend_memory_complete() for the last numa object to wait for 
all previously spawned numa-node creation threads to finish there job. It 
involves parsing of cmdline options more than once (to identify if the current
numa node being serviced is the last numa node). 
Parsing cmdline in object specific implementation does not look correct?

also by parallelizing each numa node, the # memset threads would be reduced 
accordingly so that we don’t spawn too may threads. For example 
# threads spawned per numa node = min(#vcpus, 16)/(# numa nodes). 
With current implementation we would see min(#vcpus, 16) threads spawned, 
working 
on each numa node a time. Both implementations should have almost same 
performance?

Thanks,
- Jitendra
> Thanks,
> - Jitendra
> 
>> Dave
>>
>>> -- 
>>> 1.8.3.1
>>>
>>>
>> --
>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
>>
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]