qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [Qemu-block] [PATCH] qcow2: do lazy allocation of the L


From: Stefan Hajnoczi
Subject: Re: [Qemu-devel] [Qemu-block] [PATCH] qcow2: do lazy allocation of the L2 cache
Date: Fri, 24 Apr 2015 10:26:21 +0100
User-agent: Mutt/1.5.23 (2014-03-12)

On Thu, Apr 23, 2015 at 01:50:28PM +0200, Alberto Garcia wrote:
> On Thu 23 Apr 2015 12:15:04 PM CEST, Stefan Hajnoczi wrote:
> 
> >> For a cache size of 128MB, the PSS is actually ~10MB larger without
> >> the patch, which seems to come from posix_memalign().
> >
> > Do you mean RSS or are you using a tool that reports a "PSS" number
> > that I don't know about?
> >
> > We should understand what is going on instead of moving the code
> > around to hide/delay the problem.
> 
> Both RSS and PSS ("proportional set size", also reported by the kernel).
> 
> I'm not an expert in memory allocators, but I measured the overhead like
> this:
> 
> An L2 cache of 128MB implies a refcount cache of 32MB, in total 160MB.
> With a default cluster size of 64k, that's 2560 cache entries.
> 
> So I wrote a test case that allocates 2560 blocks of 64k each using
> posix_memalign and mmap, and here's how their /proc/<pid>/smaps compare:
> 
> -Size:             165184 kB
> -Rss:               10244 kB
> -Pss:               10244 kB
> +Size:             161856 kB
> +Rss:                   0 kB
> +Pss:                   0 kB
>  Shared_Clean:          0 kB
>  Shared_Dirty:          0 kB
>  Private_Clean:         0 kB
> -Private_Dirty:     10244 kB
> -Referenced:        10244 kB
> -Anonymous:         10244 kB
> +Private_Dirty:         0 kB
> +Referenced:            0 kB
> +Anonymous:             0 kB
>  AnonHugePages:         0 kB
>  Swap:                  0 kB
>  KernelPageSize:        4 kB
> 
> Those are the 10MB I saw. For the record I also tried with malloc() and
> the results are similar to those of posix_memalign().

The posix_memalign() call wastes memory.  I compared:

  posix_memalign(&memptr, 65536, 2560 * 65536);
  memset(memptr, 0, 2560 * 65536);

with:

  for (i = 0; i < 2560; i++) {
      posix_memalign(&memptr, 65536, 65536);
      memset(memptr, 0, 65536);
  }

Here are the results:

-Size:             163920 kB
-Rss:              163860 kB
-Pss:              163860 kB
+Size:             337800 kB
+Rss:              183620 kB
+Pss:              183620 kB

Note the memset simulates a fully occupied cache.

The 19 MB RSS difference between the two seems wasteful.  The large
"Size" difference hints that the mmap pattern is very different when
posix_memalign() is called multiple times.

We could avoid the 19 MB overhead by switching to a single allocation.

What's more is that dropping the memset() to simulate no cache entry
usage (like your example) gives us a grand total of 20 kB RSS.  There is
no point in delaying allocations if we do a single big upfront
allocation.

I'd prefer a patch that replaces the small allocations with a single big
one.  That's a win in both empty and full cache cases.

Stefan

Attachment: pgpvTe8gD5TsD.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]