Re: [Qemu-devel] [PATCH 0/2] linux-user: Change mmap

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 0/2] linux-user: Change mmap_lock to rwlock

From:	Emilio G. Cota
Subject:	Re: [Qemu-devel] [PATCH 0/2] linux-user: Change mmap_lock to rwlock
Date:	Sat, 23 Jun 2018 14:20:39 -0400
User-agent:	Mutt/1.5.24 (2015-08-30)

On Sat, Jun 23, 2018 at 08:25:52 -0700, Richard Henderson wrote:
> On 06/22/2018 02:12 PM, Emilio G. Cota wrote:
> > I'm curious to see how much perf could be gained. It seems that the hold
> > times in SVE code for readers might not be very large, which
> > then wouldn't let us amortize the atomic inc of the read lock
> > (IOW, we might not see much of a difference compared to a regular
> > mutex).
> 
> In theory, the uncontended case for rwlocks is the same as a mutex.

In the fast path, wr_lock/unlock have one more atomic than
mutex_lock/unlock. The perf difference is quite large in
microbenchmarks, e.g. changing tests/atomic_add-bench to
use pthread_mutex or pthread_rwlock_wrlock instead of
an atomic operation (this is enabled with the added -m flag):

$ taskset -c 0 perf record tests/atomic_add-bench-mutex  -d 4 -m
 Throughput:         62.05 Mops/s

$ taskset -c 0 perf record tests/atomic_add-bench-rwlock  -d 4 -m
 Throughput:         37.68 Mops/s

That said, it's unlikely to have real user-space code
(i.e. not from microbenchmarks) that would be sensitive to
the additional delay and/or lower scalability. It is common to
avoid frequent calls to mmap(2) due to potential serialization
in the kernel -- think for instance of memory allocators, they
do a few large mmap calls and then manage the memory themselves.

To double-check I ran some multi-threaded benchmarks from
Hoard[1] under qemu-linux-user, with and without the rwlock change,
and couldn't measure a significant difference.

[1] https://github.com/emeryberger/Hoard/tree/master/benchmarks

> > Are you using any benchmark that shows any perf difference?
> 
> Not so far.  Glibc has some microbenchmarks for strings, which I will try next
> week, but they are not multi-threaded.  Maybe just run 4 threads of those
> benchmark?

I'd run more threads if possible. I have access to a 64-core machine,
so ping me once you identify benchmarks that are of interest.

                Emilio

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH 0/2] linux-user: Change mmap_lock to rwlock, Richard Henderson, 2018/06/21
- [Qemu-devel] [PATCH 2/2] linux-user: Use pthread_rwlock_t for mmap_rd/wrlock, Richard Henderson, 2018/06/21
  - Re: [Qemu-devel] [PATCH 2/2] linux-user: Use pthread_rwlock_t for mmap_rd/wrlock, Emilio G. Cota, 2018/06/22
- [Qemu-devel] [PATCH 1/2] exec: Split mmap_lock to mmap_rdlock/mmap_wrlock, Richard Henderson, 2018/06/21
- Re: [Qemu-devel] [PATCH 0/2] linux-user: Change mmap_lock to rwlock, Emilio G. Cota, 2018/06/22
  - Re: [Qemu-devel] [PATCH 0/2] linux-user: Change mmap_lock to rwlock, Richard Henderson, 2018/06/23
    - Re: [Qemu-devel] [PATCH 0/2] linux-user: Change mmap_lock to rwlock, Emilio G. Cota <=

Prev by Date: Re: [Qemu-devel] [PATCH] fix fdiv instruction
Next by Date: Re: [Qemu-devel] [PATCH] tcg: fix --disable-tcg build breakage
Previous by thread: Re: [Qemu-devel] [PATCH 0/2] linux-user: Change mmap_lock to rwlock
Next by thread: [Qemu-devel] [PATCH] Partially revert "python: futurize -f libfuturize.fixes.fix_absolute_import"
Index(es):
- Date
- Thread