libunwind-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Libunwind-devel] Question about performance of threaded access in l


From: Robert Schöne
Subject: Re: [Libunwind-devel] Question about performance of threaded access in libunwind
Date: Fri, 07 Oct 2016 12:11:12 +0200

Hi,

Thanks for the answer, but i do not think that this would help me. 

After some debugging I found that changing the caching_policy of
unw_local_addr_space does not affect the as->caching_policy that is
used in dwarf/Gparser.c:get_rs_cache.

The functions get_rs_cache and put_rs_cache call lock_acquire
and lock_release respectively. These call the syscall sigprocmask which
does not scale.

If I inititialize the caching with UNW_CACHE_PER_THREAD in
x86_64/Ginit.c:x86_64_local_addr_space_init, then the runtime is
significantly better.

In my local branch, I solved the problem by implementing 2 new
functions in mi/init.c that are exposed by the library. With these
functions one can get and set the default local caching policy. The
default setting can be defined before before any init_local is called
and must not be changed after the first init_local call.

Robert


Am Donnerstag, den 06.10.2016, 18:25 +0200 schrieb Milian Wolff:
> On Thursday, October 6, 2016 12:55:52 PM CEST Robert Schöne wrote:
> > 
> > Hello,
> > 
> > Could it be that unwinding does not work well with threading?
> > 
> > I run an Intel dual core system + Hyperthreading using Ubuntu 16.04.
> > and patched tests/Gperf-trace.c so that this part
> 
> 
> I'm the author of heaptrack and have seen the dwarf-based unwinding adding a 
> significant slow-down when profiling multi-threaded applications. The reason 
> is mostly the synchronization point within the many calls to 
> `dl_iterate_phdr` 
> when encountering non-cached code locations. Once everything is cached, 
> libunwind is pretty fast and scales OK across threads.
> 
> I have submitted a patch which did not get accepted upstream yet (the project 
> is pretty much unmaintained atm), to improve the per-thread caching 
> functionality.
> 
> Others have submitted patches to allow replacing `dl_iterate_phdr` with 
> something custom, which allows one to cache the `dl_iterate_phdr` results 
> once 
> and only update that cache when dlclose/dlopen is called.
> 
> > 
> > According to perf and strace a significant amount of time is spent in
> > the kernel, i.e. in sigprocmask.
> Can you verify where sigprocmask is coming from, i.e. sample with call 
> stacks? 
> I remember it being a problem once, but don't think it's the main culprit for 
> thread scaling.
> 
> Unrelated to this: At this stage, I would recommend looking at an alternative 
> to libunwind. elfutils' libdwfl can unwind the stack, and is supposedly even 
> faster at it. You have to write more code though, but you can also implement 
> the address lookups manually, invalidating all of the points above.
> 
> For inspiration on how to do that, look at the backward-cpp sources:
> https://github.com/bombela/backward-cpp
> 
> Cheers



reply via email to

[Prev in Thread] Current Thread [Next in Thread]