[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Libunwind-devel] Question about performance of threaded access in l
From: |
Robert Schöne |
Subject: |
Re: [Libunwind-devel] Question about performance of threaded access in libunwind |
Date: |
Fri, 07 Oct 2016 12:11:12 +0200 |
Hi,
Thanks for the answer, but i do not think that this would help me.
After some debugging I found that changing the caching_policy of
unw_local_addr_space does not affect the as->caching_policy that is
used in dwarf/Gparser.c:get_rs_cache.
The functions get_rs_cache and put_rs_cache call lock_acquire
and lock_release respectively. These call the syscall sigprocmask which
does not scale.
If I inititialize the caching with UNW_CACHE_PER_THREAD in
x86_64/Ginit.c:x86_64_local_addr_space_init, then the runtime is
significantly better.
In my local branch, I solved the problem by implementing 2 new
functions in mi/init.c that are exposed by the library. With these
functions one can get and set the default local caching policy. The
default setting can be defined before before any init_local is called
and must not be changed after the first init_local call.
Robert
Am Donnerstag, den 06.10.2016, 18:25 +0200 schrieb Milian Wolff:
> On Thursday, October 6, 2016 12:55:52 PM CEST Robert Schöne wrote:
> >
> > Hello,
> >
> > Could it be that unwinding does not work well with threading?
> >
> > I run an Intel dual core system + Hyperthreading using Ubuntu 16.04.
> > and patched tests/Gperf-trace.c so that this part
>
>
> I'm the author of heaptrack and have seen the dwarf-based unwinding adding a
> significant slow-down when profiling multi-threaded applications. The reason
> is mostly the synchronization point within the many calls to
> `dl_iterate_phdr`
> when encountering non-cached code locations. Once everything is cached,
> libunwind is pretty fast and scales OK across threads.
>
> I have submitted a patch which did not get accepted upstream yet (the project
> is pretty much unmaintained atm), to improve the per-thread caching
> functionality.
>
> Others have submitted patches to allow replacing `dl_iterate_phdr` with
> something custom, which allows one to cache the `dl_iterate_phdr` results
> once
> and only update that cache when dlclose/dlopen is called.
>
> >
> > According to perf and strace a significant amount of time is spent in
> > the kernel, i.e. in sigprocmask.
> Can you verify where sigprocmask is coming from, i.e. sample with call
> stacks?
> I remember it being a problem once, but don't think it's the main culprit for
> thread scaling.
>
> Unrelated to this: At this stage, I would recommend looking at an alternative
> to libunwind. elfutils' libdwfl can unwind the stack, and is supposedly even
> faster at it. You have to write more code though, but you can also implement
> the address lookups manually, invalidating all of the points above.
>
> For inspiration on how to do that, look at the backward-cpp sources:
> https://github.com/bombela/backward-cpp
>
> Cheers