[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Libunwind-devel] Updated fast trace patch with initial performance resu

From: Lassi Tuura
Subject: [Libunwind-devel] Updated fast trace patch with initial performance results
Date: Wed, 23 Mar 2011 00:50:35 +0100


I've finished a round of testing on a patch set which combines my original
fast trace (1), Jason Evans' improvements (2) plus integration directly into
backtrace() and reusing some more of libunwind existing internals (attached).

I ran the following tests on RHEL5-based x86_64 system. The system libraries
(libc, libm, etc.) are compiled with RedHat's GCC 4.1.2. All the rest was
built with GCC 4.3.4. I used igprof memory and performance profilers as the
test tool. I ran all tests four times, twice normally and twice with taskset
restricted to one core, and picked the best performance - the variation was
small. The system is an 4-core, 24 GB RAM Intel Xeon L5520 2.27 GHz with
hyper-threading on. There were two profiling crashes because of bad unwind
info from GCC 4.3.4-generated code; I rerun the test when that happened,
having verified the crash was indeed caused by incorrect unwind info.

Without any profiling / libunwind use the program runs at 99% CPU use for 405
seconds, of which 2.5s are system time, rest is user time. The program is
single-threaded and loads 650 shared libraries, 22 of which are system
libraries (/lib64, /usr/lib64), 628 are our own. 

Here are the results, with this classification:
 P  = performance profile
 M  = memory allocation profile

 NI = non-instrumented bare application
 NT = normal "slow" trace using unw_step() loop
 FT = fast trace as I originally submitted it
 JE = Jason Evan's improved version
 BT = updated fast trace using unw_backtrace()

Timing with average/rms of stack depths seen, stack walk time, including per
stack level time. The times are in clock cycles from rdtsc instruction. VSIZE,
RSS are in megabytes. Best viewed in fixed width font.

      Time  %     VSIZE  RSS   Walks     Depth     Walk Time   Per Stack Level
-/NI   405s 100%  1548   1251  -         - / -       - / -         - / -

P/NT   412s 102s  1614   1299  68460  29.1 / 7.7  227k / 175k   8729 / 8206
P/FT   408s 101s  1610   1294  67743  29.1 / 7.7   61k / 97k    2980 / 7887
P/JE   408s 101%  1584   1270  67749  29.2 / 7.6   53k / 90k    2580 / 7137
P/BT   410s 101%  1593   1273  68154  29.1 / 7.6   65k / 105k   3111 / 7838

M/NT  5384s 1330% 3078   2774  387M   28.0 / 6.3   26k / 30k   949.4 / [x]
M/FT  1314s 325%  2820   2515  387M   28.0 / 6.3  2767 / 1948  100.5 / 121.0
M/JE  1348s 333%  2800   2494  387M   28.0 / 6.3  2933 / 2869  106.0 / 144.8
M/BT  1251s 309%  2815   2501  387M   28.0 / 6.3  2313 / 2650   84.5 / 114.3

[x] Ran out of 64-bit integer range for calculating RMS on per-level timing.

All but the M/NT run with negligible system time as mentioned above.
M/NT had 60% CPU time, 40% system time - or actually it varied 60-88%
user time, the rest system time, but the total time stayed ~ the same.

I tried to make the new patches as we discussed. Please review and note:

- This is still preliminary in that I haven't run libunwind tests yet.
  I also haven't yet tried any other system other than RHEL5/x86_64.

- I had to rename 'backtrace' to 'unw_backtrace' and add a prototype;
  otherwise it didn't get used, dynamic linker resolved the calls to
  the system backtrace() instead. I am somewhat uncomfortable about
  attempting to override system's backtrace() in libunwind anyway :)

- unw_backtrace() now returns one more frame than it did before.

- The frame caches are thread-local and internal to unw_tdep_trace().

- The hashes and the caches are allocated using mempool, which does
  fixed-sized objects only so this is a hybrid of Jason's hash growth
  code and fixed-sized objects. There's no alloc/free callback now.

- My hashes grow faster and remain emptier than Jason's, and I reset
  contents (not rehash) on resize. It uses slightly more memory, but
  seems to perform slightly better in heavy use.

- I used a simpler hash function than Jason's code had.

- I didn't find any meaningful way to reuse the existing register set
  caches in the dwarf code. They are LRU caches and have a generation
  change which don't apply to the frame cache.

- I will still try to squeeze better performance, but would welcome
  feedback on general design - let me know if you find it reasonable.

I think my two patches should really be squashed together, but I kept
them separate so you can compare the code. The underlying fast trace
patch had a few changes: 

- Added unw_tdep_trace() / unw_tdep_frame_t for other architectures.

- Adjusted unw_tdep_frame_t field widths based on value profiling:
  cfa_reg_offset 20 -> 30 bits, r{s,b}p_cfa_offset 20 -> 15 bits wide.

- Improved tdep_stash_frame() debug message and fixed off-by-one bug in
  RSP, RBP offset range validation.

- Adapted the patch for changes.

- Fixed white space warnings.

My apologies especially to Jason Evans for not responding to feedback
on these patches sooner - but thanks for your interest and input :-)



Attachment: fast-trace.tar.bz2
Description: BZip2 compressed data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]