|
From: | Peter Bex |
Subject: | [Chicken-hackers] [PATCH] Add a statistical profiler |
Date: | Sun, 3 Jan 2016 20:20:19 +0100 |
User-agent: | Mutt/1.5.21 (2010-09-15) |
Hi all, For a while now I've been meaning to improve our profiler. I finally got around to writing a new profiler a few weeks ago and I've been testing it on and off since then. I think now it's ready to start integrating it. I've successfully made some performance improvements in uri-generic and intarweb. The irregex performance improvement I sent to chicken-hackers was also a result from running it with the new profiler. I believe the profiler needs improving because of these limitations: * Instrumentation adds a whole lot of overhead and prevents various optimizations due to requiring dynamic-wind and turning all calls to such procedures into a CPS call. This skews the results, sometimes so extremely to make the profile output useless. If you want to see an example of this, try profiling the "nbody" benchmark from Mario's chicken-benchmarks repo. It goes from 1s to 17s on my laptop. Other examples include "knucleotide", "primes" and "ray2". * Only toplevel procedures are instrumented. This means that programs with large procedures or programs which rely heavily on closures will be hard to profile. It's possible to work around this by tweaking the source code, but this is very annoying to do. * Checking performance of a program requires recompilation with -profile. This is only a minor annoyance, but sometimes you want to look "deeper" inside the library calls that your program performs, and that means you'll have to recompile those libraries with -profile as well (and possibly libraries they use, and so on), which gets annoying pretty quickly. Also, you can easily forget you did that for a library, and this will slow down all programs using this library. The attached patches add a *second* profiler to CHICKEN, which uses statistical sampling to figure out where a program spends most of its time. It registers a timer with setitimer(). This will trigger a signal to be delivered every 10ms (this is customisable via "-:P"). When a signal arrives, we look at the top of the trace buffer to find the currently running procedure. This means it needs *no* additional instrumentation or even recompilation: libraries or programs which are compiled without -no-trace or -d0 (i.e., most of them) are instantly profileable! This statistical sampling method is basically what gprof(1) and profil(3) also do, from what I understand of their documentation. To see it in action, just compile the target program as normal, and then when running it, pass it the "-:p" switch. This is recognised by the runtime system and therefore any program can be passed this switch. On exit, it will write out a "PROFILE.1234" file, where 1234 is the pid, just like under the instrumentation-based profiler. This can only be read with the new version of chicken-profile, which I've tweaked a tiny bit to understand the new profile format, which has a header now. The tweak to chicken-profile is necessary because the statistical profiler has very little context to work with: it only sees the top of the trace buffer, so it knows what procedure is currently running. Because it takes a sample every 10 ms, it can count how much time is spent in that procedure, but it doesn't really know how it got there, as it can't carefully track procedure entry/exit events like the instrumentation does under the old profiler. So, we know how much time is spent in a procedure, we can only calculate the percentage of the entire running time spent, individually per procedure. This means the percentages add up to 100, which is different from the original profiler. There, the percentage will be 100 for the "main" procedure, and less for everything called by it. Because of this, both profilers now write a "header" symbol to the start of the profile, which chicken-profile recognises and will change its behaviour on. The new profiler has none of the disadvantages of the old profiler, but comes with its own set of caveats: * If a program blocks signals, profiling is unreliable. I got a report from Kooda that SDL's vsync() seems to do this, so this is not always directly under the programmer's control. * The profile results are very different: it shows how much time is spent in the procedure that was called last, making it very low-level. This can sometimes result in a harder to read profile, with lots of details. The old profiler would simply state "procedure X is slow", while the new will say "call (A) on line 1 is slowest followed by call (B) on line 1507 and call (C) on line 5", even though lines 1 and 5 are both part of procedure X. Ideally you would want some way to sum the calls belonging to the same procedure. On the other hand, it can also be useful to know what each *part* of a procedure contributes to the profile. * Due to the statistical approach, the call counts can be *completely* wrong. If you have very fast procedures which are called extremely often, the number of calls will be way too low because it has missed several calls in between two samples. Finally, I should mention two more caveats, but I don't think these are huge problems: * Due to the statistical approach, the time spent can be over- or under- allocated. We simply allocate the 10ms to the currently running procedure at the time we take the sample. This should in general not be a problem because we take so many samples that it will average out: In the long run, small, fast procedures should not be running too often while a sample is taken. * The code is all in C, this is required to reduce the impact on performance as much as possible, and to avoid having to recompile. Also, code running with interrupts disabled would not receive the signal if we processed the signal in Scheme. Finally, the profiler might always see itself running when taking a sample if we did it in Scheme! Currently, profiling only works for compiled programs, but I think with some more or less invasive changes it may be possible to even profile programs running under the interpreter. This could not be made to work with the old profiler, as far as I can see. Given the above laundry list of caveats and the elegant approach of the old profiler (all in Scheme, very little code, precise and more high-level results), I think we should not be removing that one just yet. In fact, I think we could even merge some of these parts of the code, which should also remove some of the overhead of the old profiler, reducing some of its problems. I'm sure the above text sounds very rambling, so please feel free to ask if you have questions. I'm also preparing a blog post that describes the new profiler in a more structured and easy-going way. There's a lot more we could potentially do to improve either profiler, but let's get started by applying this first. Cheers, Peter
0001-Add-simple-statistical-profiler-to-runtime-library.master.patch
Description: Text Data
0002-Support-profiling-on-Windows-with-native-timers.master.patch
Description: Text Data
0003-Fix-statistical-percentage-chicken-profile-output.master.patch
Description: Text Data
0001-Add-simple-statistical-profiler-to-runtime-library.chicken-5.patch
Description: Text Data
0002-Support-profiling-on-Windows-with-native-timers.chicken-5.patch
Description: Text Data
0003-Fix-statistical-percentage-chicken-profile-output.chicken-5.patch
Description: Text Data
signature.asc
Description: Digital signature
[Prev in Thread] | Current Thread | [Next in Thread] |