|Subject:||Re: [Tinycc-devel] Huge swings in cache performance|
|Date:||Tue, 20 Dec 2016 23:02:08 -0500|
DavidDiscussion about alignment and execution speed for the Haskell compiler: https://ghc.haskell.org/trac/g
hc/ticket/8279This discussion mentions why things should be aligned, and gives some multi-byte no-ops that can be used for padding for aligned loops. http://stackoverflow.com/ questions/18113995/ performance-optimisations-of- x86-64-assembly-alignment-and- branch-predictionI came across a similar issue a few weeks ago, but I was able to "fix" it by allocating more memory than I needed and then relocating to an address within that allocation that was aligned to the start of a page. This seemed to fix the problem back then, but this new flavor of alignment woes is impervious to such a trick.--On Tue, Dec 20, 2016 at 10:29 PM, KHMan <address@hidden> wrote:On 12/20/2016 10:17 PM, David Mertens wrote:
I'm not convinced this is entirely an unpredictable hardware
issue. The reason is that I can easily create similar
functionality with gcc (the usual Perl XS module, the normal means
for writing a C-based extension) and it does not show these kinds
of cache swings. I think there is something gcc does while
producing its machine code that makes it less susceptible to cache
misses. (Well, there are lots of things it does, I'm sure.) I'm
hoping there's one or two simple things that gcc does which tcc
misses and could implement.
Was the behavior observed with Lua noted when working with JIT?
I couldn't find the old posting but it was along the lines of benchmark variability due to memory layout, see "Mytkowicz memory layout". IIRC, the discussion was about a small benchmark Lua script running the interpreter, in one posting, changing an environment variable changed the program's total running time significantly, IIRC it was in the 20-50% range. The timings were done casually and nobody did detailed follow-up research.
... which of course are the same executables and is different from your case. Long day and all. But tcc is not much of an optimizing compiler, if the change caused register spilling in an inner loop it would hammer memory access and account for at least some of the effects...
On Tue, Dec 20, 2016 at 9:05 AM, KHMan wrote:
On 12/20/2016 9:16 PM, David Mertens wrote:
Reminder/Background: C::Blocks is my Perl wrapper around
of tcc with extended symbol table support.
I've begun writing benchmarks to seriously test how C::Blocks
compares with other JIT and JIT-ish options for Perl. I've
a couple of situations in which slight modifications to
cause a huge drop in performance. One benchmark went from
5,000ms (i.e. 5 sec).
The change to the code was so slight that I immediately
cache misses as the culprit. Running with linux's "perf"
gave proof of that (hopefully this format properly with
Fast Slow Significant
time (ms) 370 5022 **
instructions 3.5B 3.5B
branches 640M 650M
branch-miss 687k 671k
dcache-miss 974k 71M **
icache-miss 3.2M 83M **
By dcache-miss I refer to what perf calls "L1 dcache load
and by icache-miss I refer to what perf calls "L1 icache
I'm a bit confused on what would cause this sort of persistent
cache miss behavior. In particular, I've tried working
distinct strategies for managing executable memory, including
ensuring page alignment (wrong: it should be line-width
of 64 bytes). This fixed a similar issue previously
didn't seem to improve the situation here. I used malloc
of Perl's built-in memory allocator. I created a pool for
executable memory so that multiple chunks of executable
all be written to the same page in memory. EVEN THIS did
this issue, which really surprised me since I would have
adjacent memory would hash to different caches.
I believe that what I've found is an issue with tcc, but I
golfed it down to a simple libtcc-consuming example. I can do
that, but wanted to see if anybody could think of an obvious
cause, and fix, without going to such lengths. If not, I
if I can write a small reproducible example.
This kind of behaviour was discussed on the Lua list not long
ago. IIRC, for example changing environment variables changed
the way a program is loaded, and the timing changed. Probably
cache behaviour. It's like, what can we really benchmark anymore?
When modern GHz parts have cache misses and need to access
main memory, they cause such train wrecks that everybody seems
to be moving or have already moved to neural network-based
(perceptron *cough*) branch prediction.
So well, how do we scientifically or meaningfully benchmark
these days, that is the question... (especially for folks in
academic needing to justify benchmark results...)
Kein-Hong Man (esq.)
Tinycc-devel mailing list
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." -- Brian Kernighan
|[Prev in Thread]||Current Thread||[Next in Thread]|