tinycc-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] Huge swings in cache performance


From: David Mertens
Subject: Re: [Tinycc-devel] Huge swings in cache performance
Date: Thu, 5 Jan 2017 08:25:23 -0500

Hello everyone,

I have now written a very simple C program which gives highly erratic timing behavior when run under tcc -run. I have added this file to the gist; look for cache-test-simple.c here: https://gist.github.com/run4flat/fcbb6480275b1b9dcaa7a8d3a8084638

The simple program does not attempt to produce a shared object library, and so should be runnable on any operating system that supports tcc -run, including Windows and Mac in addition to Linux. Here are some sample outputs on my machine:

$ time ./tcc -B. -DNOPS=0 -run cache-test-simple.c
real    0m0.052s
$ time ./tcc -B. -DNOPS=1 -run cache-test-simple.c  ***
real    0m1.413s
$ time ./tcc -B. -DNOPS=2 -run cache-test-simple.c
real    0m0.069s
$ time ./tcc -B. -DNOPS=3 -run cache-test-simple.c
real    0m0.076s
$ time ./tcc -B. -DNOPS=4 -run cache-test-simple.c  ***
real    0m1.158s

The starred results are over an order of magnitude slower than the unstarred results.

1) Do others see this on other operating systems with 64-bit Intel processors?
2) Do others see this on any operating system with 64-bit AMD processors?
3) Do others see this on any operating system with any other architecture?

Thanks!
David

On Thu, Jan 5, 2017 at 12:59 AM, David Mertens <address@hidden> wrote:
Update: I *can* get this slowdown with tcc. The main trigger is to have a global variable that gets modified by the function.

I have updated the gist: https://gist.github.com/run4flat/fcbb6480275b1b9dcaa7a8d3a8084638

This program generates a single function filled with a collection of skipped operations (number of operations is a command-line option) and finished with a modification of a global variable. It compiles the function using tcc, then calls the function a specified number of times (repeat count specified via command-line). It can either generate code in-memory, or it can generate a .so file and load that using dlopen. (If it generates in-memory, it prints the size of the generated code.)

Here are the interesting results on my machine, all for 10,000,000 iterations, using compilation-in-memory:

N   Code Size (Bytes)   Time (s)
0                 128       2.52
1                 144       2.54
2                 176       2.57
3                 208       0.035
4                 224       0.058
5                 256       2.57
6                 272       0.060

Switching over to a shared object file, I get these results (code size is size of the .so file):
N   Code Size (Bytes)   Time (s)
0                2960       0.057
1                2984       0.040
2                3016       0.058
3                3040       0.039
4                3064       0.040
5                3088       0.060
6                3112       0.063

As you can see, the jit-compiled code has odd jumps of 30x speed drops depending on... something. The shared object file, on the other hand, has consistently sound performance.

Two questions:
1) Can anybody reproduce these effects on their Linux machines, especially different architectures? (I can try an ARM tomorrow.)
2) Is there something special about how tcc builds a shared object file that is not happening with the jit-compiled code?

Thanks!
David

--
 "Debugging is twice as hard as writing the code in the first place.
  Therefore, if you write the code as cleverly as possible, you are,
  by definition, not smart enough to debug it." -- Brian Kernighan



--
 "Debugging is twice as hard as writing the code in the first place.
  Therefore, if you write the code as cleverly as possible, you are,
  by definition, not smart enough to debug it." -- Brian Kernighan

reply via email to

[Prev in Thread] Current Thread [Next in Thread]