Re: [Tinycc-devel] Huge swings in cache performance

On Thu, Jan 5, 2017 at 6:46 PM, grischka <address@hidden> wrote:

You might try larger "section alignment" for -run:

in tccrun.c:208 instead of
offset = (offset + 15) & ~15;
for example
offset = (offset + 63) & ~63;

This would add more space between your "foo" data variable and
the instructions in memory

--- grischka

Harry van Haaren wrote:

On Thu, Jan 5, 2017 at 2:12 PM, avih <address@hidden> wrote:

I can reproduce x30 variations on Windows with tcc64 (built either using

gcc (mingw) or using tcc64 itself), but for me -DNOPS=2 or 5 or 9 are fast,
and the others (up to 9) are slow. I didn't check further.

I also removed the #include <stdio.h> since it's not where tcc typically

is, and it's not required as far as I can tell, and also removed the -B
thingy (the tcc binary is in the distribution dir on windows and its
default -B location doesn't include anything other than tcc
files/libs/includes).

Same here, removed the stdio include and -B. flag, tcc version 0.9.26
(x86-64 Linux), recent desktop CPU:
Results (below), even NOPS are bad, odd NOPS are good up to 8, then it
becomes unpredictable.

Hope that helps, -Harry

PS: My first post to TCC list - awesome project - thanks all! :)

time tcc -DNOPS=0 -run test.c
real 0m1.015s

time tcc -DNOPS=1 -run test.c
real 0m0.043s

time tcc -DNOPS=2 -run test.c
real 0m1.215s

time tcc -DNOPS=3 -run test.c
real 0m0.037s

time tcc -DNOPS=4 -run test.c
real 0m1.008s

time tcc -DNOPS=5 -run test.c
real 0m0.051s

time tcc -DNOPS=6 -run test.c
real 0m1.010s

time tcc -DNOPS=7 -run test.c
real 0m0.036s

time tcc -DNOPS=8 -run test.c
real 0m1.014s

time tcc -DNOPS=9 -run test.c
real 0m1.112s

time tcc -DNOPS=10 -run test.c
real 0m0.041s

time tcc -DNOPS=11 -run test.c
real 0m1.161s

time tcc -DNOPS=12 -run test.c
real 0m0.039s

time tcc -DNOPS=13 -run test.c
real 0m1.482s

time tcc -DNOPS=14 -run test.c
real 0m1.009s

time tcc -DNOPS=15 -run test.c
real 0m1.506s

time tcc -DNOPS=16 -run test.c
real 0m1.005s

On Thursday, January 5, 2017 3:25 PM, David Mertens <

address@hidden> wrote:

Hello everyone,

I have now written a very simple C program which gives highly erratic

timing behavior when run under tcc -run. I have added this file to the
gist; look for cache-test-simple.c here: https://gist.github.com/ run4flat/
fcbb6480275b1b9dcaa7a8d3a80846 38

The simple program does not attempt to produce a shared object library,

and so should be runnable on any operating system that supports tcc -run,
including Windows and Mac in addition to Linux. Here are some sample
outputs on my machine:

$ time ./tcc -B. -DNOPS=0 -run cache-test-simple.c
real 0m0.052s
$ time ./tcc -B. -DNOPS=1 -run cache-test-simple.c ***
real 0m1.413s
$ time ./tcc -B. -DNOPS=2 -run cache-test-simple.c
real 0m0.069s
$ time ./tcc -B. -DNOPS=3 -run cache-test-simple.c
real 0m0.076s
$ time ./tcc -B. -DNOPS=4 -run cache-test-simple.c ***
real 0m1.158s

The starred results are over an order of magnitude slower than the

unstarred results.

1) Do others see this on other operating systems with 64-bit Intel

processors?

2) Do others see this on any operating system with 64-bit AMD processors?
3) Do others see this on any operating system with any other architecture?

Thanks!
David

On Thu, Jan 5, 2017 at 12:59 AM, David Mertens <address@hidden>

wrote:

Update: I *can* get this slowdown with tcc. The main trigger is to have a

global variable that gets modified by the function.

I have updated the gist: https://gist.github.com/ run4flat/

fcbb6480275b1b9dcaa7a8d3a80846 38

This program generates a single function filled with a collection of

skipped operations (number of operations is a command-line option) and
finished with a modification of a global variable. It compiles the function
using tcc, then calls the function a specified number of times (repeat
count specified via command-line). It can either generate code in-memory,
or it can generate a .so file and load that using dlopen. (If it generates
in-memory, it prints the size of the generated code.)

Here are the interesting results on my machine, all for 10,000,000

iterations, using compilation-in-memory:

N Code Size (Bytes) Time (s)
0 128 2.52
1 144 2.54
2 176 2.57
3 208 0.035
4 224 0.058
5 256 2.57
6 272 0.060

Switching over to a shared object file, I get these results (code size is

size of the .so file):

N Code Size (Bytes) Time (s)
0 2960 0.057
1 2984 0.040
2 3016 0.058
3 3040 0.039
4 3064 0.040
5 3088 0.060
6 3112 0.063

As you can see, the jit-compiled code has odd jumps of 30x speed drops

depending on... something. The shared object file, on the other hand, has
consistently sound performance.

Two questions:
1) Can anybody reproduce these effects on their Linux machines,

especially different architectures? (I can try an ARM tomorrow.)

2) Is there something special about how tcc builds a shared object file

that is not happening with the jit-compiled code?

Thanks!
David

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." -- Brian Kernighan

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." -- Brian Kernighan

_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel

_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel

------------------------------------------------------------------------

_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel

_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." -- Brian Kernighan

From:	David Mertens
Subject:	Re: [Tinycc-devel] Huge swings in cache performance
Date:	Thu, 5 Jan 2017 22:44:48 -0500