chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Performance question concerning chicken flonum vs "foreign flonum"


From: felix . winkelmann
Subject: Re: Performance question concerning chicken flonum vs "foreign flonum"
Date: Thu, 04 Nov 2021 21:04:32 +0100

> 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
> 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB
>
>[...]
>
> It would be great to get some help or explanation with this issue.

Hi!

I have similar timings and the difference in the number of minor GC indicates
that the c99-fma variant allocates more stack space and thus causes more
minor GCs.

Looking at the generated C file ("csc -k"), we see that scm-fma unboxes the 
intermediate
result and thus generates relatively decent code:

/* scm-fma in k183 in k180 in k177 in k174 */
static void C_ccall f_187(C_word c,C_word *av){
C_word tmp;
C_word t0=av[0];
C_word t1=av[1];
C_word t2=av[2];
C_word t3=av[3];
C_word t4=av[4];
C_word t5;
double f0;
C_word *a;
if(C_unlikely(!C_demand(C_calculate_demand(4,c,1)))){
C_save_and_reclaim((void *)f_187,c,av);}
a=C_alloc(4);
f0=C_ub_i_flonum_times(C_flonum_magnitude(t2),C_flonum_magnitude(t3));
t5=t1;{
C_word *av2=av;
av2[0]=t5;
av2[1]=C_flonum(&a,C_ub_i_flonum_plus(C_flonum_magnitude(t4),f0));
((C_proc)(void*)(*((C_word*)t5+1)))(2,av2);}}

The other version allocates a bytevector to hold the result:

/* c99-fma in k183 in k180 in k177 in k174 */
static void C_ccall f_197(C_word c,C_word *av){
C_word tmp;
C_word t0=av[0];
C_word t1=av[1];
C_word t2=av[2];
C_word t3=av[3];
C_word t4=av[4];
C_word t5;
C_word t6;
C_word *a;
if(C_unlikely(!C_demand(C_calculate_demand(6,c,1)))){
C_save_and_reclaim((void *)f_197,c,av);}
a=C_alloc(6);
t5=C_a_i_bytevector(&a,1,C_fix(4));
t6=t1;{
C_word *av2=av;
av2[0]=t6;
av2[1]=stub21(t5,t2,t3,t4);
((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}}

I thought that the allocation of 4 words for the bytevector (which is more than
needed on a 64 bit machine) makes the difference, but it turns out to be 
negligible
Changing it to 2 and also adjusting the values for C_calculate_demand and
C_alloc doesn't seem to change a lot, but you may want to try that -
just modify the C code and compile it with the same options as the .scm file.

On my laptop fma is a library call, so currently my guess is simply that
the scm-fma code is tighter and avoids 3 additional function calls (one to the 
stub,
one to C_a_i_bytevector and one to fma). The increased number of GCs may
also be caused by the bytevector above, which is used as a placeholder for
the flonum result, which wastes one word.

There is room for improvement for the compiler, though: the C_fix(4) is overly
conservative (4 words are correct on 32-bit, taking care of flonum alignment, 
but
unnecessary on 64 bits). Also, the bytevector thing is a bit of a hack - we
could actually just pass "a" to stub21 directly. You may want to try this out:

/* c99-fma in k183 in k180 in k177 in k174 (modified) */
static void C_ccall f_197(C_word c,C_word *av){
C_word tmp;
C_word t0=av[0];
C_word t1=av[1];
C_word t2=av[2];
C_word t3=av[3];
C_word t4=av[4];
C_word t6;
C_word *a;
if(C_unlikely(!C_demand(C_calculate_demand(4,c,1)))){
C_save_and_reclaim((void *)f_197,c,av);}
a=C_alloc(4);
t6=t1;{
C_word *av2=av;
av2[0]=t6;
av2[1]=stub21((C_word)a,t2,t3,t4);
((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}}

This reduces minor GCs on my machine to roughly the same. If your
compiler inlines stub21 and fma, then you should see comparable performance.
Also, default optimization-levels for C are -Os (pass -v to csc to see what is
passed to the C compiler), so using -O2 instead should make a difference.


felix




reply via email to

[Prev in Thread] Current Thread [Next in Thread]