[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gcl-devel] Re: gzipped tar file on profiling

From: Camm Maguire
Subject: Re: [Gcl-devel] Re: gzipped tar file on profiling
Date: 30 Dec 2003 17:11:42 -0500
User-agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2

Greetings!  Just a preliminary note here -- the overwhelming majority
of the gbc cost with or without sgc is in collecting fixnum garbage.
Furthermore, the vast majority of this is generated in your compiled
lisp code.  

As you probably know, if GCL can figure out that an integer is a
fixnum, usually via proclaims or declares, it will compile it directly
to a machine long integer avoiding any memory allocation.  Otherwise,
it calls the allocating routine make_fixnum.  

At present, the only known (to me) system macro which often
unnecessarily allocates fixnums is dotimes -- if your code uses this a
lot let me know and I'll send you a replacement to try.  

Otherwise these massive calls to make_fixnum are likely in code
written by the compiler.  You likely know which functions of yours
take the most time.  You can look for calls to make_fixnum via the
(disassemble...)  function which will output the generated C code.  A
simple (declare (fixnum i)) or similar where appropriate will likely
eliminate most of the runtime.  

To be a bit more specific, to avoid any allocation, the compiler needs
to know both that the bounds of the integer are appropriate (which can
be gleaned from the (declare...) statement), and that any functions
using the variable as an argument have been compiled to take machine
integer arguments (which information typically comes from
(proclaims...)).  The compiler will "promote" a machine integer to a
lisp object via allocation if the integer is passed to a function
taking lisp objects as arguments.  You also probably know about the
automatic way of making a sys-proclaim.lsp for use by the compiler
in this regard.  Briefly, just load gcl_collectfn.o and execute
(compiler::emit-fn t) before compiling (generating .fn files for each
.lisp file), and then (compiler::make-all-proclaims "*.fn").  Then
load sys-proclaim.lsp before recompiling, or put this load command in
a gcl_cmpinit.lsp file in the source directory.  All this needs to be
automated in my opinion.

The only current drawback of gprof profiling is that compiled lisp
code as conventionally loaded into the program's .data section is all
lumped together in one entry labeled <hicore>.  I'll be working on
improving this at some point.  For now, you can build your final
executable in an alternate way which will retain the identities of the
compiled lisp functions for the purposes of profiling should you be
unable to identify the culprit yourself.  Basically you set the
variable compiler::*default-system-p* to t before compiling, and
instead of using (si::save-system...), do

        (list "module1.o" "module2.o" ...)
        "(progn (lisp-code-to-run-after-loading-and-before-saving))")

Here is how I do this for acl2 (this method is alas required on some

BINARIES:=acl2-fns.o axioms.o basis.o translate.o type-set-a.o type-set-b.o 
rewrite.o simplify.o \
          bdd.o other-processes.o induct.o history-management.o prove.o 
defuns.o proof-checker-a.o\
          defthm.o other-events.o ld.o proof-checker-b.o tutorial.o 
interface-raw.o \
          linear-a.o linear-b.o non-linear.o TMP1.o
BIN:=$(patsubst %,"%",$(BINARIES))

INIT:=(initialize-acl2 (quote include-book) *acl2-pass-2-files* t t)
POST:=(load \"foo.lsp\")\
      (setq compiler::*default-system-p* nil)\
      (in-package \"acl2\")\
      (save-acl2 (quote (initialize-acl2 (quote include-book) 
*acl2-pass-2-files* t)) \"saved_acl2\"))

nsaved_acl2: foo.lsp
        echo '(load "foo.lsp")(in-package "acl2")(compile-acl2)' | gcl
        echo '(load "foo.lsp")(in-package "acl2")(load-acl2)$(INIT)' | gcl
        echo '(compiler::link (list $(BIN)) "$@" "$(POST)" "" nil)' |gcl

I'd be interested in knowing any results, as my current guess is that
Allegro can infer a fixnum where we cannot at present.  If you can
give me the details of such misinference, I can try to fix it.

I would have guessed that you were doing a lot of array indexing, but
this appears not to be the case.

Take care, 

Matt Kaufmann <address@hidden> writes:

> Camm --
> The run has completed.  It seemed to take much longer without SGC.  I'll send
> you the uuencoded gzipped tar file (509K) now (see the README after you expand
> it).  I'll spare those CCed on this email from that file.
> -- Matt
>    cc: address@hidden, address@hidden, address@hidden,
>          address@hidden
>    From: Camm Maguire <address@hidden>
>    Date: 17 Dec 2003 17:11:44 -0500
>    User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2
>    Content-Type: text/plain; charset=us-ascii
>    Thanks Matt!  I got it now.  I wonder if you had also saved the gbc
>    output -- if I recall you usually run with *notify-gbc* turned on.
>    Also, a (room) before and after would be of interest, as well as the
>    effect of (si:sgc-on nil).
>    On the one hand, GBC is clearly the performance culprit, and that is
>    very good news, as it is relatively easy to address with a better
>    memory layout.  What is odd about your profiling though is that
>    marking is taking up the lion's share of the time.  In normal GBC, its
>    the sweeper that costs, at least in the tests I've run so far.  And I
>    think I know why, it is the following extra pass that's required with
>    sgc marking:
>      /* mark all non recent data on writable pages */
>      {
>        int t,i=page(heap_end);
>        struct typemanager *tm;
>        char *p;
>        while (--i >= 0) {
>        if (WRITABLE_PAGE_P(i)
>            && (t=type_map[i]) < (int) t_end)
>          ;
>        else 
>          continue;
>        tm=tm_of(t);
>        p=pagetochar(i);
>        if ( t == t_cons) 
>          for (j = tm->tm_nppage; --j >= 0; p += tm_table[t_cons].tm_size/*  
> sizeof(struct cons) */) {
>            object x = (object) p; 
>            if (SGC_OR_M(x)) 
>              continue;
>            if (x->d.t==t_cons) {
>              x->d.m = TRUE; 
>              sgc_mark_cons(x);
>            } else
>              sgc_mark_object1(x);
>          }
>        else {
>          int size=tm->tm_size;
>          for (j = tm->tm_nppage; --j >= 0; p += size) {
>            object x = (object) p; 
>            if (SGC_OR_M(x)) continue;
>            sgc_mark_object1(x);
>          }
>        }
>        }
>      }
>    I'll have to think a bit about why this was put in, but in short, sgc
>    need not be a performance boost if most of your memory is writable
>    anyway, and might (speculation) even be less efficient.  This pertains
>    to that note I sent earlier about the enabling of SGC in ACL2, which
>    you asked me to clarify, and I never did, as I don't yet have a good
>    answer as to a metric to determine when SGC helps and when it does
>    not.  One thing should be clear, and that is many heap expansions,
>    which require an (si::sgc-on nil)(gbc t)(si::sgc-on t), are very
>    expensive, as these steps require that the read-only subset of memory
>    be determined all over again.
>    To be a bit more verbose, SGC works by taking the memory set at time
>    of (si:sgc-on t), making it read-only, and allocating new writable
>    pages for use in subsequent allocation -- these are "SGC" pages,
>    i.e. the memory set is divided into two.  Write attempts on the old
>    memory make those pages writable, but they are not "SGC" pages as far
>    as the algorithm goes, and must be garbage collected separately.  The
>    idea is to enable sgc at the point where the most memory that will
>    essentially have no further modifications can be set aside.  Setting
>    aside a small amount of truly read-only memory, or a lot of memory
>    which has significant subsequent modifications, both suffer from
>    inefficiencies.
>    >From looking at b.out, for example, half the time is spend in
>    sgc_mark_phase itself, and half in its main child process,
>    sgc_mark_object1.  This suggests to me that the ostensibly read-only
>    memory which was actually modified and thus made writable (writable
>    non-SGC pages) is substantial.  The loop shown above is similar to
>    that in the sweeper which is the main cost in normal GBC, so the more
>    memory it has to process, the worse performance will be.
>    Anyway, these are just initial thoughts.  I'd like Vadim's input if
>    possible to construct for this case an analog of the maxima init file
>    he's just put together, so we can get a performance comparison with
>    ACL short of GBC.  This in turn will help focus our efforts on where
>    our out of the box algorithms and parameter settings can be improved. 
>    Take care, 
>    Matt Kaufmann <address@hidden> writes:
>    > Camm --
>    > 
>    > In case email from AMD to you is problematic, I've also just now put the
>    > profile results on the web as a gzipped tar file:
>    > 
>    > /u/www/users/moore/acl2/seminar/temp/camm.tar.gz
>    > or
>    > http://www.cs.utexas.edu/users/moore/acl2/seminar/temp/camm.tar.gz
>    > 
>    > When you let me know you've received it, I'll delete this file.
>    > 
>    > -- Matt
>    > 
>    > 
>    > 
>    -- 
>    Camm Maguire                                               address@hidden
>    ==========================================================================
>    "The earth is but one country, and mankind its citizens."  --  Baha'u'llah
> _______________________________________________
> Gcl-devel mailing list
> address@hidden
> http://mail.gnu.org/mailman/listinfo/gcl-devel

Camm Maguire                                            address@hidden
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah

reply via email to

[Prev in Thread] Current Thread [Next in Thread]