[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU make 3.81beta4 released

From: Eli Zaretskii
Subject: Re: GNU make 3.81beta4 released
Date: Fri, 20 Jan 2006 13:49:19 +0200

> From: "Markus Mauhart" <address@hidden>
> Date: Fri, 20 Jan 2006 00:49:27 +0100
> "Eli Zaretskii" <address@hidden> wrote ...
> >
> >> From: "Markus Mauhart" <address@hidden>
> >> Date: Thu, 19 Jan 2006 00:55:46 +0100
> >>
> >> Now I continued with your suggestion "-j 64" -- it run AFAICS allmost
> >> 1m without errors until I got ...
> >>     Assertion failed: a == g->changed, file .\remake.c, line 169
> >> ... this comes from an assertion I had inserted around this bug:
> >>     g->changed += commands_started - ocommands_started;
> >> (g->changed is only 8 bits wide).
> >
> > What is the value of commands_started and ocommands_started at that
> > point, and what is the value of g->changed?
> see today's 3 traces below
> > What exactly is that assertion testing?  I mean, what is `a' to which
> > it compares the value of g->changed?
> yesterday it was ...
> {
> int a = (int)commands_started - (int)ocommands_started;
> a += g->changed;
> g->changed += commands_started - ocommands_started;
> assert (a == g->changed);
> }
> Today I changed it to ...
> {
> unsigned int const g_changed_old = g->changed;
> int a = (int)commands_started - (int)ocommands_started;
> a += g->changed;
> g->changed += commands_started - ocommands_started;
> if (a != g->changed)
>     {
>     fprintf (stderr, "\n(g_changed_old, commands_started, ocommands_started) 
> = (%u ,%u ,%u)\n"
>             ,g_changed_old ,commands_started ,ocommands_started);
>     g->changed = (commands_started != ocommands_started);
>     }
> }
> 3 traces from 3 successfull(!) "-j 64" runs:
> (g_changed_old, commands_started, ocommands_started) = (0 ,419) //forgot one 
> %u
> (g_changed_old, commands_started, ocommands_started) = (0 ,436) //forgot to 
> rebuild
> (g_changed_old, commands_started, ocommands_started) = (0 ,466 ,0)

That's very strange indeed!  Paul, can this happen?  I thought the
difference between commands_started and ocommands_started should be at
most 1, because in this code from remake.c:

              /* Save the old value of `commands_started' so we can compare
                 later.  It will be incremented when any commands are
                 actually run.  */
              ocommands_started = commands_started;

              x = update_file (file, rebuilding_makefiles ? 1 : 0);
              check_renamed (file);

              /* Set the goal's `changed' flag if any commands were started
                 by calling update_file above.  We check this flag below to
                 decide when to give an "up to date" diagnostic.  */
              g->changed += commands_started - ocommands_started;

update_file should launch at most 1 command.  Am I missing something?

> Btw, such successfull "-j 64"-build then sometimes uses ~50 compilers
> in parallel - with my 1GB this is no problem with msvc71, while yet "-j 3" is
> a big problem with mingw-gcc344 (both -O2).

What do you mean by ``a big problem''?  Could you tell the details?

> > Anyway, it looks like g->changed is just a boolean flag, its value is
> > tested in remake.c to be either zero or non-zero.  So, unless I'm
> > missing something, the offending line can be modified to say:
> >
> >    g->changed = commands_started - ocommands_started;
> or ... = (commands_started != ocommands_started)

Yes.  Did you try such a change, and if so, did it allow your build
procedure to come to a successful completion?

> 1) "-j someNumber" propagates to recursive $(MAKE) only via a (simple) trick,
> but this trick also disables gmake's builtin "load balancing" for tree's
> of sub-MAKE's (not a problem for my test's simple tree with on trivial root
> and exactly one child).

I don't think load balancing is disabled in recursive Make's, it is
disabled because you didn't tell Make (via the -l switch) that you
want it to pay attention to the load.

However, as long as the Windows version of getloadavg always returns
zero load, using -l is equivalent to using -j, because load_too_high
will roughly estimate the load as 25% of the number of subprocesses
launched by Make.

> 2) "-j noNumber" (btw, array sizes surely reduced to MAXIMUM_WAIT_OBJECTS, and
>                  overflows handeled correctly for some levels of callers)
> As I said, this has the advantage of propagating to recursive $(MAKE) whithout
> tricks, but the disadvantage of looping forever in all my tests:
> Initially ~30 trivial cmds (.bat files) are executed successfully (containing 
> echo ...).
> Then in my last test I get 89 traces ...
>     process_easy() failed failed to launch process (e=-1)

I'd be interested to know what part of process_easy failed and why.
It looks like this particular error (-1) happens when process_begin
returns -1, but what is the exact reason for the failure in
process_begin?  I see several possibilities.  Can you track that down?

>     // job.c contains the double "failed" typo
> ... only interrupted with a single (trivial) success.
> Then some 100 successfull cmds (echos and compiles and links).
> Then it loops forever.
> 89 .bat-files still left in the tmp dir.
> Many (89 ?) childs existing.
> IMHO there are to bugs involved:
> 1st, the failed process_easy() isnt really gracefully handled - the 
> "transactions"
> arent undone and hence cant be re-tried.
> 2nd, the loop and its sub-functions fails to detect "nothing goes":
> E.g. in reap_children() "if (!remote && pid == shell_function_pid)" in my case
> is "true" cause both pid and shell_function_pid are 0, but the corresponding
> code and comment IMHO acts like handling lots of success.
> Both probably are part of the problem you mentioned IIRC that generally -j 
> isnt
> safe against temporary resource allocation failures.

I think there should be a way for remake.c to know in advance that no
more processes can be safely launched.  When that happens, it should
not even try to launch more commands, but instead queue them for

As I wrote here yesterday, I think a similar problem exists on Posix
platforms, albeit for slightly different reasons (the limit imposed by
the OS on the number of processes).  If Paul agrees that this is a
bug, then perhaps we could have a system-independent solution for it
that will incorporate the Windows idiosyncrasies.  If not, I will
craft something Windows-specific.

Thanks again for working on this tricky problem.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]