Re: GNU Make 4.2 Query

help-make

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Make 4.2 Query

From:	nikhil jain
Subject:	Re: GNU Make 4.2 Query
Date:	Mon, 2 Sep 2019 23:34:00 +0530

Thanks for detailed information.

I will see if I can use shell wrapper program as mentioned by you.

I had used LSF a lot like for 5 years. I still use it.  bsub, bjobs. bkill,
lim, sbatchd, mbatchd etc. it is easy to understand and use

lsmake - I do not want to use IBM's proprietary stuff.

Thanks for your suggestions.

Nikhil

On Mon, Sep 2, 2019 at 11:10 PM David Boyce <address@hidden> wrote:

> I'm not going to address the remote execution topic since it sounds like
> you already have the solution and are not looking for help. However, I do
> have fairly extensive experience with the NFS/retry area so will try to
> contribute there.
>
> First, I don't think what Paul says:
>
> > As for your NFS issue, another option would be to enable the .ONESHELL
> > feature available in newer versions of GNU make: that will ensure that
> > all lines in a recipe are invoked in a single shell, which means that
> > they should all be invoked on the same remote host.
>
> Is sufficient. Consider the typical case of compiling foo.c to foo.o and
> linking it into foo.exe. Typically, and correctly, those actions would be
> in two separate recipes which in a distributed-build scenario could run on
> different hosts so the linker may not find the .o file from a previous
> recipe. Here .ONSHELL cannot help since they're different recipes.
>
> In my day job we use a product from IBM called LSF (Load Sharing
> F-something,
> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_welcome.html)
> which exists to distribute jobs over a server farm (typically using NFS)
> according to various metrics like load and free memory and so on. Part of
> the LSF package is a program called lsmake (
> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_command_ref/lsmake.1.html)
> which under the covers is a version of GNU make with enhancement to enable
> remote/distributed recipes and also adds retry-with-delay feature Nikhil
> requested). Since GNU make is GPL, IBM is required to make its package of
> enhancements available under GPL as well. Much of it is not of direct
> interest to the open source community because it's all about communicating
> with IBM's proprietary daemons but their retry logic could probably be
> taken directly from the patch. At the very least, if retries were to be
> added to GNU make per se it would be nice if the flags were compatible with
> lsmake.
>
> However, my personal belief is that retries are God's way of telling us to
> think harder and better. Retrying (and worse, delay-and-retry) is a form of
> defeatism which I call "sleep and hope". Computers are deterministic,
> there's always a root cause which can usually be found and addressed with
> sufficient analysis, etc. Granted there are cases where you understand the
> problem but can't address it for administrative/permissions/business
> reasons but that can't be known until the problem is understood.
>
> NFS caching is the root cause of unreliable distributed builds, as you've
> already described, but most or all of these issues can be addressed with a
> less blunt instrument than sleep-and-retry. Even LSF engineers threw up
> their hands and did retries but what we did here was take their patch,
> which at last check was still targeted to 3.81, and while porting it to 4.1
> added some of the cache-flushing strategies detailed below. This has solved
> most if not all of our NFS sync problems. Caveat: most of our people still
> use the LSF retry logic in addition, because they're not as absolutist as I
> am and just want to get their jobs done (go figure), which makes it harder
> to determine what percentage of problems are solved by cache flushing vs
> retries but I'm pretty sure flushing has resolved the great majority of
> problems.
>
> One problem with sleep-and-hope is that there's no amount of time
> guaranteed to be enough so you're just driving the incidence rate down, not
> fixing it.
>
> Since we were already working with a hacked version of GNU make we found
> it most convenient to implement flushing directly in the make program but
> it can also be done within recipes. In fact we have 3 different
> implementations of the same NFS cache flushing logic:
>
> 1. Directly within our enhanced version of lsmake.
> 2. In a standalone binary called "nfsflush".
> 3. In a Python script called nfsflush.py.
>
> The Python script is a lab for trying out new strategies but it's too slow
> for production use. The binary is a faster version of the same techniques
> for direct use in recipes, and that same C code is linked directly into
> lsmake as well. Here's the usage message of our Python script:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *$ nfsflush.py --helpusage: nfsflush.py [-h] [-f] [-l] [-r] [-t] [-u] [-V]
> path [path ...]positional arguments:  path             directory paths to
> flushoptional arguments:  -h, --help       show this help message and exit
> -f, --fsync      Use fsync() on named files  -l, --lock       Lock and
> unlock the named files  -r, --recursive  flush recursively  -t, --touch
>  additional flush action - touch and remove a temp file  -u, --upflush
>  flush parent directories back to the root  -V, --verbose    increment
> verbosity levelFlush the NFS filehandle caches of NFS directories.Newly
> created files are sometimes unavailable via NFS for a periodof time due to
> filehandle caching, leading to apparent race problems.See
> http://tss.iki.fi/nfs-coding-howto.html#fhcache
> <http://tss.iki.fi/nfs-coding-howto.html#fhcache> for details.This script
> forces a flush using techniques mentioned in the URL. Itcan optionally do
> so recursively.This always does an opendir/closedir sequence on each
> directoryvisited, as described in the URL, because that's cheap and safe
> andoften sufficient. Other strategies, such as creating and removing atemp
> file, are optional.EXAMPLES:    nfsflush.py /nfs/path/...*
>
> The most important thing is to read the URL given above and/or to google
> for similar resource of which there are many. While I'm not an NFS guru
> myself, the summary of my understanding is that NFS caches all sorts of
> things (metadata like atime/mtime, directory updates, etc) with various
> degrees of aggression according to NFS vendor and internal configuration.
> We've seen substantial variation between NAS providers such as NetApp, EMC,
> etc, so much depends on whose NFS server you're using. However, the NFS
> spec _requires_ that caches be flushed on a write operation so all
> implementations will do this.
>
> Bottom line, the most common failure case is as mentioned above: foo.o is
> compiled on host A and immediately linked on host B. The close() system
> call following the final write() of foo.o on host A will cause its data to
> be flushed. Similarly I *believe* the directory write (assuming foo.o is
> newly created and not just updated) will cause the filehandle cache to be
> flushed. Thus, after these two write ops (directory and file) the NFS
> server will know about the new foo.o as soon as it's created.
>
> The problem typically arises on host B because no write operation has
> taken place there after foo.o was created on A so no one has told it to
> update its caches and as a result it doesn't know foo.o exists and the link
> fails with ENOENT. All the flushing techniques in the script above are
> attempts to address this. One takeaway from all this is that even if you do
> retries, a "dumb" retry is immeasurably enhanced by adding a flush. In
> other words the most efficient retry formula in a distributed build
> scenario would be:
>
> <recipe> || flush || <recipe>
>
> This never flushes a cache unless the first attempt fails. It presumes
> that NFS implementors and admins know what they're doing and thus caching
> helps with performance so it's not done unless needed. This is what we
> built into our variant of lsmake. However, the same can also be done in the
> shell.
>
> Details about implemented cache flushing techniques: the filehandle cache
> is the biggest source of problems in distributed builds and the simplest
> solution for it seems to be opening and reading the directory entry. Thus
> our script and its parallel C implementation always do that. We've also
> seen cases where forcing a directory write operation is required which the
> -t, --touch option does. Sometimes you can't easily enumerate all
> directories involved (vpath etc) so the recurse-downward (-r) and recurse
> upward (-u) flags may be helpful though they (especially -u) may also be
> overkill. The -f and -l options were added based on advice found on the net
> but have not been shown to be helpful in our environment.
>
> Some techniques may be of limited utility because they require write
> and/or ownership privileges. For instance I've seen statements that
> umounts, even failed umounts, will force flushes. Thus a command like "cd
> <dir> && umount $(pwd)" would have to fail since the moount is busy but
> would flush as a side effect. However I believe this requires root
> privileges so is not helpful in the normal case.
>
> In summary: although I don't believe in retries, if they're going to be
> used I think they should be implemented in a shell wrapper program which
> could be passed to make as SHELL=<wrapper> and the wrapper should use
> flushing in addition to, or instead of, retries. We didn't do it that way
> but I think our nfsflush program could just as well have been implemented
> as say "nfsshell" such that "nfsshell [other-options] -c <recipe>" would
> run the recipe along with added flushing and retrying options. I agree with
> Paul that I see no reason to implement any of these features, retry and/or
> flush, directly in make.
>
> David
>
> On Mon, Sep 2, 2019 at 6:05 AM Paul Smith <address@hidden> wrote:
>
>> On Sun, 2019-09-01 at 23:23 -0700, Kaz Kylheku (gmake) wrote:
>> > If your R&D team would allow you to add just one line to the
>> > legacy GNU Makefile to assign the SHELL variable, you can assign that
>> > to a shell wrapper program which performs command re-trying.
>>
>> You don't have to add any lines to the makefile.  You can reset SHELL
>> on the command line, just like any other make variable:
>>
>>     make SHELL=/my/special/sh
>>
>> You can even override it only for specific targets using the --eval
>> command line option:
>>
>>     make --eval 'somerule: SHELL := /my/special/sh'
>>
>> Or, you can add '-f mymakefile.mk -f Makefile' options to the command
>> line to force reading of a personal makefile before the standard
>> makefile.
>>
>> Clearly you can modify the command line, otherwise adding new options
>> to control a putative retry on error option would not be possible.
>>
>> As for your NFS issue, another option would be to enable the .ONESHELL
>> feature available in newer versions of GNU make: that will ensure that
>> all lines in a recipe are invoked in a single shell, which means that
>> they should all be invoked on the same remote host.  This can also be
>> done from the command line, as above.  If your recipes are written well
>> it should Just Work.  If they aren't, and you can't fix them, then
>> obviously this solution won't work for you.
>>
>> Regarding changes to set re-invocation on failure, at this time I don't
>> believe it's something I'd be willing to add to GNU make directly,
>> especially not an option that simply retries every failed job.  This is
>> almost never useful (why would you want to retry a compile, or link, or
>> similar?  It will always just fail again, take longer, and generate
>> confusing duplicate output--at best).
>>
>> The right answer for this problem is to modify the makefile to properly
>> retry those specific rules which need it.
>>
>> I commiserate with you that your environment is static and you're not
>> permitted to modify it, however adding new specialized capabilities to
>> GNU make so that makefiles don't have to be modified isn't a design
>> philosophy I want to adopt.
>>
>>
>> _______________________________________________
>> Help-make mailing list
>> address@hidden
>> https://lists.gnu.org/mailman/listinfo/help-make
>>
>

[Prev in Thread]

Current Thread

[Next in Thread]

Re: GNU Make 4.2 Query, nikhil jain, 2019/09/01
- Re: GNU Make 4.2 Query, Kaz Kylheku (gmake), 2019/09/02
  - Re: GNU Make 4.2 Query, nikhil jain, 2019/09/02
    - Re: GNU Make 4.2 Query, Kaz Kylheku (gmake), 2019/09/02
    - Re: GNU Make 4.2 Query, nikhil jain, 2019/09/02
    - Re: GNU Make 4.2 Query, Paul Smith, 2019/09/02
    - Re: GNU Make 4.2 Query, nikhil jain, 2019/09/02
    - Re: GNU Make 4.2 Query, Paul Smith, 2019/09/02
    - Re: GNU Make 4.2 Query, nikhil jain, 2019/09/02
    - Re: GNU Make 4.2 Query, David Boyce, 2019/09/02
    - Re: GNU Make 4.2 Query, nikhil jain <=
    - Re: GNU Make 4.2 Query, David Boyce, 2019/09/02
    - Re: GNU Make 4.2 Query, nikhil jain, 2019/09/02
    - Re: GNU Make 4.2 Query, nikhil jain, 2019/09/17
    - Re: GNU Make 4.2 Query, Paul Smith, 2019/09/21
    - Re: GNU Make 4.2 Query, nikhil jain, 2019/09/22

Prev by Date: Re: GNU Make 4.2 Query
Next by Date: Re: GNU Make 4.2 Query
Previous by thread: Re: GNU Make 4.2 Query
Next by thread: Re: GNU Make 4.2 Query
Index(es):
- Date
- Thread