help-make
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GNU Make 4.2 Query


From: nikhil jain
Subject: Re: GNU Make 4.2 Query
Date: Wed, 18 Sep 2019 08:20:31 +0530

Hi,

I have a query. Sorry to bother you again.

Can you please let me know when I do a SIGINT to the running make or do a
ctrl+c, which function is called at the last ? I want to add some logic in
there. Please help. This is urgent.

Thanks in advance.

Nikhil

On Mon, 2 Sep 2019, 23:49 nikhil jain, <address@hidden> wrote:

> haha OK.
>
> If I were you, I would have built lsmake functionality in GMAKE and not
> pay IBM lol.
>
> Anyways, have a good day. :)
>
> On Mon, Sep 2, 2019 at 11:45 PM David Boyce <address@hidden>
> wrote:
>
>> I did not suggest using lsmake, I simply mentioned that we use it.
>>
>> On Mon, Sep 2, 2019 at 11:04 AM nikhil jain <address@hidden>
>> wrote:
>>
>>> Thanks for detailed information.
>>>
>>> I will see if I can use shell wrapper program as mentioned by you.
>>>
>>> I had used LSF a lot like for 5 years. I still use it.  bsub, bjobs.
>>> bkill, lim, sbatchd, mbatchd etc. it is easy to understand and use
>>>
>>> lsmake - I do not want to use IBM's proprietary stuff.
>>>
>>> Thanks for your suggestions.
>>>
>>> Nikhil
>>>
>>> On Mon, Sep 2, 2019 at 11:10 PM David Boyce <address@hidden>
>>> wrote:
>>>
>>>> I'm not going to address the remote execution topic since it sounds
>>>> like you already have the solution and are not looking for help. However, I
>>>> do have fairly extensive experience with the NFS/retry area so will try to
>>>> contribute there.
>>>>
>>>> First, I don't think what Paul says:
>>>>
>>>> > As for your NFS issue, another option would be to enable the .ONESHELL
>>>> > feature available in newer versions of GNU make: that will ensure that
>>>> > all lines in a recipe are invoked in a single shell, which means that
>>>> > they should all be invoked on the same remote host.
>>>>
>>>> Is sufficient. Consider the typical case of compiling foo.c to foo.o
>>>> and linking it into foo.exe. Typically, and correctly, those actions would
>>>> be in two separate recipes which in a distributed-build scenario could run
>>>> on different hosts so the linker may not find the .o file from a previous
>>>> recipe. Here .ONSHELL cannot help since they're different recipes.
>>>>
>>>> In my day job we use a product from IBM called LSF (Load Sharing
>>>> F-something,
>>>> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_welcome.html)
>>>> which exists to distribute jobs over a server farm (typically using NFS)
>>>> according to various metrics like load and free memory and so on. Part of
>>>> the LSF package is a program called lsmake (
>>>> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_command_ref/lsmake.1.html)
>>>> which under the covers is a version of GNU make with enhancement to enable
>>>> remote/distributed recipes and also adds retry-with-delay feature Nikhil
>>>> requested). Since GNU make is GPL, IBM is required to make its package of
>>>> enhancements available under GPL as well. Much of it is not of direct
>>>> interest to the open source community because it's all about communicating
>>>> with IBM's proprietary daemons but their retry logic could probably be
>>>> taken directly from the patch. At the very least, if retries were to be
>>>> added to GNU make per se it would be nice if the flags were compatible with
>>>> lsmake.
>>>>
>>>> However, my personal belief is that retries are God's way of telling us
>>>> to think harder and better. Retrying (and worse, delay-and-retry) is a form
>>>> of defeatism which I call "sleep and hope". Computers are deterministic,
>>>> there's always a root cause which can usually be found and addressed with
>>>> sufficient analysis, etc. Granted there are cases where you understand the
>>>> problem but can't address it for administrative/permissions/business
>>>> reasons but that can't be known until the problem is understood.
>>>>
>>>> NFS caching is the root cause of unreliable distributed builds, as
>>>> you've already described, but most or all of these issues can be addressed
>>>> with a less blunt instrument than sleep-and-retry. Even LSF engineers threw
>>>> up their hands and did retries but what we did here was take their patch,
>>>> which at last check was still targeted to 3.81, and while porting it to 4.1
>>>> added some of the cache-flushing strategies detailed below. This has solved
>>>> most if not all of our NFS sync problems. Caveat: most of our people still
>>>> use the LSF retry logic in addition, because they're not as absolutist as I
>>>> am and just want to get their jobs done (go figure), which makes it harder
>>>> to determine what percentage of problems are solved by cache flushing vs
>>>> retries but I'm pretty sure flushing has resolved the great majority of
>>>> problems.
>>>>
>>>> One problem with sleep-and-hope is that there's no amount of time
>>>> guaranteed to be enough so you're just driving the incidence rate down, not
>>>> fixing it.
>>>>
>>>> Since we were already working with a hacked version of GNU make we
>>>> found it most convenient to implement flushing directly in the make program
>>>> but it can also be done within recipes. In fact we have 3 different
>>>> implementations of the same NFS cache flushing logic:
>>>>
>>>> 1. Directly within our enhanced version of lsmake.
>>>> 2. In a standalone binary called "nfsflush".
>>>> 3. In a Python script called nfsflush.py.
>>>>
>>>> The Python script is a lab for trying out new strategies but it's too
>>>> slow for production use. The binary is a faster version of the same
>>>> techniques for direct use in recipes, and that same C code is linked
>>>> directly into lsmake as well. Here's the usage message of our Python 
>>>> script:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *$ nfsflush.py --helpusage: nfsflush.py [-h] [-f] [-l] [-r] [-t] [-u]
>>>> [-V] path [path ...]positional arguments:  path             directory paths
>>>> to flushoptional arguments:  -h, --help       show this help message and
>>>> exit  -f, --fsync      Use fsync() on named files  -l, --lock       Lock
>>>> and unlock the named files  -r, --recursive  flush recursively  -t, --touch
>>>>      additional flush action - touch and remove a temp file  -u, --upflush
>>>>    flush parent directories back to the root  -V, --verbose    increment
>>>> verbosity levelFlush the NFS filehandle caches of NFS directories.Newly
>>>> created files are sometimes unavailable via NFS for a periodof time due to
>>>> filehandle caching, leading to apparent race problems.See
>>>> http://tss.iki.fi/nfs-coding-howto.html#fhcache
>>>> <http://tss.iki.fi/nfs-coding-howto.html#fhcache> for details.This script
>>>> forces a flush using techniques mentioned in the URL. Itcan optionally do
>>>> so recursively.This always does an opendir/closedir sequence on each
>>>> directoryvisited, as described in the URL, because that's cheap and safe
>>>> andoften sufficient. Other strategies, such as creating and removing atemp
>>>> file, are optional.EXAMPLES:    nfsflush.py /nfs/path/...*
>>>>
>>>> The most important thing is to read the URL given above and/or to
>>>> google for similar resource of which there are many. While I'm not an NFS
>>>> guru myself, the summary of my understanding is that NFS caches all sorts
>>>> of things (metadata like atime/mtime, directory updates, etc) with various
>>>> degrees of aggression according to NFS vendor and internal configuration.
>>>> We've seen substantial variation between NAS providers such as NetApp, EMC,
>>>> etc, so much depends on whose NFS server you're using. However, the NFS
>>>> spec _requires_ that caches be flushed on a write operation so all
>>>> implementations will do this.
>>>>
>>>> Bottom line, the most common failure case is as mentioned above: foo.o
>>>> is compiled on host A and immediately linked on host B. The close() system
>>>> call following the final write() of foo.o on host A will cause its data to
>>>> be flushed. Similarly I *believe* the directory write (assuming foo.o is
>>>> newly created and not just updated) will cause the filehandle cache to be
>>>> flushed. Thus, after these two write ops (directory and file) the NFS
>>>> server will know about the new foo.o as soon as it's created.
>>>>
>>>> The problem typically arises on host B because no write operation has
>>>> taken place there after foo.o was created on A so no one has told it to
>>>> update its caches and as a result it doesn't know foo.o exists and the link
>>>> fails with ENOENT. All the flushing techniques in the script above are
>>>> attempts to address this. One takeaway from all this is that even if you do
>>>> retries, a "dumb" retry is immeasurably enhanced by adding a flush. In
>>>> other words the most efficient retry formula in a distributed build
>>>> scenario would be:
>>>>
>>>> <recipe> || flush || <recipe>
>>>>
>>>> This never flushes a cache unless the first attempt fails. It presumes
>>>> that NFS implementors and admins know what they're doing and thus caching
>>>> helps with performance so it's not done unless needed. This is what we
>>>> built into our variant of lsmake. However, the same can also be done in the
>>>> shell.
>>>>
>>>> Details about implemented cache flushing techniques: the filehandle
>>>> cache is the biggest source of problems in distributed builds and the
>>>> simplest solution for it seems to be opening and reading the directory
>>>> entry. Thus our script and its parallel C implementation always do that.
>>>> We've also seen cases where forcing a directory write operation is required
>>>> which the -t, --touch option does. Sometimes you can't easily enumerate all
>>>> directories involved (vpath etc) so the recurse-downward (-r) and recurse
>>>> upward (-u) flags may be helpful though they (especially -u) may also be
>>>> overkill. The -f and -l options were added based on advice found on the net
>>>> but have not been shown to be helpful in our environment.
>>>>
>>>> Some techniques may be of limited utility because they require write
>>>> and/or ownership privileges. For instance I've seen statements that
>>>> umounts, even failed umounts, will force flushes. Thus a command like "cd
>>>> <dir> && umount $(pwd)" would have to fail since the moount is busy but
>>>> would flush as a side effect. However I believe this requires root
>>>> privileges so is not helpful in the normal case.
>>>>
>>>> In summary: although I don't believe in retries, if they're going to be
>>>> used I think they should be implemented in a shell wrapper program which
>>>> could be passed to make as SHELL=<wrapper> and the wrapper should use
>>>> flushing in addition to, or instead of, retries. We didn't do it that way
>>>> but I think our nfsflush program could just as well have been implemented
>>>> as say "nfsshell" such that "nfsshell [other-options] -c <recipe>" would
>>>> run the recipe along with added flushing and retrying options. I agree with
>>>> Paul that I see no reason to implement any of these features, retry and/or
>>>> flush, directly in make.
>>>>
>>>> David
>>>>
>>>> On Mon, Sep 2, 2019 at 6:05 AM Paul Smith <address@hidden> wrote:
>>>>
>>>>> On Sun, 2019-09-01 at 23:23 -0700, Kaz Kylheku (gmake) wrote:
>>>>> > If your R&D team would allow you to add just one line to the
>>>>> > legacy GNU Makefile to assign the SHELL variable, you can assign that
>>>>> > to a shell wrapper program which performs command re-trying.
>>>>>
>>>>> You don't have to add any lines to the makefile.  You can reset SHELL
>>>>> on the command line, just like any other make variable:
>>>>>
>>>>>     make SHELL=/my/special/sh
>>>>>
>>>>> You can even override it only for specific targets using the --eval
>>>>> command line option:
>>>>>
>>>>>     make --eval 'somerule: SHELL := /my/special/sh'
>>>>>
>>>>> Or, you can add '-f mymakefile.mk -f Makefile' options to the command
>>>>> line to force reading of a personal makefile before the standard
>>>>> makefile.
>>>>>
>>>>> Clearly you can modify the command line, otherwise adding new options
>>>>> to control a putative retry on error option would not be possible.
>>>>>
>>>>> As for your NFS issue, another option would be to enable the .ONESHELL
>>>>> feature available in newer versions of GNU make: that will ensure that
>>>>> all lines in a recipe are invoked in a single shell, which means that
>>>>> they should all be invoked on the same remote host.  This can also be
>>>>> done from the command line, as above.  If your recipes are written well
>>>>> it should Just Work.  If they aren't, and you can't fix them, then
>>>>> obviously this solution won't work for you.
>>>>>
>>>>> Regarding changes to set re-invocation on failure, at this time I don't
>>>>> believe it's something I'd be willing to add to GNU make directly,
>>>>> especially not an option that simply retries every failed job.  This is
>>>>> almost never useful (why would you want to retry a compile, or link, or
>>>>> similar?  It will always just fail again, take longer, and generate
>>>>> confusing duplicate output--at best).
>>>>>
>>>>> The right answer for this problem is to modify the makefile to properly
>>>>> retry those specific rules which need it.
>>>>>
>>>>> I commiserate with you that your environment is static and you're not
>>>>> permitted to modify it, however adding new specialized capabilities to
>>>>> GNU make so that makefiles don't have to be modified isn't a design
>>>>> philosophy I want to adopt.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Help-make mailing list
>>>>> address@hidden
>>>>> https://lists.gnu.org/mailman/listinfo/help-make
>>>>>
>>>>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]