parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Feature request: halt on threshold


From: Ole Tange
Subject: Re: Feature request: halt on threshold
Date: Sat, 19 Jul 2014 08:59:26 +0200

I have implemented --halt 10% in git to mean: 10% of the jobs run so
far must fail and at least 3. The 3 was necessary to avoid too many
false positive is the percentage is 50% or higher.

Feedback welcome.

/Ole

On Sat, Jul 19, 2014 at 1:17 AM, Ole Tange <tange@gnu.org> wrote:
> On Fri, Jul 18, 2014 at 11:22 PM, Ben Rusholme <rusholme@caltech.edu> wrote:
>
>> There are currently three options to "—halt" - ignore (0), stop new jobs 
>> (1), or kill everything (2).
>>
>> I propose an additional option; to set the number of job failures before 
>> doing anything. This would then allow some tolerance of failure but would 
>> catch global problems.
>>
>> Consider this example - running a 1000 jobs each of around 1hr, where a 
>> random handful will fail due to unexpected bad data or other unforeseen bug, 
>> but the overwhelming majority will complete successfully.
>>
>> Setting —halt 0 all jobs will run, and I can check for the failures 
>> afterwards. Great! However, say I forget to create the results directory, so 
>> every "good" job runs for full time then fails right at the end…if I wasn’t 
>> monitoring I just wasted 1000hrs of processing time.
>
> This I do not understand. GNU Parallel 20140622 creates the dirs
> before running, so your version is broken:
>
> $ parallel --results /tmp/this/does/not/exist echo ::: 1
> 1
> $ ls /tmp/this/does/not/exist/1/1/
> stderr  stdout
>
>> Setting halt > 0 the job will stop at or just after the first problem. I 
>> have to check the logs, figure out and fix if possible, rerun with previous 
>> success excluded etc.
>
> Using --resume-failed.
>
>> What I would like is to say set the number of tolerable failures to the 
>> number of workers. Then a serious bug would be caught after the first 
>> iteration, but the entire job would run and handle some measure of bad input 
>> data.
>
> You need to give a reproducible example where you cannot just use
> --halt 0 and then later --resume-failed when you have fixed the
> bug/the input data.
>
>> Does this make sense? Unfortunately it would require changing the current 
>> flags, either adding another or changing the current halt options.
>
> One possibility for syntax is --halt 10% to allow 10% to fail.
>
>
> /Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]