[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Feature request: halt on threshold
From: |
Ben Rusholme |
Subject: |
Feature request: halt on threshold |
Date: |
Fri, 18 Jul 2014 14:22:31 -0700 |
Hi,
There are currently three options to "—halt" - ignore (0), stop new jobs (1),
or kill everything (2).
I propose an additional option; to set the number of job failures before doing
anything. This would then allow some tolerance of failure but would catch
global problems.
Consider this example - running a 1000 jobs each of around 1hr, where a random
handful will fail due to unexpected bad data or other unforeseen bug, but the
overwhelming majority will complete successfully.
Setting —halt 0 all jobs will run, and I can check for the failures afterwards.
Great! However, say I forget to create the results directory, so every "good"
job runs for full time then fails right at the end…if I wasn’t monitoring I
just wasted 1000hrs of processing time.
Setting halt > 0 the job will stop at or just after the first problem. I have
to check the logs, figure out and fix if possible, rerun with previous success
excluded etc.
What I would like is to say set the number of tolerable failures to the number
of workers. Then a serious bug would be caught after the first iteration, but
the entire job would run and handle some measure of bad input data.
Does this make sense? Unfortunately it would require changing the current
flags, either adding another or changing the current halt options.
Thanks, Ben
- Feature request: halt on threshold,
Ben Rusholme <=