parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: controlling memory use beyond --noswap


From: B. Franz Lang
Subject: Re: controlling memory use beyond --noswap
Date: Tue, 06 May 2014 19:16:46 -0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0

Hi Ole

you are definitely right for a server with more than one user, and for sets
of jobs that are very different in memory use within the given set. Maybe reduce
the problem to:

- sets of jobs have a similar memory usage
- there is only one major user
- this user reserves enough memory for all combined small jobs so that one does 
not need
  to worry about them, including occasional other users. This would allow even 
for
  very different-size memory jobs within a set, as long as they all fit into 
the allocated
  space. In fact, the one(s) that do not fit could be just left out, and one 
would still have
  the information from the successful runs to plan ahead or even have enough 
information
  already at hand.

The best I have come up with is the ulimit thing where you accept that
jobs will be killed and restarted later, if they prove to take up too
much ram. But that is far from ideal: You could have a situation where
one 25 GB job would finish fine because it happened to run parallel
with tiny jobs.

I do not know of a bullet proof way to figure out how much memory a
job + its children take up. But maybe we could monitor swap:

   If swapout > 0: Don't care. There is no problem in a machine swapping out
   If swapin > 0: Don't care. There is no problem in a machine swapping in
   If (swapin*swapout > limit) 2 seconds in a row:
     The machine is swapping in and out: This is a problem
     Kill the newest started job and put it back in the queue and wait
until at one job is finished to start another.
As indicated above, what about leaving a comfortable portion of memory
for small jobs (user-defined value or percentage), so that any swapping
can be safely attributed to the biggies. It would be then only those that you 
would kill and restart.
Btw. a good example might be velvetg which you probably know. It may spend
lots of time with single-threaded calculations (depending on the dataset),
and if you wish to scan across twenty kmers it can be rather tiring.
I usually run it in parallel after estimating time from a single kmer run, or 
just
based on an educated guess. It is still difficult to optimize for full use of a
server and things may end without grace. While many runs are perfect, others 
take even longer
as the machine starts to work within swap space (and grind up the hard
disk :-(. It really needs dropping and later restarting of big jobs. I already
managed with 18 out of 20 calculations finished over a weekend, and the other 
two dropped by
the Linux system as swap space was exceeded - not nice, don't repeat.

Cheers Franz


The above will limit the jobs started, but it will not start more
small jobs when the big jobs are finished.

With multiple server of different sizes this becomes even harder.


/Ole

On 14-05-03 04:58 PM, Ole Tange wrote:
On Wed, Apr 30, 2014 at 11:35 PM, B. Franz Lang <Franz.Lang@umontreal.ca>
wrote:

I have been trying to find a way that allows the use of 'parallel'
without completely freezing machines --- which in my case
is due to the parallel execution
of very memory-hungry applications (like a server that has 64 GB
memory, and one instance of an application - unforeseeable -
between 10-60 GB).
I have spend quite some time trying to think of a good way to deal
with that. But what is the correct thing to do?

Let us assume that we have 64 GB RAM and that most jobs take 10 GB but
20% of the jobs take between 10-60 GB and that we cannot predict which
jobs and we cannot predict how long they run.

In theory 80% of the time we can run 6 jobs (namely the 10 GB jobs).

How can we avoid that we start 2 jobs that will hit 60 GB at the same
time (or 3 25 GB jobs)?

If we can predict the memory usage, then the user can probably do that
even better.

niceload (part of the package) has --start-mem which will only start a
new job if there is a certain amount of memory free. Which may help in
some situations.

But it does not solve the situation where the next 3 jobs are 25 GB
jobs and they start out looking as 10 GB jobs, thus you only discover
that they are 25 GB jobs long after they started.

So right now the problem is to find an algorithm that would do the
right thing in most cases.

If your program reaches its max memory usage fast then I would suggest
you use 'ulimit' to kill off the jobs: That way you can run 6 10 GB
jobs at the time (killing jobs bigger than 10 GB). Using --joblog you
can keep track of the jobs that got killed. When all the 10 GB jobs
are complete, you can raise the ulimit and run 3 20 GB jobs with
--resume-failed, then 2 30 GB jobs and finally the rest one job at a
time.


/Ole





reply via email to

[Prev in Thread] Current Thread [Next in Thread]