parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Limiting memory used by parallel?


From: hubert depesz lubaczewski
Subject: Re: Limiting memory used by parallel?
Date: Mon, 29 Jan 2018 13:45:03 +0100
User-agent: Mutt/1.5.23 (2014-03-12)

On Sun, Jan 28, 2018 at 02:45:42AM +0100, Ole Tange wrote:
> --pipe keeps one block per process in memory, so the above should use
> around 25 GB of RAM.
> 
> You can see the reason for this design by imagining jobs that reads
> very slowly: You will want all 5 of these to be running, but you would
> have to read (and buffer) at least 4*5 GB to start the 5th process,
> and the code is cleaner if you simply read the full block for every
> process.

OK, I understand the reason.

> --pipepart does not use memory, so that is one way to avoid this.
> --pipepart is extremely fast: It delivers around 1 GB/cpucore, so it
> will most likely be limited by your disk speed:
> 
>   tar cf - /some/directory > bigfile.tar
>   parallel --pipepart bigfile.tar --block 5G --recend ''
> ./handle-single-part.sh {#}
> 
> But I imagine you do not have space to keep an uncompressed copy of
> the tarfile, and you really want to handle the parts _while_ tar is
> running.
> 
> You can also use --cat:
> 
>   tar cf - /some/directory | parallel -j 5 --pipe --block 5G --cat
> --recend '' 'cat {} | ./handle-single-part.sh {#}'
> 
> This way each block is saved to the tempdir before the job starts. By
> my limited testing this should make GNU Parallel only keep 1-2 blocks
> in memory.

Yeah. All of this requires saving either whole file, or the parts to
temp dir, which is something I'd prefer to avoid entirely.

Will look into the cat thing, as it has some promise...

Best regards,

depesz




reply via email to

[Prev in Thread] Current Thread [Next in Thread]