Re: Spreading parallel across nodes on HPC system

parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Spreading parallel across nodes on HPC system

From:	Ole Tange
Subject:	Re: Spreading parallel across nodes on HPC system
Date:	Fri, 11 Nov 2022 19:33:26 +0100

On Fri, Nov 11, 2022 at 5:58 PM Ken Mankoff <mankoff@gmail.com> wrote:

> I'll try to simplify my original question...
>
> If I run
>
> parallel -s-slf hostfile -j 1000 <script> ::: $(seq 1000)
>
> And hostfile has some hosts that have 1 CPU, and some hosts that have 100s of 
> CPUs, does parallel take care of handling this?
>
> I've now just read the man page in more detail and above --slf under the -S 
> documentation I see
>
> > GNU parallel will determine the number of CPUs on the remote computers
> > and run the number of jobs as specified by -j.
>
> So I *think* that if I leave "-j" off the command line, parallel will use the 
> maximum number of available CPUs. This all sounds good.
>
> Last question, which I may be able to figure out with trial-and-error 
> testing. Does parallel detect the total number of CPUs on host, or the number 
> of CPUs allocated to me and my job? I only have access to the latter...

Try running this:

$ seq 100000 | parallel -Slo,h --eta true

Computers / CPU cores / Max jobs to run
1:h / 2 / 2
2:lo / 8 / 8

Computer:jobs running/jobs completed/%of started jobs/Average seconds
to complete
ETA: 10558s Left: 99920 AVG: 0.10s  h:2/21/26%/1.5s  lo:8/59/73%/0.5s

The server h has 2 CPU threads, the server lo has 8 CPU threads.

So GNU Parallel detects the number of CPU threads the server has.

It does not detect how many threads are reserved for you by SLURM.

What happens if you use more threads than allocated for you?

> > SLURM_JOB_CPUS_PER_NODE: Count of CPUs available to the job on the
> > nodes in the allocation, using the format
> > CPU_count[(xnumber_of_nodes)][,CPU_count [(xnumber_of_nodes)] ...].
> > For example: SLURM_JOB_CPUS_PER_NODE='72(x2),36' indicates that on the
> > first and second nodes (as listed by SLURM_JOB_NODELIST) the
> > allocation has 72 CPUs, while the third node has 36 CPUs.

It seems SLURM sets a lot of other env vars. Maybe one of those is
easier to parse? Could you get a sample output of `env`?

It seems it should be possible to generate a --slf by merging
SLURM_JOB_CPUS_PER_NODE and SLURM_JOB_NODELIST. But I really need to
see real examples of SLURM_JOB_CPUS_PER_NODE and SLURM_JOB_NODELIST to
confirm that.

/Ole

[Prev in Thread]

Current Thread

[Next in Thread]

Spreading parallel across nodes on HPC system, Ken Mankoff, 2022/11/10
- Re: Spreading parallel across nodes on HPC system, Rob Sargent, 2022/11/10
  - Re: Spreading parallel across nodes on HPC system, Ken Mankoff, 2022/11/11
    - Re: Spreading parallel across nodes on HPC system, Rob Sargent, 2022/11/11
    - Re: Spreading parallel across nodes on HPC system, Ken Mankoff, 2022/11/11
    - Re: Spreading parallel across nodes on HPC system, Christian Meesters, 2022/11/11
- Re: Spreading parallel across nodes on HPC system, Christian Meesters, 2022/11/10
  - Re: Spreading parallel across nodes on HPC system, Ken Mankoff, 2022/11/11
- Re: Spreading parallel across nodes on HPC system, Ken Mankoff, 2022/11/11
  - Re: Spreading parallel across nodes on HPC system, Ole Tange <=
    - Re: Spreading parallel across nodes on HPC system, Ken Mankoff, 2022/11/12

Prev by Date: Re: Spreading parallel across nodes on HPC system
Next by Date: Re: Spreading parallel across nodes on HPC system
Previous by thread: Re: Spreading parallel across nodes on HPC system
Next by thread: Re: Spreading parallel across nodes on HPC system
Index(es):
- Date
- Thread