|
From: | Christian Meesters |
Subject: | Re: Spreading parallel across nodes on HPC system |
Date: | Thu, 10 Nov 2022 21:27:47 +0100 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.3.0 |
Hi,
Take a look here for a template: https://mogonwiki.zdv.uni-mainz.de/dokuwiki/start:working_on_mogon:workflow_organization:node_local_scheduling#running_on_several_hosts
Of course, you need to adjust the partition names and the like,
and the example is unmaintained, but it worked for me for quite a
while.
Best regards,
Christian
Hello, I'm trying to run parallel on multiple nodes. Each node may have a different number of CPUs. It appears the best syntax for this is from the man page --slf section: 8/my-8-cpu-server.example.com 2/my_other_username@my-dualcore.example.net My problem is that I'm running in the SLURM environment. I can get the hostnames with scontrol show hostnames $SLURM_JOB_NODELIST > nodelist.0 But I cannot easily get the CPUS-per-node. From the SLURM docs, SLURM_JOB_CPUS_PER_NODE: Count of CPUs available to the job on the nodes in the allocation, using the format CPU_count[(xnumber_of_nodes)][,CPU_count [(xnumber_of_nodes)] ...]. For example: SLURM_JOB_CPUS_PER_NODE='72(x2),36' indicates that on the first and second nodes (as listed by SLURM_JOB_NODELIST) the allocation has 72 CPUs, while the third node has 36 CPUs. So, parsing '72(x2),36' seems complicated. If I requested a total of 1000 tasks, but have no control over how many nodes, can I just call parallel with -j1000 and pass it a hostfile without the "CPUs/" prepended to the hostname? Would parallel then start however many jobs it can per node, and if for some reason I was allocated 1000 CPUS on 1 node, that would work fine, as would 1 CPU on 1000 different nodes? Thanks, -k.
[Prev in Thread] | Current Thread | [Next in Thread] |