parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Facing issue with running GNU Parallel on a cluster


From: Nanditha Rao
Subject: Facing issue with running GNU Parallel on a cluster
Date: Sun, 24 Mar 2013 12:07:17 +0530

Ques 1:
I need to run a bunch of simulations using a tool called ngspice, and since I want to run a million simulations, I am distributing them across a cluster of machines (master+a slave which have 12 cores each).
This is the command for the tool:
ngspice deck_1.sp
ngspice deck_2.sp and so on

Step 1: A python script is used to generate these sp files.
Step 2: Python invokes GNU parallel to distribute the files (*.sp files) across machines and run the simulations using ngspice
Step 3: I post-process the results (python script). 

I generate and process only 1000 files at a time to save disk space. So the above Step 1 to 3 are repeated in a loop till a million files are simulated.

The following is the structure of my python code:

for loop in range(1, (num_of_loops+1)): 

    #Step 1: 
    clear existing sp files
    os.system('python generate.py')  #This generates deck_1.sp, deck_2.sp etc

    #Step 2: Run GNU Parallel
    os.system("seq 1 1000 | parallel --progress -j +0 --sshloginfile /home/user1/simulations/decoder/spice_decks/sshmachines.txt  'cd /home/user1/simulations/decoder/spice_decks && ngspice deck_{}.sp' " )

sshmachines.txt has the following:
:
12/user1@192.168.1.8

   #Step 3:
   os.system('python process_the_results.py') 


Now, my problem is: 
When I execute the for loop- for the 1st time, I have no problem. The files are distributed across the local and the slave (12/user1@192.168.1.8) machines till the 1000 simulations are complete. When the loop starts off the second time, I clear off the existing sp files and regenerate them (step 1). Now, when I execute step 2- for some strange reason, some files are not being detected. The GNU Parallel errors out saying:

Computers / CPU cores / Max jobs to run
1:local / 12 / 12
2:user1@192.168.1.8 / 12 / 11

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
/home/user1/simulations/decoder/spice_decks/deck_13.sp: No such file or directory
/home/user1/simulations/decoder/spice_decks
local:12/0/50%/0.0s  user1@192.168.1./home/user1/simulations/decoder/spice_decks/deck_24.sp: No such file or directory
/home/user1/simulations/decoder/spice_decks
local:12/0/48%/0.0s  user1@192.168.1./home/user1/simulations/decoder/spice_decks/deck_25.sp: No such file or directory
/home/user1/simulations/decoder/spice_decks


I paused my script and did an 'ls' to print out the existing file list. I did a 'pwd' to see if it is actually in the correct directory. I took screenshots- to verify that the files actually exist. But I am not sure why it complains that certain files do not exist. And strangely this happens only from the 2nd time the 'for' loop is executed. There is no problem when the 'for' loop is executed the first time.

And this happens only on the slave machine. If I delete out the 12/user1@192.168.1.8  from the sshloginfile.txt and just use the master or the local machine (:), I see no error at all.


Ques 2: 
I now changed the parallel command (Step 2) into:

os.system("find /home/user1/simulations/decoder/spice_decks/ -name '*.sp' | parallel --progress -j +0 --sshloginfile /home/user1/simulations/decoder/sshmachines.txt --transfer 'cd /home/user1/simulations/decoder/spice_decks/ && ngspice {}'" )

Now, I get the following errors (again, second time in the loop). It again complains that there are no such files as deck_86.sp, deck_91.sp etc., Any help is appreciated. Thanks.

Computers / CPU cores / Max jobs to run
1:local / 12 / 12

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
rsync: mkstemp "/home/user1/simulations/decoder/spice_decks/.deck_86.sp.Bo4pZM" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]
/home/user1/simulations/decoder/spice_decks/deck_86.sp: No such file or directory
local:12/0/52%/0.0s  user1@192.168.1.8:11/0/47%/0.0s rsync: mkstemp "/home/user1/simulations/decoder/spice_decks/.deck_91.sp.VVj9mJ" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]
/home/user1/simulations/decoder/spice_decks/deck_91.sp: No such file or directory
rsync: mkstemp "/home/user1/simulations/decoder/spice_decks/.deck_53.sp.254uYM" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]
/home/user1/simulations/decoder/spice_decks/deck_53.sp: No such file or directory

Is it missing out something? It looks strange to me that the file exists and that it cannot find it. 

Pls help.

Regards
Nanditha

reply via email to

[Prev in Thread] Current Thread [Next in Thread]