parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Example of use of GNU Parallel


From: Nanditha Rao
Subject: Re: Example of use of GNU Parallel
Date: Sat, 23 Mar 2013 17:31:45 +0530

Sorry for the long mail, but I need to explain this strange problem

I need to run a bunch of simulations using a tool called ngspice, and since I want to run a million simulations, I am distributing them across a cluster of machines (master+3 slaves which have 12 cores each).
This is the command for the tool:
ngspice deck_1.sp
ngspice deck_2.sp and so on

Step 1: A python script is used to generate these sp files.
Step 2: And then, I am running GNU parallel to distribute the files (*.sp files) across machines. I use --sshloginfile and provide the IP address of the slave machines inside the file (right now, experimenting with only 1 master and 1 slave). I run the simulations using ngspice
Step 3: After running them, I need to post-process the results (again, using a python script). 

Also, since I do not want to occupy the disk space by generating the entire one million at the same time, I generate only 1000 at a time, run ngspice through GNU Parallel on these 1000 files and then post process. So the above Step 1 to 3 are repeated in a loop till a million files are simulated.



The following is the structure of my code:

for loop in range(1, (num_of_loops+1)): 

    #Step 1: 
    os.system('python generate.py')  #This generates deck_1.sp, deck_2.sp etc till a million cases

    #Step 2: Run GNU Parallel
     os.system("seq 1 %d | parallel --progress -j +0 --sshloginfile %s/%s/sshmachines.txt 'cd %s/%s/spice_decks; ngspice %s/%s/spice_decks/deck_{}.sp' " % (num_spice, path_fixed,folder,path_fixed,folder,path_fixed,folder))

   #Essentially this means something like:
    os.system("seq 1 1000 | parallel --progress -j +0 --sshloginfile /home/user1/simulations/decoder/spice_decks/sshmachines.txt  'cd /home/user1/simulations/decoder/spice_decks && ngspice deck_{}.sp' " )

   #Step 3:
   os.system('python process_the_results.py')  #This post processes the results

sshmachines.txt has the following:
:
12/user1@192.168.1.8


I execute 1000 jobs at a time.
Now, my problem is:
When I execute the for loop- for the 1st time, I have no problem. The files are distributed across the local and the slave (12/user1@192.168.1.8) machines till the 1000 simulations are complete.

Now, when the loop starts off the second time, I clear off the existing sp files uisng python commands and regenerate them (step 1). Now, when I execute step 2- for some strange reason, some files are not being detected. The GNU Parallel errors out saying:

Computers / CPU cores / Max jobs to run
1:local / 12 / 12
2:user1@192.168.1.8 / 12 / 11

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
/home/user1/simulations/decoder/spice_decks/deck_13.sp: No such file or directory
/home/user1/simulations/decoder/spice_decks
local:12/0/50%/0.0s  user1@192.168.1./home/user1/simulations/decoder/spice_decks/deck_24.sp: No such file or directory
/home/user1/simulations/decoder/spice_decks
local:12/0/48%/0.0s  user1@192.168.1./home/user1/simulations/decoder/spice_decks/deck_25.sp: No such file or directory
/home/user1/simulations/decoder/spice_decks


I thought that the files may not exist- but the files exist. I paused my script and did an 'ls' to print out the existing file list. I did a 'pwd' to see if it is actually in the correct directory. I took screenshots- to verify that the files actually exist (attached are screenshots). But I am not sure why it complains that certain files do not exist. And strangely this happens only from the 2nd time the 'for' loop is executed. There is no problem when the 'for' loop is executed the first time.

And this happens only on the slave machine. If I delete out the 12/user1@192.168.1.8  from the sshloginfile.txt and just use the master or the local machine (:), I see no error at all.

If you look at the above error it says:
user1@192.168.1./home/user1/simulations/decoder/spice_decks/deck_24.sp: No such file or directory

Actually the IP address I have provided is: user1@192.168.1.8, but somehow in the error it prints out user1@192.168.1. 

Is it missing out something? Is it a bug? It looks strange to me that the file exists and that it cannot find it. 

Pls help.

Regards
Nanditha





On Mon, Feb 11, 2013 at 3:21 PM, Ole Tange <tange@gnu.org> wrote:
I have earlier encouraged you to post examples of use of GNU Parallel.

Today I had to split a 200 GB gz-file into smaller files. The file contained records of 4 lines, so I had to unpack the .gz file, chop at a 4 line limit around 10 MB and gzip the chunk under a unique name:

zcat big.gz | parallel --block 10M -L4 --pipe gzip -1 '>'small.{#}.gz

The limiting factor in this was GNU Parallel which is not uncommon for --pipe.

The spreadstdin() and write_record_to_pipe() are to blame. They can be sped up by re-writing these functions in C/C++. But it might even be sufficient to split up the parts into a reader process (which would read a chunk, find the split point, and put it in a queue), a few writer processes (which given a chunk would write it to the user program) and a manager process (which would communicate between the reader and the writer and spawn off new writer processes if needed), so fork does not have to be called for every block. Any takers?


/Ole


Attachment: gnu_parallel_1.jpeg
Description: JPEG image

Attachment: gnu_parallel_2.jpeg
Description: JPEG image


reply via email to

[Prev in Thread] Current Thread [Next in Thread]