parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to debug `parallel` crash?


From: Nagle, Michael F
Subject: Re: How to debug `parallel` crash?
Date: Sat, 9 Jul 2022 22:55:12 +0000

Thanks for your response, Rob. I will do my best to answer your questions. Please let me know if anything is unclear and more info would help. I appreciate your attention to this!

This is a rather powerful Dell workstation running Ubuntu 22.04 LTS, with a 12-core Intel processor and 503GB RAM.

I'm running as a user with admin privileges, but am not using sudo, so as I understand these should not be root processes.

In short, we're running some custom Python code to analyze ~1.3GB hyperspectral images, do some linear algebra and output some plots and arrays describing the biochemical composition in these images. This is benchmarked to take 2-4GB of RAM per image. There is one image per job. By default, parallel​ is running 24 jobs, dual-threading on each of 12 cores... There should be plenty of RAM to run 24 4GB jobs at once. Since this is an embarrassingly parallel computation and we already use bash scripting in this workflow, I prefer to keep it simple and use GNU Parallel rather than Python parallel frameworks... it always worked great in the past.

Here is the script I'm calling from the command line, inside the jobs file described further below: gmodetector_py/analyze_sample.py at master · naglemi/gmodetector_py (github.com)

# This is what we run to execute the .jobs​ file
parallel -a $job_list_name

# I have also tried limiting the number of jobs to 20, which also leads to the same crashing problem after a few runs.
parallel--jobs 20 -a $job_list_name

# Here is how we prepare the .jobs​ file. We produce one job per image, each given its own line in a text file, with options set by a bunch of variables in a Jupyter notebook. Note, I have also confirmed it still crashes if we run outside of Jupyter.
for file in $data/*.hdr
do
 if [[ "$file" != *'hroma'* ]] && [[ "$file" != *'roadband'* ]]; then
  echo "python wrappers/analyze_sample.py \
--file_path $file \
--fluorophores ${fluorophores[*]} \
--min_desired_wavelength ${desired_wavelength_range[0]} \
--max_desired_wavelength ${desired_wavelength_range[1]} \
--red_channel ${FalseColor_channels[0]} \
--green_channel ${FalseColor_channels[1]} \
--blue_channel ${FalseColor_channels[2]} \
--red_cap ${FalseColor_caps[0]} \
--green_cap ${FalseColor_caps[1]} \
--blue_cap ${FalseColor_caps[2]} \
--plot 1 \
--spectral_library_path "$spectral_library_path" \
--output_dir $output_directory_full \
--threshold 38" >> $job_list_name
 fi
done

Thanks again!

From: Rob Sargent <robjsargent@gmail.com>
Sent: Saturday, July 9, 2022 2:59 PM
To: Nagle, Michael F <michael.nagle@oregonstate.edu>
Cc: parallel <parallel@gnu.org>
Subject: Re: How to debug `parallel` crash?
 

[This email originated from outside of OSU. Use caution with links and attachments.]



On Jul 9, 2022, at 3:34 PM, Nagle, Michael F <michael.nagle@oregonstate.edu> wrote:


Hello,

First, I’d like to thank the developers and community for producing GNU Parallel and supporting it.

I use GNU parallel for a particular part of a scientific workflow, and it worked great on a previous machine. On a new machine (with many more cores), I’m now having it crash sometimes and am having trouble debugging this.

When it crashes, the terminal it is being run from crashes, so I’m left with no error message or clues I can find as to why the crash occurred. How can I figure this out?

What I’ve tried and outcomes:
1. Restarting the machine and trying again… GNU parallel never crashes the first time it is run after a restart. After several runs, it crashes every time, and the machine needs to be restarted again before it will work. This leads me to suspect some kind of zombie processes may be left behind, but I don’t see anything suspicious with `top`.
2. Looking for log files… These could be very helpful and informative if they’re out there. I looked in /var/logs/ and in the directory from which `parallel` is being run, but haven’t found logs. I haven’t been able to find info about logs in documentation. Are there logs I should be able to find, and where?

Any advice for diagnosing and troubleshooting the problem would be greatly appreciated. Thanks for your time and help.

  Michael Nagle 1mr5al

Michael Nagle

PhD Candidate, Molecular and Cellular Biology

Forest Biotechnology Laboratory

Oregon State University

301-974-7221 (cell)



Are you crashing a Linux machine?  That would be impressive. Are you running as root. That would be dangerous. 

Show the command line which causes the crash. Show all of it. In plain tex. Describe the data files. Maybe a hint at what the processing does. Describe your machine 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]