bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#36130: split bug


From: Assaf Gordon
Subject: bug#36130: split bug
Date: Mon, 10 Jun 2019 16:50:20 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.0

Hello,

On 2019-06-10 12:28 p.m., Heather Wick wrote:
Thank you so much for your response. Here are the results of the tests you sent:

Verbose: This seems to have made the same number of files this time; not sure why the other 3-4 times I ran it it did not. They appear to be the same size, with paired last reads
[...]

Glad to hear it worked.

Could it be that in previous times the queued job ran out of disk space?

That would be my first guess, as such things are common in shared grid/cluster environments, particularly if your job runs in a temporary
and limited storage location (e.g. "/tmp/job-NNNN").

I would suspect that the exit-code you are seeing is the exit code
of the entire job (that is - of the shell script that is being qsub'd),
and not necessarily that of 'split' (then again, this might not be correct if you explicitly checked the exit code of 'split').

Given that your grid environment already has configuration issues
(the bash and "module" related errors), I would not be surprised if
the exit code is not reliable.

I would strongly encourage to always look into the STDERR file
of the job to verify no other errors occurred.

Or, perhaps write shell scripts more defensively, like so:

  [...]
  zcat MH1_R1.fastq.gz | split -l 40000000 - DHT_R1_ \
        && echo split MH1_R1 OK \
        || echo split MH1_R1 FAILED
  [...]

Then checking the STDOUT for positive confirmation each program succeeded.
Or perhaps:


  # define a shell function "die" to print an error and terminate
  die()
  {
    base=$(basename "$0")
    echo "$base: error: $*" >&2
    exit 1
  }

  zcat MH1_R1.fastq.gz | split -l 40000000 - DHT_R1_ \
        || die "split MH1_R1 failed"


And then run at least one job that will fail on purpose,
and ensure you see the error message in the STDERR log,
and you get a non-zero exit code (and then ensure you use 'die'
on every command).


It is sometimes recommended to use "set -e" for "easy"
error handling in shell scripts- but I would recommend against it.
Many reasons detailed here: https://mywiki.wooledge.org/BashFAQ/105

It might be more frustrating to add such extra checks on every
program, but from my humble experience, grid environments bring
on so many more intermittent and transient problems that it is
definitely worth it.


STDERR:
The only thing in the stderr file is an odd duck of:

-sh: module: line 1: syntax error: unexpected end of file

-sh: error importing function definition for `BASH_FUNC_module'

Python 3.6.8 :: Anaconda, Inc.

/bin/sh: module: line 1: syntax error: unexpected end of file

/bin/sh: error importing function definition for `BASH_FUNC_module'

but this prints for every job I run with this particular flavor of conda/bash and doesn't seem to affect anything else (as far as I know)


These errors are specific to your grid/cluster environment,
and the best place to ask is the I.T or bioinformatics department in
your institute (whomever is in charge of the cluster).

Broadly speaking, "module" is mechanism that ease the use of
various software packages. It is usally setup by your IT administrators.
A typical use-case is to have different version of programs in non-
standard locations, e.g.
   samtools version 1.6 in /opt/it/programs/samtools-1.6
 and
   samtools version 1.9 in /opt/bioinfo/tools/new/samtools/

and then cluster users (e.g. you) just need to add:
   "module load samtools-1.8"
and have the command "samtools" just work without knowing
the gritty details of where the program is.

It seems that in your case, something relating to the "module"
setup is broken.

More information here: https://en.wikipedia.org/wiki/Environment_Modules_(software)


All jobs finished well below allotted memory and with exit status 0, even when split didn't make the right number of output files.
>
> Do you know any reason why the behavior would be inconsistent?

The "alloted memory" is a non-issue for this "split" command,
it will always use very little amount of memory regardless of how big
the input files are.

As for "exit status 0" - I can't be sure, but I suspect the exit status
you see is the one of the entire job (i.e. the shell script),
and perhaps it does not represent the exit code of the "split" program.

If you have the STDERR files of the jobs which failed, it's worth
checking them for any additional error messages.


Pairing check: unfortunately my server's version of bash doesn't support paste in this way, I've run into this issue before but I forget what the workaround is. I can't run this command interactively because my server times out (these files are > 3 billion lines each, so it takes a long time to zcat them)

Ah yes, the construct:

   program <(other program)

is a "bash" feature that is not available in simple shell scripts
(interactive use vs non-interactive and other things).

One work-around is to run (from inside your script):

  bash -c "paste <(zcat MH1_R2.fastq) <(zcat MH1_R2.fastq.gz)" \
       | awk 'NR%4!=1 { next } $1!=$3 { print "Error in line " NR ":" $1


----

To conclude:

If I understand correctly, the latest attempt worked correctly
and there are no problems in "split".

If this is the case, we can mark this thread as "done".

regards,
 - assaf







reply via email to

[Prev in Thread] Current Thread [Next in Thread]