Running mapping of Illumina reads to a reference and variant calling

Comparison of multiple genomes was carried out using multiple software packages wrapped in a Perl package which was incorporated into two bash pipelines for submitting the jobs in parallel on the Pawsey supercomputer (Magnus), based in Perth, Western Australia. All of the scripts, including the Perl package are available here.

The procedures were split into two overall pipelines, which were run simultaneously. The first procedure was mapping and variant calling, which also incorporates SNPs from a comparison between denovo assemblies using nucmer. This procedure was run using the 'mapping_variants.sh' bash script, which was submitted using the 'run_mapping_variants.sh' script. The mapping_variants.sh script implements mpibash, which allowed for parallelisation of the procedure. The procedure was run for all isolates analysed against the reference, which at the time of writing was 12. The mpibash script is as follows:

#!/usr/bin/env mpibash

enable -f mpibash.so mpi_init

threads=$1

mpi_init
mpi_comm_rank rank
mpi_comm_size size

echo "Script began at `date +"%c"` for process $rank of $size."

readsf=($(ls /scratch/y95/mderbyshire/pan_genome/data/P_*R1.fastq))
readsr=($(ls /scratch/y95/mderbyshire/pan_genome/data/P_*R2.fastq))
readsu=($(ls /scratch/y95/mderbyshire/pan_genome/data/U_*.u.fastq))
reference="/scratch/y95/mderbyshire/pan_genome/data/Ssclerotiorum_v2_sorted.fasta"

for ((i=0; i < ${#readsf[@]}; i++)); do
  stub=${readsf[$i]}
  stub=${stub##*/}
  stub=${stub%.*}.$$
  newarray+=(${readsf[$i]},${readsr[$i]},${readsu[$i]})
  stubarray+=($stub)
done

#function check_assembly_finished(){
#  count=1
#  if ls *ass/*contigs.fasta 1> /dev/null 2>&1; then
#    denovoarray=($(ls *ass/*contigs.fasta))
#  else
#    denovoarray=0
#  fi
#  while [ ${#denovoarray[@]} -lt ${#readsf[@]} ]; do
#    if [ $count == 1 ]; then
#      echo ""
#      echo "Wating for assembly to finish."
#      echo ""
#    fi
#    ((count+=1))
#    sleep 0.5
#  done
#}

if [[ $rank == 0 ]]; then
  echo "Running mapping and variant calling for:"
  echo "${stubarray[@]}"
  echo "Running ${#stubarray[@]} on $size * $threads cpus."
fi

./mapping.pl \
--reads ${newarray[$rank]} \
--reference $reference \
--prefix ${stubarray[$rank]}.map \
--threads $threads

if [[ $? != 0 ]]; then
  echo "Problem with mapping for ${stubarray[$rank]}."
else
  echo "Mapping complete for ${stubarray[$rank]}."
fi

mpi_barrier

bamarray=($(ls P_*/*rg.bam))

./assembly.pl \
--bam ${bamarray[$rank]} \
--prefix ${stubarray[$rank]}.add \
--mitochondrial /scratch/y95/mderbyshire/pan_genome/data/Sclerotinia_sclerotiorum_mtDNA.fasta \
--threads $threads

mpi_barrier

#check_assembly_finished

denovoarray=($(ls *ass/*contigs.fasta))

./callVariants.pl \
--reference $reference \
--bam ${bamarray[$rank]} \
--nucmer --denovo ${denovoarray[$rank]} \
--filter-snps --depth 0 --prefix ${stubarray[$rank]}.var --combine \
--threads $threads

if [[ $? != 0 ]]; then
  echo "Problem with calling variants for ${stubarray[$rank]}."
else
  echo "Variant calls complete for ${stubarray[$rank]}."
fi

mpi_barrier
mpi_finalize

echo "Mapping and variant calling finished at `date +"%c"`"

Running assembly and gene & repeat annotation of Illumina reads

Note the function in the previous step:

function check_assembly_finished(){
  count=1
  if ls *ass/*contigs.fasta 1> /dev/null 2>&1; then
    denovoarray=($(ls *ass/*contigs.fasta))
  else
    denovoarray=0
  fi
  while [ ${#denovoarray[@]} -lt ${#readsf[@]} ]; do
    if [ $count == 1 ]; then
      echo ""
      echo "Wating for assembly to finish."
      echo ""
    fi
    ((count+=1))
    sleep 0.5
  done
}

This assumes that 'mapping_variants.sh' is being run at the same time as 'assembly_gene_call.sh'. The latter script will produce outputs in a directory suffixed with '*.ass'. When enough of these directories have been generated to match the number of isolates being analysed, the final stage of mapping_variants.sh is able to complete. This is because this part of the script relies on the denovo assemblies generated by assembly_gene_call.sh.

However, this function is commented out as an alternative method was eventually used. In order to not exceed the time limit, 'assembly_gene_call.sh'_ _was run first and the Job ID identified. Then, 'mapping_variants.sh' was run via:

sbatch --dependecy=afterok:<job_id> run_mapping_variants.sh

This ensures the first job runs successfully before the second job is run. The same as manually running one after the other but avoids constant checking.

The assembly_gene_call.sh script is as follows:

#!/usr/bin/env mpibash

enable -f mpibash.so mpi_init

threads=$1


mpi_init
mpi_comm_rank rank
mpi_comm_size size

echo "Script began at `date +"%c"` for process $rank of $size."

readsf=($(ls /scratch/y95/mderbyshire/pan_genome/data/P_*R1.fastq))
readsr=($(ls /scratch/y95/mderbyshire/pan_genome/data/P_*R2.fastq))
readsu=($(ls /scratch/y95/mderbyshire/pan_genome/data/U_*.u.fastq))
reference="/scratch/y95/mderbyshire/pan_genome/data/Ssclerotiorum_v2_sorted.fasta"

for ((i=0; i < ${#readsf[@]}; i++)); do
  stub=${readsf[$i]}
  stub=${stub##*/}
  stub=${stub%.*}.$$
  newarray+=(${readsf[$i]},${readsr[$i]},${readsu[$i]})
  stubarray+=($stub)
done

if [[ $rank == 0 ]]; then
  echo "Running assembly and annotation for:"
  echo "${stubarray[@]}"
  echo "Running ${#stubarray[@]} on $size * $threads cpus."
fi 

./assembly.pl \
--reads ${newarray[$rank]} \
--prefix ${stubarray[$rank]}.ass \
--threads $threads

if [[ $? != 0 ]]; then
  echo "Problem with assembly for ${stubarray[$rank]}."
else
  echo "Assembly complete for ${stubarray[$rank]}."
fi

mpi_barrier

denovoarray=($(ls *ass/*contigs.fasta))
./annotation.pl \
--type all \
--contigs ${denovoarray[$rank]} \
--species Sclerotinia_sclerotiorum \
--prot /scratch/y95/mderbyshire/pan_genome/data/Ssclerotiorum_v2.pep.fasta \
--threads $threads \
--prefix ${stubarray[$rank]}.ann

if [[ $? != 0 ]]; then
  echo "Problem with calling annotation for ${stubarray[$rank]}."
else
  echo "Annotation complete for ${stubarray[$rank]}."
fi

mpi_barrier
mpi_finalize

echo "Script ended at `date +"%c"`"

Again, this script implements mpibash and was used to run the same procedures on all isolates in parallel. Thus, the overall approach to running in parallel was to run the two pipelines one after the other on all isolates simultaneously. Each isolate was run on a single node with 24 threads for internal algorithms to run. To run on an arbitrary number of isolates the script takes one to two days.

results matching ""

    No results matching ""