Running mapping of Illumina reads to a reference and variant calling
Comparison of multiple genomes was carried out using multiple software packages wrapped in a Perl package which was incorporated into two bash pipelines for submitting the jobs in parallel on the Pawsey supercomputer (Magnus), based in Perth, Western Australia. All of the scripts, including the Perl package are available here.
The procedures were split into two overall pipelines, which were run simultaneously. The first procedure was mapping and variant calling, which also incorporates SNPs from a comparison between denovo assemblies using nucmer. This procedure was run using the 'mapping_variants.sh' bash script, which was submitted using the 'run_mapping_variants.sh' script. The mapping_variants.sh script implements mpibash, which allowed for parallelisation of the procedure. The procedure was run for all isolates analysed against the reference, which at the time of writing was 12. The mpibash script is as follows:
#!/usr/bin/env mpibash
enable -f mpibash.so mpi_init
threads=$1
mpi_init
mpi_comm_rank rank
mpi_comm_size size
echo "Script began at `date +"%c"` for process $rank of $size."
readsf=($(ls /scratch/y95/mderbyshire/pan_genome/data/P_*R1.fastq))
readsr=($(ls /scratch/y95/mderbyshire/pan_genome/data/P_*R2.fastq))
readsu=($(ls /scratch/y95/mderbyshire/pan_genome/data/U_*.u.fastq))
reference="/scratch/y95/mderbyshire/pan_genome/data/Ssclerotiorum_v2_sorted.fasta"
for ((i=0; i < ${#readsf[@]}; i++)); do
stub=${readsf[$i]}
stub=${stub##*/}
stub=${stub%.*}.$$
newarray+=(${readsf[$i]},${readsr[$i]},${readsu[$i]})
stubarray+=($stub)
done
#function check_assembly_finished(){
# count=1
# if ls *ass/*contigs.fasta 1> /dev/null 2>&1; then
# denovoarray=($(ls *ass/*contigs.fasta))
# else
# denovoarray=0
# fi
# while [ ${#denovoarray[@]} -lt ${#readsf[@]} ]; do
# if [ $count == 1 ]; then
# echo ""
# echo "Wating for assembly to finish."
# echo ""
# fi
# ((count+=1))
# sleep 0.5
# done
#}
if [[ $rank == 0 ]]; then
echo "Running mapping and variant calling for:"
echo "${stubarray[@]}"
echo "Running ${#stubarray[@]} on $size * $threads cpus."
fi
./mapping.pl \
--reads ${newarray[$rank]} \
--reference $reference \
--prefix ${stubarray[$rank]}.map \
--threads $threads
if [[ $? != 0 ]]; then
echo "Problem with mapping for ${stubarray[$rank]}."
else
echo "Mapping complete for ${stubarray[$rank]}."
fi
mpi_barrier
bamarray=($(ls P_*/*rg.bam))
./assembly.pl \
--bam ${bamarray[$rank]} \
--prefix ${stubarray[$rank]}.add \
--mitochondrial /scratch/y95/mderbyshire/pan_genome/data/Sclerotinia_sclerotiorum_mtDNA.fasta \
--threads $threads
mpi_barrier
#check_assembly_finished
denovoarray=($(ls *ass/*contigs.fasta))
./callVariants.pl \
--reference $reference \
--bam ${bamarray[$rank]} \
--nucmer --denovo ${denovoarray[$rank]} \
--filter-snps --depth 0 --prefix ${stubarray[$rank]}.var --combine \
--threads $threads
if [[ $? != 0 ]]; then
echo "Problem with calling variants for ${stubarray[$rank]}."
else
echo "Variant calls complete for ${stubarray[$rank]}."
fi
mpi_barrier
mpi_finalize
echo "Mapping and variant calling finished at `date +"%c"`"
Running assembly and gene & repeat annotation of Illumina reads
Note the function in the previous step:
function check_assembly_finished(){
count=1
if ls *ass/*contigs.fasta 1> /dev/null 2>&1; then
denovoarray=($(ls *ass/*contigs.fasta))
else
denovoarray=0
fi
while [ ${#denovoarray[@]} -lt ${#readsf[@]} ]; do
if [ $count == 1 ]; then
echo ""
echo "Wating for assembly to finish."
echo ""
fi
((count+=1))
sleep 0.5
done
}
This assumes that 'mapping_variants.sh' is being run at the same time as 'assembly_gene_call.sh'. The latter script will produce outputs in a directory suffixed with '*.ass'. When enough of these directories have been generated to match the number of isolates being analysed, the final stage of mapping_variants.sh is able to complete. This is because this part of the script relies on the denovo assemblies generated by assembly_gene_call.sh.
However, this function is commented out as an alternative method was eventually used. In order to not exceed the time limit, 'assembly_gene_call.sh'_ _was run first and the Job ID identified. Then, 'mapping_variants.sh' was run via:
sbatch --dependecy=afterok:<job_id> run_mapping_variants.sh
This ensures the first job runs successfully before the second job is run. The same as manually running one after the other but avoids constant checking.
The assembly_gene_call.sh script is as follows:
#!/usr/bin/env mpibash
enable -f mpibash.so mpi_init
threads=$1
mpi_init
mpi_comm_rank rank
mpi_comm_size size
echo "Script began at `date +"%c"` for process $rank of $size."
readsf=($(ls /scratch/y95/mderbyshire/pan_genome/data/P_*R1.fastq))
readsr=($(ls /scratch/y95/mderbyshire/pan_genome/data/P_*R2.fastq))
readsu=($(ls /scratch/y95/mderbyshire/pan_genome/data/U_*.u.fastq))
reference="/scratch/y95/mderbyshire/pan_genome/data/Ssclerotiorum_v2_sorted.fasta"
for ((i=0; i < ${#readsf[@]}; i++)); do
stub=${readsf[$i]}
stub=${stub##*/}
stub=${stub%.*}.$$
newarray+=(${readsf[$i]},${readsr[$i]},${readsu[$i]})
stubarray+=($stub)
done
if [[ $rank == 0 ]]; then
echo "Running assembly and annotation for:"
echo "${stubarray[@]}"
echo "Running ${#stubarray[@]} on $size * $threads cpus."
fi
./assembly.pl \
--reads ${newarray[$rank]} \
--prefix ${stubarray[$rank]}.ass \
--threads $threads
if [[ $? != 0 ]]; then
echo "Problem with assembly for ${stubarray[$rank]}."
else
echo "Assembly complete for ${stubarray[$rank]}."
fi
mpi_barrier
denovoarray=($(ls *ass/*contigs.fasta))
./annotation.pl \
--type all \
--contigs ${denovoarray[$rank]} \
--species Sclerotinia_sclerotiorum \
--prot /scratch/y95/mderbyshire/pan_genome/data/Ssclerotiorum_v2.pep.fasta \
--threads $threads \
--prefix ${stubarray[$rank]}.ann
if [[ $? != 0 ]]; then
echo "Problem with calling annotation for ${stubarray[$rank]}."
else
echo "Annotation complete for ${stubarray[$rank]}."
fi
mpi_barrier
mpi_finalize
echo "Script ended at `date +"%c"`"
Again, this script implements mpibash and was used to run the same procedures on all isolates in parallel. Thus, the overall approach to running in parallel was to run the two pipelines one after the other on all isolates simultaneously. Each isolate was run on a single node with 24 threads for internal algorithms to run. To run on an arbitrary number of isolates the script takes one to two days.