Running snpEff
In order for snpEff to work, phase information must be removed from the annotation file being provided. This can be accomplished using the following perl script:
#!/usr/bin/perl
open my $FH, "<", $ARGV[0];
while ( $l = <$FH> ) {
chomp $l;
if( $l =~ /^#/ ) {
print "$l\n";
} else {
@t = split /\t/,$l;
if( $#t > 7 ) {
$t[7] = ".";
print join("\t", @t) . "\n";
} else {
print "$l\n";
}
}
}
close $FH;
If the phase information is not removed from the gff file, there will be multiple genes with warnings saying "WARNING_MULTIPLE_STOP_CODONS_IN_TRANSCRIPT" in the output vcf. Once this has been done, to build a snpEff database from a non-model organism, create a directory in the directory containing the snpEff.jar file called:
data/organism
Where "organism" denotes the name of the organism to be provided to snpEff on the command line. This was done as the new Sclerotinia sclerotiorum genome has not been added to the snpEff database yet.
Following this, add the following lines to the config file:
# organism, version x
organism.genome : organism
It seems the main part to get right is the 'organism.genome' part. The prefix of this file should be the same as the directory that has just been created (it is inside the directory called "data"). It seems version is arbitrary, as is whatever is after the colon.
Following this, build the database:
java -jar snpEff.jar build organism
Then, run snpEff:
java -jar snpEff.jar eff organism input.vcf > output.vcf
Once the output has been produced, use the following code to create a tabular file with the information for each allele from the vcf one per line:
#!/usr/bin/perl
open my $STDIN, "<", $ARGV[0];
while( $l = <$STDIN> ) {
chomp $l;
if( $l !~ /^#/ ) {
($chrom, $pos, $id, $ref, $alt, $qual, $filter, $info) = split /\t/, $l;
print "$chrom\t$pos\t$id\t$ref\t$alt\t$qual\t$filter\n";
foreach $in ( split /;/, $info ) {
if( $in =~ /^(.*?)=(.*)/ ) {
$key = $1;
$values = $2;
@vals = split /,/, $values;
if( $#vals > 0 ) {
foreach $val ( @vals ) { print "\t\t\t\t\t\t\t$key\t$val\n"; }
} else {
print "\t\t\t\t\t\t\t$key\t$values\n";
}
} else {
print "\t\t\t\t\t\t\t$in\n";
}
}
}
}
close $STDIN;
Then, use the command line to get just the highly disrupted genes and the mutations they have in the alternate strain:
grep "HIGH" test.opl.vcf | awk 'BEGIN{OFS="\t"}{split($0,a,"|");print a[2],a[3],a[4]}' > major_effects.txt
To get a file containing the interpro annotations for these genes use the following commands:
for line in `awk '{print $3}' major_effects.txt`; do
grep -P "$line\s+" interpro_output.txt >> major_effects_interpro.txt
done
Then use the earlier mentioned Perl script found here to get the GO terms associated with these loci (output is suitable for use with GOstats R package):
perl InterProScanToGOstats.pl -i interpro_file.txt -p genes
Running this whole process on multiple files
In order to carry on making this process scalable, a bash script was developed for running snpEff on all files. This runs some parts in parallel and others in serial and will create files for each genome that can be used with GOStat. Importantly, the pipeline also removes genes with highly disruptive SNPs from the file containing genes with only moderately disruptive SNPs. This is useful as it makes sense to distinguish between the two for analysis purposes.
To run, simply call the script and supply an output directory as a command line argument. If the output directory does not exist, the script will create it.
bash runSnpEff.sh results