Running snpEff

In order for snpEff to work, phase information must be removed from the annotation file being provided. This can be accomplished using the following perl script:

 #!/usr/bin/perl

 open my $FH, "<", $ARGV[0];
 while ( $l = <$FH> ) {
         chomp $l;
         if( $l =~ /^#/ ) {
                 print "$l\n";
         } else {
                 @t = split /\t/,$l;
                 if( $#t > 7 ) {
                         $t[7] = ".";
                         print join("\t", @t) . "\n";
                 } else {
                         print "$l\n";
                 }
         }
 }
 close $FH;

If the phase information is not removed from the gff file, there will be multiple genes with warnings saying "WARNING_MULTIPLE_STOP_CODONS_IN_TRANSCRIPT" in the output vcf. Once this has been done, to build a snpEff database from a non-model organism, create a directory in the directory containing the snpEff.jar file called:

 data/organism

Where "organism" denotes the name of the organism to be provided to snpEff on the command line. This was done as the new Sclerotinia sclerotiorum genome has not been added to the snpEff database yet.

Following this, add the following lines to the config file:

 # organism, version x
 organism.genome : organism

It seems the main part to get right is the 'organism.genome' part. The prefix of this file should be the same as the directory that has just been created (it is inside the directory called "data"). It seems version is arbitrary, as is whatever is after the colon.

Following this, build the database:

 java -jar snpEff.jar build organism

Then, run snpEff:

 java -jar snpEff.jar eff organism input.vcf > output.vcf

Once the output has been produced, use the following code to create a tabular file with the information for each allele from the vcf one per line:

 #!/usr/bin/perl

 open my $STDIN, "<", $ARGV[0];
 while( $l = <$STDIN> ) {
         chomp $l;
         if( $l !~ /^#/ ) {
                 ($chrom, $pos, $id, $ref, $alt, $qual, $filter, $info) = split /\t/, $l;
                 print "$chrom\t$pos\t$id\t$ref\t$alt\t$qual\t$filter\n";  

                 foreach $in ( split /;/, $info ) {
                         if( $in =~ /^(.*?)=(.*)/ ) {
                                 $key = $1;
                                 $values = $2; 

                                 @vals = split /,/, $values;

                                 if( $#vals > 0 ) {
                                         foreach $val ( @vals ) { print "\t\t\t\t\t\t\t$key\t$val\n"; }
                                 } else {
                                         print "\t\t\t\t\t\t\t$key\t$values\n";
                                 }
                         } else {
                                 print "\t\t\t\t\t\t\t$in\n";
                         }
                 }
         }
 }
 close $STDIN;

Then, use the command line to get just the highly disrupted genes and the mutations they have in the alternate strain:

 grep "HIGH" test.opl.vcf | awk 'BEGIN{OFS="\t"}{split($0,a,"|");print a[2],a[3],a[4]}' > major_effects.txt

To get a file containing the interpro annotations for these genes use the following commands:

 for line in `awk '{print $3}' major_effects.txt`; do 
     grep -P "$line\s+" interpro_output.txt >> major_effects_interpro.txt
 done

Then use the earlier mentioned Perl script found here to get the GO terms associated with these loci (output is suitable for use with GOstats R package):

 perl InterProScanToGOstats.pl -i interpro_file.txt -p genes

Running this whole process on multiple files

In order to carry on making this process scalable, a bash script was developed for running snpEff on all files. This runs some parts in parallel and others in serial and will create files for each genome that can be used with GOStat. Importantly, the pipeline also removes genes with highly disruptive SNPs from the file containing genes with only moderately disruptive SNPs. This is useful as it makes sense to distinguish between the two for analysis purposes.

To run, simply call the script and supply an output directory as a command line argument. If the output directory does not exist, the script will create it.

bash runSnpEff.sh results

results matching ""

    No results matching ""