Background
In order to determine what types of genes are most affected by disruptive SNPs, a GO term enrichment analysis was carried out (see GOstats analysis). However, through preliminary analysis, it was found that most genes affected by SNPs belonged to transposable elements. Thus, transposable element sequences were filetered out in order to determine which other kinds of "functional" genes were disrupted or changed.
BLAST alignment to RepeatMasker libararies
In order to filter out genes that are likely transposons, protein sequences from the genome of interest were blasted against the RepeatMasker protein library, RepeatPeps.lib, which is distributed with RepeatMasker with the path Libraries/RepeatPeps.lib (RepeatMasker vopen-4.0.6 was used at the time of writing). A BLAST database was already made for this library. The BLAST commands were (BLAST v2.2.28+ was used at the time of writing):
blastp -db blastdb \
-query query_gene_set \
-out output.txt \
-evalue 0.00001 \
-max_target_seqs 1 \
-max_hsps_per_subject 1 \
-num_threads 24 \
-outfmt "6 qseqid"
This is in order to produce an output list of genes with hits to RepeatPeps.lib.
Then, get only the sequences from the interpro output that do not match this list:
grep -v -f output.txt interpro-output.txt > interpro-filtered.txt
One typical domain associated with transposons - DNA transposons specifically - is the DDE endonuclease domain. To check if these types of genes have been filtered out, run:
grep "DDE" interpro.filtered.txt | wc -l
grep "DDE" interpro.txt | wc -l
To compare the two files. Then, create the list of genes to use as input to GOstats for both filtered and non-filtered:
perl InterProScanToGOstats.pl -i interpro.filtered.txt -p interpro.filtered.list
perl InterProScanToGOstats.pl -i interpro.txt -p interpro.list
Ready to use as gene list for input to GOstats.