Retrieving positions of domains from an InterProScan analysis

Background

InterProScan outputs a standard format for domains that have been identified in each fasta sequence entry that has been scanned. This output contains the positions of the starts and ends of domains respective to each sequence. However, there is currently no 'official' method for mapping these domains to their respective positions within the whole genome sequence. In order to determine which SNPs, from a multiple genome comparison for example, fall within domains and not linker regions of enzymes, this is necessary.

To map domains to their positions within a genome sequence, this script was developed. This script is used like so:

python getDomainInfoInterpro.py -i interprooutput.txt -g genes.gff3 -t Pfam -o domains.txt

Then it is possible to determine which SNPs fall within these domains and what effects they are likely to have based on a snpEff 'one per line' vcf output (see previous section) using this script.

python getDomainSNPs.py -i domains.txt -v onePerLine.vcf -o snps.domains.txt

Testing snps in genes containing a cetain domain

To determine whether genes with a particular domain are enriched in a 'list of genes containing non-synonymous SNPs in functional domains', the following can be carried out.

First, get all entries in 'domains.txt' that contain a particular domain. In this case, for example, we use the Pfam domain 'PF00550', which is the identifier for the phosphopantethein attachment site.

grep "PF00550" domains.txt | awk '{print $1}' | sort -u > PF00550.genes.txt

Then we get all the entries in the 'snps.domains.txt' file that correspond to genes that contain at least a single 'PF00550' domain.

grep -f PF00550.genes.txt snps.domains.txt > PF00550.snps.txt

Then one can count how many genes have this domain in whole genome, how many genes have snps in functional domains in the genome and how many PF00550 containing genes have snps in functional domains.

results matching ""

    No results matching ""