GOstats analysis
R version 3.3.2 installation:
sudo add-apt-repository ppa:marutter/rrutter
sudo apt-get update
sudo apt-get install r-base r-base-dev
libxml2 installation:
sudo apt-get install libxml2-dev
Within R, install GOstats with:
source("http://bioconductor.org/biocLite.R")
biocLite("GOstats")
Numerous BioConductor packages must be installed for GOstats to function. The names of them can be found here on page one.
The tutorial on using a non-model reference genome found here.
To read into R as a suitable dataframe, the output from InterProScan must be reformatted. This was done using the Perl script found here. The following are the commands used ("interpro_file.txt" is the output from InterProScan):
perl InterProScanToGOstats.pl -i interpro_file.txt
Within R, the following commands were used to conduct the analyses ("out.txt" is the file that was produced by the above commands):
frame = read.csv("out.txt", header=TRUE, sep="\t")
goFrame=GOFrame(frame, organism="Sclerotinia sclerotiorum")
goAllFrame=GOAllFrame(goFrame)
library("GSEABase")
gsc <- GeneSetCollection(goAllFrame, setType=GOCollection())
This will simplify GO terms using the GO.db GO2All function to create GO term mappings. It will then load the GSEABase R package and create a gene set collection object called 'gsc'.
After creating this, read in your list of genes (assuming your list of genes is in the same format as "out.txt" and is contained in the file "genes.txt"):
genes <- data.frame(read.csv("genes.txt", sep="\t", header=TRUE))
genes = genes$GENE_ID
Now create the background set of genes for the hypergeometric test to be carried out against:
background = frame$GENE_ID
Now create parameters for the test:
params <- GSEAGOHyperGParams( name = "Custom database",
geneSetCollection = gsc,
geneIds = genes,
universeGeneIds = background,
ontology = "MF",
pvalueCutoff = 0.05,
conditional = FALSE,
testDirection = "over" )
Note the use of the objects "genes", "background" and "gsc" created earlier on. This will throw out warnings about removing duplicate IDs. This is because in the lists there are more than one GO term per gene ID, so the gene IDs are repeated.
Then run the overrepresentation test:
or <- hyperGTest(params)
Running GOStats on multiple files
The above procedures were run on all genome files. To do this, a small library of R functions was generated for processing multiple datasets.
To run this procedure on all genotyped isolates:
source("goEnrichmentStats.r")
setwd(/my/working/directory) #Set to the dircetory containing results from runSnpEff.sh
files <- get_files("*HIGH.gos", "*MODERATE.gos", "*LOW.gos") #Final output files from runSnpEff.sh
test.sig <- run_go_all(files, name="Database name", gene_file="/filtered/gene/set.txt", type="MF", pval=0.05)
test.all <- run_go_all(files, name="Database name", gene_file="/filtered/gene/set.txt", type="MF", pval=1)
This will produces two lists, one that contains test statistics for significantly enriched genes and another that contains test statistics for all genes. The reason for this is that these two data sets can then be combined for plotting purposes, i.e., take the GO terms that are significantly enriched in at least one isolate and get information on log odds and p value for all isolates so they can be plotted together. Note: the '/filtered/gene/set.txt' file should be the file output from 'Filtering out genes that fall within repeat regions'. This will do the analysis on data that do not have transposable element sequences in. This can also be done for the whole set, with transposable elements, which may be informative in its own right.
To combine the results from both the tests into a format suitable for plotting with ggplot, run:
plot.data <- make_plot(test.sig, test.all)
The plot.data object is of a convenient format that can be used with ggplot. It consists of three lists (plot.data[[1:3]]), which each contain three dataframes (plot.data[[1:3]][[1:3]]). The three outer lists correspond to pvalue, log(odds) and count data. The three inner dataframes correspond to high, moderate and low effect SNPs data (or the order in which the files were specified at the start).