Analysis of Genome Wide Association Study data

The aim of these analyses are to identify significant differences between phenotypes of individuals in a large population. In this case, a genome wide association study (GWAS) population. Mainly for plotting purposes, the tests require the library "agricolae".

First, the data should look something like this (data made up), with at least the two columns 'Variety' and 'AUDPC':

  Variety    AUDPC
  a            10
  a            5
  a            15
  a            17
  a            9
  b            8
  b            16
  b            15
  b            17
  b            10
  c            1
  c            2
  c            2
  c            5
  c            3

Then, read in the data, reorder the varieties based on mean values (for plotting) and conduct analysis of variance (ANOVA) followed by Tukey's honest significant difference (HSD) with output for groupings. The latter part requires the 'agricolae' library, which should be loaded before it is run.

  library("agricolae")

  data    <- read.csv("inputfile.csv")
  bymean <- with(data, reorder(Variety, AUDPC, mean))
  anov    <- aov(AUDPC ~ Variety, data=data)
  hsd    <- HSD.test(anov, "Variety", group=TRUE)

The hsd object will show groupings for each variety based on significant difference to others:

 hsd$groups
            trt    means   M
 1  a           4130.700   a
 2  b           3430.700  ab
 3  c           3414.600  ab
 4  d           2670.500 abc
 5  e           2603.125 abc
 6  f           2561.300 abc
 7  g           2474.500 abc
 8  h           2167.200 abc
 9  i           2104.375 abc
 10 j           1898.750 abc
 11 k           1365.000  bc
 12 l           860.300

The 'trt' column contains the varieties listed in 'data'. The hsd object has various attributes that can be accessed using the '$' symbol.

To get the statistics:

hsd$statistics

Then for plotting purposes, define a function for getting a list of colours to supply to 'boxplot()':

getColors <- function(input){
    colours         <- colors()
    sequence        <- c()
    for (i in 1:length(input)){
        colourList  <- replicate(input[i], colours[i])
        sequence    <- c(sequence, colourList)
        }
    return (sequence)
    }

For input to this function, create a table of group values:

 input      <- hsd$groups$M
 colours    <- getColors(input)

 colours
 [1] "white"         "aliceblue"     "aliceblue"     "antiquewhite"  "antiquewhite"  "antiquewhite"  "antiquewhite"  "antiquewhite"  "antiquewhite"  "antiquewhite"  "antiquewhite1" "antiquewhite2"

Then the data can be plotted using the vector of colours to supply the colours of the boxplot:

boxplot(AUDPC ~ bymean, data = data, col=colours, outline=FALSE, names=c("a", "b","c","d","e","f","g","h","i","j","k","l"))

This plot groups data based on significant differences between the groups, i.e. 'a' is one colour, 'a,b' is another colour, 'a,b,c' is another colour and 'b,c' is another colour. (In this example, the first two groups are hard to distinguish. This can be remedied by changing colours or the iterater in the get Colors function, to get colours further apart in the R builtins).

Then, lines can be added above the boxplot to denote the groups. This can be achieved using the function:

 getLines <- function(input){
     seen <- c()
     feat <- vector(mode="list")
     c = 1
     for (i in 1:length(input)){
         groups <- as.vector(strsplit(as.character(input[i]), ""))
         for (s in 1:length(groups[[1]])){
             if (!groups[[1]][s] %in% seen){
                 seen <- c(seen, groups[[1]][s])
                 start <- i
                 feat[[c]] <- c(groups[[1]][s], as.character(start))
                 c = c + 1
                 }
             }
         for (n in 1:length(feat)){
             end <- i - 1
             if ((!feat[[n]][1] %in% groups[[1]]) && (length(feat[[n]]) == 2)){
                 feat[[n]] <- c(feat[[n]], end)    
                 }
             }
         if (i == length(input)){
             end <- i
             feat[[length(feat)]] <- c(feat[[length(feat)]], end)
             }
         }
     return(feat)
     }

Then:

lines <- getLines(rev(hsd$groups$M))

Note: Sequence is reversed as the output from HSD.test is in reverse order of means. Then once the above plot has been called, draw the lines above it:

boxplot(AUDPC ~ bymean, data = data, col=colours, outline=FALSE, names=c("a", "b","c","d","e","f","g","h","i","j",k","l"), bty="l", frame=FALSE)
segments(as.numeric(lines[[1]][2]), max(data$AUDPC)+ 10, as.numeric(lines[[1]][3]), max(data$AUDPC) + 10, col="red")
segments(as.numeric(lines[[2]][2]), max(data$AUDPC)+ 110, as.numeric(lines[[2]][3]), max(data$AUDPC) + 110, col="blue")
segments(as.numeric(lines[[3]][2]), max(data$AUDPC)+ 210, as.numeric(lines[[3]][3]), max(data$AUDPC) + 210, col="green")
segments(as.numeric(lines[[1]][2]), max(data$AUDPC)+ 10, as.numeric(lines[[1]][2]), max(data$AUDPC) - 90, col="red")
segments(as.numeric(lines[[1]][3]), max(data$AUDPC)+ 10, as.numeric(lines[[1]][3]), max(data$AUDPC) - 90, col="red")
segments(as.numeric(lines[[2]][2]), max(data$AUDPC)+ 110, as.numeric(lines[[2]][2]), max(data$AUDPC) + 10, col="blue")
segments(as.numeric(lines[[2]][3]), max(data$AUDPC)+ 110, as.numeric(lines[[2]][3]), max(data$AUDPC) + 10, col="blue")
segments(as.numeric(lines[[3]][2]), max(data$AUDPC)+ 210, as.numeric(lines[[3]][2]), max(data$AUDPC) + 110, col="green")
segments(as.numeric(lines[[3]][3]), max(data$AUDPC)+ 210, as.numeric(lines[[3]][3]), max(data$AUDPC) + 110, col="green")

This will create the following plot:

However, the plot may look better without colouring the bars differently.

R general stats and plots

Analysis of Genome Wide Association Study data

results matching ""

No results matching ""