Updated 2018 March 15th
A short post on the different normalisation methods implemented within edgeR; to see the normalisation methods type:
#install if necessary source("http://bioconductor.org/biocLite.R") biocLite("edgeR") library(edgeR) ?calcNormFactors
From the documentation:
method="TMM" is the weighted trimmed mean of M-values (to the reference) proposed by Robinson and Oshlack (2010), where the weights are from the delta method on Binomial data. If refColumn is unspecified, the library whose upper quartile is closest to the mean upper quartile is used.
method="RLE" is the scaling factor method proposed by Anders and Huber (2010). We call it "relative log expression", as median library is calculated from the geometric mean of all columns and the median ratio of each sample to the median library is taken as the scale factor.
method="upperquartile" is the upper-quartile normalization method of Bullard et al (2010), in which the scale factors are calculated from the 75% quantile of the counts for each library, after removing transcripts which are zero in all libraries. This idea is generalized here to allow scaling by any quantile of the distributions.
If method="none", then the normalization factors are set to 1.
For symmetry, normalization factors are adjusted to multiply to 1. The effective library size is then the original library size multiplied by the scaling factor.
Note that rows that have zero counts for all columns are trimmed before normalization factors are computed. Therefore rows with all zero counts do not affect the estimated factors.
A test dataset
I created a dataset to test the different normalisation methods. There are four samples; column one and two are the controls and column three and four are the patients. 25 transcripts are in all four samples in equal amount. Another 25 transcripts are only present in the controls and their expression is the same as the first 25 transcripts in the controls. The four samples have exactly the same depth (500 counts). However, since the patient samples have half the number of transcripts than the controls (25 versus 50), they are sequenced at twice the depth. This hypothetical situation was described in Robinson and Oshlack.
#prepare example control_1 <- rep(10, 50) control_2 <- rep(10, 50) patient_1 <- c(rep(20, 25),rep(0,25)) patient_2 <- c(rep(20, 25),rep(0,25)) df <- data.frame(c1=control_1, c2=control_2, p1=patient_1, p2=patient_2) head(df) c1 c2 p1 p2 1 10 10 20 20 2 10 10 20 20 3 10 10 20 20 4 10 10 20 20 5 10 10 20 20 6 10 10 20 20 tail(df) c1 c2 p1 p2 45 10 10 0 0 46 10 10 0 0 47 10 10 0 0 48 10 10 0 0 49 10 10 0 0 50 10 10 0 0 #equal depth colSums(df) c1 c2 p1 p2 500 500 500 500
No normalisation
Let's run the differential expression analysis without any normalisation step:
#load library library(edgeR) #create group vector group <- c('control','control','patient','patient') #create DGEList object d <- DGEList(counts=df, group=group) #check out the DGEList object d An object of class "DGEList" $counts c1 c2 p1 p2 1 10 10 20 20 2 10 10 20 20 3 10 10 20 20 4 10 10 20 20 5 10 10 20 20 45 more rows ... $samples group lib.size norm.factors c1 control 500 1 c2 control 500 1 p1 patient 500 1 p2 patient 500 1 d <- DGEList(counts=df, group=group) d <- estimateCommonDisp(d) #perform the DE test de <- exactTest(d) #how many differentially expressed transcripts? table(p.adjust(de$table$PValue, method="BH")<0.05) TRUE 50
Without normalisation, every transcript is differentially expressed.
TMM normalisation
Let's test the weighted trimmed mean of M-values method:
TMM <- calcNormFactors(d, method="TMM") TMM An object of class "DGEList" $counts c1 c2 p1 p2 1 10 10 20 20 2 10 10 20 20 3 10 10 20 20 4 10 10 20 20 5 10 10 20 20 45 more rows ... $samples group lib.size norm.factors c1 control 500 0.7071068 c2 control 500 0.7071068 p1 patient 500 1.4142136 p2 patient 500 1.4142136
If we use the scaling, for transcript 1 we would compare 10/0.7071068 (~14.14) to 20/1.4142136 (~14.14) and correctly observe that they are not differentially expressed. Let's run the differential expression test:
TMM <- estimateCommonDisp(TMM) TMM <- exactTest(TMM) table(p.adjust(TMM$table$PValue, method="BH")<0.05) FALSE TRUE 25 25
Only half of the transcripts are differentially expressed (the last 25 transcripts in the control samples).
RLE normalisation
Let's test the relative log expression normalisation method:
RLE <- calcNormFactors(d, method="RLE") RLE An object of class "DGEList" $counts c1 c2 p1 p2 1 10 10 20 20 2 10 10 20 20 3 10 10 20 20 4 10 10 20 20 5 10 10 20 20 45 more rows ... $samples group lib.size norm.factors c1 control 500 0.7071068 c2 control 500 0.7071068 p1 patient 500 1.4142136 p2 patient 500 1.4142136 RLE <- estimateCommonDisp(RLE) RLE <- exactTest(RLE) table(p.adjust(RLE$table$PValue, method="BH")<0.05) FALSE TRUE 25 25
Upper-quartile normalisation
Lastly let's try the upper-quartile normalisation method:
uq <- calcNormFactors(d, method="upperquartile") uq An object of class "DGEList" $counts c1 c2 p1 p2 1 10 10 20 20 2 10 10 20 20 3 10 10 20 20 4 10 10 20 20 5 10 10 20 20 45 more rows ... $samples group lib.size norm.factors c1 control 500 0.7071068 c2 control 500 0.7071068 p1 patient 500 1.4142136 p2 patient 500 1.4142136 uq <- estimateCommonDisp(uq) uq <- exactTest(uq) table(p.adjust(uq$table$PValue, method="BH")<0.05) FALSE TRUE 25 25
Testing a real dataset
Perform differential gene expression analysis using various normalisation methods on the pnas_expression.txt dataset.
my_url <- "https://davetang.org/file/pnas_expression.txt" data <- read.table(my_url, header=TRUE, sep="\t") dim(data) [1] 37435 9 ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8 len 1 ENSG00000215696 0 0 0 0 0 0 0 330 2 ENSG00000215700 0 0 0 0 0 0 0 2370 3 ENSG00000215699 0 0 0 0 0 0 0 1842 4 ENSG00000215784 0 0 0 0 0 0 0 2393 5 ENSG00000212914 0 0 0 0 0 0 0 384 6 ENSG00000212042 0 0 0 0 0 0 0 92
Prepare a DGEList object.
d <- data[,2:8] rownames(d) <- data[,1] group <- c(rep("Control",4),rep("DHT",3)) d <- DGEList(counts = d, group=group) An object of class "DGEList" $counts lane1 lane2 lane3 lane4 lane5 lane6 lane8 ENSG00000215696 0 0 0 0 0 0 0 ENSG00000215700 0 0 0 0 0 0 0 ENSG00000215699 0 0 0 0 0 0 0 ENSG00000215784 0 0 0 0 0 0 0 ENSG00000212914 0 0 0 0 0 0 0 37430 more rows ... $samples group lib.size norm.factors lane1 Control 978576 1 lane2 Control 1156844 1 lane3 Control 1442169 1 lane4 Control 1485604 1 lane5 DHT 1823460 1 lane6 DHT 1834335 1 lane8 DHT 681743 1
Carry out differential gene expression analysis with no normalisation.
no_norm <- estimateCommonDisp(d) no_norm <- exactTest(no_norm) table(p.adjust(no_norm$table$PValue, method="BH")<0.05) FALSE TRUE 33404 4031
With TMM normalisation.
TMM <- calcNormFactors(d, method="TMM") TMM An object of class "DGEList" $counts lane1 lane2 lane3 lane4 lane5 lane6 lane8 ENSG00000215696 0 0 0 0 0 0 0 ENSG00000215700 0 0 0 0 0 0 0 ENSG00000215699 0 0 0 0 0 0 0 ENSG00000215784 0 0 0 0 0 0 0 ENSG00000212914 0 0 0 0 0 0 0 37430 more rows ... $samples group lib.size norm.factors lane1 Control 978576 1.0350786 lane2 Control 1156844 1.0379515 lane3 Control 1442169 1.0287815 lane4 Control 1485604 1.0222095 lane5 DHT 1823460 0.9446243 lane6 DHT 1834335 0.9412769 lane8 DHT 681743 0.9954283 TMM <- estimateCommonDisp(TMM) TMM <- exactTest(TMM) table(p.adjust(TMM$table$PValue, method="BH")<0.05) FALSE TRUE 33519 3916
With RLE.
RLE <- calcNormFactors(d, method="RLE") RLE An object of class "DGEList" $counts lane1 lane2 lane3 lane4 lane5 lane6 lane8 ENSG00000215696 0 0 0 0 0 0 0 ENSG00000215700 0 0 0 0 0 0 0 ENSG00000215699 0 0 0 0 0 0 0 ENSG00000215784 0 0 0 0 0 0 0 ENSG00000212914 0 0 0 0 0 0 0 37430 more rows ... $samples group lib.size norm.factors lane1 Control 978576 1.0150010 lane2 Control 1156844 1.0236675 lane3 Control 1442169 1.0345426 lane4 Control 1485604 1.0399724 lane5 DHT 1823460 0.9706692 lane6 DHT 1834335 0.9734955 lane8 DHT 681743 0.9466713 RLE <- estimateCommonDisp(RLE) RLE <- exactTest(RLE) table(p.adjust(RLE$table$PValue, method="BH")<0.05) FALSE TRUE 33465 3970
With the upper quartile method.
uq <- calcNormFactors(d, method="upperquartile") uq An object of class "DGEList" $counts lane1 lane2 lane3 lane4 lane5 lane6 lane8 ENSG00000215696 0 0 0 0 0 0 0 ENSG00000215700 0 0 0 0 0 0 0 ENSG00000215699 0 0 0 0 0 0 0 ENSG00000215784 0 0 0 0 0 0 0 ENSG00000212914 0 0 0 0 0 0 0 37430 more rows ... $samples group lib.size norm.factors lane1 Control 978576 1.0272514 lane2 Control 1156844 1.0222982 lane3 Control 1442169 1.0250528 lane4 Control 1485604 1.0348864 lane5 DHT 1823460 0.9728534 lane6 DHT 1834335 0.9670858 lane8 DHT 681743 0.9541011 uq <- estimateCommonDisp(uq) uq <- exactTest(uq) table(p.adjust(uq$table$PValue, method="BH")<0.05) FALSE TRUE 33466 3969
Finding the overlaps between the differential gene expression analyses.
library(gplots) get_de <- function(x, pvalue){ my_i <- p.adjust(x$PValue, method="BH") < pvalue row.names(x)[my_i] } my_de_no_norm <- get_de(no_norm$table, 0.05) my_de_tmm <- get_de(TMM$table, 0.05) my_de_rle <- get_de(RLE$table, 0.05) my_de_uq <- get_de(uq$table, 0.05) gplots::venn(list(no_norm = my_de_no_norm, TMM = my_de_tmm, RLE = my_de_rle, UQ = my_de_uq))
There is a list of genes (n = 405) only detected as differentially expressed in the no normalisation set. Furthermore, there is a common list of genes (n = 342) detected as differentially expressed in the normalisation set.
gplots::venn(list(TMM = my_de_tmm, RLE = my_de_rle, UQ = my_de_uq))
There is a large overlap of differentially expressed genes in the different normalisation approaches.
Summary
The normalisation factors were quite similar between all normalisation methods, which is why the results of the differential expression were quite concordant. Most methods down sized the DHT samples with a normalisation factor of less than one to account for the larger library sizes of these samples.
This work is licensed under a Creative Commons
Attribution 4.0 International License.
Hi, Dave Tang
Would you like explain for me about TMM method?
I do not know the formula for calculating the column norm.factor?
Thanks.
This is the paper describing the TMM method https://www.ncbi.nlm.nih.gov/pubmed/20196867.
Thanks a lot. I understood this issue.
hi davo, could you please provide any script to extract only those genes which are deferentially expressed in each case i.e. 3863 in your case?
Shouldn’t counts be multiplied by norm.factors instead of being divided?
Hello Dave,
I will grateful if you include a demo on normalizing rnaseq counts using housekeeping genes.
Thanks