Normalisation methods implemented in edgeR

Updated 2018 March 15th

A short post on the different normalisation methods implemented within edgeR; to see the normalisation methods type:

#install if necessary
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
library(edgeR)
?calcNormFactors

From the documentation:

method="TMM" is the weighted trimmed mean of M-values (to the reference) proposed by Robinson and Oshlack (2010), where the weights are from the delta method on Binomial data. If refColumn is unspecified, the library whose upper quartile is closest to the mean upper quartile is used.

method="RLE" is the scaling factor method proposed by Anders and Huber (2010). We call it "relative log expression", as median library is calculated from the geometric mean of all columns and the median ratio of each sample to the median library is taken as the scale factor.

method="upperquartile" is the upper-quartile normalization method of Bullard et al (2010), in which the scale factors are calculated from the 75% quantile of the counts for each library, after removing transcripts which are zero in all libraries. This idea is generalized here to allow scaling by any quantile of the distributions.

If method="none", then the normalization factors are set to 1.

For symmetry, normalization factors are adjusted to multiply to 1. The effective library size is then the original library size multiplied by the scaling factor.

Note that rows that have zero counts for all columns are trimmed before normalization factors are computed. Therefore rows with all zero counts do not affect the estimated factors.

A test dataset

I created a dataset to test the different normalisation methods. There are four samples; column one and two are the controls and column three and four are the patients. 25 transcripts are in all four samples in equal amount. Another 25 transcripts are only present in the controls and their expression is the same as the first 25 transcripts in the controls. The four samples have exactly the same depth (500 counts). However, since the patient samples have half the number of transcripts than the controls (25 versus 50), they are sequenced at twice the depth. This hypothetical situation was described in Robinson and Oshlack.

#prepare example
control_1 <- rep(10, 50)
control_2 <- rep(10, 50)
patient_1 <- c(rep(20, 25),rep(0,25))
patient_2 <- c(rep(20, 25),rep(0,25))

df <- data.frame(c1=control_1,
                 c2=control_2,
                 p1=patient_1,
                 p2=patient_2)

head(df)
  c1 c2 p1 p2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
6 10 10 20 20

tail(df)
   c1 c2 p1 p2
45 10 10  0  0
46 10 10  0  0
47 10 10  0  0
48 10 10  0  0
49 10 10  0  0
50 10 10  0  0

#equal depth
colSums(df)
 c1  c2  p1  p2 
500 500 500 500

No normalisation

Let's run the differential expression analysis without any normalisation step:

#load library
library(edgeR)

#create group vector
group <- c('control','control','patient','patient')

#create DGEList object
d <- DGEList(counts=df, group=group)

#check out the DGEList object
d
An object of class "DGEList"
$counts
  c1 c2 p1 p2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
45 more rows ...

$samples
     group lib.size norm.factors
c1 control      500            1
c2 control      500            1
p1 patient      500            1
p2 patient      500            1

d <- DGEList(counts=df, group=group)
d <- estimateCommonDisp(d)

#perform the DE test
de <- exactTest(d)

#how many differentially expressed transcripts?
table(p.adjust(de$table$PValue, method="BH")<0.05)

TRUE 
  50

Without normalisation, every transcript is differentially expressed.

TMM normalisation

Let's test the weighted trimmed mean of M-values method:

TMM <- calcNormFactors(d, method="TMM")
TMM
An object of class "DGEList"
$counts
  c1 c2 p1 p2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
45 more rows ...

$samples
     group lib.size norm.factors
c1 control      500    0.7071068
c2 control      500    0.7071068
p1 patient      500    1.4142136
p2 patient      500    1.4142136

If we use the scaling, for transcript 1 we would compare 10/0.7071068 (~14.14) to 20/1.4142136 (~14.14) and correctly observe that they are not differentially expressed. Let's run the differential expression test:

TMM <- estimateCommonDisp(TMM)
TMM <- exactTest(TMM)
table(p.adjust(TMM$table$PValue, method="BH")<0.05)

FALSE  TRUE 
   25    25

Only half of the transcripts are differentially expressed (the last 25 transcripts in the control samples).

RLE normalisation

Let's test the relative log expression normalisation method:

RLE <- calcNormFactors(d, method="RLE")
RLE
An object of class "DGEList"
$counts
  c1 c2 p1 p2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
45 more rows ...

$samples
     group lib.size norm.factors
c1 control      500    0.7071068
c2 control      500    0.7071068
p1 patient      500    1.4142136
p2 patient      500    1.4142136
RLE <- estimateCommonDisp(RLE)
RLE <- exactTest(RLE)
table(p.adjust(RLE$table$PValue, method="BH")<0.05)

FALSE  TRUE 
   25    25

Upper-quartile normalisation

Lastly let's try the upper-quartile normalisation method:

uq <- calcNormFactors(d, method="upperquartile")
uq
An object of class "DGEList"
$counts
  c1 c2 p1 p2
1 10 10 20 20
2 10 10 20 20
3 10 10 20 20
4 10 10 20 20
5 10 10 20 20
45 more rows ...

$samples
     group lib.size norm.factors
c1 control      500    0.7071068
c2 control      500    0.7071068
p1 patient      500    1.4142136
p2 patient      500    1.4142136

uq <- estimateCommonDisp(uq)
uq <- exactTest(uq)
table(p.adjust(uq$table$PValue, method="BH")<0.05)

FALSE  TRUE 
   25    25

Testing a real dataset

Perform differential gene expression analysis using various normalisation methods on the pnas_expression.txt dataset.

my_url <- "https://davetang.org/file/pnas_expression.txt"
data <- read.table(my_url, header=TRUE, sep="\t")

dim(data)
[1] 37435     9

       ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8  len
1 ENSG00000215696     0     0     0     0     0     0     0  330
2 ENSG00000215700     0     0     0     0     0     0     0 2370
3 ENSG00000215699     0     0     0     0     0     0     0 1842
4 ENSG00000215784     0     0     0     0     0     0     0 2393
5 ENSG00000212914     0     0     0     0     0     0     0  384
6 ENSG00000212042     0     0     0     0     0     0     0   92

Prepare a DGEList object.

d <- data[,2:8]
rownames(d) <- data[,1]
group <- c(rep("Control",4),rep("DHT",3))
d <- DGEList(counts = d, group=group)

An object of class "DGEList"
$counts
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000215696     0     0     0     0     0     0     0
ENSG00000215700     0     0     0     0     0     0     0
ENSG00000215699     0     0     0     0     0     0     0
ENSG00000215784     0     0     0     0     0     0     0
ENSG00000212914     0     0     0     0     0     0     0
37430 more rows ...

$samples
        group lib.size norm.factors
lane1 Control   978576            1
lane2 Control  1156844            1
lane3 Control  1442169            1
lane4 Control  1485604            1
lane5     DHT  1823460            1
lane6     DHT  1834335            1
lane8     DHT   681743            1

Carry out differential gene expression analysis with no normalisation.

no_norm <- estimateCommonDisp(d)
no_norm <- exactTest(no_norm)
table(p.adjust(no_norm$table$PValue, method="BH")<0.05)

FALSE  TRUE 
33404  4031

With TMM normalisation.

TMM <- calcNormFactors(d, method="TMM")
TMM
An object of class "DGEList"
$counts
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000215696     0     0     0     0     0     0     0
ENSG00000215700     0     0     0     0     0     0     0
ENSG00000215699     0     0     0     0     0     0     0
ENSG00000215784     0     0     0     0     0     0     0
ENSG00000212914     0     0     0     0     0     0     0
37430 more rows ...

$samples
        group lib.size norm.factors
lane1 Control   978576    1.0350786
lane2 Control  1156844    1.0379515
lane3 Control  1442169    1.0287815
lane4 Control  1485604    1.0222095
lane5     DHT  1823460    0.9446243
lane6     DHT  1834335    0.9412769
lane8     DHT   681743    0.9954283

TMM <- estimateCommonDisp(TMM)
TMM <- exactTest(TMM)
table(p.adjust(TMM$table$PValue, method="BH")<0.05)

FALSE  TRUE 
33519  3916

With RLE.

RLE <- calcNormFactors(d, method="RLE")
RLE
An object of class "DGEList"
$counts
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000215696     0     0     0     0     0     0     0
ENSG00000215700     0     0     0     0     0     0     0
ENSG00000215699     0     0     0     0     0     0     0
ENSG00000215784     0     0     0     0     0     0     0
ENSG00000212914     0     0     0     0     0     0     0
37430 more rows ...

$samples
        group lib.size norm.factors
lane1 Control   978576    1.0150010
lane2 Control  1156844    1.0236675
lane3 Control  1442169    1.0345426
lane4 Control  1485604    1.0399724
lane5     DHT  1823460    0.9706692
lane6     DHT  1834335    0.9734955
lane8     DHT   681743    0.9466713

RLE <- estimateCommonDisp(RLE)
RLE <- exactTest(RLE)
table(p.adjust(RLE$table$PValue, method="BH")<0.05)

FALSE  TRUE 
33465  3970

With the upper quartile method.

uq <- calcNormFactors(d, method="upperquartile")
uq
An object of class "DGEList"
$counts
                lane1 lane2 lane3 lane4 lane5 lane6 lane8
ENSG00000215696     0     0     0     0     0     0     0
ENSG00000215700     0     0     0     0     0     0     0
ENSG00000215699     0     0     0     0     0     0     0
ENSG00000215784     0     0     0     0     0     0     0
ENSG00000212914     0     0     0     0     0     0     0
37430 more rows ...

$samples
        group lib.size norm.factors
lane1 Control   978576    1.0272514
lane2 Control  1156844    1.0222982
lane3 Control  1442169    1.0250528
lane4 Control  1485604    1.0348864
lane5     DHT  1823460    0.9728534
lane6     DHT  1834335    0.9670858
lane8     DHT   681743    0.9541011

uq <- estimateCommonDisp(uq)
uq <- exactTest(uq)
table(p.adjust(uq$table$PValue, method="BH")<0.05)

FALSE  TRUE 
33466  3969

Finding the overlaps between the differential gene expression analyses.

library(gplots)

get_de <- function(x, pvalue){
  my_i <- p.adjust(x$PValue, method="BH") < pvalue
  row.names(x)[my_i]
}

my_de_no_norm <- get_de(no_norm$table, 0.05)
my_de_tmm <- get_de(TMM$table, 0.05)
my_de_rle <- get_de(RLE$table, 0.05)
my_de_uq <- get_de(uq$table, 0.05)

gplots::venn(list(no_norm = my_de_no_norm, TMM = my_de_tmm, RLE = my_de_rle, UQ = my_de_uq))

There is a list of genes (n = 405) only detected as differentially expressed in the no normalisation set. Furthermore, there is a common list of genes (n = 342) detected as differentially expressed in the normalisation set.

gplots::venn(list(TMM = my_de_tmm, RLE = my_de_rle, UQ = my_de_uq))

There is a large overlap of differentially expressed genes in the different normalisation approaches.

Summary

The normalisation factors were quite similar between all normalisation methods, which is why the results of the differential expression were quite concordant. Most methods down sized the DHT samples with a normalisation factor of less than one to account for the larger library sizes of these samples.

This work is licensed under a Creative Commons
Attribution 4.0 International License.

6 comments Add yours

Minh Thùy says:

April 5, 2018 at 09:01

Hi, Dave Tang
Would you like explain for me about TMM method?
I do not know the formula for calculating the column norm.factor?

Thanks.

1. Davo says:
  
  April 9, 2018 at 02:12
  
  This is the paper describing the TMM method https://www.ncbi.nlm.nih.gov/pubmed/20196867.
  
  1. Le Minh Thuy says:
    
    April 24, 2018 at 03:16
    
    Thanks a lot. I understood this issue.
    
  2. anjali says:
    
    September 16, 2019 at 17:15
    
    hi davo, could you please provide any script to extract only those genes which are deferentially expressed in each case i.e. 3863 in your case?
    
Paolo says:

February 5, 2020 at 09:38

Shouldn’t counts be multiplied by norm.factors instead of being divided?

Vincent says:

February 26, 2020 at 11:44

Hello Dave,

I will grateful if you include a demo on normalizing rnaseq counts using housekeeping genes.
Thanks