Differential gene expression analysis using DESeq2

Renesh Bedre    7 minute read

DESeq2 tutorial

Introduction

  • Differential gene expression (DGE) analysis is commonly used in the transcriptome-wide analysis (using RNA-seq) for studying the changes in gene or transcripts expressions under different conditions (e.g. control vs infected).
  • RNA sequencing (bulk and single-cell RNA-seq) using next-generation sequencing (e.g. Illumina short-read sequencing) is a de facto method for quantifying the transcriptome-wide gene or transcript expressions and performing DGE analysis.
  • There are several computational tools are available for DGE analysis. In this article, I will cover DESeq2 for DGE analysis.

DGE analysis using DESeq2

The standard workflow for DGE analysis involves the following steps

  • RNA-seq with a sequencing depth of 10-30 M reads per library (at least 3 biological replicates per sample)
  • aligning or mapping the quality-filtered sequenced reads to respective genome
  • quantifying reads that are mapped to genes or transcripts.
  • Raw integer read counts (un-normalized) are then used for DGE analysis using DESeq2

This standard and other workflows for DGE analysis are depicted in the following flowchart,

DESeq2 workflow

Note: DESeq2 requires raw integer read counts for performing accurate DGE analysis. The normalized read counts should not be used in DESeq2 analysis. DESeq2 internally normalizes the count data correcting for differences in the library sizes. Read more about DESeq2 normalization

Perform the DGE analysis using DESeq2 for read count matrix,

For DGE analysis, I will use the sugarcane RNA-seq data. The DGE analysis will be performed using the raw integer read counts for control and fungal treatment conditions. The goal here is to identify the differentially expressed genes under infected condition.

Note: This article focuses on DGE analysis using a count matrix. But, If you have gene quantification from Salmon, Sailfish, Kallisto, or RSEM, you can use the tximport package to import the count data to perform DGE analysis using DESeq2. Read more here

# load library (install DESeq2 if not installed)
library("DESeq2")

# get count dataset
count_matrix <- read.csv("https://reneshbedre.github.io/assets/posts/gexp/df_sc.csv", row.names = "gene")
# view first two rows
head(count_matrix, 2)
                 ctr1 ctr2 ctr3 trt1 trt2 trt3 length
Sobic.001G000200  338  324  246  291  202  168   1982
Sobic.001G000400   49   21   53   16   16   11   4769

# drop length column
count_matrix <- count_matrix[, -7]
                ctr1 ctr2 ctr3 trt1 trt2 trt3
Sobic.001G000200  338  324  246  291  202  168
Sobic.001G000400   49   21   53   16   16   11

DESeq2 needs sample information (metadata) for performing DGE analysis. Let’s create the sample information (you can also import sample information if you have it in a file),

coldata <- data.frame(
   sample = c( "ctr1", "ctr2", "ctr3", "trt1", "trt2", "trt3" ),
   condition = c( "control", "control",  "control", "infected", "infected", "infected" ), 
  row.names = "sample" )
coldata$condition <- as.factor(coldata$condition)

coldata
     condition
ctr1   control
ctr2   control
ctr3   control
trt1  infected
trt2  infected
trt3  infected

It is essential to have the name of the samples in the same order as that in the count matrix. Let’s check using the below code,

all(rownames(coldata) %in% colnames(count_matrix))
[1] TRUE

all(rownames(coldata) == colnames(count_matrix))
[1] TRUE

Now, construct DESeqDataSet for DGE analysis,

dds <- DESeqDataSetFromMatrix(countData = count_matrix, colData = coldata, 
                              design = ~ condition)

Note: The design formula specifies the experimental design to model the samples. It is used in the estimation of dispersions and log2 fold changes (LFCs) of the model. The design formula also allows controlling additional factors (other than the variable of interest) in the model such as batch effects, type of sequencing, etc.

Pre-filter the genes which have low counts. Here, I will remove the genes which have < 10 reads (this can vary based on research goal) in total across all the samples. Pre-filtering helps to remove genes that have very few mapped reads, reduces memory, and increases the speed of the DESeq2 analysis.

dds <- dds[rowSums(counts(dds)) >= 10,]

Now, select the reference level for condition comparisons. The reference level can set using ref parameter. The comparisons of other conditions will be compared against this reference i.e, the log2 fold changes will be calculated based on ref value (infected/control) . If this parameter is not set, comparisons will be based on alphabetical order of the levels.

# set control condition as reference
dds$condition <- relevel(dds$condition, ref = "control")

Perform differential gene expression analysis,

dds <- DESeq(dds)

# see all comparisons (here there is only one)
resultsNames(dds)
[1] "Intercept"                     "condition_infected_vs_control"

# get gene expression table
# at this step independent filtering is applied by default to remove low count genes
# independent filtering can be turned off by passing independentFiltering=FALSE to results
res <- results(dds)  # same as results(dds, name="condition_infected_vs_control") or results( dds, contrast = c("condition", "infected", "control") )
res
# output
log2 fold change (MLE): condition infected vs control 
Wald test p-value: condition infected vs control 
DataFrame with 15921 rows and 6 columns
                  baseMean log2FoldChange     lfcSE       stat     pvalue      padj
                 <numeric>      <numeric> <numeric>  <numeric>  <numeric> <numeric>
Sobic.001G000200 252.98510    -0.01225415  0.250686 -0.0488824  0.9610130  0.995253
Sobic.001G000400  25.12668    -1.04249464  0.477268 -2.1842943  0.0289406  0.280100
Sobic.001G000700  39.61157     0.48020114  0.374920  1.2808102  0.2002604  0.646876
Sobic.001G000800 434.18254    -0.00687855  0.191669 -0.0358876  0.9713720  0.997115
Sobic.001G001000   6.70029     0.61800014  0.864682  0.7147140  0.4747858  0.854971

If there are multiple group comparisons, the parameter name or contrast can be used to extract the DGE table for each comparison.

Order gene expression table by adjusted p value (Benjamini-Hochberg FDR method) ,

res[order(res$padj),]  
# output
log2 fold change (MLE): condition infected vs control 
Wald test p-value: condition infected vs control 
DataFrame with 15921 rows and 6 columns
                  baseMean log2FoldChange     lfcSE      stat      pvalue        padj
                 <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
Sobic.003G079200   59.5931        3.99942  0.434279   9.20933 3.28163e-20 4.71570e-16
Sobic.007G191200  274.8357        3.84765  0.439166   8.76127 1.93055e-18 1.38710e-14
Sobic.004G086500   67.7705        4.15436  0.507285   8.18940 2.62539e-16 1.25756e-12
Sobic.010G095100  440.5084        1.99250  0.252985   7.87596 3.38123e-15 1.21471e-11
Sobic.004G222000   90.4653        2.60385  0.334920   7.77455 7.57139e-15 2.17602e-11

Note: You may get some genes with p value set to NA. This is due to all samples have zero counts for a gene or there is extreme outlier count for a gene or that gene is subjected to independent filtering by DESeq2.

Export differential gene expression analysis table to CSV file,

write.csv(as.data.frame(res[order(res$padj),] ), file="condition_infected_vs_control_dge.csv")

Get summary of differential gene expression with adjusted p value cut-off at 0.05,

summary(results(dds, alpha=0.05))
# output
out of 15921 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 263, 1.7%
LFC < 0 (down)     : 150, 0.94%
outliers [1]       : 7, 0.044%
low counts [2]     : 1544, 9.7%
(mean count < 7)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

# add lfcThreshold (default 0) parameter if you want to filter genes based on log2 fold change

Get normalized counts,

normalized_counts <- counts(dds, normalized=TRUE)
head(normalized_counts)
                       ctr1       ctr2       ctr3       trt1       trt2       trt3
Sobic.001G000200 272.429944 290.406092 199.121060 272.843984 211.859879 271.249618
Sobic.001G000400  39.494282  18.822617  42.900066  15.001731  16.780980  17.760392
Sobic.001G000700  31.434224  43.919440  24.283056  43.129977  54.538187  40.364526
Sobic.001G000800 427.183048 475.047002 403.908167 467.866489 404.841154 426.249399
Sobic.001G001000   9.672069   2.688945   3.237741   2.812825  10.488113  11.302067
Sobic.001G001132   3.224023   1.792630   1.618870   2.812825   4.195245   1.614581

Learn more about DESeq2 normalization

DGE Visualization

I will visualize the DGE using Volcano plot using Python,

Volcano plot,

# import packages
import pandas as pd
from bioinfokit import visuz

# import the DGE table (condition_infected_vs_control_dge.csv)
df = pd.read_csv("condition_infected_vs_control_dge.csv")
# drop NA values
df=df.dropna()

# create volcano plot
visuz.GeneExpression.volcano(df=df, lfc='log2FoldChange', pv='padj', sign_line=True, plotlegend=True)

DESeq2 volcano plot

Shrinkage estimation of log2 fold changes (LFCs)

The shrinkage of effect size (LFC) helps to remove the low count genes (by shrinking towards zero). The low count genes can give large estimates of LFCs which may not represent true difference in changes in gene expression between two conditions.

Shrinkage estimation of LFCs can be performed on using lfcShrink and apeglm method,

resLFC <- lfcShrink(dds, coef="condition_infected_vs_control", type="apeglm")
head(resLFC)
log2 fold change (MAP): condition infected vs control 
Wald test p-value: condition infected vs control 
DataFrame with 6 rows and 5 columns
                  baseMean log2FoldChange     lfcSE    pvalue      padj
                 <numeric>      <numeric> <numeric> <numeric> <numeric>
Sobic.001G000200 252.98510    -0.00442825  0.157724 0.9610130  0.995253
Sobic.001G000400  25.12668    -0.25339744  0.407104 0.0289406  0.280100
Sobic.001G000700  39.61157     0.12383348  0.217570 0.2002604  0.646876
Sobic.001G000800 434.18254    -0.00306346  0.139306 0.9713720  0.997115
Sobic.001G001000   6.70029     0.03240858  0.201185 0.4747858  0.854971
Sobic.001G001132   2.54303     0.01295899  0.200235 0.7161400        NA

Visualize the shrinkage estimation of LFCs with MA plot and compare it without shrinkage of LFCs,

par(mfrow = c(1, 2))
plotMA(resLFC, main="Shrinkage of LFCs", ylim=c(-4,4))
plotMA(res, main="No shrinkage of LFCs", ylim=c(-4,4))

DESeq2 MA plot with shrinkage

References

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

This work is licensed under a Creative Commons Attribution 4.0 International License