A Beginner’s Guide to Differential Expression Analysis with DESeq2 in R on Open OnDemand#
Prerequisite#
Access to Tufts HPC: If you don’t already have an account, you can request one here.
Lab Storage Access: For analysis, we recommend using your lab’s designated storage rather than the
$HOME
directory. For more details on storage options, refer to this guide.
Reference#
DESeq2 official website: Comprehensive information on DESeq2, including installation instructions, documentation, and updates.
User guide for how to use Tufts HPC: Detailed instructions for accessing and using Tufts High-Performance Computing (HPC) resources.
DESeq2 tutorials
How to Download and Install R and RStudio
If you are using your own laptop, follow the instructions in this tutorial to install R and RStudio.
-
If you are not using our Open OnDemand RStudio_Pax app, follow the instructures in this tutorial to install
DESeq2
packages.
Step1: Access the Cluster via Open OnDemand#
Log in to Open OnDemand using your Tufts HPC credentials.
Start an RStudio Job:
Navigate to Interactive Apps, then select RStudio_Pax.
Fill in the required fields:
Number of hours: Specify the time needed for your session. (Recommended starting point: 24 hours)
Number of CPU cores: Choose based on your computational needs. (Recommended starting point: 8 cores)
Amount of memory: Allocate sufficient memory for your analysis. (Recommended starting point: 32 GB)
R Version: Select
4.4.1
.
Click Launch to start the job.
Note: guidelines for selecting computational resources (cores and memory) based on dataset size when performing DESeq2 differential expression analysis.
Cores:
Small datasets (<10 samples): 1–2 cores.
Medium datasets (10–50 samples): 4–8 cores.
Large datasets (>50 samples): 8–16 cores, especially if using parallelization (e.g.,
BiocParallel
).
Memory:
Small datasets: 8 GB.
Medium datasets: 16–32 GB.
Large datasets: 64 GB or more for large matrices or numerous samples.
Step2: Load the DESeq2 package and other necessary libraries#
The version of DESeq2 currently pre-installed in R is 1.44.0.
Verify it using the following commands in R:
library(DESeq2)
packageVersion("DESeq2")
If you need a newer version of DESeq2 (e.g., 1.46.0), you may need to update the package manually or request installation by the system administrator.
Load other libraries:
library(DESeq2)
library(ggplot2)
library(pheatmap)
Step3: Begin the Analysis#
3.1 Input Data Preparation#
Requirements:#
A count matrix (rows = genes, columns = samples).
A metadata file with sample information (e.g., condition, treatment).
Example:#
Count Matrix (counts_matrix.csv
):
GeneID |
Sample1 |
Sample2 |
Sample3 |
Sample4 |
---|---|---|---|---|
GeneA |
100 |
120 |
90 |
80 |
GeneB |
200 |
190 |
210 |
205 |
Metadata (metadata.csv
):
Sample |
Condition |
---|---|
Sample1 |
Control |
Sample2 |
Control |
Sample3 |
Treatment |
Sample4 |
Treatment |
Load Data:#
# Load count matrix
counts <- read.csv("counts_matrix.csv", row.names = 1)
# Load metadata
metadata <- read.csv("metadata.csv", row.names = 1)
# Ensure row names of metadata match column names of counts
metadata <- metadata[match(colnames(counts), rownames(metadata)), ]
3.2 Create DESeq2 Dataset#
Create the DESeq2 object using DESeqDataSetFromMatrix
:
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = metadata,
design = ~ Condition
)
# Pre-filtering: Remove low count genes
dds <- dds[rowSums(counts(dds)) > 10, ]
3.3 Differential Expression Analysis#
dds <- DESeq(dds)
results <- results(dds)
results <- results[order(results$padj), ]
head(results)
3.4 Exploratory Data Analysis#
# PCA plot
vsd <- vst(dds, blind = TRUE) # Variance-stabilizing transformation
plotPCA(vsd, intgroup = "Condition")
# Heatmap of Sample Distances
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
pheatmap(sampleDistMatrix, main = "Sample Distance Heatmap")
3.5 Volcano Plot#
# Add significance column
results$significant <- results$padj < 0.05 & abs(results$log2FoldChange) > 1
# Volcano plot
ggplot(results, aes(x = log2FoldChange, y = -log10(pvalue), color = significant)) +
geom_point() +
theme_minimal() +
labs(x = "Log2 Fold Change", y = "-Log10 P-value")
3.6 Annotating Results#
# Add gene symbols or annotations to results (if available):
annotated_results <- merge(results, gene_annotations, by = "GeneID", all.x = TRUE)
# Save results to a CSV file:
write.csv(annotated_results, "DESeq2_results.csv")