A Beginner’s Guide to Differential Expression Analysis with DESeq2 in R on Open OnDemand#

Prerequisite#

Access to Tufts HPC: If you don’t already have an account, you can request one here.
Lab Storage Access: For analysis, we recommend using your lab’s designated storage rather than the $HOME directory. For more details on storage options, refer to this guide.

Reference#

DESeq2 official website: Comprehensive information on DESeq2, including installation instructions, documentation, and updates.
User guide for how to use Tufts HPC: Detailed instructions for accessing and using Tufts High-Performance Computing (HPC) resources.
DESeq2 tutorials
How to Download and Install R and RStudio
- If you are using your own laptop, follow the instructions in this tutorial to install R and RStudio.
Install Bioconductor Packages
- If you are not using our Open OnDemand RStudio_Pax app, follow the instructures in this tutorial to install DESeq2 packages.

Step1: Access the Cluster via Open OnDemand#

Log in to Open OnDemand using your Tufts HPC credentials.

Start an RStudio Job:

Navigate to Interactive Apps, then select RStudio_Pax.
Fill in the required fields:
- Number of hours: Specify the time needed for your session. (Recommended starting point: 24 hours)
- Number of CPU cores: Choose based on your computational needs. (Recommended starting point: 8 cores)
- Amount of memory: Allocate sufficient memory for your analysis. (Recommended starting point: 32 GB)
- R Version: Select 4.4.1.
Click Launch to start the job.

Note: guidelines for selecting computational resources (cores and memory) based on dataset size when performing DESeq2 differential expression analysis.

Cores:
- Small datasets (<10 samples): 1–2 cores.
- Medium datasets (10–50 samples): 4–8 cores.
- Large datasets (>50 samples): 8–16 cores, especially if using parallelization (e.g., BiocParallel).
Memory:
- Small datasets: 8 GB.
- Medium datasets: 16–32 GB.
- Large datasets: 64 GB or more for large matrices or numerous samples.

Step2: Load the DESeq2 package and other necessary libraries#

The version of DESeq2 currently pre-installed in R is 1.44.0.

Verify it using the following commands in R:

library(DESeq2)
packageVersion("DESeq2")

If you need a newer version of DESeq2 (e.g., 1.46.0), you may need to update the package manually or request installation by the system administrator.

Load other libraries:

library(DESeq2)
library(ggplot2)
library(pheatmap)

Step3: Begin the Analysis#

3.1 Input Data Preparation#

Requirements:#

A count matrix (rows = genes, columns = samples).
A metadata file with sample information (e.g., condition, treatment).

Example:#

Count Matrix (counts_matrix.csv):

GeneID	Sample1	Sample2	Sample3	Sample4
GeneA	100	120	90	80
GeneB	200	190	210	205

Metadata (metadata.csv):

Sample	Condition
Sample1	Control
Sample2	Control
Sample3	Treatment
Sample4	Treatment

Load Data:#

# Load count matrix
counts <- read.csv("counts_matrix.csv", row.names = 1)

# Load metadata
metadata <- read.csv("metadata.csv", row.names = 1)

# Ensure row names of metadata match column names of counts
metadata <- metadata[match(colnames(counts), rownames(metadata)), ]

3.2 Create DESeq2 Dataset#

Create the DESeq2 object using DESeqDataSetFromMatrix:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData = metadata,
  design = ~ Condition
)

# Pre-filtering: Remove low count genes
dds <- dds[rowSums(counts(dds)) > 10, ]

3.3 Differential Expression Analysis#

dds <- DESeq(dds)
results <- results(dds)
results <- results[order(results$padj), ]
head(results)

3.4 Exploratory Data Analysis#

# PCA plot
vsd <- vst(dds, blind = TRUE)  # Variance-stabilizing transformation
plotPCA(vsd, intgroup = "Condition")

# Heatmap of Sample Distances
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
pheatmap(sampleDistMatrix, main = "Sample Distance Heatmap")

3.5 Volcano Plot#

# Add significance column
results$significant <- results$padj < 0.05 & abs(results$log2FoldChange) > 1

# Volcano plot
ggplot(results, aes(x = log2FoldChange, y = -log10(pvalue), color = significant)) +
  geom_point() +
  theme_minimal() +
  labs(x = "Log2 Fold Change", y = "-Log10 P-value")

3.6 Annotating Results#

# Add gene symbols or annotations to results (if available):
annotated_results <- merge(results, gene_annotations, by = "GeneID", all.x = TRUE)

# Save results to a CSV file:
write.csv(annotated_results, "DESeq2_results.csv")