A Beginner’s Guide to Differential Expression Analysis with DESeq2 in R on Open OnDemand#

Prerequisite#

  1. Access to Tufts HPC: If you don’t already have an account, you can request one here.

  2. Lab Storage Access: For analysis, we recommend using your lab’s designated storage rather than the $HOME directory. For more details on storage options, refer to this guide.

Reference#

Step1: Access the Cluster via Open OnDemand#

Log in to Open OnDemand using your Tufts HPC credentials.

Start an RStudio Job:

  1. Navigate to Interactive Apps, then select RStudio_Pax.

  2. Fill in the required fields:

    • Number of hours: Specify the time needed for your session. (Recommended starting point: 24 hours)

    • Number of CPU cores: Choose based on your computational needs. (Recommended starting point: 8 cores)

    • Amount of memory: Allocate sufficient memory for your analysis. (Recommended starting point: 32 GB)

    • R Version: Select 4.4.1.

  3. Click Launch to start the job.

Note: guidelines for selecting computational resources (cores and memory) based on dataset size when performing DESeq2 differential expression analysis.

  • Cores:

    • Small datasets (<10 samples): 1–2 cores.

    • Medium datasets (10–50 samples): 4–8 cores.

    • Large datasets (>50 samples): 8–16 cores, especially if using parallelization (e.g., BiocParallel).

  • Memory:

    • Small datasets: 8 GB.

    • Medium datasets: 16–32 GB.

    • Large datasets: 64 GB or more for large matrices or numerous samples.

Step2: Load the DESeq2 package and other necessary libraries#

The version of DESeq2 currently pre-installed in R is 1.44.0.

Verify it using the following commands in R:

library(DESeq2)
packageVersion("DESeq2")

If you need a newer version of DESeq2 (e.g., 1.46.0), you may need to update the package manually or request installation by the system administrator.

Load other libraries:

library(DESeq2)
library(ggplot2)
library(pheatmap)

Step3: Begin the Analysis#

3.1 Input Data Preparation#

Requirements:#

  • A count matrix (rows = genes, columns = samples).

  • A metadata file with sample information (e.g., condition, treatment).

Example:#

Count Matrix (counts_matrix.csv):

GeneID

Sample1

Sample2

Sample3

Sample4

GeneA

100

120

90

80

GeneB

200

190

210

205

Metadata (metadata.csv):

Sample

Condition

Sample1

Control

Sample2

Control

Sample3

Treatment

Sample4

Treatment

Load Data:#

# Load count matrix
counts <- read.csv("counts_matrix.csv", row.names = 1)

# Load metadata
metadata <- read.csv("metadata.csv", row.names = 1)

# Ensure row names of metadata match column names of counts
metadata <- metadata[match(colnames(counts), rownames(metadata)), ]

3.2 Create DESeq2 Dataset#

Create the DESeq2 object using DESeqDataSetFromMatrix:

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData = metadata,
  design = ~ Condition
)

# Pre-filtering: Remove low count genes
dds <- dds[rowSums(counts(dds)) > 10, ]

3.3 Differential Expression Analysis#

dds <- DESeq(dds)
results <- results(dds)
results <- results[order(results$padj), ]
head(results)

3.4 Exploratory Data Analysis#

# PCA plot
vsd <- vst(dds, blind = TRUE)  # Variance-stabilizing transformation
plotPCA(vsd, intgroup = "Condition")

# Heatmap of Sample Distances
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
pheatmap(sampleDistMatrix, main = "Sample Distance Heatmap")

3.5 Volcano Plot#

# Add significance column
results$significant <- results$padj < 0.05 & abs(results$log2FoldChange) > 1

# Volcano plot
ggplot(results, aes(x = log2FoldChange, y = -log10(pvalue), color = significant)) +
  geom_point() +
  theme_minimal() +
  labs(x = "Log2 Fold Change", y = "-Log10 P-value")

3.6 Annotating Results#

# Add gene symbols or annotations to results (if available):
annotated_results <- merge(results, gene_annotations, by = "GeneID", all.x = TRUE)

# Save results to a CSV file:
write.csv(annotated_results, "DESeq2_results.csv")