# A Beginner’s Guide to Differential Expression Analysis with DESeq2 in R on Open OnDemand


## Prerequisite

1. **Access to Tufts HPC**: If you don’t already have an account, you can request one [here](https://shib-idp.tufts.edu/idp/profile/SAML2/POST/SSO?execution=e1s1).
2. **Lab Storage Access**: For analysis, we recommend using your lab’s designated storage rather than the `$HOME` directory. For more details on storage options, refer to [this guide](https://it.tufts.edu/research-data-storage).


## Reference

- [**DESeq2 official website**](https://bioconductor.org/packages/release/bioc/html/DESeq2.html): Comprehensive information on DESeq2, including installation instructions, documentation, and updates.  

- [User guide for how to use Tufts HPC](https://tufts.app.box.com/v/Pax-User-Guide): Detailed instructions for accessing and using Tufts High-Performance Computing (HPC) resources. 

- DESeq2 tutorials

  - [A beginner-friendly guide to using DESeq2 for differential gene expression analysis. ](https://lashlock.github.io/compbio/R_presentation.html)
  - [Step-by-step walkthrough for DESeq2 analysis. ](https://introtogenomics.readthedocs.io/en/latest/2021.11.11.DeseqTutorial.html)
  - [Differential expression with DEseq2. ](https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/)
  - [Harvard Chan Bioinformatics Core Training: Introduction to DGE. ](https://hbctraining.github.io/DGE_workshop/lessons/04_DGE_DESeq2_analysis.html)

- [How to Download and Install R and RStudio](https://rstudio-education.github.io/hopr/starting.html)

  - If you are using your own laptop, follow the instructions in this tutorial to install R and RStudio. 

- [Install Bioconductor Packages](https://www.bioconductor.org/install/)

  - If you are not using our Open OnDemand RStudio_Pax app, follow the instructures in this tutorial to install `DESeq2` packages. 

  
## Step1: Access the Cluster via Open OnDemand

**Log in** to [Open OnDemand](https://ondemand.pax.tufts.edu) using your Tufts HPC credentials.

**Start an RStudio Job**:

1. Navigate to **Interactive Apps**, then select **RStudio_Pax**.

2. Fill in the required fields:

   - **Number of hours**: Specify the time needed for your session. (Recommended starting point: 24 hours)
   - **Number of CPU cores**: Choose based on your computational needs. (Recommended starting point: 8 cores)
   - **Amount of memory**: Allocate sufficient memory for your analysis. (Recommended starting point: 32 GB)
   - **R Version**: Select `4.4.1`.

3. Click **Launch** to start the job.

   
**Note: guidelines for selecting computational resources (cores and memory) based on dataset size when performing DESeq2 differential expression analysis.**

- **Cores**:

  - Small datasets (<10 samples): 1–2 cores.

  - Medium datasets (10–50 samples): 4–8 cores.

  - Large datasets (>50 samples): 8–16 cores, especially if using parallelization (e.g., `BiocParallel`).

- **Memory**:

  - Small datasets: 8 GB.

  - Medium datasets: 16–32 GB.

  - Large datasets: 64 GB or more for large matrices or numerous samples.


## Step2: Load the DESeq2 package and other necessary libraries

The version of DESeq2 currently pre-installed in R is **1.44.0**. 

Verify it using the following commands in R:

```
library(DESeq2)
packageVersion("DESeq2")
```

If you need a newer version of DESeq2 (e.g., 1.46.0), you may need to update the package manually or request installation by the system administrator. 


Load other libraries:

```
library(DESeq2)
library(ggplot2)
library(pheatmap)
```


## Step3: Begin the Analysis

### 3.1 Input Data Preparation

#### Requirements:

- A **count matrix** (rows = genes, columns = samples).
- A **metadata file** with sample information (e.g., condition, treatment).


#### Example:

Count Matrix (`counts_matrix.csv`):

| GeneID | Sample1 | Sample2 | Sample3 | Sample4 |
| ------ | ------- | ------- | ------- | ------- |
| GeneA  | 100     | 120     | 90      | 80      |
| GeneB  | 200     | 190     | 210     | 205     |

Metadata (`metadata.csv`):

| Sample  | Condition |
| ------- | --------- |
| Sample1 | Control   |
| Sample2 | Control   |
| Sample3 | Treatment |
| Sample4 | Treatment |

#### Load Data:

```
# Load count matrix
counts <- read.csv("counts_matrix.csv", row.names = 1)

# Load metadata
metadata <- read.csv("metadata.csv", row.names = 1)

# Ensure row names of metadata match column names of counts
metadata <- metadata[match(colnames(counts), rownames(metadata)), ]
```


### 3.2 Create DESeq2 Dataset

Create the DESeq2 object using `DESeqDataSetFromMatrix`:

```
dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData = metadata,
  design = ~ Condition
)

# Pre-filtering: Remove low count genes
dds <- dds[rowSums(counts(dds)) > 10, ]
```


### 3.3 Differential Expression Analysis

```
dds <- DESeq(dds)
results <- results(dds)
results <- results[order(results$padj), ]
head(results)
```


### 3.4  Exploratory Data Analysis

```
# PCA plot
vsd <- vst(dds, blind = TRUE)  # Variance-stabilizing transformation
plotPCA(vsd, intgroup = "Condition")

# Heatmap of Sample Distances
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
pheatmap(sampleDistMatrix, main = "Sample Distance Heatmap")
```


### 3.5 Volcano Plot

```
# Add significance column
results$significant <- results$padj < 0.05 & abs(results$log2FoldChange) > 1

# Volcano plot
ggplot(results, aes(x = log2FoldChange, y = -log10(pvalue), color = significant)) +
  geom_point() +
  theme_minimal() +
  labs(x = "Log2 Fold Change", y = "-Log10 P-value")
```


### 3.6 Annotating Results

```
# Add gene symbols or annotations to results (if available):
annotated_results <- merge(results, gene_annotations, by = "GeneID", all.x = TRUE)

# Save results to a CSV file:
write.csv(annotated_results, "DESeq2_results.csv")
```