scRNA-seq analysis with RStudio#

Shirley Li, Bioinformatician, TTS Research Technology xue.li37@tufts.edu

Date: 2024-11-01

Overview#

In this tutorial, you will learn how to:

  • Set up an R environment for single-cell RNA-seq analysis.

  • Install and configure popular R packages for scRNA-seq, such as Seurat, SingleCellExperiment, and scater.

  • Load, preprocess, and analyze scRNA-seq data in R with Seurat.

Set Up RStudio on Open OnDemand for scRNA-seq#

  1. Open OnDemand RStudio App

    • Log in to Open OnDemand with your UTLN

    • You will see RStudio Pax under Interactive Apps and other topic-specific Rstudio apps under Bioinformatics Apps

    • Fill in the parameters according to your needs. Start with 64GB and 12 cores.

    • Launch the job.

  2. Load Packages in R:

    # Load the necessary libraries for scRNA-seq analysis
    library(Seurat)
    library(SeuratData)
    library(SingleCellExperiment)
    library(scater)
    library(scran)
    # code to install scRNAseq
    # BiocManager::install("scRNAseq")
    library(scRNAseq)
    library(monocle)
    

Single-Cell RNA-seq Analysis Packages#

Seurat#

  • Summary: Seurat is a widely used R package for single-cell RNA-seq analysis. It supports preprocessing, clustering, dimensionality reduction, differential expression, and visualization of single-cell data.

  • Paper: Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck III, W. M., et al. (2019). “Comprehensive Integration of Single-Cell Data.” Cell, 177(7), 1888–1902. https://doi.org/10.1016/j.cell.2019.05.031

  • Website: https://satijalab.org/seurat/

SingleCellExperiment#

scater#

  • Summary: scater is used for pre-processing and quality control of single-cell data. It helps to generate various visualizations and metrics for understanding the quality and variability within scRNA-seq datasets.

  • Paper: McCarthy, D. J., Campbell, K. R., Lun, A. T. L., & Wills, Q. F. (2017). “Scater: Pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.” Bioinformatics, 33(8), 1179–1186. https://doi.org/10.1093/bioinformatics/btw777

  • Website: https://bioconductor.org/packages/scater/

scran#

  • Summary: scran provides efficient methods for single-cell RNA-seq data normalization, clustering, and marker detection. It’s optimized for scalability with large datasets and includes several statistical tools for single-cell analysis.

  • Paper: Lun, A. T. L., & Marioni, J. C. (2017). “Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data.” Nature Methods, 13, 795–798. https://doi.org/10.1038/nmeth.3963

  • Website: https://bioconductor.org/packages/scran/

scRNAseq#

  • Summary: scRNAseq is a data package that includes several example scRNA-seq datasets for practice and benchmarking of analytical pipelines.

  • Paper: Available on Bioconductor.

  • Website: https://bioconductor.org/packages/scRNAseq/

Monocle#

  • Summary: Monocle is a package designed for analyzing single-cell trajectories. It identifies and orders cells based on gene expression dynamics over pseudotime, which can help reveal cell differentiation and developmental processes.

  • Paper: Trapnell, C., Cacchiarelli, D., Grimsby, J., Pokharel, P., Li, S., Morse, M., et al. (2014). “The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells.” Nature Biotechnology, 32(4), 381–386. https://doi.org/10.1038/nbt.2859

  • Website: http://cole-trapnell-lab.github.io/monocle-release/

Example Code for Data Loading and Visualization#

Below is a step-by-step example code for loading, preprocessing, and visualizing single-cell RNA-seq data with Seurat, including comments to explain each step:

# Load packages
library(Seurat)
library(SeuratData)

# Load the pbmc3k dataset (replace with your dataset if needed)
SeuratData::InstallData("pbmc3k")
data("pbmc3k") # Replace with your dataset

# Create a Seurat object
pbmc_counts <- as.matrix(pbmc3k@assays$RNA@counts)
pbmc <- CreateSeuratObject(counts = pbmc_counts)

# Quality control and filtering
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500)

# Normalize the data
# This step adjusts for differences in sequencing depth across cells.
pbmc <- NormalizeData(pbmc)

# Identify variable features
# Variable features are genes that exhibit high variability across cells.
# These features are used in downstream analyses to focus on informative genes.
pbmc <- FindVariableFeatures(pbmc)

# Scale the data
# Scaling centers and scales each gene, making them comparable for PCA and clustering.
# This is an important step for most dimensionality reduction techniques.
pbmc <- ScaleData(pbmc)

# Step 4: Run Principal Component Analysis (PCA)
# PCA reduces the dimensionality of the data, allowing us to identify major sources of variation.
# Here, we use the variable features identified earlier to perform PCA.
pbmc <- RunPCA(pbmc)

# Step 5: Find Neighbors
# This step identifies nearest neighbors for each cell based on their PCA scores.
# It is an essential step before clustering the cells.
pbmc <- FindNeighbors(pbmc, dims = 1:10)

# Step 6: Cluster the cells
# Clustering groups cells with similar expression profiles, aiding in cell-type identification.
pbmc <- FindClusters(pbmc, resolution = 0.5)

# Step 7: Run UMAP for visualization
# UMAP (Uniform Manifold Approximation and Projection) is a popular method for visualizing high-dimensional data.
# This step reduces the data to two dimensions, making it easier to visualize clusters.
pbmc <- RunUMAP(pbmc, dims = 1:10)

# Step 8: Plot the UMAP results
# This visualization shows the clusters identified by Seurat, each in a different color.
DimPlot(pbmc, reduction = "umap")

This annotated code provides a guide for basic steps of single-cell RNA-seq analysis in Seurat, from data normalization to visualization.

For a detailed tutorial on using Seurat, please visit this link

Memory and Core Requirements#

Memory Requirements#

  • Small Dataset (up to ~10,000 cells): 16–32 GB RAM should suffice for typical preprocessing and clustering tasks.

  • Medium Dataset (up to ~50,000 cells): 32–64 GB RAM is recommended to handle most analyses comfortably.

  • Large Dataset (over ~100,000 cells): 128 GB or more might be needed, particularly for steps like dimensionality reduction and integration.

Core Requirements#

  • Small Analyses: For small datasets or exploratory analysis, 2–4 cores are generally sufficient.

  • Medium to Large Analyses: For larger datasets, 8–16 cores are beneficial, especially for parallelizable steps such as normalization, dimensionality reduction, and clustering.

  • Very Large or High-throughput Analyses: Using 24 cores or more can help expedite analyses on large datasets, especially when utilizing parallelized workflows with packages like BiocParallel.

Example Setup Recommendations#

For a typical medium-sized scRNA-seq analysis in R, a setup with 64 GB of memory and 8–12 cores is often effective. If you’re working with very large datasets or performing computationally intensive steps like integration across multiple datasets, consider 128 GB of memory and 16 or more cores to optimize performance and speed.