--- tags: bioinformatics --- # scRNA-seq analysis with RStudio Shirley Li, Bioinformatician, TTS Research Technology xue.li37@tufts.edu Date: 2024-11-01 ## Overview In this tutorial, you will learn how to: - Set up an R environment for single-cell RNA-seq analysis. - Install and configure popular R packages for scRNA-seq, such as **Seurat**, **SingleCellExperiment**, and **scater**. - Load, preprocess, and analyze scRNA-seq data in R with Seurat. ## Set Up RStudio on Open OnDemand for scRNA-seq 1. **Open OnDemand RStudio App** - Log in to [Open OnDemand](https://ondemand.pax.tufts.edu/) with your UTLN - You will see `RStudio Pax` under `Interactive Apps` and other topic-specific Rstudio apps under `Bioinformatics Apps` - Fill in the parameters according to your needs. Start with 64GB and 12 cores. - Launch the job. 3. **Load Packages in R**: ```r # Load the necessary libraries for scRNA-seq analysis library(Seurat) library(SeuratData) library(SingleCellExperiment) library(scater) library(scran) # code to install scRNAseq # BiocManager::install("scRNAseq") library(scRNAseq) library(monocle) ``` ## Single-Cell RNA-seq Analysis Packages ### Seurat - **Summary**: `Seurat` is a widely used R package for single-cell RNA-seq analysis. It supports preprocessing, clustering, dimensionality reduction, differential expression, and visualization of single-cell data. - **Paper**: Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck III, W. M., et al. (2019). "Comprehensive Integration of Single-Cell Data." *Cell*, 177(7), 1888–1902. [https://doi.org/10.1016/j.cell.2019.05.031](https://doi.org/10.1016/j.cell.2019.05.031) - **Website**: [https://satijalab.org/seurat/](https://satijalab.org/seurat/) ### SingleCellExperiment - **Summary**: `SingleCellExperiment` provides a flexible framework for representing single-cell data in R, including assays, row/column metadata, and reduced dimensions. It serves as a foundational data structure in the Bioconductor ecosystem. - **Paper**: Lun, A. T. L., & Risso, D. (2023). "SingleCellExperiment: an object class for single-cell analysis in R." *F1000Research*, 7, 521. [https://doi.org/10.12688/f1000research.16298.2](https://doi.org/10.12688/f1000research.16298.2) - **Website**: [https://bioconductor.org/packages/SingleCellExperiment/](https://bioconductor.org/packages/SingleCellExperiment/) ### scater - **Summary**: `scater` is used for pre-processing and quality control of single-cell data. It helps to generate various visualizations and metrics for understanding the quality and variability within scRNA-seq datasets. - **Paper**: McCarthy, D. J., Campbell, K. R., Lun, A. T. L., & Wills, Q. F. (2017). "Scater: Pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R." *Bioinformatics*, 33(8), 1179–1186. [https://doi.org/10.1093/bioinformatics/btw777](https://doi.org/10.1093/bioinformatics/btw777) - **Website**: [https://bioconductor.org/packages/scater/](https://bioconductor.org/packages/scater/) ### scran - **Summary**: `scran` provides efficient methods for single-cell RNA-seq data normalization, clustering, and marker detection. It’s optimized for scalability with large datasets and includes several statistical tools for single-cell analysis. - **Paper**: Lun, A. T. L., & Marioni, J. C. (2017). "Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data." *Nature Methods*, 13, 795–798. [https://doi.org/10.1038/nmeth.3963](https://doi.org/10.1038/nmeth.3963) - **Website**: [https://bioconductor.org/packages/scran/](https://bioconductor.org/packages/scran/) ### scRNAseq - **Summary**: `scRNAseq` is a data package that includes several example scRNA-seq datasets for practice and benchmarking of analytical pipelines. - **Paper**: Available on Bioconductor. - **Website**: [https://bioconductor.org/packages/scRNAseq/](https://bioconductor.org/packages/scRNAseq/) ### Monocle - **Summary**: `Monocle` is a package designed for analyzing single-cell trajectories. It identifies and orders cells based on gene expression dynamics over pseudotime, which can help reveal cell differentiation and developmental processes. - **Paper**: Trapnell, C., Cacchiarelli, D., Grimsby, J., Pokharel, P., Li, S., Morse, M., et al. (2014). "The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells." *Nature Biotechnology*, 32(4), 381–386. https://doi.org/10.1038/nbt.2859 - **Website**: http://cole-trapnell-lab.github.io/monocle-release/ ## Example Code for Data Loading and Visualization Below is a step-by-step example code for loading, preprocessing, and visualizing single-cell RNA-seq data with Seurat, including comments to explain each step: ```r # Load packages library(Seurat) library(SeuratData) # Load the pbmc3k dataset (replace with your dataset if needed) SeuratData::InstallData("pbmc3k") data("pbmc3k") # Replace with your dataset # Create a Seurat object pbmc_counts <- as.matrix(pbmc3k@assays$RNA@counts) pbmc <- CreateSeuratObject(counts = pbmc_counts) # Quality control and filtering pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500) # Normalize the data # This step adjusts for differences in sequencing depth across cells. pbmc <- NormalizeData(pbmc) # Identify variable features # Variable features are genes that exhibit high variability across cells. # These features are used in downstream analyses to focus on informative genes. pbmc <- FindVariableFeatures(pbmc) # Scale the data # Scaling centers and scales each gene, making them comparable for PCA and clustering. # This is an important step for most dimensionality reduction techniques. pbmc <- ScaleData(pbmc) # Step 4: Run Principal Component Analysis (PCA) # PCA reduces the dimensionality of the data, allowing us to identify major sources of variation. # Here, we use the variable features identified earlier to perform PCA. pbmc <- RunPCA(pbmc) # Step 5: Find Neighbors # This step identifies nearest neighbors for each cell based on their PCA scores. # It is an essential step before clustering the cells. pbmc <- FindNeighbors(pbmc, dims = 1:10) # Step 6: Cluster the cells # Clustering groups cells with similar expression profiles, aiding in cell-type identification. pbmc <- FindClusters(pbmc, resolution = 0.5) # Step 7: Run UMAP for visualization # UMAP (Uniform Manifold Approximation and Projection) is a popular method for visualizing high-dimensional data. # This step reduces the data to two dimensions, making it easier to visualize clusters. pbmc <- RunUMAP(pbmc, dims = 1:10) # Step 8: Plot the UMAP results # This visualization shows the clusters identified by Seurat, each in a different color. DimPlot(pbmc, reduction = "umap") ``` This annotated code provides a guide for basic steps of single-cell RNA-seq analysis in Seurat, from data normalization to visualization. **For a detailed tutorial on using Seurat, please visit [this link](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html)** ## Memory and Core Requirements ### **Memory Requirements** - **Small Dataset (up to ~10,000 cells)**: 16–32 GB RAM should suffice for typical preprocessing and clustering tasks. - **Medium Dataset (up to ~50,000 cells)**: 32–64 GB RAM is recommended to handle most analyses comfortably. - **Large Dataset (over ~100,000 cells)**: 128 GB or more might be needed, particularly for steps like dimensionality reduction and integration. ### **Core Requirements** - **Small Analyses**: For small datasets or exploratory analysis, 2–4 cores are generally sufficient. - **Medium to Large Analyses**: For larger datasets, 8–16 cores are beneficial, especially for parallelizable steps such as normalization, dimensionality reduction, and clustering. - **Very Large or High-throughput Analyses**: Using 24 cores or more can help expedite analyses on large datasets, especially when utilizing parallelized workflows with packages like `BiocParallel`. ### Example Setup Recommendations For a typical medium-sized scRNA-seq analysis in R, a setup with **64 GB of memory and 8–12 cores** is often effective. If you’re working with very large datasets or performing computationally intensive steps like integration across multiple datasets, consider **128 GB of memory and 16 or more cores** to optimize performance and speed.