Bioinformatics applications#

Below are bioinformatics applications deployed by Tufts Research Technology.

abcreg

ABCreg is a tool used for automating approximate Bayesian computation by local linear regression.

abyss

ABySS is a de novo sequence assembler intended for short paired-end reads and genomes of all sizes.

alphafold

Alphafold is an artificial intelligence program developed by Alphabets’s/Google’s DeepMind which performs predictions of protein structure.

amplify

AMPlify is an attentive deep learning model for antimicrobial peptide prediction.

angsd

Angsd is a software for analyzing next generation sequencing data.

bakta

Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs. It provides dbxref-rich, sORF-including and taxon-independent annotations in machine-readable JSON & bioinformatics standard file formats for automated downstream analysis.

bbmap

Bbmap is a short read aligner, as well as various other bioinformatic tools.

bbtools

BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data.

bcftools

Bcftools is a program for variant calling and manipulating files in the Variant Call Format (VCF) and its binary counterpart BCF.

beast2

BEAST 2 is a cross-platform program for Bayesian phylogenetic analysis of molecular sequences.

bedops

Bedops is a software package for manipulating and analyzing genomic interval data.

bedtools

Bedtools is an extensive suite of utilities for genome arithmetic and comparing genomic features in BED format.

biobakery_workflows

BioBakery workflows is a collection of workflows and tasks for executing common microbial community analyses using standardized, validated tools and parameters.

biopython

Biopython is a set of freely available tools for biological computation written in Python.

blast

BLAST (Basic Local Alignment Search Tool) finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.

bowtie2

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.

breseq

Breseq is a computational pipeline for the analysis of short-read re-sequencing data.

busco

BUSCO (Benchmarking sets of Universal Single-Copy Orthologs) provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs.

cactus

Cactus is a reference-free whole-genome multiple alignment program.

canu

Canu is a fork of the Celera Assembler, designed for high-noise single-molecule sequencing (such as the PacBio RS II/Sequel or Oxford Nanopore MinION).

cellprofiler

CellProfiler is a free open-source software designed to enable biologists without training in computer vision or programming to quantitatively measure phenotypes from thousands of images automatically.

cellprofiler-analyst

CellProfiler Analyst allows interactive exploration and analysis of data, particularly from high-throughput, image-based experiments.

cellranger

Cellranger is a set of analysis pipelines that process Chromium single-cell data to align reads, generate feature-barcode matrices, perform clustering and other secondary analysis, and more.

cellranger-atac

Cellranger-atac is a set of analysis pipelines that process Chromium Single Cell ATAC data.

cellrank

Cellrank is a toolkit to uncover cellular dynamics based on Markov state modeling of single-cell data.

cufflinks

Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols.

cutadapt

Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

diamond

Diamond is sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data.

dorado

Dorado is a high-performance, easy-to-use, open source basecaller for Oxford Nanopore reads.

dragon_ora

The DRAGEN ORA Helper Suite Software is a suite of software for Linux distributions, designed to integrate in a transparent manner compressed FASTQ.

exonerate

Exonerate is a generic tool for pairwise sequence comparison/alignment.

fasta3

Fasta3 is a suite of programs for searching nucleotide or protein databases with a query sequence.

fastp

Fastp is an ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging, etc).

fastqc

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

fastspar

FastSpar is a C++ implementation of the SparCC algorithm which is up to several thousand times faster than the original Python2 release and uses much less memory. The FastSpar implementation provides threading support and a p-value estimator which accounts for the possibility of repetitious data permutations.

fasttree

FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences. FastTree can handle alignments with up to a million of sequences in a reasonable amount of time and memory. For large alignments, FastTree is 100-1,000 times faster than PhyML 3.0 or RAxML 7.

filtlong

Filtlong is a tool for filtering long reads by quality. It can take a set of long reads and produce a smaller, better subset. It uses both read length (longer is better) and read identity (higher is better) when choosing which reads pass the filter.

flye

Flye: Fast and accurate de novo assembler for single molecule sequencing reads

fqtk

fqtk is a toolkit for working with FASTQ files, written in Rust.

gatk4

GATK (Genome Analysis Toolkit) is a collection of command-line tools for analyzing high-throughput sequencing data with a primary focus on variant discoverye.

genomad

geNomad: Identification of mobile genetic elements.

geomx_ngs_pipeline

The GeoMx NGS Pipeline, developed by NanoString, is an essential part of the GeoMx NGS workflow.

guppy

Guppy is a data processing toolkit that contains the Oxford Nanopore Technologies’ basecalling algorithms, and several bioinformatic post-processing features.

hap.py

Hap.py is a tool to compare diploid genotypes at haplotype level.

hisat2

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes as well as to a single reference genome.

hmmer

Hmmer is used for searching sequence databases for sequence homologs, and for making sequence alignments.

homer

HOMER is a suite of tools for Motif Discovery and next-gen sequencing analysis.

htseq

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.

humann

Humann is a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data (typically millions of short DNA/RNA reads).

impute2

Impute2 is a genotype imputation and haplotype phasing program.

iqtree2

IQ-TREE is an efficient phylogenomic software by maximum likelihood.

kallisto

Kallisto is a program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads.

kneaddata

Kneaddata is a tool designed to perform quality control on metagenomic sequencing data.

kraken2

Kraken2 is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences.

krakentools

Krakentools is a suite of scripts to be used for post-analysis of Kraken/KrakenUniq/Kraken2/Bracken results.

macs2

MACS2 is Model-based Analysis of ChIP-Seq for identifying transcript factor binding sites.

macs3

Macs3 is Model-based Analysis of ChIP-Seq for identifying transcript factor.

masurca

The MaSuRCA (Maryland Super Read Cabog Assembler) genome assembly and analysis toolkit contains of MaSuRCA genome assembler, QuORUM error corrector for Illumina data, POLCA genome polishing software, Chromosome scaffolder, jellyfish mer counter, and MUMmer aligner.

medaka

Medaka is a tool to create consensus sequences and variant calls from nanopore sequencing data.

megahit

Megahit is a ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.

meme

Meme is a collection of tools for the discovery and analysis of sequence motifs. Contents.

metaphlan

Metaphlan is computational tool for profiling the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level.

miniasm

Miniasm is a very fast OLC-based de novo assembler for noisy long reads.

minimap2

Minimap2 is a versatile pairwise aligner for genomic and spliced nucleotide sequences.

minipolish

Minipolish is a tool for Racon polishing of miniasm assemblies.

mirdeep2

miRDeep2 discovers active known or novel miRNAs from deep sequencing data (Solexa/Illumina, 454, …).

mirge3

Mirge3 is an update to Python package to perform comprehensive analysis of small RNA sequencing data, including miRNA annotation, A-to-I editing, novel miRNA detection, isomiR analysis, visualization through IGV, processing Unique Molecular Identifieres (UMI), tRF detection and producing interactive graphical output.

mothur

Mothur is an open source software package for bioinformatics data processing.

multiqc

Multiqc is a reporting tool that parses summary statistics from results and log files generated by other bioinformatics tools.

nf-core-ampliseq

nfcore/ampliseq is a bioinformatics analysis pipeline used for amplicon sequencing, supporting denoising of any amplicon and supports a variety of taxonomic databases for taxonomic assignment including 16S, ITS, CO1 and 18S. Phylogenetic placement is also possible. Multiple region analysis such as 5R is implemented. Supported is paired-end Illumina or single-end Illumina, PacBio and IonTorrent data. Default is the analysis of 16S rRNA gene amplicons sequenced paired-end with Illumina.

nf-core-atacseq

nfcore/atacseq is a bioinformatics analysis pipeline used for ATAC-seq data.

nf-core-bacass

nf-core/bacass is a bioinformatics best-practice analysis pipeline for simple bacterial assembly and annotation. The pipeline is able to assemble short reads, long reads, or a mixture of short and long reads (hybrid assembly).

nf-core-bamtofastq

nf-core/bamtofastq is a bioinformatics best-practice analysis pipeline that converts (un)mapped .bam or .cram files into fq.gz files.

nf-core-chipseq

nfcore/chipseq is a bioinformatics analysis pipeline used for Chromatin ImmunopreciPitation sequencing (ChIP-seq) data.

nf-core-denovotranscript

nf-core/denovotranscript is a bioinformatics pipeline for de novo transcriptome assembly of paired-end short reads from bulk RNA-seq. It takes a samplesheet and FASTQ files as input, perfoms quality control (QC), trimming, assembly, redundancy reduction, pseudoalignment, and quantification. It outputs a transcriptome assembly FASTA file, a transcript abundance TSV file, and a MultiQC report with assembly quality and read QC metrics.

nf-core-detaxizer

nf-core/detaxizer is a pipeline to assess raw (meta)genomic data for contaminations and optionally filter reads which were classified as contamination. Default taxa classified as contamination are Homo and Homo sapiens.

nf-core-differentialabundance

nf-core/differentialabundance is a bioinformatics pipeline that can be used to analyse data represented as matrices, comparing groups of observations to generate differential statistics and downstream analyses. The pipeline supports RNA-seq data such as that generated by the nf-core rnaseq workflow, and Affymetrix arrays via .CEL files.

nf-core-eager

nf-core/eager is a scalable and reproducible bioinformatics best-practise processing pipeline for genomic NGS sequencing data, with a focus on ancient DNA (aDNA) data. It is ideal for the (palaeo)genomic analysis of humans, animals, plants, microbes and even microbiomes.

nf-core-fetchngs

nf-core/fetchngs is a bioinformatics pipeline to fetch metadata and raw FastQ files from both public databases. At present, the pipeline supports SRA / ENA / DDBJ / GEO ids.

nf-core-funcscan

nf-core/funcscan is a bioinformatics best-practice analysis pipeline for the screening of nucleotide sequences such as assembled contigs for functional genes. It currently features mining for antimicrobial peptides, antibiotic resistance genes and biosynthetic gene clusters.

nf-core-hic

nf-core/hic is a bioinformatics best-practice analysis pipeline for Analysis of Chromosome Conformation Capture data (Hi-C).

nf-core-mag

nf-core/mag is a bioinformatics best-practise analysis pipeline for assembly, binning and annotation of metagenomes.

nf-core-metatdenovo

nf-core/metatdenovo is a bioinformatics best-practice analysis pipeline for assembly and annotation of metatranscriptomic data, both prokaryotic and eukaryotic.0

nf-core-methylseq

nf-core/methylseq is a bioinformatics analysis pipeline used for Methylation (Bisulfite) sequencing data. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.

nf-core-nanoseq

nfcore/nanoseq is a bioinformatics analysis pipeline for Nanopore DNA/RNA sequencing data that can be used to perform basecalling, demultiplexing, QC, alignment, and downstream analysis.

nf-core-nanostring

nf-core/nanostring is a bioinformatics pipeline that can be used to analyze NanoString data. The performed analysis steps include quality control and data normalization.

nf-core-pangenome

nf-core/pangenome is a bioinformatics best-practice analysis pipeline for pangenome graph construction. The pipeline renders a collection of sequences into a pangenome graph. Its goal is to build a graph that is locally directed and acyclic while preserving large-scale variation. Maintaining local linearity is important for interpretation, visualization, mapping, comparative genomics, and reuse of pangenome graphs.

nf-core-proteinfold

nf-core/proteinfold is a bioinformatics best-practice analysis pipeline for Protein 3D structure prediction.

nf-core-raredisease

nf-core/raredisease is a best-practice bioinformatic pipeline for calling and scoring variants from WGS/WES data from rare disease patients.

nf-core-rnafusion

nf-core/rnafusion is a bioinformatics best-practice analysis pipeline for RNA sequencing consisting of several tools designed for detecting and visualizing fusion genes. Results from up to 5 fusion callers tools are created, and are also aggregated, most notably in a pdf visualiation document, a vcf data collection file, and html and tsv reports.

nf-core-rnaseq

nf-core/rnaseq is a bioinformatics pipeline that can be used to analyse RNA sequencing data obtained from organisms with a reference genome and annotation. It takes a samplesheet and FASTQ files as input, performs quality control (QC), trimming and (pseudo-)alignment, and produces a gene expression matrix and extensive QC report.

nf-core-rnasplice

nf-core/rnasplice is a bioinformatics pipeline for alternative splicing analysis of RNA sequencing data obtained from organisms with a reference genome and annotation.

nf-core-sarek

nf-core/sarek is a workflow designed to detect variants on whole genome or targeted sequencing data. Initially designed for Human, and Mouse, it can work on any species with a reference genome. Sarek can also handle tumour / normal pairs and could include additional relapses.

nf-core-scrnaseq

nf-core/scrnaseq is a bioinformatics best-practice analysis pipeline for processing 10x Genomics single-cell RNA-seq data.

nf-core-smrnaseq

nf-core/smrnaseq is a bioinformatics best-practice analysis pipeline for Small RNA-Seq.

nf-core-taxprofiler

nf-core/taxprofiler is a bioinformatics best-practice analysis pipeline for taxonomic classification and profiling of shotgun short- and long-read metagenomic data. It allows for in-parallel taxonomic identification of reads or taxonomic abundance estimation with multiple classification and profiling tools against multiple databases, and produces standardised output tables for facilitating results comparison between different tools and databases.

nf-core-viralrecon

nf-core/viralrecon is a bioinformatics analysis pipeline used to perform assembly and intra-host/low-frequency variant calling for viral samples. The pipeline supports both Illumina and Nanopore sequencing data.

orthofinder

Orthofinder is a fast, accurate and comprehensive platform for comparative genomics. It finds orthogroups and orthologs, infers rooted gene trees for all orthogroups and identifies all of the gene duplication events in those gene trees. It also infers a rooted species tree for the species being analysed and maps the gene duplication events from the gene trees to branches in the species tree. OrthoFinder also provides comprehensive statistics for comparative genomic analyses.

pandaseq

Pandaseq is a program to align Illumina reads, optionally with PCR primers embedded in the sequence, and reconstruct an overlapping sequence.

parabricks

NVIDIA’s Clara Parabricks brings next generation sequencing to GPUs, accelerating an array of gold-standard tooling such as BWA-MEM, GATK4, Google’s DeepVariant, and many more. Users can achieve a 30-60x acceleration and 99.99% accuracy for variant calling when comparing against CPU-only BWA-GATK4 pipelines, meaning a single server can process up to 60 whole genomes per day. These tools can be easily integrated into current pipelines with drop-in replacement commands to quickly bring speed and data-center scale to a range of applications including germline, somatic and RNA workflows.

pepper_deepvariant

PEPPER is a genome inference module based on recurrent neural networks that enables long-read variant calling and nanopore assembly polishing in the PEPPER-Margin-DeepVariant pipeline.

petitefinder

petiteFinder is an automated computer vision tool to compute Petite colony frequencies in baker’s yeast.

picard

Picard is a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.

plink

Plink is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.

plink2

Plink2 is a whole genome association analysis toolset.

polypolish

Polypolish is a tool for polishing genome assemblies with short reads.

prokka

Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.

qiime2

QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency. QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.

qualimap

Qualimap is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.

r-bioinformatics

RStudio is an integrated development environment (IDE) for the R statistical computation and graphics system.

r-scrnaseq

RStudio is an integrated development environment (IDE) for the R statistical computation and graphics system.

r-shinyngs

Shinyngs is an R package designed to facilitate downstream analysis of RNA-seq and similar expression data with various exploratory plots and data mining tools.

raven-assembler

Raven-assembler is a de novo genome assembler for long uncorrected reads.

raxml-ng-mpi

Raxml-ng is a phylogenetic tree inference tool which uses maximum-likelihood (ML) optimality criterion.

relion

RELION (for REgularised LIkelihood OptimisatioN) is a stand-alone computer program for Maximum A Posteriori refinement of (multiple) 3D reconstructions or 2D class averages in cryo-electron microscopy. It is developed in the research group of Sjors Scheres at the MRC Laboratory of Molecular Biology.

rmats2sashimiplot

Rmats2sashimiplot produces a sashimiplot visualization of rMATS output.

rnaquast

Rnaquast is a quality assessment tool for de novo transcriptome assemblies.

rosettafold2

RoseTTAFold2 extends the original three-track architecture of RoseTTAFold over the full network, incorporating the concepts of Frame-aligned point error, recycling during training, and the use of a distillation set from AlphaFold2.

rosettafold2na

RoseTTAFoldNA rapidly produces three-dimensional structure models with confidence estimates for protein–DNA and protein–RNA complexes.

salmon

Salmon is a tool for quantifying the expression of transcripts using RNA-seq data.

samtools

Samtools is a set of utilities for the Sequence Alignment/Map (SAM) format.

scanpy

Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata.

scvelo

Scvelo is a scalable toolkit for RNA velocity analysis in single cells.

signalp6

SignalP predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes

spaceranger

Spaceranger is a set of analysis pipelines that process Visium Spatial Gene Expression data with brightfield and fluorescence microscope images.

spades

Spades is an assembly toolkit containing various assembly pipelines.

squid

SQUID is designed to detect both fusion-gene and non-fusion-gene transcriptomic structural variations from RNA-seq alignment.

star

STAR (Spliced Transcripts Alignment to a Reference) is an ultrafast universal RNA-seq aligner.

subread

Subread carries out high-performance read alignment, quantification and mutation discovery. It is a general-purpose read aligner which can be used to map both genomic DNA-seq reads and DNA-seq reads. It uses a new mapping paradigm called seed-and-vote to achieve fast, accurate and scalable read mapping. Subread automatically determines if a read should be globally or locally aligned, therefore particularly powerful in mapping RNA-seq reads. It supports INDEL detection and can map reads with both fixed and variable lengths.

tmhmm

Tmhmm is used for prediction of transmembrane helices in proteins.

transdecoder

Transdecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using Tophat and Cufflinks.

trgt

TRGT is a tool for targeted genotyping of tandem repeats from PacBio HiFi data. In addition to the basic size genotyping, TRGT profiles sequence composition, mosaicism, and CpG methylation of each analyzed repeat and visualization of reads overlapping the repeats.

trim-galore

Trim-galore is a wrapper tool that automates quality and adapter trimming to FastQ files.

trimmomatic

Trimmomatic is a flexible read trimming tool for Illumina NGS data

trinity

Trinity assembles transcript sequences from Illumina RNA-Seq data.

trinotate

Trinotate is a comprehensive annotation suite designed for automatic functional annotation of transcriptomes, particularly de novo assembled transcriptomes, from model or non-model organisms.

trycycler

Trycycler is a tool for generating consensus long-read assemblies for bacterial genomes.

vcftools

VCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files.

viennarna

Viennarna is a set of standalone programs and libraries used for prediction and analysis of RNA secondary structures.