Resources

Here’s a friendly, handpicked list of bioinformatics resources to help you kick off your analyses or sharpen your skills. It’s not everything out there, just stuff based on my experience. And whenever there’s a great tutorial already online, I’ll point you to it. No need to reinvent the wheel!

Programming

The Essentials: Key Programming Languages for Bioinformatics and Data Science.

Languages

Bash

The command-line powerhouse. Bash is your go-to for automating repetitive tasks, managing files, and running pipelines on servers or clusters. In bioinformatics, it’s essential for processing large datasets, running tools like BWA or SAMtools, and chaining commands into efficient workflows.

Python

The Swiss Army knife of bioinformatics. Python’s simplicity and vast ecosystem make it ideal for data analysis, scripting, and machine learning. Use it to parse genomic data, build pipelines, or create interactive visualizations, Python does it all.

R

The statistician’s best friend. R shines in exploratory data analysis, statistical modeling, and publication-ready visualizations. In bioinformatics, it’s the tool of choice for differential expression analysis, clustering, and interpreting high-throughput sequencing data.

Workflow management

Snakemake

Workflow management system that uses a Python-based syntax to create reproducible and scalable data analysis pipelines, making it particularly accessible for users familiar with Python.

Nextflow

Workflow management system that utilizes a domain-specific language based on Groovy, enabling seamless integration of various programming languages and providing robust support for parallel and distributed computing environments.

Web applications

Shiny

A web application framework for R or Python that can be used to build interactive web applications.

Dash

Dash is the original low-code framework for rapidly building data apps in Python.

Flask

Web application framework designed to make getting started quick and easy, with the ability to scale up to complex applications.

Integrated Development Environments

IDE are essential tools for developers, offering a range of features that facilitate coding, debugging, and project management.

Visual Studio Code

Visual Studio Code (VS Code) is a highly popular, open-source code editor developed by Microsoft. It supports a wide range of programming languages and comes with features like IntelliSense code completion, debugging tools, Git integration, and a vast marketplace for extensions.

PyCharm

PyCharm is another IDE from JetBrains, specifically designed for Python development. It provides features like code analysis, graphical debugging, and an integrated terminal. PyCharm also supports web development with Django, Flask, and other frameworks.

Rstudio

RStudio is a powerful and widely-used Integrated Development Environment (IDE) designed to enhance productivity in coding, data analysis, and visualization, primarily for R but also supporting Python. It is the go-to choice for over 90% of R programmers, making it the de facto standard in the field.

Positron

Positron is the next-generation IDE for developing R code. Developed by Posit, it is set to be the successor of RStudio, bringing new features and a smoother user experience.

Bioinformatics tools

tools that you need for most sequencing data analyses

Raw data quality check

FastQC

FastQC will analyze your reads and return a complete report with many metrics. It is fast and simple to use. Always check the quality of the sequencing before any analysis.

MultiQC

MultiQC rapidly processes all files within the designated directory, producing interactive reports within seconds. While functionally comparable to FastQC, it supports a broader range of file formats, including SAM and BAM, and necessitates a deeper level of technical comprehension.

Reads filtering and trimming

Trimmomatic

A powerful tool for filtering and trimming reads, offering high customization and efficiency, though its syntax may feel somewhat complex to newcomers.

fastp

Fastp is the easiest tool for filtering and trimming low-quality reads, yet it’s also highly customizable to fit your specific experimental needs.

Genome assembly

Unicycler

Unicycler is the genome assembler you need. It has been designed for hybrid assembly (long reads + short reads) but also works very well with short reads only. It uses SPAdes in the background and produces assemblies of quality.

Flye

If you need to perform genome assembly using long reads, use Flye. It has been designed for long reads either from PacBio or Nanopore technologies.

QUAST

Once assembled, evaluate your assembly’s quality to ensure optimal results or compare it with others, QUAST is the tool for the job. It analyzes your contigs and delivers key metrics to accurately assess assembly quality.

Bandage

Visualizing the assembly directly can be highly informative, it helps identify issues like contamination or poor assembly quality. With Bandage, you can examine the assembly graph and assess how well your assembly has been constructed.

Phylogeny

RaxML

RaxML is a powerful and fast software for reconstructing phylogenetic trees, widely used in evolutionary biology to analyze large amounts of genetic data with precision.

IQ-TREE

fast and user-friendly tool for phylogenetic tree inference based on maximum likelihood. It supports DNA and protein sequence alignments and includes features such as automated model selection and ultrafast bootstrap, making it well suited for efficient and accurate phylogenetic analyses.

Annotation

RAST

RAST (Rapid Annotation using Subsystem Technology) is an automated pipeline for the annotation of bacterial and archaeal genomes. It provides functional predictions for genes and assigns them to biological subsystems. Only accessible online, it is system independent but may takes quite a long time before finishing annotation.

Bakta

Fast and standardized tool for the annotation of bacterial genomes. It provides high-quality structural and functional annotations using curated reference databases, ensuring consistent results across analyses. Bakta is designed for ease of use and integration into bioinformatics pipelines, making it well suited for large-scale bacterial genome projects.

AMRfinder

AMRFinder is a tool developed by NCBI for identifying antimicrobial resistance (AMR) genes and associated mutations in bacterial genomes. It uses curated AMR reference data to detect resistance determinants from DNA or protein sequences, providing reliable annotations important for surveillance, research, and clinical microbiology.

Resfinder

Web-based tool for detecting acquired antimicrobial resistance genes in bacterial whole-genome sequences. Developed by the Center for Genomic Epidemiology, it compares input sequences against curated resistance gene databases to support AMR surveillance and epidemiological studies.

GenoScanner

GenomeScanner is a lightweight bioinformatics tool for taxonomically classify microbial genomes.

Peaks calling

MACS3

MACS3 (Model-based Analysis of ChIP-Seq) is a widely used tool for identifying enriched regions in ChIP-seq and related sequencing data. It models background noise to accurately detect peaks corresponding to protein–DNA interactions, making it a standard choice for transcription factor and epigenomic analyses.

HOMER

HOMER is a suite of tools for analyzing ChIP-seq and other high-throughput sequencing data, with a strong focus on peak detection and motif discovery. It enables identification of enriched genomic regions and associated regulatory motifs, supporting studies of transcriptional regulation and epigenomics.

Variant calling

Snippy

Rapid bacterial variant calling and core genome alignment tool. It identifies SNPs and small indels from whole-genome sequencing data by mapping reads to a reference genome, making it ideal for comparative genomics and outbreak investigations.

GATK

GATK (Genome Analysis Toolkit) is a comprehensive software suite for variant discovery and genotyping in high-throughput sequencing data. Widely used in human and model organism genomics, it provides robust tools for calling SNPs, indels, and structural variants, along with best-practice workflows for accurate and reproducible analysis.

bcftools

Command-line suite for processing and analyzing variant call format (VCF) and binary VCF (BCF) files. It provides efficient tools for variant calling, filtering, annotation, and manipulation, making it essential for large-scale genomic variant analysis.

Metagenomics

Kraken2

Kraken 2 is a fast and accurate taxonomic classification tool for metagenomic sequencing data. It assigns reads to taxa using exact k-mer matches against large reference databases, enabling comprehensive profiling of microbial communities from DNA or RNA sequencing datasets.

Bracken

Companion tool to Kraken 2, Bracken provides accurate species- and genus-level abundance estimates from metagenomic sequencing data. By reanalyzing Kraken 2 classifications, Bracken refines read assignments to generate more precise microbial community profiles.

DADA2

Software package for high-resolution analysis of amplicon sequencing data, particularly 16S and ITS rRNA gene sequences. It models and corrects sequencing errors to infer exact biological sequences, enabling accurate identification of microbial taxa and community composition.

metaSPAdes

MetaSPAdes is a specialized genome assembler designed for metagenomic sequencing data. It reconstructs microbial genomes from complex communities by efficiently handling uneven coverage and mixed populations, making it ideal for environmental and clinical metagenomics studies.

QIIME2

Powerful, open-source platform for analyzing and visualizing microbiome sequencing data. It supports reproducible workflows for tasks such as quality control, taxonomic classification, diversity analysis, and data visualization, making it widely used in microbial ecology and microbiome research.

Others

Skani

Tool for rapid and scalable k-mer based analysis of genomic sequences. It enables fast comparison and clustering of large-scale genome datasets, making it suitable for genome similarity studies, phylogenomics, and microbial population analyses.

mge-cluster

Tool for clustering and analyzing mobile genetic elements (MGEs) in microbial genomes. It enables identification of related MGEs across datasets, facilitating studies of horizontal gene transfer, antibiotic resistance, and genome evolution.

SAMtools

SAMtools is a widely used suite of programs for interacting with high-throughput sequencing data in SAM, BAM, and CRAM formats. It provides essential functions such as sorting, indexing, and variant calling, forming a core component of many genomics and bioinformatics pipelines.

BEDTools

Powerful suite of utilities for comparing, manipulating, and analyzing genomic features in BED, GFF/GTF, VCF, and other formats. It enables tasks such as intersecting, merging, and subtracting genomic intervals, making it a staple tool for genome annotation and sequence analysis workflows.

deepTools

Toolkit for the analysis and visualization of high-throughput sequencing data, particularly ChIP-seq, RNA-seq, and ATAC-seq. It provides functions for normalization, coverage calculation, and generation of publication-quality heatmaps and profiles, facilitating exploration of genomic signal patterns.

blast+

Suite of tools for comparing nucleotide or protein sequences against sequence databases. It enables rapid identification of homologous sequences, functional annotation, and evolutionary analysis, making it a cornerstone of bioinformatics research and genomic studies.

bowtie2

Fast and memory-efficient aligner for mapping sequencing reads to reference genomes. It supports gapped, local, and paired-end alignments, making it widely used in RNA-seq, ChIP-seq, and other high-throughput sequencing analyses.

featuresCount

High-performance tool for counting reads mapped to genomic features such as genes, exons, or transcripts. It efficiently processes large RNA-seq datasets and provides accurate read assignments, making it a key step in gene expression analysis workflows.

SeqKit

Fast and versatile toolkit for manipulating and analyzing FASTA and FASTQ sequence files. It provides a wide range of functions, including filtering, sorting, sampling, and statistics calculation, making it a convenient utility for routine bioinformatics workflows.

bioconvert

Versatile tool for converting between a wide range of bioinformatics file formats, including sequence, alignment, and annotation files. It streamlines data interoperability, making it easier to integrate different tools and pipelines in genomics and transcriptomics analyses.

Online resources

Convenient websites for facilitating your analysis and data manipulation, databases and others

Plots and visualizations

From data to Viz

Find the most appropriate graph for your data with great examples and code in R and Python.

Datawrapper

Paste your data and create stunning figures directly with eases and speed.

GenoVi

GenoVi generates circular genome representations for complete, draft, and multiple bacterial and archaeal genomes.

Fundamentals of Data Visualization

A complete guide about what to do and not to do when creating figures. And more !

Easyfig

Create multiple genomes comparison figures with easy-to-use GUI.

Databases

Glittr

Web application framework designed to make getting started quick and easy, with the ability to scale up to complex applications.

Data & ML
Database

A collection of 200 real-world data science and machine learning case studies across companies.

What statistical test to do ?

A guide to choose the right statistical test to do with your data. Part of an amazing blog about statistics.

LLM for genomics data

A very cool project for training and using LLM with sequences data.

Bioinformatics_toolkit

A GitHub repository packed with tutorials, analysis examples, and other valuable resources.

rstats

Discover practical and insightful R tips you need to know.

sanbox.bio

Practice your Bash commands in a safe sandbox. Experiment freely—no risk of breaking anything!

SequencEnG

Explore a beautiful interactive resource to deepen your understanding of sequencing techniques because knowing where your data comes from is key.

regex101

Create, test and fine tuned your regular expressions in an interactive environment.
Search