BOL: Related items

3D Genome Browser: explore chromatin interaction data, such as Hi-C, ChIA-PET, Capture Hi-C, PLAC-Seq, and more

BioStar — Fri, 22 Jan 2021 20:19:32 -0600

Beside visualizing chromatin interaction data, you can also seamlessly browse other omics data such as ChIP-Seq or RNA-Seq for the same genomic region, and gain a complete view of both regulatory landscape and 3D genome structure for any given gene. You can also check the expression of any queried gene across hundreds of tissue/cell types measured by the ENCODE consortium. Finally, please check out the virtual 4C page, where we provide multiple methods to link distal cis-regulatory elements with their potential target genes, including virtual 4C, ChIA-PET and cross-cell-type correlation of proximal and distal DHSs.

Address of the bookmark: http://3dgenome.fsm.northwestern.edu/index.html

Libraries or management tools for high throughput sequencing data

LEGE — Fri, 04 Oct 2024 02:45:06 -0500

GATB Library. The Genome Analysis Toolbox with de-Bruijn graph. A large part of tools developed by the GenScale team are based on this library.
These methods enable the analysis of data sets of any size on multi-core desktop computers, including very huge amount of reads data coming from any kind of organisms such as bacteria, plants, animals and even complex samples (e.g. metagenomes). Among them are (the full is available here: https://gatb.inria.fr/software/):
LRez: C++ Library and toolkit for the barcode-based management and indexation of linked-read datasets.

Variant calling and/or genotyping

DiscoSNP++ and discoSnpRAD: Reference-free small variant discovery (SNPs and indels)
MindTheGap: Detection and assembly of large insertion variants
TakeABreak: reference-free inversion discovery tool
SVJedi: Structural Variant genotyper with long read data
SVJedi-graph: Structural Variant genotyper with long read data using a variation graph

Sequence assembly

MinYS: reference-guided genome assembly in metagenomics data
MTG-link: local assembly tool for linked-read data
Minia: De novo short read assembler
de-novo pipeline: de-novo assembly pipeline (error correction / contigs / scaffolding) for genomes and meta-genomes
Mapsembler2: Targeted assembly (not maintained)

Managing k-mers & indexation

findere: simple strategy for speeding up queries and for reducing false positive calls from any Approximate Membership Query data structure.
- fimpera extends findere adding the abundance information.
kmtricks: modular tool suite for counting kmers, and constructing Bloom filters or kmer matrices, for large collections of sequencing data.
kmindex is a tool for indexing and querying sequencing samples. It is built on top of kmtricks.
back to sequences: Find sequences (reads, unitigs, genes) related to a set of kmers in large datasets, in a matter of seconds.
Backpack Quotient Filter: k-mer indexing data structure with abundance
short read connector: Detect similar reads from potentially large read set
DSK: Count K-mer in sequences

Pangenome graph manipulation

Pancat: Pangenome Comparison and Analysis Toolkit
GFAGraphs: a Python library to handle pangenome graph files in GFA format.

Comparative metagenomics with k-mers

Simka and SimkaMin: Comparative metagenomics for large-scale datasets
Comparead & Commet: comparison of metagenomic datasets

Species and bacterial strains identification

ORI: software using long nanopore reads to identify bacteria present in a sample at the strain level
StrainFLAIR: STRAIN-level proFiLing using vArIation gRaph

General-purpose sequencing data manipulation

GASSST: long read mapper
Leon: short read compressor (now included in GATB-core)
Bloocoo: short read corrector
BCALM: Construct compacted de Bruijn graphs (unitigs)

Protein Structure

A_Purva: Contact Map Overlap solver
MD-Jeep: Distance Geometry solver
CSA: Comparative Structural Alignment

Workflow

SLICEE: parallel execution of bioinformatics workflows

Comparative Genomics

CASSIS: detection of rearrangement breakpoints
PLAST: intensive bank-to-bank sequence comparison
DRJBreakpointFinder: detection and precise localization of excision sites in proviral segments

Five key traits to seek out in potential bioinformatics candidates !!!

Jit — Mon, 10 Aug 2015 12:53:50 -0500

Genomics and proteomics data are being collected in bulk, but mostly, traditional biologist don’t know what to do with it. Perhaps this is the reason why (not only this!!! ) computational biologist/bioinformatics scientists are hot commodities in the research world.

In fact, there are huge demands for expert biological data analyst. It’s a fairly new (not exactly) hot area, these bioinformatician are invaluable because they know and understand the significance of biological data for your research and how you can use it for better understanding of biological problems.

The bioinformatics can discover biological patterns and stories in genomic and proteomics data. They can develop the pipeline needed to properly collect, store and analyse it.

Once your research group is ready to make a larger investment and hire a bioinformatician to gain a competitive edge, there are several key traits to seek out in potential candidates. The best bioinformatician are:

1. Highly Skilled - programming skills, experience with the biological software and tools.

The biological data won’t illuminate much if the scientist analysing it doesn’t possess practical programming skills, experience with the biological software and tools and a thorough understanding of basic biological stuff. A solid background in mathematics and statistics is also an indispensable trait.

2. Insight - Real vision, robust understanding and deep insight.

In order to hire the best bioinformatics and computational biologist scientist for your needs, it is always recommended and mostly practiced by the recruiters, to ask each contender to write and develop a sample script/presentation based on a specific set of data you provide. Then, explore the approaches used to deal with data provided and pick up those candidates who convey real vision, robust understanding and deep insight.

3. Energetic – Curiosity to explore

Mostly natural curiosity and enthusiasm for solving big biological problems coupled with an ability to transform data into a scientific stories may place one candidate above the rest. In addition to achieve that, the bioinformatician should be agile enough to quickly modify their methods to suit changes within a particular research.

4. Researcher – Publications

Look for someone who has a keen sense and understanding of concern biological problems. You can judge it by looking at previously published papers and data. It is always recommended to have a look at GitHub and other repository for codes written by her/him.

5. Impressive communicator - Insight that can’t be expressed is worthless.

Good bioinformatics scientists are able to uncover biological patterns and are willing to explain those patterns in clear and helpful ways through thoughtful and open communication. In other words, they should must have good scientific writing skills. A computational biologis/bioinformatician should know how to present the data and tell a scientific story through numbers/images.

What is Data Science? — A Bioinformatics Perspective

Abhi — Mon, 16 Jun 2025 01:44:34 -0500

In today’s era of big biology, we’re generating more data than ever before—genomes, transcriptomes, proteomes, metabolomes, microbiomes… you name it. But raw biological data doesn’t speak for itself. Making sense of it requires more than traditional biology. This is where data science steps in.

So, What Is Data Science?
At its core, data science is the interdisciplinary field that extracts knowledge and insights from data using programming, statistics, and domain expertise. In bioinformatics, data science enables us to turn gigabytes of sequence data into biological meaning.

Imagine trying to understand gene regulation in cancer by analyzing thousands of RNA-seq samples, or predicting antibiotic resistance from bacterial genomes—these challenges are not solvable through wet lab experiments alone. They require data-driven thinking.

Data Science Meets Bioinformatics
Bioinformatics is inherently a data science domain. From genomics to systems biology, every field in modern biology relies on data science techniques to:

Clean and process massive datasets

Discover patterns in high-dimensional data

Build predictive models (e.g., for disease classification)

Visualize complex biological networks and trends

Integrate diverse data types (e.g., transcriptomic + epigenomic data)

The Bioinformatics Toolkit
Here’s what data science typically looks like in bioinformatics:

Task Data Science Role
Sequence alignment Efficient algorithms, indexing, parallel processing
Gene expression analysis Statistical modeling (e.g., DESeq2, limma)
Variant calling Data filtering, probabilistic models
Clustering of cells in single-cell data Unsupervised learning
Protein structure prediction Deep learning models (e.g., AlphaFold)
Metagenomics Data integration, classification, dimensionality reduction

Common tools include Python, R, Bioconductor, scikit-learn, Pandas, Seurat, and TensorFlow—often working together in reproducible workflows.

It's Not Just About Coding
A common misconception is that bioinformatics is just programming or scripting. But being a data scientist in bioinformatics also means:

Understanding experimental design

Asking biologically meaningful questions

Choosing the right statistical or machine learning models

Communicating findings effectively (e.g., plots, dashboards, papers)

In other words, data science in bioinformatics is where biology, statistics, and computer science converge.

Why It Matters
The real power of data science in bioinformatics is its ability to scale discovery.

Instead of studying one gene, we can study thousands.

Instead of analyzing one species, we can explore entire ecosystems.

Instead of waiting months for lab results, we can generate hypotheses in days.

From personalized medicine and cancer diagnostics to agricultural genomics and pandemic surveillance, data science is at the heart of the bioinformatics revolution.

Final Thoughts
If you’re a biologist who’s curious about code, or a data enthusiast fascinated by life sciences, bioinformatics is your playground—and data science is your toolkit.

In bioinformatics, data science isn’t just useful. It’s essential.

MGRA: Breakpoint graphs and ancestral genome reconstructions

Jit — Tue, 25 Jul 2017 08:48:25 -0500

MGRA (Multiple Genome Rearrangements and Ancestors) is a tool for reconstruction of ancestor genomes and evolutionary history of extant genomes.

It takes as an input a set of genomes represented as sequences of genes (or synteny blocks) and produces such sequences for ancestral genomes at the internal nodes of the phylogenetic tree.

The phylogenetic tree may be also specified completely or partially, in the latter case MGRA can reconstruct conserved ancestral regions (CARs) of the ancestral genome of interest.

Since version 2 MGRA supports gene insertion and deletions in addition to genome rearrangements and allows the input genomes to have different gene content.

It also can reconstruct most plausible phylogenetic tree based on the rearrangement characters.

Address of the bookmark: http://mgra.cblab.org/

Genomicus: genome browser that enables users to navigate in genomes in several dimensions

Jit — Sat, 18 Nov 2017 16:10:16 -0600

Genomicus is a genome browser that enables users to navigate in genomes in several dimensions: linearly along chromosome axes, transversaly across different species, and chronologicaly along evolutionary time.

Once a query gene has been entered, it is displayed in its genomic context in parallel to the genomic context of all its orthologous and paralogous copies in all the other sequenced metazoan genomes. Moreover, Genomicus stores and displays the predicted ancestral genome structure in all the ancestral species within the phylogenetic range of interest.

All the data on extant species displayed in this browser are from Ensembl.

Address of the bookmark: http://genomicus.biologie.ens.fr/genomicus-90.01/cgi-bin/search.pl

Bandage: interactive visualization of de novo genome assemblies

Shruti Paniwala — Mon, 04 Dec 2017 10:09:37 -0600

Bandage (a Bioinformatics Application for Navigating De novo Assembly Graphs Easily) is a tool for visualizing assembly graphs with connections. Users can zoom in to specific areas of the graph and interact with it by moving nodes, adding labels, changing colors and extracting sequences. BLAST searches can be performed within the Bandage graphical user interface and the hits are displayed as highlights in the graph. By displaying connections between contigs, Bandage presents new possibilities for analyzing de novo assemblies that are not possible through investigation of contigs alone.

Availability and implementation: Source code and binaries are freely available at https://github.com/rrwick/Bandage. Bandage is implemented in C++ and supported on Linux, OS X and Windows. A full feature list and screenshots are available at http://rrwick.github.io/Bandage.

Address of the bookmark: http://rrwick.github.io/Bandage/

Tools for bacterial whole genome annotation

Radha Agarkar — Sat, 16 Dec 2017 17:37:47 -0600

RAST – Web tool (upload contigs), uses the subsystems in the SEED database and provides detailed annotation and pathway analysis. Takes several hours per genome but I think this is the best way to get a high quality annotation (if you have only a few genomes to annotate).

Prokka – Standalone command line tool, takes just a few minutes per genome. This is the best way to get good quality annotation in a flash, which is particularly useful if you have loads of genomes or need to annotate a pangenome or metagenome. Note however that the quality of functional information is not as good as RAST, and you will need several extra steps if you want to do functional profiling and pathway analysis of your genome(s)… which is in-built in RAST.

NCBI Prokaryotic Genome Annotation Pipeline is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).

Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements.

PGAP: NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Automatic Annotation Pipeline (PGAAP; see Pubmed Article) developed in 2005 has been replaced with an upgraded version that is capable of processing a larger data volume. NCBI's annotation pipeline depends on several internal databases and is not currently available for download or use outside of the NCBI environment.

BEACON (automated tool for Bacterial GEnome Annotation ComparisON), a fast tool for an automated and a systematic comparison of different annotations of single genomes. The extended annotation assigns putative functions to many genes with unknown functions. BEACON is available under GNU General Public License version 3.0 and is accessible at: http://www.cbrc.kaust.edu.sa/BEACON/.

BlastKOLA: Assigns K numbers to the user's sequence data by BLAST searches, respectively, against a nonredundant set of KEGG GENES. KOALA (KEGG Orthology And Links Annotation) is KEGG's internal annotation tool for K number assignment of KEGG GENES using SSEARCH computation. Annotate Sequence in KEGG Mapper and Pathogen Checker in KEGG Pathogen are special interfaces to this server and can be executed in an interactive mode. BlastKOALA is suitable for annotating fully sequenced genomes.

PAGIT: Provides a toolkit for improving the quality of genome assemblies created via an assembly software. PAGIT compiled four tools: (i) ABACAS which classifies and orientates contigs and estimates the sizes of gaps between them; (ii) IMAGE uses paired-end reads to extend contigs and close gaps within the scaffolds; (iii) ICORN for identifying and correcting small errors in consensus sequences and; (iv) RATT for help annotation. The software was mainly created to analyze parasite genomes of up to about 300 Mb.

MAKER: A portable and easily configurable genome annotation pipeline. MAKER allows smaller eukaryotic and prokaryotic genome projects to independently annotate their genomes and to create genome databases. It identifies repeats, aligns ESTs and proteins to a genome, produces ab-initio gene predictions and automatically synthesizes these data into gene annotations having evidence-based quality values. MAKER's inputs are minimal and its ouputs can be directly loaded into a Generic Model Organism Database (GMOD). They can also be viewed in the Apollo genome browser; this feature of MAKER provides an easy means to annotate, view and edit individual contigs and BACs without the overhead of a database. MAKER is available for download and can be tested online via the MAKER Web Annotation Service (MWAS).

MyPro is a software pipeline for high-quality prokaryotic genome assembly and annotation. It was validated on 18 oral streptococcal strains to produce submission-ready, annotated draft genomes. MyPro installed as a virtual machine and supported by updated databases will enable biologists to perform quality prokaryotic genome assembly and annotation with ease.

AliTV—interactive visualization of whole genome comparisons

Jit — Wed, 10 Jan 2018 07:08:17 -0600

AliTV, which provides interactive visualization of whole genome alignments. AliTV reads multiple whole genome alignments or automatically generates alignments from the provided data. Optional feature annotations and phylo- genetic information are supported. The user-friendly, web-browser based and highly customizable interface allows rapid exploration and manipulation of the visualized data as well as the export of publication-ready high-quality figures. AliTV is freely available at https://github.com/AliTVTeam/AliTV

https://alitvteam.github.io/AliTV/

Address of the bookmark: https://github.com/AliTVTeam/AliTV

ARCS: scaffolding genome drafts with linked reads

Rahul Nayak — Tue, 06 Mar 2018 16:35:26 -0600

ARCS, an application that utilizes the barcoding information contained in linked reads to further organize draft genomes into highly contiguous assemblies. We show how the contiguity of an ABySS H.sapiensgenome assembly can be increased over six-fold, using moderate coverage (25-fold) Chromium data. We expect ARCS to have broad utility in harnessing the barcoding information contained in linked read data for connecting high-quality sequences in genome assembly drafts.

Address of the bookmark: https://github.com/bcgsc/ARCS/