BOL: Related items

Comparative Genomics Data Set Including 240 Mammals Released !

Jit — Thu, 19 Nov 2020 06:45:39 -0600

The genome of 130 mammals was sequenced by a large international consortium and the data was analyzed together with 110 existing genomes to allow scientists to identify the important positions in the DNA. This report, published in Nature today will help advance research on human disease mutations and inform how best to protect endangered species.

In addition to the knowledge of the human genome, all these genomes, widely sampled across mammals, can be used to research how particular organisms respond to different conditions. Some otters, for example, have a thick, water-resistant shell, and some rodents, but not all, have adapted to hibernation. These animal traits will help us to understand human traits, such as metabolic diseases.

With climate change and more animal ecosystems being threatened by human activity, the protection of endangered species is becoming increasingly important. Scientists have historically researched several people in various populations of a species to understand the genetic variation that occurs in that species. This is important for understanding how particular species can be protected. In this study, animals on the Red List of Endangered Species of the International Union for Conservation of Nature had fewer differences in their genomes, which is consistent with their endangered status.

Ref @ A comparative genomics multitool for scientific discovery and conservation https://www.nature.com/articles/s41586-020-2876-6

Data at http://zoonomiaproject.org/

AMR Database !

LEGE — Tue, 04 Jun 2024 13:37:21 -0500

ARG-ANNOT. PMID: 24145532
CARD. PMID: 23650175
MEGARes PMID: 27899569
NCBI BioProject: PRJNA313047
plasmidfinder PMID: 24777092
resfinder. PMID: 22782487
VFDB. PMID: 26578559
SRST2's version of ARG-ANNOT. PMID: 25422674.
VirulenceFinder PMID: 24574290.

Address of the bookmark: https://github.com/sanger-pathogens/ariba/wiki/Task%3A-getref

What is Data Science? — A Bioinformatics Perspective

Abhi — Mon, 16 Jun 2025 01:44:34 -0500

In today’s era of big biology, we’re generating more data than ever before—genomes, transcriptomes, proteomes, metabolomes, microbiomes… you name it. But raw biological data doesn’t speak for itself. Making sense of it requires more than traditional biology. This is where data science steps in.

So, What Is Data Science?
At its core, data science is the interdisciplinary field that extracts knowledge and insights from data using programming, statistics, and domain expertise. In bioinformatics, data science enables us to turn gigabytes of sequence data into biological meaning.

Imagine trying to understand gene regulation in cancer by analyzing thousands of RNA-seq samples, or predicting antibiotic resistance from bacterial genomes—these challenges are not solvable through wet lab experiments alone. They require data-driven thinking.

Data Science Meets Bioinformatics
Bioinformatics is inherently a data science domain. From genomics to systems biology, every field in modern biology relies on data science techniques to:

Clean and process massive datasets

Discover patterns in high-dimensional data

Build predictive models (e.g., for disease classification)

Visualize complex biological networks and trends

Integrate diverse data types (e.g., transcriptomic + epigenomic data)

The Bioinformatics Toolkit
Here’s what data science typically looks like in bioinformatics:

Task Data Science Role
Sequence alignment Efficient algorithms, indexing, parallel processing
Gene expression analysis Statistical modeling (e.g., DESeq2, limma)
Variant calling Data filtering, probabilistic models
Clustering of cells in single-cell data Unsupervised learning
Protein structure prediction Deep learning models (e.g., AlphaFold)
Metagenomics Data integration, classification, dimensionality reduction

Common tools include Python, R, Bioconductor, scikit-learn, Pandas, Seurat, and TensorFlow—often working together in reproducible workflows.

It's Not Just About Coding
A common misconception is that bioinformatics is just programming or scripting. But being a data scientist in bioinformatics also means:

Understanding experimental design

Asking biologically meaningful questions

Choosing the right statistical or machine learning models

Communicating findings effectively (e.g., plots, dashboards, papers)

In other words, data science in bioinformatics is where biology, statistics, and computer science converge.

Why It Matters
The real power of data science in bioinformatics is its ability to scale discovery.

Instead of studying one gene, we can study thousands.

Instead of analyzing one species, we can explore entire ecosystems.

Instead of waiting months for lab results, we can generate hypotheses in days.

From personalized medicine and cancer diagnostics to agricultural genomics and pandemic surveillance, data science is at the heart of the bioinformatics revolution.

Final Thoughts
If you’re a biologist who’s curious about code, or a data enthusiast fascinated by life sciences, bioinformatics is your playground—and data science is your toolkit.

In bioinformatics, data science isn’t just useful. It’s essential.

minimap2: A versatile pairwise aligner for genomic and spliced nucleotide sequences

Jit — Wed, 20 Jun 2018 07:55:29 -0500

git clone https://github.com/lh3/minimap2 cd minimap2 && make # long sequences against a reference genome ./minimap2 -a test/MT-human.fa test/MT-orang.fa > test.sam # create an index first and then map ./minimap2 -d MT-human.mmi test/MT-human.fa ./minimap2 -a MT-human.mmi test/MT-orang.fa > test.sam # use presets (no test data) ./minimap2 -ax map-pb ref.fa pacbio.fq.gz > aln.sam # PacBio genomic reads ./minimap2 -ax map-ont ref.fa ont.fq.gz > aln.sam # Oxford Nanopore genomic reads ./minimap2 -ax sr ref.fa read1.fa read2.fa > aln.sam # short genomic paired-end reads ./minimap2 -ax splice ref.fa rna-reads.fa > aln.sam # spliced long reads ./minimap2 -ax splice -k14 -uf ref.fa reads.fa > aln.sam # Nanopore Direct RNA-seq ./minimap2 -cx asm5 asm1.fa asm2.fa > aln.paf # intra-species asm-to-asm alignment ./minimap2 -x ava-pb reads.fa reads.fa > overlaps.paf # PacBio read overlap ./minimap2 -x ava-ont reads.fa reads.fa > overlaps.paf # Nanopore read overlap # man page for detailed command line options man ./minimap2.1

Address of the bookmark: https://github.com/lh3/minimap2

LoRDEC: a hybrid error correction program for long, PacBio reads

Jit — Mon, 10 Apr 2017 04:16:09 -0500

LoRDEC is a program to correct sequencing errors in long reads from 3rd generation sequencing with high error rate, and is especially intended for PacBio reads. It uses a hybrid strategy, meaning that it uses two sets of reads: the reference read set, whose error rate is assumed to be small, and the PacBio read set, which is then corrected using the reference set. Typically, the reference set contains Illumina reads.

Usually, errors in PacBio reads include many insertions and deletions, and comparatively less substitutions. LoRDEC can correct errors of all these types.
After correction, a larger portion of the sequence of PacBio reads is usable for detection of region of similarity with other sequences, for aligning them to the contigs of an assembly, etc.

Why is LoRDEC different?

It is efficient and can process large read data sets, included from eukaryotic or vertebrate species, on a usual computing server, and even works on desktop/laptop computers.
It adopts a novel graph based approach: it builds a succinct De Bruijn Graph (DBG) representing the short reads, and seeks a corrective sequence for each erroneous region of a long read by traversing chosen paths in the graph.

Address of the bookmark: http://www.atgc-montpellier.fr/lordec/

proovread : large-scale high-accuracy PacBio correction through iterative short read consensus

Jit — Fri, 05 Jan 2018 04:12:20 -0600

proovread : large-scale high-accuracy PacBio correction through iterative short read consensus

outperforms PacBioToCA/LSC in terms of accuracy and contiguity/sensitivity (http://dx.doi.org/10.1093/bioinformatics/btu392)
is easy to install/run/configure
supports various types of dat
- HiSeq/MiSeq (100-500bp)
- Unitigs
- 454, ...

proovread maps high coverage data to pacbio reads (bwa mem, blasr, daligner) in multiple iterations.

Address of the bookmark: https://github.com/BioInf-Wuerzburg/proovread

minialign: fast and accurate alignment tool for PacBio and Nanopore long reads

Jit — Thu, 24 May 2018 08:33:26 -0500

Minialign is a little bit fast and moderately accurate nucleotide sequence alignment tool designed for PacBio and Nanopore long reads. It is built on three key algorithms, minimizer-based index of the minimap overlapper, array-based seed chaining, and SIMD-parallel Smith-Waterman-Gotoh extension.

Address of the bookmark: https://github.com/ocxtal/minialign

LoRMA: A tool for correcting sequencing errors in long reads

Abhimanyu Singh — Thu, 06 Sep 2018 16:21:01 -0500

An error correction method that uses long reads only. The method consists of two phases: first, we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of k-mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments, the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75×, the throughput of the new method is at least 20% higher.

conda install -c atgc-montpellier lorma

Address of the bookmark: https://gite.lirmm.fr/lorma/lorma-releases/wikis/home

Pacasus: Correction of palindromes in long reads from PacBio and Nanopore

BioStar — Mon, 12 Nov 2018 05:26:48 -0600

Tool for detecting and cleaning PacBio / Nanopore long reads after whole genome amplification. Check the poster from the Revolutionizing Next-Generation Sequencing (2nd edition) conference in the source folder: https://github.com/swarris/Pacasus/blob/master/vib2017.pdf.

The prepint version is found on http://www.biorxiv.org/content/early/2017/08/09/173872

It uses the pyPaSWAS framework for sequence alignment (https://github.com/swarris/pyPaSWAS)

Address of the bookmark: https://github.com/swarris/Pacasus

MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads

Abhi — Tue, 05 Sep 2023 07:31:35 -0500

MitoHiFi v3.2 is a python pipeline distributed under MIT License !

MitoHiFi was first developed to assemble the mitogenomes for a wide range of species in the Darwin Tree of Life Project (DToL)

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05385-y

Address of the bookmark: https://github.com/marcelauliano/MitoHiFi