BOL: Related items

What is Data Science? — A Bioinformatics Perspective

Abhi — Mon, 16 Jun 2025 01:44:34 -0500

In today’s era of big biology, we’re generating more data than ever before—genomes, transcriptomes, proteomes, metabolomes, microbiomes… you name it. But raw biological data doesn’t speak for itself. Making sense of it requires more than traditional biology. This is where data science steps in.

So, What Is Data Science?
At its core, data science is the interdisciplinary field that extracts knowledge and insights from data using programming, statistics, and domain expertise. In bioinformatics, data science enables us to turn gigabytes of sequence data into biological meaning.

Imagine trying to understand gene regulation in cancer by analyzing thousands of RNA-seq samples, or predicting antibiotic resistance from bacterial genomes—these challenges are not solvable through wet lab experiments alone. They require data-driven thinking.

Data Science Meets Bioinformatics
Bioinformatics is inherently a data science domain. From genomics to systems biology, every field in modern biology relies on data science techniques to:

Clean and process massive datasets

Discover patterns in high-dimensional data

Build predictive models (e.g., for disease classification)

Visualize complex biological networks and trends

Integrate diverse data types (e.g., transcriptomic + epigenomic data)

The Bioinformatics Toolkit
Here’s what data science typically looks like in bioinformatics:

Task Data Science Role
Sequence alignment Efficient algorithms, indexing, parallel processing
Gene expression analysis Statistical modeling (e.g., DESeq2, limma)
Variant calling Data filtering, probabilistic models
Clustering of cells in single-cell data Unsupervised learning
Protein structure prediction Deep learning models (e.g., AlphaFold)
Metagenomics Data integration, classification, dimensionality reduction

Common tools include Python, R, Bioconductor, scikit-learn, Pandas, Seurat, and TensorFlow—often working together in reproducible workflows.

It's Not Just About Coding
A common misconception is that bioinformatics is just programming or scripting. But being a data scientist in bioinformatics also means:

Understanding experimental design

Asking biologically meaningful questions

Choosing the right statistical or machine learning models

Communicating findings effectively (e.g., plots, dashboards, papers)

In other words, data science in bioinformatics is where biology, statistics, and computer science converge.

Why It Matters
The real power of data science in bioinformatics is its ability to scale discovery.

Instead of studying one gene, we can study thousands.

Instead of analyzing one species, we can explore entire ecosystems.

Instead of waiting months for lab results, we can generate hypotheses in days.

From personalized medicine and cancer diagnostics to agricultural genomics and pandemic surveillance, data science is at the heart of the bioinformatics revolution.

Final Thoughts
If you’re a biologist who’s curious about code, or a data enthusiast fascinated by life sciences, bioinformatics is your playground—and data science is your toolkit.

In bioinformatics, data science isn’t just useful. It’s essential.

The new corona variant has 23 mutations in all, which is unusually huge !

Shruti Paniwala — Wed, 23 Dec 2020 03:50:50 -0600

The new SARS-CoV-2 version, B.1.1.7, which was first seen in the third week of September in Kent and Greater London, has since spread to other locations in the UK. According to the COVID-19 Genomics UK Consortium (COG-UK Consortium) that analysed the genome data of the virus and identified the variant, the new variant has been spreading "rapidly" over the last four weeks and has now been detected in other locations in the UK, suggesting further spread of the variant in the region.

According to a preliminary report posted on December 19 by the COG-UK Consortium scientists, as of December 15, 1,623 variant genomes have been sequenced. In a December 21 tweet, COG-UK Consortium said that it added 2,963 more genome sequences of SARS-CoV-2, of which 942 (32%) belong to the new variant. The Consortium intends to sequence 20,000 more SARS-CoV-2 genomes in the next two weeks to further ascertain the spread of the variant.

There is no clear proof, at least not yet, that it does cause severe pandemic. But there is a justification for seriously taking the possibility. Another coronavirus lineage in South Africa has acquired one specific mutation that is also present in B.1.1.7. This variant is increasingly spreading across South Africa's coastal regions. And doctors have observed in preliminary research that individuals infected with this variant bear a higher viral load-a higher concentration of the virus in their upper respiratory tract. In many viral diseases, this is associated with more severe symptoms.

Sequence Viewer: Download Transcripts, Exons and Proteins

Mon, 15 Sep 2014 17:30:36 -0500

How to download FASTA sequence for certain gene features while in the NCBI's Sequence Viewer. Sequence Viewer homepage: www.ncbi.nlm.nih.gov/projects/sviewer/ Sequence Viewer playlist: https://www.youtube.com/playlist?list=PL76D7EE6A6A8AC1C3

Platanus

Jit — Fri, 13 May 2016 05:12:40 -0500

Platanus is a novel de novo sequence assembler that can reconstruct genomic sequences of
highly heterozygous diploids from massively parallel shotgun sequencing data.

The latest version is 1.2.4.

To cite Platanus, please use the following:

Kajitani R, Toshimoto K, Noguchi H, Toyoda A, Ogura Y, Okuno M, Yabana M, Harada M, Nagayasu E, Maruyama H, Kohara Y, Fujiyama A, Hayashi T, Itoh T, “Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads”. Genome Res. 2014 Aug;24(8):1384-95. doi: 10.1101/gr.170720.113. [abstract | full text]

Address of the bookmark: http://platanus.bio.titech.ac.jp/

Blobsplorer

Jit — Tue, 14 Jun 2016 10:28:58 -0500

Blobsplorer is a tool for interactive visualization of assembled DNA sequence data ("contigs") derived from (often unintentionally) mixed-species pools. It allows the simultaneous display of GC content, coverage, and taxonomic annotation for collections of contigs with a view to separating out those belonging to different taxa.

Blobsplorer is unlikely to be of use on its own as it requires contig data to be supplied in a format that involves considerable preprocessing (see below for a description). The easiest way to use Blobsplorer is as part of a workflow using scripts from here.

Address of the bookmark: http://nematodes.org/martin/blobsplorer/blobsplorer.html

EAGER

Jit — Sat, 10 Dec 2016 18:07:23 -0600

The automated reconstruction of genome sequences in ancient genome analysis is a multifaceted process.

EAGER encompasses both state-of-the-art tools for each step as well as new complementary tools tailored for ancient DNA data within a single integrated solution in an easily accessible format.

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0918-z

Address of the bookmark: https://github.com/apeltzer/EAGER-GUI

NovelSeq: Novel Sequence Insertion Detection

Neel — Fri, 09 Jun 2017 04:31:30 -0500

The NovelSeq framework is designed to detect novel sequence insertions using high throughput paired-end whole genome sequencing data.

http://novelseq.sourceforge.net/Home

Paper at https://www.ncbi.nlm.nih.gov/pubmed/20385726

Address of the bookmark: http://novelseq.sourceforge.net/Home

GAPPadder: A Sensitive Approach for Closing Gaps on Draft Genomes with Short Sequence Reads

Jit — Mon, 14 May 2018 05:25:48 -0500

This software is provided ``as is” without warranty of any kind. In no event shall the author be held responsible for any damage resulting from the use of this software. The program package, including source codes, executables, and this documentation, is distributed free of charge. If you use this program in a publication, please cite the following reference:
Chong Chu, Xin Li, and Yufeng Wu. "GAPPadder: A Sensitive Approach for Closing Gaps on Draft Genomes with Short Sequence Reads." bioRxiv (2017): 125534.

Address of the bookmark: https://github.com/Reedwarbler/GAPPadder

Sequence Tube Maps: displays multiple genomic sequences in the form of a tube map

Jit — Wed, 11 Mar 2020 01:12:06 -0500

A JavaScript module for the visualization of genomic sequence graphs. It automatically generates a "tube map"-like visualization of sequence graphs which have been created with vg. (https://github.com/vgteam/vg)

Link to working demo: https://vgteam.github.io/sequenceTubeMap/

Address of the bookmark: https://github.com/vgteam/sequenceTubeMap

SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files

Rahul Nayak — Tue, 01 Mar 2022 03:13:33 -0600

A general-purpose program to manipulate and parse information from FASTA/FASTQ files, supporting gzipped input files. Includes functions to interleave and de-interleave FASTQ files, to rename sequences and to count and print statistics on sequence lengths. SeqFu is available for Linux and MacOS.

A compiled program delivering high performance analyses
Supports FASTA/FASTQ files, also Gzip compressed
A growing collection of handy utilities, also for quick inspection of the datasets

Can be easily installed via conda:

conda install -c bioconda seqfu

Address of the bookmark: https://telatin.github.io/seqfu2/