BOL: Related items

LRSDAY: Long-read Sequencing Data Analysis for Yeasts

Poonam Mahapatra — Mon, 26 Aug 2019 18:07:33 -0500

Long-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads.

Address of the bookmark: https://github.com/yjx1217/LRSDAY

Comparative Genomics Data Set Including 240 Mammals Released !

Jit — Thu, 19 Nov 2020 06:45:39 -0600

The genome of 130 mammals was sequenced by a large international consortium and the data was analyzed together with 110 existing genomes to allow scientists to identify the important positions in the DNA. This report, published in Nature today will help advance research on human disease mutations and inform how best to protect endangered species.

In addition to the knowledge of the human genome, all these genomes, widely sampled across mammals, can be used to research how particular organisms respond to different conditions. Some otters, for example, have a thick, water-resistant shell, and some rodents, but not all, have adapted to hibernation. These animal traits will help us to understand human traits, such as metabolic diseases.

With climate change and more animal ecosystems being threatened by human activity, the protection of endangered species is becoming increasingly important. Scientists have historically researched several people in various populations of a species to understand the genetic variation that occurs in that species. This is important for understanding how particular species can be protected. In this study, animals on the Red List of Endangered Species of the International Union for Conservation of Nature had fewer differences in their genomes, which is consistent with their endangered status.

Ref @ A comparative genomics multitool for scientific discovery and conservation https://www.nature.com/articles/s41586-020-2876-6

Data at http://zoonomiaproject.org/

AMR Database !

LEGE — Tue, 04 Jun 2024 13:37:21 -0500

ARG-ANNOT. PMID: 24145532
CARD. PMID: 23650175
MEGARes PMID: 27899569
NCBI BioProject: PRJNA313047
plasmidfinder PMID: 24777092
resfinder. PMID: 22782487
VFDB. PMID: 26578559
SRST2's version of ARG-ANNOT. PMID: 25422674.
VirulenceFinder PMID: 24574290.

Address of the bookmark: https://github.com/sanger-pathogens/ariba/wiki/Task%3A-getref

What is Data Science? — A Bioinformatics Perspective

Abhi — Mon, 16 Jun 2025 01:44:34 -0500

In today’s era of big biology, we’re generating more data than ever before—genomes, transcriptomes, proteomes, metabolomes, microbiomes… you name it. But raw biological data doesn’t speak for itself. Making sense of it requires more than traditional biology. This is where data science steps in.

So, What Is Data Science?
At its core, data science is the interdisciplinary field that extracts knowledge and insights from data using programming, statistics, and domain expertise. In bioinformatics, data science enables us to turn gigabytes of sequence data into biological meaning.

Imagine trying to understand gene regulation in cancer by analyzing thousands of RNA-seq samples, or predicting antibiotic resistance from bacterial genomes—these challenges are not solvable through wet lab experiments alone. They require data-driven thinking.

Data Science Meets Bioinformatics
Bioinformatics is inherently a data science domain. From genomics to systems biology, every field in modern biology relies on data science techniques to:

Clean and process massive datasets

Discover patterns in high-dimensional data

Build predictive models (e.g., for disease classification)

Visualize complex biological networks and trends

Integrate diverse data types (e.g., transcriptomic + epigenomic data)

The Bioinformatics Toolkit
Here’s what data science typically looks like in bioinformatics:

Task Data Science Role
Sequence alignment Efficient algorithms, indexing, parallel processing
Gene expression analysis Statistical modeling (e.g., DESeq2, limma)
Variant calling Data filtering, probabilistic models
Clustering of cells in single-cell data Unsupervised learning
Protein structure prediction Deep learning models (e.g., AlphaFold)
Metagenomics Data integration, classification, dimensionality reduction

Common tools include Python, R, Bioconductor, scikit-learn, Pandas, Seurat, and TensorFlow—often working together in reproducible workflows.

It's Not Just About Coding
A common misconception is that bioinformatics is just programming or scripting. But being a data scientist in bioinformatics also means:

Understanding experimental design

Asking biologically meaningful questions

Choosing the right statistical or machine learning models

Communicating findings effectively (e.g., plots, dashboards, papers)

In other words, data science in bioinformatics is where biology, statistics, and computer science converge.

Why It Matters
The real power of data science in bioinformatics is its ability to scale discovery.

Instead of studying one gene, we can study thousands.

Instead of analyzing one species, we can explore entire ecosystems.

Instead of waiting months for lab results, we can generate hypotheses in days.

From personalized medicine and cancer diagnostics to agricultural genomics and pandemic surveillance, data science is at the heart of the bioinformatics revolution.

Final Thoughts
If you’re a biologist who’s curious about code, or a data enthusiast fascinated by life sciences, bioinformatics is your playground—and data science is your toolkit.

In bioinformatics, data science isn’t just useful. It’s essential.

MITObim - mitochondrial baiting and iterative mapping

Rahul Nayak — Tue, 08 May 2018 04:15:25 -0500

This document contains instructions on how to use the MITObim pipeline described in Hahn et al. 2013. The full article can be found here. Kindly cite the article if you are using MITObim in your work. The pipeline was originally developed for Illumina data, but thanks to the versatility of the MIRA assembler, MITObim supports in principle also data from the Iontorrent, 454 and PacBio sequencing platforms.

Below you can find a few basic tutorials for how to run MITObim and I encorage you to give them a try with the testdata that comes with this Repo, just to make sure everything is running smoothly on your system. It'll only take a few minutes and will potentially safe you a lot of time down the line.

I provide further examples here as Jupyter notebooks. Get in touch if you feel like sharing your particular MITObim solution and I'd be happy to put it up here, too!

Address of the bookmark: https://github.com/chrishah/MITObim

Scientists map 17,294 proteins produced in human body

Jit — Thu, 29 May 2014 01:57:55 -0500

Indian scientists missed the genomic profiling bus, but they've more than made up for it by creating the first human proteome map which is an extension of the genomic study. Till now, here is no direct equivalent for the human proteome. But recently two groups present mass spectrometry-based analysis of human tissues, body fluids and cells mapping the large majority of the human proteome.

The Indian scientists working in Bangalore, along with their American counterparts, have mapped more than 17,000 proteins in 30 organs of the human body. Just like the human genome was sequenced around the turn of the millennium, this is an equivalent mapping of the human proteome.

The researcher estimated there are around 20,500 proteins in the human body. These scientists have profiled around 17,294, which account for around 84% of the total proteins. Apart from this, the team also traced around 2,500 of 3,000 proteins that had been categorised as "missing proteins".

The work, done by group of Indian scientists, and Johns Hopkins University, published in the renowned journal Nature ( http://www.nature.com/nature/journal/v509/n7502/full/nature13302.html ). Of the 72 people who worked on the project, 46 are Indians.

Reference:

http://www.nature.com/nature/journal/v509/n7502/full/nature13302.html

http://www.proteinatlas.org/ -The antibody-based Human Protein Atlas programme

http://www.humanproteomemap.org/ -Proteogenomic analysis by identifying translated proteins from annotated pseudogenes, non-coding RNAs and untranslated regions.

https://www.proteomicsdb.org/ -Assembled protein evidence for 18,097 genes in ProteomicsDB

segemehl

Anjana — Tue, 10 May 2016 08:10:15 -0500

segemehl is a software to map short sequencer reads to reference genomes. Unlike other methods, segemehl is able to detect not only mismatches but also insertions and deletions. Furthermore, segemehl is not limited to a specific read length and is able to map primer- or polyadenylation contaminated reads correctly. segemehl implements a matching strategy based on enhanced suffix arrays (ESA).

More at http://www.bioinf.uni-leipzig.de/Software/segemehl/

Manual http://www.bioinf.uni-leipzig.de/Software/segemehl/segemehl_manual_0_1_7.pdf

Address of the bookmark: http://hoffmann.bioinf.uni-leipzig.de/LIFE/segemehl.html

BIMA V3: an aligner customized for mate pair library sequencing

Abhimanyu Singh — Wed, 14 Dec 2016 15:20:00 -0600

Summary: Mate pair library sequencing is an effective and economical method for detecting genomic structural variants and chromosomal abnormalities. Unfortunately, the mapping and alignment of mate pair read pairs to a reference genome is a challenging and
time consuming process for most NGS alignment programs. Large insert sizes, introduction of library preparation protocol artifacts (biotin junction reads, paired-end read contamination, chimeras, etc.), and presence of structural variant breakpoints within reads increases mapping and alignment complexity. We describe an algorithm that is up to 20 times faster and 25% more accurate than popular NGS alignment programs when processing mate pair sequencing.
Availability: http://bioinformaticstools.mayo.edu/research/bima/
Contact: vasmatzis.george@mayo.edu

Address of the bookmark: http://bioinformatics.oxfordjournals.org/content/early/2014/02/12/bioinformatics.btu078.full.pdf

NCBI Magic-BLAST

Jit — Tue, 14 Aug 2018 18:11:11 -0500

Magic-BLAST is a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome. Each alignment optimizes a composite score, taking into account simultaneously the two reads of a pair, and in case of RNA-seq, locating the candidate introns and adding up the score of all exons. This is very different from other versions of BLAST, where each exon is scored as a separate hit and read-pairing is ignored.

Magic-BLAST incorporates within the NCBI BLAST code framework ideas developed in the NCBI Magic pipeline, in particular hit extensions by local walk and jump (http://www.ncbi.nlm.nih.gov/pubmed/26109056), and recursive clipping of mismatches near the edges of the reads, which avoids accumulating artefactual mismatches near splice sites and is needed to distinguish short indels from substitutions near the edges.

Address of the bookmark: https://ncbi.github.io/magicblast/

Common steps for reads mapping !

BioStar — Thu, 09 Mar 2023 02:48:02 -0600

Mapping reads to a reference genome is an essential step in many types of genomic analysis, such as variant calling and gene expression analysis. Here are some general steps to follow for mapping reads to a genome:

Choose a read mapper: There are many read mappers available, such as BWA, Bowtie, and HISAT2. Choose a mapper that is appropriate for your type of data and research question.
Index the reference genome: Before mapping reads, the reference genome needs to be indexed. This involves creating an index of the genome sequence that allows the mapper to quickly find matches to the reads. Most mappers have their own indexing tools.
Prepare the read data: The reads should be in a format that is compatible with the mapper. Most mappers accept FASTQ or BAM files. Depending on the quality of the data, it may need to be filtered or trimmed before mapping.
Run the mapper: The mapper is run with the command-line interface or using a graphical user interface. The specific command depends on the mapper being used, but typically involves specifying the input data, reference genome, and output file format.
Evaluate the mapping results: After the mapping is complete, the results should be evaluated. This includes assessing the quality of the mapping, such as the mapping rate, the number of mapped reads, and the mapping quality score.
Post-processing: Depending on the analysis being performed, post-processing of the mapped reads may be necessary. This can include filtering reads based on quality, removing duplicate reads, and calling variants.

Overall, mapping reads to a reference genome is a complex process that requires careful consideration of the type of data, the research question, and the specific mapper being used.