BOL: Related items

Harvest

Jit — Tue, 31 Jan 2017 10:57:56 -0600

Harvest is a suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial genomes, including variant calls, recombination detection, and phylogenetic trees.

Tools

Parsnp - Core-genome alignment and analysis
Gingr - Interactive visualization of alignments, trees and variants
HarvestTools - Archiving and postprocessing

Citation

Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biology, 15 (11), 1-15 [PDF]

Address of the bookmark: http://harvest.readthedocs.io/en/latest/index.html

HivePlot

Jit — Thu, 16 Feb 2017 11:39:34 -0600

The hive plot is a rational visualization method for drawing networks. Nodes are mapped to and positioned on radially distributed linear axes — this mapping is based on network structural properties. Edges are drawn as curved links. Simple and interpretable.

The purpose of the hive plot is to establish a new baseline for visualization of large networks — a method that is both general and tunable and useful as a starting point in visually exploring network structure.

More at http://www.hiveplot.com/

Address of the bookmark: http://www.hiveplot.com/

ConPADE: Genome Assembly Ploidy Estimation from Next-Generation Sequencing Data

Jit — Fri, 24 Feb 2017 04:55:41 -0600

ConPADE (Contig Ploidy and Allele Dosage Estimation), a probabilistic method that estimates the ploidy of any given contig/scaffold based on its allele proportions. In the process, they report findings regarding errors in sequencing. The method can be used for whole genome shotgun (WGS) sequencing data. They also show applicability of the method for variant calling and allele dosage estimation. Results for simulated and real datasets are discussed and provide evidence that ConPADE performs well as long as enough sequencing coverage is available, or the true contig ploidy is low.

https://github.com/microsoftgenomics

Address of the bookmark: https://github.com/microsoftgenomics

MyCC: Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes

Jit — Fri, 03 Mar 2017 08:34:23 -0600

MyCC, an automated binning tool that combines genomic signatures, marker genes and optional contig coverages within one or multiple samples, in order to visualize the metagenomes and to identify the reconstructed genomic fragments.

More at http://www.nature.com/articles/srep24175

Address of the bookmark: https://sourceforge.net/projects/sb2nhri/files/MyCC/

MaxBin: software for binning assembled metagenomic sequences based on an Expectation-Maximization algorithm.

Jit — Mon, 06 Mar 2017 04:03:38 -0600

MaxBin is software for binning assembled metagenomic sequences based on an Expectation-Maximization algorithm. Users can understand the underlying bins (genomes) of the microbes in their metagenomes by simply providing assembled metagenomic sequences and the reads coverage information or sequencing reads. For users' convenience MaxBin will report genome-related statistics, including estimated completeness, GC content and genome size in the binning summary page.

Users can use MEGAN or similar software on MaxBin bins to find the taxonomy of each bin after the binning process is finished.

https://academic.oup.com/bioinformatics/article/32/4/605/1744462/MaxBin-2-0-an-automated-binning-algorithm-to

The most recent version of MaxBin is 2.2, which supports the analysis of coassemblies of multiple samples. It is available at this JBEI downloads sites as well as MaxBin and MaxBin 2.0 sourceforge sites.

Address of the bookmark: http://downloads.jbei.org/data/microbial_communities/MaxBin/MaxBin.html

GroopM: Metagenomic binning toolset

Jit — Tue, 07 Mar 2017 08:59:45 -0600

GroopM is a metagenomic binning toolset. It leverages spatio-temoral
dynamics (differential coverage) to accurately (and almost automatically)
extract population genomes from multi-sample metagenomic datasets.

GroopM is largely parameter-free. Use: groopm -h for more info.

For installation and usage instructions see : http://ecogenomics.github.io/GroopM/

Address of the bookmark: https://github.com/ecogenomics/GroopM

NCBI Prokaryotic Genome Annotation Pipeline

Jit — Tue, 16 May 2017 08:56:03 -0500

NCBI Prokaryotic Genome Annotation Pipeline is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).

Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements.

NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Automatic Annotation Pipeline (PGAAP; see Pubmed Article) developed in 2005 has been replaced with an upgraded version that is capable of processing a larger data volume. You can find a more detailed description of the new version of the pipeline in NCBI Handbook chapter. NCBI's annotation pipeline depends on several internal databases and is not currently available for download or use outside of the NCBI environment.

https://www.ncbi.nlm.nih.gov/genome/annotation_prok/

Address of the bookmark: https://www.ncbi.nlm.nih.gov/genome/annotation_prok/

ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data

Jit — Mon, 19 Feb 2018 06:46:15 -0600

ETE v3, featuring numerous improvements in the underlying library of methods, and providing a novel set of standalone tools to perform common tasks in comparative genomics and phylogenetics.

The new features include

(i) building gene-based and supermatrix-based phylogenies using a single command,

(ii) testing and visualizing evolutionary models,

(iii) calculating distances between trees of different size or including duplications, and

(iv) providing seamless integration with the NCBI taxonomy database.

ETE is freely available at http://etetoolkit.org

Address of the bookmark: http://etetoolkit.org

SeqCAT: Sequence Conversion and Analysis Toolbox

Neel — Fri, 14 Jun 2024 14:36:53 -0500

Your all-in-one solution for smooth conversion of sequence coordinates.

Designed for bioinformatics data analysis and daily laboratory work, SeqCAT simplifies sequence coordinate conversion. Extract gene and transcript information, manipulate sequences, and easily validate complex genetic events such as fusions with SeqCAT.

More at https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkae422/7683049?login=false

Address of the bookmark: https://mtb.bioinf.med.uni-goettingen.de/SeqCAT/home

Large Language Models in Bioinformatics: Transforming Data Analysis and Interpretation

LEGE — Thu, 02 Jan 2025 11:26:29 -0600

The integration of artificial intelligence (AI) into bioinformatics has ushered in a new era of computational biology. Among the most transformative advancements are large language models (LLMs), such as GPT and BERT, which leverage deep learning to process and interpret vast amounts of text data. These models are reshaping bioinformatics by enhancing data analysis, hypothesis generation, and literature mining.

Understanding Large Language Models

LLMs are AI systems trained on extensive datasets of natural language. Their ability to model context, identify patterns, and generate coherent language has proven invaluable across domains, including bioinformatics. By fine-tuning these models on biological datasets, researchers can unlock insights into molecular biology, systems biology, and beyond.

Key Applications of LLMs in Bioinformatics

1. Annotating Biological Data

Annotating genomic and proteomic data is fundamental yet labor-intensive. LLMs streamline this process by extracting functional annotations from literature and databases, predicting gene and protein functions, and providing automated insights.

2. Mining Scientific Literature

The exponential growth of publications presents a challenge for researchers to stay updated. LLMs can process large volumes of text to extract key findings, summarize papers, and identify trends, thereby facilitating efficient literature reviews.

3. Predicting Gene and Protein Functions

By leveraging sequence data and annotations, LLMs can predict the functions of uncharacterized genes and proteins. This capability is particularly useful for studying non-model organisms and orphan genes.

4. Drug Discovery and Repurposing

LLMs enable pattern recognition across chemical, genomic, and clinical datasets, identifying novel drug candidates and repurposing existing drugs for new therapeutic targets. They can simulate interactions between drugs and biological molecules, accelerating the discovery pipeline.

5. Generating Hypotheses for Research

LLMs analyze complex datasets to propose testable hypotheses. For example, they can predict protein-protein interactions, identify regulatory motifs, or model evolutionary processes in genomes.

Advantages of LLMs in Bioinformatics

Scalability: LLMs process massive datasets rapidly, reducing the time required for data analysis.
Versatility: These models adapt to diverse bioinformatics tasks, from genomic annotation to network analysis.
Contextual Insights: By synthesizing information across disparate datasets, LLMs provide integrative insights into biological systems.

Challenges in Applying LLMs

Despite their promise, LLMs face limitations:

Data Quality and Bias: Inaccurate or biased datasets can affect model predictions, necessitating rigorous data curation.
Interpretability: Understanding the decision-making process of LLMs remains a critical challenge, especially in high-stakes fields like genomics and medicine.
Resource Intensity: Training and deploying LLMs require substantial computational power, which can limit accessibility.
Ethical Concerns: Handling sensitive genomic data raises privacy and security issues, emphasizing the need for ethical guidelines.

Future Prospects

The continued development of LLMs tailored for bioinformatics promises exciting advancements. Specialized models trained on omics data, open-access platforms, and interdisciplinary collaborations will expand the utility of LLMs. Moreover, integrating LLMs with other AI technologies, such as graph neural networks and reinforcement learning, can unlock deeper biological insights.

Conclusion

Large language models are revolutionizing bioinformatics by addressing longstanding challenges in data annotation, literature mining, and function prediction. Their ability to analyze complex biological datasets efficiently positions them as indispensable tools for modern research. As bioinformatics embraces AI, the synergy between LLMs and biological sciences holds the potential to unravel the complexities of life with unprecedented precision and scale.