BOL: Related items

HNADOCK: a nucleic acid docking server for modeling RNA/DNA–RNA/DNA 3D complex structures

Poonam Mahapatra — Thu, 04 Jun 2020 23:19:07 -0500

The HNADOCK server is to predict the binding complex structure between two nucleic acid molecules through a hierarchical docking algorihtm of an FFT-based global search strategy and an intrinsic scoring function for nucleic acid interactions. Users are required to provide the three-dimensional (3D) structures of the two molecules to be docked.

Address of the bookmark: http://huanglab.phys.hust.edu.cn/hnadock/

Graph Genome Suite

Jit — Fri, 28 Oct 2016 07:59:54 -0500

Seven Bridges is the biomedical data analysis company accelerating breakthroughs in genomics research for cancer, drug development and precision medicine. We build self-improving systems to analyze millions of genomes, including the Graph Genome Suite — the most advanced population genomics tools in the world.

Address of the bookmark: https://www.sbgenomics.com/graph/

PhyloGrapher - Graph Visualization Tool

Jit — Wed, 07 Mar 2018 18:11:25 -0600

PhyloGrapher is a program designed to visualize and study evolutionary relationships within families of homologous genes or proteins (elements). PhyloGrapher is a drawing tool that generates custom graphs for a given set of elements. In general, it is possible to use PhyloGrapher to visualize any type of relations between elements.

https://www.youtube.com/watch?v=WgufqYMHCvM

Address of the bookmark: http://www.atgc.org/PhyloGrapher/PhyloGrapher_Welcome.html

PPanGGOLiN: Depicting microbial species diversity via a Partitioned PanGenome Graph Of Linked Neighbors

LEGE — Thu, 01 Feb 2024 00:24:32 -0600

PPanGGOLiN (Gautreau et al. 2020) is a software suite used to create and manipulate prokaryotic pangenomes from a set of either genomic DNA sequences or provided genome annotations. It is designed to scale up to tens of thousands of genomes. It has the specificity to partition the pangenome using a statistical approach rather than using fixed thresholds which gives it the ability to work with low-quality data such as Metagenomic Assembled Genomes (MAGs) or Single-cell Amplified Genomes (SAGs) thus taking advantage of large scale environmental studies and letting users study the pangenome of uncultivable species.

A complete documentation is available here.

Address of the bookmark: https://github.com/labgem/PPanGGOLiN

GraphPath: A graph attention model for molecular stratification with interpretability based on the pathway-pathway interaction network

LEGE — Wed, 27 Mar 2024 20:51:21 -0500

Achieving accurate and interpretable clinical predictions requires paramount attention to thoroughly characterizing patients at both the molecular and biological pathway levels. In this paper, we present GraphPath, a biological knowledge-driven graph neural network with multi-head self-attention mechanism that implements the pathway-pathway interaction network. We train GraphPath to classify the cancer status of patients with prostate cancer based on their multi-omics profiling.

Address of the bookmark: https://github.com/amazingma/GraphPath

Large Language Models in Bioinformatics: Transforming Data Analysis and Interpretation

LEGE — Thu, 02 Jan 2025 11:26:29 -0600

The integration of artificial intelligence (AI) into bioinformatics has ushered in a new era of computational biology. Among the most transformative advancements are large language models (LLMs), such as GPT and BERT, which leverage deep learning to process and interpret vast amounts of text data. These models are reshaping bioinformatics by enhancing data analysis, hypothesis generation, and literature mining.

Understanding Large Language Models

LLMs are AI systems trained on extensive datasets of natural language. Their ability to model context, identify patterns, and generate coherent language has proven invaluable across domains, including bioinformatics. By fine-tuning these models on biological datasets, researchers can unlock insights into molecular biology, systems biology, and beyond.

Key Applications of LLMs in Bioinformatics

1. Annotating Biological Data

Annotating genomic and proteomic data is fundamental yet labor-intensive. LLMs streamline this process by extracting functional annotations from literature and databases, predicting gene and protein functions, and providing automated insights.

2. Mining Scientific Literature

The exponential growth of publications presents a challenge for researchers to stay updated. LLMs can process large volumes of text to extract key findings, summarize papers, and identify trends, thereby facilitating efficient literature reviews.

3. Predicting Gene and Protein Functions

By leveraging sequence data and annotations, LLMs can predict the functions of uncharacterized genes and proteins. This capability is particularly useful for studying non-model organisms and orphan genes.

4. Drug Discovery and Repurposing

LLMs enable pattern recognition across chemical, genomic, and clinical datasets, identifying novel drug candidates and repurposing existing drugs for new therapeutic targets. They can simulate interactions between drugs and biological molecules, accelerating the discovery pipeline.

5. Generating Hypotheses for Research

LLMs analyze complex datasets to propose testable hypotheses. For example, they can predict protein-protein interactions, identify regulatory motifs, or model evolutionary processes in genomes.

Advantages of LLMs in Bioinformatics

Scalability: LLMs process massive datasets rapidly, reducing the time required for data analysis.
Versatility: These models adapt to diverse bioinformatics tasks, from genomic annotation to network analysis.
Contextual Insights: By synthesizing information across disparate datasets, LLMs provide integrative insights into biological systems.

Challenges in Applying LLMs

Despite their promise, LLMs face limitations:

Data Quality and Bias: Inaccurate or biased datasets can affect model predictions, necessitating rigorous data curation.
Interpretability: Understanding the decision-making process of LLMs remains a critical challenge, especially in high-stakes fields like genomics and medicine.
Resource Intensity: Training and deploying LLMs require substantial computational power, which can limit accessibility.
Ethical Concerns: Handling sensitive genomic data raises privacy and security issues, emphasizing the need for ethical guidelines.

Future Prospects

The continued development of LLMs tailored for bioinformatics promises exciting advancements. Specialized models trained on omics data, open-access platforms, and interdisciplinary collaborations will expand the utility of LLMs. Moreover, integrating LLMs with other AI technologies, such as graph neural networks and reinforcement learning, can unlock deeper biological insights.

Conclusion

Large language models are revolutionizing bioinformatics by addressing longstanding challenges in data annotation, literature mining, and function prediction. Their ability to analyze complex biological datasets efficiently positions them as indispensable tools for modern research. As bioinformatics embraces AI, the synergy between LLMs and biological sciences holds the potential to unravel the complexities of life with unprecedented precision and scale.

AMStat: display statistics of large sequence files from next generation sequencing projects

Neel — Fri, 09 Nov 2018 13:34:56 -0600

SAMStat is an efficient C program to quickly display statistics of large sequence files from next generation sequencing projects. When applied to SAM/BAM files all statistics are reported for unmapped, poorly and accurately mapped reads separately. This allows for identification of a variety of problems, such as remaining linker and adaptor sequences, causing poor mapping. Apart from this SAMStat can be used to verify individual processing steps in large analysis pipelines.

Address of the bookmark: http://samstat.sourceforge.net/

Zombies like bacteria!!!

Rahul Agarwal — Tue, 03 Sep 2013 08:44:15 -0500

Do you believe in Zombies stories … Hmm confused? Don’t worry there is a news for you. Scientists from the Integrated Ocean Drilling Program have announced the findings of the long-lived bacteria, reproducing only once every 10,000 years, which have been found in rocks 2.5km (1.5 miles) below the ocean floor that are as much as 100 million years old.

" the microbes exist in very low concentrations, of around 1,000 microbes in every tea spoon full of rock, compared with billions or trillions of bacteria that would typically be found in the same amount of soil at Earth's surface."

Reference:

http://www.bbc.co.uk/news/science-environment-23855436

Fancy Oneliner for Bioinformatics !!

Poonam Mahapatra — Thu, 07 Jul 2016 12:05:50 -0500

This webpage lists some of the one-liners that we frequently use in metagenomic analyses. You can click on the following links to browse through different topics. You can copy/paste the commands as they are in your terminal screen, provided you follow the same naming conventions and folder structures as we have. We are sharing these codes with the intention that if they are useful and help you in your analyses, then we will be appropriately credited as considerable effort has been put into devising them.

Address of the bookmark: http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/oneliners.html

CoverM: Read coverage calculator for metagenomics

Neel — Thu, 29 Apr 2021 23:39:14 -0500

CoverM aims to be a configurable, easy to use and fast DNA read coverage and relative abundance calculator focused on metagenomics applications.

CoverM calculates coverage of genomes/MAGs coverm genome (help) or individual contigs coverm contig (help). Calculating coverage by read mapping, its input can either be BAM files sorted by reference, or raw reads and reference genomes in various formats.

Address of the bookmark: https://github.com/wwood/CoverM