BOL: Related items

S-plot2: Rapid Visual and Statistical Analysis of Genomic Sequences

Abhimanyu Singh — Tue, 02 Oct 2018 17:57:27 -0500

S-plot2 creates an interactive, two-dimensional heatmap capturing the similarities and dissimilarities in nucleotide usage between genomic sequences (partial or complete). In S-plot2, whole eukaryotic chromosomes and smaller prokaryotic genomes can be efficiently compared. The tool includes functionality to extract, analyze, and automate BLAST queries of regions of interest within the heatmap. This facilitates the investigation of quickly evolving coding regions, novel coding regions, and laterally transferred elements.

Address of the bookmark: https://bitbucket.org/lkalesinskas/splot

Pango Lineage Analysis !

Abhi — Mon, 15 Nov 2021 03:38:29 -0600

The Pango nomenclature is being used by researchers and public health agencies worldwide to track the transmission and spread of SARS-CoV-2, including variants of concern. This website documents all current Pango lineages and their spread, as well as various software tools which can be used by researchers to perform analyses on SARS-COV-2 sequence data.

Address of the bookmark: https://cov-lineages.org/resources/pangolin/output.html

SeqCAT: Sequence Conversion and Analysis Toolbox

Neel — Fri, 14 Jun 2024 14:36:53 -0500

Your all-in-one solution for smooth conversion of sequence coordinates.

Designed for bioinformatics data analysis and daily laboratory work, SeqCAT simplifies sequence coordinate conversion. Extract gene and transcript information, manipulate sequences, and easily validate complex genetic events such as fusions with SeqCAT.

More at https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkae422/7683049?login=false

Address of the bookmark: https://mtb.bioinf.med.uni-goettingen.de/SeqCAT/home

Large Language Models in Bioinformatics: Transforming Data Analysis and Interpretation

LEGE — Thu, 02 Jan 2025 11:26:29 -0600

The integration of artificial intelligence (AI) into bioinformatics has ushered in a new era of computational biology. Among the most transformative advancements are large language models (LLMs), such as GPT and BERT, which leverage deep learning to process and interpret vast amounts of text data. These models are reshaping bioinformatics by enhancing data analysis, hypothesis generation, and literature mining.

Understanding Large Language Models

LLMs are AI systems trained on extensive datasets of natural language. Their ability to model context, identify patterns, and generate coherent language has proven invaluable across domains, including bioinformatics. By fine-tuning these models on biological datasets, researchers can unlock insights into molecular biology, systems biology, and beyond.

Key Applications of LLMs in Bioinformatics

1. Annotating Biological Data

Annotating genomic and proteomic data is fundamental yet labor-intensive. LLMs streamline this process by extracting functional annotations from literature and databases, predicting gene and protein functions, and providing automated insights.

2. Mining Scientific Literature

The exponential growth of publications presents a challenge for researchers to stay updated. LLMs can process large volumes of text to extract key findings, summarize papers, and identify trends, thereby facilitating efficient literature reviews.

3. Predicting Gene and Protein Functions

By leveraging sequence data and annotations, LLMs can predict the functions of uncharacterized genes and proteins. This capability is particularly useful for studying non-model organisms and orphan genes.

4. Drug Discovery and Repurposing

LLMs enable pattern recognition across chemical, genomic, and clinical datasets, identifying novel drug candidates and repurposing existing drugs for new therapeutic targets. They can simulate interactions between drugs and biological molecules, accelerating the discovery pipeline.

5. Generating Hypotheses for Research

LLMs analyze complex datasets to propose testable hypotheses. For example, they can predict protein-protein interactions, identify regulatory motifs, or model evolutionary processes in genomes.

Advantages of LLMs in Bioinformatics

Scalability: LLMs process massive datasets rapidly, reducing the time required for data analysis.
Versatility: These models adapt to diverse bioinformatics tasks, from genomic annotation to network analysis.
Contextual Insights: By synthesizing information across disparate datasets, LLMs provide integrative insights into biological systems.

Challenges in Applying LLMs

Despite their promise, LLMs face limitations:

Data Quality and Bias: Inaccurate or biased datasets can affect model predictions, necessitating rigorous data curation.
Interpretability: Understanding the decision-making process of LLMs remains a critical challenge, especially in high-stakes fields like genomics and medicine.
Resource Intensity: Training and deploying LLMs require substantial computational power, which can limit accessibility.
Ethical Concerns: Handling sensitive genomic data raises privacy and security issues, emphasizing the need for ethical guidelines.

Future Prospects

The continued development of LLMs tailored for bioinformatics promises exciting advancements. Specialized models trained on omics data, open-access platforms, and interdisciplinary collaborations will expand the utility of LLMs. Moreover, integrating LLMs with other AI technologies, such as graph neural networks and reinforcement learning, can unlock deeper biological insights.

Conclusion

Large language models are revolutionizing bioinformatics by addressing longstanding challenges in data annotation, literature mining, and function prediction. Their ability to analyze complex biological datasets efficiently positions them as indispensable tools for modern research. As bioinformatics embraces AI, the synergy between LLMs and biological sciences holds the potential to unravel the complexities of life with unprecedented precision and scale.

DeepVariant : an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

Jit — Sat, 25 Jan 2020 13:28:09 -0600

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data. DeepVariant relies on Nucleus, a library of Python and C++ code for reading and writing data in common genomics file formats (like SAM and VCF) designed for painless integration with the TensorFlow machine learning framework.

https://ai.googleblog.com/2017/12/deepvariant-highly-accurate-genomes.html

https://www.biorxiv.org/content/10.1101/092890v6

Address of the bookmark: https://github.com/google/deepvariant

wgd—simple command line tools for the analysis of ancient whole-genome duplications

LEGE — Thu, 23 Jul 2020 05:49:45 -0500

wgd is a easy to use command-line tool for K_S distribution construction named wgd. The wgd suite provides commonly used K_S and colinearity analysis workflows together with tools for modeling and visualization, rendering these analyses accessible to genomics researchers in a convenient manner.

https://academic.oup.com/bioinformatics/article/35/12/2153/5162749

Address of the bookmark: https://github.com/arzwa/wgd

Kmer: a suite of tools for DNA sequence analysis

BioStar — Wed, 18 Aug 2021 00:02:54 -0500

More at https://help.rc.ufl.edu/doc/Kmer

This also includes:

A2Amapper: ATAC, Assembly to Assembly Comparision tool:
- Comparative mapping between two genome assemblies (same species), or between two different genomes (cross species).

Sim4db:
- Spliced alignment of cDNA and genomic sequences, from the same (sim4) or related (sim4cc) species. Optimized for high-throughput batched alignment.

LEAFF:
- LEAFF (ahem, Let's Extract Anything From Fasta) is a utility program for working with multi-fasta files. In addition to providing random access to the base level, it includes several analysis functions.

Meryl:
- An out-of-core k-mer counter. The amount of sequence that can be processed for any size k depends only on the amount of free disk space.

Address of the bookmark: https://help.rc.ufl.edu/doc/Kmer

pipesnake: bioinformatics best-practice analysis pipeline for phylogenomic reconstruction

LEGE — Wed, 21 Feb 2024 06:19:41 -0600

ausarg/pipesnake is a bioinformatics best-practice analysis pipeline for phylogenomic reconstruction starting from short-read 'second-generation' sequencing data.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

Address of the bookmark: https://github.com/AusARG/pipesnake

New born babies get ready to know their whole genome soon!!!

Rahul Agarwal — Thu, 05 Sep 2013 07:24:02 -0500

USA launch a pilot projects to examine medical information of newborn baby, which are being funded by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) and the National Human Genome Research Institute (NHGRI), both parts of the National Institutes of Health.

Awards of $5 million to four grantees have been made in fiscal year 2013 under the Genomic Sequencing and Newborn Screening Disorders research program. The program will be funded at $25 million over five years, as funds are made available.

"Hundreds of US babies will be pioneers in genomic medicine through a US$25-million programme to sequence their genomes soon after they are born."

Source:

http://blogs.nature.com/news/2013/09/scientists-to-sequence-hundreds-of-newborns-genomes.html

http://www.genome.gov/27554919

GOLD:Genomes Online Database

Jit — Wed, 26 Jul 2017 07:49:29 -0500

GOLD:Genomes Online Database, is a World Wide Web resource for comprehensive access to information regarding genome and metagenome sequencing projects, and their associated metadata, around the world.

https://gold.jgi.doe.gov/

Address of the bookmark: https://gold.jgi.doe.gov/