BOL: Related items

Reference Sequence Resource!

LEGE — Wed, 15 Sep 2021 21:15:22 -0500

The ENCODE project uses Reference Genomes from NCBI or UCSC to provide a consistent framework for mapping high-throughput sequencing data. In general, ENCODE data are mapped consistently to 2 human (GRCH38, hg19) and 2 mouse (mm9/mm10) genomes for historical comparability. Drosophia melanogaster experiments are mapped to either dm3 or dm6 and Caenorhabdilis elegans experiments are mapped to ce10 or ce11. T

Address of the bookmark: https://www.encodeproject.org/data-standards/reference-sequences/

Large Language Models in Bioinformatics: Transforming Data Analysis and Interpretation

LEGE — Thu, 02 Jan 2025 11:26:29 -0600

The integration of artificial intelligence (AI) into bioinformatics has ushered in a new era of computational biology. Among the most transformative advancements are large language models (LLMs), such as GPT and BERT, which leverage deep learning to process and interpret vast amounts of text data. These models are reshaping bioinformatics by enhancing data analysis, hypothesis generation, and literature mining.

Understanding Large Language Models

LLMs are AI systems trained on extensive datasets of natural language. Their ability to model context, identify patterns, and generate coherent language has proven invaluable across domains, including bioinformatics. By fine-tuning these models on biological datasets, researchers can unlock insights into molecular biology, systems biology, and beyond.

Key Applications of LLMs in Bioinformatics

1. Annotating Biological Data

Annotating genomic and proteomic data is fundamental yet labor-intensive. LLMs streamline this process by extracting functional annotations from literature and databases, predicting gene and protein functions, and providing automated insights.

2. Mining Scientific Literature

The exponential growth of publications presents a challenge for researchers to stay updated. LLMs can process large volumes of text to extract key findings, summarize papers, and identify trends, thereby facilitating efficient literature reviews.

3. Predicting Gene and Protein Functions

By leveraging sequence data and annotations, LLMs can predict the functions of uncharacterized genes and proteins. This capability is particularly useful for studying non-model organisms and orphan genes.

4. Drug Discovery and Repurposing

LLMs enable pattern recognition across chemical, genomic, and clinical datasets, identifying novel drug candidates and repurposing existing drugs for new therapeutic targets. They can simulate interactions between drugs and biological molecules, accelerating the discovery pipeline.

5. Generating Hypotheses for Research

LLMs analyze complex datasets to propose testable hypotheses. For example, they can predict protein-protein interactions, identify regulatory motifs, or model evolutionary processes in genomes.

Advantages of LLMs in Bioinformatics

Scalability: LLMs process massive datasets rapidly, reducing the time required for data analysis.
Versatility: These models adapt to diverse bioinformatics tasks, from genomic annotation to network analysis.
Contextual Insights: By synthesizing information across disparate datasets, LLMs provide integrative insights into biological systems.

Challenges in Applying LLMs

Despite their promise, LLMs face limitations:

Data Quality and Bias: Inaccurate or biased datasets can affect model predictions, necessitating rigorous data curation.
Interpretability: Understanding the decision-making process of LLMs remains a critical challenge, especially in high-stakes fields like genomics and medicine.
Resource Intensity: Training and deploying LLMs require substantial computational power, which can limit accessibility.
Ethical Concerns: Handling sensitive genomic data raises privacy and security issues, emphasizing the need for ethical guidelines.

Future Prospects

The continued development of LLMs tailored for bioinformatics promises exciting advancements. Specialized models trained on omics data, open-access platforms, and interdisciplinary collaborations will expand the utility of LLMs. Moreover, integrating LLMs with other AI technologies, such as graph neural networks and reinforcement learning, can unlock deeper biological insights.

Conclusion

Large language models are revolutionizing bioinformatics by addressing longstanding challenges in data annotation, literature mining, and function prediction. Their ability to analyze complex biological datasets efficiently positions them as indispensable tools for modern research. As bioinformatics embraces AI, the synergy between LLMs and biological sciences holds the potential to unravel the complexities of life with unprecedented precision and scale.

Yannick Wurm Lab

Thu, 07 Aug 2014 18:02:37 -0500

Evolutionary genomics of social insects. Extensive theoretical work has explained how and why complex societies evolve. However, only little is known about the genes and molecular mechanisms responsible for social phenotypes. We have been identifying genes and mechanisms involved in the evolution of insect societies using modern genomics tools (Illumina, RNAseq, RADseq...). For example we recently:

1. sequenced and analyzed the genome of the invasive red fire ant Solenopsis invicta (PNAS 2011)

2. discovered that a fundamental social trait in this species (how many queens are accepted in the colony) is determined by variants of a social chromosome (Nature 2013).

3. described the gene expression changes that occur in a virgin queen when she is given the opportunity of replacing her mother (Mol Ecol 2010).

Homepage: http://yannick.poulet.org/

GOLD:Genomes Online Database

Jit — Wed, 26 Jul 2017 07:49:29 -0500

GOLD:Genomes Online Database, is a World Wide Web resource for comprehensive access to information regarding genome and metagenome sequencing projects, and their associated metadata, around the world.

https://gold.jgi.doe.gov/

Address of the bookmark: https://gold.jgi.doe.gov/

P10K: The Protist 10,000 Genomes

BioStar — Sat, 06 Jul 2024 08:29:30 -0500

The Protist 10,000 Genomes (P10K) Project aims to decipher the genome sequences and construct a comprehensive database resource containing over 10,000 species of protists, encompassing representatives from every major clade. Samples were collected from diverse habitats, and the genome information was acquired through de novo sequencing, genome re-annotation, and integration of publicly available data. Serving as a centralized data portal for the project, the P10K database primarily focuses on delivering high-quality curation and facilitating efficient retrieval of protist genome data.

Address of the bookmark: https://ngdc.cncb.ac.cn/p10k/

Genomicus: genome browser that enables users to navigate in genomes in several dimensions

Jit — Sat, 18 Nov 2017 16:10:16 -0600

Genomicus is a genome browser that enables users to navigate in genomes in several dimensions: linearly along chromosome axes, transversaly across different species, and chronologicaly along evolutionary time.

Once a query gene has been entered, it is displayed in its genomic context in parallel to the genomic context of all its orthologous and paralogous copies in all the other sequenced metazoan genomes. Moreover, Genomicus stores and displays the predicted ancestral genome structure in all the ancestral species within the phylogenetic range of interest.

All the data on extant species displayed in this browser are from Ensembl.

Address of the bookmark: http://genomicus.biologie.ens.fr/genomicus-90.01/cgi-bin/search.pl

HGT-Finder: A New Tool for Horizontal Gene Transfer Finding and Application to Aspergillus genomes

Jit — Wed, 17 Jan 2018 05:03:19 -0600

HGT-Finder:

(i) can be used for HGT detection in both prokaryotes and eukaryotes,

(ii) can report a statistical P value for each gene to indicate how likely it is to be horizontally transferred, and

(iii) is fully automated (requires minimal human intervention), as well as very easy to install and run.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4626719/

CheckM:Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

Rahul Nayak — Wed, 23 May 2018 04:39:26 -0500

CheckM provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage. Assessment of genome quality can also be examined using plots depicting key genomic characteristics (e.g., GC, coding density) which highlight sequences outside the expected distributions of a typical genome. CheckM also provides tools for identifying genome bins that are likely candidates for merging based on marker set compatibility, similarity in genomic characteristics, and proximity within a reference genome tree.

Address of the bookmark: http://ecogenomics.github.io/CheckM/

BASE: a practical de novo assembler for large genomes using long NGS reads

Rahul Nayak — Fri, 19 Oct 2018 07:25:21 -0500

new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs.

Address of the bookmark: https://github.com/dhlbh/BASE

HaploTypo: a variant-calling pipeline for phased genomes

Jit — Thu, 19 Dec 2019 07:33:40 -0600

An increasing number of phased (i.e. with resolved haplotypes) reference genomes are available. However, most genetic variant calling tools do not explicitly account for haplotype structure. Here, we present HaploTypo, a pipeline tailored to resolve haplotypes in genetic variation analyses. HaploTypo infers the haplotype correspondence for each heterozygous variant called on a phased reference genome.

Availability and Implementation

HaploTypo is implemented in Python 2.7 and Python 3.5, and is freely available at https://github.com/gabaldonlab/haplotypo, and as a Docker image.

Address of the bookmark: https://github.com/gabaldonlab/haplotypo