BOL: Related items

Proksee: in-depth characterization and visualization of bacterial genomes

LEGE — Tue, 09 May 2023 19:38:52 -0500

Proksee is an expert system for genome assembly, annotation and visualization. To begin using Proksee, provide a complete genome sequence, sequencing reads or a CGView/Proksee map JSON file.

Address of the bookmark: https://proksee.ca/

SKESA: strategic k-mer extension for scrupulous assemblies

Jit — Wed, 14 Nov 2018 04:45:41 -0600

SKESA is a DeBruijn graph-based de-novo assembler designed for assembling reads of microbial genomes sequenced using Illumina. Comparison with SPAdes and MegaHit shows that SKESA produces assemblies that have high sequence quality and contiguity, handles low-level contamination in reads, is fast, and produces an identical assembly for the same input when assembled multiple times with the same or different compute resources.

Source code for SKESA is freely available at https://github.com/ncbi/SKESA/releases.

Research Paper @ Link

SKESA algorithm are as follows:

Address of the bookmark: https://github.com/ncbi/SKESA/releases

Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation

Neel — Sun, 03 Apr 2022 20:35:19 -0500

Merfin, a k-mer based variant-filtering algorithm for improved accuracy in genotyping and genome assembly polishing. Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller’s internal score. Merfin increased the precision of genotyped calls in several benchmarks, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from Pacific Biosciences HiFi and continuous long reads or Oxford Nanopore reads, including the first complete human genome. Moreover, we introduce assembly quality and completeness metrics that account for the expected genomic copy numbers.

More at https://www.nature.com/articles/s41592-022-01445-y

Address of the bookmark: https://github.com/arangrhie/merfin

Large Language Models in Bioinformatics: Transforming Data Analysis and Interpretation

LEGE — Thu, 02 Jan 2025 11:26:29 -0600

The integration of artificial intelligence (AI) into bioinformatics has ushered in a new era of computational biology. Among the most transformative advancements are large language models (LLMs), such as GPT and BERT, which leverage deep learning to process and interpret vast amounts of text data. These models are reshaping bioinformatics by enhancing data analysis, hypothesis generation, and literature mining.

Understanding Large Language Models

LLMs are AI systems trained on extensive datasets of natural language. Their ability to model context, identify patterns, and generate coherent language has proven invaluable across domains, including bioinformatics. By fine-tuning these models on biological datasets, researchers can unlock insights into molecular biology, systems biology, and beyond.

Key Applications of LLMs in Bioinformatics

1. Annotating Biological Data

Annotating genomic and proteomic data is fundamental yet labor-intensive. LLMs streamline this process by extracting functional annotations from literature and databases, predicting gene and protein functions, and providing automated insights.

2. Mining Scientific Literature

The exponential growth of publications presents a challenge for researchers to stay updated. LLMs can process large volumes of text to extract key findings, summarize papers, and identify trends, thereby facilitating efficient literature reviews.

3. Predicting Gene and Protein Functions

By leveraging sequence data and annotations, LLMs can predict the functions of uncharacterized genes and proteins. This capability is particularly useful for studying non-model organisms and orphan genes.

4. Drug Discovery and Repurposing

LLMs enable pattern recognition across chemical, genomic, and clinical datasets, identifying novel drug candidates and repurposing existing drugs for new therapeutic targets. They can simulate interactions between drugs and biological molecules, accelerating the discovery pipeline.

5. Generating Hypotheses for Research

LLMs analyze complex datasets to propose testable hypotheses. For example, they can predict protein-protein interactions, identify regulatory motifs, or model evolutionary processes in genomes.

Advantages of LLMs in Bioinformatics

Scalability: LLMs process massive datasets rapidly, reducing the time required for data analysis.
Versatility: These models adapt to diverse bioinformatics tasks, from genomic annotation to network analysis.
Contextual Insights: By synthesizing information across disparate datasets, LLMs provide integrative insights into biological systems.

Challenges in Applying LLMs

Despite their promise, LLMs face limitations:

Data Quality and Bias: Inaccurate or biased datasets can affect model predictions, necessitating rigorous data curation.
Interpretability: Understanding the decision-making process of LLMs remains a critical challenge, especially in high-stakes fields like genomics and medicine.
Resource Intensity: Training and deploying LLMs require substantial computational power, which can limit accessibility.
Ethical Concerns: Handling sensitive genomic data raises privacy and security issues, emphasizing the need for ethical guidelines.

Future Prospects

The continued development of LLMs tailored for bioinformatics promises exciting advancements. Specialized models trained on omics data, open-access platforms, and interdisciplinary collaborations will expand the utility of LLMs. Moreover, integrating LLMs with other AI technologies, such as graph neural networks and reinforcement learning, can unlock deeper biological insights.

Conclusion

Large language models are revolutionizing bioinformatics by addressing longstanding challenges in data annotation, literature mining, and function prediction. Their ability to analyze complex biological datasets efficiently positions them as indispensable tools for modern research. As bioinformatics embraces AI, the synergy between LLMs and biological sciences holds the potential to unravel the complexities of life with unprecedented precision and scale.

ClueGO: a Cytoscape plug-in that visualizes the non-redundant biological terms for large clusters of genes

BioStar — Thu, 13 Aug 2020 10:24:29 -0500

ClueGO is a Cytoscape plug-in that visualizes the non-redundant biological terms for large clusters of genes in a functionally grouped network and it can be used in combination with GOlorize. The identifiers can be uploaded from a text file or interactively from a network of Cytoscape. The type of identifiers supported can be easely extended by the user. ClueGO performs single cluster analysis and comparison of clusters. From the ontology sources used, the terms are selected by different filter criteria. The related terms which share similar associated genes can be fused to reduce redundancy. The ClueGO network is created with kappa statistics and reflects the relationships between the terms based on the similarity of their associated genes. On the network, the node colour can be switched between functional groups and clusters distribution. ClueGO charts are underlying the specificity and the common aspects of the biological role. The significance of the terms and groups is automatically calculated. ClueGO is easy updatable with the newest files from Gene Ontology and KEGG.

Address of the bookmark: http://www.ici.upmc.fr/cluego/

HGTector: an automated method facilitating genome-wide discovery of putative horizontal gene transfers

Jit — Mon, 21 Jan 2019 06:50:05 -0600

A computational pipeline for genome-wide detection of putative horizontal gene transfer (HGT) events based on sequence homology search hit distribution statistics

Authors: Qiyun Zhu (qiyunzhu@gmail.com), Katharina Dittmar (katharinad@gmail.com)

Affiliation: Department of Biological Sciences, University at Buffalo, State University of New York, Buffalo, USA

Zhu Q, Kosoy M, Dittmar K. HGTector: an automated method facilitating genome-wide discovery of putative horizontal gene transfers. BMC Genomics. 2014. 15:717.

Usage: Simply execute perl HGTector.pl, or, open GUI.html in a web browser to see a step-by-step wizard.

Download HGTector 0.2.2.

Address of the bookmark: https://github.com/DittmarLab/HGTector

Pollard Lab

Fri, 25 Sep 2020 20:20:50 -0500

We are a bioinformatics research lab focused on developing novel methods and using them to study genome evolution, organization, and regulation. Our mission is to decode biomedical knowledge that is missed without rigorous statistical approaches.

http://docpollard.org/

Tools

http://docpollard.org/resources/software/

Method in Comparative genomics !!

Jit — Wed, 09 Nov 2016 16:29:24 -0600

We present methods for the automatic determination of genome correspondence. The algorithms enabled the automatic identification of orthologs for more than 90% of genes and intergenic regions across the four species despite the large number of duplicated genes in the yeast genome. The remaining ambiguities in the gene correspondence revealed recent gene family expansions in regions of rapid genomic change.

We present methods for the identification of protein-coding genes based on their patterns of nucleotide conservation across related species. We observed the pressure to conserve the reading frame of functional proteins and developed a test for gene identification with high sensitivity and specificity. We used this test to revisit the genome of S. cerevisiae, reducing the overall gene count by 500 genes (10% of previously annotated genes) and refining the gene structure of hundreds of genes. We present novel methods for the systematic de novo identification of regulatory motifs. The methods do not rely on previous knowledge of gene function and in that way differ from the current literature on computational motif discovery. Based on the genome-wide conservation patterns of known motifs, we developed three conservation criteria that we used to discover novel motifs. We used an enumeration approach to select strongly conserved motif cores, which we extended and collapsed into a small number of candidate regulatory motifs. These include most previously known regulatory motifs as well as several noteworthy novel motifs. The majority of discovered motifs are enriched in functionally related genes, allowing us to infer a candidate function for novel motifs.

Our results demonstrate the power of comparative genomics to further our understanding of any species. Our methods are validated by the extensive experimental knowledge in yeast, and will be invaluable in the study of complex genomes like that of human.

Address of the bookmark: http://web.mit.edu/manoli/www/publications/Kellis_JCB_04.pdf

SRF Bioinformatics job position in National Institute of Plant Genome Research (NIPGR)

Mon, 19 Sep 2016 05:43:38 -0500

SRF Bioinformatics job position in National Institute of Plant Genome Research (NIPGR)
Title : “Transcriptome and small RNA diversity analysis of developing seed contrasting rice varieties”
Qualification : Candidates having M.Sc./M.Tech. degree or equivalent (with minimum 60% marks) in Bioinformatics with a minimum of two years of post M.Sc./M.Tech research experience are eligible to apply.
No. of Post : 01
How to apply
Application should reach to Dr. Pinky Agarwal, Staff Scientist, National Institute of Plant Genome Research (NIPGR) Aruna Asaf Ali Marg, P.O. Box NO. 10531, New Delhi - 110067 on or before 30/09/2016

More at http://www.nipgr.res.in/careers/vacancies_latest.php#

Phytozome v12.1: plant science community hub for accessing palnts genomic data

Surabhi Chaudhary — Tue, 17 Mar 2020 07:30:17 -0500

Phytozome, the Plant Comparative Genomics portal of the Department of Energy's Joint Genome Institute, provides JGI users and the broader plant science community a hub for accessing, visualizing and analyzing JGI-sequenced plant genomes, as well as selected genomes and datasets that have been sequenced elsewhere. As of release v12.1.6, Phytozome hosts 93 assembled and annotated genomes, from 82 Viridiplantae species. More than half of these genomes have been sequenced, assembled and/or annotated with JGI Plant Science program resources. By integrating this large collection of plant genomes into a single resource and performing comprehensive and uniform annotation and analyses, Phytozome facilitates accurate and insightful comparative genomics studies.

Address of the bookmark: https://phytozome.jgi.doe.gov/pz/portal.html