BOL: Related items

Generate interactive codon usage plots

Jit — Wed, 28 Feb 2018 03:47:40 -0600

Generate interactive codon usage plots as used at ensembl.lepbase.org. The input file format can be generated from an Ensembl database using the export_json.pl script from the easy-import pipeline.

live demo

Address of the bookmark: https://github.com/rjchallis/codon-usage

BamView: a free interactive display of read alignments in BAM data files

Neel — Fri, 09 Nov 2018 13:43:22 -0600

To run the application on UNIX from the downloaded jar file run the UNIX:

java -mx512m -jar BamView.jar

and extra command line options are given when '-h' is used:

java -jar BamView.jar -h

BAM files can be specified on the command line with the '-a' option:

java -mx512m -jar BamView.jar -a pathToFile/sorted.bam

If a BAM filename is not given on the command line BamView will prompt for a file to be entered. The BAM index file should have the same name as the BAM file but with a '.bai' suffix. Multiple BAM files can be loaded and overlaid in the viewer. To make this easier BamView will read in files that contain a list of filenames.

Address of the bookmark: http://bamview.sourceforge.net/

Large Language Models in Bioinformatics: Transforming Data Analysis and Interpretation

LEGE — Thu, 02 Jan 2025 11:26:29 -0600

The integration of artificial intelligence (AI) into bioinformatics has ushered in a new era of computational biology. Among the most transformative advancements are large language models (LLMs), such as GPT and BERT, which leverage deep learning to process and interpret vast amounts of text data. These models are reshaping bioinformatics by enhancing data analysis, hypothesis generation, and literature mining.

Understanding Large Language Models

LLMs are AI systems trained on extensive datasets of natural language. Their ability to model context, identify patterns, and generate coherent language has proven invaluable across domains, including bioinformatics. By fine-tuning these models on biological datasets, researchers can unlock insights into molecular biology, systems biology, and beyond.

Key Applications of LLMs in Bioinformatics

1. Annotating Biological Data

Annotating genomic and proteomic data is fundamental yet labor-intensive. LLMs streamline this process by extracting functional annotations from literature and databases, predicting gene and protein functions, and providing automated insights.

2. Mining Scientific Literature

The exponential growth of publications presents a challenge for researchers to stay updated. LLMs can process large volumes of text to extract key findings, summarize papers, and identify trends, thereby facilitating efficient literature reviews.

3. Predicting Gene and Protein Functions

By leveraging sequence data and annotations, LLMs can predict the functions of uncharacterized genes and proteins. This capability is particularly useful for studying non-model organisms and orphan genes.

4. Drug Discovery and Repurposing

LLMs enable pattern recognition across chemical, genomic, and clinical datasets, identifying novel drug candidates and repurposing existing drugs for new therapeutic targets. They can simulate interactions between drugs and biological molecules, accelerating the discovery pipeline.

5. Generating Hypotheses for Research

LLMs analyze complex datasets to propose testable hypotheses. For example, they can predict protein-protein interactions, identify regulatory motifs, or model evolutionary processes in genomes.

Advantages of LLMs in Bioinformatics

Scalability: LLMs process massive datasets rapidly, reducing the time required for data analysis.
Versatility: These models adapt to diverse bioinformatics tasks, from genomic annotation to network analysis.
Contextual Insights: By synthesizing information across disparate datasets, LLMs provide integrative insights into biological systems.

Challenges in Applying LLMs

Despite their promise, LLMs face limitations:

Data Quality and Bias: Inaccurate or biased datasets can affect model predictions, necessitating rigorous data curation.
Interpretability: Understanding the decision-making process of LLMs remains a critical challenge, especially in high-stakes fields like genomics and medicine.
Resource Intensity: Training and deploying LLMs require substantial computational power, which can limit accessibility.
Ethical Concerns: Handling sensitive genomic data raises privacy and security issues, emphasizing the need for ethical guidelines.

Future Prospects

The continued development of LLMs tailored for bioinformatics promises exciting advancements. Specialized models trained on omics data, open-access platforms, and interdisciplinary collaborations will expand the utility of LLMs. Moreover, integrating LLMs with other AI technologies, such as graph neural networks and reinforcement learning, can unlock deeper biological insights.

Conclusion

Large language models are revolutionizing bioinformatics by addressing longstanding challenges in data annotation, literature mining, and function prediction. Their ability to analyze complex biological datasets efficiently positions them as indispensable tools for modern research. As bioinformatics embraces AI, the synergy between LLMs and biological sciences holds the potential to unravel the complexities of life with unprecedented precision and scale.

ClueGO: a Cytoscape plug-in that visualizes the non-redundant biological terms for large clusters of genes

BioStar — Thu, 13 Aug 2020 10:24:29 -0500

ClueGO is a Cytoscape plug-in that visualizes the non-redundant biological terms for large clusters of genes in a functionally grouped network and it can be used in combination with GOlorize. The identifiers can be uploaded from a text file or interactively from a network of Cytoscape. The type of identifiers supported can be easely extended by the user. ClueGO performs single cluster analysis and comparison of clusters. From the ontology sources used, the terms are selected by different filter criteria. The related terms which share similar associated genes can be fused to reduce redundancy. The ClueGO network is created with kappa statistics and reflects the relationships between the terms based on the similarity of their associated genes. On the network, the node colour can be switched between functional groups and clusters distribution. ClueGO charts are underlying the specificity and the common aspects of the biological role. The significance of the terms and groups is automatically calculated. ClueGO is easy updatable with the newest files from Gene Ontology and KEGG.

Address of the bookmark: http://www.ici.upmc.fr/cluego/

LAST

Archana Malhotra — Wed, 09 Mar 2016 14:27:01 -0600

LAST can:

Handle big sequence data, e.g:
- Compare two vertebrate genomes
- Align billions of DNA reads to a genome
Indicate the reliability of each aligned column.
Use sequence quality data properly.
Compare DNA to proteins, with frameshifts.
Compare PSSMs to sequences
Calculate the likelihood of chance similarities between random sequences.
Do split and spliced alignment.
Train alignment parameters for unusual kinds of sequence (e.g. nanopore).

Address of the bookmark: http://last.cbrc.jp/

gVolante: Completeness Assessment of Genome/Transcriptome Sequences

Neel — Sun, 13 Jan 2019 07:03:25 -0600

A brand-new web server, gVolante, which provides an online tool for (i) on-demand completeness assessment of sequence sets by means of the previously developed pipelines CEGMA and BUSCO and (ii) browsing pre-computed completeness scores for publicly available data in its database section

Address of the bookmark: https://gvolante.riken.jp/analysis.html

GenBank release 257.0 is now available!

Neel — Wed, 23 Aug 2023 00:23:23 -0500

GenBank release 257.0 is now available! This release has 25.10 trillion bases and 3.69 billion records. Learn more: https://ncbiinsights.ncbi.nlm.nih.gov/2023/08/21/genbank-release-257/

GenBank release 257.0 (8/15/2023) is now available on the NCBI FTP site. This release has 25.10 trillion bases and 3.69 billion records.

The current release has:

246,119,175 traditional records containing 2,112,058,517,945 base pairs of sequence data
2,631,493,489 WGS records containing 22,294,446,104,543 base pairs of sequence data
686,271,945 bulk-oriented TSA records containing 646,176,166,908 base pairs of sequence data
124,421,006 bulk-oriented TLS records containing 48,289,699,026 base pairs of sequence data

UPhO: Scripts for homology and orthology assessment from genomic sequences.

BioStar — Mon, 14 Jan 2019 10:36:42 -0600

UPhO finds orthologs with and without inparalogs from input gene family trees. Refer to the Documentation.pdf for more detailed explanations on its usage, installation and dependencies. Type UPhO.py -h for help.

The only input requierement for UPhO is a tree (or trees) in Newick format in which the leaves are named with a species idenfifier, a field separator, and sequence identifier. By default, the field separator is the character "|" but custom delimiters can be defined. Examples of trees to test UPhO are provided in the TestData folder.

Address of the bookmark: https://github.com/ballesterus/UPhO

Sequence Tube Maps: displays multiple genomic sequences in the form of a tube map

Jit — Wed, 11 Mar 2020 01:12:06 -0500

A JavaScript module for the visualization of genomic sequence graphs. It automatically generates a "tube map"-like visualization of sequence graphs which have been created with vg. (https://github.com/vgteam/vg)

Link to working demo: https://vgteam.github.io/sequenceTubeMap/

Address of the bookmark: https://github.com/vgteam/sequenceTubeMap

Steps to find palindrome in genomes !

BioStar — Thu, 09 Mar 2023 02:56:54 -0600

Palindromes are sequences of nucleotides that read the same backward as forward. They can be present in genomes and have various biological functions. Here are some methods for discovering palindromes in genomes:

Direct sequence search: One of the simplest ways to discover palindromes is to search the genome sequence directly for palindromic sequences using pattern matching tools, such as regular expressions or string algorithms. This approach can be useful for discovering simple palindromes, but may miss more complex palindromic structures.
Dot plot analysis: Dot plot analysis is a graphical method that can be used to identify palindromic regions in a genome. It involves plotting the genome sequence against itself and examining the diagonal patterns that emerge. Palindromic regions will appear as symmetrical patterns along the diagonal.
Restriction enzyme analysis: Some restriction enzymes, such as EcoRI and HindIII, recognize palindromic sequences and cleave DNA at these sites. By digesting the genome with these enzymes and examining the resulting fragments, palindromic regions can be identified.
Next-generation sequencing: High-throughput sequencing technologies, such as PacBio and Oxford Nanopore, can generate long reads that can span entire palindromic regions. By mapping these reads to the genome, palindromic regions can be identified and characterized.
Comparative genomics: Comparing the genomes of related species can also reveal palindromic regions that are conserved across evolutionarily divergent lineages. This approach can help identify functional palindromes that are under selective pressure.

Overall, the discovery of palindromic sequences in genomes can be accomplished using a variety of methods, each with their own advantages and limitations. A combination of these methods can provide a comprehensive understanding of the palindromic landscape of a genome.