BOL: Related items

Murasaki

Anjana — Fri, 30 Sep 2016 10:22:30 -0500

Murasaki is an anchor alignment program that is

exteremely fast (17 CPU hours for whole Human x Mouse genome (with 40 nodes: 35 wall minutes), or 8 mammals in 21 CPU hours (42 wall minutes))
scalable (Arbitrarily parallelizable across multiple nodes using MPI)
memory efficient. (Even a single node with 16GB of ram can handle over 1Gbp of sequence)
unlimited by pattern length or selection
repeat tolerant

Address of the bookmark: http://murasaki.dna.bio.keio.ac.jp/wiki/index.php?Murasaki

Data Visualization in Bioinformatics: Useful and Eye-Catching Plots for Data Analysis

LEGE — Sat, 14 Dec 2024 12:41:53 -0600

Data visualization is a cornerstone of bioinformatics, enabling researchers to interpret complex datasets effectively. With a plethora of data types—genomic sequences, expression profiles, protein interactions, and more—the right visualizations can make or break an analysis. This blog highlights some of the most useful and visually compelling plots for bioinformatics data analysis, along with tools to create them.

1. Heatmaps: Exploring Patterns in High-Dimensional Data

Heatmaps are a go-to visualization for representing high-dimensional datasets, such as gene expression or metabolomics data. They use color gradients to display data intensity, making patterns and clusters easily detectable.

Applications: Gene expression analysis, pathway enrichment, methylation studies.
Tools: Seaborn (Python), ComplexHeatmap (R), Morpheus (web-based).

Tip: Add dendrograms to visualize clustering of rows and columns for hierarchical relationships.

2. Volcano Plots: Highlighting Differential Features

Volcano plots are indispensable for identifying significantly differentially expressed genes or proteins. They plot the log2 fold change against –log10(p-value), making it easy to spot statistically significant changes.

Applications: RNA-seq, proteomics, and metabolomics.
Tools: ggplot2 (R), EnhancedVolcano (R), Plotly (Python).

Tip: Use color to highlight significant features and label key genes or proteins.

3. PCA Plots: Reducing Complexity with Principal Component Analysis

Principal Component Analysis (PCA) plots are used to reduce dimensionality and uncover trends or clusters in data. They provide insights into sample variability and grouping.

Applications: Transcriptomics, metabolomics, microbiome studies.
Tools: scikit-learn + Matplotlib (Python), prcomp (R), ClustVis (web-based).

Tip: Annotate clusters with metadata to enhance interpretability.

4. Manhattan Plots: Genome-Wide Association Studies

Manhattan plots visualize p-values across the genome, making it easy to identify significant associations in genome-wide studies. They resemble city skylines, with the highest peaks indicating loci of interest.

Applications: GWAS, QTL mapping.
Tools: qqman (R), Matplotlib (Python).

Tip: Use alternating colors for chromosomes and highlight significant SNPs for clarity.

5. Circular Plots (Circos): Visualizing Genomic Relationships

Circular plots are ideal for visualizing relationships across the genome, such as structural variations, gene duplications, or synteny.

Applications: Comparative genomics, structural variation studies.
Tools: Circos (standalone), Rcircos (R), pyCircos (Python).

Tip: Keep the plot clean and avoid overcrowding to maintain readability.

6. Sankey Diagrams: Tracking Data Flows

Sankey diagrams visualize flows or relationships between categories, often used to track changes in gene expression or pathway enrichment across conditions.

Applications: Pathway analysis, gene set enrichment analysis.
Tools: Plotly (Python), networkD3 (R).

Tip: Use gradients or distinct colors to highlight key transitions.

7. Network Graphs: Mapping Interactions

Network graphs represent relationships between entities, such as protein-protein interactions or gene regulatory networks. Nodes represent entities, and edges represent relationships.

Applications: Systems biology, interactomics.
Tools: Cytoscape (standalone), igraph (R), NetworkX (Python).

Tip: Use edge thickness or node size to represent interaction strength or centrality.

8. Violin Plots: Visualizing Data Distribution

Violin plots combine a boxplot with a density plot, showing the distribution and variability of data.

Applications: Single-cell RNA-seq, quantitative trait analysis.
Tools: Seaborn (Python), ggplot2 (R).

Tip: Split violins by groups for side-by-side comparisons.

9. Time-Series Plots: Monitoring Changes Over Time

Time-series plots display changes in variables across time points, useful for tracking gene expression dynamics or metabolic fluxes.

Applications: Time-course experiments, cell cycle studies.
Tools: Matplotlib (Python), ggplot2 (R).

Tip: Smooth the data to highlight trends while avoiding overfitting.

10. Genome Tracks: Visualizing Genomic Features

Genome tracks display multiple layers of genomic data, such as gene annotations, sequencing coverage, and epigenetic marks.

Applications: ChIP-seq, ATAC-seq, whole-genome sequencing.
Tools: IGV (standalone), pyGenomeTracks (Python).

Tip: Stack related tracks for direct comparisons.

11. UpSet Plots: Visualizing Set Intersections

UpSet plots are a powerful alternative to Venn diagrams for visualizing intersections between multiple datasets.

Applications: Overlap analysis for gene sets, pathways, or variants.
Tools: UpSetR (R), ComplexUpset (Python).

Tip: Use bar plots to represent the size of each intersection for added clarity.

12. Ridge Plots: Comparing Distributions

Ridge plots visualize the distributions of multiple datasets, stacked for easy comparison.

Applications: Transcriptomics, single-cell RNA-seq.
Tools: ggridges (R), Matplotlib (Python).

Tip: Use transparency and consistent scaling for better readability.

13. Chord Diagrams: Visualizing Connections Between Groups

Chord diagrams illustrate relationships between categories, such as shared genes between pathways or overlaps in regulatory elements.

Applications: Pathway overlap, synteny, co-expression networks.
Tools: Circlize (R), Holoviews (Python).

Tip: Use distinct colors for each group to emphasize relationships.

14. Treemaps: Hierarchical Data Representation

Treemaps visualize hierarchical data as nested rectangles, with area proportional to data size.

Applications: Ontology enrichment, pathway analysis.
Tools: Treemapify (R), Plotly (Python).

Tip: Use colors to represent additional variables, like significance or enrichment scores.

15. T-SNE/UMAP Plots: Dimensionality Reduction for Clustering

T-SNE and UMAP plots are great for visualizing high-dimensional data in two dimensions while preserving local or global structure.

Applications: Single-cell transcriptomics, clustering analyses.
Tools: scikit-learn (Python), Seurat (R).

Tip: Combine with metadata annotations for better cluster interpretation.

Bringing It All Together

The choice of visualization can significantly impact the insights gained from bioinformatics data. By selecting plots tailored to your data type and analysis goals, you can effectively communicate your findings and make your research more impactful. Whether you’re a seasoned bioinformatician or a beginner, mastering these visualizations will elevate your analyses and presentations.

MIRO : miRNA omics

Jit — Tue, 04 Oct 2016 14:50:48 -0500

The MIRO (the miRNA omics) pipeline is a flexible and powerful tool for the analysis of miRNA (or more generall short RNA) expression using short-read deep sequencing data. In its present implementation MIRO is especially adapted for the analysis of reads generated with the Illumina sequencing platform. MIRO allows to preprocess the Solexa-reads, map them flexibly to several reference genomes using one of four different mappers, create differential gene (miRNA) expression profiles and cluster reads using one of several algorithm. MIRO output is furthermore compatible with software such as genome browsers and miRDeep.

Address of the bookmark: http://seq.crg.es/download/software/Miro/

Protein function annotation and machine learning - UPMC - Paris, France

Sat, 02 Aug 2014 01:22:52 -0500

Protein function annotation and machine learning - UPMC - Paris, France

Job Description: We are interested in finding an excellent postdoc with interests in protein functional annotation, machine learning and computer grids. The position is open for 3.5 years at the Université Pierre et Marie Curie, in the heart of paris.

Research topic: Protein function annotation, multiple probabilistic models, domain architecture, machine learning, combinatorial optimization, computer grid.

Title: A novel integrative platform for large scale protein annotation that exploits a multitude of diversified probabilistic models in several protein signature databases.

We propose a novel integrated approach for large scale protein annotation that will exploit an unprecedented amount of genomic data as well as sophisticated machine learning techniques and combinatorial optimization approaches taking advantages of High Performance Computing (HPC) environments. The idea is to uncover as much as possible the evolutionary processes of protein sequences that took place throughout the whole tree of life and that affected the evolution of a protein family. We have already demonstrated in a previous work that the problem of functional annotation is inherent to the ability of uncovering such paths. Now, we shall extend this approach to large scale genome annotation by considering 11 different protein databases, constituted by about 10^9 protein sequences, and by producing a large pool of diversified probabilistic models coding for about 10^7 evolutionary protein pathways. Such models will be used to search for specific domains in genomes to be annotated. Our previous methodology needs to be fundamentally improved to deal with this large amount of biological data. In this project, we shall work on the algorithms to reduce the space of models and the search complexity, and we shall implement some important algorithmic changes towards the realization of a powerful integrated annotation tool.

Where: This project is run on the Laboratoire de Biologie Computationnelle et Quantitative UMR7238 CNRS-UPMC – Analytical Genomics team, headed by A.Carbone. It is co-advised with Pierre-Henri Wuillemin, Laboratoire d’Informatique de Paris 6 – Equipe DECISION.

Start date: September 1st, 2014
Contact Person: Alessandra Carbone
Contact: alessandra.carbone@lip6.fr

flo

Jitendra Narayan — Wed, 10 Feb 2016 10:52:32 -0600

flo - same species annotations lift over pipeline

Lift over is the process of transferring annotations from one genome assembly to another. Usually lift over is done because there is a new, improved genome assembly for the species and good quality annotations (maybe manually curated or experimentally verified) are available on the old assembly.

The idea is simple: align the new assembly with the old one (e.g., with BLAT), process the alignment data to define how a coordinate or coordinate range on the old assembly should be transformed to the new assembly (e.g., as a chain file), transform the coordinates (e.g., with liftOver).

https://github.com/wurmlab/flo

Address of the bookmark: https://github.com/wurmlab/flo

EAGER

Jit — Sat, 10 Dec 2016 18:07:23 -0600

The automated reconstruction of genome sequences in ancient genome analysis is a multifaceted process.

EAGER encompasses both state-of-the-art tools for each step as well as new complementary tools tailored for ancient DNA data within a single integrated solution in an easily accessible format.

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0918-z

Address of the bookmark: https://github.com/apeltzer/EAGER-GUI

GenomeView: genome browser and annotation editor

Rahul Nayak — Wed, 02 Jan 2019 04:09:06 -0600

GenomeView is a genome browser and annotation editor that displays reference sequence, annotation, multiple alignments, short read alignments and graphs. Most major data formats are supported. Local and internet files can be loaded.
This project has moved to GitHub: https://github.com/GenomeView/genomeview

Address of the bookmark: https://sourceforge.net/projects/genomeview/

JCVI utility libraries

Jit — Sat, 08 May 2021 22:04:02 -0500

Collection of Python libraries to parse bioinformatics files, or perform computation related to assembly, annotation, and comparative genomics.

Address of the bookmark: https://github.com/tanghaibao/jcvi

Cgaln

Jit — Wed, 22 Feb 2017 05:14:15 -0600

Cgaln (Coarse grained alignment) is a program designed to align a pair of whole genomic sequences of not only bacteria but also entire chromosomes of vertebrates on a nominal desktop computer. Cgaln performs an alignment job in two steps, at the block level and then at the nucleotide level. The former "coarse-grained" alignment can explore genomic rearrangements and reduce the regions to be analyzed in the next step. The latter is devoted to detailed alignment within the limited regions found in the first stage. The output of Cgaln is 'glocal' in the sense that rearrangements are taken into consideration while each alignable region is extended as long as possible. Thus, Cgaln is not only fast and memory-efficient, but also can filter noisy outputs without missing the most important homologous segment pairs.

http://www.iam.u-tokyo.ac.jp/chromosomeinformatics/rnakato/cgaln/

Address of the bookmark: http://www.iam.u-tokyo.ac.jp/chromosomeinformatics/rnakato/cgaln/

HiTE: a fast and accurate dynamic boundary adjustment approach for full-length Transposable Elements detection and annotation in Genome Assemblies

LEGE — Sat, 20 Sep 2025 09:34:04 -0500

HiTE is a Python software that uses a dynamic boundary adjustment approach to detect and annotate full-length Transposable Elements in Genome Assemblies. In comparison to other tools, HiTE demonstrates superior performance in detecting a greater number of full-length TEs.

panHiTE

We have developed panHiTE, a comprehensive and accurate pipeline for TE detection in large-scale population genomes. It has been successfully applied to hundreds of plant population genomes, demonstrating its effectiveness and scalability.

For detailed instructions, please refer to the panHiTE tutorial.

Address of the bookmark: https://github.com/CSU-KangHu/HiTE