BOL: Related items

GenomeTools: The versatile open source genome analysis software

Neel — Wed, 02 Feb 2022 04:00:21 -0600

The GenomeTools genome analysis system is a free collection of bioinformatics tools (in the realm of genome informatics) combined into a single binary named gt. It is based on a C library named “libgenometools” which consists of several modules.

If you are interested in gene prediction, have a look at GenomeThreader.

http://genometools.org/pub/

Address of the bookmark: http://genometools.org/

Structural variation: the hidden genomic treasure

Jit — Sat, 10 Dec 2016 16:19:09 -0600

Genome re-sequencing projects have revealed substantial amounts of genetic variation between individuals extending beyond single nucleotide polymorphisms (SNPs) and short indels. Structural Variations (SVs) and Copy Number Variations (CNVs) are a major source of genomic variation. However, compared to SNPs, accurate detection, genotyping and understanding of CNVs is lagging behind due to much greater analytical challenges related to SV/CNV detection and analysis. In our lab we analyse SVs/CNVs using high-throughput sequencing and different analytical approaches. The most‐studied structural variants are copy number variations (CNVs) which can be generated by several different mechanisms including non‐allelic homologous recombination, non‐homologous end‐joining and deoxyribonucleic acid (DNA) replication‐related fork stalling and template switching. CNVs are closely related to segmental duplications (SDs): SDs can stimulate the formation of CNVs and themselves started out as CNVs, but became fixed in a species. Structural variation can be neutral but has also influenced our phenotypic evolution, for example our susceptibility to disease and our ability to digest certain types of food. Our understanding of the extent of structural variation is increasing rapidly, but it will be much more difficult to understand its phenotypic consequences.

Structural variants (SVs) such as deletions, insertions, duplications, inversions and translocations litter genomes and are often associated with gene expression changes and severe phenotypes (ie. genetic diseases in humans). Recent studies on the functional aspects of different types of SVs have unveiled several cases of adaptive evolution. For example, inversions have been associated with ecological adaptations and may facilitate speciation. Due to their prevalent nature, SVs arguably have a large impact on genome evolution and should not be neglected when studying the genetics of adaptation and speciation. SVs were classically defined as chromosomal rearrangements larger than 1kb, but due to a higher resolution of new detection methods, smaller variants (between 50 and 1000 base pairs) can now be accurately assessed. Besides various methods of detection in next generation sequencing data (paired end mapping, split reads, and depth of coverage), array-based approaches have proven to be particularly useful for detecting copy number variations (CNVs). These technologies have enabled researchers to catalog a wide spectrum of SVs in many organisms and infer the effects of selection shaping their evolutionary trajectories.

Structure variation sequencing signature (Source: NatRev Genetics)

Related tools, databases and publications are listed below. If you know any interesing papers, please let us know in comment section:

Key concepts

Structural variation includes balanced variants such as inversions and translocations, and unbalanced ones such as duplications and deletions (copy number variations or CNVs).

Structural variants can arise by several mechanisms, including nonallelic homologous recombination (NAHR), nonhomologous end‐joining (NHEJ) and DNA replication‐based fork stalling and template switching (FoSTeS).

CNV is closely linked to segmental duplication, but is not exactly the same. Segmental duplications can stimulate CNV formation by NAHR, and themselves arise from CNVs that have become fixed.

Segmental duplications did not appear uniformly during the evolution of the Great Ape species, but rather during a burst of activity around the time of the divergence of gorilla from the human/chimpanzee ancestor.

Duplicated genes play a critical role in the evolution of a genome as they act as ‘spare parts’ than can evolve to perform new or more specialized functions.

Effects of structural variation on gene expression can be identified but only a few examples of the consequences for species biology have been documented.

Tools

CNVnatora tool for CNV discovery and genotyping from depth of read mapping.2011a,2011b

AGEa tools that implements an algorithm for optimal alignment of sequences with SVs.2011

BreakSeqa pipeline for annotation, classification and analysis of SVs at single nucleotide resolution.2010

PEMera computational and simulation framework for discovering SVs by paired-end read mapping.2009,2007

GASV https://code.google.com/archive/p/gasv/

PAIROSCOPE http://pairoscope.sourceforge.net/

SVDetect http://svdetect.sourceforge.net/Site/Home.html

BreakPtr, discovery of unbalanced structural variants (copy-number variants) with tiling microarrays Link

R Package https://www.bioconductor.org/help/course-materials/2010/EMBL2010/Practical-4-StructuralVariants.pdf

BreakSeq, structural variant genotyping using split reads Link

CopySeq, genotyping of unbalanced structural variants (copy-number variants) using read-depth Link

DELLY2, integrated structural variant discovery, genotyping and visualization in deep sequencing data Link

PEMer, structural variant discovery in 454 sequencing data by paired-end mapping Link

TIGER, transduction inference in germline genomes using short read data Link

MANTA https://github.com/Illumina/manta

SV-Bay https://github.com/InstitutCurie/SV-Bay

BreakDancer http://breakdancer.sourceforge.net/

Variation Hunter http://compbio.cs.sfu.ca/software-variation-hunter

Lumpy https://github.com/arq5x/lumpy-sv

ForestSV http://sebatlab.ucsd.edu/index.php/software-data

PBSuites for long reads https://sourceforge.net/projects/pb-jelly/

Visualization

The SV visualization tool: http://genomesavant.com/savant/

InGAP-SV (http://ingap.sourceforge.net/) that is nice tools for both detection and visualisation of severals kind of structural variations (Large insertions, translocation, deletion, inversions....)

Tools table: http://www.nature.com/nbt/journal/v29/n8/fig_tab/nbt.1904_T2.html

Variation Viewer https://www.ncbi.nlm.nih.gov/variation/view/

Papers

http://www.nature.com/nmeth/journal/v9/n2/full/nmeth.1858.html

http://journal.frontiersin.org/researchtopic/1412/structural-variations-in-genomes-ecological-and-evolutionary-implications

http://www.mi.fu-berlin.de/wiki/pub/ABI/GenomicsLecture10Materials/structural-variation.pdf

http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1479-3

https://www.ncbi.nlm.nih.gov/dbvar/content/overview/

http://www.nature.com/subjects/structural-variation

https://eichlerlab.gs.washington.edu/news/NatMeth_Feb2012.pdf

https://www.ncbi.nlm.nih.gov/pubmed/19477992 ***

https://www.ncbi.nlm.nih.gov/pubmed/22452995

http://biorxiv.org/content/early/2016/09/06/073833

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479793/

http://www.nature.com/articles/srep18501

http://www.genetics.org/content/202/1/351

http://www.cs.cmu.edu/~sssykim/teaching/s13/slides/Lecture_SVI.pdf

https://www.omicsonline.org/open-access/structural-variation-detection-from-next-generation-sequencing-2469-9853-S1-007.php?aid=69055

http://schatzlab.cshl.edu/presentations/2016/2016.01.12.PAG.Structural%20Variations.pdf

pyScaf

Bulbul — Mon, 19 Dec 2016 14:20:33 -0600

pyScaf orders contigs from genome assemblies utilising several types of information:

paired-end (PE) and/or mate-pair libraries (NGS-based mode)
long reads (NGS-based mode)
synteny to the genome of some related species (reference-based mode)

Scaffolding

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned globally (end-to-end) onto reference chromosomes, ignoring:

matches not satisfying cut-offs (--identity and --overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table).
Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect...), which breaks synteny.
pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
Consider closing gaps after scaffolding.

Address of the bookmark: https://github.com/lpryszcz/pyScaf

Data Visualization in Bioinformatics: Useful and Eye-Catching Plots for Data Analysis

LEGE — Sat, 14 Dec 2024 12:41:53 -0600

Data visualization is a cornerstone of bioinformatics, enabling researchers to interpret complex datasets effectively. With a plethora of data types—genomic sequences, expression profiles, protein interactions, and more—the right visualizations can make or break an analysis. This blog highlights some of the most useful and visually compelling plots for bioinformatics data analysis, along with tools to create them.

1. Heatmaps: Exploring Patterns in High-Dimensional Data

Heatmaps are a go-to visualization for representing high-dimensional datasets, such as gene expression or metabolomics data. They use color gradients to display data intensity, making patterns and clusters easily detectable.

Applications: Gene expression analysis, pathway enrichment, methylation studies.
Tools: Seaborn (Python), ComplexHeatmap (R), Morpheus (web-based).

Tip: Add dendrograms to visualize clustering of rows and columns for hierarchical relationships.

2. Volcano Plots: Highlighting Differential Features

Volcano plots are indispensable for identifying significantly differentially expressed genes or proteins. They plot the log2 fold change against –log10(p-value), making it easy to spot statistically significant changes.

Applications: RNA-seq, proteomics, and metabolomics.
Tools: ggplot2 (R), EnhancedVolcano (R), Plotly (Python).

Tip: Use color to highlight significant features and label key genes or proteins.

3. PCA Plots: Reducing Complexity with Principal Component Analysis

Principal Component Analysis (PCA) plots are used to reduce dimensionality and uncover trends or clusters in data. They provide insights into sample variability and grouping.

Applications: Transcriptomics, metabolomics, microbiome studies.
Tools: scikit-learn + Matplotlib (Python), prcomp (R), ClustVis (web-based).

Tip: Annotate clusters with metadata to enhance interpretability.

4. Manhattan Plots: Genome-Wide Association Studies

Manhattan plots visualize p-values across the genome, making it easy to identify significant associations in genome-wide studies. They resemble city skylines, with the highest peaks indicating loci of interest.

Applications: GWAS, QTL mapping.
Tools: qqman (R), Matplotlib (Python).

Tip: Use alternating colors for chromosomes and highlight significant SNPs for clarity.

5. Circular Plots (Circos): Visualizing Genomic Relationships

Circular plots are ideal for visualizing relationships across the genome, such as structural variations, gene duplications, or synteny.

Applications: Comparative genomics, structural variation studies.
Tools: Circos (standalone), Rcircos (R), pyCircos (Python).

Tip: Keep the plot clean and avoid overcrowding to maintain readability.

6. Sankey Diagrams: Tracking Data Flows

Sankey diagrams visualize flows or relationships between categories, often used to track changes in gene expression or pathway enrichment across conditions.

Applications: Pathway analysis, gene set enrichment analysis.
Tools: Plotly (Python), networkD3 (R).

Tip: Use gradients or distinct colors to highlight key transitions.

7. Network Graphs: Mapping Interactions

Network graphs represent relationships between entities, such as protein-protein interactions or gene regulatory networks. Nodes represent entities, and edges represent relationships.

Applications: Systems biology, interactomics.
Tools: Cytoscape (standalone), igraph (R), NetworkX (Python).

Tip: Use edge thickness or node size to represent interaction strength or centrality.

8. Violin Plots: Visualizing Data Distribution

Violin plots combine a boxplot with a density plot, showing the distribution and variability of data.

Applications: Single-cell RNA-seq, quantitative trait analysis.
Tools: Seaborn (Python), ggplot2 (R).

Tip: Split violins by groups for side-by-side comparisons.

9. Time-Series Plots: Monitoring Changes Over Time

Time-series plots display changes in variables across time points, useful for tracking gene expression dynamics or metabolic fluxes.

Applications: Time-course experiments, cell cycle studies.
Tools: Matplotlib (Python), ggplot2 (R).

Tip: Smooth the data to highlight trends while avoiding overfitting.

10. Genome Tracks: Visualizing Genomic Features

Genome tracks display multiple layers of genomic data, such as gene annotations, sequencing coverage, and epigenetic marks.

Applications: ChIP-seq, ATAC-seq, whole-genome sequencing.
Tools: IGV (standalone), pyGenomeTracks (Python).

Tip: Stack related tracks for direct comparisons.

11. UpSet Plots: Visualizing Set Intersections

UpSet plots are a powerful alternative to Venn diagrams for visualizing intersections between multiple datasets.

Applications: Overlap analysis for gene sets, pathways, or variants.
Tools: UpSetR (R), ComplexUpset (Python).

Tip: Use bar plots to represent the size of each intersection for added clarity.

12. Ridge Plots: Comparing Distributions

Ridge plots visualize the distributions of multiple datasets, stacked for easy comparison.

Applications: Transcriptomics, single-cell RNA-seq.
Tools: ggridges (R), Matplotlib (Python).

Tip: Use transparency and consistent scaling for better readability.

13. Chord Diagrams: Visualizing Connections Between Groups

Chord diagrams illustrate relationships between categories, such as shared genes between pathways or overlaps in regulatory elements.

Applications: Pathway overlap, synteny, co-expression networks.
Tools: Circlize (R), Holoviews (Python).

Tip: Use distinct colors for each group to emphasize relationships.

14. Treemaps: Hierarchical Data Representation

Treemaps visualize hierarchical data as nested rectangles, with area proportional to data size.

Applications: Ontology enrichment, pathway analysis.
Tools: Treemapify (R), Plotly (Python).

Tip: Use colors to represent additional variables, like significance or enrichment scores.

15. T-SNE/UMAP Plots: Dimensionality Reduction for Clustering

T-SNE and UMAP plots are great for visualizing high-dimensional data in two dimensions while preserving local or global structure.

Applications: Single-cell transcriptomics, clustering analyses.
Tools: scikit-learn (Python), Seurat (R).

Tip: Combine with metadata annotations for better cluster interpretation.

Bringing It All Together

The choice of visualization can significantly impact the insights gained from bioinformatics data. By selecting plots tailored to your data type and analysis goals, you can effectively communicate your findings and make your research more impactful. Whether you’re a seasoned bioinformatician or a beginner, mastering these visualizations will elevate your analyses and presentations.

AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references

Manisha Mishra — Tue, 17 Apr 2018 16:21:20 -0500

AlignGraph is a software that extends and joins contigs or scaffolds by reassembling them with help provided by a reference genome of a closely related organism.

Using AlignGraph

AlignGraph --read1 reads_1.fa --read2 reads_2.fa --contig contigs.fa --genome genome.fa --distanceLow distanceLow --distanceHigh distancehigh --extendedContig extendedContigs.fa --remainingContig remainingContigs.fa [--kMer k --insertVariation insertVariation --coverage coverage --part p --fastMap --ratioCheck --iterativeMap --misassemblyRemoval --resume]

Address of the bookmark: https://github.com/baoe/AlignGraph

EAGLER: a scaffolding tool for long reads.

Jit — Mon, 04 Jun 2018 05:26:03 -0500

EAGLER is a scaffolding tool for long reads. The scaffolder takes as input a draft genome created by any NGS assembler and a set of long reads. The long reads are used to extend the contigs present in the NGS draft and possibly join overlapping contigs. EAGLER supports both PacBio and Oxford Nanopore reads.

The tool should be compatible with most UNIX flavors and has been successfully tested on the following operating systems:

Mac OS X 10.11.1
Mac OS X 10.10.3
Ubuntu 14.04 LTS

https://bib.irb.hr/datoteka/844447.Diplomski_2015_Luka_terbi.pdf

Address of the bookmark: https://github.com/mculinovic/EAGLER

Nanopolis: polish a genome assembly

Rahul Nayak — Thu, 26 Jul 2018 04:51:28 -0500

Software package for signal-level analysis of Oxford Nanopore sequencing data. Nanopolish can calculate an improved consensus sequence for a draft genome assembly, detect base modifications, call SNPs and indels with respect to a reference genome and more (see Nanopolish modules, below).

Quickstart

http://nanopolish.readthedocs.io/en/latest/quickstart_consensus.html

Algorithms

http://simpsonlab.github.io/2017/06/30/nanopolish-v0.7.0/

Address of the bookmark: https://github.com/jts/nanopolish

Software and Tools to detect structure variation with long reads !!

Archana Malhotra — Wed, 15 Mar 2017 14:31:09 -0500

Uncovering the connection between genetics and heritable diseases requires an approach that looks at all the variant bases and types in a genome. While a PacBio de novo assembly resolves the most novel SV variants. 8-10X PacBio coverage of single genomes or trios reveals triple the SVs detectable by short-read data.

With Single Molecule, Real-Time (SMRT) Sequencing, you can access structural variations having a broad range of sizes, types, and GC content with the ability to:

Uncover missing heritability linked to structural variation
Unambiguously identify genomic context and variant breakpoints at the sequence level to unravel the genetic etiology of disease
Resolve structural variation across the complete size spectrum with basepair resolution

Following are the SV tools, which can assist you to achieve your goal.

Sniffles: Structural variation caller using third generation sequencing

Sniffles is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore). It detects all types of SVs using evidence from split-read alignments, high-mismatch regions, and coverage analysis. Please note the current version of Sniffles requires sorted output from BWA-MEM (use -M and -x parameter) or NGM-LR with the optional SAM attributes enabled!

More at https://github.com/fritzsedlazeck/Sniffles

MultiBreak-SV: It identifies structural variants from next-generation paired end data, third-generation long read data, or data from a combination of sequencing platforms.

There are two pieces of software in this release: (1) a pre-processor that takes machineformat (.m5) BLASR files, and (2) MultiBreak-SV. For installation and usage instructions, see doc/MultiBreakSV-Manual.txt.

More at https://github.com/raphael-group/multibreak-sv

Parliament: A Structural Variation Tool. Why ask a single sv-detection approach to find every variant when you can have a parliament of tools deciding?

Publication about the algorithm and “…the first long-read characterization of structural variation in a diploid human personal genome…” (HS1011) - “Assessing structural variation in a personal genome—towards a human reference diploid genome”

More at https://sourceforge.net/projects/parliamentsv/

https://www.dnanexus.com/papers/Parliament_Info_Sheet.pdf

PBHoney: the structural variation discovery tool

PBHoney is an implementation of two variant-identification approaches designed to exploit the high mappability of long reads (i.e., greater than 10,000 bp). PBHoney considers both intra-read discordance and soft-clipped tails of long reads to identify structural variants.

Read The Paper http://www.biomedcentral.com/1471-2105/15/180/abstract

More at https://sourceforge.net/projects/pb-jelly/

SMRT-SV: Structural variant and indel caller for PacBio reads

Structural variant (SV) and indel caller for PacBio reads based on methods from Chaisson et al. 2014.

SMRT-SV provides an official software package for tools described in Chaisson et al. 2014 and adds several key features including the following.

Unified variant calling user interface with built-in cluster compute support
Small indel calling (2-49 bp)
Improved inversion calling (screenInversions)
Quality metric for SV calls based on number of local assemblies supporting each call
Higher sensitivity for SV calls using tiled local assemblies across the entire genome instead of "signature" regions
Genotyping of SVs with Illumina paired-end reads from WGS samples

More at https://github.com/EichlerLab/pacbio_variant_caller

QUAST-LG: Versatile genome assembly evaluation

Jit — Thu, 25 Oct 2018 10:46:55 -0500

QUAST-LG-a tool that compares large genomic de novo assemblies against reference sequences and computes relevant quality metrics. Since genomes generally cannot be reconstructed completely due to complex repeat patterns and low coverage regions, we introduce a concept of upper bound assembly for a given genome and set of reads, and compute theoretical limits on assembly correctness and completeness. Using QUAST-LG, we show how close the assemblies are to the theoretical optimum, and how far this optimum is from the finished reference.

AVAILABILITY AND IMPLEMENTATION:

http://cab.spbu.ru/software/quast-lg

Address of the bookmark: http://cab.spbu.ru/software/quast-lg/

Lifemap

Jit — Mon, 10 Apr 2017 05:42:37 -0500

Lifemap is an interactive tool to explore the WHOLE NCBI TAXONOMY. The concept used in Lifemap is similar to the one used in cartography with tools like Google Maps© or Open Street Maps: exploring is done by zooming and panning.

The current tree contains ALL species present in NCBI taxonomy as of October 18th, 2016: 1,135,169 species including 10,545 Archaea, 418,777 Bacteria and 705,847 Eukaryotes. The Lifemap tree is updated every two weeks.

All the nodes in the tree are clickable. This displays various information and options:

The species name (and the associated common name if there is one)
The rank (kingdom, family, class, species...)
Ability to go to the corresponding node/species on NCBI web site (displayed in a new window)
Possibility to download the corresponding subtree in newick extended format
Possibilty to get the whole lineage from the current node/tip to the root of the tree.

Address of the bookmark: http://lifemap-ncbi.univ-lyon1.fr/