BOL: Related items

Gepard: allows the calculation of dotplots even for large sequences like chromosomes or bacterial genomes

Jit — Mon, 26 Aug 2019 11:38:30 -0500

Gepard (German: "cheetah", Backronym for "GEnome PAir - Rapid Dotter") allows the calculation of dotplots even for large sequences like chromosomes or bacterial genomes. Reference: Krumsiek J, Arnold R, Rattei T. Gepard: A rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics 2007; 23(8): 1026-8. PMID: 17309896

http://cube.univie.ac.at/gepard

Address of the bookmark: https://github.com/univieCUBE/gepard

mutatrix: a population genome simulator which generates simulated genomes.

Jit — Tue, 28 Jan 2020 04:06:58 -0600

genome simulation across a population with zeta-distributed allele frequency, snps, insertions, deletions, and multi-nucleotide polymorphisms

More at https://github.com/ekg/mutatrix

./mutatrix -S sample -P test/ -p 2 -n 10 reference.fasta

Address of the bookmark: https://github.com/ekg/mutatrix

PyParanoid: a pipeline for rapid identification of homologous gene families in a set of genomes

BioStar — Thu, 13 Aug 2020 10:06:19 -0500

PyParanoid is a pipeline for rapid identification of homologous gene families in a set of genomes - a central task of any comparative genomics analysis. The "gold standard" for identifying homologs is to use reciprocal best hits (RBHs) which depends on performing a all-vs-all sequence comparison, usually using BLAST, to determine homology. However, these methods are computationally expensive, requiring O(n2) resources to identify RBHs. This is problematic, as the modern deluge of sequencing data means that comparative genomics analyses could be performed on datasets of thousands of strains.

Address of the bookmark: https://github.com/ryanmelnyk/PyParanoid

Proksee: in-depth characterization and visualization of bacterial genomes

LEGE — Tue, 09 May 2023 19:38:52 -0500

Proksee is an expert system for genome assembly, annotation and visualization. To begin using Proksee, provide a complete genome sequence, sequencing reads or a CGView/Proksee map JSON file.

Address of the bookmark: https://proksee.ca/

ShRec3D

Jitendra Narayan — Thu, 25 Dec 2014 23:14:52 -0600

ShRec3D is a program that aims at reconstructing a genome 3D structure (b) from the sole knowledge of the contacts between different genomic regions (a) as determined by Hi-C (http://www.ncbi.nlm.nih.gov/pubmed/19815776).

There are two options to run ShRec3D (on linuX only so far): the first one uses the Matlab complier runtime environment (MCR), the second one doesn't need any other library to be installed but only works with the latest versions of Linux (equivalent to Fedora 19 and above).

Address of the bookmark: https://sites.google.com/site/julienmozziconacci/#TOC-Downloads

pWhatsHap: a parallel, high-performance version of WhatsHap

Jit — Wed, 14 Nov 2018 08:20:27 -0600

Given the potential relevance of efficient haplotyping in several analysis pipelines, we have designed and engineered pWhatsHap, a parallel, high-performance version of WhatsHap. pWhatsHap is embedded in a toolkit developed in Python and supports genomics datasets in standard file formats. Building on WhatsHap, pWhatsHap exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WhatsHap, which increases with coverage.

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1170-y

Address of the bookmark: https://bitbucket.org/whatshap/whatshap

Merqury: reference-free quality and phasing assessment for genome assemblies

Jit — Sat, 06 Jun 2020 05:38:34 -0500

Often, genome assembly projects have illumina whole genome sequencing reads available for the assembled individual. The k-mer spectrum of this read set can be used for independently evaluating assembly quality without the need of a high quality reference. Merqury provides a set of tools for this purpose.

https://github.com/marbl/meryl

Address of the bookmark: https://github.com/marbl/merqury

pyScaf

Bulbul — Mon, 19 Dec 2016 14:20:33 -0600

pyScaf orders contigs from genome assemblies utilising several types of information:

paired-end (PE) and/or mate-pair libraries (NGS-based mode)
long reads (NGS-based mode)
synteny to the genome of some related species (reference-based mode)

Scaffolding

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned globally (end-to-end) onto reference chromosomes, ignoring:

matches not satisfying cut-offs (--identity and --overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table).
Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect...), which breaks synteny.
pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
Consider closing gaps after scaffolding.

Address of the bookmark: https://github.com/lpryszcz/pyScaf

GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies.

Abhimanyu Singh — Tue, 23 May 2017 05:20:32 -0500

GRASS (GeneRic ASsembly Scaffolder)-a novel algorithm for scaffolding second-generation sequencing assemblies capable of using diverse information sources. GRASS offers a mixed-integer programming formulation of the contig scaffolding problem, which combines contig order, distance and orientation in a single optimization objective. The resulting optimization problem is solved using an expectation-maximization procedure and an unconstrained binary quadratic programming approximation of the original problem. We compared GRASS with existing HTS scaffolders using Illumina paired reads of three bacterial genomes. Our algorithm constructs a comparable number of scaffolds, but makes fewer errors. This result is further improved when additional data, in the form of related genome sequences, are used.

Address of the bookmark: https://github.com/AlexeyG/GRASS

ARCS: scaffolding genome drafts with linked reads

Rahul Nayak — Tue, 06 Mar 2018 16:35:26 -0600

ARCS, an application that utilizes the barcoding information contained in linked reads to further organize draft genomes into highly contiguous assemblies. We show how the contiguity of an ABySS H.sapiensgenome assembly can be increased over six-fold, using moderate coverage (25-fold) Chromium data. We expect ARCS to have broad utility in harnessing the barcoding information contained in linked read data for connecting high-quality sequences in genome assembly drafts.

Address of the bookmark: https://github.com/bcgsc/ARCS/