BOL: Related items

Genome assembly stats plotting

Jit — Wed, 28 Feb 2018 03:45:39 -0600

A de novo genome assembly can be summarised b

y a number of metrics, including:

Overall assembly length
Number of scaffolds/contigs
Length of longest scaffold/contig
Scaffold/contig N50 and N90Assembly base composition, in particular percentage GC and percentage Ns
CEGMA completeness
Scaffold/contig length/count distribution

assembly-stats supports two widely used presentations of these values, tabular and cumulative length plots, and introduces an additional circular plot that summarises most commonly used assembly metrics in a single visualisation. Each of these presentations is generated using javascript from a common (JSON) data structure, allowing toggling between alternative views, and each can be applied to a single or multiple assemblies to allow direct comparison of alternate assemblies.

Tabular presentation allows direct comparison of exact values between assemblies, the limitations of this approach lie in the necessary omission of distributions and the challenge of interpreting ratios of values that may vary by several orders of magnitude.

Address of the bookmark: https://github.com/rjchallis/assembly-stats

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies

Jit — Tue, 15 May 2018 07:35:26 -0500

HapCUT2 is a maximum-likelihood-based tool for assembling haplotypes from DNA sequence reads, designed to "just work" with excellent speed and accuracy. We found that previously described haplotype assembly methods are specialized for specific read technologies or protocols, with slow or inaccurate performance on others. With this in mind, HapCUT2 is designed for speed and accuracy across diverse sequencing technologies, including but not limited to: NGS short reads (Illumina HiSeq) clone-based sequencing (Fosmid or BAC clones) SMRT reads (PacBio) Oxford Nanopore reads 10X Genomics Linked-Reads proximity-ligation (Hi-C) reads high-coverage sequencing (>40x coverage-per-SNP) using above technologies combinations of the above technologies (e.g. scaffold long reads with Hi-C reads) See below for specific examples of command line options and best practices for some of these technologies. NOTE: At this time HapCUT2 is for diploid organisms only. VCF input should contain diploid variants. If you use HapCUT2 in your research, please cite: Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. gr.213462.116 (2016). doi:10.1101/gr.213462.116

Address of the bookmark: https://github.com/vibansal/HapCUT2

SWALO: Scaffolding with assembly likelihood optimization

Jit — Wed, 20 Jun 2018 02:45:16 -0500

SWALO (scaffolding with assembly likelihood optimization) is a method for scaffolding based on likelihood of genome assemblies computed using generative models for sequencing. Please email your questions, comments, suggestions, and bug reports to atif.bd@gmail.com.

Address of the bookmark: https://atifrahman.github.io/SWALO/

Converting a VCF into a FASTA given some reference !

Jit — Fri, 20 Jul 2018 10:03:53 -0500

Samtools/BCFtools (Heng Li) provides a Perl script vcfutils.pl which does this, the function vcf2fq (lines 469-528)

This script has been modified by others to convert InDels as well, e.g. this by David Eccles

./vcf2fq.pl -f <input.fasta> <all-site.vcf> > <output.fastq>

https://github.com/gringer/bioinfscripts/blob/master/vcf2fq.pl

https://github.com/lh3/samtools/blob/master/bcftools/vcfutils.pl

SKESA: strategic k-mer extension for scrupulous assemblies

Jit — Wed, 14 Nov 2018 04:45:41 -0600

SKESA is a DeBruijn graph-based de-novo assembler designed for assembling reads of microbial genomes sequenced using Illumina. Comparison with SPAdes and MegaHit shows that SKESA produces assemblies that have high sequence quality and contiguity, handles low-level contamination in reads, is fast, and produces an identical assembly for the same input when assembled multiple times with the same or different compute resources.

Source code for SKESA is freely available at https://github.com/ncbi/SKESA/releases.

Research Paper @ Link

SKESA algorithm are as follows:

Address of the bookmark: https://github.com/ncbi/SKESA/releases

Versatile genome assembly evaluation with QUAST-LG

Jit — Fri, 21 Dec 2018 22:06:31 -0600

QUAST-LG is an extension of QUAST intended for evaluating large-scale genome assemblies (up to mammalian-size).

QUAST-LG is included in the QUAST package starting from version 5.0.0 (download the latest release). Run QUAST as usual and do not forget to add ‐‐large option to your command!

A short list of the new features (see CHANGES for all):

Significant speedup achieved by both use of new fast aligner (minimap2) and the refactoring of alignment analyzing modules
New k-mer-based completeness and correctness metrics
BUSCO added for enhanced reference-free analysis
The concept of upper bound assembly (theoretical limits on the assembly completeness and contiguity for a given genome and set of reads)

Address of the bookmark: http://cab.spbu.ru/software/quast-lg/

List of tools frequently used while genome assembly

BioStar — Tue, 22 Jan 2019 09:39:02 -0600

List of tools frequently used while genome assembly:

I have used the following assemblers

Spades (v. 3.10.1)
CANU (v. 1.6)
Unicycler (v. v0.4.1)
Miniasm (v. 0.2-r137-dirty)

I have used the following mappers

minimap2 (v. 2.0rc1-r232)
minimap (v. 0.2-r124-dirty)
bwa (v. 0.7.12-r1039)

I have used the following polishing tools

Racon (v. not available)
Pilon (v. 1.18)
Nanopolish (v. 0.8.3)

I have used the following tools to assess genome assembly characteristics

ANI.pl (https://github.com/chjp/ANI)
CheckM (v. 1.0.7)
Prokka (v. 1.12)
QUAST (v. 2.3)
mummer (v. not available)

If you have any ideas or superior tools we have missed please let us know in the comments.

CONTIGuator !

BioStar — Fri, 04 Oct 2019 01:27:58 -0500

CONTIGuator is a Python script for Linux environments whose purpose is to speed-up the bacterial genome assembly process and to obtain a first insight of the genome structure using the well-known artemis comparison tool (ACT).

Address of the bookmark: https://sourceforge.net/projects/contiguator/

BlobToolKit: A toolkit for genome assembly QC

Jit — Fri, 21 Feb 2020 00:17:50 -0600

Filtering raw genomic datasets is essential to avoid chimeric assemblies and to increase the validity of sequence-based biological inference. BlobToolKit extends the BlobTools1/Blobology2 approach to simplify interactive and reproducible filtering.

BlobToolKit is comprised of four components:

BlobToolKit Viewer allows browser-based interactive visualisation and filtering of preliminary or published genomic datasets even for highly fragmented assemblies.
BlobTools2 is a command-line program to convert assemblies and analysis results into datasets that can be further processed using BlobTools2 and/or visualised in the Viewer.
The BlobToolKit Specification features a formal schema and validator for the JSON-based BlobDir format used by BlobTools2 and the Viewer.
The BlobToolKit Pipeline is a configurable Snakemake pipeline that automates all steps from retrieving public datasets through running analyses and generating a BlobDir dataset with BlobTools2, ready for visualisation in the Viewer.

Paper https://www.biorxiv.org/content/10.1101/844852v1.full.pdf

Address of the bookmark: https://blobtoolkit.genomehubs.org/

VICUNA: a software tool that enables consensus assembly of ultra-deep sequence derived from diverse viral or other heterogeneous populations.

biogeek — Tue, 25 Aug 2020 03:40:17 -0500

VICUNA is a de novo assembly program targeting populations with high mutation rates. It creates a single linear representation of the mixed population on which intra-host variants can be mapped. For clinical samples rich in contamination (e.g., >95%), VICUNA can leverage existing genomes, if available, to assemble only target-alike reads. After initial assembly, it can also use existing genomes to perform guided merging of contigs. For each data set (e.g., Illumina paired read, 454), VICUNA outputs consensus sequence(s) and the corresponding multiple sequence alignment of constituent reads. VICUNA efficiently handles ultra-deep sequence data with tens of thousands fold coverage.

http://software.broadinstitute.org/viral/docs/vicuna_v1.0.pdf

Address of the bookmark: https://www.broadinstitute.org/viral-genomics/vicuna