BOL: Related items

Long read assembly workshop !

Rahul Nayak — Thu, 04 Oct 2018 17:23:18 -0500

This is a tutorial for a workshop on long-read (PacBio) genome assembly.

It demonstrates how to use long PacBio sequencing reads to assemble a bacterial genome, and includes additional steps for circularising, trimming, finding plasmids, and correcting the assembly with short-read Illumina data.

Please comment if you know any other long read addembly tutorial.

Address of the bookmark: http://sepsis-omics.github.io/tutorials/modules/cmdline_assembly_v2/

SKESA: strategic k-mer extension for scrupulous assemblies

Jit — Wed, 14 Nov 2018 04:45:41 -0600

SKESA is a DeBruijn graph-based de-novo assembler designed for assembling reads of microbial genomes sequenced using Illumina. Comparison with SPAdes and MegaHit shows that SKESA produces assemblies that have high sequence quality and contiguity, handles low-level contamination in reads, is fast, and produces an identical assembly for the same input when assembled multiple times with the same or different compute resources.

Source code for SKESA is freely available at https://github.com/ncbi/SKESA/releases.

Research Paper @ Link

SKESA algorithm are as follows:

Address of the bookmark: https://github.com/ncbi/SKESA/releases

Software and Tools to detect structure variation with long reads !!

Archana Malhotra — Wed, 15 Mar 2017 14:31:09 -0500

Uncovering the connection between genetics and heritable diseases requires an approach that looks at all the variant bases and types in a genome. While a PacBio de novo assembly resolves the most novel SV variants. 8-10X PacBio coverage of single genomes or trios reveals triple the SVs detectable by short-read data.

With Single Molecule, Real-Time (SMRT) Sequencing, you can access structural variations having a broad range of sizes, types, and GC content with the ability to:

Uncover missing heritability linked to structural variation
Unambiguously identify genomic context and variant breakpoints at the sequence level to unravel the genetic etiology of disease
Resolve structural variation across the complete size spectrum with basepair resolution

Following are the SV tools, which can assist you to achieve your goal.

Sniffles: Structural variation caller using third generation sequencing

Sniffles is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore). It detects all types of SVs using evidence from split-read alignments, high-mismatch regions, and coverage analysis. Please note the current version of Sniffles requires sorted output from BWA-MEM (use -M and -x parameter) or NGM-LR with the optional SAM attributes enabled!

More at https://github.com/fritzsedlazeck/Sniffles

MultiBreak-SV: It identifies structural variants from next-generation paired end data, third-generation long read data, or data from a combination of sequencing platforms.

There are two pieces of software in this release: (1) a pre-processor that takes machineformat (.m5) BLASR files, and (2) MultiBreak-SV. For installation and usage instructions, see doc/MultiBreakSV-Manual.txt.

More at https://github.com/raphael-group/multibreak-sv

Parliament: A Structural Variation Tool. Why ask a single sv-detection approach to find every variant when you can have a parliament of tools deciding?

Publication about the algorithm and “…the first long-read characterization of structural variation in a diploid human personal genome…” (HS1011) - “Assessing structural variation in a personal genome—towards a human reference diploid genome”

More at https://sourceforge.net/projects/parliamentsv/

https://www.dnanexus.com/papers/Parliament_Info_Sheet.pdf

PBHoney: the structural variation discovery tool

PBHoney is an implementation of two variant-identification approaches designed to exploit the high mappability of long reads (i.e., greater than 10,000 bp). PBHoney considers both intra-read discordance and soft-clipped tails of long reads to identify structural variants.

Read The Paper http://www.biomedcentral.com/1471-2105/15/180/abstract

More at https://sourceforge.net/projects/pb-jelly/

SMRT-SV: Structural variant and indel caller for PacBio reads

Structural variant (SV) and indel caller for PacBio reads based on methods from Chaisson et al. 2014.

SMRT-SV provides an official software package for tools described in Chaisson et al. 2014 and adds several key features including the following.

Unified variant calling user interface with built-in cluster compute support
Small indel calling (2-49 bp)
Improved inversion calling (screenInversions)
Quality metric for SV calls based on number of local assemblies supporting each call
Higher sensitivity for SV calls using tiled local assemblies across the entire genome instead of "signature" regions
Genotyping of SVs with Illumina paired-end reads from WGS samples

More at https://github.com/EichlerLab/pacbio_variant_caller

Versatile genome assembly evaluation with QUAST-LG

Jit — Fri, 21 Dec 2018 22:06:31 -0600

QUAST-LG is an extension of QUAST intended for evaluating large-scale genome assemblies (up to mammalian-size).

QUAST-LG is included in the QUAST package starting from version 5.0.0 (download the latest release). Run QUAST as usual and do not forget to add ‐‐large option to your command!

A short list of the new features (see CHANGES for all):

Significant speedup achieved by both use of new fast aligner (minimap2) and the refactoring of alignment analyzing modules
New k-mer-based completeness and correctness metrics
BUSCO added for enhanced reference-free analysis
The concept of upper bound assembly (theoretical limits on the assembly completeness and contiguity for a given genome and set of reads)

Address of the bookmark: http://cab.spbu.ru/software/quast-lg/

List of tools frequently used while genome assembly

BioStar — Tue, 22 Jan 2019 09:39:02 -0600

List of tools frequently used while genome assembly:

I have used the following assemblers

Spades (v. 3.10.1)
CANU (v. 1.6)
Unicycler (v. v0.4.1)
Miniasm (v. 0.2-r137-dirty)

I have used the following mappers

minimap2 (v. 2.0rc1-r232)
minimap (v. 0.2-r124-dirty)
bwa (v. 0.7.12-r1039)

I have used the following polishing tools

Racon (v. not available)
Pilon (v. 1.18)
Nanopolish (v. 0.8.3)

I have used the following tools to assess genome assembly characteristics

ANI.pl (https://github.com/chjp/ANI)
CheckM (v. 1.0.7)
Prokka (v. 1.12)
QUAST (v. 2.3)
mummer (v. not available)

If you have any ideas or superior tools we have missed please let us know in the comments.

MitoZ: a toolkit for animal mitochondrial genome assembly, annotation and visualization

Jit — Fri, 24 Jan 2020 04:09:15 -0600

MitoZ is a Python3-based toolkit which aims to automatically filter pair-end raw data (fastq files), assemble genome, search for mitogenome sequences from the genome assembly result, annotate mitogenome (genbank file as result), and mitogenome visualization. MitoZ is available from https://github.com/linzhi2013/MitoZ.

https://academic.oup.com/nar/article/47/11/e63/5377471

Address of the bookmark: https://github.com/linzhi2013/MitoZ

JCVI:Python utility libraries on genome assembly, annotation and comparative genomics

Jit — Tue, 17 Mar 2020 06:19:06 -0500

Collection of Python libraries to parse bioinformatics files, or perform computation related to assembly, annotation, and comparative genomics.

https://github.com/tanghaibao/jcvi

More at https://github.com/tanghaibao/jcvi/wiki

Address of the bookmark: https://github.com/tanghaibao/jcvi

HASLR: a hybrid assembler which uses both second and third generation sequencing reads

BioStar — Mon, 04 May 2020 02:04:03 -0500

HASLR, a hybrid assembler which uses both second and third generation sequencing reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples. Availability. HASLR is an open source tool available at https://github.com/vpc-ccg/haslr.

Address of the bookmark: https://github.com/vpc-ccg/haslr

HapSolo: An optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding.

Jit — Mon, 26 Oct 2020 21:23:36 -0500

Despite marked recent improvements in long-read sequencing technology, the assembly of diploid genomes remains a difficult task. A major obstacle is distinguishing between alternative contigs that represent highly heterozygous regions. If primary and secondary contigs are not properly identified, the primary assembly will overrepresent both the size and complexity of the genome, which complicates downstream analysis such as scaffolding.

More at https://github.com/esolares/HapSolo

Address of the bookmark: https://github.com/esolares/HapSolo

HapSolo: An optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding

Jit — Sat, 08 May 2021 21:25:00 -0500

HapSolo, that identifies secondary contigs and defines a primary assembly based on multiple pairwise contig alignment metrics. HapSolo evaluates candidate primary assemblies using BUSCO scores and then distinguishes among candidate assemblies using a cost function. The cost function can be defined by the user but by default considers the number of missing, duplicated and single BUSCO genes within the assembly. HapSolo performs hill climbing to minimize cost over thousands of candidate assemblies.

Address of the bookmark: https://github.com/esolares/HapSolo