BOL: Related items

BASE: a practical de novo assembler for large genomes using long NGS reads

Rahul Nayak — Fri, 19 Oct 2018 07:25:21 -0500

new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs.

Address of the bookmark: https://github.com/dhlbh/BASE

Filtlong: quality filtering tool for long reads

Radha Agarkar — Wed, 13 May 2020 10:23:55 -0500

Filtlong is a tool for filtering long reads by quality. It can take a set of long reads and produce a smaller, better subset. It uses both read length (longer is better) and read identity (higher is better) when choosing which reads pass the filter.

Filtlong builds into a stand-alone executable:

git clone https://github.com/rrwick/Filtlong.git
cd Filtlong
make -j
bin/filtlong -h

Address of the bookmark: https://github.com/rrwick/Filtlong

CoLoRMap: Correcting Long Reads by Mapping short reads

Jit — Mon, 20 Aug 2018 14:17:05 -0500

Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap

ehaghshe@sfu.ca or cedric.chauve@sfu.ca

Address of the bookmark: https://github.com/sfu-compbio/colormap

MashMap: a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s)

Jit — Tue, 12 Dec 2017 17:23:31 -0600

MashMap is a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s). It maps a query sequence against a reference region if and only if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Winnowing and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Address of the bookmark: https://github.com/marbl/MashMap

MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads

Rahul Nayak — Fri, 11 May 2018 05:07:45 -0500

MECAT is an ultra-fast Mapping, Error Correction and de novo Assembly Tools for single molecula sequencing (SMRT) reads. MECAT employs novel alignment and error correction algorithms that are much more efficient than the state of art of aligners and error correction tools. MECAT can be used for effectively de novo assemblying large genomes. For example, on a 32-thread computer with 2.0 GHz CPU , MECAT takes 9.5 days to assemble a human genome based on 54x SMRT data, which is 40 times faster than the current PBcR-Mhap pipeline. MECAT performance were compared with PBcR-Mhap pipeline, FALCON and Canu(v1.3) in five real datasets. The quality of assembled contigs produced by MECAT is the same or better than that of the PBcR-Mhap pipeline and FALCON.

https://www.nature.com/articles/nmeth.4432

Address of the bookmark: https://github.com/xiaochuanle/MECAT

Cerulean: A hybrid assembly using high throughput short and long reads

Rahul Nayak — Tue, 05 Jun 2018 10:10:15 -0500

Cerulean extends contigs assembled using short read datasets like Illumina paired-end reads using long reads like PacBio RS long reads. Cerulean v0.1 has been implemented with bacterial genomes in mind. The method is fully described in Deshpande, V., Fung, E. D., Pham, S., & Bafna, V. (2013). Cerulean: A hybrid assembly using high throughput short and long reads. arXiv preprint arXiv:1307.7933. http://arxiv.org/abs/1307.7933

Address of the bookmark: https://sourceforge.net/projects/ceruleanassembler/

Evaluation of genome assembly software based on long reads

BioStar — Fri, 01 Feb 2019 11:55:54 -0600

TGS technologies have been used to produce highly accurate de novo assemblies of hundreds of microbial genomes and highly contiguous reconstructions of many dozens of plant and animal genomes, enabling new insights into evolution and sequence diversity. They have also been applied to resequencing analyses, to create detailed maps of structural variations in many species. Also, these new technologies have been used to fill in many of the gaps in the human reference genome.

In this report, we compare and evaluate several genome assembly software based on TSG technology. The experimentation has been performed on 4 reference genomes and the results evaluated with the QUAST software. The 11 software that have been evaluated are: Celera Assembler , Falcon , Miniasm, Newbler , SGA Assembler, Smartdenovo, Abruijn, Ra, DBG2OLC, Spades and Cerulean. The first 8 software use only long reads, while the 3 last software can merge long and short reads

Flye: Fast and accurate de novo assembler for single molecule sequencing reads

Jit — Fri, 04 May 2018 19:16:22 -0500

Flye is a de novo assembler for long and noisy reads, such as those produced by PacBio and Oxford Nanopore Technologies. The algorithm uses an A-Bruijn graph to find the overlaps between reads and does not require them to be error-corrected. After the initial assembly, Flye performs an extra repeat classification and analysis step to improve the structural accuracy of the resulting sequence. The package also includes a polisher module, which produces the final assembly of high nucleotide-level quality.

Address of the bookmark: https://github.com/fenderglass/Flye

Wtdbg2: a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore

Neel — Fri, 19 Oct 2018 08:48:43 -0500

Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). It assembles raw reads without error correction and then builds the consensus from intermediate assembly output. Wtdbg2 is able to assemble the human and even the 32Gb Axolotl genome at a speed tens of times faster than CANU and FALCONwhile producing contigs of comparable base accuracy.

Address of the bookmark: https://github.com/ruanjue/wtdbg2

LRCstats: a tool for evaluating long reads correction methods

Aaryan Lokwani — Wed, 22 Aug 2018 11:05:04 -0500

LRCstats is an open-source pipeline for benchmarking DNA long read correction algorithms for long reads outputted by third generation sequencing technology such as machines produced by Pacific Biosciences. The reads produced by third generation sequencing technology, as the name suggests, are longer in length than reads produced by next generation sequencing technologies, such as those produced by Illumina. However, long reads are plagued by high error rates, which can cause issues in downstream analysis. Long read correction algorithms reduce the error rate of long reads either through self-correcting methods or using accurate, short reads outputted by next generation sequencing technologies to correct long reads.

Address of the bookmark: https://github.com/cchauve/lrcstats