BOL: Related items

OPERA : Optimal Paired-End Read Assembler

Jit — Fri, 09 Sep 2016 05:28:58 -0500

OPERA (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly). It uses information from paired-end/mate-pair/long reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly project, in a process known as Scaffolding. OPERA is based on an exact algorithm that is guaranteed to minimize the discordance of scaffolds with the information provided by the paired-end/mate-pair/long reads (for further details see Gao et al, 2011).

Note that since the original publication, we have made significant changes to OPERA (v1.0 onwards) including refinements to its basic algorithm (to reduce local errors, improve efficiency etc.) and incorporated features that are important for scaffolding large genomes (multi-library support, better repeat-handling etc.), in addition to other scalability and usability improvements (bam and gzip support, smaller memory footprint). We therefore encourage you to download and use our latest version: OPERA-LG. In our benchmarks, it has significantly improved corrected N50 and reduced the number of scaffolding errors. Furthermore, our latest release contains the wrapper script OPERA-long-read that enables scaffolding with long-reads from third-generation sequencing technologies (PacBio or Oxford Nanopore). The manuscript describing the new features and algorithms is available at Genome Biology. We look forward to getting your feedback to improve it further.

Address of the bookmark: https://sourceforge.net/p/operasf/wiki/The%20OPERA%20wiki/

dipSPAdes: Assembler for Highly Polymorphic Diploid Genomes.

Jit — Wed, 20 Dec 2017 18:35:16 -0600

While the number of sequenced diploid genomes have been steadily increasing in the last few years, assembly of highly polymorphic (HP) diploid genomes remains challenging. As a result, there is a shortage of tools for assembling HP genomes from the next generation sequencing (NGS) data. The initial approaches to assembling HP genomes were proposed in the pre-NGS era and are not well suited for NGS projects. To address this limitation, we developed the first de Bruijn graph assembler, dipSPAdes, for HP genomes that significantly improves on the state-of-the-art assemblers for HP diploid genomes.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pubmed/25734602

Scallop: reference-based transcriptome assembler for RNA-seq

Rahul Nayak — Tue, 08 May 2018 04:23:27 -0500

Scallop is an accurate reference-based transcript assembler. Scallop features its high accuracy in assembling multi-exon transcripts as well as lowly expressed transcripts. Scallop achieves this improvement through a novel algorithm that can be proved preserving all phasing paths from reads and paired-end reads, while also achieves both transcripts parsimony and coverage deviation minimization.

Scallop paper has been published at Nature Biotechnology. The datasets and scripts used in this paper to compare the performance of Scallop and other assemblers are available at scalloptest.

Please also checkout the podcast about Scallop (thanks Roman Cheplyaka for the interview). It is available at both the bioinformatics chat and iTunes.

https://github.com/Kingsford-Group/scallop

Address of the bookmark: https://github.com/Kingsford-Group/scallop

Shasta long read assembler

Jit — Tue, 14 Jan 2020 06:47:07 -0600

The goal of the Shasta long read assembler is to rapidly produce accurate assembled sequence using as input DNA reads generated by Oxford Nanopore flow cells.

Computational methods used by the Shasta assembler include:

Using a run-length representation of the read sequence. This makes the assembly process more resilient to errors in homopolymer repeat counts, which are the most common type of errors in Oxford Nanopore reads.
Using in some phases of the computation a representation of the read sequence based on markers, a fixed subset of short k-mers (k ≈ 10).

More at https://chanzuckerberg.github.io/shasta/index.html

Address of the bookmark: https://github.com/chanzuckerberg/shasta

MashMap: a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s)

Jit — Tue, 12 Dec 2017 17:23:31 -0600

MashMap is a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s). It maps a query sequence against a reference region if and only if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Winnowing and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Address of the bookmark: https://github.com/marbl/MashMap

HISAT2: a fast and sensitive alignment program for mapping next-generation sequencing reads

Rahul Nayak — Tue, 08 May 2018 04:27:22 -0500

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Based on an extension of BWT for graphs [Sirén et al. 2014], we designed and implemented a graph FM index (GFM), an original approach and its first implementation to the best of our knowledge. In addition to using one global GFM index that represents a population of human genomes, HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome (each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover the human population). These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. This new indexing scheme is called a Hierarchical Graph FM index (HGFM).

more at https://ccb.jhu.edu/software/hisat2/index.shtml

Address of the bookmark: https://github.com/infphilo/hisat2

Circlator: automated circularization of genome assemblies using long sequencing reads

Poonam Mahapatra — Tue, 15 May 2018 09:42:32 -0500

A tool to circularize genome assemblies. The algorithm and benchmarks are described in the Genome Biology manuscript. Citation: "Circlator: automated circularization of genome assemblies using long sequencing reads", Hunt et al, Genome Biology 2015 Dec 29;16(1):294. doi: 10.1186/s13059-015-0849-0. PMID: 26714481.

Address of the bookmark: http://sanger-pathogens.github.io/circlator/

GenomeMapper: Simultaneous alignment of short reads against multiple genomes

Jit — Fri, 25 May 2018 09:29:44 -0500

GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. It can be used to align against multiple genomes simulanteously or against a single reference. If you are unsure which one is the appropriate GenomeMapper, you might want to use the latter https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2768987/

Address of the bookmark: http://1001genomes.org/software/genomemapper.html

npScarf: real-time scaffolder using SPAdes contigs and Nanopore sequencing reads

Shruti Paniwala — Mon, 11 Jun 2018 05:14:57 -0500

npScarf (jsa.np.npscarf) is a program that connect contigs from a draft genomes to generate sequences that are closer to finish. These pipelines can run on a single laptop for microbial datasets. In real-time mode, it can be integrated with simple structural analyses such as gene ordering, plasmid forming.

Address of the bookmark: http://japsa.readthedocs.io/en/latest/tools/jsa.np.npscarf.html

AlignQC: A tool for assessing an alignment, and generating reports that are easy to share

Jit — Tue, 07 Aug 2018 04:41:07 -0500

Long read alignment analysis. Generate a reports on sequence alignments for mappability vs read sizes, error patterns, annotations and rarefraction curve analysis. The most basic analysis only requires a BAM file, and outputs a web browser compatible xhtml to visualize/share/store/extract analysis results.

https://f1000research.com/articles/6-100/

https://github.com/jason-weirather/AlignQC

Address of the bookmark: https://www.healthcare.uiowa.edu/labs/au/AlignQC/