BOL: Related items

Rebaler: program for conducting reference-based assemblies using long reads.

Jit — Tue, 18 Sep 2018 07:52:41 -0500

Rebaler is a program for conducting reference-based assemblies using long reads. It relies mainly on minimap2 for alignment and Racon for making consensus sequences.

I made Rebaler for bacterial genomes (specifically for the task of testing basecallers). It should in principle work for non-bacterial genomes as well, but I haven't tested it.

Address of the bookmark: https://github.com/rrwick/Rebaler

ASplice: a scalable and memory-efficient algorithm for de novo transcriptome assembly

Rahul Nayak — Tue, 03 Jul 2018 04:09:46 -0500

With increased availability of de novo assembly algorithms, it is feasible to study entire transcriptomes of non-model organisms. While algorithms are available that are specifically designed for performing transcriptome assembly from high-throughput sequencing data, they are very memory-intensive, limiting their applications to small data sets with few libraries. Texas A&M University researchers develop a transcriptome assembly algorithm that recovers alternatively spliced isoforms and expression levels while utilizing as many RNA-Seq libraries as possible that contain hundreds of gigabases of data. New techniques are developed so that computations can be performed on a computing cluster with moderate amount of physical memory. Availability – A software program that implements the algorithm is available at: http://faculty.cse.tamu.edu/shsze/asplice. Sze SH, Pimsler ML, Tomberlin JK, Jones CD, Tarone AM. (2017) A scalable and memory-efficient algorithm for de novo transcriptome assembly of non-model organisms. BMC Genomics 18(Suppl 4):387.

Address of the bookmark: http://faculty.cse.tamu.edu/shsze/asplice/

SiLiX: implements an ultra-efficient algorithm for the clustering of homologous sequences

Jit — Wed, 12 Dec 2018 09:22:41 -0600

The software package SiLiX implements an ultra-efficient algorithm for the clustering of homologous sequences, based on single transitive links (single linkage) with alignment coverage constraints.

SiLiX adopts a graph-theoretical framework to interpret similarity pairs as edges of a network. A very efficient algorithm, based on the Disjoint Sets Data Structure, allows the computation of sequence families with low time and space requirements.

A parallel version of SiLiX, based on MPI, is also available in this package and has been proved to be scalable, so that its allows the study of very large datasets.

SiLiX is already included in the analysis pipeline for HOGENOM.

Address of the bookmark: http://lbbe.univ-lyon1.fr/SiLiX?lang=fr

MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping

Neel — Fri, 20 May 2016 18:53:49 -0500

MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery.

Address of the bookmark: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090581

Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm

BioStar — Mon, 16 Mar 2020 10:09:26 -0500

Apollo is an assembly polishing algorithm that attempts to correct the errors in an assembly. It can take multiple set of reads in a single run and polish the assemblies of genomes of any size. Described by Firtina et al. (preliminary version at https://arxiv.org/pdf/1902.04341.pdf

More at https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa179/5804978?rss=1

Address of the bookmark: https://github.com/CMU-SAFARI/Apollo

lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data

BioJoker — Tue, 27 Nov 2018 04:43:57 -0600

lordFAST is a sensitive tool for mapping long reads with high error rates. lordFAST is specially designed for aligning reads from PacBio sequencing technology but provides the user the ability to change alignment parameters depending on the reads and application.

lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint.

Address of the bookmark: https://github.com/vpc-ccg/lordfast

RaGOO: Fast Reference-Guided Scaffolding of Genome Assembly Contigs

Jit — Sun, 27 Oct 2019 00:57:23 -0500

Alonge M, Soyk S, Ramakrishnan S, Wang X, Goodwin S, Sedlazeck FJ, Lippman ZB, Schatz MC: Fast and accurate reference-guided scaffolding of draft genomes. bioRxiv 2019.

RaGOO is a tool for coalescing genome assembly contigs into pseudochromosomes via minimap2 alignments to a closely related reference genome. The focus of this tool is on practicality and therefore has the following features:

Good performance. On a MacBook Pro using Arabidopsis data, pseudochromosome construction takes less than a minute and the whole pipeline with SV calling takes ~2 minutes.
Intact ordering and orienting of contigs.
Misassembly correction
GFF lift-over
Structural variant calling with and integrated version of Assemblytics
Confidence scores associated with the grouping, localization, and orientation for each contig.

Address of the bookmark: https://github.com/malonge/RaGOO

PLAST: A fast, accurate and NGS scalable bank-to-bank sequence similarity search tool

Jit — Fri, 01 Dec 2017 04:10:54 -0600

PLAST is a fast, accurate and NGS scalable bank-to-bank sequence similarity search tool providing significant accelerations of seeds-based heuristic comparison methods, such as the Blast suite of algorithms.

Relying on unique software architecture, PLAST takes full advantage of recent multi-core personal computers without requiring any additional hardware devices.

PLAST stands for Parallel Local Sequence Alignment Search Tool and is was published in BMC Bioinformatics.

PLAST is a general purpose sequence comparison tool providing the following benefits:

PLAST is a high-performance sequence comparison tool designed to compare two sets of sequences (query vs. reference),
Reduces the processing time of sequences comparisons while providing highest quality results,
Contains a fully integrated data filtering engine capable of selecting relevant hits with user-defined criteria (E-Value, identity, coverage, alignment length, etc.),
Does not require any additional hardware, since it is a software solution. It is easy to install, cost-effective, takes full advantage of multi-core processors and uses a small RAM footprint,
Ready to be used on desktop computer, cluster, cloud as well as within distributed system running Hadoop.

https://plast.inria.fr/

Address of the bookmark: https://plast.inria.fr/

MMseqs2.0: ultra fast and sensitive protein search and clustering suite

Jit — Thu, 22 Mar 2018 10:40:51 -0500

MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed.

The MMseqs2 user guide is available as Github Wiki or as PDF file (Thanks to pandoc!)

Please cite: Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017).

Address of the bookmark: https://github.com/soedinglab/MMseqs2

WhatsHap: fast and accurate read-based phasing

Jit — Mon, 28 May 2018 09:52:16 -0500

WhatsHap is a software for phasing genomic variants using DNA sequencing reads, also called read-based phasing or haplotype assembly. It is especially suitable for long reads, but works also well with short reads.

Features

Very accurate results (Martin et al., WhatsHap: fast and accurate read-based phasing)

Works well with Illumina, PacBio, Oxford Nanopore and other types of reads

It phases SNVs, indels and even “complex” variants (such as TCG → AGAA)

Pedigree phasing mode uses reads from related individuals (such as trios) to improve results and to reduce coverage requirements (Garg et al., Read-Based Phasing of Related Individuals).

WhatsHap is easy to install

It is easy to use: Pass in a VCF and one or more BAM files, get out a phased VCF. Supports multi-sample VCFs.

It produces standard-compliant VCF output by default

If desired, get output that is compatible with ReadBackedPhasing

Open Source (MIT license)

Address of the bookmark: https://whatshap.readthedocs.io/en/latest/