BOL: Related items

PLAST: A fast, accurate and NGS scalable bank-to-bank sequence similarity search tool

Jit — Fri, 01 Dec 2017 04:10:54 -0600

PLAST is a fast, accurate and NGS scalable bank-to-bank sequence similarity search tool providing significant accelerations of seeds-based heuristic comparison methods, such as the Blast suite of algorithms.

Relying on unique software architecture, PLAST takes full advantage of recent multi-core personal computers without requiring any additional hardware devices.

PLAST stands for Parallel Local Sequence Alignment Search Tool and is was published in BMC Bioinformatics.

PLAST is a general purpose sequence comparison tool providing the following benefits:

PLAST is a high-performance sequence comparison tool designed to compare two sets of sequences (query vs. reference),
Reduces the processing time of sequences comparisons while providing highest quality results,
Contains a fully integrated data filtering engine capable of selecting relevant hits with user-defined criteria (E-Value, identity, coverage, alignment length, etc.),
Does not require any additional hardware, since it is a software solution. It is easy to install, cost-effective, takes full advantage of multi-core processors and uses a small RAM footprint,
Ready to be used on desktop computer, cluster, cloud as well as within distributed system running Hadoop.

https://plast.inria.fr/

Address of the bookmark: https://plast.inria.fr/

MMseqs2.0: ultra fast and sensitive protein search and clustering suite

Jit — Thu, 22 Mar 2018 10:40:51 -0500

MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed.

The MMseqs2 user guide is available as Github Wiki or as PDF file (Thanks to pandoc!)

Please cite: Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017).

Address of the bookmark: https://github.com/soedinglab/MMseqs2

WhatsHap: fast and accurate read-based phasing

Jit — Mon, 28 May 2018 09:52:16 -0500

WhatsHap is a software for phasing genomic variants using DNA sequencing reads, also called read-based phasing or haplotype assembly. It is especially suitable for long reads, but works also well with short reads.

Features

Very accurate results (Martin et al., WhatsHap: fast and accurate read-based phasing)

Works well with Illumina, PacBio, Oxford Nanopore and other types of reads

It phases SNVs, indels and even “complex” variants (such as TCG → AGAA)

Pedigree phasing mode uses reads from related individuals (such as trios) to improve results and to reduce coverage requirements (Garg et al., Read-Based Phasing of Related Individuals).

WhatsHap is easy to install

It is easy to use: Pass in a VCF and one or more BAM files, get out a phased VCF. Supports multi-sample VCFs.

It produces standard-compliant VCF output by default

If desired, get output that is compatible with ReadBackedPhasing

Open Source (MIT license)

Address of the bookmark: https://whatshap.readthedocs.io/en/latest/

ANItools web: a web tool for fast genome comparison within multiple bacterial strains

Jit — Wed, 14 Nov 2018 04:34:23 -0600

ANItools is a software package written by PERL scripts that can be run in a Linux/Unix system. If you want to compare bacterial genomes and calculate their average nucleotide identity (ANI), you could download and run this program directly. Or you could send us the genome sequence by email. Then we will do the analysis work for you.

https://academic.oup.com/database/article/doi/10.1093/database/baw084/2630454

Address of the bookmark: http://ani.mypathogen.cn/

MMseqs2: ultra fast and sensitive sequence search and clustering suite

Manisha Mishra — Mon, 18 Jan 2021 10:47:56 -0600

MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed.

Address of the bookmark: https://github.com/soedinglab/MMseqs2

PuffAligner: a fast, efficient and accurate aligner based on the Pufferfish index

Rahul Nayak — Thu, 21 Apr 2022 05:41:39 -0500

PuffAligner, a fast, accurate and versatile aligner built on top of the Pufferfish index. PuffAligner is able to produce highly sensitive alignments, similar to those of Bowtie2, but much more quickly. While exhibiting similar speed to the ultrafast STAR aligner, PuffAligner requires considerably less memory to construct its index and align reads. PuffAligner strikes a desirable balance with respect to the time, space and accuracy tradeoffs made by different alignment tools and provides a promising foundation on which to test new alignment ideas over large collections of sequences.

Address of the bookmark: https://github.com/COMBINE-lab/pufferfish/tree/cigar-strings

HiTE: a fast and accurate dynamic boundary adjustment approach for full-length Transposable Elements detection and annotation in Genome Assemblies

LEGE — Sat, 20 Sep 2025 09:34:04 -0500

HiTE is a Python software that uses a dynamic boundary adjustment approach to detect and annotate full-length Transposable Elements in Genome Assemblies. In comparison to other tools, HiTE demonstrates superior performance in detecting a greater number of full-length TEs.

panHiTE

We have developed panHiTE, a comprehensive and accurate pipeline for TE detection in large-scale population genomes. It has been successfully applied to hundreds of plant population genomes, demonstrating its effectiveness and scalability.

For detailed instructions, please refer to the panHiTE tutorial.

Address of the bookmark: https://github.com/CSU-KangHu/HiTE

miniasm: very fast OLC-based de novo assembler for noisy long reads

Jit — Mon, 27 Nov 2017 07:58:49 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

For a C. elegans PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Minimap and miniasm are ultrafast tools for (i) mapping and (ii) assembly. Designed for long, noisy reads, they do not have a correction or consensus step, and therefore the resulting assemblies are contiguous (i.e. long) but very noisy (i.e. full of errors)

We start with an all against all comparison:

minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz

Then we can assemble

miniasm -f reads.fq reads.paf.gz > reads.gfa

Convert GFA to FASTA:

awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > reads.fa

And then count how many contigs:

grep ">" reads.fa | wc -l

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)
# Overlap
minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Address of the bookmark: https://github.com/lh3/miniasm

ARCS: scaffolding genome drafts with linked reads

Rahul Nayak — Tue, 06 Mar 2018 16:35:26 -0600

ARCS, an application that utilizes the barcoding information contained in linked reads to further organize draft genomes into highly contiguous assemblies. We show how the contiguity of an ABySS H.sapiensgenome assembly can be increased over six-fold, using moderate coverage (25-fold) Chromium data. We expect ARCS to have broad utility in harnessing the barcoding information contained in linked read data for connecting high-quality sequences in genome assembly drafts.

Address of the bookmark: https://github.com/bcgsc/ARCS/

Jvarkit : Java utilities for Bioinformatics

Jit — Fri, 08 Jun 2018 09:31:55 -0500

Collection of Java tool kits for bioinformatics works: Jvarkit : Java utilities for Bioinformatics

Address of the bookmark: http://lindenb.github.io/jvarkit/