BOL: Related items

Kaiju

Jit — Mon, 27 Jun 2016 11:23:04 -0500

Kaiju is a program for the taxonomic classification of metagenomic high-throughput sequencing reads. Each read is directly assigned to a taxon within the NCBI taxonomy by comparing it to a reference database containing microbial and viral protein sequences.

By default, Kaiju uses either the available complete genomes from NCBI RefSeq or the microbial subset of the non-redundant protein database nr used by NCBI BLAST, optionally also including fungi and microbial eukaryotes.

Kaiju translates reads into amino acid sequences, which are then searched in the database using a modified backward search on a memory-efficient implementation of the Burrows-Wheeler transform, which finds maximum exact matches (MEMs), optionally allowing mismatches in the protein alignment. The search can process up to millions of reads per minute using, for example, only 10 GB RAM with a protein database comprising 4821 microbial genomes. Kaiju can also be used for querying any other protein database without taxonomic classification, using either protein or nucleotide queries.

Kaiju is described in Menzel, P. et al. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257 (open access).

Address of the bookmark: http://kaiju.binf.ku.dk/

PLAST: A fast, accurate and NGS scalable bank-to-bank sequence similarity search tool

Jit — Fri, 01 Dec 2017 04:10:54 -0600

PLAST is a fast, accurate and NGS scalable bank-to-bank sequence similarity search tool providing significant accelerations of seeds-based heuristic comparison methods, such as the Blast suite of algorithms.

Relying on unique software architecture, PLAST takes full advantage of recent multi-core personal computers without requiring any additional hardware devices.

PLAST stands for Parallel Local Sequence Alignment Search Tool and is was published in BMC Bioinformatics.

PLAST is a general purpose sequence comparison tool providing the following benefits:

PLAST is a high-performance sequence comparison tool designed to compare two sets of sequences (query vs. reference),
Reduces the processing time of sequences comparisons while providing highest quality results,
Contains a fully integrated data filtering engine capable of selecting relevant hits with user-defined criteria (E-Value, identity, coverage, alignment length, etc.),
Does not require any additional hardware, since it is a software solution. It is easy to install, cost-effective, takes full advantage of multi-core processors and uses a small RAM footprint,
Ready to be used on desktop computer, cluster, cloud as well as within distributed system running Hadoop.

https://plast.inria.fr/

Address of the bookmark: https://plast.inria.fr/

nanofilt: Filtering and trimming of long read sequencing data

Jit — Mon, 30 Jul 2018 12:01:52 -0500

Filtering on quality and/or read length, and optional trimming after passing filters.
Reads from stdin, writes to stdout.

Intended to be used:

directly after fastq extraction
prior to mapping
in a stream between extraction and mapping

https://github.com/wdecoster/nanofilt

Address of the bookmark: https://github.com/wdecoster/nanofilt

LoFreq*: A sequence-quality aware, ultra-sensitive variant caller for NGS data

BioStar — Tue, 18 Feb 2020 03:24:22 -0600

LoFreq* (i.e. LoFreq version 2) is a fast and sensitive variant-caller for inferring SNVs and indels from next-generation sequencing data. It makes full use of base-call qualities and other sources of errors inherent in sequencing (e.g. mapping or base/indel alignment uncertainty), which are usually ignored by other methods or only used for filtering.

https://github.com/CSB5/lofreq

http://csb5.github.io/lofreq/installation/

https://github.com/CSB5/lofreq/tree/master/dist

Address of the bookmark: http://csb5.github.io/lofreq/

Alignment-free sequence comparison tools available for next-generation sequencing data analysis

Abhimanyu Singh — Tue, 07 Nov 2017 05:33:33 -0600

kallisto

Transcript abundance quantification from RNA-seq data (uses pseudoalignment for rapid determination of read compatibility with targets)

Software (C++)

https://pachterlab.github.io/kallisto/

Sailfish

Estimation of isoform abundances from reference sequences and RNA-seq data (k-mer based)

Software (C++)

http://www.cs.cmu.edu/~ckingsf/software/sailfish/

Salmon

Quantification of the expression of transcripts using RNA-seq data (uses k-mers)

https://combine-lab.github.io/salmon/

RNA-Skim

RNA-seq quantification at transcript-level (partitions the transcriptome into disjoint transcript clusters; uses sig-mers, a special type of k-mers)

Software (C++)

http://www.csbio.unc.edu/rs/

Variant calling

ChimeRScope

Fusion transcript prediction using gene k-mers profiles of the RNA-seq paired-end reads

Software (Java)

https://github.com/ChimeRScope/ChimeRScope/wiki

FastGT

Genotyping of known SNV/SNP variants directly from raw NGS sequence reads by counting unique k-mers

Software (C)

https://github.com/bioinfo-ut/GenomeTester4/

Phy-Mer

Reference-independent mitochondrial haplogroup classifier from NGS data (k-mer based)

Software (Python)

https://github.com/danielnavarrogomez/phy-mer

LAVA

Genotyping of known SNPs (dbSNP and Affymetrix's Genome-Wide Human SNP Array) from raw NGS reads (k-mer based)

Software (C)

http://lava.csail.mit.edu/

MICADo

Detection of mutations in targeted third-generation NGS data (can distinguish patients’ specific mutations; algorithm uses k-mers and is based on colored de Bruijn graphs)

Software (Python)

http://github.com/cbib/MICADo

General mapper

Minimap

Lightweight and fast read mapper and read overlap detector (uses the concept of “minimazers”, a special type of k-mers)

Software (C)

https://github.com/lh3/minimap

Assembly

De novo genome assembly

MHAP

Produces highly continuous assembly (fully resolved chromosome arms) from third-generation long and noisy reads (10 kbp) using a dimensionality reduction technique MinHash

Software (Java)

https://github.com/marbl/MHAP

Miniasm

Assembler of long noisy reads (SMRT, ONT) using the Overlap-Layout Consensus (OLC) approach without the necessity of an error correction stage (uses minimap)

Software (C)

https://github.com/lh3/miniasm

LINKS

Scaffolding genome assembly with error-containing long sequence (e.g., ONT or PacBio reads, draft genomes)

Software (Perl)

https://github.com/warrenlr/LINKS/

Read clustering

afcluster

Clustering of reads from different genes and different species based on k-mer counts

Software (C++)

https://github.com/luscinius/afcluster

QCluster

Clustering of reads with alignment-free measures (k-mer based) and quality values

Software (C++)

http://www.dei.unipd.it/~ciompin/main/qcluster.html

Reads error correction

Lighter

Correction of sequencing errors in raw, whole genome sequencing reads (k-mer based)

Software (C++)

https://github.com/mourisl/Lighter

QuorUM

Error corrector for Illumina reads using k-mers

Software (C++)

https://github.com/gmarcais/Quorum

Trowel

Software (C++)

https://sourceforge.net/projects/trowel-ec/

Metagenomics

Assembly-free phylogenomics

AAF

Phylogeny reconstruction directly from unassembled raw sequence data from whole genome sequencing projects; provides bootstrap support to assess uncertainty in the tree topology (k-mer based)

Software (Python)

https://github.com/fanhuan/AAF

kSNP v3

Reference-free SNP identification and estimation of phylogenetic trees using SNPs (based on k-mer analysis)

Software (C)

https://sourceforge.net/projects/ksnp/files/

NGS-MC

Phylogeny of species based on NGS reads using alignment-free sequence dissimilarity measures d2* and d2 S under different Markov chain models (using k-words)

R package

http://www-rcf.usc.edu/~fsun/Programs/NGS-MC/NGS-MC.html

Species identification/taxonomic profiling

CLARK

Taxonomic classification of metagenomic reads to known bacterial genomes using k-mer search and LCA assignment

Software (C++)

http://clark.cs.ucr.edu/

FOCUS

Reports organisms present in metagenomic samples and profiles their abundances (uses composition-based approach and non-negative least squares for prediction)

Web service Software (Python)

http://edwards.sdsu.edu/FOCUS/

GSM

Estimation of abundances of microbial genomes in metagenomic samples (k-mer based)

Software (Go)

https://github.com/pdtrang/GSM

Mash

Species identification using assembled or unassembled Illumina, PacBio, and ONT data (based on MinHash dimensionality-reduction technique)

Software (C++)

https://github.com/marbl/mash

Kraken

Taxonomic assignment in metagenome analysis by exact k-mer search; LCA assignment of short reads based on a comprehensive sequence database

Software (C++)

https://ccb.jhu.edu/software/kraken/

LMAT

Assignment of taxonomic labels to reads by k-mers searches in precomputed database

Software (C++/Python)

https://sourceforge.net/projects/lmat/

stringMLST

k-mer-based tool for MLST directly from the genome sequencing reads

Software (Python)

http://jordan.biology.gatech.edu/page/software/stringMLST

Taxonomer

k-mer-based ultrafast metagenomics tool for assigning taxonomy to sequencing reads from clinical and environmental samples

Web service

http://taxonomer.iobio.io/

Other

d2-tools

Word-based (k-tuple) comparison (pairwise dissimilarity matrix using d2S measure) of metatranscriptomic samples from NGS reads

Software (Python/R)

https://code.google.com/p/d2-tools/

VirHostMatcher

Prediction of hosts from metagenomic viral sequences based on ONF using various distance measures (e.g., d2)

Software (C++)

https://github.com/jessieren/VirHostMatcher

MetaFast

Statistics calculation of metagenome sequences and the distances between them based on assembly using de Bruijn graphs and Bray–Curtis dissimilarity measure

Software (Java)

https://github.com/ctlab/metafast

BFC: a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data

Jit — Thu, 31 May 2018 09:35:23 -0500

BFC is a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data. It is specifically designed for high-coverage whole-genome human data, though also performs well for small genomes. The BFC algorithm is a variant of the classical spectrum alignment algorithm introduced by Pevzner et al (2001). It uses an exhaustive search to find a k-mer path through a read that minimizes a heuristic objective function jointly considering penalties on correction, quality and k-mer support. This algorithm was first implemented in my fermi assembler and then refined a few times in fermi, fermi2 and now in BFC. In the k-mer counting phase, BFC uses a blocked bloom filter to filter out most singleton k-mers and keeps the rest in a hash table (Melsted and Pritchard, 2011). The use of bloom filter is how BFC is named, though other correctors such as Lighter and Bless actually rely more on bloom filter than BFC. https://github.com/lh3/bfc

Address of the bookmark: https://github.com/lh3/bfc

Ra assembler - a de novo DNA assembler for third generation sequencing data

biogeek — Wed, 27 Dec 2017 20:36:54 -0600

Integration of the Ra assembler - a de novo DNA assembler for third generation sequencing data developed on Faculty of Electrical Engineering and Computing (FER), Ruder Boskovic Institute (RBI) and Genome Institute of Singapore (GIS).

Ra is in development since 2014 in the form of several separate components that used to be run individually.
This project aims to ease the usage of Ra by integrating it into a complete de novo assembly tool.

Unlike other state-of-the-art assemblers, Ra does not have an error correction step. Instead, it relies on detecting overlaps using a very sensitive and specific overlapper ("graphmap -w owler", https://github.com/isovic/graphmap) and constructing and reducing an overlap graph (Ra layout, https://github.com/mariokostelac/ra).

Address of the bookmark: https://github.com/mariokostelac/ra-integrate/

HASLR: a hybrid assembler which uses both second and third generation sequencing reads

BioStar — Mon, 04 May 2020 02:04:03 -0500

HASLR, a hybrid assembler which uses both second and third generation sequencing reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples. Availability. HASLR is an open source tool available at https://github.com/vpc-ccg/haslr.

Address of the bookmark: https://github.com/vpc-ccg/haslr

Does anyone have Nanopore latest updates?

Poonam Mahapatra — Mon, 12 Aug 2013 12:19:29 -0500

There was a lot of buzz about Oxford Nanopore Technologies® is developing the GridION™ system and miniaturised MinION™ device. These are a new generation of electronic molecular analysis system for use in scientific research, personalised medicine, crop science, security/defence and more. The platform technology uses nanopores to analyse single molecules including DNA/RNA and proteins. With a broad patent portfolio, the Oxford Nanopore pipeline includes biological nanopores and solid-state nanopores.

Is this available, or still under trial mode?

https://www.nanoporetech.com/

https://www.nanoporetech.com/technology/the-minion-device-a-miniaturised-sensing-system/the-minion-device-a-miniaturised-sensing-system

Which math/statistics programming language/application do you most frequently use in bioinformatics?

John Parker — Thu, 04 Sep 2014 17:46:41 -0500

I'm doing a bit more statistical analysis on some bioinformatics things lately, and I'm curious if there are any programming languages that are particularly good for this NGS computation. What suggestions do you guys have? Are there any languages that have exceptionally good libraries?