BOL: Related items

HISAT2: a fast and sensitive alignment program for mapping next-generation sequencing reads

Rahul Nayak — Tue, 08 May 2018 04:27:22 -0500

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Based on an extension of BWT for graphs [Sirén et al. 2014], we designed and implemented a graph FM index (GFM), an original approach and its first implementation to the best of our knowledge. In addition to using one global GFM index that represents a population of human genomes, HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome (each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover the human population). These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. This new indexing scheme is called a Hierarchical Graph FM index (HGFM).

more at https://ccb.jhu.edu/software/hisat2/index.shtml

Address of the bookmark: https://github.com/infphilo/hisat2

pbalign: maps PacBio reads to reference sequences and saves alignments to a BAM file

Jit — Thu, 24 May 2018 10:06:52 -0500

pbalign aligns PacBio reads to reference sequences, filters aligned reads according to user-specific filtering criteria, and converts the output to either the SAM format or PacBio Compare HDF5 (e.g., .cmp.h5) format. The output Compare HDF5 file will be compatible with Quiver if --forQuiver option is specified.

Address of the bookmark: https://github.com/PacificBiosciences/pbalign

AlignQC: A tool for assessing an alignment, and generating reports that are easy to share

Jit — Tue, 07 Aug 2018 04:41:07 -0500

Long read alignment analysis. Generate a reports on sequence alignments for mappability vs read sizes, error patterns, annotations and rarefraction curve analysis. The most basic analysis only requires a BAM file, and outputs a web browser compatible xhtml to visualize/share/store/extract analysis results.

https://f1000research.com/articles/6-100/

https://github.com/jason-weirather/AlignQC

Address of the bookmark: https://www.healthcare.uiowa.edu/labs/au/AlignQC/

Deepbinner: a signal-level demultiplexer for Oxford Nanopore reads

Neel — Tue, 27 Nov 2018 03:38:49 -0600

Deepbinner is a tool for demultiplexing barcoded Oxford Nanopore sequencing reads. It does this with a deep convolutional neural network classifier, using many of the architectural advances that have proven successful in image classification. Unlike other demultiplexers (e.g. Albacore and Porechop), Deepbinner identifies barcodes from the raw signal (a.k.a. squiggle) which gives it greater sensitivity and fewer unclassified reads.

Reasons to use Deepbinner:
- To minimise the number of unclassified reads (use Deepbinner by itself).
- To minimise the number of misclassified reads (use Deepbinner in conjunction with Albacore demultiplexing).
- You plan on running signal-level downstream analyses, like Nanopolish. Deepbinner can demultiplex the fast5 fileswhich makes this easier.
Reasons to not use Deepbinner:
- You only have basecalled reads not the raw fast5 files (which Deepbinner requires).
- You have a small/slow computer. Deepbinner is more computationally intensive than Porechop.
- You used a sequencing/barcoding kit other than the ones Deepbinner was trained on.

Address of the bookmark: https://github.com/rrwick/Deepbinner

Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads

BioStar — Tue, 04 Feb 2020 23:23:16 -0600

Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.

Usage: perl run_rcorrector.pl [OPTIONS]
OPTIONS:
	Required
	-s seq_files: comma separated files for single-end data sets
	-1 seq_files_left: comma separated files for the first mate in the paried-end data sets
	-2 seq_files_right: comma separated files for the second mate in the paired-end data sets
	-i seq_files_interleaved: comma sperated files for interleaved paired-end data sets
	Optional
	-k INT: kmer_length (<=32, default: 23)
	-od STRING: output_file_directory (default: ./)
	-t INT: number of threads to use (default: 1)
	-trim : allow trimming (default: false)
	-maxcorK INT: the maximum number of correction within k-bp window (default: 4)
	-wk FLOAT: the proportion of kmers that are used to estimate weak kmer count threshold, lower for more divergent genome (default: 0.95)
	-ek INT: expected number of kmers; does not affect the correctness of program but affects the memory usage (default: 100000000)
	-stdout: output the corrected reads to stdout (default: not used)
	-verbose: output some correction information to stdout (default: not used)
	-stage INT: start from which stage (default: 0)
		0-start from begining(storing kmers in bloom filter) ;
		1-start from count kmers showed up in bloom filter;
		2-start from dumping kmer counts into a jf_dump file;
		3-start from error correction.

Address of the bookmark: https://github.com/mourisl/Rcorrector/

SqueezeMeta: a fully automated metagenomics pipeline, from reads to bins

BioStar — Mon, 17 Aug 2020 05:25:10 -0500

SqueezeMeta is a full automatic pipeline for metagenomics/metatranscriptomics, covering all steps of the analysis. SqueezeMeta includes multi-metagenome support allowing the co-assembly of related metagenomes and the retrieval of individual genomes via binning procedures. Thus, SqueezeMeta features several unique characteristics:

Co-assembly procedure with read mapping for estimation of the abundances of genes in each metagenome
Co-assembly of a large number of metagenomes via merging of individual metagenomes
Includes binning and bin checking, for retrieving individual genomes
The results are stored in a database, where they can be easily exported and shared, and can be inspected anywhere using a web interface.
Internal checks for the assembly and binning steps inform about the consistency of contigs and bins, allowing to spot potential chimeras.
Metatranscriptomic support via mapping of cDNA reads against reference metagenomes

Address of the bookmark: https://github.com/jtamames/SqueezeMeta

Stacks

Jitendra Narayan — Wed, 24 Feb 2016 15:52:30 -0600

Stacks is a software pipeline for building loci from short-read sequences, such as those generated on the Illumina platform. Stacks was developed to work with restriction enzyme-based data, such as RAD-seq, for the purpose of building genetic maps and conducting population genomics and phylogeography.

More at http://catchenlab.life.illinois.edu/stacks/

Address of the bookmark: http://catchenlab.life.illinois.edu/stacks/

cutadapt

Radha Agarkar — Fri, 13 May 2016 04:54:50 -0500

Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

Cleaning your data in this way is often required: Reads from small-RNA sequencing contain the 3’ sequencing adapter because the read is longer than the molecule that is sequenced. Amplicon reads start with a primer sequence. Poly-A tails are useful for pulling out RNA from your sample, but often you don’t want them to be in your reads.

Cutadapt helps with these trimming tasks by finding the adapter or primer sequences in an error-tolerant way. It can also modify and filter reads in various ways. Adapter sequences can contain IUPAC wildcard characters. Also, paired-end reads and even colorspace data is supported. If you want, you can also just demultiplex your input data, without removing adapter sequences at all.

Cutadapt comes with an extensive suite of automated tests and is available under the terms of the MIT license.

If you use cutadapt, please cite DOI:10.14806/ej.17.1.200 .

Address of the bookmark: https://cutadapt.readthedocs.io/en/stable/installation.html#quickstart

WgSim

Jit — Thu, 23 Jun 2016 07:26:49 -0500

Reads simulator

Wgsim is a small tool for simulating sequence reads from a reference genome. It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL) polymorphisms, and simulate reads with uniform substitution sequencing errors. It does not generate INDEL sequencing errors, but this can be partly compensated by simulating INDEL polymorphisms.

Wgsim outputs the simulated polymorphisms, and writes the true read coordinates as well as the number of polymorphisms and sequencing errors in read names. One can evaluate the accuracy of a mapper or a SNP caller with wgsim_eval.pl that comes with the package.

Address of the bookmark: https://github.com/lh3/wgsim

Krona

Jit — Wed, 22 Mar 2017 04:47:35 -0500

Krona allows hierarchical data to be explored with zooming, multi-layered pie charts. Krona charts can be created using an Excel template or KronaTools, which includes support for several bioinformatics tools and raw data formats. The interactive charts are self-contained and can be viewed with any modern web browser (see Browser support).

Address of the bookmark: https://github.com/marbl/Krona/wiki