BOL: Related items

SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter for CPUs, GPUs, and FPGAs

Neel — Sun, 20 Dec 2020 01:39:54 -0600

The first and the only pre-alignment filtering algorithm that works efficiently and fast on modern CPU, FPGA, and GPU architectures. SneakySnake greatly (by more than two orders of magnitude) expedites sequence alignment calculation for both short (Illumina) and long (ONT and PacBio) reads. Described by Alser et al. (preliminary version at https://arxiv.org/abs/1910.09020).

Address of the bookmark: https://github.com/CMU-SAFARI/SneakySnake

MashMap: a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s)

Jit — Tue, 12 Dec 2017 17:23:31 -0600

MashMap is a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s). It maps a query sequence against a reference region if and only if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Winnowing and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Address of the bookmark: https://github.com/marbl/MashMap

LRCstats: a tool for evaluating long reads correction methods

Aaryan Lokwani — Wed, 22 Aug 2018 11:05:04 -0500

LRCstats is an open-source pipeline for benchmarking DNA long read correction algorithms for long reads outputted by third generation sequencing technology such as machines produced by Pacific Biosciences. The reads produced by third generation sequencing technology, as the name suggests, are longer in length than reads produced by next generation sequencing technologies, such as those produced by Illumina. However, long reads are plagued by high error rates, which can cause issues in downstream analysis. Long read correction algorithms reduce the error rate of long reads either through self-correcting methods or using accurate, short reads outputted by next generation sequencing technologies to correct long reads.

Address of the bookmark: https://github.com/cchauve/lrcstats

Genome assembly tutorial "Genome Assembly for short and long reads"

Jit — Sat, 19 Jan 2019 17:29:53 -0600

In this lab we will perform de novo genome assembly of a bacterial genome. You will be guided through the genome assembly starting with data quality control, through to building contigs and analysis of the results. At the end of the lab you will know:

How to perform basic quality checks on the input data
How to run a short read assembler on Illumina data
How to run a long read assembler on Pacific Biosciences or Oxford Nanopore data
How to improve the accuracy of a long read assembly using short reads
How to assess the quality of an assembly

https://bioinformaticsdotca.github.io/high-throughput_biology_2017

Address of the bookmark: https://bioinformaticsdotca.github.io/high-throughput_biology_2017_module6_lab

AlignGraph2: similar genome-assisted reassembly pipeline for PacBio long reads

Rahul Nayak — Sun, 14 Mar 2021 09:42:47 -0500

AlignGraph2 is the second version of AlignGraph for PacBio long reads. It extends and refines contigs assembled from the long reads with a published genome similar to the sequencing genome.

More at https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab022/6146772

Address of the bookmark: https://github.com/huangs001/AlignGraph2

PBSuite: Software for Long-Read Sequencing Data from PacBio

Jit — Mon, 27 Feb 2017 09:54:47 -0600

PBJelly - the genome upgrading tool.
PBHoney - the structural variation discovery tool

Both are contained within the PBSuite code found in downloads.

----- PBJelly -----
Read The Paper
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0047768

PBJelly is a highly automated pipeline that aligns long sequencing reads (such as PacBio RS reads or long 454 reads in fasta format) to high-confidence draft assembles. PBJelly fills or reduces as many captured gaps as possible to produce upgraded draft genomes.

----- PBHoney -----
Read The Paper
http://www.biomedcentral.com/1471-2105/15/180/abstract

PBHoney is an implementation of two variant-identification approaches designed to exploit the high mappability of long reads (i.e., greater than 10,000 bp). PBHoney considers both intra-read discordance and soft-clipped tails of long reads to identify structural variants.

Address of the bookmark: https://sourceforge.net/projects/pb-jelly/

The MARVEL assembler

Jit — Fri, 04 May 2018 19:18:41 -0500

MARVEL consists of a set of tools that facilitate the overlapping, patching, correction and assembly of noisy (not so noisy ones as well) long reads.

The assembly process can be summarized as follows:

overlap
patch reads
overlap (again)
scrubbing
assembly graph construction and touring
optional read correction
fasta file creation

Address of the bookmark: https://github.com/schloi/MARVEL

Nanopore adaptor !

Jit — Mon, 03 Feb 2020 00:10:29 -0600

Porechop is a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.

Porechop also supports demultiplexing of Nanopore reads that were barcoded with the Native Barcoding Kit, PCR Barcoding Kit or Rapid Barcoding Kit.

The known Nanopore adapters that Porechop looks for are defined

https://github.com/rrwick/Porechop/blob/master/porechop/adapters.py

They are:

Ligation kit adapters
Rapid kit adapters
PCR kit adapters
Barcodes
Native barcoding
Rapid barcoding

Address of the bookmark: https://github.com/rrwick/Porechop/blob/master/porechop/adapters.py

Run miniasm assembler on nanopore reads !

Jit — Mon, 18 Dec 2017 04:07:50 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

Find the detail of the reads repeats:

fq2fa ONT_A.fastq ONT_A.fasta

minimap2 -xava-ont ONT_A.fasta ONT_A.fasta -t10 -X > AONT.paf

awk '{if($1==$6){print}}' AONT.paf > AONTself.paf

awk '$5=="-"' AONTself.paf | awk '{print $1}'| sort|uniq > invertedrepeat.list

Generated a few palindrome and repeats plots (highlighting only repeats largest than 10, 20 and 30 kb)

minidot -f 5 -m 30000 AONTself.paf > AONTself30000.eps
sed 's/_template_pass_FAH31515//' AONTself30000.eps > AONTself30000final.eps

minidot -f 5 -m 20000 AONTself.paf > AONTself20000.eps
sed 's/_template_pass_FAH31515//' AONTself20000.eps > AONTself20000final.eps

minidot -f 5 -m 10000 AONTself.paf > AONTself10000.eps
sed 's/_template_pass_FAH31515//' AONTself10000.eps > AONTself10000final.eps

Assemble with miniasm:

miniasm -f ONT_A.fasta AONT.paf > AONT.gfa
grep '^S' AONT.gfa |awk '{print ">"$2"\n"$3}' > AONT_miniasm.fasta

minimap2 -xasm10 AONT_miniasm.fasta AONT_miniasm.fasta -t1 -X > AONT_miniasm.paf

awk '{if($1==$6){print}}' AONT_miniasm.paf > AONT_miniasm_self.paf

minidot -f 5 -m 10000 AONT_miniasm_self.paf > AONT_miniasm_self10000.eps

Njoy the assembly !

FMLRC: a long-read error correction tool using the multi-string Burrows Wheeler Transform

Neel — Fri, 10 Aug 2018 13:29:28 -0500

FMLRC, or FM-index Long Read Corrector, is a tool for performing hybrid correction of long read sequencing using the BWT and FM-index of short-read sequencing data. Given a BWT of the short-read sequencing data, FMLRC will build an FM-index and use that as an implicit de Bruijn graph. Each long read is then corrected independently by identifying low frequency k-mers in the long read and replacing them with the closest matching high frequency k-mers in the implicit de Bruijn graph. In contrast to other de Bruijn graph based implementations, FMLRC is not restricted to a particular k-mer size and instead uses a two pass method with both a short "k-mer" and a longer "K-mer". This allows FMLRC to correct through low complexity regions that are computational difficult for short k-mers.

Address of the bookmark: https://github.com/holtjma/fmlrc