BOL: Related items

MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads

Rahul Nayak — Fri, 11 May 2018 05:07:45 -0500

MECAT is an ultra-fast Mapping, Error Correction and de novo Assembly Tools for single molecula sequencing (SMRT) reads. MECAT employs novel alignment and error correction algorithms that are much more efficient than the state of art of aligners and error correction tools. MECAT can be used for effectively de novo assemblying large genomes. For example, on a 32-thread computer with 2.0 GHz CPU , MECAT takes 9.5 days to assemble a human genome based on 54x SMRT data, which is 40 times faster than the current PBcR-Mhap pipeline. MECAT performance were compared with PBcR-Mhap pipeline, FALCON and Canu(v1.3) in five real datasets. The quality of assembled contigs produced by MECAT is the same or better than that of the PBcR-Mhap pipeline and FALCON.

https://www.nature.com/articles/nmeth.4432

Address of the bookmark: https://github.com/xiaochuanle/MECAT

BlasR Mapping single molecule sequencing reads using Basic Local Alignment with Successive Refinement (BLASR): Theory and Application,

Jit — Wed, 23 May 2018 06:54:32 -0500

BLASR (Basic Local Alignment with Successive Refinement) for mapping Single Molecule Sequencing (SMS) reads that are thousands to tens of thousands of bases long with divergence between the read and genome dominated by insertion and deletion error.

Here is how I use the blasr to align PacBio reads to the contigs (target.fasta). The “target.fasta.sa” is the suffix array from “target.fasta” generated by sawriter.

blasr query.fa ./target.fasta -sa ./target.fasta.sa -bestn 40 -maxScore -500 -m 4 -nproc 24 -out target.m4 -maxLCPLength 15

the output format option “-m 4″ generate the alignment coordinate. Not fully documented, but I can explain that to you.

I use a 24 cores / 48G ram server for the alignment. It took about 2 to 3 hours aligning 3G PacBio Reads to 10^6 sequences of short read contigs with a mean 3.5kbp length.

Address of the bookmark: http://bix.ucsd.edu/projects/blasr/

EAGLER: a scaffolding tool for long reads.

Jit — Mon, 04 Jun 2018 05:26:03 -0500

EAGLER is a scaffolding tool for long reads. The scaffolder takes as input a draft genome created by any NGS assembler and a set of long reads. The long reads are used to extend the contigs present in the NGS draft and possibly join overlapping contigs. EAGLER supports both PacBio and Oxford Nanopore reads.

The tool should be compatible with most UNIX flavors and has been successfully tested on the following operating systems:

Mac OS X 10.11.1
Mac OS X 10.10.3
Ubuntu 14.04 LTS

https://bib.irb.hr/datoteka/844447.Diplomski_2015_Luka_terbi.pdf

Address of the bookmark: https://github.com/mculinovic/EAGLER

JBrowse: Embeddable genome browser built completely with JavaScript and HTML5

Jit — Fri, 29 Jun 2018 09:19:56 -0500

JBrowse is a fast, embeddable genome browser built completely with JavaScript and HTML5, with optional run-once data formatting tools written in Perl. Headline Features: Fast, smooth scrolling and zooming. Explore your genome with unparalleled speed. Scales easily to multi-gigabase genomes and deep-coverage sequencing. Quickly open and view data files on your computer without uploading them to any server. Supports GFF3, BED, FASTA, Wiggle, BigWig, BAM, VCF (with either .tbi or .idx index), REST, and more. BAM, BigBed, BigWig, and VCF data are displayed directly from chunks of the compressed binary files, no conversion needed. Includes an optional “faceted” track selector (see demo) suitable for large installations with thousands of tracks. Very light server resource requirements. In fact, JBrowse has no back-end server code, just tools for formatting data files to be read directly over HTTP. Serve huge datasets from a single low-cost cloud instance. Can run as a stand-alone app on OSX and Windows using the Electron platform Highly extensible plugin architecture, with a large plugin registry of existing examples here https://gmod.github.io/jbrowse-registry https://jbrowse.org/

Address of the bookmark: https://github.com/GMOD/jbrowse

FinisherSC:a repeat-aware tool for upgrading de novo assembly using long reads

Jit — Mon, 20 Aug 2018 04:08:50 -0500

Here is the command to run the tool:

python finisherSC.py destinedFolder mummerPath

If you are running on server computer and would like to use multiple threads, then the following commands can generate 20 threads to run FinisherSC.

python finisherSC.py -par 20 destinedFolder mummerPath

Sometimes, if the names of raw reads and contigs consists of special characters/formats, FinisherSC/MUMmer may not parse them correctly. In that case, you want to have a quick renaming of the names of contigs/reads in contigs.fasta or raw_reads.fasta using the following command.

    perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' raw_reads.fasta > newRaw_reads.fasta
    cp newRaw_reads.fasta raw_reads.fasta
    perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' contigs.fasta > newContigs.fasta
    cp newContigs.fasta contigs.fasta

Address of the bookmark: https://github.com/kakitone/finishingTool

LRCstats: a tool for evaluating long reads correction methods

Aaryan Lokwani — Wed, 22 Aug 2018 11:05:04 -0500

LRCstats is an open-source pipeline for benchmarking DNA long read correction algorithms for long reads outputted by third generation sequencing technology such as machines produced by Pacific Biosciences. The reads produced by third generation sequencing technology, as the name suggests, are longer in length than reads produced by next generation sequencing technologies, such as those produced by Illumina. However, long reads are plagued by high error rates, which can cause issues in downstream analysis. Long read correction algorithms reduce the error rate of long reads either through self-correcting methods or using accurate, short reads outputted by next generation sequencing technologies to correct long reads.

Address of the bookmark: https://github.com/cchauve/lrcstats

Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads

Rahul Nayak — Fri, 19 Oct 2018 08:23:42 -0500

Rainbow is developed to provide an ultra-fast and memory-efficient solution to clustering and assembling short reads produced by RAD-seq. First, Rainbow clusters reads using a spaced seed method. Then, Rainbow implements a heterozygote calling like strategy to divide potential groups into haplotypes in a top–down manner. And along a guided tree, it iteratively merges sibling leaves in a bottom–up manner if they are similar enough. Here, the similarity is defined by comparing the 2nd reads of a RAD segment. This approach tries to collapse heterozygote while discriminate repetitive sequences. At last, Rainbow uses a greedy algorithm to locally assemble merged reads into contigs. Rainbow not only outputs the optimal but also suboptimal assembly results. Based on simulation and a real guppy RAD-seq data, we show that Rainbow is more competent than the other tools in dealing with RAD-seq data

Address of the bookmark: https://sourceforge.net/projects/bio-rainbow/files/

ARCS: scaffolding genome drafts with linked reads

Jit — Mon, 17 Dec 2018 17:40:28 -0600

ARCS requires two input files:

Draft assembly fasta file
Interleaved linked reads file (Barcode sequence expected in the BX tag of the read header or in the form "@readname_barcode" ; Run Long Ranger basic on raw chromium reads to produce this interleaved file)

Address of the bookmark: https://github.com/bcgsc/ARCS/

Flye: Fast and accurate de novo assembler for single molecule sequencing reads

BioJoker — Tue, 02 Apr 2019 21:54:55 -0500

Flye is a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PB / ONT reads as input and outputs polished contigs. Flye also includes a special mode for metagenome assembly.

Address of the bookmark: https://github.com/fenderglass/Flye

HASLR: a tool for rapid genome assembly of long sequencing reads

LEGE — Fri, 31 Jan 2020 05:50:15 -0600

HASLR is a tool for rapid genome assembly of long sequencing reads. HASLR is a hybrid tool which means it requires long reads generated by Third Generation Sequencing technologies (such as PacBio or Oxford Nanopore) together with Next Generation Sequencing reads (such as Illumina) from the same sample.

Address of the bookmark: https://github.com/vpc-ccg/haslr