BOL: Related items

COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly

Jit — Wed, 06 Dec 2017 02:08:14 -0600

An efficient tool called Connecting Overlapped Pair-End (COPE) reads, to connect overlapping pair-end reads using k-mer frequencies. We evaluated our tool on 30× simulated pair-end reads from Arabidopsis thaliana with 1% base error. COPE connected over 99% of reads with 98.8% accuracy, which is, respectively, 10 and 2% higher than the recently published tool FLASH. When COPE is applied to real reads for genome assembly, the resulting contigs are found to have fewer errors and give a 14-fold improvement in the N50 measurement when compared with the contigs produced using unconnected reads.

Address of the bookmark: ftp://ftp.genomics.org.cn/pub/cope

ALPACA: A hybrid strategy for assembly of genomic DNA shotgun sequencing reads.

Seema Singh — Mon, 30 Apr 2018 04:38:40 -0500

ALPACA requires Celera Assembler 8.3 or later. It is recommended to build Celera Assembler from source. (Why? The pre-built binaries CA_8.3rc1 and CA8.3rc2 will work for any large data set.

Detail paper at https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3927-8

Address of the bookmark: https://github.com/VicugnaPacos/ALPACA

NxRepair: error correction in de novo assemblies using Nextera Mate Pair Reads

BioStar — Thu, 24 Jan 2019 10:35:12 -0600

NxRepair is a python module that automatically detects large structural errors in de novo assemblies using Nextera mate pair reads. The decector will break a contig at the site of an identified misassembly and will generate a new fasta file containing both the corrected contigs and the correct, unaffected contigs.

https://nxrepair.readthedocs.io/en/latest/tutorial.html

nxrepair aligned_matepairs.bam assemblyfasta.fasta error_locations.csv new_fasta.fasta

Address of the bookmark: https://github.com/rebeccaroisin/nxrepair

HASLR: a tool for rapid genome assembly of long sequencing reads

LEGE — Fri, 31 Jan 2020 05:50:15 -0600

HASLR is a tool for rapid genome assembly of long sequencing reads. HASLR is a hybrid tool which means it requires long reads generated by Third Generation Sequencing technologies (such as PacBio or Oxford Nanopore) together with Next Generation Sequencing reads (such as Illumina) from the same sample.

Address of the bookmark: https://github.com/vpc-ccg/haslr

HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads

BioStar — Fri, 27 Mar 2020 22:49:31 -0500

HiCanu, a significant modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering.

More at https://www.biorxiv.org/content/10.1101/2020.03.14.992248v3

Address of the bookmark: https://github.com/marbl/canu

Short-read assembly using Spades !

Abhimanyu Singh — Mon, 31 Jan 2022 07:18:16 -0600

If we only had Illumina reads, we could also assemble these using the tool Spades.

You can try this here, or try it later on your own data.

Get data

We will use the same Illumina data as we used above:

illumina_R1.fastq.gz: the Illumina forward reads
illumina_R2.fastq.gz: the Illumina reverse reads

Assemble

Run Spades:

spades.py -1 illumina_R1.fastq.gz -2 illumina_R2.fastq.gz --careful --cov-cutoff auto -o spades_assembly_all_illumina

-1 is input file of forward reads
-2 is input file of reverse reads
--careful minimizes mismatches and short indels
--cov-cutoff auto computes the coverage threshold (rather than the default setting, “off”)
-o is the output directory

Results

Move into the output directory and look at the contigs:

infoseq contigs.fasta

ALE: a Generic Assembly Likelihood Evaluation Framework for Assessing the Accuracy of Genome and Metagenome Assemblies

Neel — Tue, 26 Apr 2016 03:38:43 -0500

Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.

More at http://www.ncbi.nlm.nih.gov/pubmed/23303509

Address of the bookmark: http://sc932.github.io/ALE/about.html

Hagfish - assess an assembly through creative use of coverage plots

Abhi — Fri, 20 May 2016 19:08:17 -0500

Hagfish is a tool that is to be used in data analysis of Next Generation Sequencing (NGS) experiments. Hagfish builds on the concept of coverage plots and aims to assist (amongst others) in quality control of de novo genome assembly or identification of structural variation in a genome re-sequencing experiment.

Hagfish requires a reference sequence and a paired end re-sequencing data set. Hagfish has more power the larger the insert size of the paired end library is.

Quick links: Installation,Operation, Read mappers, Hagfish scripts, Hagfish plots

Address of the bookmark: https://github.com/mfiers/hagfish

CABOG: Celera Assembler with Best Overlap Graph

Abhimanyu Singh — Mon, 15 May 2017 05:04:39 -0500

CABOG (Celera Assembler with Best Overlap Graph) is scientific software for DNA research. CABOG has been a critical component of many genome sequencing projects. CABOG operates on small genomes such as bacterial as well as large genomes such as mammalian. CABOG is an extension of the Celera Assembler software that was originally developed at Celera for the 2001 publication of the first draft human genome sequence. The software was released to the public domain in 2004. Its open source repository on Source Forge is an internet resource for scientists around the world.

CABOG is one of many software programs called genome assemblers. These programs exist to overcome the fundamental limitation of all sequencing machines, namely, that they read out very few DNA letters at a time. These programs reconstruct genomes that are billions of letters long from the hundreds of letters per read that modern sequencers provide. What these programs do is often described as a scaled up version of a family solving a jigsaw puzzle.

The CABOG software was the first to accomplish many scientific goals. It was the first to assemble the genome of a multicellular organism (Drosophila melanogaster, 2000). It was the first to assemble both parental haplotypes of one human genome (J. Craig Venter, 2007). It was the first to assemble environmental sequence from the oceans (Sargasso Sea in 2004 and Global Ocean Sampling in 2007). It was first to combine reads from first-generation Sanger sequencing machines and second-generation pyrosequencing machines (Marine microbes, 2006). Today, CABOG is one of the leading assembly programs for data sets that include paired end data from the Roche 454 line of sequencing machines.

Address of the bookmark: http://www.jcvi.org/cms/research/projects/cabog/overview/

Consed--A Finishing Package (BAM File Viewer, Assembly Editor, Autofinish, Autoreport, Autoedit, and Align Reads To Reference Sequence)

Neel — Fri, 07 Feb 2020 07:16:22 -0600

Supports Illumina, 454, other Next-Gen and Sanger Reads and allows mixtures of these read types
Consed includes BamScape which can view bam files with unlimited numbers of reads. BamScape can bring up consed to edit reads and the reference sequence in targeted regions.
Consed is compatible with Newbler, Cross_match, Phrap, MIRA, Velvet and PCAP output.
Quickly takes the user to each variant site for viewing (also available as an automated report)
Overview of assembly can help detect and fix misassemblies
Editing time reduced by the program's ability to pin-point problem areas
Editing is guided by error probabilities

Address of the bookmark: http://www.phrap.org/consed/consed.html