BOL: Related items

OPERA : Optimal Paired-End Read Assembler

Jit — Fri, 09 Sep 2016 05:28:58 -0500

OPERA (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly). It uses information from paired-end/mate-pair/long reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly project, in a process known as Scaffolding. OPERA is based on an exact algorithm that is guaranteed to minimize the discordance of scaffolds with the information provided by the paired-end/mate-pair/long reads (for further details see Gao et al, 2011).

Note that since the original publication, we have made significant changes to OPERA (v1.0 onwards) including refinements to its basic algorithm (to reduce local errors, improve efficiency etc.) and incorporated features that are important for scaffolding large genomes (multi-library support, better repeat-handling etc.), in addition to other scalability and usability improvements (bam and gzip support, smaller memory footprint). We therefore encourage you to download and use our latest version: OPERA-LG. In our benchmarks, it has significantly improved corrected N50 and reduced the number of scaffolding errors. Furthermore, our latest release contains the wrapper script OPERA-long-read that enables scaffolding with long-reads from third-generation sequencing technologies (PacBio or Oxford Nanopore). The manuscript describing the new features and algorithms is available at Genome Biology. We look forward to getting your feedback to improve it further.

Address of the bookmark: https://sourceforge.net/p/operasf/wiki/The%20OPERA%20wiki/

dipSPAdes: Assembler for Highly Polymorphic Diploid Genomes.

Jit — Wed, 20 Dec 2017 18:35:16 -0600

While the number of sequenced diploid genomes have been steadily increasing in the last few years, assembly of highly polymorphic (HP) diploid genomes remains challenging. As a result, there is a shortage of tools for assembling HP genomes from the next generation sequencing (NGS) data. The initial approaches to assembling HP genomes were proposed in the pre-NGS era and are not well suited for NGS projects. To address this limitation, we developed the first de Bruijn graph assembler, dipSPAdes, for HP genomes that significantly improves on the state-of-the-art assemblers for HP diploid genomes.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pubmed/25734602

NextDenovo: string graph-based de novo assembler for TGS long reads

Jit — Sun, 05 Jan 2020 04:08:29 -0600

NextDenovo is a string graph-based de novo assembler for TGS long reads. It uses a "correct-then-assemble" strategy similar to canu, but requires significantly less computing resources and storages. After assembly, the per-base error rate is about 97-98%, to further improve single base accuracy, please use NextPolish.

NextDenovo contains two core modules: NextCorrect and NextGraph. NextCorrect can be used to correct TGS long reads with approximately 15% sequencing errors, and NextGraph can be used to construct a string graph with corrected reads. It also contains a modified version of minimap2 for adapting input and output and producing more sensitive and accurate dovetail overlaps, and some useful utilities (see here for more details).

Address of the bookmark: https://github.com/Nextomics/NextDenovo

Unicycler: Hybrid assembly pipeline for bacterial genomes

Jit — Fri, 10 Nov 2017 03:58:27 -0600

Unicycler is an assembly pipeline for bacterial genomes. It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser. It can also assembly long-read-only sets (PacBio or Nanopore) where it runs a miniasm+Racon pipeline. For the best possible assemblies, give it both Illumina reads and long reads, and it will conduct a hybrid assembly.

Address of the bookmark: https://github.com/rrwick/Unicycler

Nanopolis: polish a genome assembly

Rahul Nayak — Thu, 26 Jul 2018 04:51:28 -0500

Software package for signal-level analysis of Oxford Nanopore sequencing data. Nanopolish can calculate an improved consensus sequence for a draft genome assembly, detect base modifications, call SNPs and indels with respect to a reference genome and more (see Nanopolish modules, below).

Quickstart

http://nanopolish.readthedocs.io/en/latest/quickstart_consensus.html

Algorithms

http://simpsonlab.github.io/2017/06/30/nanopolish-v0.7.0/

Address of the bookmark: https://github.com/jts/nanopolish

CANU genome assembly parameters !

Rahul Nayak — Mon, 07 Jan 2019 08:40:37 -0600

Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2) and you would like to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false. This specifications should be for your project discussed with a local computing guru. The parameters that are in square brackets [] are optional, symbol | stands for "or".

usage:   canu [-correct | -trim | -assemble | -trim-assemble] \
              [-s ] \
               -p  \
               -d  \
               genomeSize=[g|m|k] \
               -maxThreads=2 \
               useGrid=false \
              [other-options] \
               read_file.fastq.gz

A default Canu run produces usually high quality assembly, example of a command that was used for testing can be found below. However, there are still a lot of parameters that are possible to tweak. For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process.

canu -p test_asmbl \
     -d asm_test3 \
     genomeSize=2m \
     -maxThreads=2 useGrid=false \
     -pacbio-raw \ ~/pacbio/dna/sample_reads.fastq.gz

There is a brilliant section in documentation about parameter tweaking.

The output directory contains will contain many files. The most interesting ones are:

*.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.
*.trimmedReads.fastq : file containing the sequences after correction and final trimming
*.layout : file containing informations about read inclusion in the final assembly
*.gfa : file containing the assembly graph by Canu
*.contigs.fasta : file containing everything that could be assembled and is part of the primary assembly

The basic stats of assembly can be read from reports generated by the assembler, or calculated using standard UNIX command line tools.

More at https://canu.readthedocs.io/en/latest/faq.html

RefKA: A fast and efficient long-read genome assembly approach for large and complex genomes

Rahul Nayak — Fri, 01 May 2020 03:00:40 -0500

RefKA, a reference-based approach for long read genome assembly. This approach relies on breaking up a closely related reference genome into bins, aligning k-mers unique to each bin with PacBio reads, and then assembling each bin in parallel followed by a final bin-stitching step.

Address of the bookmark: https://github.com/AppliedBioinformatics/RefKA

CSA: A high-throughput chromosome-scale assembly pipeline for vertebrate genomes

Jit — Wed, 10 Mar 2021 06:13:49 -0600

The pipeline can use information from scaffolded assemblies (for example from HiC or 10X Genomics), or even from diverged (~65-100 Mya) reference genomes for ordering the contigs and thus support the assembly process. This typically results in improved contig N50 when compared to current state of the art methods.

For smaller vertebrate genomes (~1 Gbp) chromosome scale assemblies can be achieved within 12h on high-end Desktop computers (Intel i7, 12 CPU threads, 128 GB RAM). Larger mammalian genomes (~3Gbp) can be processed within 15-18 h on server equipment (Xeon, 96 CPU threads, 1TB RAM).

Address of the bookmark: https://github.com/HMPNK/CSA2.6

MAKER

Jitendra Narayan — Sun, 07 Feb 2016 15:59:24 -0600

MAKER is a portable and easily configurable genome annotation pipeline.Its purpose is to allow smaller eukaryotic and prokaryotic genome projects to independently annotate their genomes and to create genome databases. MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab-initio gene predictions and automatically synthesizes these data into gene annotations having evidence-based quality values.

More at http://www.yandell-lab.org/software/maker.html

Address of the bookmark: http://www.yandell-lab.org/software/maker.html

CrossMap

Jitendra Narayan — Mon, 08 Feb 2016 15:47:00 -0600

CrossMap is a program for convenient conversion of genome coordinates (or annotation files) between different assemblies (such as Human hg18 (NCBI36) <> hg19 (GRCh37), Mouse mm9 (MGSCv37) <> mm10 (GRCm38)).

It supports most commonly used file formats including SAM/BAM, Wiggle/BigWig, BED, GFF/GTF, VCF.

CrossMap is designed to liftover genome coordinates between assemblies. It’s not a program for aligning sequences to reference genome.

We do not recommend using CrossMap to convert genome coordinates between species.

More at http://crossmap.sourceforge.net/

Address of the bookmark: http://crossmap.sourceforge.net/