BOL: Related items

ALPACA: A hybrid strategy for assembly of genomic DNA shotgun sequencing reads.

Seema Singh — Mon, 30 Apr 2018 04:38:40 -0500

ALPACA requires Celera Assembler 8.3 or later. It is recommended to build Celera Assembler from source. (Why? The pre-built binaries CA_8.3rc1 and CA8.3rc2 will work for any large data set.

Detail paper at https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3927-8

Address of the bookmark: https://github.com/VicugnaPacos/ALPACA

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

Rahul Nayak — Tue, 05 Jun 2018 09:57:11 -0500

PERGA - Paired End Reads Guided Assembler PERGA is a novel sequence reads guided de novo assembly approach which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds. Instead of using single-end reads to construct contig, PERGA uses paired-end reads and different read overlap sizes from O ≥ Omax to Omin to resolve the gaps and branches. Moreover, by constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. PERGA will try to extend the contigs by all feasible nucleotides and determine if these multiple extensions due to sequencing errors or repeats by using looking ahead technology, and it also try to separate the different repeats of nearby genomic regions to make the assembly result more longer and accurate. The simulated E.coli paired-end reads data are generated using GemSim (KE McElroy, F Luciani, T Thomas. Gemsim: General, Error-Model Based Simulator of Next-Generation Sequencing Data. BMC Genomics 2012, 13:74), with coverage 50x, 60x, 100x, read lengths 100-bp, and can be downloaded from https://github.com/zhuxiao/data_PERGA.

Address of the bookmark: https://github.com/hitbio/PERGA

ASplice: a scalable and memory-efficient algorithm for de novo transcriptome assembly

Rahul Nayak — Tue, 03 Jul 2018 04:09:46 -0500

With increased availability of de novo assembly algorithms, it is feasible to study entire transcriptomes of non-model organisms. While algorithms are available that are specifically designed for performing transcriptome assembly from high-throughput sequencing data, they are very memory-intensive, limiting their applications to small data sets with few libraries. Texas A&M University researchers develop a transcriptome assembly algorithm that recovers alternatively spliced isoforms and expression levels while utilizing as many RNA-Seq libraries as possible that contain hundreds of gigabases of data. New techniques are developed so that computations can be performed on a computing cluster with moderate amount of physical memory. Availability – A software program that implements the algorithm is available at: http://faculty.cse.tamu.edu/shsze/asplice. Sze SH, Pimsler ML, Tomberlin JK, Jones CD, Tarone AM. (2017) A scalable and memory-efficient algorithm for de novo transcriptome assembly of non-model organisms. BMC Genomics 18(Suppl 4):387.

Address of the bookmark: http://faculty.cse.tamu.edu/shsze/asplice/

FinisherSC:a repeat-aware tool for upgrading de novo assembly using long reads

Jit — Mon, 20 Aug 2018 04:08:50 -0500

Here is the command to run the tool:

python finisherSC.py destinedFolder mummerPath

If you are running on server computer and would like to use multiple threads, then the following commands can generate 20 threads to run FinisherSC.

python finisherSC.py -par 20 destinedFolder mummerPath

Sometimes, if the names of raw reads and contigs consists of special characters/formats, FinisherSC/MUMmer may not parse them correctly. In that case, you want to have a quick renaming of the names of contigs/reads in contigs.fasta or raw_reads.fasta using the following command.

    perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' raw_reads.fasta > newRaw_reads.fasta
    cp newRaw_reads.fasta raw_reads.fasta
    perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' contigs.fasta > newContigs.fasta
    cp newContigs.fasta contigs.fasta

Address of the bookmark: https://github.com/kakitone/finishingTool

Flye: Fast and accurate de novo assembler for single molecule sequencing reads

BioJoker — Tue, 02 Apr 2019 21:54:55 -0500

Flye is a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PB / ONT reads as input and outputs polished contigs. Flye also includes a special mode for metagenome assembly.

Address of the bookmark: https://github.com/fenderglass/Flye

StringTie Transcript assembly and quantification for RNA-Seq

Jit — Tue, 09 Jun 2020 05:21:11 -0500

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only alignments of short reads that can also be used by other transcript assemblers, but also alignments of longer sequences that have been assembled from those reads. In order to identify differentially expressed genes between experiments, StringTie's output can be processed by specialized software like Ballgown, Cuffdiff or other programs (DESeq2, edgeR, etc.).

Address of the bookmark: https://ccb.jhu.edu/software/stringtie/

3D de novo assembly (3D DNA) pipeline

Jit — Sun, 02 Feb 2020 13:41:55 -0600

For a detailed description of the pipeline and how it integrates with other tools designed by the Aiden Lab see Genome Assembly Cookbook on http://aidenlab.org/assembly.

For the original version of the pipeline and to reproduce the Hs2-HiC and the AaegL4 genomes reported in (Dudchenko et al., Science, 2017) see the original commit.

For the detailed description of the merge section see https://github.com/theaidenlab/AGWG-merge.

Address of the bookmark: https://github.com/theaidenlab/3d-dna

HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads

BioStar — Fri, 27 Mar 2020 22:49:31 -0500

HiCanu, a significant modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering.

More at https://www.biorxiv.org/content/10.1101/2020.03.14.992248v3

Address of the bookmark: https://github.com/marbl/canu

Supernova: generates phased, whole-genome de novo assemblies from a Chromium-prepared library.

Jit — Sun, 31 May 2020 01:59:30 -0500

Supernova generates phased, whole-genome de novo assemblies from a Chromium-prepared library.

Please see Achieving Success with De Novo Assembly and System Requirements before creating your Chromium libraries for assembly.

Supernova should be run using 38-56x coverage of the genome.
• Somewhat higher coverage is sometimes advantageous.
• Supernova will exit if it finds that coverage is far from the recommended range.
• Note that at most 2.14 billion reads are allowed.
• Please note that we have not extensively tested genomes larger than human, and any genome above approximately 4 GB should be considered experimental and is not supported.

Address of the bookmark: https://support.10xgenomics.com/de-novo-assembly/software/pipelines/latest/using/running

auN: a new metric to measure assembly contiguity

Jit — Tue, 02 Aug 2022 01:18:47 -0500

Given a de novo assembly, we often measure the “average” contig length by N50. N50 is neither the real average nor median. It is the length of the contig such that this and longer contigs cover at least 50% of the assembly. A longer N50 indicates better contiguity. We can similarly define Nx such that contigs no shorter than Nx covers x% of the assembly. The Nx curve plots Nx as a function of x, where x is ranged from 0 to 100.

Address of the bookmark: https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity