BOL: Related items

NovoGraph: building whole genome graphs from long-read-based de novo assemblies

Jit — Thu, 15 Nov 2018 12:48:30 -0600

NovoGraph: building whole genome graphs from long-read-based de novo assemblies

An algorithmically novel approach to construct a genome graph representation of long-read-based de novo sequence assemblies. We then provide a proof of principle by creating a genome graph of seven ethnically-diverse human genomes.

https://f1000research.com/articles/7-1391/v1

Address of the bookmark: https://github.com/NCBI-Hackathons/NovoGraph

Omega2: metagenome assembly pipeline

Jit — Mon, 10 Jul 2017 05:56:07 -0500

Omega found overlaps between reads using a prefix/suffix hash table. The overlap graph of reads was simplified by removing transitive edges and trimming short branches. Unitigs were generated based on minimum cost flow analysis of the overlap graph and then merged to contigs and scaffolds using mate-pair information. In comparison with three de Bruijn graph assemblers (SOAPdenovo, IDBA-UD and MetaVelvet), Omega provided comparable overall performance on a HiSeq 100-bp dataset and superior performance on a MiSeq 300-bp dataset. In comparison with Celera on the MiSeq dataset, Omega provided more continuous assemblies overall using a fraction of the computing time of existing overlap-layout-consensus assemblers. This indicates Omega can more efficiently assemble longer Illumina reads, and at deeper coverage, for metagenomic datasets.

Address of the bookmark: http://omega.omicsbio.org/

miniasm: very fast OLC-based de novo assembler for noisy long reads

Jit — Mon, 27 Nov 2017 07:58:49 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

For a C. elegans PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Minimap and miniasm are ultrafast tools for (i) mapping and (ii) assembly. Designed for long, noisy reads, they do not have a correction or consensus step, and therefore the resulting assemblies are contiguous (i.e. long) but very noisy (i.e. full of errors)

We start with an all against all comparison:

minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz

Then we can assemble

miniasm -f reads.fq reads.paf.gz > reads.gfa

Convert GFA to FASTA:

awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > reads.fa

And then count how many contigs:

grep ">" reads.fa | wc -l

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)
# Overlap
minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Address of the bookmark: https://github.com/lh3/miniasm

MashMap: a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s)

Jit — Tue, 12 Dec 2017 17:23:31 -0600

MashMap is a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s). It maps a query sequence against a reference region if and only if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Winnowing and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Address of the bookmark: https://github.com/marbl/MashMap

Genome assembly stats plotting

Jit — Wed, 28 Feb 2018 03:45:39 -0600

A de novo genome assembly can be summarised b

y a number of metrics, including:

Overall assembly length
Number of scaffolds/contigs
Length of longest scaffold/contig
Scaffold/contig N50 and N90Assembly base composition, in particular percentage GC and percentage Ns
CEGMA completeness
Scaffold/contig length/count distribution

assembly-stats supports two widely used presentations of these values, tabular and cumulative length plots, and introduces an additional circular plot that summarises most commonly used assembly metrics in a single visualisation. Each of these presentations is generated using javascript from a common (JSON) data structure, allowing toggling between alternative views, and each can be applied to a single or multiple assemblies to allow direct comparison of alternate assemblies.

Tabular presentation allows direct comparison of exact values between assemblies, the limitations of this approach lie in the necessary omission of distributions and the challenge of interpreting ratios of values that may vary by several orders of magnitude.

Address of the bookmark: https://github.com/rjchallis/assembly-stats

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies

Jit — Tue, 15 May 2018 07:35:26 -0500

HapCUT2 is a maximum-likelihood-based tool for assembling haplotypes from DNA sequence reads, designed to "just work" with excellent speed and accuracy. We found that previously described haplotype assembly methods are specialized for specific read technologies or protocols, with slow or inaccurate performance on others. With this in mind, HapCUT2 is designed for speed and accuracy across diverse sequencing technologies, including but not limited to: NGS short reads (Illumina HiSeq) clone-based sequencing (Fosmid or BAC clones) SMRT reads (PacBio) Oxford Nanopore reads 10X Genomics Linked-Reads proximity-ligation (Hi-C) reads high-coverage sequencing (>40x coverage-per-SNP) using above technologies combinations of the above technologies (e.g. scaffold long reads with Hi-C reads) See below for specific examples of command line options and best practices for some of these technologies. NOTE: At this time HapCUT2 is for diploid organisms only. VCF input should contain diploid variants. If you use HapCUT2 in your research, please cite: Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. gr.213462.116 (2016). doi:10.1101/gr.213462.116

Address of the bookmark: https://github.com/vibansal/HapCUT2

assemblytics: delta file to analyze alignments of an assembly to another assembly or a reference genome

Jit — Thu, 14 Jun 2018 07:31:00 -0500

Download and install MUMmer Align your assembly to a reference genome using nucmer (from MUMmer package) $ nucmer -maxmatch -l 100 -c 500 REFERENCE.fa ASSEMBLY.fa -prefix OUT Consult the MUMmer manual if you encounter problems Optional: Gzip the delta file to speed up upload (usually 2-4X faster) $ gzip OUT.delta Then use the OUT.delta.gz file for upload. Upload the .delta or delta.gz file (view example) to Assemblytics Important: Use only contigs rather than scaffolds from the assembly. This will prevent false positives when the number of Ns in the scaffolded sequence does not match perfectly to the distance in the reference. The unique sequence length required represents an anchor for determining if a sequence is unique enough to safely call variants from, which is an alternative to the mapping quality filter for read alignment. http://assemblytics.com/

Address of the bookmark: http://assemblytics.com/

transrate: Understanding your transcriptome assembly

Neel — Fri, 13 Jul 2018 07:49:26 -0500

Transrate is software for de-novo transcriptome assembly quality analysis. It examines your assembly in detail and compares it to experimental evidence such as the sequencing reads, reporting quality scores for contigs and assemblies. This allows you to choose between assemblers and parameters, filter out the bad contigs from an assembly, and help decide when to stop trying to improve the assembly.

Address of the bookmark: http://hibberdlab.com/transrate/index.html

HaploMerger2: rebuilding both haploid sub-assemblies from high-heterozygosity diploid genome assembly

BioStar — Thu, 27 Sep 2018 07:08:47 -0500

HM2 can process any diploid assemblies, but it is especially suitable for diploid assemblies with high heterozygosity (≥3%), which can be difficult for other tools. This pipeline also implements flexible and sensitive assembly error detection, a hierarchical scaffolding procedure and a reliable gap-closing method for haploid sub-assemblies.

Source code, executables and the testing dataset are freely available at https://github.com/mapleforest/HaploMerger2/releases/.

Address of the bookmark: https://github.com/mapleforest/HaploMerger2/releases/

Referee: Genome assembly quality scores

Jit — Sun, 04 Nov 2018 16:44:30 -0600

Modern genome sequencing technologies provide a succint measure of quality at each position in every read, however all of this information is lost in the assembly process. Referee summarizes the quality information from the reads that map to a site in an assembled genome to calculate a quality score for each position in the genome assembly.

We accomplish this by first calculating genotype likelihoods for every site. For a given site in a diploid genome, there are 10 possible genotypes (AA, AC, AG, AT, CC, CG, CT, GG, GT, TT). Referee takes as input the genotype likelihoods calculated for all 10 genotypes given the called reference base at each position.

Referee is a program to calculate a quality score for every position in a genome assembly. This allows for easy filtering of low quality sites for any downstream analysis.

https://github.com/gwct/referee

Address of the bookmark: https://gwct.github.io/referee/#