BOL: Related items

MashMap: a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s)

Jit — Tue, 12 Dec 2017 17:23:31 -0600

MashMap is a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s). It maps a query sequence against a reference region if and only if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Winnowing and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Address of the bookmark: https://github.com/marbl/MashMap

RGFA: powerful and convenient handling of assembly graphs

Rahul Nayak — Thu, 25 Jan 2018 05:47:53 -0600

RGFA, an implementation of the proposed GFA specification in Ruby. It allows the user to conveniently parse, edit and write GFA files. Complex operations such as the separation of the implicit instances of repeats and the merging of linear paths can be performed. A typical application of RGFA is the editing of a graph, to finish the assembly of a sequence, using information not available to the assembler. We illustrate a use case, in which the assembly of a repetitive metagenomic fosmid insert was completed using a script based on RGFA.

https://github.com/ggonnella/rgfa

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5103826/

assemblytics: delta file to analyze alignments of an assembly to another assembly or a reference genome

Jit — Thu, 14 Jun 2018 07:31:00 -0500

Download and install MUMmer Align your assembly to a reference genome using nucmer (from MUMmer package) $ nucmer -maxmatch -l 100 -c 500 REFERENCE.fa ASSEMBLY.fa -prefix OUT Consult the MUMmer manual if you encounter problems Optional: Gzip the delta file to speed up upload (usually 2-4X faster) $ gzip OUT.delta Then use the OUT.delta.gz file for upload. Upload the .delta or delta.gz file (view example) to Assemblytics Important: Use only contigs rather than scaffolds from the assembly. This will prevent false positives when the number of Ns in the scaffolded sequence does not match perfectly to the distance in the reference. The unique sequence length required represents an anchor for determining if a sequence is unique enough to safely call variants from, which is an alternative to the mapping quality filter for read alignment. http://assemblytics.com/

Address of the bookmark: http://assemblytics.com/

transrate: Understanding your transcriptome assembly

Neel — Fri, 13 Jul 2018 07:49:26 -0500

Transrate is software for de-novo transcriptome assembly quality analysis. It examines your assembly in detail and compares it to experimental evidence such as the sequencing reads, reporting quality scores for contigs and assemblies. This allows you to choose between assemblers and parameters, filter out the bad contigs from an assembly, and help decide when to stop trying to improve the assembly.

Address of the bookmark: http://hibberdlab.com/transrate/index.html

HaploMerger2: rebuilding both haploid sub-assemblies from high-heterozygosity diploid genome assembly

BioStar — Thu, 27 Sep 2018 07:08:47 -0500

HM2 can process any diploid assemblies, but it is especially suitable for diploid assemblies with high heterozygosity (≥3%), which can be difficult for other tools. This pipeline also implements flexible and sensitive assembly error detection, a hierarchical scaffolding procedure and a reliable gap-closing method for haploid sub-assemblies.

Source code, executables and the testing dataset are freely available at https://github.com/mapleforest/HaploMerger2/releases/.

Address of the bookmark: https://github.com/mapleforest/HaploMerger2/releases/

Referee: Genome assembly quality scores

Jit — Sun, 04 Nov 2018 16:44:30 -0600

Modern genome sequencing technologies provide a succint measure of quality at each position in every read, however all of this information is lost in the assembly process. Referee summarizes the quality information from the reads that map to a site in an assembled genome to calculate a quality score for each position in the genome assembly.

We accomplish this by first calculating genotype likelihoods for every site. For a given site in a diploid genome, there are 10 possible genotypes (AA, AC, AG, AT, CC, CG, CT, GG, GT, TT). Referee takes as input the genotype likelihoods calculated for all 10 genotypes given the called reference base at each position.

Referee is a program to calculate a quality score for every position in a genome assembly. This allows for easy filtering of low quality sites for any downstream analysis.

https://github.com/gwct/referee

Address of the bookmark: https://gwct.github.io/referee/#

ALLHiC: Phasing and scaffolding polyploid genomes based on Hi-C data

BioStar — Thu, 20 Dec 2018 12:03:32 -0600

The major problem of scaffolding polyploid genome is that Hi-C signals are frequently detected between allelic haplotypes and any existing stat of art Hi-C scaffolding program links the allelic haplotypes together. To solve the problem, we developed a new Hi-C scaffolding pipeline, called ALLHIC, specifically tailored to the polyploid genomes. ALLHIC pipeline contains a total of 5 steps: prune, partition, rescue, optimize and build.

Address of the bookmark: https://github.com/tangerzhang/ALLHiC/wiki

SvABA: Genome-wide detection of structural variants and indels by local assembly

Abhimanyu Singh — Mon, 21 Jan 2019 17:58:56 -0600

SvABA is a method for detecting structural variants in sequencing data using genome-wide local assembly. Under the hood, SvABA uses a custom implementation of SGA (String Graph Assembler) by Jared Simpson, and BWA-MEM by Heng Li. Contigs are assembled for every 25kb window (with some small overlap) for every region in the genome. The default is to use only clipped, discordant, unmapped and indel reads, although this can be customized to any set of reads at the command line using VariantBam rules. These contigs are then immediately aligned to the reference with BWA-MEM and parsed to identify variants. Sequencing reads are then realigned to the contigs with BWA-MEM, and variants are scored by their read support.

Address of the bookmark: https://github.com/walaj/svaba

Evaluation of genome assembly software based on long reads

BioStar — Fri, 01 Feb 2019 11:55:54 -0600

TGS technologies have been used to produce highly accurate de novo assemblies of hundreds of microbial genomes and highly contiguous reconstructions of many dozens of plant and animal genomes, enabling new insights into evolution and sequence diversity. They have also been applied to resequencing analyses, to create detailed maps of structural variations in many species. Also, these new technologies have been used to fill in many of the gaps in the human reference genome.

In this report, we compare and evaluate several genome assembly software based on TSG technology. The experimentation has been performed on 4 reference genomes and the results evaluated with the QUAST software. The 11 software that have been evaluated are: Celera Assembler , Falcon , Miniasm, Newbler , SGA Assembler, Smartdenovo, Abruijn, Ra, DBG2OLC, Spades and Cerulean. The first 8 software use only long reads, while the 3 last software can merge long and short reads

TRITEX sequence assembly pipeline for Triticeae genomes

Jit — Tue, 20 Aug 2019 09:47:14 -0500

The pipeline is open-source and hosted in a public Bitbucket repository.

TRITEX has been run on highly inbred genotypes of barley (Hordeum vulgare), tetraploid wheat (Triticum turgidum) and hexaploid wheat (T. aestivum) with reasonable results: super-scaffold N50 values in the range of dozens of Mb and pseudomolecules with better gene space representation than a BAC-by-BAC assembly. It has never been tested and is not expected to work on heterozygous or autopolyploid genomes.

A protocol for generating chromosome-conformation capture sequencing (Hi-C) data suitable for use with the pipeline is described in Himmelbach et al. 2018. Refer to the technical notes of 10X Genomics on how to generate Chromium data.

Address of the bookmark: https://tritexassembly.bitbucket.io/