BOL: Related items

coursera genome assembly tutorial

Jit — Sat, 25 Nov 2017 08:57:25 -0600

Solutions to Coursera Genome Sequencing (Bioinformatics II)

Address of the bookmark: https://github.com/iansealy/coursera-assembly

Bandage: interactive visualization of de novo genome assemblies

Shruti Paniwala — Mon, 04 Dec 2017 10:09:37 -0600

Bandage (a Bioinformatics Application for Navigating De novo Assembly Graphs Easily) is a tool for visualizing assembly graphs with connections. Users can zoom in to specific areas of the graph and interact with it by moving nodes, adding labels, changing colors and extracting sequences. BLAST searches can be performed within the Bandage graphical user interface and the hits are displayed as highlights in the graph. By displaying connections between contigs, Bandage presents new possibilities for analyzing de novo assemblies that are not possible through investigation of contigs alone.

Availability and implementation: Source code and binaries are freely available at https://github.com/rrwick/Bandage. Bandage is implemented in C++ and supported on Linux, OS X and Windows. A full feature list and screenshots are available at http://rrwick.github.io/Bandage.

Address of the bookmark: http://rrwick.github.io/Bandage/

String graph based genome assembly software and tools !

Rahul Nayak — Tue, 19 Dec 2017 17:17:38 -0600

In graph theory, a string graph is an intersection graph of curves in the plane; each curve is called a "string". String graphs were first proposed by E. W. Myers in a 2005 publication. In recent Genome Research paper describing an innovative approach for assembling large genomes from NGS data caught our attention for several reasons. i) it give different "string graph" prospective of long lasting genome assembly problem ii) the paper is coauthored by Jared Simpson, the developer of ABySS assembler and Richard Durbin. iii) Simpson-Durbin algorithm is that it does not rely on de Bruijn graphs, and instead employs a different graph construction approach called ‘string graph’.

Following are the genome assembly tools based on string graph:

1.SGA (String Graph Assembler) https://github.com/jts/sga

Assembles large genomes from high coverage short read data. SGA is designed as a modular set of programs, which are used to form an assembly pipeline. SGA implements a set of assembly algorithms based on the FM-index. As the FM-index is a compressed data structure, the algorithms are very memory efficient. The SGA assembly has three distinct phases. The first phase corrects base calling errors in the reads. The second phase assembles contigs from the corrected reads. The third phase uses paired end and/or mate pair data to build scaffolds from the contigs. The output of this software is a PDF report that allows the properties of the genome and data quality to be visually explored. By providing more information to the user at the start of an assembly project, this software will help increase awareness of the factors that make a given assembly easy or difficult, assist in the selection of software and parameters and help to troubleshoot an assembly if it runs into problems.

2. SAGE: String-overlap Assembly of GEnomes https://github.com/lucian-ilie/SAGE2

SAGE, for de novo genome assembly. As opposed to most assemblers, which are de Bruijn graph based, SAGE uses the string-overlap graph. SAGE builds upon great existing work on string-overlap graph and maximum likelihood assembly, bringing an important number of new ideas, such as the efficient computation of the transitive reduction of the string overlap graph, the use of (generalized) edge multiplicity statistics for more accurate estimation of read copy counts, and the improved use of mate pairs and min-cost flow for supporting edge merging. The assemblies produced by SAGE for several short and medium-size genomes compared favourably with those of existing leading assemblers.

3. FSG: Fast String Graph

The new integrated assembler has been assessed on a standard benchmark, showing that fast string graph (FSG) is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads. Moreover, we have studied the effect of coverage rates on the running times.

4. BASE https://github.com/dhlbh/BASE

It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.

5. Fermi https://github.com/lh3/fermi/

Fermi is a de novo assembler with a particular focus on assembling Illumina short sequence reads from a mammal-sized genome. In addition to the role of a typical assembler, fermi also aims to preserve heterozygotes which are often collapsed by other assemblers. Its ultimate goal is to find a minimal set of unitigs to represent all the information in raw reads.

If you want to learn about String Graph assembler, please read the following papers -

i) The Fragment Assembly String Graph - E. W. Myers

This paper describes the String Graph concept.

ii) Efficient construction of an assembly string graph using the FM-index - Jared T. Simpson and Richard Durbin

This earlier paper from Simpson and Durbin

iii) Efficient de novo assembly of large genomes using compressed data structures - Jared T. Simpson and Richard Durbin

AliTV—interactive visualization of whole genome comparisons

Jit — Wed, 10 Jan 2018 07:08:17 -0600

AliTV, which provides interactive visualization of whole genome alignments. AliTV reads multiple whole genome alignments or automatically generates alignments from the provided data. Optional feature annotations and phylo- genetic information are supported. The user-friendly, web-browser based and highly customizable interface allows rapid exploration and manipulation of the visualized data as well as the export of publication-ready high-quality figures. AliTV is freely available at https://github.com/AliTVTeam/AliTV

https://alitvteam.github.io/AliTV/

Address of the bookmark: https://github.com/AliTVTeam/AliTV

Carefully opt for human reference genome

biogeek — Tue, 18 Feb 2020 07:43:32 -0600

Heng Li posted several issues with the human reference genomes given in these resources and suggests the following compressed FASTA file to be used as hg38/GRCh38 human reference genome.

if you map reads to GRCh38 or hg38, use the following:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

There are several other versions of GRCh37/GRCh38. What’s wrong with them? Here are a collection of potential issues:

More at http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Address of the bookmark: http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Metassembler: merging and optimizing de novo genome assemblies

Rahul Nayak — Tue, 08 May 2018 04:52:33 -0500

Metassembler combines multiple whole genome de novo assemblies into a combined consensus assembly using the best segments of the individual assemblies.

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence.

Address of the bookmark: https://sourceforge.net/projects/metassembler/?source=directory

EAGLER: a scaffolding tool for long reads.

Jit — Mon, 04 Jun 2018 05:26:03 -0500

EAGLER is a scaffolding tool for long reads. The scaffolder takes as input a draft genome created by any NGS assembler and a set of long reads. The long reads are used to extend the contigs present in the NGS draft and possibly join overlapping contigs. EAGLER supports both PacBio and Oxford Nanopore reads.

The tool should be compatible with most UNIX flavors and has been successfully tested on the following operating systems:

Mac OS X 10.11.1
Mac OS X 10.10.3
Ubuntu 14.04 LTS

https://bib.irb.hr/datoteka/844447.Diplomski_2015_Luka_terbi.pdf

Address of the bookmark: https://github.com/mculinovic/EAGLER

getopts.pl file

Jit — Fri, 15 Jun 2018 04:43:03 -0500

SSPACE_longread complain for getopts.pl file.

To resolve this, download and have in SSPACED-Longreads folder.

Cheers :)

ChopStitch: exon annotation and splice graph construction using transcriptome assembly and whole genome sequencing data

Rahul Nayak — Tue, 03 Jul 2018 04:14:52 -0500

ChopStitch is a new method for finding putative exons and constructing splice graphs using an assembled transcriptome and whole genome shotgun sequencing (WGSS) data. ChopStitch identifies exon-exon boundaries in de novo assembled RNA-seq data with the help of a Bloom filter that represents the k-mer spectrum of WGSS reads. The algorithm also detects base substitutions in transcript sequences corresponding to sequencing or assembly errors, haplotype variations, or putative RNA editing events. The primary output of our tool is a FASTA file containing putative exons. Further, exon edges are interrogated for alternative exon-exon boundaries to detect transcript isoforms, which are reported as splice graphs in dot output format.

Address of the bookmark: https://github.com/bcgsc/ChopStitch

Nanopolis: polish a genome assembly

Rahul Nayak — Thu, 26 Jul 2018 04:51:28 -0500

Software package for signal-level analysis of Oxford Nanopore sequencing data. Nanopolish can calculate an improved consensus sequence for a draft genome assembly, detect base modifications, call SNPs and indels with respect to a reference genome and more (see Nanopolish modules, below).

Quickstart

http://nanopolish.readthedocs.io/en/latest/quickstart_consensus.html

Algorithms

http://simpsonlab.github.io/2017/06/30/nanopolish-v0.7.0/

Address of the bookmark: https://github.com/jts/nanopolish