BOL: Related items

Ka, Ks and Ka/Ks calculations

Poonam Mahapatra — Mon, 29 Aug 2016 11:44:11 -0500

gKaKs is a codon-based genome-level Ka/Ks computation pipeline developed and based on programs from four widely used packages: BLAT, BLASTALL (including bl2seq, formatdb and fastacmd), PAML (including codeml and yn00) and KaKs_Calculator (including 10 substitution rate estimation methods). gKaKs can automatically detect and eliminate frameshift mutations and premature stop codons to compute the substitution rates (Ka, Ks and Ka/Ks) between a well-annotated genome and a non-annotated genome or even a poorly assembled scaffold dataset. It is especially useful for newly sequenced genomes that have not been well annotated.

Look for KaKs calculation:

https://github.com/fumba/kaks-calculator

http://longlab.uchicago.edu/?q=gKaKs

http://www.ncbi.nlm.nih.gov/pubmed/23314322

Address of the bookmark: http://longlab.uchicago.edu/?q=gKaKs

Redundans

Jit — Thu, 01 Sep 2016 08:28:11 -0500

Redundans pipeline assists an assembly of heterozygous genomes.
Program takes as input assembled contigs, paired-end and/or mate pairs sequencing libraries and returns scaffolded homozygous genome assembly, that should be less fragmented and with total size smaller than the input contigs. In addition, Redundans will automatically close the gaps resulting from genome assembly or scaffolding more details.

The pipeline consists of three steps/modules:

redundancy reduction: detection and selectively removal of redundant contigs from an initial de novo assembly
scaffolding: joining of genome fragments using paired-end and/or mate-pairs reads
gap closing

Redundans is:

fast & lightweight, multi-core support and memory-optimised, so it can be run even on the laptop for small-to-medium size genomes
flexible toward many sequencing technologies (Illumina, 454 or Sanger) and library types (paired-end, mate pairs, fosmids)
modular: every step can be ommited or replaced by another tools

Address of the bookmark: https://github.com/Gabaldonlab/redundans

Assembly tutorial PPT

Jit — Wed, 07 Sep 2016 03:12:53 -0500

Saved Cornell University assembly workshop PPT.

Reference:

http://cbsu.tc.cornell.edu/lab/doc/assembly_workshop_20150420_lecture1.pdf

OPERA : Optimal Paired-End Read Assembler

Jit — Fri, 09 Sep 2016 05:28:58 -0500

OPERA (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly). It uses information from paired-end/mate-pair/long reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly project, in a process known as Scaffolding. OPERA is based on an exact algorithm that is guaranteed to minimize the discordance of scaffolds with the information provided by the paired-end/mate-pair/long reads (for further details see Gao et al, 2011).

Note that since the original publication, we have made significant changes to OPERA (v1.0 onwards) including refinements to its basic algorithm (to reduce local errors, improve efficiency etc.) and incorporated features that are important for scaffolding large genomes (multi-library support, better repeat-handling etc.), in addition to other scalability and usability improvements (bam and gzip support, smaller memory footprint). We therefore encourage you to download and use our latest version: OPERA-LG. In our benchmarks, it has significantly improved corrected N50 and reduced the number of scaffolding errors. Furthermore, our latest release contains the wrapper script OPERA-long-read that enables scaffolding with long-reads from third-generation sequencing technologies (PacBio or Oxford Nanopore). The manuscript describing the new features and algorithms is available at Genome Biology. We look forward to getting your feedback to improve it further.

Address of the bookmark: https://sourceforge.net/p/operasf/wiki/The%20OPERA%20wiki/

Ribbon !!

Jit — Fri, 21 Oct 2016 04:54:30 -0500

Visualization has played an extremely important role in the current genomic revolution to inspect and understand variants, expression patterns, evolutionary changes, and a number of other relationships. However, most of the information in read-to-reference or genome-genome alignments is lost for structural variations in the one-dimensional views of most genome browsers showing only reference coordinates. Instead, structural variations captured by long reads or assembled contigs often need more context to understand, including alignments and other genomic information from multiple chromosomes. We have addressed this problem by creating Ribbon (genomeribbon.com) an interactive online visualization tool that displays alignments along both reference and query sequences, along with any associated variant calls in the sample. This way Ribbon shows patterns in alignments of many reads across multiple chromosomes, while allowing detailed inspection of individual reads (Supplementary Note 1). For example, here we show a gene fusion in the SK-BR-3 breast cancer cell line linking the genes CYTH1 and EIF3H. While it has been found in the transcriptome previously, genome sequencing did not identify a direct chromosomal fusion between these two genes. After SMRT sequencing, Ribbon shows that there are indeed long reads that span from one gene to the other, going through not one but two variants, for the first time showing the genomic link between these two genes (Figure 1a). More gene fusions of this cancer cell line are investigated in Supplementary Note 2. Figure 1b shows another complex event in this sample made simple in Ribbon: the translocation of a 4.4 kb sequence deleted from chr19 and inserted into chr16 (Figure 1b). Thus, Ribbon enables understanding of complex variants, and it may also help in the detection of sequencing and sample preparation issues, testing of aligners and variant-callers, and rapid curation of structural variant candidates (Supplementary Note 3). In addition to SAM and BAM files with long, short, or paired-end reads, Ribbon can also load coordinate files from whole genome aligners such as MUMmer. Therefore, Ribbon can be used to test assembly algorithms or inspect the similarity between species. Supplementary Note 4 shows a comparison of gorilla and human genomes using Ribbon, highlighting major structural differences. In conclusion, Ribbon is a powerful interactive web tool for viewing complex genomic alignments.

Script at https://github.com/MariaNattestad/ribbon

Address of the bookmark: http://genomeribbon.com/

eFORGE.v1.2

Jit — Fri, 28 Oct 2016 09:06:59 -0500

The eFORGE tool provides a method to view the tissue specific regulatory component of a set of EWAS DMPs. eFORGE analysis takes a set of DMPs, such as those hits above genome-wide significance threshold in an EWAS study, and analyses whether there is enrichment for overlap of putative functional elements compared to matched background DMPs. It assesses enrichment on a per cell type basis, since functional elements are differentially active in different cell types, and hence can expose tissue-specific signals of enrichment for the given test DMP set. This can reveal the sites of action underlying the EWAS signal, and provide confirmation of the validity of the EWAS where a tissue-specific mechanism is known or expected for the phenotype. Conversely unknown tissue involvements can also be revealed.

Address of the bookmark: http://eforge.cs.ucl.ac.uk/eFORGE.v1.2/?documentation

Minia

Jit — Thu, 08 Dec 2016 05:07:00 -0600

Minia is a short-read assembler based on a de Bruijn graph, capable of assembling a human genome on a desktop computer in a day. The output of Minia is a set of contigs. Minia produces results of similar contiguity and accuracy to other de Bruijn assemblers (e.g. Velvet).

Download

Minia 2.0.7 Linux 64-bits binaries (Source code) (Legacy codebase)

For the impatient

A typical Minia command line looks like:

./minia -in reads.fa -kmer-size 31 -abundance-min 3 -out output_prefix

Type

./minia

for a quick explanation of the parameters.

For more information, refer to the manual.

KmerGenie can be used to determine the best k-mer size, minimum abundance of correct k-mers, and genome size estimation for your dataset.

Address of the bookmark: http://minia.genouest.org/

Velvet tutorial

Poonam Mahapatra — Fri, 09 Dec 2016 04:19:07 -0600

The objective of this activity is to help you understand how to run Velvet in general, how to accurately estimate the insert size of a paired-end library through the use of Bowtie, the primary parameters of velvet, and the process involved in producing a de novo assembly from Illumina reads.

http://evomics.org/learning/assembly-and-alignment/velvet/

Address of the bookmark: http://evomics.org/learning/assembly-and-alignment/velvet/

MeGAMerge: A tool to merge assembled contigs, long reads from metagenomic sequencing runs

Jit — Mon, 19 Dec 2016 09:42:15 -0600

MeGAMerge

MeGAMerge (A tool to merge assembled contigs, long reads from metagenomic sequencing runs)

Description

MeGAMerge is a perl based wrapper/tool that can accept any number of sequence (FASTA) files containing assembled contigs of any length in Multi-FASTA format to produce an improved contig set based on OLC based assembly. All overlap parameters (Minimum Overlap Length, Identity, etc) are user-declarable at runtime. It is written to run on Linux.

Requirements:

You will need to have the following tools installed and in $PATH, or added to $binpath in the tool:

Newbler (specifically runAssembly)
Minimus2 (part of AMOS, also requires MUMmer)

Address of the bookmark: https://github.com/LANL-Bioinformatics/MeGAMerge

Software and Tools to detect structure variation with long reads !!

Archana Malhotra — Wed, 15 Mar 2017 14:31:09 -0500

Uncovering the connection between genetics and heritable diseases requires an approach that looks at all the variant bases and types in a genome. While a PacBio de novo assembly resolves the most novel SV variants. 8-10X PacBio coverage of single genomes or trios reveals triple the SVs detectable by short-read data.

With Single Molecule, Real-Time (SMRT) Sequencing, you can access structural variations having a broad range of sizes, types, and GC content with the ability to:

Uncover missing heritability linked to structural variation
Unambiguously identify genomic context and variant breakpoints at the sequence level to unravel the genetic etiology of disease
Resolve structural variation across the complete size spectrum with basepair resolution

Following are the SV tools, which can assist you to achieve your goal.

Sniffles: Structural variation caller using third generation sequencing

Sniffles is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore). It detects all types of SVs using evidence from split-read alignments, high-mismatch regions, and coverage analysis. Please note the current version of Sniffles requires sorted output from BWA-MEM (use -M and -x parameter) or NGM-LR with the optional SAM attributes enabled!

More at https://github.com/fritzsedlazeck/Sniffles

MultiBreak-SV: It identifies structural variants from next-generation paired end data, third-generation long read data, or data from a combination of sequencing platforms.

There are two pieces of software in this release: (1) a pre-processor that takes machineformat (.m5) BLASR files, and (2) MultiBreak-SV. For installation and usage instructions, see doc/MultiBreakSV-Manual.txt.

More at https://github.com/raphael-group/multibreak-sv

Parliament: A Structural Variation Tool. Why ask a single sv-detection approach to find every variant when you can have a parliament of tools deciding?

Publication about the algorithm and “…the first long-read characterization of structural variation in a diploid human personal genome…” (HS1011) - “Assessing structural variation in a personal genome—towards a human reference diploid genome”

More at https://sourceforge.net/projects/parliamentsv/

https://www.dnanexus.com/papers/Parliament_Info_Sheet.pdf

PBHoney: the structural variation discovery tool

PBHoney is an implementation of two variant-identification approaches designed to exploit the high mappability of long reads (i.e., greater than 10,000 bp). PBHoney considers both intra-read discordance and soft-clipped tails of long reads to identify structural variants.

Read The Paper http://www.biomedcentral.com/1471-2105/15/180/abstract

More at https://sourceforge.net/projects/pb-jelly/

SMRT-SV: Structural variant and indel caller for PacBio reads

Structural variant (SV) and indel caller for PacBio reads based on methods from Chaisson et al. 2014.

SMRT-SV provides an official software package for tools described in Chaisson et al. 2014 and adds several key features including the following.

Unified variant calling user interface with built-in cluster compute support
Small indel calling (2-49 bp)
Improved inversion calling (screenInversions)
Quality metric for SV calls based on number of local assemblies supporting each call
Higher sensitivity for SV calls using tiled local assemblies across the entire genome instead of "signature" regions
Genotyping of SVs with Illumina paired-end reads from WGS samples

More at https://github.com/EichlerLab/pacbio_variant_caller