BOL: Related items

Smash: An alignment-free method to find and visualise rearrangements between pairs of DNA sequences

Jit — Tue, 26 Apr 2016 12:18:49 -0500

Smash is a completely alignment-free method/tool to find and visualise genomic rearrangements. The detection is based on conditional exclusive compression, namely using a FCM (Markov model), of high context order (typically 20). For visualisation, Smash outputs a SVG image, with an ideogramoutput architecture, where the patterns are represented with several HSV values (only value varies). The method can perform both in small- and large-scale. Nevertheless is more directed to large-scale since that the main aim of the research is to know where the large-scale [chromosomal by chromosome] of several primates was equal/different, having at a glance a map of the entire genomes.

Address of the bookmark: http://bioinformatics.ua.pt/software/smash/

AccNET

Jitendra Narayan — Fri, 07 Oct 2016 05:22:11 -0500

AccNET is a Perl application that presents a new way to study the accessory genome of a given set of organisms. Using the proteomes of these organisms, AccNET create a bipartite network compatible with common network analysis platforms. AccNET collects phylogenetic and functional information in a network improving the analysis capability. Networks offer a new perspective of organism organization through elements acquired by horizontal gene transfers and not constricted by hierarchical structures.

More at https://www.youtube.com/watch?v=vdGuy1GAJrQ

Address of the bookmark: https://sourceforge.net/projects/accnet/

MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping

Neel — Fri, 20 May 2016 18:53:49 -0500

MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery.

Address of the bookmark: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090581

CNIDARIA: fast, reference-free phylogenomic clustering

Shruti Paniwala — Thu, 16 Jun 2016 17:55:17 -0500

Motivation: Identification of biological specimens is a major requirement for a range of applications. Reference-free methods analyse unprocessed sequencing data without relying on prior knowledge, but these do not scale to arbitrarily large genomes and arbitrarily large phylogenetic distances.

Results: We present Cnidaria, a practical tool for clustering genomic and transcriptomic data with no limitation on ge-nome size or phylogenetic distances. We successfully simultaneously clustered 169 genomic and transcriptomic datasets from 4 kingdoms, achieving 100% accuracy at supra-species level and 78% accuracy for species level.

Availability and Implementation: Cnidaria is written in C++ and Python and is available at http://www.ab.wur.nl/cnidaria.

Contact: Saulo Aflitos - sauloal@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Address of the bookmark: https://github.com/sauloal/cnidaria/wiki

Samtools Primer !!

Jit — Thu, 23 Jun 2016 07:18:17 -0500

SAMtools: Primer / Tutorial by Ethan Cerami, Ph.D.

keywords: samtools, next-gen, next-generation, sequencing, bowtie, sam, bam, primer, tutorial, how-to, introduction
Revisions

    1.0: May 30, 2013: First public release on biobits.org.
    1.1: July 24, 2013: Updated with Disqus Comments / Feedback section.
    1.2: December 19, 2014: Multiple updates, including:
        Updated to use samtools 1.1 and bcftools 1.2.
        Updated usage for bcftools.

About

SAMtools is a popular open-source tool used in next-generation sequence analysis. This primer provides an introduction to SAMtools, and is geared towards those new to next-generation sequence analysis. The primer is also designed to be self-contained and hands-on, meaning that you only need to install SAMtools, and no other tools, and sample data sets are provided. Terms in bold are also explained in the glossary at the end of the document.

Address of the bookmark: http://biobits.org/samtools_primer.html

NGS Glossary !!

Jit — Mon, 27 Jun 2016 08:56:18 -0500

alignment: the mapping of a raw sequence read to a location within a reference genome. The mapping occurs because the sequences within the raw read match or align to sequences within the reference genome. Alignment information is stored in the SAM or BAM file formats.

bcftools: a set of companion tools, currently bundled with SAMtools, for identifying and filtering genomics variants.

bowtie: widely used, open source alignment software for aligning raw sequence reads to a reference genome.

BAM Format: binary, compressed format for storing SAM data.

BCF Format: Binary call format. Binary, compressed format for storing VCF data.

CIGAR String: Compact Idiosyncratic Gapped Alignment Report. A compact string that (partially) summarizes the alignment of a raw sequence read to the reference genome. Three core abbreviations are used: M for alignment match; I for insertion; and D for Deletion. For example, a CIGAR string of 5M2I63M indicates that the first 5 base pairs of the read align to the reference, followed by 2 base pairs, which are unique to the read, and not in the reference genome, followed by an additional 63 base pairs of alignment.

FASTA Format: text format for storing raw sequence data. For example, the FASTA file at: http://www.ncbi.nlm.nih.gov/nuccore/NC_008253 contains entire genome for Escherichia coli 536.

FASTQ Format: text format for storing raw sequence data along with quality scores for each base; usually generated by sequencing machines.

genotype likelihood: the probability that a specific genotype is present in the sample of interest. Genotype likelihoods are usually expressed as a Phred-scaled probability, where P = 10 ^ (-Q/10). For example, if the genotype TT (both alleles are T) at position 1,299,132 in human chromosome 12 (reference G) is 37, this translates to a probability of 10^-37/10 = 0.0001995, meaning that there is very low probability that the reads in your sample support a TT genotype. On the other hand, a genotype of AA at the same position with a score of 0 translates into a probability of 10^-0 = 1, indicating extremely high probability that your sample contains a homozygous mutation of G to A.

mate-pair: in paired-end sequencing, both ends of a single DNA or RNA fragment are sequenced, but the intermediate region is not. The two ends which are sequenced form a pair, and are frequently referred to as mate-pairs.

QNAME: unique identifier of a raw sequence read (also known as the Query Name). Used in FASTQ and SAM files.

paired-end sequencing: sequencing process where both ends of a single DNA or RNA fragment are sequenced, but the intermediate region is not. Particularly useful for identifying structural rearrangements, including gene fusions.

Phred-scaled probability: a scaled value (Q) used to compactly summarize a probability, where P = 10^-Q/10. For example, a Phred Q score of 10 translates to probability (P) = 10^-10/10 = 0.1. Phred-scaled probabilities are common in next-generation sequencing, and are used to represent multiple types of quality metrics, including quality of base calls, quality of mappings, and probabilities associated with specific genotypes. The name Phred refers to the original Phred base-calling software, which first used and developed the scale.

Phred quality score: a score assigned to each base within a sequence, quantifying the probability that the base was called incorrectly. Scores use a Phred-scaled probability metric. For example, a Phred Q score of 10 translates to P=10^-10/10 = 0.1, indicating that the base has a 0.1 probability of being incorrect. Higher Phred score correspond to higher accuracy. In the FASTQ format, Phred scores are represented as single ASCII letters. For details on translating between Phred scores and ASCII values, refer to Table 1 of this useful blog post from Damian Gregory Allis.

read-length: the number of base pairs that are sequenced in an individual sequence read.

read-depth: the number of sequence reads that pile up at the same genomic location. For example, 30X read-depth coverage indicates that the genomic location is covered by 30 independent sequencing reads. Increased read-depth translates into higher confidence for calling genomic variants.

RNAME: reference genome identifier (also known as the Reference Name). Within a SAM formatted file, the RNAME identifies the reference genome where the raw read aligns.

SAM Flag: a single integer value (e.g. 16), which encodes multiple elements of meta-data regarding a read and its alignment. Elements include: whether the read is one part of a paired-end read, whether the read aligns to the genome, and whether the read aligns to the forward or reverse strand of the genome. A useful online utility decodes a single SAM flag value into plain English.

SAM Format: Text file format for storing sequence alignments against a reference genome. See also BAM Format.

SAMtools: widely used, open source command line tool for manipulating SAM/BAM files. Includes options for converting, sorting, indexing and viewing SAM/BAM files. The SAMtools distribution also includes bcftools, a set of command line tools for identifying and filtering genomics variants. Created by Heng Li, currently of the Broad Institute.

single-read sequencing: sequencing process where only one end of a DNA or RNA fragment is sequenced. Contrast with paired-end sequencing.

VCF Format: Variant call format. Text file format for storing genomic variants, including single nucleotide polymorphisms, insertions, deletions and structural rearrangements. See also BCF format.

NextGenerationSequencing
A high-throughput sequencing method which parallelizes the sequencing process, producing thousands or millions of sequences at once.

DeepSequencing
Techniques of nucleotide sequence analysis that increase the range, complexity, sensitivity, and accuracy of results by greatly increasing the scale of operations and thus the number of nucleotides, and the number of copies of each nucleotide sequenced.

Paired-EndSequencing
Sequence both ends of the same fragment and keep track of the paired data.

Adapter
Short oligonucleotides which are attached to the DNA to be sequenced. An adapter can provide a priming site for both amplification and sequencing of the adjoining, unknown nucleic acid.

Library
A collection of DNA fragments with adapters ligated to each end.

BridgeAmplification
Generation of in situ copies of a specific DNA molecule on an oligo-decorated solid support.

EmulsionPCR
A method for bead-based amplification of a library. A single adapter-bound fragment is attached to the surface of a bead, and an oil emulsion containing necessary amplification reagents is formed around the bead/fragment component. Parallel amplification of millions of beads with millions of single strand fragments produces a sequencer-ready library.

Alignment
Mapping of sequence reads to a known reference sequence

Referencesequence/genome
A fully assembled version of a genome that can be used for mapping short DNA sequence reads for comparisons of genomes from various individuals

CoverageDepth
The number of nucleotides from reads that are mapped to a given position of reference genome.

Specificity
The percentage of sequences that map to the intended targets out of total bases per run.

Uniformity
The variability in sequence coverage across target regions.

Homopolymer
Uninterrupted stretch of a single nucleotide type (e.g., TTT or GGGGGG)

InDel
InDel stands for Insertion or deletion. A form of structural variation in which a DNA segment is either deleted or inserted.

SNP

SNP stands for Single Nucleotide Polymorphism. A single base difference found when comparing the same DNA sequence from two different individuals.

WiseScaffolder

Poonam Mahapatra — Wed, 13 Jul 2016 08:08:57 -0500

Function

WiseScaffolder is a stand-alone semi-automatic application for genome scaffolding of pre-assembled contigs using mate-pair data. It also produces editable scaffold maps, allowing either to build gapped scaffolds or usable as a common thread for the manual improvement of scaffolds.

Description

WiseScaffolder includes 4 subcommands: dumpconfig generates a configuration file that notably specifies the average insert size of the mate-pair library preprocess allows the detection and correction of chimerae, the estimation of contigs copy number and produces valuable outputs for the manual improvement of scaffolds scaffold constitutes the central scaffold-builder and comprises two modules:

i) the interative_scaffold_extender, which works with big, unambiguous contigs, or when they run out, single copy contigs, and

ii) the small_contig_inserter, which inserts the small contigs within scaffolds buildfasta converts the scaffold(s) map(s) into Fasta sequences.

Address of the bookmark: http://abims.sb-roscoff.fr/wisescaffolder

valet

Jit — Thu, 22 Sep 2016 04:27:09 -0500

VALET is a pipeline for performing de novo validation of metagenomic assemblies. VALET checks a number of properties that should hold true for a correct assembly (e.g., mate-pairs are aligned at the correct distance from each other in the assembly, the depth of coverage is fairly uniform along contigs, etc.). The violations of these invariants are reported allowing one to pinpoint areas that were potentially mis-assembled, or to compare the quality of different assemblies. For comparing multiple assemblies of the same data-sets, VALET also reports an overall estimate of the likelihood a particular assembly is correct.

Home Page:

VALET code repository

Address of the bookmark: https://www.cbcb.umd.edu/software/valet

COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly

Jit — Wed, 06 Dec 2017 02:08:14 -0600

An efficient tool called Connecting Overlapped Pair-End (COPE) reads, to connect overlapping pair-end reads using k-mer frequencies. We evaluated our tool on 30× simulated pair-end reads from Arabidopsis thaliana with 1% base error. COPE connected over 99% of reads with 98.8% accuracy, which is, respectively, 10 and 2% higher than the recently published tool FLASH. When COPE is applied to real reads for genome assembly, the resulting contigs are found to have fewer errors and give a 14-fold improvement in the N50 measurement when compared with the contigs produced using unconnected reads.

Address of the bookmark: ftp://ftp.genomics.org.cn/pub/cope

RepeatModeler

Jit — Thu, 18 Aug 2016 09:57:15 -0500

RepeatModeler is a de-novo repeat family identification and modeling package. At the heart of RepeatModeler are two de-novo repeat finding programs ( RECON and RepeatScout ) which employ complementary computational methods for identifying repeat element boundaries and family relationships from sequence data. RepeatModeler assists in automating the runs of RECON and RepeatScout given a genomic database and uses the output to build, refine and classify consensus models of putative interspersed repeats.

Address of the bookmark: http://www.repeatmasker.org/RepeatModeler.html