BOL: Related items

Velvet tutorial

Poonam Mahapatra — Fri, 09 Dec 2016 04:19:07 -0600

The objective of this activity is to help you understand how to run Velvet in general, how to accurately estimate the insert size of a paired-end library through the use of Bowtie, the primary parameters of velvet, and the process involved in producing a de novo assembly from Illumina reads.

http://evomics.org/learning/assembly-and-alignment/velvet/

Address of the bookmark: http://evomics.org/learning/assembly-and-alignment/velvet/

Understanding Greedy Algorithms

Jit — Mon, 12 Dec 2016 04:37:40 -0600

Learning greedy algo for biologist.

https://www.topcoder.com/community/data-science/data-science-tutorials/greedy-is-good/

This webpage is also useful for the same:

http://learninglover.com/examples.php?id=59

http://www.cs.rpi.edu/~magdon/ps/conference/super_biokdd.pdf

https://ocw.mit.edu/courses/biology/7-91j-foundations-of-computational-and-systems-biology-spring-2014/lecture-slides/MIT7_91JS14_Lecture6.pdf

http://schatzlab.cshl.edu/teaching/AssemblyClass/01.%20Assembly%20Intro.pdf

http://lsl.sinica.edu.tw/Services/Class/files/20150612449.pdf

http://www.cs.jhu.edu/~langmea/resources/lecture_notes/assembly_scs.pdf

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-43.pdf

Address of the bookmark: https://www.topcoder.com/community/data-science/data-science-tutorials/greedy-is-good/

LAST

Bulbul — Mon, 19 Dec 2016 14:07:53 -0600

LAST can:

Handle big sequence data, e.g:
- Compare two vertebrate genomes
- Align billions of DNA reads to a genome
Indicate the reliability of each aligned column.
Use sequence quality data properly.
Compare DNA to proteins, with frameshifts.
Compare PSSMs to sequences
Calculate the likelihood of chance similarities between random sequences.
Do split and spliced alignment.
Train alignment parameters for unusual kinds of sequence (e.g. nanopore).

Address of the bookmark: http://last.cbrc.jp/

Genome Assembly Tools and Software - PART2 !!

Jit — Tue, 27 Dec 2016 16:14:35 -0600

The genome assemblers generally take a file of short sequence reads and a file of quality-value as the input. Since the quality-value file for the high throughput short reads is usually highly memory-intensive, only a few assemblers, best suited for your assembly. For the sake of computational memory saving and convenience of data inquiry, high-throughput short reads data is always initially formatted to specific data structure. Currently, existing data structure for this usage can be predominantly classified into two categories: string-based model and graph-based model.

We therefore list many genomle assembly tools here. We mainly reported for the assembly of genomes while the others are designed aiming at handling complex genomes.

RMAP 2.1 – Short-read Mapping
RMAP is aimed to map accurately reads from the next-generation sequencing technology. RMAP can map reads with or without error probability information (quality scores) and supports paired-end reads or bisulfite-treated reads mapping. There is no limitaions on read widths or number of mismatches. RMAP can now map more than 8 million reads in an hour at full sensitivity to 2 mismatches
MIRA 4.0.2 – Whole Genome Shotgun and EST Sequence Assembler
MIRA (Mimicking Intelligent Read Assembly)is a whole genome shotgun and EST sequence assembler for Sanger, 454, Solexa (Illumina), IonTorrent data and PacBio (the later at the moment only CCS and error-corrected CLR reads). It can be seen as a Swiss army knife of sequence assembly developed and used in the past 12 years to get assembly jobs done efficiently – and especially accurately. That is, without actually putting too much manual work into finishing the assembly.
HapCompass 0.7.7 – A Cycle-Basis Algorithm for Accurate Haplotype Assembly
HapCompass for polyploid genomes can currently be used to create accurate pairwise SNP phasings.Given a set of aligned sequence reads in a SAM file and a set of variant calls in VCF format, HAPCOMPASS will assemble reads into haplotypes.
GAM-NGS 1.1b – Genome Assemblies Merger for Next Generation Sequencing
GAM-NGS is able to merge two or more assemblies and it rteturns an improved assembly (more contiguous and more correct). GAM-NGS shows its full potential with multi-library Illumina-based projects.
GeneStitch 1.2.1 – Network Matching Algorithm to Gene Assembly
GeneStitch is a tool to assemble genes using network matching algorithm. Given an already-assembled dataset, it is capable of assembling contigs together to form more complete genes with the help of a reference gene set. Currently the assembly software that GeneStitch support is SOAPdenovo.
RACA 0.9.1.1 – Reference-Assisted Chromosome Assembly
RACA is an algorithm to reliably order and orient sequence scaffolds generated by NGS and assemblers into longer chromosomal fragments using comparative genome information and paired-end reads.
DISCOVAR 51750 – Genome Shotgun Assembler and Variant Caller
DISCOVAR is a whole genome shotgun assembler and variant caller that can generate high quality assemblies and variant calls from the latest 250 base Illumina PCR-free fragment reads.
SeqCons 1.0 – de novo and reference-guided Sequence Assembly
SeqCons (Sequence consensus) is an open source consensus computation program for Linux and Windows. The algorithm can be used for de novo and reference-guided sequence assembly.
SimAssemblyStage1/2 0.2 – Assembly Alignment of Contigs
SimAssemblyStage1: Perfectly aligns TranscriptSimulator reads to their nucleotide templates using read title inforamation, creating ideal simulated assembly of super contigs.
GapFiller – Closing the Gap within Paired Reads
GapFiller is not a standard de novo assembler. It aims “only” at closing the gap between pairs of reads as a first step of a large number of downstream analysis
PAGIT 1.01 – Post Assembly Genome Improvement Toolkit
PAGIT (Post Assembly Genome Improvement Toolkit) is a tools to generate automatically high quality sequence by ordering contigs, closing gaps, correcting sequence errors and transferring annotation.
ShoRAH 0.8.2 – Short Reads Assembly into Haplotypes
ShoRAH is a software package that allows for inference about the structure of a population from a set of short sequence reads as obtained from ultra-deep sequencing of a mixed sample. The package contains programs that support mapping of reads to a reference genome, correcting sequencing errors by locally clustering reads in small windows of the alignment, reconstructing a minimal set of global haplotypes that explain the reads, and estimating the frequencies of the inferred haplotypes.
RePS 2.0 – WGS Sequence Assembler
RePS (Repeat-masked Phrap with scaffolding), a WGS sequence assembler, that explicitly identifies exact kmer repeats from the shotgun data and removes them prior to the assembly. The established software Phrap is used to compute meaningful error probabilities for each base. Clone-end-pairing information is used to construct scaffolds that order and orient the contigs. The updated version of RePS incorporates some of the ideas introduced by Phusion on clustering
treecat – Phylogenetic Comparative Assembly
treecat (phylogenetic tree based contig arrangement tool) takes several genomes and their relationships in a phylogenetic tree into account to estimate a possible ordering of the contigs.
IsoLasso 2.6.1 – A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly
IsoLasso is an algorithm to assemble transcripts and estimate their expression levels from RNA-Seq reads.
CEM 0.9.1 – Transcriptome Assembly and Isoform Expression Level Estimation from Biased RNA-Seq Reads
CEM is an algorithm to assemble transcripts and estimate their expression levels from RNA-Seq reads.
MaLTA – Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq data
MaLTA is a method for simultaneous transcriptome assembly and quantification from Ion Torrent RNA-Seq data.
AMOS 3.1.0 – Whole Genome Shotgun Assembler
AMOS (AModular, Open-Source) consortium is committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal — to produce A Modular, Open-Source whole genome assembler.Open-source so that everyone is welcome to contribute and help build outstanding assembly tools, and modular in nature so that new contributions can be easily inserted into an existing assembly pipeline. This modular design will foster the development of new assembly algorithms and allow the AMOS project to continually grow and improve in hopes of eventually becoming a widely accepted and deployed assembly infrastructure. In this sense, AMOS is both a design philosophy and a software system.
AutoEditor 1.20 – Automated Correction of Genome Sequence Errors
AutoEditor is a tool for correcting sequencing and basecaller errors using sequence assembly and chromatogram data. On average AutoEditor corrects 80% of erroneous base calls, with an accuracy of 99.99%.This in turn improves the overall accuracy of genome sequences and facilitates the use of these sequences for polymorphism discovery.
SAGE – String Graph Assembly of GEnomes
SAGE is a new string-overlap graph-based de novo genome assembler.
Omega 1.0.2 – Overlap-graph de novo Assembler for Metagenomics
Omega is a software for assembling and scaffolding Illumina sequencing data of microbial communities.
TCGA-Assembler 1.0.3 – Open-Source Software for Retrieving and Processing TCGA Data
TCGA-Assembler is an open-source, freely available tool that automatically downloads, assembles, and processes public The Cancer Genome Atlas (TCGA) data, to facilitate downstream data analysis by relieving investigators from the burdens of data preparation.
SAMMate 2.7.4 / assemblySAM 1.1 – Processing Short Read Alignments in SAM/BAM format / RNA-Seq Assembly and Analysis

SAMMate is an open source GUI software suite to process RNA-Seq data. It is composed of two modules: assemblySAM and SAMMate.

assemblySAM employs a novel method to localize and assemble RNA-seq reads into RNA transcript sequences.
StringGraph beta – String Graph Construction Using Incremental Hashing
StringGraph is a novel, hash based method for constructing the string graph.
MindTheGap 1.0.0 – Detection and Assembly of Insertion Variants
MindTheGap is a software that performs detection and assembly of DNA insertion variants in NGS read datasets with respect to a reference genome.
MetAMOS 1.5rc3 – Metagenomic Assembly pipeline for AMOS
MetAMOS is an open source and modular metagenomic assembly and analysis pipeline. MetAMOS represents an important step towards fully automated metagenomic analysis, starting with next-generation sequencing reads and producing genomic scaffolds, open-reading frames and taxonomic or functional annotations.
TIGER – DNA Sequence Assembly
Tiger is a novel de novo assembly framework which adapts to available computing resources by iteratively decomposing the assembly problem into sub-problems.
AlignGraph – Secondary de novo Genome Assembly guided by closely related References
AlignGraph is a software that extends and joins contigs or scaffolds by reassembling them with help provided by a reference genome of a closely related organism.
scarpa 0.241 – Scaffolding Reads with Practical Algorithms
Scarpa is a stand-alone scaffolding tool for NGS data. It can be used together with virtually any genome assembler and any NGS read mapper that supports SAM format. Other features include support for multiple libraries and an option to estimate insert size distributions from data.
VGA v1 – Viral Genome Assembler
VGA is a method for accurate assembly of a heterogeneous viral population consisting of individuals viral genomes (also known as quasi-species).
Genomix 0.2.11 – Parallel Genome Assembly using Hyracks
Genomix is a parallel genome assembly system built from the ground up with scalability in mind. It can assemble large and high-coverage genomes from fastq files in a short time and produces assemblies similar to Velvet or Ray in quality.
LACHESIS – Genome Assembly with Contact Probability Maps
LACHESIS is method that exploits contact probability map data (e.g. from Hi-C) for chromosome-scale de novo genome assembly.
KGBassembler 1.2 – Karyotype-based Genome Assembler for Brassicaceae Species
KGBassembler (Brassicaceae genome assembler) is a C++ based tool for assembling contigs and/or scaffolds to full chromosomes based on the karyotype maps of Brassicaceae species and without the need of genetic and physical maps.
AutoAssemblyD 0.1 – Graphical User Interface system for several Genome Assembler
The AssemblyD is a software which performed the local and remote genome assembly by several assemblers based on an XML Template which can replace the large command lines required by most assemblers.
SR-ASM – DNA Assembly of the Short Sequences coming from 454 sequencer
SR-ASM (Short Reads ASseMbly) algorithm is designed for DNA assembly of the short sequences coming from 454 sequencers.
YASRA 2.33 – Yet Another Short Read Assembler
YASRA performs comparative assembly of short reads using a reference genome, which can differ substantially from the genome being sequenced.
PRICE 1.2 – de novo Genome Assembler
PRICE (Paired-Read Iterative Contig Extension) is a de novo genome assembler implemented in C++. Its name describes the strategy that it implements for genome assembly: PRICE uses paired-read information to iteratively increase the size of existing contigs. Initially, those contigs can be individual reads from a subset of the paired-read dataset, non-paired reads from sequencing technologies that provide non-paired data, or contigs that were output from a prior run of PRICE or any other
ALE 20130717 – Assembly Likelihood Estimator
ALE is a probabalistic framework for determining the likelihood of an assembly given the data (raw reads) used to assemble it. It allows for the rapid discovery of errors and comparisons between similar assemblies.
SSPACE 3.0 – Scaffolding pre-assembled Contigs using Paired-read data
SSPACE (SSAKE-based Scaffolding of Pre-Assembled Contigs after Extension) is a stand-alone program for scaffolding pre-assembled contigs using paired-read data. It is unique in offering the possibility to manually control the scaffolding process. By using the distance information of paired-end and/or matepair data, SSPACE is able to assess the order, distance and orientation of your contigs and combine them into scaffolds. Currently we offer this as a command-line tool in Perl. The input data is given by pre-assembled contig sequences (FASTA) and NGS paired-read data (FASTA or FASTQ). The final scaffolds are provided in FASTA format.
IMAGE 2.4.1 – Iterative Mapping and Assembly for Gap Elimination
IMAGE ( Iterative Mapping and Assembly for Gap Elimination) is a software designed to close gaps in any draft assembly using Illumina paired end reads. IMAGE is best described in several stages: aligning of Illumina reads at contig ends; local assembly of reads into new contigs; reference contigs are extended or merged; iterating the whole process to extend and merge more contigs.
ATLAS GapFill 2.2 – Deals with the Repetitive Gap Assembly problem
ATLAS GapFill deals with the repetitive gap assembly problem by using the unique gap-flanking sequences to group reads and convert the problem to a local assembly task. Localizing the assembly reduces the numbers of repeats in the assembly, allows more data to be incorporated, and allows for gaps to be filled.
Atlas 2005 – Whole Genome Assembly Suite
Atlas is a collection of software tools to facilitate the assembly of large genomes from whole genome shotgun reads, or a combination of whole genome shotgun reads and BAC or other localized reads.
CGAL 0.9.6b – Computing Genome Assembly Likelihoods
CGAL is a tool for computing genome assembly likelihoods. It computes the likelihood of reads with respect to the assembly and a statistical model which can be used as a metric for evaluating assemblies.
Fermi 1.1 – WGS de novo Assembler based on the FMD-index for large Genomes
Fermi is a de novo assembler for Illumina reads from whole-genome short-gun sequencing. It also provides tools for error correction, sequence-to-read alignment and comparison between read sets. It uses the FMD-index, a novel compressed data structure, as the key data
PASHA 1.0.10 – Parallelized Short Read Assembly
PASHA is a parallel short read assembler for large genomes using de Bruijn graphs. Taking advantage of both shared-memory multi-core CPUs and distributed-memory compute clusters, PASHA has demonstrated its potential to perform high-quality de-novo assembly of large genomes in reasonable time with modest computing resources. Our evaluation using three small real paired-end datasets shows that PASHA is able to produce better assemblies with comparable genome coverage and mis-assembly rates compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. Moreover, PASHA achieves the fastest speed for all three datasets on a single CPU.
XGenovo – Extended Genovo Metagenomic Assembler by Incorporating Paired-End Information
XGenovo (Extended Genovo) is an extended genovo metagenomic assembler by incorporating paired-end information
MetaVelvet 1.2.01 / MetaVelvet-SL – An Extension of Velvet Assembler to de novo Metagenomic Assembly / utilizing Supervised Learning
MetaVelvet is an extension of Velvet assembler to de novo metagenome assembly from short sequence reads
Edena v3.131028 – De Novo Short Reads Assembler
Edena is an assembler dedicated to process the millions of very short reads produced by the Illumina Genome Analyzer
ConPADE 1.00 – Contig Ploidy and Allele Dosage Estimation
ConPADE is a tool used to estimate contig ploidy and allele dosage in polyploid genome assemblies.
ELOPER 1.2 – Elongation of Paired-end Reads for de novo Assembly
ELOPER is a pre-processing tool for pair-end sequences that produces a better read library for assembly programs.
Oases 0.2.08 – De novo Transcriptome Assembler for very short reads
Oases designed to heuristically assemble RNA-seq reads in the absence of a reference genome, across a broad spectrum of expression values and in presence of alternative isoforms. It achieves this by using an array of hash lengths, a dynamic filtering of noise, a robust resolution of alternative splicing events, and the efficient merging of multiple assemblies. It was tested on human and mouse RNA-seq data and is shown to improve significantly on the transABySS and Trinity de novo
SOPRA 1.4.6 – Statistical Optimization of Paired Read Assembly
SOPRA is an assembler for mate pair/paired-end reads from high throughput sequencing platforms, e.g. Illumina and SOLiD.
hapAssembly – Haplotype Assembly from Whole-Genome Sequence Data
hapAssembly beats the previous best for the important Haplotype Assembly Problem. It is an approach to finding optimal solutions for the haplotype assembly problem under the minimum-error-correction (MEC) model.
PBSIM 1.0.3 – PacBio Reads Simulator
PacBio sequencers produced two types of characteristic reads: CCS (short and low error rate) and CLR (long and high error rate), both of which could be useful for de novo assembly of genomes. PBSIM simulates those PacBio reads by using either a model-based or sampling-based simulation.
SIS – Generate Draft Genome Sequence Scaffolds for Prokaryotes
SIS (Scaffolds from Inversion Signatures)is a new easy-to-use tool to generate contig scaffolds
NN50-calculator 0.5 – Evaluate the Correctness of Genome Assemblies
NN50-calculator (Normalized N50 calculator) is a tool for evaluating the correctness of genome assemblies.
Baa.pl 0.20 – use BLAT to ASSESS an ASSEMBLY
Baa.pl is a simple script that parses the output of a BLAT run of a transcriptome vs. a genome assembly.
hapsembler 2.21 – Haplotype-specific Genome Assembly Toolkit
Hapsembler is a haplotype-specific genome assembly toolkit that is designed for genomes that are rich in SNPs and other types of polymorphism. Hapsembler can be used to assemble reads from a variety of platforms including Illumina and Roche/454.
ViSpA 02 – Viral Spectrum Assembler
ViSpA (Viral Spectrum Assembling) implements a novel viral assembling and frequency estimation methods. This software uses a simple error correction, viral variants assembling based on maximum-bandwidth paths in weighted read graphs and frequency estimation via Expectation Maximization on all reads.
VelvetOptimiser 2.2.5 – Automatically Optimise Velvet Assembler Parameters
VelvetOptimiser is a multi-threaded Perl script for automatically optimising the three primary parameter options (K, -exp_cov, -cov_cutoff) for the Velvet de novo sequence assembler.
Assemblet 0.1 – Antigenic Variation Assembler
Assemblet is a short read assembler for assembling antigenic variant sequences in bacteria.
VelvetK 20120606 – Find a reasonable K-mer size to Assemble Genome Reads with Velvet
VelvetK can estimate the best k-mer size to use for your Velvet de novo assembly. It needs two inputs: the estimated genome size, and all your sequence read files. The genome size can be supplied as as a number (eg. 3.5M) or as a FASTA file of a closely related genome.
VAGUE 1.0.5 – Velvet Assembler Graphical User Environment
VAGUE (Velvet Assembler Graphical Front End) is a GUI for the Velvet de novo assembler.
Transcriptome Assembler – Transcriptome Assembly used in RNA-seq of 16 Mammalian Species
Transcriptome Assembler is a software for transcriptome assembly used in RNA-seq of 16 mammalian species.
BioSequenceAssembler 2.0 – Microsoft Research Sequence Assembler
BioSequenceAssembler is intended for use by biologist and laboratory technicians who are responsible for managing next-generation genomic sequencing data for alignment, assembly, and/or BLAST identification.
BugBuilder – Microbial Genome Assembly
BugBuilder is a pipeline for the automated assembly and annotation of microbial genomes from high-throughput sequence data. It is configurable so as not to be tied to any assembler or scaffolder, and is designed to run in a cluster environment facilitating high-throughput processing of genomes.
MAXIMUS 0.2 – Hybrid Reference and de novo Assembly pipeline
MAXIMUS is a genome assembly pipeline which takes the best out of multiple reference assemblies and de novo assembly. The benefits of this approach include better assembled repetitive regions, less gaps and higher accuracy for the resultant assembly.
ISSAKE – Short Read Sequence Assembly
iSSAKE (immuno-SSAKE) is a sequencing approach and assembly software for profiling T-cell metagenomes using short reads from the massively parallel sequencing platforms.
IDBA / IDBA-UD 1.1.1 – De Bruijn Graph De Novo Assembler with Highly Uneven Sequencing Depth
IDBA is a practical iterative De Bruijn Graph De Novo Assembler for sequence assembly in bioinfomatics. Most assemblers based on de Bruijn graph build a de Bruijn graph with a specific k to perform the assembling task. For all of them, it is very crucial to find a specific value of k. If k is too large, there will be a lot of gap problems in the graph. If k is too small, there will a lot of branch problems. IDBA uses not only one specific k but a range of k values to build the iterative de Bruijn graph. It can keep all the information in graphs with different k values. So, it will perform better than other assemblers.
est2assembly 1.13 – Assembly and Annotation of Transcriptomes for any Species
The est2assembly platform is the only platform for standardising transcriptome projects: go from raw trace files to an annotated GBrowse interface driven by the Seqfeature database. It accepts both Sanger and 454 sequencing technology for a denovo assembly, annotation and data mining of EST data.
Curtain 0.2.3 beta – Assembling large Genomes from Short Read Sequences
Curtain is an assembler of next generation sequence. Curtain is a Java wrapper around next-generation assemblers such as Velvet, which allows the incremental introduction of read-pair information into the assembly process.
PEAssember 1.2 – A de novo Genome Assembler
PEAssember is a parallel de novo genome assembler for small – mid sized genomes.
Contrail 0.8.2 – Assembly of Large Genomes using Cloud Computing
Contrail is a Hadoop based genome assembler for assembling large genomes in the clouds
BEAP 0.6 beta – Blast Extension and Assembly Program
The BEAP is a computer program that uses a short starting DNA fragment, often a EST or partial gene segment, as “primer”, to recursively blast nucleotide databases in an attempt to obtain all sequences that overlaps, directly or indirectly, with the “primer” therefore help to “extend” the length of the original sequence for constructing a “full length” sequence for functional analysis, or at least to obtain neighboring regions of the segment for SNP discovery and linkage disequilibrium
BRANCH 1.8.1 – boosting RNA-Seq Assemblies with Partial or related Genomic Sequences
BRANCH is a software that extends de novo transfrags and identifies novel transfrags with DNA contigs or genes of close related species. BRANCH discovers novel exons first and then extends/joins fragmented de novo transfrags, so that the resulted transfrags are more complete.
Quake 0.3.5 – Detect & Correct Substitution Sequencing Errors in WGS Data Sets

Quake is a package to correct substitution sequencing errors in experiments with deep coverage (e.g. >15X), specifically intended for Illumina sequencing reads. Quake adopts the k-mer error correction framework, first introduced by the EULER genome assembly package. Unlike EULER and similar progams, Quake utilizes a robust mixture model of erroneous and genuine k-mer distributions to determine where errors are located. Then Quake uses read quality values and learns the nucleotide to nucleotide error rates to determine what types of errors are most likely. This leads to more corrections and greater accuracy, especially with respect to avoiding mis-corrections, which create false sequence unsimilar to anything in the original genome sequence from which the read was taken.
Velvet 1.2.10 – Sequence Assembler for Very Short Reads
Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454.Velvet currently takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.
Lucy 2.20 – DNA Sequence Quality & Vector Trimming
Lucy has been used for several years to clean sequence data from automated DNA sequencers prior to sequence assembly and other downstream uses. The quality trimming portion of lucy makes use of phred quality scores, such as those produced by many automated sequencers based on the Sanger sequencing method. As such, lucy’s quality trimming may not be appropriate for sequence data produced by some of the new “next-generation” sequencers.
iAssembler 1.3.2 – de novo Assembly of Roche-454/Sanger Transcriptome Sequences
iAssembler is a standalone package to assemble ESTs generated using Sanger and/or Roche-454 pyrosequencing technologies into contigs.
GAEMR 1.0.1 – Assembly Analysis Framework
GAEMR (Genome Assembly Evaluation Metrics and Reportin) is a complete genome analysis package that helps you evaluate and report on a genome assembly’s completeness, correctness, and contiguity.
PyroCleaner 1.3 – Clean 454 Pyrosequencing Reads in order to ease the Assembly Process
The pyrocleaner is intended to clean the reads included in the sff file in order to ease the assembly process. It enables filtering sequences on different criteria such as length, complexity, number of undetermined bases which has been proven to correlate with poor quality and multiple copy reads. It also enables to clean paired-ends sff files and generates on one side a sff with the validated paired-ends and on the other the sequences which can be used as shotgun reads.
SLiQ – Simple linear Inequalities based Mate-Pair reads Filtering and Scaffolding
SLIQ , a set of simple linear inequalities derived from the geometry of contigs on the line, can be used to predict the relative positions and orientations of contigs from individual mate pair reads and thus produce a contig digraph.
rectangles 2.0 – Rectangle Graph for Repeat Resolution in Genome Assembly
rectangles is an ultimate tool for resolving repeats in genome assemblies.
Arachne 4.6233 – Whole-genome Shotgun Assembler
ARACHNE is a program for assembling data from whole genome shotgun sequencing experiments. It was designed for long reads from Sanger sequencing technology, and has been used extensively to assemble many genomes, including many that are large and highly repetitive.
Reconciliator 2.0 – The tool for Merging Assemblies
Reconciliator is the tool for merging assemblies.
PhrapUMD 2 – Modified version of Phrap
Phrap UMD consists of the UMD Trimmer, UMD Overlapper and a modified version of Phrap.It is capable of assembling data downloaded directly from the NCBI Trace Archive. The pipeline runs in 3 stages: first the vector ends of the reads are examined and the vector is found. Then the reads are trimmed for vector and quality. After that the trimmed reads afe fed into the 5-pass UMD Overlapper that finds the overlaps, corrects the base caller errors and performs additional trimming if necessary. After the overlaps are produced, the trimmed and error-corrected reads and overlaps are input into the modified version of Phrap, whichonly puts the reads together if they overlap according to the list of overlaps produced by the UMD Overlapper.
DNA Dragon 1.5.6 build1 – DNA Sequence Contig Assembler Software
DNA Dragon Contig Assembler assembles sequences, trace data (ABI, SCF, AB1), Illumina and Roche 454 flowgrams into contigs. It is a very fast and accurate DNA sequence assembly software. The DNA sequences are assembled into contigs and a direct comparision of trace date with nucleotide data is possible. It also allows for proofreading and base editing.

Krona

Jit — Wed, 22 Mar 2017 04:47:35 -0500

Krona allows hierarchical data to be explored with zooming, multi-layered pie charts. Krona charts can be created using an Excel template or KronaTools, which includes support for several bioinformatics tools and raw data formats. The interactive charts are self-contained and can be viewed with any modern web browser (see Browser support).

Address of the bookmark: https://github.com/marbl/Krona/wiki

WGS Celera Assembler version 8.3rc2

Jit — Mon, 10 Apr 2017 04:45:40 -0500

These are release notes for Celera Assembler version 8.3rc2, which was released on May 24, 2015.

This distribution package provides a stable, tested, documented version of the software. The distribution is usable on most Unix-like platforms, and some platforms have pre-compiled binary distributions ready for installation.

The source code package includes full source code (revision 4627), Makefiles, and scripts. A subset of the kmer package (http://kmer.sourceforge.net/, version r1994), used by some modules of Celera Assembler, is included. This distribution includes [http://samtools.sourceforge.net/ SAMtools], [http://www.cbcb.umd.edu/software/jellyfish/ Jellyfish 2.0], [https://github.com/pbjd/pbutgcns PBUTGCNS], [https://github.com/PacificBiosciences/pbdagcon PBDAGCON], [https://github.com/PacificBiosciences/BLASR BLASR], and parts of the [https://github.com/PacificBiosciences/FALCON/tree/v0.1.3 Falcon assembler].

Full documentation can be found online at http://wgs-assembler.sourceforge.net/.

Interesting scripts within it

urbe@urbo214b[bin] ls []
-rwxrwxr-x 1 urbe urbe 11K Apr 10 11:41 addCNSToStore
-rwxrwxr-x 1 urbe urbe 575K Apr 10 11:41 addReadsToUnitigs
-rwxrwxr-x 1 urbe urbe 128K Apr 10 11:41 analyzeBest
-rwxrwxr-x 1 urbe urbe 257K Apr 10 11:41 analyzePosMap
-rwxrwxr-x 1 urbe urbe 1,5M Apr 10 11:41 analyzeScaffolds
-rwxrwxr-x 1 urbe urbe 224K Apr 10 11:41 asmOutputFasta
-rwxrwxr-x 1 urbe urbe 448K Apr 10 11:41 asmOutputStatistics
-rwxrwxr-x 1 urbe urbe 2,4K Apr 10 11:41 asmToAGP.pl
-rwxrwxr-x 1 urbe urbe 7,6M Apr 10 11:41 blasr
-rwxrwxr-x 1 urbe urbe 1,6M Apr 10 11:41 bogart
-rwxrwxr-x 1 urbe urbe 183K Apr 10 11:41 bogus
-rwxrwxr-x 1 urbe urbe 272K Apr 10 11:41 bogusness
-rwxrwxr-x 1 urbe urbe 247K Apr 10 11:41 buildPosMap
-rwxrwxr-x 1 urbe urbe 213K Apr 10 11:41 buildRefContigs
-rwxrwxr-x 1 urbe urbe 990K Apr 10 11:41 buildUnitigs
-rwxrwxr-x 1 urbe urbe 18K Apr 10 11:41 ca2ace.pl
-rwxrwxr-x 1 urbe urbe 12K Apr 10 11:41 caqc_help.ini
-rwxrwxr-x 1 urbe urbe 61K Apr 10 11:41 caqc.pl
-rwxrwxr-x 1 urbe urbe 23K Apr 10 11:41 cat-corrects
-rwxrwxr-x 1 urbe urbe 24K Apr 10 11:41 cat-erates
-rwxrwxr-x 1 urbe urbe 1,9M Apr 10 11:41 cgw
-rwxrwxr-x 1 urbe urbe 1,4M Apr 10 11:41 cgwDump
-rwxrwxr-x 1 urbe urbe 204K Apr 10 11:41 chimChe
-rwxrwxr-x 1 urbe urbe 201K Apr 10 11:40 chimera
-rwxrwxr-x 1 urbe urbe 220K Apr 10 11:41 classifyMates
-rwxrwxr-x 1 urbe urbe 201K Apr 10 11:41 classifyMatesApply
-rwxrwxr-x 1 urbe urbe 215K Apr 10 11:41 classifyMatesPairwise
-rwxrwxr-x 1 urbe urbe 366K Apr 10 11:41 computeCoverageStat
-rwxrwxr-x 1 urbe urbe 9,8K Apr 10 11:41 convert-fasta-to-v2.pl
-rwxrwxr-x 1 urbe urbe 48K Apr 10 11:41 convertOverlap
-rwxrwxr-x 1 urbe urbe 119K Apr 10 11:41 convertSamToCA
-rwxrwxr-x 1 urbe urbe 20K Apr 10 11:41 convertToPBCNS
-rwxrwxr-x 1 urbe urbe 197K Apr 10 11:41 correct-frags
-rwxrwxr-x 1 urbe urbe 259K Apr 10 11:41 correct-olaps
-rwxrwxr-x 1 urbe urbe 520K Apr 10 11:41 correctPacBio
-rwxrwxr-x 1 urbe urbe 540K Apr 10 11:41 ctgcns
-rwxrwxr-x 1 urbe urbe 162K Apr 10 11:40 deduplicate
-rwxrwxr-x 1 urbe urbe 37K Apr 10 11:41 demotePosMap
-rwxrwxr-x 1 urbe urbe 1,5M Apr 10 11:41 dumpCloneMiddles
-rwxrwxr-x 1 urbe urbe 124K Apr 10 11:41 dumpPBRLayoutStore
-rwxrwxr-x 1 urbe urbe 1,3M Apr 10 11:41 dumpSingletons
-rwxrwxr-x 1 urbe urbe 171K Apr 10 11:41 erate-estimate
-rwxrwxr-x 1 urbe urbe 221K Apr 10 11:40 estimate-mer-threshold
-rwxrwxr-x 1 urbe urbe 1,5M Apr 10 11:41 extendClearRanges
-rwxrwxr-x 1 urbe urbe 1,3M Apr 10 11:41 extendClearRangesPartition
-rwxrwxr-x 1 urbe urbe 205K Apr 10 11:40 extractmessages
-rwxrwxr-x 1 urbe urbe 7,2M Apr 10 11:41 falcon_sense
-rwxrwxr-x 1 urbe urbe 9,8K Apr 10 11:41 fastaToCA
-rwxrwxr-x 1 urbe urbe 124K Apr 10 11:40 fastqAnalyze
-rwxrwxr-x 1 urbe urbe 137K Apr 10 11:40 fastqSample
-rwxrwxr-x 1 urbe urbe 62K Apr 10 11:40 fastqSimulate
-rwxrwxr-x 1 urbe urbe 121K Apr 10 11:40 fastqSimulate-sort
-rwxrwxr-x 1 urbe urbe 246K Apr 10 11:40 fastqToCA
-rwxrwxr-x 1 urbe urbe 140K Apr 10 11:41 filterOverlap
-rwxrwxr-x 1 urbe urbe 341K Apr 10 11:40 finalTrim
-rwxrwxr-x 1 urbe urbe 228K Apr 10 11:41 fixUnitigs
-rwxrwxr-x 1 urbe urbe 147K Apr 10 11:40 fragmentDepth
-rwxrwxr-x 1 urbe urbe 29K Apr 10 11:41 fragsInVars
-rwxrwxr-x 1 urbe urbe 545K Apr 10 11:41 frgs2clones
-rwxrwxr-x 1 urbe urbe 398K Apr 10 11:40 gatekeeper
-rwxrwxr-x 1 urbe urbe 139K Apr 10 11:40 gatekeeperbench
-rwxrwxr-x 1 urbe urbe 167K Apr 10 11:40 gkpStoreCreate
-rwxrwxr-x 1 urbe urbe 147K Apr 10 11:40 gkpStoreDumpFASTQ
-rwxrwxr-x 1 urbe urbe 184K Apr 10 11:41 greedyFragmentTiling
-rwxrwxr-x 1 urbe urbe 1,6K Apr 10 11:41 greedy_layout_to_IUM
-rwxrwxr-x 1 urbe urbe 142K Apr 10 11:40 initialTrim
-rwxrwxr-x 1 urbe urbe 967K Apr 10 11:41 jellyfish
-rwxrwxr-x 1 urbe urbe 219K Apr 10 11:41 markRepeatUnique
-rwxrwxr-x 1 urbe urbe 273K Apr 10 11:40 markUniqueUnique
-rwxrwxr-x 1 urbe urbe 114K Apr 10 11:40 mercy
-rwxrwxr-x 1 urbe urbe 3,8K Apr 10 11:41 mergeqc.pl
-rwxrwxr-x 1 urbe urbe 422K Apr 10 11:40 merTrim
-rwxrwxr-x 1 urbe urbe 125K Apr 10 11:40 merTrimApply
-rwxrwxr-x 1 urbe urbe 376K Apr 10 11:40 meryl
-rwxrwxr-x 1 urbe urbe 176K Apr 10 11:41 metagenomics_ovl_analyses
-rwxrwxr-x 1 urbe urbe 297K Apr 10 11:41 olap-from-seeds
-rwxrwxr-x 1 urbe urbe 275K Apr 10 11:41 outputLayout
-rwxrwxr-x 1 urbe urbe 229K Apr 10 11:41 overlapInCore
-rwxrwxr-x 1 urbe urbe 144K Apr 10 11:40 overlap_partition
-rwxrwxr-x 1 urbe urbe 179K Apr 10 11:41 overlapStats
-rwxrwxr-x 1 urbe urbe 179K Apr 10 11:41 overlapStore
-rwxrwxr-x 1 urbe urbe 153K Apr 10 11:41 overlapStoreBucketizer
-rwxrwxr-x 1 urbe urbe 175K Apr 10 11:41 overlapStoreBuild
-rwxrwxr-x 1 urbe urbe 33K Apr 10 11:41 overlapStoreIndexer
-rwxrwxr-x 1 urbe urbe 48K Apr 10 11:41 overlapStoreSorter
-rwxrwxr-x 1 urbe urbe 604K Apr 10 11:40 overmerry
lrwxrwxrwx 1 urbe urbe 4 Apr 10 11:41 pacBioToCA -> PBcR
-rwxrwxr-x 1 urbe urbe 131K Apr 10 11:41 PBcR
-rwxrwxr-x 1 urbe urbe 2,9M Apr 10 11:41 pbdagcon
-rwxrwxr-x 1 urbe urbe 1,9M Apr 10 11:41 pbutgcns
-rwxrwxr-x 1 urbe urbe 201K Apr 10 11:40 remove_fragment
-rwxrwxr-x 1 urbe urbe 153K Apr 10 11:40 removeMateOverlap
-rwxrwxr-x 1 urbe urbe 2,5K Apr 10 11:41 replaceUIDwithName-fastq
-rwxrwxr-x 1 urbe urbe 1,2K Apr 10 11:41 replaceUIDwithName-posmap
-rwxrwxr-x 1 urbe urbe 1,3M Apr 10 11:41 resolveSurrogates
-rwxrwxr-x 1 urbe urbe 139K Apr 10 11:41 rewriteCache
-rwxrwxr-x 1 urbe urbe 232K Apr 10 11:41 runCA
-rwxrwxr-x 1 urbe urbe 88K Apr 10 11:41 runCA-dedupe
-rwxrwxr-x 1 urbe urbe 14K Apr 10 11:41 runCA-overlapStoreBuild
-rwxrwxr-x 1 urbe urbe 3,6K Apr 10 11:41 run_greedy.csh
-rwxrwxr-x 1 urbe urbe 297K Apr 10 11:40 sffToCA
-rwxrwxr-x 1 urbe urbe 13K Apr 10 11:40 show-corrects
-rwxrwxr-x 1 urbe urbe 557K Apr 10 11:41 splitUnitigs
-rwxrwxr-x 1 urbe urbe 1,4M Apr 10 11:41 terminator
drwxrwxr-x 2 urbe urbe 4,0K Apr 10 11:41 TIGR
-rwxrwxr-x 1 urbe urbe 526K Apr 10 11:41 tigStore
-rwxrwxr-x 1 urbe urbe 35K Apr 10 11:41 tracearchiveToCA
-rwxrwxr-x 1 urbe urbe 35K Apr 10 11:41 tracedb-to-frg.pl
-rwxrwxr-x 1 urbe urbe 44K Apr 10 11:41 trimFastqByQVWindow
-rwxrwxr-x 1 urbe urbe 18K Apr 10 11:40 uidclient
-rwxrwxr-x 1 urbe urbe 589K Apr 10 11:41 unitigger
-rwxrwxr-x 1 urbe urbe 42K Apr 10 11:40 upgrade-v8-to-v9
-rwxrwxr-x 1 urbe urbe 42K Apr 10 11:40 upgrade-v9-to-v10
-rwxrwxr-x 1 urbe urbe 854 Apr 10 11:41 utg2fasta
-rwxrwxr-x 1 urbe urbe 731K Apr 10 11:41 utgcns
-rwxrwxr-x 1 urbe urbe 561K Apr 10 11:41 utgcnsfix

Address of the bookmark: http://wgs-assembler.sourceforge.net/wiki/index.php/Main_Page

STUDENTSHIP and TRAINEESHIP @ University of Madras

Sat, 16 Nov 2013 19:27:40 -0600

Bioinformatics Infrastructure Facility
University of Madras
Chennai 600 025

Applications are invited for the STUDENTSHIP and TRAINEESHIP vacancies to carry out project/research work in the DBT - Bioinformatics Infrastructure Facility with consolidated stipend of Rs.5,000/- per month.

Essential Qualification

Student Trainee: Those who have completed M.Sc., Bioinformatics/Biophysics/Life sciences or Pursuing M.Tech., Bioinformatics/Biotechnology

Duration : 3-4 Months

Student Trainee: Those who are pursuing M.Sc Bioinformatics/Biophysics/ Life sciences/others

Duration : 2-3 Months

Mail your CV on or before 25th November 2013 to shirai2011@gmail.com and hard copy to "Dr. D. Velmurugan, Professor & Head, CAS in Crystallography and Biophysics, University of Madras, Guindy Campus, Chennai 600 025". Also, the applicants are requested to attend the interview on 29th November, 2013 at 11 A.M.

www.unom.ac.in/uploads/announcements/bifadvertisement_20131114080003_23240.pdf

Fastq format

Jit — Wed, 03 May 2017 04:23:32 -0500

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

It was originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA sequence and its quality data, but has recently become the de facto standard for storing the output of high-throughput sequencing instruments such as the Illumina Genome Analyzer.^[1]

Address of the bookmark: https://en.wikipedia.org/wiki/FASTQ_format

Five points for bioinformatics software/tools

Jitendra Narayan — Mon, 05 Aug 2013 04:12:32 -0500

In the bioinformatics sector we mostly spend time on computational analysis of huge amounts of data and try to make sense of it, biologically. But, most of the newbie bioinformaticians are faced with dilemma when they receive biological sequence data for the first time. They mostly found confusing over open source, user friendly GUI, and commercial bioinformatics software. Don’t be surprise this is true and also not an easy task to decide, because analytical step is the most crucial part and believe to be the biggest bottleneck in publishing paper in high impact journals. Through this blog I would like to address the pros and cons of both kind of software/tools and try to assist (Hmmm not really, It looks convince) you to make decision on your software selections.

The most common newbie questions are:

Should I try to use these free open source programs? Why are we not trying GUI software for computational analysis? Should I use commercial bioinformatics programs/software?”

1. Let’s be open

We generally think free and cheap are useless. But this concept is not applicable when we discuss open source software. Mostly, the bioinformatics software is developed by highly competitive biological programmers who believe in open sharing of knowledge. They come under Open Bioinformatics Foundation or O|B|F which is a non-profit, volunteer run organization focused on supporting open source programming in bioinformatics. The best part about open source tools/software is that they’re free to download the source code and read exactly what the program does. If you are so inclined, you can view all of the parts of the program and see the logical flow of the pipeline. In addition, open source makes an excellent learning tool for any beginning bioinformatician. Moreover, you can modify existing open source programs to deal with cutting-edge problems or to customize your pipeline. Apart from your computational and analysis work, most of the reviewer also prefers the open source based results so that they can validate the results if validation required.

2. Code headache

As a bioinformatician you are supposed to know the basics of programming languages, and if you are not good at it, then please learn it as soon as possible because you are not a bio-analyst but biological programmers. The open source programs usually lack dedicated service and support teams (often because they were the product of an overworked doc/postdoc!) so you are responsible for troubleshooting your own errors most of the time. We commonly receive the HELP email to support and assist to setup the pipeline; you can also find this kind of request on any QA forum. I personally believe this coding horror brings the biggest downside of open-source programs; where you need some programming skills in order to implement the program in your pipeline. But, if you are not able to fix the pipeline and modify the open source code according to your requirements them you should re-think on your bioinformatician name tag!!!

3. Dive into the codes

Some of the biologist turn bioinformatician says “if you can do the same thing with commercial software then why to get migraine with weird codes”, well this statement looks to me that guys are keen to learn swimming but still don’t like to get wet. If you are still using paid software and doing your work by customer support and clicking some of the well-designed GUI button then perhaps you are not interested in learning and trying new and challenging bioinformatics works. You are missing the basic flavour of bioinformatics. Let’s dive into the coding world, I am sure your will enjoy it. I recommend your to swim freely in code’s sea, and enjoy the journey; do not merely watch it from the outside.

4. Paid does not mean better

The bioinformatics company which are specializes in bioinformatics solutions develop well designed/packed, user friendly software by using a large number of specialised scientist, programmers and support staff. They also provide good services to accomplice your biological analysis work. This means that if you hit a ‘snag’ with your data, help is likely only a phone call away! These companies price their products competitively against the cost of a dedicated bioinformatician. You may be able to afford the program, but not the additional staff! Additionally, most of the functionality that you need in your analysis is already coded into the program. Need to plot a graph? Just click this button right here. It is that easy. But, as a bioinformatician this is not generally well encouraged approach in biological analysis work, because the software is not available to everyone and your data can’t be validated. Moreover, there is very less chances that anyone will repeat your work or love to do similar kind of research (because not all the labs in the world are rich like yours).

5. Take a caution

In biological analysis work, in which you deal GB/TB of data are having maximum chances of getting errors, so please be careful and always cross check your data before coming to any conclusion. Even an error in two line code can alter your entire analysis and display weird results. Some of the scientist blindly believes on commercial software, which is entirely wrong. Using proprietary tools does not absolve you of the need to actually read and research the type of analysis that you are doing. This is particularly true in the case of genome assembly and annotation.

At the end, I would like to tell only one think that open source solutions allows you to do more cutting edge analysis than the commercial tools. So let’s go for it.

Disclaimer:

This is my personal view. I have nothing to do with any company or open source community. The views expressed on these pages are mine alone and not those of my current/past employers. I do reserve the right to remove comments left by spammers or off-topic comments.

Prime Minister’s 100k Genome Project

Jitendra Narayan — Thu, 08 Aug 2013 09:40:39 -0500

Genomics Ebgland is destined to sequence 100,000 patients over the next five year in England. A landmark project by british government.

Genomics England will play a key role in building on the UK’s long track record as leader in medical science advances to push the boundaries by unlocking the power of DNA data. The UK will become the first ever country to introduce this technology in its mainstream health system – leading the global race for better tests, better drugs and above all better, more personalised care.

http://www.genomicsengland.co.uk/100k-genome-project/