BOL: Related items

Shinyheatmap

Jit — Fri, 21 Oct 2016 05:12:11 -0500

Background: Transcriptomics, metabolomics, metagenomics, and other various next-generation sequencing (-omics) fields are known for their production of large datasets. Visualizing such big data has posed technical challenges in biology, both in terms of available computational resources as well as programming acumen. Since heatmaps are used to depict high-dimensional numerical data as a colored grid of cells, efficiency and speed have often proven to be critical considerations in the process of successfully converting data into graphics. For example, rendering interactive heatmaps from large input datasets (e.g., 100k+ rows) has been computationally infeasible on both desktop computers and web browsers. In addition to memory requirements, programming skills and knowledge have frequently been barriers-to-entry for creating highly customizable heatmaps. Results: We propose shinyheatmap: an advanced user-friendly heatmap software suite capable of efficiently creating highly customizable static and interactive biological heatmaps in a web browser. shinyheatmap is a low memory footprint program, making it particularly well-suited for the interactive visualization of extremely large datasets that cannot typically be computed in-memory due to size restrictions. Conclusions: shinyheatmap is hosted online as a freely available web server with an intuitive graphical user interface: http://shinyheatmap.com. The methods are implemented in R, and are available as part of the shinyheatmap project at: https://github.com/Bohdan-Khomtchouk/shinyheatmap.

More at http://biorxiv.org/content/early/2016/09/21/076463

Address of the bookmark: http://shinyheatmap.com/

HybPiper

Jit — Fri, 04 Nov 2016 05:02:10 -0500

HybPiper was designed for targeted sequence capture, in which DNA sequencing libraries are enriched for gene regions of interest, especially for phylogenetics. HybPiper is a suite of Python scripts that wrap and connect bioinformatics tools in order to extract target sequences from high-throughput DNA sequencing reads.

Targeted bait capture is a technique for sequencing many loci simultaneously based on bait sequences. HybPiper pipeline starts with high-throughput sequencing reads (for example from Illumina MiSeq), and assigns them to target genes using BLASTx or BWA. The reads are distributed to separate directories, where they are assembled separately using SPAdes. The main output is a FASTA file of the (in frame) CDS portion of the sample for each target region, and a separate file with the translated protein sequence.

HybPiper also includes post-processing scripts, run after the main pipeline, to also extract the intronic regions flanking each exon, investigate putative paralogs, and calculate sequencing depth. For more information, please see our wiki.

HybPiper is run separately for each sample (single or paired-end sequence reads). When HybPiper generates sequence files from the reads, it does so in a standardized directory hierarchy. Many of the post-processing scripts rely on this directory hierarchy, so do not modify it after running the initial pipeline. It is a good idea to run the pipeline for each sample from the same directory. You will end up with one directory per run of HybPiper, and some of the later scripts take advantage of this predictable directory structure.

Address of the bookmark: https://github.com/mossmatters/HybPiper

SGA: String Graph Assembler

Jit — Thu, 08 Dec 2016 05:08:59 -0600

SGA is a de novo genome assembler based on the concept of string graphs. The major goal of SGA is to be very memory efficient, which is achieved by using a compressed representation of DNA sequence reads.

More at

https://github.com/jts/sga

SGA dependencies:
-google sparse hash library (http://code.google.com/p/google-sparsehash/)
-the bamtools library (https://github.com/pezmaster31/bamtools)
-zlib (http://www.zlib.net/)
-(optional but suggested) the jemalloc memory allocator (http://www.canonware.com/jemalloc/download.html)

Address of the bookmark: https://github.com/jts/sga

Velvet tutorial

Poonam Mahapatra — Fri, 09 Dec 2016 04:19:07 -0600

The objective of this activity is to help you understand how to run Velvet in general, how to accurately estimate the insert size of a paired-end library through the use of Bowtie, the primary parameters of velvet, and the process involved in producing a de novo assembly from Illumina reads.

http://evomics.org/learning/assembly-and-alignment/velvet/

Address of the bookmark: http://evomics.org/learning/assembly-and-alignment/velvet/

Understanding Greedy Algorithms

Jit — Mon, 12 Dec 2016 04:37:40 -0600

Learning greedy algo for biologist.

https://www.topcoder.com/community/data-science/data-science-tutorials/greedy-is-good/

This webpage is also useful for the same:

http://learninglover.com/examples.php?id=59

http://www.cs.rpi.edu/~magdon/ps/conference/super_biokdd.pdf

https://ocw.mit.edu/courses/biology/7-91j-foundations-of-computational-and-systems-biology-spring-2014/lecture-slides/MIT7_91JS14_Lecture6.pdf

http://schatzlab.cshl.edu/teaching/AssemblyClass/01.%20Assembly%20Intro.pdf

http://lsl.sinica.edu.tw/Services/Class/files/20150612449.pdf

http://www.cs.jhu.edu/~langmea/resources/lecture_notes/assembly_scs.pdf

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-43.pdf

Address of the bookmark: https://www.topcoder.com/community/data-science/data-science-tutorials/greedy-is-good/

E-MEM: Efficient computation of Maximal Exact Matches

Jit — Thu, 15 Dec 2016 09:30:43 -0600

E-MEM is a C++/OpenMP program designed to efficiently compute MEMs between large genomes. See the README file for instructions on how to use E-MEM.

E-MEM source code

The source code can be downloaded here.

If you use E-MEM, please cite:

N. Khiste, L. Ilie, E-MEM: Efficient computation of Maximal Exact Matches for very large genomes, Bioinformatics 31(4) (2015) 509 -- 514.

For any questions, please contact Lucian Ilie: ilie@uwo.ca

Address of the bookmark: http://www.csd.uwo.ca/~ilie/E-MEM/

PEAR

Jit — Mon, 19 Dec 2016 09:28:30 -0600

PEAR is an ultrafast, memory-efficient and highly accurate pair-end read merger. It is fully parallelized and can run with as low as just a few kilobytes of memory.

PEAR evaluates all possible paired-end read overlaps and without requiring the target fragment size as input. In addition, it implements a statistical test for minimizing false-positive results. Together with a highly optimized implementation, it can merge millions of paired end reads within a couple of minutes on a standard desktop computer.

Address of the bookmark: http://sco.h-its.org/exelixis/web/software/pear/doc.html

pyScaf

Bulbul — Mon, 19 Dec 2016 14:20:33 -0600

pyScaf orders contigs from genome assemblies utilising several types of information:

paired-end (PE) and/or mate-pair libraries (NGS-based mode)
long reads (NGS-based mode)
synteny to the genome of some related species (reference-based mode)

Scaffolding

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned globally (end-to-end) onto reference chromosomes, ignoring:

matches not satisfying cut-offs (--identity and --overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table).
Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect...), which breaks synteny.
pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
Consider closing gaps after scaffolding.

Address of the bookmark: https://github.com/lpryszcz/pyScaf

MCscan

Bulbul — Thu, 22 Dec 2016 03:53:58 -0600

MCscan is a computer program that can simultaneously scan multiple genomes to identify homologous chromosomal regions and subsequently align these regions using genes as anchors. This is the toolset for generating the synteny correspondences in Plant Genome Duplication Database. It is intended as an easy-to-use and quick way to identify conserved gene arrays both within the same genome and across different genomes.

More at http://chibba.agtec.uga.edu/duplication/mcscan/

Address of the bookmark: http://chibba.agtec.uga.edu/duplication/mcscan/

Centurion

Jit — Fri, 12 Feb 2016 04:45:41 -0600

Although centromeres are essential for life and are the subject of extensive research, centromere locations in yeast genomes are difficult to infer, and in most species they are still unknown. Recently, the chromatin conformation assay Hi-C has been re-purposed for diverse applications, including de novo genome assembly, deconvolution of metagenomic samples, and inference of centromere locations. We describe a method, Centurion, that jointly infers the locations of all centromeres in a single yeast genome by exploiting the centromeres’ tendency to cluster in 3D space. We first demonstrate the accuracy of Centurion in identifying known centromere locations from high coverage Hi-C data of budding yeast and a human malaria parasite. We then use two metagenomic samples with relatively low coverage Hi-C data to infer centromere locations for each chromosome in 14 different yeast species. For yeasts with large centromeres (e.g., S. pombe) Centurion predicts the exact centromere locations. For seven yeasts with point centromeres, Centurion predicts most of the centromeres at an average of 5~kb distance from their known locations. Finally, we predict centromere coordinates for six yeast species that currently lack centromere annotations. These results suggest that Centurion can be used for centromere identification for a large number of yeast species, even with a limited amount of Hi-C sequencing.

Paper:http://www.ncbi.nlm.nih.gov/pubmed/25940625

More at http://cbio.ensmp.fr/centurion/

Address of the bookmark: http://cbio.ensmp.fr/centurion/