BOL: Related items

FERMI

Jit — Fri, 09 Sep 2016 05:37:13 -0500

Fermi is a de novo assembler with a particular focus on assembling Illumina short sequence reads from a mammal-sized genome. In addition to the role of a typical assembler, fermi also aims to preserve heterozygotes which are often collapsed by other assemblers. Its ultimate goal is to find a minimal set of
unitigs to represent all the information in raw reads.

Fermi follows the overlap-layout-consensus paradigm and uses the FM-DNA-index (FMD-index) as the key data structure. It is inspired by the string graph assembler (Simpson and Durbin, 2010 and 2012) and has a similar workflow.

As a typical de novo assembler, fermi tends to produce contigs with slightly longer N50. However, the major weakness of fermi is the high misassembly rate. Although fermi provides a tool to fix misassemblies by using paired-end reads to achieve an accuracy comparable to other assemblers, this is not a favorable solution.

Fermi is designed to be used on a multi-core Linux machine with large shared memory. The easiest way to run fermi is to use the run-fermi.pl script. It generates a Makefile. The actual assembly is done by invoking make. Premature assembly processes can be resumed. Here is an example:

run-fermi.pl -dAPe ./fermi -p NA12878 -t16 -f18 reads*.fq.gz > NA12878.mak
make -f NA12878.mak -j16

Address of the bookmark: https://github.com/lh3/fermi

Shinyheatmap

Jit — Fri, 21 Oct 2016 05:12:11 -0500

Background: Transcriptomics, metabolomics, metagenomics, and other various next-generation sequencing (-omics) fields are known for their production of large datasets. Visualizing such big data has posed technical challenges in biology, both in terms of available computational resources as well as programming acumen. Since heatmaps are used to depict high-dimensional numerical data as a colored grid of cells, efficiency and speed have often proven to be critical considerations in the process of successfully converting data into graphics. For example, rendering interactive heatmaps from large input datasets (e.g., 100k+ rows) has been computationally infeasible on both desktop computers and web browsers. In addition to memory requirements, programming skills and knowledge have frequently been barriers-to-entry for creating highly customizable heatmaps. Results: We propose shinyheatmap: an advanced user-friendly heatmap software suite capable of efficiently creating highly customizable static and interactive biological heatmaps in a web browser. shinyheatmap is a low memory footprint program, making it particularly well-suited for the interactive visualization of extremely large datasets that cannot typically be computed in-memory due to size restrictions. Conclusions: shinyheatmap is hosted online as a freely available web server with an intuitive graphical user interface: http://shinyheatmap.com. The methods are implemented in R, and are available as part of the shinyheatmap project at: https://github.com/Bohdan-Khomtchouk/shinyheatmap.

More at http://biorxiv.org/content/early/2016/09/21/076463

Address of the bookmark: http://shinyheatmap.com/

HybPiper

Jit — Fri, 04 Nov 2016 05:02:10 -0500

HybPiper was designed for targeted sequence capture, in which DNA sequencing libraries are enriched for gene regions of interest, especially for phylogenetics. HybPiper is a suite of Python scripts that wrap and connect bioinformatics tools in order to extract target sequences from high-throughput DNA sequencing reads.

Targeted bait capture is a technique for sequencing many loci simultaneously based on bait sequences. HybPiper pipeline starts with high-throughput sequencing reads (for example from Illumina MiSeq), and assigns them to target genes using BLASTx or BWA. The reads are distributed to separate directories, where they are assembled separately using SPAdes. The main output is a FASTA file of the (in frame) CDS portion of the sample for each target region, and a separate file with the translated protein sequence.

HybPiper also includes post-processing scripts, run after the main pipeline, to also extract the intronic regions flanking each exon, investigate putative paralogs, and calculate sequencing depth. For more information, please see our wiki.

HybPiper is run separately for each sample (single or paired-end sequence reads). When HybPiper generates sequence files from the reads, it does so in a standardized directory hierarchy. Many of the post-processing scripts rely on this directory hierarchy, so do not modify it after running the initial pipeline. It is a good idea to run the pipeline for each sample from the same directory. You will end up with one directory per run of HybPiper, and some of the later scripts take advantage of this predictable directory structure.

Address of the bookmark: https://github.com/mossmatters/HybPiper

SGA: String Graph Assembler

Jit — Thu, 08 Dec 2016 05:08:59 -0600

SGA is a de novo genome assembler based on the concept of string graphs. The major goal of SGA is to be very memory efficient, which is achieved by using a compressed representation of DNA sequence reads.

More at

https://github.com/jts/sga

SGA dependencies:
-google sparse hash library (http://code.google.com/p/google-sparsehash/)
-the bamtools library (https://github.com/pezmaster31/bamtools)
-zlib (http://www.zlib.net/)
-(optional but suggested) the jemalloc memory allocator (http://www.canonware.com/jemalloc/download.html)

Address of the bookmark: https://github.com/jts/sga

MeGAMerge: A tool to merge assembled contigs, long reads from metagenomic sequencing runs

Jit — Mon, 19 Dec 2016 09:42:15 -0600

MeGAMerge

MeGAMerge (A tool to merge assembled contigs, long reads from metagenomic sequencing runs)

Description

MeGAMerge is a perl based wrapper/tool that can accept any number of sequence (FASTA) files containing assembled contigs of any length in Multi-FASTA format to produce an improved contig set based on OLC based assembly. All overlap parameters (Minimum Overlap Length, Identity, etc) are user-declarable at runtime. It is written to run on Linux.

Requirements:

You will need to have the following tools installed and in $PATH, or added to $binpath in the tool:

Newbler (specifically runAssembly)
Minimus2 (part of AMOS, also requires MUMmer)

Address of the bookmark: https://github.com/LANL-Bioinformatics/MeGAMerge

Prokka: tool for the rapid annotation of prokaryotic genomes

Jit — Mon, 06 Mar 2017 03:49:57 -0600

Prokka is a software tool for the rapid annotation of prokaryotic genomes. A typical 4 Mbp genome can be fully annotated in less than 10 minutes on a quad-core computer, and scales well to 32 core SMP systems. It produces GFF3, GBK and SQN files that are ready for editing in Sequin and ultimately submitted to Genbank/DDJB/ENA.

Address of the bookmark: http://www.vicbioinformatics.com/software.prokka.shtml

Download assemblies from NCBI

Bulbul — Mon, 15 May 2017 06:02:32 -0500

A new “Download assemblies” button is now available in the Assembly database. This makes it easy to download data for multiple genomes without having to write scripts.

For example, you can run a search in Assembly and use check boxes (see left side of screenshot below) to refine the set of genome assemblies of interest. Then, just open the “Download assemblies” menu, choose the source database (GenBank or RefSeq), choose the file type, and start the download. An archive file will be saved to your computer that can be expanded into a folder containing your selected genome data files.

More at https://ncbiinsights.ncbi.nlm.nih.gov/2017/05/08/genome-data-download-made-easy/

RATT

Jitendra Narayan — Sun, 07 Feb 2016 16:09:40 -0600

RATT is software to transfer annotation from a reference (annotated) genome to an unannotated query genome.

It was first developed to transfer annotations between different genome assembly versions. However, it can also transfer annotations between strains and even different species, like Plasmodium chabaudi onto P. berghei, between different Leishmania species or Salmonella enterica onto other Salmonella serotypes. RATT is able to transfer any entries present on a reference sequence, such as the systematic id or an annotator's notes; such information would be lost in a de novo annotation.

More at http://ratt.sourceforge.net/

Address of the bookmark: http://ratt.sourceforge.net/

A 3D Map of the Human Genome

Fri, 12 Dec 2014 22:27:55 -0600

Suhas Rao and Miriam Huntley (of the Aiden Lab) describe a 3D map of the human genome at kilobase resolution, revealing the principles of chromatin looping. Guest Origami Folding: Sarah Nyquist. Suhas S.P. Rao*, Miriam H. Huntley*, Neva C. Durand, Elena K. Stamenova, Ivan D. Bochkov, James T. Robinson, Adrian L. Sanborn, Ido Machol, Arina D. Omer, Eric S. Lander, Erez Lieberman Aiden. (2014). A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell.

Hagfish - assess an assembly through creative use of coverage plots

Abhi — Fri, 20 May 2016 19:08:17 -0500

Hagfish is a tool that is to be used in data analysis of Next Generation Sequencing (NGS) experiments. Hagfish builds on the concept of coverage plots and aims to assist (amongst others) in quality control of de novo genome assembly or identification of structural variation in a genome re-sequencing experiment.

Hagfish requires a reference sequence and a paired end re-sequencing data set. Hagfish has more power the larger the insert size of the paired end library is.

Quick links: Installation,Operation, Read mappers, Hagfish scripts, Hagfish plots

Address of the bookmark: https://github.com/mfiers/hagfish