BOL: Related items

FERMI

Jit — Fri, 09 Sep 2016 05:37:13 -0500

Fermi is a de novo assembler with a particular focus on assembling Illumina short sequence reads from a mammal-sized genome. In addition to the role of a typical assembler, fermi also aims to preserve heterozygotes which are often collapsed by other assemblers. Its ultimate goal is to find a minimal set of
unitigs to represent all the information in raw reads.

Fermi follows the overlap-layout-consensus paradigm and uses the FM-DNA-index (FMD-index) as the key data structure. It is inspired by the string graph assembler (Simpson and Durbin, 2010 and 2012) and has a similar workflow.

As a typical de novo assembler, fermi tends to produce contigs with slightly longer N50. However, the major weakness of fermi is the high misassembly rate. Although fermi provides a tool to fix misassemblies by using paired-end reads to achieve an accuracy comparable to other assemblers, this is not a favorable solution.

Fermi is designed to be used on a multi-core Linux machine with large shared memory. The easiest way to run fermi is to use the run-fermi.pl script. It generates a Makefile. The actual assembly is done by invoking make. Premature assembly processes can be resumed. Here is an example:

run-fermi.pl -dAPe ./fermi -p NA12878 -t16 -f18 reads*.fq.gz > NA12878.mak
make -f NA12878.mak -j16

Address of the bookmark: https://github.com/lh3/fermi

Shinyheatmap

Jit — Fri, 21 Oct 2016 05:12:11 -0500

Background: Transcriptomics, metabolomics, metagenomics, and other various next-generation sequencing (-omics) fields are known for their production of large datasets. Visualizing such big data has posed technical challenges in biology, both in terms of available computational resources as well as programming acumen. Since heatmaps are used to depict high-dimensional numerical data as a colored grid of cells, efficiency and speed have often proven to be critical considerations in the process of successfully converting data into graphics. For example, rendering interactive heatmaps from large input datasets (e.g., 100k+ rows) has been computationally infeasible on both desktop computers and web browsers. In addition to memory requirements, programming skills and knowledge have frequently been barriers-to-entry for creating highly customizable heatmaps. Results: We propose shinyheatmap: an advanced user-friendly heatmap software suite capable of efficiently creating highly customizable static and interactive biological heatmaps in a web browser. shinyheatmap is a low memory footprint program, making it particularly well-suited for the interactive visualization of extremely large datasets that cannot typically be computed in-memory due to size restrictions. Conclusions: shinyheatmap is hosted online as a freely available web server with an intuitive graphical user interface: http://shinyheatmap.com. The methods are implemented in R, and are available as part of the shinyheatmap project at: https://github.com/Bohdan-Khomtchouk/shinyheatmap.

More at http://biorxiv.org/content/early/2016/09/21/076463

Address of the bookmark: http://shinyheatmap.com/

HybPiper

Jit — Fri, 04 Nov 2016 05:02:10 -0500

HybPiper was designed for targeted sequence capture, in which DNA sequencing libraries are enriched for gene regions of interest, especially for phylogenetics. HybPiper is a suite of Python scripts that wrap and connect bioinformatics tools in order to extract target sequences from high-throughput DNA sequencing reads.

Targeted bait capture is a technique for sequencing many loci simultaneously based on bait sequences. HybPiper pipeline starts with high-throughput sequencing reads (for example from Illumina MiSeq), and assigns them to target genes using BLASTx or BWA. The reads are distributed to separate directories, where they are assembled separately using SPAdes. The main output is a FASTA file of the (in frame) CDS portion of the sample for each target region, and a separate file with the translated protein sequence.

HybPiper also includes post-processing scripts, run after the main pipeline, to also extract the intronic regions flanking each exon, investigate putative paralogs, and calculate sequencing depth. For more information, please see our wiki.

HybPiper is run separately for each sample (single or paired-end sequence reads). When HybPiper generates sequence files from the reads, it does so in a standardized directory hierarchy. Many of the post-processing scripts rely on this directory hierarchy, so do not modify it after running the initial pipeline. It is a good idea to run the pipeline for each sample from the same directory. You will end up with one directory per run of HybPiper, and some of the later scripts take advantage of this predictable directory structure.

Address of the bookmark: https://github.com/mossmatters/HybPiper

SGA: String Graph Assembler

Jit — Thu, 08 Dec 2016 05:08:59 -0600

SGA is a de novo genome assembler based on the concept of string graphs. The major goal of SGA is to be very memory efficient, which is achieved by using a compressed representation of DNA sequence reads.

More at

https://github.com/jts/sga

SGA dependencies:
-google sparse hash library (http://code.google.com/p/google-sparsehash/)
-the bamtools library (https://github.com/pezmaster31/bamtools)
-zlib (http://www.zlib.net/)
-(optional but suggested) the jemalloc memory allocator (http://www.canonware.com/jemalloc/download.html)

Address of the bookmark: https://github.com/jts/sga

Cutadapt

Bulbul — Wed, 14 Dec 2016 09:59:52 -0600

Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

Cutadapt helps with these trimming tasks by finding the adapter or primer sequences in an error-tolerant way. It can also modify and filter reads in various ways. Adapter sequences can contain IUPAC wildcard characters. Also, paired-end reads and even colorspace data is supported. If you want, you can also just demultiplex your input data, without removing adapter sequences at all.

Cutadapt comes with an extensive suite of automated tests and is available under the terms of the MIT license.

If you use cutadapt, please cite DOI:10.14806/ej.17.1.200 .

More at https://github.com/marcelm/cutadapt

Address of the bookmark: http://cutadapt.readthedocs.io/en/stable/guide.html

MeGAMerge: A tool to merge assembled contigs, long reads from metagenomic sequencing runs

Jit — Mon, 19 Dec 2016 09:42:15 -0600

MeGAMerge

MeGAMerge (A tool to merge assembled contigs, long reads from metagenomic sequencing runs)

Description

MeGAMerge is a perl based wrapper/tool that can accept any number of sequence (FASTA) files containing assembled contigs of any length in Multi-FASTA format to produce an improved contig set based on OLC based assembly. All overlap parameters (Minimum Overlap Length, Identity, etc) are user-declarable at runtime. It is written to run on Linux.

Requirements:

You will need to have the following tools installed and in $PATH, or added to $binpath in the tool:

Newbler (specifically runAssembly)
Minimus2 (part of AMOS, also requires MUMmer)

Address of the bookmark: https://github.com/LANL-Bioinformatics/MeGAMerge

methylKit

Jit — Fri, 03 Jun 2016 10:09:29 -0500

methylKit is an R package for DNA methylation analysis and annotation from high-throughput bisulfite sequencing. The package is designed to deal with sequencing data from RRBS and its variants, but also target-capture methods such as Agilent SureSelect methyl-seq. In addition, methylKit can deal with base-pair resolution data for 5hmC obtained from Tab-seq or oxBS-seq. It can also handle whole-genome bisulfite sequencing data if proper input format is provided.

Address of the bookmark: https://github.com/al2na/methylKit

QuIN’s web server

Jit — Mon, 27 Jun 2016 10:44:16 -0500

Recent studies of the human genome have indicated that regulatory elements (e.g. promoters and enhancers) at distal genomic locations can interact with each other via chromatin folding and affect gene expression levels. Genomic technologies for mapping interactions between DNA regions, e.g., ChIA-PET and HiC, can generate genome-wide maps of interactions between regulatory elements. These interaction datasets are important resources to infer distal gene targets of non-coding regulatory elements and to facilitate prioritization of critical loci for important cellular functions. With the increasing diversity and complexity of genomic information and public ontologies, making sense of these datasets demands integrative and easy-to-use software tools. Moreover, network representation of chromatin interaction maps enables effective data visualization, integration, and mining. Currently, there is no software that can take full advantage of network theory approaches for the analysis of chromatin interaction datasets. To fill this gap, we developed a web-based application, QuIN, which enables: 1) building and visualizing chromatin interaction networks, 2) annotating networks with user-provided private and publicly available functional genomics and interaction datasets, 3) querying network components based on gene name or chromosome location, and 4) utilizing network based measures to identify and prioritize critical regulatory targets and their direct and indirect interactions.

AVAILABILITY: QuIN’s web server is available at http://quin.jax.org QuIN is developed in Java and JavaScript, utilizing an Apache Tomcat web server and MySQL database and the source code is available under the GPLV3 license available on GitHub:https://github.com/UcarLab/QuIN/.

Address of the bookmark: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004809

Fancy Oneliner for Bioinformatics !!

Poonam Mahapatra — Thu, 07 Jul 2016 12:05:50 -0500

This webpage lists some of the one-liners that we frequently use in metagenomic analyses. You can click on the following links to browse through different topics. You can copy/paste the commands as they are in your terminal screen, provided you follow the same naming conventions and folder structures as we have. We are sharing these codes with the intention that if they are useful and help you in your analyses, then we will be appropriately credited as considerable effort has been put into devising them.

Address of the bookmark: http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/oneliners.html

Genome STRiP

Neel — Tue, 06 Sep 2016 03:58:19 -0500

Genome STRiP (Genome STRucture In Populations) is a suite of tools for discovering and genotyping structural variations using sequencing data. The methods are designed to detect shared variation using data from multiple individuals.

Genome STRiP looks both across and within a set of sequenced genomes to detect variation. The methods are adaptive and support heterogeneous data sets, including variations in sequencing depth, read lengths and mixtures of paired and single-end reads. A minimum of 20 to 30 genomes are required to get acceptable results, but the method gains power across genomes and processing more genomes provide better results.

To run discovery or genotyping on a single sequenced genome or a small set of genomes, you need to call your data against a background population, such as a set of genomes from the 1000 Genomes Project. The background population does not need to be matched to the target individuals.

Address of the bookmark: http://software.broadinstitute.org/software/genomestrip/