BOL: Related items

scikit-learn

Jitendra Prajapati — Mon, 29 Feb 2016 17:39:24 -0600

Machine Learning in Python

Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license

More at http://scikit-learn.org/stable/index.html

Address of the bookmark: http://scikit-learn.org/stable/auto_examples/index.html

RNA-Seq De novo Assembly Using Trinity

Surabhi Chaudhary — Wed, 23 Mar 2016 05:53:46 -0500

Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. Briefly, the process works like so:

Inchworm assembles the RNA-seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptonal complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.
Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.

More at https://github.com/trinityrnaseq/trinityrnaseq/wiki

......................................................................................................................................

Download Trinity here.

Build Trinity by typing 'make' in the base installation directory.

Assemble RNA-Seq data like so:

 Trinity --seqType fq --left reads_1.fq --right reads_2.fq --CPU 6 --max_memory 20G

Find assembled transcripts as: 'trinity_out_dir/Trinity.fasta'

Address of the bookmark: https://github.com/trinityrnaseq/trinityrnaseq/wiki

PhyloGrapher - Graph Visualization Tool

Jitendra Prajapati — Wed, 06 Apr 2016 19:06:48 -0500

PhyloGrapher is a program designed to visualize and study evolutionary relationships within families of homologous genes or proteins (elements).PhyloGrapher is a drawing tool that generates custom graphs for a given set of elements. In general, it is possible to use PhyloGrapher to visualize any type of relations between elements.

More at http://www.atgc.org/PhyloGrapher/PhyloGrapher_Welcome.html

Address of the bookmark: http://www.atgc.org/PhyloGrapher/PhyloGrapher_Welcome.html

DISCOVAR

Abhimanyu Singh — Mon, 18 Apr 2016 11:59:16 -0500

DISCOVAR is a new variant caller and DISCOVAR de novo a new genome assembler, both designed for state-of-the-art data. Their inputs are chosen to optimize quality while keeping costs low. Currently it takes as input Illumina reads of length 250 or longer — produced on MiSeq or HiSeq 2500 — and from a single PCR-free library. These data enable a level of completeness and continuity that was not previously possible.

DISCOVAR can call variants on a region by region basis, potentially tiling an entire large genome. DISCOVAR variant calling is under active development and transitioning to VCF.

DISCOVAR de novo can generate de novo assemblies for both large and small genomes. It currently does not call variants.

More at https://www.broadinstitute.org/software/discovar/blog/?page_id=14

Address of the bookmark: https://www.broadinstitute.org/software/discovar/blog/

HOMER: Software for motif discovery and next-gen sequencing analysis

Neel — Tue, 26 Apr 2016 03:48:23 -0500

This tutorial covers topics independently of HOMER, and represents knowledge which is important to know before diving head first into more advanced analysis tools such as HOMER.

Setting up your computing environment
Retrieving and storing sequencing files (your own data or from public sources)
Checking sequence quality, trimming, general sequence manipulation
Mapping reads to a reference genome
Manipulating SAM/BAM alignment files
Visualizing data in a genome browser

RNA-Seq

Microarray

Basic analysis of Affymetrix Gene Expression Arrays using R/Bioconductor

General Tips for Data Analysis

Excel workarounds, adding gene annotation, X-Y plots tips, etc.

Address of the bookmark: http://homer.salk.edu/homer/basicTutorial/

Smash: An alignment-free method to find and visualise rearrangements between pairs of DNA sequences

Jit — Tue, 26 Apr 2016 12:18:49 -0500

Smash is a completely alignment-free method/tool to find and visualise genomic rearrangements. The detection is based on conditional exclusive compression, namely using a FCM (Markov model), of high context order (typically 20). For visualisation, Smash outputs a SVG image, with an ideogramoutput architecture, where the patterns are represented with several HSV values (only value varies). The method can perform both in small- and large-scale. Nevertheless is more directed to large-scale since that the main aim of the research is to know where the large-scale [chromosomal by chromosome] of several primates was equal/different, having at a glance a map of the entire genomes.

Address of the bookmark: http://bioinformatics.ua.pt/software/smash/

GATB : Genome Analysis Toolbox with de-Bruijn graph

Jit — Thu, 28 Apr 2016 11:16:51 -0500

The Genome Analysis Toolbox with de-Bruijn graph (GATB) provides a set of highly efficient algorithms to analyse NGS data sets. These methods enable the analysis of data sets of any size on multi-core desktop computers, including very huge amount of reads data coming from any kind of organisms such as bacteria, plants, animals and even complex samples (e.g. metagenomes).

More at https://gatb.inria.fr/

Address of the bookmark: https://gatb.inria.fr/

Painless package development for R

Abhi — Tue, 03 May 2016 05:31:06 -0500

Devtools makes package development a breeze: it works with R’s existing conventions for code structure, adding efficient tools to support the cycle of package development. With devtools, developing a package becomes so easy that it will be your default layout whenever you’re writing a significant amount of code.

Before you get started be sure to check out:

Address of the bookmark: https://www.rstudio.com/products/rpackages/devtools/

Summer internship positions at DuPont

Wed, 11 May 2016 08:05:54 -0500

DuPont Industrial Biosciences has several summer internship positions
for undergrads available. We are looking for driven and creative interns
to conduct research in the following areas:

· Enzyme immobilization supports for select enzyme systems.

· New tools for microbial strain and genome engineering using
state-of-the-art methodologies.

· Rapid high throughput assays to screen microorganisms from various
sources for enzymatic activities of interest.

· High throughput combinatorial approaches to the formulation of growth
media in support of microbial enrichments, strain isolations and growth
optimization.

· Meta-transcriptomics for the discovery of new enzymes.

· Strain adaptation techniques in defined chemostat environments for
microbial strain development.

The internships are based at the Experimental Station R&D Center in
Wilmington, DE.

If interested, apply fast!

For more information and to apply, go to:

http://careers.dupont.com/jobsearch/job-details/industrial-biosciences-summer-internship/008549W-10/

SLURM basics !

Radha Agarkar — Fri, 13 May 2016 04:42:24 -0500

SLURM is a queue management system and stands for Simple Linux Utility for Resource Management. SLURM was developed at the Lawrence Livermore National Lab and currently runs some of the largest compute clusters in the world.

SLURM is similar in many ways to most other queue systems. You write a batch script then submit it to the queue manager. The queue manager then schedules your job to run on the queue (or partition in SLURM parlance) that you designate. Below we will provide an outline of how to submit jobs to SLURM, how SLURM decides when to schedule your job and how to monitor progress.

SLURM has a number of valuable features compared to other job management systems:

Kill and Requeue SLURM’s ability to kill and requeue is superior to that of other systems. It waits for jobs to be cleared before scheduling the high priority job. It also does kill and requeue on memory rather than just on core count.
Memory Memory requests are sacrosanct in SLURM. Thus the amount of memory you request at run time is guaranteed to be there. No one can infringe on that memory space and you cannot exceed the amount of memory that you request.
Accounting Tools SLURM has a back end database which stores historical information about the cluster. This information can be queried by the users who are curious about how much resources they have used.

Summary of SLURM commands

The table below shows a summary of SLURM commands. These commands are described in more detail below along with links to the SLURM doc site.

	SLURM	SLURM Example
Submit a batch serial job	sbatch	`sbatch runscript.sh`
Run a script interatively	srun	`srun --pty -p interact -t 10 --mem 1000 /bin/bash /bin/hostname`
Kill a job	scancel	`scancel 999999`
View status of queues	squeue	`squeue -u akitzmiller`
Check current job by id	sacct	`sacct -j 999999`