BOL: Related items

4DGenome

Jitendra Narayan — Mon, 04 Jul 2016 00:44:55 -0500

Records in 4DGenome are compiled through comprehensive literature curation of experimentally-derived and computationally-predicted interactions. The current release contains 4,433,071 experimentally-derived and 3,605,176 computationally-predicted interactions in 5 organisms. Experimental data cover both high throughput datasets and individiual focused studies.

All interaction data are freely available in a standardized file format. Records can be queried by genomic regions, gene names, organism, and detection technology.

Address of the bookmark: http://4dgenome.research.chop.edu/

Linux command line exercises for NGS data processing

Jit — Wed, 22 Jun 2016 07:59:39 -0500

The purpose of this tutorial is to introduce students to the frequently used tools for NGS analysis as well as giving experience in writing one-liners. Copy the required files to your current directory, change directory (cd) to the linuxTutorial folder, and do all the processing inside:

[uzi@quince-srv2 ~/]$ cp -r /home/opt/MScBioinformatics/linuxTutorial .
[uzi@quince-srv2 ~/]$ cd linuxTutorial
[uzi@quince-srv2 ~/linuxTutorial]$

I have deliberately chosen Awk in the exercises as it is a language in itself and is used more often to manipulate NGS data as compared to the other command line tools such as grep, sed, perl etc. Furthermore, having a command on awk will make it easier to understand advanced tutorials such as Illumina Amplicons Processing Workflow.

In Linux, we use a shell that is a program that takes your commands from the keyboard and gives them to the operating system. Most Linux systems utilize Bourne Again SHell (bash), but there are several additional shell programs on a typical Linux system such as ksh, tcsh, and zsh. To see which shell you are using, type

[uzi@quince-srv2 ~/linuxTutorial]$ echo $SHELL

/bin/bash

Address of the bookmark: http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/linux.html

SAM flags

Poonam Mahapatra — Wed, 29 Jun 2016 15:38:15 -0500

Decoding SAM flags

This utility makes it easy to identify what are the properties of a read based on its SAM flag value, or conversely, to find what the SAM Flag value would be for a given combination of properties.

To decode a given SAM flag value, just enter the number in the field below. The encoded properties will be listed under Summary below, to the right.

Address of the bookmark: https://broadinstitute.github.io/picard/explain-flags.html

Kaiju

Jit — Mon, 27 Jun 2016 11:23:04 -0500

Kaiju is a program for the taxonomic classification of metagenomic high-throughput sequencing reads. Each read is directly assigned to a taxon within the NCBI taxonomy by comparing it to a reference database containing microbial and viral protein sequences.

By default, Kaiju uses either the available complete genomes from NCBI RefSeq or the microbial subset of the non-redundant protein database nr used by NCBI BLAST, optionally also including fungi and microbial eukaryotes.

Kaiju translates reads into amino acid sequences, which are then searched in the database using a modified backward search on a memory-efficient implementation of the Burrows-Wheeler transform, which finds maximum exact matches (MEMs), optionally allowing mismatches in the protein alignment. The search can process up to millions of reads per minute using, for example, only 10 GB RAM with a protein database comprising 4821 microbial genomes. Kaiju can also be used for querying any other protein database without taxonomic classification, using either protein or nucleotide queries.

Kaiju is described in Menzel, P. et al. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257 (open access).

Address of the bookmark: http://kaiju.binf.ku.dk/

Scarpa

Poonam Mahapatra — Wed, 13 Jul 2016 07:59:25 -0500

Scarpa is a stand-alone scaffolding tool for NGS data. It can be used together with virtually any genome assembler and any NGS read mapper that supports SAM format. Other features include support for multiple libraries and an option to estimate insert size distributions from data. Scarpa is available free of charge for academic and commercial use under the GNU General Public License (GPL).

See the user manual or the paper for more information about Scarpa. Click here for the supplementary material.

Address of the bookmark: http://compbio.cs.toronto.edu/hapsembler/scarpa.html

Genome STRiP

Neel — Tue, 06 Sep 2016 03:58:19 -0500

Genome STRiP (Genome STRucture In Populations) is a suite of tools for discovering and genotyping structural variations using sequencing data. The methods are designed to detect shared variation using data from multiple individuals.

Genome STRiP looks both across and within a set of sequenced genomes to detect variation. The methods are adaptive and support heterogeneous data sets, including variations in sequencing depth, read lengths and mixtures of paired and single-end reads. A minimum of 20 to 30 genomes are required to get acceptable results, but the method gains power across genomes and processing more genomes provide better results.

To run discovery or genotyping on a single sequenced genome or a small set of genomes, you need to call your data against a background population, such as a set of genomes from the 1000 Genomes Project. The background population does not need to be matched to the target individuals.

Address of the bookmark: http://software.broadinstitute.org/software/genomestrip/

vcfR

Archana Malhotra — Fri, 19 Aug 2016 07:38:24 -0500

Most variant calling pipelines result in files containing large quantities of variant information. The variant call format (vcf) is an increasingly popular format for this data. The format of these files and their content is discussed in the vignette ‘vcf data.’ These files are typically intended to be post-processed (i.e., filtered) as an attempt to remove false positives or otherwise problematic sites. The R package vcfR provides tools to facilitate this filtering as well as to visualize the effects of choices made during this process.

Address of the bookmark: https://cran.r-project.org/web/packages/vcfR/vignettes/visualization_1.html

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

Rahul Nayak — Tue, 05 Jun 2018 09:57:11 -0500

PERGA - Paired End Reads Guided Assembler PERGA is a novel sequence reads guided de novo assembly approach which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds. Instead of using single-end reads to construct contig, PERGA uses paired-end reads and different read overlap sizes from O ≥ Omax to Omin to resolve the gaps and branches. Moreover, by constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. PERGA will try to extend the contigs by all feasible nucleotides and determine if these multiple extensions due to sequencing errors or repeats by using looking ahead technology, and it also try to separate the different repeats of nearby genomic regions to make the assembly result more longer and accurate. The simulated E.coli paired-end reads data are generated using GemSim (KE McElroy, F Luciani, T Thomas. Gemsim: General, Error-Model Based Simulator of Next-Generation Sequencing Data. BMC Genomics 2012, 13:74), with coverage 50x, 60x, 100x, read lengths 100-bp, and can be downloaded from https://github.com/zhuxiao/data_PERGA.

Address of the bookmark: https://github.com/hitbio/PERGA

Useful Bioinformatics Tools

Poonam Mahapatra — Mon, 29 Aug 2016 04:08:12 -0500

Collections of few handy tools for bioinformatician

http://molbiol-tools.ca/Convert.htm

Address of the bookmark: http://molbiol-tools.ca/Convert.htm

Artemis Comparison Tool (ACT)

Shruti Paniwala — Wed, 07 Sep 2016 03:54:41 -0500

ACT is a Java application for displaying pairwise comparisons between two or more DNA sequences. It can be used to identify and analyse regions of similarity and difference between genomes and to explore conservation of synteny, in the context of the entire sequences and their annotation. It can read complete EMBL, GENBANK and GFF entries or sequences in FASTA or raw format.

Address of the bookmark: http://www.sanger.ac.uk/science/tools/artemis-comparison-tool-act