BOL: Related items

PLAST: A fast, accurate and NGS scalable bank-to-bank sequence similarity search tool

Jit — Fri, 01 Dec 2017 04:10:54 -0600

PLAST is a fast, accurate and NGS scalable bank-to-bank sequence similarity search tool providing significant accelerations of seeds-based heuristic comparison methods, such as the Blast suite of algorithms.

Relying on unique software architecture, PLAST takes full advantage of recent multi-core personal computers without requiring any additional hardware devices.

PLAST stands for Parallel Local Sequence Alignment Search Tool and is was published in BMC Bioinformatics.

PLAST is a general purpose sequence comparison tool providing the following benefits:

PLAST is a high-performance sequence comparison tool designed to compare two sets of sequences (query vs. reference),
Reduces the processing time of sequences comparisons while providing highest quality results,
Contains a fully integrated data filtering engine capable of selecting relevant hits with user-defined criteria (E-Value, identity, coverage, alignment length, etc.),
Does not require any additional hardware, since it is a software solution. It is easy to install, cost-effective, takes full advantage of multi-core processors and uses a small RAM footprint,
Ready to be used on desktop computer, cluster, cloud as well as within distributed system running Hadoop.

https://plast.inria.fr/

Address of the bookmark: https://plast.inria.fr/

A Brief Bioinformatics Tutorial

Jit — Wed, 21 May 2014 12:50:09 -0500

This is about how to use a computer to find what is known about a gene of interest and also how to get new insights about it.

The tutorial is divided in three main parts:

In the Sequence part, you will see how to look efficiently for a particular protein sequence, how to blast it against the database of your choice to find homologues, how to perform a multiple alignment of the homologues you've selected and how to edit this alignment.
The Structure part is about molecular visualization, homology modeling and structural domain prediction.
In the Function part, you will be introduced to you 3 useful servers to investigate the function of a protein. i.e. finding interactors, co-expressed genes, see a phylogenetic profile, easily access papers citing your gene etc ...

During all the three parts, we will use the S. cerevisiae VPS36 protein as an example.

Address of the bookmark: http://www.mrc-lmb.cam.ac.uk/rlw/text/bioinfo_tuto/introduction.html

SMASH: An alignment-free tool to find and visualise rearrangements between pairs of DNA sequences

Jit — Thu, 21 Dec 2017 08:26:57 -0600

SMASH is a completely alignment-free method to find and visualise rearrangements between pairs of DNA sequences. The detection is based on relative compression, namely using a FCM, also known as Markov model, of high context order (typically 20). The method has been approached with a tool (also called SMASH). For visualization, SMASH outputs a SVG image, with an ideogram output architecture, where the patterns are represented with several HSV values (only value varies). The following image, illustrating the information maps between human and chimpanzee for the several chromosomes, depicts an example:

Address of the bookmark: https://github.com/pratas/smash

Bioinformatics JRF/SRF position at NII

Sun, 25 May 2014 16:54:04 -0500

NATIONAL INSTITUTE OF IMMUNOLOGY, NEW DELHI-110067

Applications are invited for the position of Senior Research Fellow for the following time-bound sponsored project as per the details given below:

1. BTIS project on, “Bioinformatics Center-National Infrastructural Facility in the Area of Immunology” funded by DBT

Senior Research Fellow (P) (One Position only)

Dr. Debasisa Mohanty
Staff Scientist-VI
deb@nii.res.in

Qualifications: M.Sc in Biological Sciences or Biotechnology with at least 04 years of Research experience in Bioinformatics or computational Biology after the master’s degree is essential.

Emoluments: The selected candidates will draw consolidated emoluments as per Institute Rules, depending upon qualifications & experience

Rs. 18,000/- per month consolidated plus 30% HRA if Leading to Ph.D/NET/GATE Qualified otherwise Rs. 14,000/- per month + 30% HRA.

Job description: The candidate should be well versed in programming in PERL/C++/HTML/CGI, web server and portal development, computational analysis of
protein structure & function, molecular dynamics simulations and use of high performance computing systems.

GENERAL TERMS AND CONDITIONS:-

1. The candidates selected for the above posts will be on contract for one year or duration of the project whichever is shorter, at a time.
2. No hostel/ housing facility will be provided.
3. Number of posts may vary and shall be need based. Advertisement is no commitment.
4. Applicants may clearly mention the category they belong to i.e. SC/ST/OBC/PH and attach documentary proof of the same.
5. No TA/DA will be paid for attending the interview, if called for.
6. Apart from sending application in the prescribed format given below, candidates should send complete Curriculum Vitae along with the names of three referees. Curriculum Vitae should contain details of the experimental expertise.

HOW TO APPLY Interested candidates may apply directly, STRICTLY IN THE PRESCRIBED FORMAT GIVEN BELOW, through e-mail, to the Investigator of the project, clearly indicating the name of the project along with their complete C.V., e-mail id, fax numbers, telephone numbers. Only Short listed candidates will be called for interview and they required to submit attested copies of all their certificates and a Demand Draft of Rs 100/- drawn on Canara Bank or Indian Bank payable at Delhi/New Delhi in favour of the Director, NII (SC / ST and PH candidates are exempted subject to submission of documentary proof), at the time of interview.

LAST DATE OF RECEIPT OF APPLICATIONS: 06th June, 2014

www1.nii.res.in/sites/default/files/projectappointment-Dr.Mohanty-6June2014.pdf

Heap: a highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data

Jit — Thu, 19 Apr 2018 08:06:03 -0500

Heap, that enables robustly sensitive and accurate calling of SNPs, particularly with a low coverage NGS data, which must be aligned to the reference genome sequences in advance. To reduce false positive SNPs, Heap determines genotypes and calls SNPs at each site except for sites at the both end of reads or containing a minor allele supported by only one read. Performance comparison with existing tools showed that Heap achieved the highest F-scores with low coverage (7X) restriction-site associated DNA sequencing reads of sorghum and rice individuals. This will facilitate cost-effective GWAS and GP studies in this NGS era. Code and documentation of Heap are freely available from https://github.com/meiji-bioinf/heap and our web site (http://bioinf.mind.meiji.ac.jp/lab/en/tools.html).

Address of the bookmark: https://github.com/meiji-bioinf/heap

Bioinformatics JRF vacancy at ICGEB, New Delhi

Wed, 23 Jul 2014 16:07:15 -0500

Junior Research Fellow for a DBT sponsored project entitled "Computational and experimental characterization of stage specific arginine methylation in P. falciparum proteome".

Candidates should have a 1st class MSc/MTech/BTech degree in Bioinformatics. Please send complete CV, quoting Application for RMETH-JRF-2014, by email to Dr. Dinesh Gupta: dinesh@icgeb.res.in

Closing date for applications: 6 August 2014

More at http://www.icgeb.org/tl_files/Vacancies/JRF.pdf

BFC: a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data

Jit — Thu, 31 May 2018 09:35:23 -0500

BFC is a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data. It is specifically designed for high-coverage whole-genome human data, though also performs well for small genomes. The BFC algorithm is a variant of the classical spectrum alignment algorithm introduced by Pevzner et al (2001). It uses an exhaustive search to find a k-mer path through a read that minimizes a heuristic objective function jointly considering penalties on correction, quality and k-mer support. This algorithm was first implemented in my fermi assembler and then refined a few times in fermi, fermi2 and now in BFC. In the k-mer counting phase, BFC uses a blocked bloom filter to filter out most singleton k-mers and keeps the rest in a hash table (Melsted and Pritchard, 2011). The use of bloom filter is how BFC is named, though other correctors such as Lighter and Bless actually rely more on bloom filter than BFC. https://github.com/lh3/bfc

Address of the bookmark: https://github.com/lh3/bfc

Linux Sort Commands for Bioinformatics

Rahul Nayak — Sat, 31 May 2014 15:41:16 -0500

Almost all the scripting languages such as Perl, Python etc have built-in sort, but unfortunately none of them are as flexible as sort command. But one when it come to space efficiency GNU sort stands at the top. It can sort a 20Gb file with less than 2Gb memory. It is not trivial to implement so powerful a sort by yourself.

sort a space-delimited file based on its first column, then the second if the first is the same, and so on:
sort input.txt

sort a huge file (GNU sort ONLY):
sort -S 1500M -t $HOME/tmp input.txt > sorted.txt

sort starting from the third column, skipping the first two columns:
sort +2 input.txt

sort the second column as numbers, descending order; if identical, sort the 3rd as strings, ascending order:
sort -k2,2nr -k3,3 input.txt

sort starting from the 4th character at column 2, as numbers:
sort -k2.4n input.txt

More Linxu sort command information

If you have any sort commands you'd like to share, please add them to our comments section below. For more help, you can also type:

man sort

or

sort --help

on your Unix/Linux system.

HiGlass: a tool for exploring genomic contact matrices and tracks.

Jit — Mon, 11 Jun 2018 09:44:49 -0500

HiGlass is a tool for exploring genomic contact matrices and tracks. Please take a look at the examples and documentation for a description of the ways that it can be configured to explore and compare contact matrices. To load private data, HiGlass can be run locally within a Docker container. The HiC data in the examples below is from Rao et al. (2014) http://higlass.io/

Address of the bookmark: http://higlass.io/

Next generation sequencing in R or bioconductor environment

John Parker — Mon, 02 Jun 2014 18:03:09 -0500

There are many R software and bioconductor packages for NGS data analysis, some of them are as follows

Biostrings

The Biostrings package from Bioconductor provides an advanced environment for efficient sequence management and analysis in R. It contains many speed and memory effective string containers, string matching algorithms, and other utilities, for fast manipulation of large sets of biological sequences. The objects and functions provided by Biostrings form the basis for many other sequence analysis packages. Documentation

IRanges Overview

IRanges provides the low-level infrastructure and containers for handling sets of integer ranges within Bioconductor's BioC-Seq domain. Its classes and methods provide support for many more high-level packages like GenomicRanges, ShortRead, Rsamtools, etc. Documentation

GenomicRanges Overview

The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project. It is built upon the IRanges infrastructure and defines three major data containers - GRanges, GRangesList and GappedAlignments - which are supporting other important BioC-Seq packages including ShortRead, Rsamtools, rtracklayer, GenomicFeatures and BSgenome. Compared to the IRanges container, the GRanges/GRangesList classes are more flexible and extensible to store additional information about sequence ranges, such as chromosome identifiers (sequence space), strand information and annotation data. Documentation

Motif Discovery

cosmo

The cosmo package allows to search a set of unaligned DNA sequences for a shared motif that may function as transcription factor binding site. The algorithm extends the popular motif discovery tool MEME (Bailey and Elkan, 1995) in that it allows the search to be supervised by specifying a set of constraints that the motif to be discovered must satisfy. Documentation

BCRANK

BCRANK is a method that takes a ranked list of genomic regions as input and outputs short DNA sequences that are overrepresented in some part of the list. The algorithm was developed for detecting transcription factor (TF) binding sites in a large number of enriched regions from high-throughput ChIP-chip or ChIP-seq experiments, but it can be applied to any ranked list of DNA sequences. Documentation

rGADEM: Documentation

MotIV: Documentation

ShortRead

The ShortRead package provides input, quality control, filtering, parsing, and manipulation functionality for short read sequences produced by high throughput sequencing technologies. While support is provided for many sequencing technologies, this package is primairly focused on Solexa/Illumina reads. Documentation

Rsamtools

Rsamtools provides functions for parsing and inspecting samtools BAM formatted binary alignment data. SAM/BAM is quickly becoming a universal standard alignment format, and is now supported by a wide variety of alignment tools. Documentation

Samtools Website
BWA (Burrows-Wheeler Alignment) Website

Additional tools for SNP analysis:

snpMatrix

BSgenome

BSgenome provides an object oriented infrastructure for interacting with a Biostring based genome sequence. BSgenome packages exist for many common genomes, and can be created to represent custom genomes. See the "How to forge a BSgenome data package" Vignette for instructions to create a new BSgenome package if a prebuilt package does not exist for your organism. Documentation

rtracklayer

rtracklayer provides an interface for exporting annotation feature data to various genome browsers and file formats (such as GFF). See the Small RNA Profiling exercise for an example of using rtracklayer to visualize alignment coverage. Documentation

biomaRt

The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite (http:// www.biomart.org). The package enables online retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas. This data is retrieved automatically via the Internet, so it's recommended that you cache the data locally, or check versions if your code will be adversely affected by updates to these data. Documentation

ChIP-Seq Analysis Packages

Bioconductor provides various packages for analyzing and visualizing ChIP-Seq data. Only a small selection of these packages is introduced here. Additional useful introductions to this topic are: BioC ChIP-seq Case Study and BioC ChIP-Seq.

chipseq

The chipseq package combines a variety of HT-Seq packages to a pipeline for ChIP-Seq data analysis. Documentation

BayesPeak

BayesPeak is a peak calling package for identifying DNA binding sites of proteins in ChIP-Seq experiments. Its algorithm uses hidden Markov models (HMM) and Bayesian statistical methods. The following sample code introduces the identification of peaks with the BayesPeak package as well as the incorporation of read coverage information obtained by the chipseq package. Documentation [ Publication ]

PICS

The PICS package applies probabilistic inference to aligned-read ChIP-Seq data in order to identify regions bound by transcription factors. PICS identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. The following sample code uses the test data set from the above BayesPeak package in order to compare the results from both methods by identifying their consensus peak set. Documentation [ Publication ]

ChIPpeakAnno

The ChIPpeakAnno package provides. batch annotation of the peaks identified from either ChIP-seq or ChIP-chip experiments. It includes functions to retrieve the sequences around peaks, obtain enriched Gene Ontology (GO) terms, find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. The package leverages the biomaRt, IRanges, Biostrings, BSgenome, GO.db, multtest and stat packages. Documentation

Additional ChIP-Seq Packages

DiffBind: Documentation

MOSAICS: Documentation

iSeq: Documentation

ChIPseqR: Documentation

ChiPsim: Documentation

CSAR: Documentation

ChIP-Seq Pipeline: PICS, rGADEM and MotIV (developer web site)

SPP: ChIP-seq processing pipeline

SPP Tutorial

MACS

SIPeS

RNA-Seq Analysis

Counting Reads that Overlap with Annotation Ranges

The GenomicRanges package provides support for importing into R short read alignment data in BAM format (via Rsamtools) and associating them with genomic feature ranges, such as exons or genes. This way one can quantify the number of reads aligning to annotated genomic regions. The package defines general purpose containers for storing genomic intervals as well as more specialized containers for storing alignments against a reference genome. The two main functions for read counting provided by this infrastructure are countOverlaps and summarizeOverlaps. For their proper usage, it is important to read the corresponding PDF manual. Documentation

Differential Gene Expression Analysis with DESeq

The DESeq package contains functions to call differentially expressed genes (DEGs) in count tables based on a model using the negative binomial distribution. It expects as input a data frame with the raw read counts per region/gene of interest (rows) for each test sample (columns). Such a count table can be imported into R or generated from BAM alignment files using the countOverlaps function as introduced above. Documentation

Differential Gene Expression Analysis with edgeR

The edgeR package uses empirical Bayes estimation and exact tests based on the negative binomial distribution to call differentially expressed genes (DEGs) in count data.

Documentation

A variety of additional R packages are available for normalizing RNA-Seq read count data and identifying differentially expressed genes (DEG):

easyRNASeq (simplifies read counting per genome feature)

DEXSeq (Inference of differential exon usage); parathyroidSE explains how to generate exon read counts in R

DEGseq

baySeq (also see: segmentSeq)

Genominator (Bullard et al. 2010)

Detection of Alternative Splice Junctions

Another utility of RNA-Seq experiments is the analysis of splice junctions. The following software suggestions provide this utility:

ERANGE
TopHat

SpliceMap

SplitSeek

DNA-Methylation Data Analysis

methylPipe
bsseq
BiSeq
Much more under BiocViews

HT-Seq Data Visualization

ggbio: ggplot2 extension for genomics data (online manual) Gviz: Plotting data and annotation information along genomic coordinates HilbertVis: Hilbert genome plots

GenomeGraphs: Plotting genomic information from Ensembl

TileQC: Flow Cell Quality Visualization

rtracklayer: R interface to genome browsers

genoPlotR: Plotting maps of genes and genomes

Genominator: Tools for storing, accessing, analyzing and visualizing genomic data.

To install all packages

source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("ShortRead", "Biostrings", "IRanges", "BSgenome", "rtracklayer", "biomaRt", "chipseq", "ChIPpeakAnno", "Rsamtools", "BayesPeak", "PICS", "GenomicRanges", "DESeq", "edgeR", "leeBamViews", "GenomicFeatures", "BSgenome.Celegans.UCSC.ce2"))