BOL: Related items

Bioinformatics JRF vacancy at ICGEB, New Delhi

Wed, 23 Jul 2014 16:07:15 -0500

Junior Research Fellow for a DBT sponsored project entitled "Computational and experimental characterization of stage specific arginine methylation in P. falciparum proteome".

Candidates should have a 1st class MSc/MTech/BTech degree in Bioinformatics. Please send complete CV, quoting Application for RMETH-JRF-2014, by email to Dr. Dinesh Gupta: dinesh@icgeb.res.in

Closing date for applications: 6 August 2014

More at http://www.icgeb.org/tl_files/Vacancies/JRF.pdf

Basics of BLAST Programs !

BioStar — Fri, 26 Jul 2024 06:04:26 -0500

The Basic Local Alignment Search Tool (BLAST) is a powerful bioinformatics program used to compare an input sequence (such as DNA, RNA, or protein sequences) against a database of sequences to find regions of similarity. Developed by the National Center for Biotechnology Information (NCBI), BLAST is widely used for identifying species, finding functional and evolutionary relationships between sequences, and predicting the function of novel sequences.

Key Features of BLAST:
1. Sequence Comparison: BLAST searches for local alignments between the query sequence and sequences in a database. It identifies regions of similarity, which can help infer functional and evolutionary relationships.

2. Speed and Efficiency: BLAST uses heuristic algorithms, making it faster than exhaustive search methods, suitable for large-scale database searches.

3. Versatility: There are several versions of BLAST for different types of sequence comparisons:
- blastn: Compares a nucleotide query sequence against a nucleotide sequence database.
- blastp: Compares a protein query sequence against a protein sequence database.
- blastx: Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
- tblastn: Compares a protein query sequence against a nucleotide sequence database translated in all reading frames.
- tblastx: Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

4. Scoring and E-value: BLAST results are scored based on the quality and length of the alignments. The E-value (expect value) indicates the number of alignments one can expect to find by chance, with lower E-values representing more significant matches.

5. Output Formats: BLAST provides results in various formats, including plain text, HTML, XML, and JSON, making it adaptable for different types of analyses and integrations with other tools.

Applications of BLAST:
- Genomic Research: Identifying genes, understanding genetic diversity, and mapping genome sequences.
- Protein Function Prediction: Inferring the function of unknown proteins by comparing them to known protein sequences.
- Evolutionary Studies: Exploring evolutionary relationships between organisms by comparing their genetic material.
- Medical Research: Identifying pathogens, understanding disease mechanisms, and developing treatments by comparing sequences of interest.

Overall, BLAST is an essential tool in bioinformatics, offering a reliable and efficient way to analyze and interpret biological sequence data.

Linux Sort Commands for Bioinformatics

Rahul Nayak — Sat, 31 May 2014 15:41:16 -0500

Almost all the scripting languages such as Perl, Python etc have built-in sort, but unfortunately none of them are as flexible as sort command. But one when it come to space efficiency GNU sort stands at the top. It can sort a 20Gb file with less than 2Gb memory. It is not trivial to implement so powerful a sort by yourself.

sort a space-delimited file based on its first column, then the second if the first is the same, and so on:
sort input.txt

sort a huge file (GNU sort ONLY):
sort -S 1500M -t $HOME/tmp input.txt > sorted.txt

sort starting from the third column, skipping the first two columns:
sort +2 input.txt

sort the second column as numbers, descending order; if identical, sort the 3rd as strings, ascending order:
sort -k2,2nr -k3,3 input.txt

sort starting from the 4th character at column 2, as numbers:
sort -k2.4n input.txt

More Linxu sort command information

If you have any sort commands you'd like to share, please add them to our comments section below. For more help, you can also type:

man sort

or

sort --help

on your Unix/Linux system.

RepeatMasker compatible blast tool

Neel — Fri, 07 Dec 2018 08:13:03 -0600

RMBlast is a RepeatMasker compatible version of the standard NCBI blastn program. The primary difference between this distribution and the NCBI distribution is the addition of a new program "rmblastn" for use with RepeatMasker and RepeatModeler.

RMBlast supports RepeatMasker searches by adding a few necessary features to the stock NCBI blastn program. These include:

Support for custom matrices ( without KA-Statistics ).
Support for cross_match-like complexity adjusted scoring. Cross_match is Phil Green's seeded smith-waterman search algorithm.
Support for cross_match-like masklevel filtering.

https://anaconda.org/bioconda/rmblast

Address of the bookmark: http://www.repeatmasker.org/RMBlast.html

Next generation sequencing in R or bioconductor environment

John Parker — Mon, 02 Jun 2014 18:03:09 -0500

There are many R software and bioconductor packages for NGS data analysis, some of them are as follows

Biostrings

The Biostrings package from Bioconductor provides an advanced environment for efficient sequence management and analysis in R. It contains many speed and memory effective string containers, string matching algorithms, and other utilities, for fast manipulation of large sets of biological sequences. The objects and functions provided by Biostrings form the basis for many other sequence analysis packages. Documentation

IRanges Overview

IRanges provides the low-level infrastructure and containers for handling sets of integer ranges within Bioconductor's BioC-Seq domain. Its classes and methods provide support for many more high-level packages like GenomicRanges, ShortRead, Rsamtools, etc. Documentation

GenomicRanges Overview

The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project. It is built upon the IRanges infrastructure and defines three major data containers - GRanges, GRangesList and GappedAlignments - which are supporting other important BioC-Seq packages including ShortRead, Rsamtools, rtracklayer, GenomicFeatures and BSgenome. Compared to the IRanges container, the GRanges/GRangesList classes are more flexible and extensible to store additional information about sequence ranges, such as chromosome identifiers (sequence space), strand information and annotation data. Documentation

Motif Discovery

cosmo

The cosmo package allows to search a set of unaligned DNA sequences for a shared motif that may function as transcription factor binding site. The algorithm extends the popular motif discovery tool MEME (Bailey and Elkan, 1995) in that it allows the search to be supervised by specifying a set of constraints that the motif to be discovered must satisfy. Documentation

BCRANK

BCRANK is a method that takes a ranked list of genomic regions as input and outputs short DNA sequences that are overrepresented in some part of the list. The algorithm was developed for detecting transcription factor (TF) binding sites in a large number of enriched regions from high-throughput ChIP-chip or ChIP-seq experiments, but it can be applied to any ranked list of DNA sequences. Documentation

rGADEM: Documentation

MotIV: Documentation

ShortRead

The ShortRead package provides input, quality control, filtering, parsing, and manipulation functionality for short read sequences produced by high throughput sequencing technologies. While support is provided for many sequencing technologies, this package is primairly focused on Solexa/Illumina reads. Documentation

Rsamtools

Rsamtools provides functions for parsing and inspecting samtools BAM formatted binary alignment data. SAM/BAM is quickly becoming a universal standard alignment format, and is now supported by a wide variety of alignment tools. Documentation

Samtools Website
BWA (Burrows-Wheeler Alignment) Website

Additional tools for SNP analysis:

snpMatrix

BSgenome

BSgenome provides an object oriented infrastructure for interacting with a Biostring based genome sequence. BSgenome packages exist for many common genomes, and can be created to represent custom genomes. See the "How to forge a BSgenome data package" Vignette for instructions to create a new BSgenome package if a prebuilt package does not exist for your organism. Documentation

rtracklayer

rtracklayer provides an interface for exporting annotation feature data to various genome browsers and file formats (such as GFF). See the Small RNA Profiling exercise for an example of using rtracklayer to visualize alignment coverage. Documentation

biomaRt

The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite (http:// www.biomart.org). The package enables online retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas. This data is retrieved automatically via the Internet, so it's recommended that you cache the data locally, or check versions if your code will be adversely affected by updates to these data. Documentation

ChIP-Seq Analysis Packages

Bioconductor provides various packages for analyzing and visualizing ChIP-Seq data. Only a small selection of these packages is introduced here. Additional useful introductions to this topic are: BioC ChIP-seq Case Study and BioC ChIP-Seq.

chipseq

The chipseq package combines a variety of HT-Seq packages to a pipeline for ChIP-Seq data analysis. Documentation

BayesPeak

BayesPeak is a peak calling package for identifying DNA binding sites of proteins in ChIP-Seq experiments. Its algorithm uses hidden Markov models (HMM) and Bayesian statistical methods. The following sample code introduces the identification of peaks with the BayesPeak package as well as the incorporation of read coverage information obtained by the chipseq package. Documentation [ Publication ]

PICS

The PICS package applies probabilistic inference to aligned-read ChIP-Seq data in order to identify regions bound by transcription factors. PICS identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. The following sample code uses the test data set from the above BayesPeak package in order to compare the results from both methods by identifying their consensus peak set. Documentation [ Publication ]

ChIPpeakAnno

The ChIPpeakAnno package provides. batch annotation of the peaks identified from either ChIP-seq or ChIP-chip experiments. It includes functions to retrieve the sequences around peaks, obtain enriched Gene Ontology (GO) terms, find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. The package leverages the biomaRt, IRanges, Biostrings, BSgenome, GO.db, multtest and stat packages. Documentation

Additional ChIP-Seq Packages

DiffBind: Documentation

MOSAICS: Documentation

iSeq: Documentation

ChIPseqR: Documentation

ChiPsim: Documentation

CSAR: Documentation

ChIP-Seq Pipeline: PICS, rGADEM and MotIV (developer web site)

SPP: ChIP-seq processing pipeline

SPP Tutorial

MACS

SIPeS

RNA-Seq Analysis

Counting Reads that Overlap with Annotation Ranges

The GenomicRanges package provides support for importing into R short read alignment data in BAM format (via Rsamtools) and associating them with genomic feature ranges, such as exons or genes. This way one can quantify the number of reads aligning to annotated genomic regions. The package defines general purpose containers for storing genomic intervals as well as more specialized containers for storing alignments against a reference genome. The two main functions for read counting provided by this infrastructure are countOverlaps and summarizeOverlaps. For their proper usage, it is important to read the corresponding PDF manual. Documentation

Differential Gene Expression Analysis with DESeq

The DESeq package contains functions to call differentially expressed genes (DEGs) in count tables based on a model using the negative binomial distribution. It expects as input a data frame with the raw read counts per region/gene of interest (rows) for each test sample (columns). Such a count table can be imported into R or generated from BAM alignment files using the countOverlaps function as introduced above. Documentation

Differential Gene Expression Analysis with edgeR

The edgeR package uses empirical Bayes estimation and exact tests based on the negative binomial distribution to call differentially expressed genes (DEGs) in count data.

Documentation

A variety of additional R packages are available for normalizing RNA-Seq read count data and identifying differentially expressed genes (DEG):

easyRNASeq (simplifies read counting per genome feature)

DEXSeq (Inference of differential exon usage); parathyroidSE explains how to generate exon read counts in R

DEGseq

baySeq (also see: segmentSeq)

Genominator (Bullard et al. 2010)

Detection of Alternative Splice Junctions

Another utility of RNA-Seq experiments is the analysis of splice junctions. The following software suggestions provide this utility:

ERANGE
TopHat

SpliceMap

SplitSeek

DNA-Methylation Data Analysis

methylPipe
bsseq
BiSeq
Much more under BiocViews

HT-Seq Data Visualization

ggbio: ggplot2 extension for genomics data (online manual) Gviz: Plotting data and annotation information along genomic coordinates HilbertVis: Hilbert genome plots

GenomeGraphs: Plotting genomic information from Ensembl

TileQC: Flow Cell Quality Visualization

rtracklayer: R interface to genome browsers

genoPlotR: Plotting maps of genes and genomes

Genominator: Tools for storing, accessing, analyzing and visualizing genomic data.

To install all packages

source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("ShortRead", "Biostrings", "IRanges", "BSgenome", "rtracklayer", "biomaRt", "chipseq", "ChIPpeakAnno", "Rsamtools", "BayesPeak", "PICS", "GenomicRanges", "DESeq", "edgeR", "leeBamViews", "GenomicFeatures", "BSgenome.Celegans.UCSC.ce2"))

Visualise blast results !

Abhi — Tue, 11 Oct 2022 03:15:10 -0500

Kablammo helps you create interactive visualizations of BLAST results from your web browser. Find your most interesting alignments, list detailed parameters for each, and export a publication-ready vector image, all without installing any software.

Address of the bookmark: https://kablammo.wasmuthlab.org/

Postdoc position at Centre Méditerranéen de Médecine Moléculaire - Nice - France

Wed, 04 Jun 2014 07:20:57 -0500

The research group of Dr. Michele Trabucchi at the Centre Méditerranéen de Médecine Moléculaire (C3M) at INSERM U1065 (University of Nice Sophia-Antipolis, France) is seeking candidates for a Postdoctoral fellow position to start on October 2014 for 3 years funded by FRM (Fondation pour la Recherche Médicale).
The broad interest of the lab is in understanding the expression control and function of small RNAs in activated myeloid cells (visit our webpage to check research interests and publications of the group : http://www.unice.fr/c3m/EN/Equipe10.html ).

The work will focus on the functional studies of small RNAs by using next-generation sequencing approaches.

Candidates should hold a Ph.D. degree and have strong background in bioinformatics.
The University of Nice Sophia-Antipolis provides a wide range of facilities and training essential for biomedical research.

Interested applicants should send a PDF with a cover letter stating research interests and qualifications, an updated CV, a summary of previous research experience and contact information for two references to Michele Trabucchi ( mtrabucchi@unice.fr )

Homepage: http://www.unice.fr/c3m/EN/Equipe10.html

KOALA: KEGG's internal annotation tool for K number assignment of KEGG GENES using SSEARCH computation

Abhimanyu Singh — Wed, 12 Dec 2018 09:16:55 -0600

KOALA (KEGG Orthology And Links Annotation) is KEGG's internal annotation tool for K number assignment of KEGG GENES using SSEARCH computation. BlastKOALA and GhostKOALA assign K numbers to the user's sequence data by BLAST and GHOSTX searches, respectively, against a nonredundant set of KEGG GENES. Annotate Sequence in KEGG Mapper and Pathogen Checker in KEGG Pathogen are special interfaces to the BlastKOALA server and can be executed in an interactive mode. See Step-by-step Instructions.

Reference: Kanehisa, M., Sato, Y., and Morishima, K. (2016) BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. J. Mol. Biol. 428, 726-731. [pubmed] [pdf]

Address of the bookmark: https://www.kegg.jp/blastkoala/

Ten recommendations for creating usable bioinformatics command line software

RAJESH DETROJA — Sun, 08 Jun 2014 10:06:26 -0500

Bioinformatics software varies greatly in quality. In terms of usability, the command line interface is the first experience a user will have of a tool. Unfortunately, this is often also the last time a tool will be used. Here I present ten recommendations for command line software author’s tools to follow, which I believe would greatly improve the uptake and usability of their products, waste less user’s time, and improve the quality of scientific analyses.

Address of the bookmark: http://www.gigasciencejournal.com/content/2/1/15?utm_content=buffer25ee0&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

Installing Perl GD Module

Jit — Mon, 22 Jul 2013 14:02:01 -0500

In comparative genome analysis work, we usually compare more than two genomes and looks for syntenic regions amongst them. In my research I used Evolution Highway (RH) http://eh-demo.ncsa.uiuc.edu/, which is a collaborative project designed to provide a visual means for simultaneously comparing genomes of multiple amniote species. The tool removes the burden of manually aligning these maps and allows cognitive skills to be used toward something more valuable than preparation and transformation of data. In addition to EH, attractive Circos (http://circos.ca/) is also very popular for this kind of analysis.

The EH is available online, and can be easily access and use, whereas Circos installation is not entirely straightforward. One of the most difficult parts of the installation involves installing the GD library. Since there weren't good instructions for installing this library on the internet I decided to post instructions here in case they are useful to anyone else.

Following are the steps to install GD modules in Mac OS

1. Setup

Create a folder for the files:

$ mkdir -p /SourceCache
$ cd /SourceCache

Get and unpack the required Jpeg-6b and GD libraries:
Download Jpeg-6b (http://code.google.com/p/google-desktop-for-linux-mirror/downloads/detail?name=jpeg-6b.tar.gz&can=2&q)
Download GD (http://search.cpan.org/~lds/GD-2.46/)

Place the "tar.gz" files in "/SourceCache" and double click to unpack.

2. Install libjpeg

Copy the "config.sub" and "config.guess" files to "/SourceCache". Note that your "config.sub" and ""config.guess" files may be in a slightly different location. The commands below show where they were on my machine:

$ cd /SourceCache/jpeg-6b/src
$ cp /usr/share/libtool/config/config.sub .
$ cp /usr/share/libtool/config/config.guess .

Configure libjpeg as follows. Note that this was installed on a 64 bit machine. However, this method may configure it in a 32 bit format. This may not be the best way to configure the installation but it works.

$ .configure --enable-shared
$ make

Check to see if the following directories exist on your machine. Create the missing directories in the following manner:

$ mkdir -p /usr/local/include
$ mkdir -p /usr/local/bin
$ mkdir -p /usr/local/lib
$ mkdir -p /usr/local/man/man1

Finish making and installing libjpeg:

$ make install

3. Install GD

$ cd /SourceCache/GD-2.46/GD/
$ perl Makefile.PL
$ make
$ make test (optional)
$ make html (optional)
$ make install

Other way for Mac OS
The easiest way to get a lot of these is with a program called Fink, which is similar in nature to the CPAN installer, but installs common GNU utilities. Fink is available from <http://sourceforge.net/projects/fink/>.

Follow the instructions for setting up Fink. Once it's installed, you'll want to run the following as root: fink install gd

It will prompt you for a number of dependencies, type 'y' and hit enter to install all of the dependencies. Then watch it work.

To prevent creating conflicts with the software that Apple installs by default, Fink creates its own directory tree at /sw where it installs most of the software that it installs. This means your libraries and headers for libgd will be at /sw/lib and /sw/include instead of /usr/lib and /usr/local/include. Because of these changed locations for the libraries, the Perl GD module will not install directly via CPAN, because it looks for the specific paths instead of getting them from your environment. But there's a way around that :-)

Instead of typing "install GD" at the cpan> prompt, type look GD. This should go through the motions of downloading the latest version of the GD module, then it will open a shell and drop you into the build directory. Apply below patch to the Makefile.PL file (save the patch into a file and use the command patch < patchfile.)

Then, run these commands to finish the installation of the GD module:

perl Makefile.PL
make
make test
make install
And don't forget to run exit to get back to CPAN.

Install on MS Window, using PPM

C:\Documents and Settings\Owner>ppm
PPM interactive shell (2.2.0) - type 'help' for available commands.
PPM> install GD
Install package 'GD?' (y/N): y
Installing package 'GD'...
Downloading http://ppm.ActiveState.com/PPMPackages/5.6plus/MSW. ...
Installing C:\Perl\site\lib\auto\GD\GD.bs
Installing C:\Perl\site\lib\auto\GD\GD.dll
Installing C:\Perl\site\lib\auto\GD\GD.exp
Installing C:\Perl\site\lib\auto\GD\GD.lib
Installing C:\Perl\html\site\lib\GD.html
Installing C:\Perl\site\lib\GD.pm
Installing C:\Perl\site\lib\qd.pl
Installing C:\Perl\site\lib\auto\GD\autosplit.ix
PPM>

If you can't install it from ppm. You can download it:
http://ppm.ActiveState.com/PPMPackages/5.6plus/MSW.

BTW,All Perl 5.6.1 Modules are located at:

http://ppm.ActiveState.com/PPMPackages/5.6plus/MSW.

Install the Perl GD Module on Linux

$ sudo perl -MCPAN -e shell

Since it was the first time I had run this command on this particular machine I had to answer a lot of questions but simply selected the defaults for everything as this usually works for me. Once in the CPAN shell I entered

$ install Bundle::CPAN

and selected all of the defaults again. Once the CPAN bundle had finished installing I tried to install GD::Graph by typing

$ install GD::Graph

but it failed with hundreds of errors – the first of which was

GD.xs:7:16: error: gd.h: No such file or directory

This was fixed with the following apt-get command (in the bash shell)

$ sudo apt-get install libgd2-xpm-dev

back in the CPAN shell I still couldn’t get GD::Graph to build and I guessed this was because of some left over files from the failed build. I don’t know the command to clean things up inside the CPAN shell and am too lazy to read the docs so I simply went into the .cpan/build directory in my home directory and deleted anything that started with GD – eg

$ rm -rf GD-2.35-HC_vkB

$ rm -rf GDGraph-1.44-Evfibe

and so on. Those strings at the end (VkB and so on) look random so they might be different on your machine. Then I went back into the CPAN shell and ran

$ install GD::Graph

There were a few dependencies which the script fetched and installed for me but everything worked smoothly.

Manual and other Perl Module instalation are mentioned in my previous blog @ http://bioinformaticsonline.com/blog/view/710/how-to-install-perl-modules-manually-using-cpan-command-and-other-quick-ways