BOL: Poonam Mahapatra's pages

Tools for Protein-Protein Docking !

Poonam Mahapatra — Wed, 25 Apr 2018 05:15:53 -0500

Predicting the structure of protein–protein complexes using docking approaches is a difficult problem whose major challenges include identifying correct solutions, and properly dealing with molecular flexibility and conformational changes. Following are the tools to predict the structure of protein–protein complexes:

3D-Dock Suite

Global rigid search: FFTShape complementarity and electrostatics

Re-scoring and clustering. Refinement of interface side-chains

3D-Garden

Global rigid search in ensamble

Shape complementarity and Lennard–Jones potential

Side chain and backbone dihedral refinement

DOT

Global rigid search: FFTShape complementarity, electrostatics and VDWNone

Escher NG

Global rigid searchShape complementarity, hydrogen bonds and electrostatic

Integrated in VEGA

GRAMM

Global rigid search: FFT. smooth protein surface representation for soft docking

Shape complementarity and Lennard-Jones potential

Clustering of conformations

GRAMM-X

Global rigid search: FFT. smooth protein surface representation for soft docking

Shape complementarity and Lennard-Jones potentialminimization and re-scoring with multiple filters

HEX

Global rigid search: Fourier correlation of spherical harmonics

Shape complementarity

HADDOCK

Global rigid searchElectrostatic ,VDW and desolvation energy termsMD simulated annealing refinement . Filtering based on external data.

ICM

Global rigid search: Monte CarloEmpirical scoring function

Clustering and selection of conformations. Refinement of interface side-chains and re-scoring

MolFit

Global rigid search: FFTShape complementarity

Clustering of good solutions, filtering using a priori information and small, local rigid rotations around selected conformations

PatchDock

Global rigid searchShape complementarity and atomic desolvation energy

Clustering of conformations

PyDock

Global rigid search:FFTShape complementarity

rescoring by binding electrostatics and desolvation energy

RosettaDock

Local rigid search: Monte Carlo with low and high resolution structure representation levels

Different scoring parameters for the different resolutions

ZDOCK

Global rigid search: FFTShape complementarity, desolvation energy, and electrostatics.

Energy minimization and re-scoringFree for academics

Point to note:

The proper treatment of flexibility in protein–protein docking is still an active field of research. You first should analyzed your proteins in order to define their conformational space and then choose the most suitable method for your docking problem.

Ligand Docking Tools and Software !

Poonam Mahapatra — Wed, 25 Apr 2018 05:05:17 -0500

Ligand docking referred to cases where small molecule (“ligand”) is being docked into much larger macromolecule ("target"). The following is partial list of docking software, focusing on free (at least for academic institutes) and/or popular docking tools.

AutoDock

Stochastic (GA)

Flexible ligand and partially flexible target

ArgusLab

Systematic

Flexible ligandX-Score based

DOCK

Systematic (IC)

Flexible ligandDOCK 3.5 (force field)

eHITS

Systematic (RBD of fragments followed by reconstruction)Flexible ligand and partially flexible targetHiTS_Score (empirical)

FlexX

Systematic (IC)Flexible ligandFlexX SF (empirical)Commercial

FLIPDock

Stochastic (GA)Flexible ligand and flexible targetAUTODOCK (empirical)

FRED

Systematic (RBD)Flexible ligandChemScore, PLP, ScreenScore, ChemGauss (empirical/consensus)

GOLD

Stochastic (GA)

Flexible ligand and partially flexible targetGoldScore, ChemScore (empirical), ASP (knowledge based)

ICM

Stochastic (MC)

Flexible ligand and partially flexible targetICM SF (empirical)

ParDOCK

Stochastic (MC)

RigidBAPPL (empirical)

PLANTS

Stochastic (ACO)Flexible ligand and partially flexible target

CHEMPLP, PLP (empirical)

Surflex

Systematic (IC/MA)Flexible ligandHammerhead based (empirical)

Point to note:

Several studies have shown that the performance of most docking tools is highly dependent on the particular characteristics of both the binding site and the ligand to be investigated, and the determination which method would be more suitable in a specific context is difficult. We encouraged you to check several docking methods to determine which one(s) work best for your system.

Protein-Protein Interaction Sites Predictions !

Poonam Mahapatra — Wed, 25 Apr 2018 04:53:20 -0500

The study of Protein–Protein Interactions (PPIs) has a crucial role in biology, medicine and the pharmaceutical industry. PPIs can be investigated from two aspects: The interaction partners of a specific protein and the amino acid residues participating in a given PPI. Information about a protein’s interaction partners allows scientists to construct protein interaction networks, such as signaling pathways, which in turn facilitate the understanding of many biological and clinical observations.

Following are the list of tools commonly used to PPIs predictions:

Protein-Protein Interaction Sites

PPISP

A consensus neural network method for predicting protein-protein interaction sites

HOMCOS

A server to predict interacting protein pairs and interacting sites by homology modeling of complex structures

HotPOINT

Prediction of protein interfaces using an empirical model

ISIS

Prediction of interaction hotspots from sequence

KFC server

Automated decision-tree approach to predicting protein-protein interaction hot spots

meta-PPISP

A meta server for predicting protein-protein interaction sites. meta-PPISP is built on three individual web servers: cons-PPISP, PINUP, and Promate

ODA

Identification of optimal surface patches with the lowest docking desolvation energy values

PINUP

Protein binding site prediction with an empirical scoring function

Other Sites (DNA, RNA, Metals)

CHED

Web server for predicting soft metal binding sites in proteins

DBD-Hunter

A knowledge-based method for the prediction of DNA-protein interactions

DISPLAR

Given the structure of a protein known to bind DNA, the method predicts residues that contact DNA using neural network method

iDBPs

Predicts DNA binding proteins for proteins with known 3D structure.

PFplus

A tool for extracting and displaying positive electrostatic patches on protein surfaces which can be indicative of nucleic acid binding interfaces.

Binding Site Prediction in Protein !

Poonam Mahapatra — Wed, 25 Apr 2018 04:35:57 -0500

The interaction between proteins and other molecules is fundamental to all biological functions. In this section we include tools that can assist in prediction of interaction sites on protein surface and tools for predicting the structure of the intermolecular complex formed between two or more molecules (docking).

Pockets Identification

CASTp

Automatic Identification of pockets and cavities in proteins structure, and quantitation of their volumes using Delaunay triangulation. Available also as PyMOL plugin

Pocket-Finder

Automatic identification of pockets and cavities in proteins structure, and quantitation of their volumes.

PocketPicker

Grid-based technique for the analysis of protein pockets. PocketPicker available as a plugin for PyMOL

Binding Site Prediction

ConSurf

Identification of functional regions in proteins by surface-mapping of phylogenetic information

CRESCENDO

Identification protein interaction sites. It uses sequence conservation patterns in homologous proteins to distinguish between residues that are conserved due to structural restraints from those due to functional restraints.

Ligand Binding Sites

3DLigandSite

The server utilizes protein-structure prediction to provide structural models of the binding site. Ligands bound to structures are superimposed onto the model and use to predict the binding site.

FINDSITE

A threading-based method for ligand-binding site prediction and functional annotation based on binding-site similarity across superimposed groups of threading templates.

LIGSITE^csc

Prediction of binding site by pocket identification using the Connolly surface and degree of conservation

metaPocketA meta server for ligand-binding site prediction. metaPocket use LIGSITE^csc, PASS, Q-SiteFinder and SURFNET

Awk for Bioinformatician and computational biologist

Poonam Mahapatra — Tue, 06 Feb 2018 14:54:35 -0600

Awk is a programming language which allows easy manipulation of structured data and is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that match with the specified patterns and then perform associated actions. The basic syntax is:

awk '/pattern1/ {Actions}
/pattern2/ {Actions}' file

The working of Awk is as follows
Awk reads the input files one line at a time.
For each line, it matches with given pattern in the given order, if matches performs the corresponding action.
If no pattern matches, no action will be performed.
In the above syntax, either search pattern or action are optional, But not both.
If the search pattern is not given, then Awk performs the given actions for each line of the input.
If the action is not given, print all that lines that matches with the given patterns which is the default action.
Empty braces with out any action does nothing. It wont perform default printing operation.
Each statement in Actions should be delimited by semicolon.
Say you have data.tsv with the following contents:

$ cat data/test.tsv
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2 ACTTTATATATT
contig3 ACTTATATATATATA
contig4 ACTTATATATATATA
contig5 ACTTTATATATT
By default Awk prints every line from the file.

$ awk '{print;}' data/test.tsv
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2 ACTTTATATATT
contig3 ACTTATATATATATA
contig4 ACTTATATATATATA
contig5 ACTTTATATATT
We print the line which matches the pattern contig3

$ awk '/contig3/' data/test.tsv
contig3 ACTTATATATATATA
Awk has number of builtin variables. For each record i.e line, it splits the record delimited by whitespace character by default and stores it in the $n variables. If the line has 5 words, it will be stored in $1, $2, $3, $4 and $5. $0 represents the whole line. NF is a builtin variable which represents the total number of fields in a record.

$ awk '{print $1","$2;}' data/test.tsv
contig1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2,ACTTTATATATT
contig3,ACTTATATATATATA
contig4,ACTTATATATATATA
contig5,ACTTTATATATT

$ awk '{print $1","$NF;}' data/test.tsv
contig1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2,ACTTTATATATT
contig3,ACTTATATATATATA
contig4,ACTTATATATATATA
contig5,ACTTTATATATT

Awk has two important patterns which are specified by the keyword called BEGIN and END. The syntax is as follows:

BEGIN { Actions before reading the file}
{Actions for everyline in the file}
END { Actions after reading the file }

For example,
$ awk 'BEGIN{print "Header,Sequence"}{print $1","$2;}END{print "-------"}' data/test.tsv
Header,Sequence
contig1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2,ACTTTATATATT
contig3,ACTTATATATATATA
contig4,ACTTATATATATATA
contig5,ACTTTATATATT
-------
We can also use the concept of a conditional operator in print statement of the form print CONDITION ? PRINT_IF_TRUE_TEXT : PRINT_IF_FALSE_TEXT. For example, in the code below, we identify sequences with lengths > 14:

$ awk '{print (length($2)>14) ? $0">14" : $0"<=14";}' data/test.tsv
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG>14
contig2 ACTTTATATATT<=14
contig3 ACTTATATATATATA>14
contig4 ACTTATATATATATA>14
contig5 ACTTTATATATT<=14
We can also use 1 after the last block {} to print everything (1 is a shorthand notation for {print $0} which becomes {print} as without any argument print will print $0 by default), and within this block, we can change $0, for example to assign the first field to $0 for third line (NR==3), we can use:

$ awk 'NR==3{$0=$1}1' data/test.tsv
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2 ACTTTATATATT
contig3
contig4 ACTTATATATATATA
contig5 ACTTTATATATT
You can have as many blocks as you want and they will be executed on each line in the order they appear, for example, if we want to print $1 three times (here we are using printf instead of print as the former doesn't put end-of-line character),

$ awk '{printf $1"\t"}{printf $1"\t"}{print $1}' data/test.tsv
contig1 contig1 contig1
contig2 contig2 contig2
contig3 contig3 contig3
contig4 contig4 contig4
contig5 contig5 contig5
Although, we can also skip executing later blocks for a given line by using next keyword:

$ awk '{printf $1"\t"}NR==3{print "";next}{print $1}' data/test.tsv
contig1 contig1
contig2 contig2
contig3
contig4 contig4
contig5 contig5

$ awk 'NR==3{print "";next}{printf $1"\t"}{print $1}' data/test.tsv
contig1 contig1
contig2 contig2

contig4 contig4
contig5 contig5
You can also use getline to load the contents of another file in addition to the one you are reading, for example, in the statement given below, the while loop will load each line from test.tsv into k until no more lines are to be read:

$ awk 'BEGIN{while((getline k <"data/test.tsv")>0) print "BEGIN:"k}{print}' data/test.tsv
BEGIN:contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
BEGIN:contig2 ACTTTATATATT
BEGIN:contig3 ACTTATATATATATA
BEGIN:contig4 ACTTATATATATATA
BEGIN:contig5 ACTTTATATATT
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2 ACTTTATATATT
contig3 ACTTATATATATATA
contig4 ACTTATATATATATA
contig5 ACTTTATATATT
You can also store data in the memory with the syntax VARIABLE_NAME[KEY]=VALUE which you can later use through for (INDEX in VARIABLE_NAME) command:

$ awk '{i[$1]=1}END{for (j in i) print j"<="i[j]}' data/test.tsv
contig1<=1
contig2<=1
contig3<=1
contig4<=1
contig5<=1

BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries

Poonam Mahapatra — Wed, 03 Jan 2018 00:25:27 -0600

BBSplit internally uses BBMap to map reads to multiple genomes at once, and determine which genome they match best. This is different than with ordinary mapping. If a genome (say, human) contains an exact repeat somewhere, reads mapping to it will be mapped ambiguously. But if you want to determine whether reads are mouse or human, it does not matter whether they map ambiguously within human, only whether they are ambiguous between human and mouse. BBSplit tracks this additional ambiguity information and decides how to use it based on the “ambig2” flag. The normal use of BBSplit is like Seal, either quantifying how many reads go to each reference, or splitting the reads into multiple output files, one per reference. BBSplit can only be run using references indexed with BBSplit, as they contain additional information regarding which sequences came from which reference file.

BBSplit is a tool that bins reads by mapping to multiple references simultaneously, using BBMap. The reads go to the bin of the reference they map to best. There are also disambiguation options, such that reads that map to multiple references can be binned with all of them, none of them, one of them, or put in a special "ambiguous" file for each of them. Paired reads will always be kept together.

For example, if you had a library of something that was contaminated with e.coli and salmonella, you could do this:

bbsplit.sh in=reads.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu=clean.fq int=t

This will produce 3 output files:
out_ecoli.fq (ecoli reads)
out_salmonella.fq (salmonella reads)
clean.fq (unmapped reads)

In this case, "int=t" means that the input file is paired and interleaved. For single-end reads you would leave that out. For paired reads in 2 files, you would do this:
bbsplit.sh in1=reads1.fq in2=reads2.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu1=clean1.fq outu2=clean2.fq

BBSplit is available here:
https://sourceforge.net/projects/bbmap/

The sensitivity can be raised to be equivalent to BBMap with these flags: "minratio=0.56 minhits=1 maxindel=16000"

Converting BLAST output into CSV

Poonam Mahapatra — Mon, 11 Dec 2017 04:17:58 -0600

Suppose we wanted to do something with all this BLAST output. Generally, that’s the case - you want to retrieve all matches, or do a reciprocal BLAST, or something.

As with most programs that run on UNIX, the text output is in some specific format. If the program is popular enough, there will be one or more parsers written for that format – these are just utilities written to help you retrieve whatever information you are interested in from the output.

Let’s conclude this tutorial by converting the BLAST output in out.txt into a spreadsheet format, using a Python script.

First, we need to get the script. We’ll do that using the ‘git’ program:

git clone https://github.com/ngs-docs/ngs-scripts.git /root/ngs-scripts

We’ll discuss ‘git’ more later; for now, just think of it as a way to get ahold of a particular set of files. In this case, we’ve placed the files in /root/ngs-scripts/, and you’re looking to run the script blast/blast-to-csv.py using Python:

python /root/ngs-scripts/blast/blast-to-csv.py out.txt

This outputs a spread-sheet like list of names and e-values. To save this to a file, do:

python /root/ngs-scripts/blast/blast-to-csv.py out.txt > ~out.csv

If you have Excel installed, try double clicking on it.

Phylogenetic & Molecular Genetics Terms and Definitions

Poonam Mahapatra — Tue, 08 Aug 2017 08:20:31 -0500

analog -- A feature that appears similar in two taxa which have originated from two different ancestors.

ancestor -- Any organism, population, or species from which some other organism, population, or species is descended by reproduction.

apomorphy -- specialized (=derived) characters of an organism.

basal group -- The earliest diverging group within a clade; for instance, to hypothesize that sponges are basal animals is to suggest that the lineage(s) leading to sponges diverged from the lineage that gave rise to all other animals.

biological classification -- The orderly arrangement of organisms in hierarchical system that ideally reflects evolutionary history.

cDNA -- Complementary DNA; DNA that is synthesized, by reverse transcriptase, from a Messenger RNA template ( Messenger RNA contains the coded information for protein synthesis).

character -- Heritable trait possessed by an organism.

character state -- characters are usually described in terms of their states, for example: "hair present" vs. "hair absent," where "hair" is the character, and "present" and "absent" are its states.

clade -- A monophyletic taxon; a group of organisms which includes the most recent common ancestor of all of its members and all of the descendants of that most recent common ancestor. From the Greek word "klados", meaning branch or twig.

cladogenesis -- The development of a new clade; the splitting of a single lineage into two distinct lineages; speciation.

cladogram -- A diagram, resulting from a cladistic analysis, which depicts a hypothetical branching sequence of lineages leading to the taxa under consideration. The points of branching within a cladogram are called nodes. All taxa occur at the endpoints of the cladogram.

convergence -- Similarities which have arisen independently in two or more organisms that are not closely related. Contrast with homology.

crown group -- All the taxa descended from a major cladogenesis event, recognized by possessing the clade's synapomorphy. See: stem group.

derived -- Describes a character state that is present in one or more subclades, but not all, of a clade under consideration. A derived character state is inferred to be a modified version of the primitive condition of that character, and to have arisen later in the evolution of the clade. For example, "presence of hair" is a primitive character state for all mammals, whereas the "hairlessness" of whales is a derived state for one subclade within the Mammalia.

diversity -- Term used to describe numbers of taxa, or variation in morphology.

evolution -- Darwin's definition: descent with modification. The term has been variously used and abused since Darwin to include everything from the origin of man to the origin of life.

evolutionary tree -- A diagram which depicts the hypothetical phylogeny of the taxa under consideration. The points at which lineages split represent ancestor taxa to the descendant taxa appearing at the terminal points of the cladogram.

expressed sequence tag (EST) -- A partial coding sequence isolated at random from a cDNA library, used for identification and mapping of coding sequences, for discovery of new genes and (by reference to sequence data banks) for discovery of identities with other genes.

extinction -- When all the members of a clade or taxon die, the group is said to be extinct.

genetic marker -- A DNA sequence that can be recognized and thus used to characterize the larger DNA sequence and the chromosome in which it occurs.

homolog -- A feature that appears similar in two or more taxa with a common ancestor that also possessed that feature.

homology -- Two structures are considered homologous when they are inherited from a common ancestor which possessed the structure. This may be difficult to determine when the structure has been modified through descent.

hypothesis -- A concept or idea that can be falsified by various scientific methods.

ingroup -- In a cladistic analysis, the set of taxa which are hypothesized to be more closely related to each other than any are to the outgroup.

lineage -- Any continuous line of descent; any series of organisms connected by reproduction by parent of offspring.

monophyletic -- Term applied to a group of organisms which includes the most recent common ancestor of all of its members and all of the descendants of that most recent common ancestor. A monophyletic group is called a clade.

outgroup -- In a cladistic analysis, any taxon used to help resolve the polarity of characters, and which is hypothesized to be less closely related to each of the taxa under consideration than any are to each other.

paraphyletic -- Term applied to a group of organisms which includes the most recent common ancestor of all of its members, but not all of the descendants of that most recent common ancestor.

parsimony -- Refers to a rule used to choose among possible cladograms, which states that the cladogram implying the least number of changes in character states is the best.

phylogenetics -- Field of biology that deals with the relationships between organisms. It includes the discovery of these relationships, and the study of the causes behind this pattern.

phylogeny -- The evolutionary relationships among organisms; the patterns of lineage branching produced by the true evolutionary history of the organisms being considered.

plesiomorphy -- A primitive character state for the taxa under consideration.

polarity of characters -- The states of characters used in a cladistic analysis, either original or derived. Original characters are those acquired by an ancestor deeper in the phylogeny than the most recent common ancestor of the taxa under consideration. Derived characters are those acquired by the most recent common ancestor of the taxa under consideration.

polyphyletic -- Term applied to a group of organisms which does not include the most recent common ancestor of those organisms; the ancestor does not possess the character shared by members of the group.

primitive -- Describes a character state that is present in the common ancestor of a clade. A primitive character state is inferred to be the original condition of that character within the clade under consideration. For example, "presence of hair" is a primitive character state for all mammals, whereas the "hairlessness" of whales is a derived state for one subclade within the Mammalia.

radiation -- Event of rapid cladogenesis, believed to occur under conditions where a new feature permits a lineage to move into a new niche or new habitat, and is then called an adaptive radiation.

rank -- In traditional taxonomy, taxa are ranked according to their level of inclusiveness. Thus a genus contains one or more species, a family includes one or more genera, and so on.

relatedness -- Two clades are more closely related when they share a more recent common ancestor between them than they do with any other clade.

repetitive DNA -- Sequences of DNA that are found to be repeated, sometimes thousands of times over.

reticulation -- Joining of separate lineages on a phylogenetic tree, generally through hybridization or through lateral gene transfer. Fairly common in certain land plant clades; reticulation is thought to be rare among metazoans.

selection -- Process which favors one feature of organisms in a population over another feature found in the population. This occurs through differential reproduction -- those with the favored feature produce more offspring than those with the other feature, such that they become a greater percentage of the population in the next generation.

sister group -- The two clades resulting from the splitting of a single lineage.

stem group -- All the taxa in a clade preceding a major cladogenesis event. They are often difficult to recognize because they may not possess synapomorpies found in the crown group.

sympleisiomorphy – A ancestral character shared by the taxa under consideration

synapomorphy -- A character which is derived, and because it is shared by the taxa under consideration, is used to infer common ancestry (shared derived state).

synteny -- Portions of chromosomes in which gene order is conserved.

systematics -- Field of biology that deals with the diversity of life. Systematics is usually divided into the two areas of phylogenetics and taxonomy.

taxon -- Any named group of organisms, not necessarily a clade

taxonomy -- The science of naming and classifying organisms.