BOL: Related items

Tools for Protein-Protein Docking !

Poonam Mahapatra — Wed, 25 Apr 2018 05:15:53 -0500

Predicting the structure of protein–protein complexes using docking approaches is a difficult problem whose major challenges include identifying correct solutions, and properly dealing with molecular flexibility and conformational changes. Following are the tools to predict the structure of protein–protein complexes:

3D-Dock Suite

Global rigid search: FFTShape complementarity and electrostatics

Re-scoring and clustering. Refinement of interface side-chains

3D-Garden

Global rigid search in ensamble

Shape complementarity and Lennard–Jones potential

Side chain and backbone dihedral refinement

DOT

Global rigid search: FFTShape complementarity, electrostatics and VDWNone

Escher NG

Global rigid searchShape complementarity, hydrogen bonds and electrostatic

Integrated in VEGA

GRAMM

Global rigid search: FFT. smooth protein surface representation for soft docking

Shape complementarity and Lennard-Jones potential

Clustering of conformations

GRAMM-X

Global rigid search: FFT. smooth protein surface representation for soft docking

Shape complementarity and Lennard-Jones potentialminimization and re-scoring with multiple filters

HEX

Global rigid search: Fourier correlation of spherical harmonics

Shape complementarity

HADDOCK

Global rigid searchElectrostatic ,VDW and desolvation energy termsMD simulated annealing refinement . Filtering based on external data.

ICM

Global rigid search: Monte CarloEmpirical scoring function

Clustering and selection of conformations. Refinement of interface side-chains and re-scoring

MolFit

Global rigid search: FFTShape complementarity

Clustering of good solutions, filtering using a priori information and small, local rigid rotations around selected conformations

PatchDock

Global rigid searchShape complementarity and atomic desolvation energy

Clustering of conformations

PyDock

Global rigid search:FFTShape complementarity

rescoring by binding electrostatics and desolvation energy

RosettaDock

Local rigid search: Monte Carlo with low and high resolution structure representation levels

Different scoring parameters for the different resolutions

ZDOCK

Global rigid search: FFTShape complementarity, desolvation energy, and electrostatics.

Energy minimization and re-scoringFree for academics

Point to note:

The proper treatment of flexibility in protein–protein docking is still an active field of research. You first should analyzed your proteins in order to define their conformational space and then choose the most suitable method for your docking problem.

Binding Site Prediction in Protein !

Poonam Mahapatra — Wed, 25 Apr 2018 04:35:57 -0500

The interaction between proteins and other molecules is fundamental to all biological functions. In this section we include tools that can assist in prediction of interaction sites on protein surface and tools for predicting the structure of the intermolecular complex formed between two or more molecules (docking).

Pockets Identification

CASTp

Automatic Identification of pockets and cavities in proteins structure, and quantitation of their volumes using Delaunay triangulation. Available also as PyMOL plugin

Pocket-Finder

Automatic identification of pockets and cavities in proteins structure, and quantitation of their volumes.

PocketPicker

Grid-based technique for the analysis of protein pockets. PocketPicker available as a plugin for PyMOL

Binding Site Prediction

ConSurf

Identification of functional regions in proteins by surface-mapping of phylogenetic information

CRESCENDO

Identification protein interaction sites. It uses sequence conservation patterns in homologous proteins to distinguish between residues that are conserved due to structural restraints from those due to functional restraints.

Ligand Binding Sites

3DLigandSite

The server utilizes protein-structure prediction to provide structural models of the binding site. Ligands bound to structures are superimposed onto the model and use to predict the binding site.

FINDSITE

A threading-based method for ligand-binding site prediction and functional annotation based on binding-site similarity across superimposed groups of threading templates.

LIGSITE^csc

Prediction of binding site by pocket identification using the Connolly surface and degree of conservation

metaPocketA meta server for ligand-binding site prediction. metaPocket use LIGSITE^csc, PASS, Q-SiteFinder and SURFNET

List of bioinformatics packages for NGS analysis !

Rahul Nayak — Sat, 20 Mar 2021 00:28:51 -0500

Package suites gather software packages and installation tools for specific languages or platforms. We have some for bioinformatics software.

Bioconductor – A plethora of tools for analysis and comprehension of high-throughput genomic data, including 1500+ software packages. [ paper-2004 | web ]
Biopython – Freely available tools for biological computing in Python, with included cookbook, packaging and thorough documentation. Part of the Open Bioinformatics Foundation. Contains the very useful Entrez package for API access to the NCBI databases. [ paper-2009 | web ]
Bioconda – A channel for the conda package manager specializing in bioinformatics software. Includes a repository with 3000+ ready-to-install (with conda install) bioinformatics packages. [ paper-2018 | web ]
BioJulia – Bioinformatics and computational biology infastructure for the Julia programming language. [ web ]
Rust-Bio – Rust implementations of algorithms and data structures useful for bioinformatics. [ paper-2016 ]
SeqAn – The modern C++ library for sequence analysis.

List of generic simulation software/tools/resource with brief description and homepage !!!

Jit — Mon, 10 Feb 2014 05:57:29 -0600

List of generic simulation software/tools/resource with brief description and homepage

ALF
A Simulation Framework for Genome Evolution
http://www.cbrg.ethz.ch/alf

Bayesian Serial SimCoal
Bayesian Serial SimCoal, (BayeSSC) is a modification of SIMCOAL 1.0, a program written by Laurent Excoffier, John Novembre, and Stefan Schneider.
http://www.stanford.edu/group/hadlylab/ssc/index.html

BEERS
BEERS was designed to benchmark RNA-Seq alignment algorithms and also algorithms that aim to reconstruct different isoforms and alternate splicing from RNA-Seq data
http://cbil.upenn.edu/beers/

BOTTLENECK
Bottleneck is a program for detecting recent effective population size reductions from allele data frequencies
http://www.ensam.inra.fr/urlb/bottleneck/bottleneck.html

BottleSim
BottleSim is a computer simulation program for simulating the process of population bottlenecks
http://chkuo.name/software/bottlesim.html

CASS
Protein Sequence Simulation
http://www.wyomingbioinformatics.org/liberlesgroup/cass/

CDPOP
CDPOP is a landscape genetics tool for simulating the emergence of spatial genetic structure in populations resulting from specified landscape processes governing organism movement behavior.
http://cel.dbs.umt.edu/cdpop

CoalFace
CoalFace is a simulation of the coalescent process with the visual display of gene genealogies.
http://web.up.ac.za/default.asp?ipkcategoryid=3283

CoaSim
CoaSim is a tool for simulating the coalescent process with recombination and geneconversion under various demographic models.
http://users-birc.au.dk/mailund/coasim/index.html

cosi
The cosi package is written in C and is available as a tar file.
http://www.broadinstitute.org/~sfs/cosi/

CS-PSeq-Gen
A program to simulate the evolution of protein sequences under the constraints of the information of a particular reconstructed phylogeny
http://bioserv.rpbs.univ-paris-diderot.fr/software/cs-pseq-gen.html

DAWG
An application designed to simulate the evolution of recombinant DNA sequences in continuous time
http://scit.us/projects/dawg

Easypop
EASYPOP is an individual based model intended to simulate datasets under a very broad range of conditions
http://www.unil.ch/dee/page36926_fr.html

EggLib
EggLib is a C++/Python library and program package for evolutionary genetics and genomics.
http://egglib.sourceforge.net/

EvolSimulator
A simulation test bed for hypotheses of genome evolution
http://acb.qfab.org/acb/evolsim/

EvolveAGene
A realistic coding sequence simulation program that separates mutation from selection and allows the user to set selection conditions
http://bellinghamresearchinstitute.com/software/index.html

fastsimcoal
A continuous-¬‐time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios
http://cmpg.unibe.ch/software/fastsimcoal/

FastSLINK
Simulation of Marker and Phenotype Data in Pedigrees
http://watson.hgen.pitt.edu/

FFPopSim
C++/Python library for population genetics.
http://webdav.tuebingen.mpg.de/ffpopsim/

FLUX SIMULATOR
The Flux Simulator aims at providing a deterministic in silico reproduction of the experimental pipelines for RNA-Seq, employing a minimal set of parameters.
http://flux.sammeth.net/simulator.html

ForSim
ForSim: A Forward Evolutionary Computer Simulation
http://www.anthro.psu.edu/weiss_lab/research.shtml

ForwSim
The program given below is based on the algorithm described in Padhukasahasram et al. 2008 to simulate genetic drift in a standard Wright-Fisher process.
http://badri-populationgeneticsimulators.blogspot.com/

FPG
Forward Population Genetic simulation
http://genfaculty.rutgers.edu/hey/software#fpg

FREGENE
FREGENE is a C++ program that simulates sequence-like data over large genomic regions in large diploid populations.
http://www.ebi.ac.uk/projects/bargen/download/fregen/documentation_html.html

GAMETES
Genetic Architecture Model Emulator for Testing and Evaluating Software: Simulates complex SNP models with pure, strict epistatic interactions with n-loci.
http://sourceforge.net/projects/gametes/?source=navbar

GASP
Genometric Analysis Simulation Program. A software tool for testing and investigating methods in statistical genetics by generating samples of family data based on user specified models.
http://research.nhgri.nih.gov/gasp/

GemSIM
Next generation sequencing read simulator
http://sourceforge.net/projects/gemsim/

GeneArtisan
Simulation of Markers in Case-Control Study Designs
http://www.rannala.org/?page_id=241

GENOME
A rapid coalescent-based whole genome simulator
http://www.sph.umich.edu/csg/liang/genome/

GenomePop2
GenomePop2 is a specialization of the program GenomePop just to manage SNPs under more flexible and useful settings. If you need models with more than 2 alleles please use the GenomePop program version.
http://webs.uvigo.es/acraaj/genomepop2.htm

GenomeSimla
GenomeSIMLA is currently under development- however, we have a beta release that we are asking to be tested
http://chgr.mc.vanderbilt.edu/genomesimla/

GENS2
Simulates interactions among two genetic and one environmental factor and also allows for epistatic interactions.
https://sourceforge.net/projects/gensim/

GWAsimulator
A rapid whole genome simulation program
http://biostat.mc.vanderbilt.edu/wiki/main/gwasimulator

HAP-SAMPLE
An association simulator for candidate regions or genome scans
http://www.hapsample.org/

HAPGEN
A simulator for the simulation of case control datasets at SNP markers
https://mathgen.stats.ox.ac.uk/genetics_software/hapgen/hapgen2.html

HapSim
A simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients
http://cran.r-project.org/web/packages/hapsim/index.html

HAPSIMU
A program that simulates heterogeneous populations with various known and controllable structures under the continuous migration model or the discrete model
http://l.web.umkc.edu/liujian/

IBDsim
IBDSim is a computer package for the simulation of genotypic data under general isolation by distance models.
http://raphael.leblois.free.fr/

indel-Seq-Gen
A biological sequence simulation program that simulates highly divergent DNA sequences and protein superfamilies
http://bioinfolab.unl.edu/~cstrope/isg/

Indelible
A powerful and flexible simulator of biological evolution
http://abacus.gene.ucl.ac.uk/software/indelible/

invertFREGENE
InvertFREGENE is a forward-in-time simulator of inversions in population genetic data
http://www.ebi.ac.uk/projects/bargen/

kernalPop
A spatially explicit population genetic simulation engine
http://cran.r-project.org/src/contrib/archive/kernelpop/

MaCS
Markovian Coalescent Simulator
http://www-hsc.usc.edu/~garykche/

Mason
A package for the simulation of nucleotide data.
http://www.seqan.de/projects/mason/

mbs
modifying Hudson's ms software to generate samples of DNA sequences with a biallelic site under selection
http://www.sendou.soken.ac.jp/esb/innan/innanlab/software.html

Mendel's Accountant
Mendel's Accountant (MENDEL) is an advanced numerical simulation program for modeling genetic change over time and was developed collaboratively by Sanford, Baumgardner, Brewer, Gibson and ReMine
http://mendelsaccount.sourceforge.net/

MetaSim
A tool to generate collections of synthetic reads that reflect the diverse taxonomical composition of typical metagenome data sets
http://ab.inf.uni-tuebingen.de/software/metasim/

mlcoalsim
Multilocus Coalescent Simulations
http://code.google.com/p/mlcoalsim-v1/

ms
The purpose of this program is to allow one to investigate the statistical properties of such samples, to evaluate estimators or statistical tests, and generally to aid in the interpretation of polymorphism data sets.
http://home.uchicago.edu/~rhudson1/source/mksamples.html

msHOT
The purpose of this program is to allow one to investigate the statistical properties of such samples, to evaluate estimators or statistical tests, and generally to aid in the interpretation of polymorphism data sets.
http://home.uchicago.edu/~rhudson1/

msms
A coalescent Simlation tool with selection.
http://www.mabs.at/ewing/msms/index.shtml

MySSP
A program for the simulation of DNA sequence evolution across a phylogenetic tree
http://www.rosenberglab.net/software.php

Nemo
A forward-time, individual-based, genetically explicit, and stochastic simulation program designed to study the evolution of genetic markers, life history traits, and phenotypic traits in a flexible (meta-)population framework.
http://nemo2.sourceforge.net/

NetRecodon
Coalescent simulation of coding DNA sequences with recombination (inter and intracodon), migration and demography
http://code.google.com/p/netrecodon/

PEDAGOG
Software for simulating eco-evolutionary population dynamics
https://bcrc.bio.umass.edu/pedigreesoftware/node/5

phenosim
A tool to add phenotypes to simulated genotypes
http://evoplant.uni-hohenheim.de/doku.php?id=software:software

PhyloSim
An R package for the Monte Carlo simulation of sequence evolution
http://bit.ly/rlsim-git

pIRS
Profile-based Illumina pair-end reads simulator
https://code.google.com/p/pirs/

ProteinEvolver
Simulation of protein evolution along phylogenies under structure-based substitution models
http://code.google.com/p/proteinevolver/

QMSim
QTL and Marker Simulator
http://www.aps.uoguelph.ca/~msargol/qmsim/

quantiNEMO
An individual-based program for the analysis of quantitative traits with explicit genetic architecture potentially under selection in a structured population
http://www2.unil.ch/popgen/softwares/quantinemo/

RECOAL
Simulates new haplotype data from a reference population of haplotypes.
ftp://popgen.usc.edu/

Recodon
Coalescent simulation of coding DNA sequences with recombination, migration and demography
http://code.google.com/p/recodon/

rlsim
A package for simulating RNA-seq library preparation with parameter estimation
http://bit.ly/rlsim-git

Rmetasim
Rmetasim is a front-end for the metasim engine that is implemented as a package that runs in the statistical computing environment R
http://linum.cofc.edu/software.html#metasim

RNA Seq Simulator
RSS takes SAM alignment files from RNA-Seq data and simulates over dispersed, multiple replica, differential, non-stranded RNA-Seq datasets.
http://useq.sourceforge.net/cmdlnmenus.html#rnaseqsimulator

Rose
Random model of sequence evolution
http://bibiserv.techfak.uni-bielefeld.de/rose/

SelSim
SelSim is a program for Monte Carlo simulation of DNA polymorphism data for a recom- bining region within which a single bi-allelic site has experienced natural selection
http://www.well.ox.ac.uk/~spencer/selsim/

Seq-Gen
An application for the Monte Carlo simulation of molecular sequence evolution along phylogenetic trees.
http://tree.bio.ed.ac.uk/software/seqgen/

SEQPower
Statistical power analysis for sequence-based association studies
http://bioinformatics.org/spower/

SeqSIMLA
SeqSIMLA can simulate sequence data with user-specified disease and quantitative trait models. Family or unrelated case-control data can be simulated.
http://seqsimla.sourceforge.net/

Serial NetEvolve
A flexible utility for generating serially-sampled sequences along a tree or recombinant network
http://biorg.cis.fiu.edu/sne/

SFS_CODE
SFS_CODE can perform forward population genetic simulations under a general Wright-Fisher model with arbitrary migration, demographic, selective, and mutational effects.
http://sfscode.sourceforge.net/sfs_code/index/index.html

SIBSIM
Quantitative phenotype simulation in extended pedigrees
http://sourceforge.net/projects/sibsim/

SIMCOAL2
A coalescent program for the simulation of complex recombination patterns over large genomic regions under various demographic models
http://cmpg.unibe.ch/software/simcoal2/

SimCopy
An R package simulating the evolution of copy number profiles along a tree.
http://bit.ly/simcopy

SIMLA
SIMLA is a SIMuLAtion program that generates data sets of families for use in Linkage and Association studies.
http://www.chg.duke.edu/research/simla.html

SimPed
A Simulation Program to Generate Haplotype and Genotype Data for Pedigree Structures
http://www.hgsc.bcm.tmc.edu/content/simped

Simprot
A program to simulate protein evolution by substitution, insertion and deletion
http://www.uhnresearch.ca/labs/tillier/software.htm#3

SimRare
Rare variant simulation and analysis tool
http://code.google.com/p/simrare/

simuGWAS
A forward-time simulator that simulates realistic samples for genome-wide association studies.
http://simupop.sourceforge.net/cookbook/simucomplexdisease

simuPOP
simuPOP is a general-purpose individual-based forward-time population genetics simulation environment.
http://simupop.sourceforge.net/

SISSI
A software tool to generate data of related sequences along a given phylogeny, taking into account user defined system of neighbourhoods and instantaneous rate matrices.
http://www.cibiv.at/software/sissi/

SNPsim
Coalescent simulation of hotspot recombination
http://code.google.com/p/phylosoftware/

SPIP
SPIP simulates the transmission of genes from parents to offspring in a population having demographic structure defined by the user
http://swfsc.noaa.gov/textblock.aspx?division=fed&id=3434

Splatche
Spatial and Temporal Coalescences in Heterogeneous Environment
http://www.splatche.com/

srv
Simulator of Rare Varaints (srv) is a simulator for the simulation of the introduction and evolution of (rare) genetic variants.
http://simupop.sourceforge.net/cookbook/simurarevariants

SUP
SLINK/FastSLINK utility program
http://mlemire.freeshell.org/software.html

TreesimJ
A flexible, forward-time population genetic simulator
http://code.google.com/p/treesimj/

Vortex
VORTEX is an individual-based simulation model for population viability analysis (PVA).
http://www.vortex9.org/vortex.html

References:

Image www.evolution-of-life.com

www.cancer.gov

MEGADOCK 4.0

Suleman Khan — Thu, 07 Aug 2014 18:08:54 -0500

An ultra–high-performance protein–protein docking software for heterogeneous supercomputers

Summary: The application of protein–protein docking in large-scale interactome analysis is a major challenge in structural bioinformatics and requires huge computing resources. In this work, we present MEGADOCK 4.0, an FFT-based docking software that makes extensive use of recent heterogeneous supercomputers and shows powerful, scalable performance of over 97% strong scaling.

Availability and Implementation: MEGADOCK 4.0 is written in C++ with OpenMPI and NVIDIA CUDA 5.0 (or later) and is freely available to all academic and non-profit users at: http://www.bi.cs.titech.ac.jp/megadock.

Contact: akiyama@cs.titech.ac.jp

Address of the bookmark: http://bioinformatics.oxfordjournals.org/content/early/2014/08/06/bioinformatics.btu532.short

Protein-Protein Interaction Sites Predictions !

Poonam Mahapatra — Wed, 25 Apr 2018 04:53:20 -0500

The study of Protein–Protein Interactions (PPIs) has a crucial role in biology, medicine and the pharmaceutical industry. PPIs can be investigated from two aspects: The interaction partners of a specific protein and the amino acid residues participating in a given PPI. Information about a protein’s interaction partners allows scientists to construct protein interaction networks, such as signaling pathways, which in turn facilitate the understanding of many biological and clinical observations.

Following are the list of tools commonly used to PPIs predictions:

Protein-Protein Interaction Sites

PPISP

A consensus neural network method for predicting protein-protein interaction sites

HOMCOS

A server to predict interacting protein pairs and interacting sites by homology modeling of complex structures

HotPOINT

Prediction of protein interfaces using an empirical model

ISIS

Prediction of interaction hotspots from sequence

KFC server

Automated decision-tree approach to predicting protein-protein interaction hot spots

meta-PPISP

A meta server for predicting protein-protein interaction sites. meta-PPISP is built on three individual web servers: cons-PPISP, PINUP, and Promate

ODA

Identification of optimal surface patches with the lowest docking desolvation energy values

PINUP

Protein binding site prediction with an empirical scoring function

Other Sites (DNA, RNA, Metals)

CHED

Web server for predicting soft metal binding sites in proteins

DBD-Hunter

A knowledge-based method for the prediction of DNA-protein interactions

DISPLAR

Given the structure of a protein known to bind DNA, the method predicts residues that contact DNA using neural network method

iDBPs

Predicts DNA binding proteins for proteins with known 3D structure.

PFplus

A tool for extracting and displaying positive electrostatic patches on protein surfaces which can be indicative of nucleic acid binding interfaces.

Five points for bioinformatics software/tools

Jitendra Narayan — Mon, 05 Aug 2013 04:12:32 -0500

In the bioinformatics sector we mostly spend time on computational analysis of huge amounts of data and try to make sense of it, biologically. But, most of the newbie bioinformaticians are faced with dilemma when they receive biological sequence data for the first time. They mostly found confusing over open source, user friendly GUI, and commercial bioinformatics software. Don’t be surprise this is true and also not an easy task to decide, because analytical step is the most crucial part and believe to be the biggest bottleneck in publishing paper in high impact journals. Through this blog I would like to address the pros and cons of both kind of software/tools and try to assist (Hmmm not really, It looks convince) you to make decision on your software selections.

The most common newbie questions are:

Should I try to use these free open source programs? Why are we not trying GUI software for computational analysis? Should I use commercial bioinformatics programs/software?”

1. Let’s be open

We generally think free and cheap are useless. But this concept is not applicable when we discuss open source software. Mostly, the bioinformatics software is developed by highly competitive biological programmers who believe in open sharing of knowledge. They come under Open Bioinformatics Foundation or O|B|F which is a non-profit, volunteer run organization focused on supporting open source programming in bioinformatics. The best part about open source tools/software is that they’re free to download the source code and read exactly what the program does. If you are so inclined, you can view all of the parts of the program and see the logical flow of the pipeline. In addition, open source makes an excellent learning tool for any beginning bioinformatician. Moreover, you can modify existing open source programs to deal with cutting-edge problems or to customize your pipeline. Apart from your computational and analysis work, most of the reviewer also prefers the open source based results so that they can validate the results if validation required.

2. Code headache

As a bioinformatician you are supposed to know the basics of programming languages, and if you are not good at it, then please learn it as soon as possible because you are not a bio-analyst but biological programmers. The open source programs usually lack dedicated service and support teams (often because they were the product of an overworked doc/postdoc!) so you are responsible for troubleshooting your own errors most of the time. We commonly receive the HELP email to support and assist to setup the pipeline; you can also find this kind of request on any QA forum. I personally believe this coding horror brings the biggest downside of open-source programs; where you need some programming skills in order to implement the program in your pipeline. But, if you are not able to fix the pipeline and modify the open source code according to your requirements them you should re-think on your bioinformatician name tag!!!

3. Dive into the codes

Some of the biologist turn bioinformatician says “if you can do the same thing with commercial software then why to get migraine with weird codes”, well this statement looks to me that guys are keen to learn swimming but still don’t like to get wet. If you are still using paid software and doing your work by customer support and clicking some of the well-designed GUI button then perhaps you are not interested in learning and trying new and challenging bioinformatics works. You are missing the basic flavour of bioinformatics. Let’s dive into the coding world, I am sure your will enjoy it. I recommend your to swim freely in code’s sea, and enjoy the journey; do not merely watch it from the outside.

4. Paid does not mean better

The bioinformatics company which are specializes in bioinformatics solutions develop well designed/packed, user friendly software by using a large number of specialised scientist, programmers and support staff. They also provide good services to accomplice your biological analysis work. This means that if you hit a ‘snag’ with your data, help is likely only a phone call away! These companies price their products competitively against the cost of a dedicated bioinformatician. You may be able to afford the program, but not the additional staff! Additionally, most of the functionality that you need in your analysis is already coded into the program. Need to plot a graph? Just click this button right here. It is that easy. But, as a bioinformatician this is not generally well encouraged approach in biological analysis work, because the software is not available to everyone and your data can’t be validated. Moreover, there is very less chances that anyone will repeat your work or love to do similar kind of research (because not all the labs in the world are rich like yours).

5. Take a caution

In biological analysis work, in which you deal GB/TB of data are having maximum chances of getting errors, so please be careful and always cross check your data before coming to any conclusion. Even an error in two line code can alter your entire analysis and display weird results. Some of the scientist blindly believes on commercial software, which is entirely wrong. Using proprietary tools does not absolve you of the need to actually read and research the type of analysis that you are doing. This is particularly true in the case of genome assembly and annotation.

At the end, I would like to tell only one think that open source solutions allows you to do more cutting edge analysis than the commercial tools. So let’s go for it.

Disclaimer:

This is my personal view. I have nothing to do with any company or open source community. The views expressed on these pages are mine alone and not those of my current/past employers. I do reserve the right to remove comments left by spammers or off-topic comments.

Software and Tools to detect structure variation with long reads !!

Archana Malhotra — Wed, 15 Mar 2017 14:31:09 -0500

Uncovering the connection between genetics and heritable diseases requires an approach that looks at all the variant bases and types in a genome. While a PacBio de novo assembly resolves the most novel SV variants. 8-10X PacBio coverage of single genomes or trios reveals triple the SVs detectable by short-read data.

With Single Molecule, Real-Time (SMRT) Sequencing, you can access structural variations having a broad range of sizes, types, and GC content with the ability to:

Uncover missing heritability linked to structural variation
Unambiguously identify genomic context and variant breakpoints at the sequence level to unravel the genetic etiology of disease
Resolve structural variation across the complete size spectrum with basepair resolution

Following are the SV tools, which can assist you to achieve your goal.

Sniffles: Structural variation caller using third generation sequencing

Sniffles is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore). It detects all types of SVs using evidence from split-read alignments, high-mismatch regions, and coverage analysis. Please note the current version of Sniffles requires sorted output from BWA-MEM (use -M and -x parameter) or NGM-LR with the optional SAM attributes enabled!

More at https://github.com/fritzsedlazeck/Sniffles

MultiBreak-SV: It identifies structural variants from next-generation paired end data, third-generation long read data, or data from a combination of sequencing platforms.

There are two pieces of software in this release: (1) a pre-processor that takes machineformat (.m5) BLASR files, and (2) MultiBreak-SV. For installation and usage instructions, see doc/MultiBreakSV-Manual.txt.

More at https://github.com/raphael-group/multibreak-sv

Parliament: A Structural Variation Tool. Why ask a single sv-detection approach to find every variant when you can have a parliament of tools deciding?

Publication about the algorithm and “…the first long-read characterization of structural variation in a diploid human personal genome…” (HS1011) - “Assessing structural variation in a personal genome—towards a human reference diploid genome”

More at https://sourceforge.net/projects/parliamentsv/

https://www.dnanexus.com/papers/Parliament_Info_Sheet.pdf

PBHoney: the structural variation discovery tool

PBHoney is an implementation of two variant-identification approaches designed to exploit the high mappability of long reads (i.e., greater than 10,000 bp). PBHoney considers both intra-read discordance and soft-clipped tails of long reads to identify structural variants.

Read The Paper http://www.biomedcentral.com/1471-2105/15/180/abstract

More at https://sourceforge.net/projects/pb-jelly/

SMRT-SV: Structural variant and indel caller for PacBio reads

Structural variant (SV) and indel caller for PacBio reads based on methods from Chaisson et al. 2014.

SMRT-SV provides an official software package for tools described in Chaisson et al. 2014 and adds several key features including the following.

Unified variant calling user interface with built-in cluster compute support
Small indel calling (2-49 bp)
Improved inversion calling (screenInversions)
Quality metric for SV calls based on number of local assemblies supporting each call
Higher sensitivity for SV calls using tiled local assemblies across the entire genome instead of "signature" regions
Genotyping of SVs with Illumina paired-end reads from WGS samples

More at https://github.com/EichlerLab/pacbio_variant_caller

List of non-commercial NGS genotype-calling software

Jit — Thu, 09 Aug 2018 04:21:32 -0500

Meaningful analysis of next-generation sequencing (NGS) data, which are produced extensively by genetics and genomics studies, relies crucially on the accurate calling of SNPs and genotypes. Recently developed statistical methods both improve and quantify the considerable uncertainty associated with genotype calling, and will especially benefit the growing number of studies using low- to medium-coverage data.

A list of programs for genotype and SNP calling :

SOAP2 http://soap.genomics.org.cn/index.html

Single-sample High-quality variant database (for example, dbSNP) Package for NGS data analysis, which includes a single individual genotype caller (SOAPsnp)

realSFS http://128.32.118.212/thorfinn/realSFS/

Single-sample Aligned reads Software for SNP and genotype calling using single individuals and allele frequencies. Site frequency spectrum (SFS) estimation

Samtools http://samtools.sourceforge.net/

Multi-sample Aligned reads Package for manipulation of NGS alignments, which includes a computation of genotype likelihoods (samtools) and SNP and genotype calling (bcftools)

GATK http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit Multi-sample Aligned reads Package for aligned NGS data analysis, which includes a SNP and genotype caller (Unifed Genotyper), SNP filtering (Variant Filtration) and SNP quality recalibration (Variant Recalibrator)

Beagle http://faculty.washington.edu/browning/beagle/beagle.html

Multi-sample LD Candidate SNPs, genotype likelihoods Software for imputation, phasing and association that includes a mode for genotype calling

IMPUTE2 http://mathgen.stats.ox.ac.uk/impute/impute_v2.html

Multi-sample LD Candidate SNPs, genotype likelihoods Software for imputation and phasing, including a mode for genotype calling. Requires fine-scale linkage map

QCall ftp://ftp.sanger.ac.uk/pub/rd/QCALL

Multi-sample LD ‘Feasible’ genealogies at a dense set of loci, genotype likelihoods Software for SNP and genotype calling, including a method for generating candidate SNPs without LD information (NLDA) and a method for incorporating LD information (LDA). The ‘feasible’ genealogies can be generated using Margarita (http://www.sanger.ac.uk/resources/software/margarita)

MaCH http://genome.sph.umich.edu/wiki/Thunder

Multi-sample LD Genotype likelihoods Software for SNP and genotype calling, including a method (GPT_Freq) for generating candidate SNPs without LD information and a method (thunder_glf_freq) for incorporating LD information

Ancient whole genome duplication (WGD) detection tools !

Rahul Nayak — Sun, 07 Mar 2021 00:32:44 -0600

There are two methods for ancient WGD detection, one is collinearity analysis, and the other is based on the Ks distribution map. Among them, Ks is defined as the average number of synonymous substitutions at each synonymous site, and there is also a Ka corresponding to it, which refers to the average number of non-synonymous substitutions at each non-synonymous site.

At present, some people have posted articles about the analysis process of WGD. I searched for the keyword "wgd pipeline" and found the following:

GenoDup: https:// github.com/MaoYafei/GenoDup-Pipeline
https://peerj.com/articles/6303/
WGDdetector: https:// github.com/yongzhiyang2 012/WGDdetector
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2670-3
wgd: https:// github.com/arzwa/wgd
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2#Sec1
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
GeNoGAP https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
https://github.com/dfguan/purge_dups
https://www.biorxiv.org/content/10.1101/2020.01.24.917997v1

This article introduces the usage of wgd.

Wgd cannot be installed directly with bioconda at present, so it is a little troublesome to install, because it depends on a lot of software. wgd depends on the following software

BLAST
MCL
MUSCLE/MAFFT/PRANK
PAML
PhyML/FastTree
i-ADHoRe

But the good news is that most of the software it depends on can be installed with bioconda

conda create -n wgd python=3.5 blast mcl muscle mafft prank paml fasttree cmake libpng mpi=1.0=mpich
conda activate wgd

Here mpi=1.0=mpich is selected, because i-adhore depends on mpich. If openmpi is installed, an error will appear while loading shared libraries: libmpi_cxx.so.40: cannot open shared object file: No such file or directory

After that, the installation is much simpler

git clone https://github.com/arzwa/wgd.git
cd wgd
pip install .
pip install git+https://github.com/arzwa/wgd.git
For i-ADHoRe, you need to register at http:// bioinformatics.psb.ugent.be /webtools/i-adhore/licensing/Agree to the license to download i-ADHoRe-3.0

Since my miniconda3 installed ~/opt/, the installation path is so~/opt/miniconda3/envs/wgd/

tar -zxvf i-adhore-3.0.01.tar.gz
cd i-adhore-3.0.01
mkdir -p build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=~/opt/miniconda3/envs/wgd/
make -j 4
make insatall

Take the sugarcane genome Saccharum spontaneum L as an example. The genome is 8-ploid with 32 chromosomes (2n = 4x8 = 32)

Download the tutorial for CDS and GFF annotation files

mkdir -p wgd_tutorial && cd wgd_tutorial
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.cds.fasta.gz
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.gff3.gz
gunzip *.gz

First conda activate wgdstart our analysis environment, and then start the analysis

Step 1 : Use to wgd mclidentify homologous genes in the genome

wgd mcl -n 20 --cds --mcl -s Sspon.v20190103.cds.fasta -o Sspon_cds.out

Step 2 : Use to wgd ksdbuild Ks distribution

wgd ksd --n_threads 80 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl Sspon.v20190103.cds.fasta

Step 3 : If the quality of the genome is good, then wgd syncollinearity analysis can be used . It can help us find the collinearity block in the genome and the corresponding anchor point

wgd syn --feature gene --gene_attribute ID \
-ks wgd_ksd/Sspon.v20190103.cds.fasta.ks.tsv \
Sspon.v20190103.gff3 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl

For more reading - There are 9 sub-modules in WGD

kde: KDE fitting to the Ks distribution
ksd: Ks distribution construction
mcl: BLASP comparison of All-vs-ALl + MCL classification analysis.
mix: Hybrid modeling of Ks distribution.
pre: preprocess the CDS file
syn: Call I-ADHoRe 3.0 to use GFF files for collinearity analysis
viz: draw histogram and density plot
wf1: Ks standard analysis procedure of the whole genome paranome (paranome), call mcl, ksd and syn
wf2: Ks standard analysis procedure of one-vs-one homologous gene (ortholog), call wcl and kSD