BOL: Related items

Stand-alone programs for Bioinformatician

Radha Agarkar — Sat, 21 May 2016 22:50:15 -0500

This directory contains applications for stand-alone use, built specifically for a Linux 64-bit machine.

For help on the bigBed and bigWig applications see:
http://genome.ucsc.edu/goldenPath/help/bigBed.html
http://genome.ucsc.edu/goldenPath/help/bigWig.html

View the file 'FOOTER' to see the usage statement for each of the applications.

Address of the bookmark: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/

Bioinformatics tools and software

Jit — Tue, 05 Jul 2016 10:02:26 -0500

USEARCH >
Extreme high-throughput sequence analysis. Orders of magnitude faster than BLAST. MUSCLE >
Multiple sequence alignment. Faster and more accurate than CLUSTALW.

UPARSE >
OTU clustering for 16S and other marker genes. Highly accurate OTU sequences and improved diversity measures. UCHIME >
Chimeric sequence detection. PILER >
De novo genome repeat finder. PILER-CR >
Detection of CRISPR repeats in bacterial genomes. QSCORE >
Compare two multiple alignments for benchmarking. PALS >
Whole-genome alignment. PREFAB >
Protein Reference Alignment Database. MSA benchmark collection >
Selected multiple alignment benchmarks in a standardized FASTA format.

Address of the bookmark: http://drive5.com/software.html

TMAP - torrent mapping alignment program General Notes

Poonam Mahapatra — Sun, 02 Apr 2017 15:53:47 -0500

TMAP - torrent mapping alignment program General Notes

TMAP is a fast and accurate alignment software for short and long nucleotide sequences produced by next-generation sequencing technologies.

The latest TMAP is unsupported. To use a supported version, please see the TMAP version associated with a Torrent Suite release below.

Get the latest source code:

git clone git://github.com/iontorrent/TMAP.git
 cd TMAP
 git submodule init
 git submodule update

https://github.com/iontorrent/TS/tree/master/Analysis/TMAP

Address of the bookmark: https://github.com/iontorrent/TS/tree/master/Analysis/TMAP

Tools to Predict the Impact of Missense Variants !

Jit — Mon, 23 Apr 2018 12:57:33 -0500

Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen‐2, SIFT, FatHMM, MutationTaster‐2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants.

Study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. Comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.

Following tools are useful for mis sense muation detection ...

PolyPhen‐2 (PP2)
“Predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations”

MutationTaster‐2 (MT2)
“Evaluation of the disease‐causing potential of DNA sequence alterations”

MutationAssessor (MASS)
“Predicts the functional impact of amino acid substitutions in proteins, such as mutations discovered in cancer or missense polymorphisms”

LRT
“Identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein‐coding sequences, which are likely to be unconditionally deleterious”

SIFT
“Predicts whether an amino acid substitution affects protein function”

GERP++
“Identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA, but did not occur because the element has been under functional constraint. We refer to these deficits as “rejected substitutions.” Rejected substitutions are a natural measure of constraint that reflects the strength of past purifying selection on the element”

phyloP
“Compute conservation or acceleration P values based on an alignment and a model of neutral evolution”

FatHMM unweighted (FatHMM‐U)
Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants”

FatHMM weighted (FatHMM‐W)
Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants” and its weighting scheme attributes higher tolerance scores to SNVs in proteins, related proteins, or domains that already include a high fraction of pathogenic variantsh

Combined Annotation Dependent Depletion (CADD)
“CADD is a tool for scoring the deleteriousness of single‐nucleotide variants as well as insertion/deletions variants in the human genome”

Tools for Protein-Protein Docking !

Poonam Mahapatra — Wed, 25 Apr 2018 05:15:53 -0500

Predicting the structure of protein–protein complexes using docking approaches is a difficult problem whose major challenges include identifying correct solutions, and properly dealing with molecular flexibility and conformational changes. Following are the tools to predict the structure of protein–protein complexes:

3D-Dock Suite

Global rigid search: FFTShape complementarity and electrostatics

Re-scoring and clustering. Refinement of interface side-chains

3D-Garden

Global rigid search in ensamble

Shape complementarity and Lennard–Jones potential

Side chain and backbone dihedral refinement

DOT

Global rigid search: FFTShape complementarity, electrostatics and VDWNone

Escher NG

Global rigid searchShape complementarity, hydrogen bonds and electrostatic

Integrated in VEGA

GRAMM

Global rigid search: FFT. smooth protein surface representation for soft docking

Shape complementarity and Lennard-Jones potential

Clustering of conformations

GRAMM-X

Global rigid search: FFT. smooth protein surface representation for soft docking

Shape complementarity and Lennard-Jones potentialminimization and re-scoring with multiple filters

HEX

Global rigid search: Fourier correlation of spherical harmonics

Shape complementarity

HADDOCK

Global rigid searchElectrostatic ,VDW and desolvation energy termsMD simulated annealing refinement . Filtering based on external data.

ICM

Global rigid search: Monte CarloEmpirical scoring function

Clustering and selection of conformations. Refinement of interface side-chains and re-scoring

MolFit

Global rigid search: FFTShape complementarity

Clustering of good solutions, filtering using a priori information and small, local rigid rotations around selected conformations

PatchDock

Global rigid searchShape complementarity and atomic desolvation energy

Clustering of conformations

PyDock

Global rigid search:FFTShape complementarity

rescoring by binding electrostatics and desolvation energy

RosettaDock

Local rigid search: Monte Carlo with low and high resolution structure representation levels

Different scoring parameters for the different resolutions

ZDOCK

Global rigid search: FFTShape complementarity, desolvation energy, and electrostatics.

Energy minimization and re-scoringFree for academics

Point to note:

The proper treatment of flexibility in protein–protein docking is still an active field of research. You first should analyzed your proteins in order to define their conformational space and then choose the most suitable method for your docking problem.

EvidentialGene: tr2aacds, mRNA Transcript Assembly Software

Rahul Nayak — Tue, 08 May 2018 04:39:39 -0500

EvidentialGene is a genome informatics project, "Evidence Directed Gene Construction for Eukaryotes", to construct high quality, accurate gene sets for animals and plants, developed by Don Gilbert at Indiana University, see
http://arthropods.eugenes.org/EvidentialGene/

Construction refers to the combination of classical gene prediction, and more recent gene assembly (de-novo and genome-assisted) methods. The basic Evigene methods involve using available best-of-breed gene prediction and assembly software, combining all evidence for genes, from expressed sequences, genome assembly sequences, related species protein sequences, and any other, to annotate and score gene constructions. Over-produced constructions are classified by gene evidence for best qualities per "locus", including genome-aligned and gene-transcript aligned (genome-free) locus identification. All software developed for EvidentialGene is publicly available. See project wiki/blog for notes.

Download

http://arthropods.eugenes.org/EvidentialGene/trassembly.html

https://sourceforge.net/p/evidentialgene/blog/

Address of the bookmark: http://arthropods.eugenes.org/EvidentialGene/trassembly.html

Introduction to phylogenies in R

Abhi — Wed, 13 Oct 2021 02:27:21 -0500

R phylogenetics is built on the contributed packages for phylogenetics in R, and there are many such packages. Let's begin today by installing a few critical packages, such as ape, phangorn, phytools, and geiger. To get the most recent CRAN version of these packages, you will need to have R 3.3.x installed on your computer!

Address of the bookmark: http://www.phytools.org/Cordoba2017/ex/2/Intro-to-phylogenies.html

Interesting Bioinformatics Resources !

Abhi — Fri, 11 Nov 2022 06:30:46 -0600

1. a reproducible workflow. https://www.youtube.com/watch?v=s3JldKoA0zw This two minute video will change your mind on reproducible research

2. Parallel sequencing lives, or what makes large sequencing projects successful https://academic.oup.com/gigascience/article/6/11/gix100/4557140?login=false

3. Common-sense approaches to sharing tabular data alongside publication https://www.sciencedirect.com/science/article/pii/S2666389921002300

4. A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker https://psyarxiv.com/8xzqy/

5. Practical Computational Reproducibility in the Life Sciences https://www.cell.com/cell-systems/fulltext/S2405-4712(18)30140-6

6. A video by Dr.Keith A. Baggerly from MD Anderson [The Importance of Reproducible Research in High-Throughput Biology](https://www.youtube.com/watch?v=7gYIs7uYbMo) highly recommended.

7. Ten Simple Rules for Reproducible Computational Research http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285)

8. Good Enough Practices in Scientific Computing http://arxiv.org/abs/1609.00037

9. Best Practices for Scientific Computing https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745

10. A Quick Guide to Organizing Computational Biology Projects http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.100042 A must read for computational biologists!

11. Reproducibility of computational workflows is automated using continuous analysis https://www.nature.com/articles/nbt.3780

12. Five selfish reasons to work reproducibly https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7

Important Bioinformatics Tools !

BioStar — Tue, 30 Jul 2024 05:03:29 -0500

1. Ktrim: An extra-fast, accurate adapter trimmer for sequencing data. It processes FASTQ files from multiple lanes with minimal mismatching and over-trimming of adapters.

2. BWA MEM: A reliable alignment tool (particularly for mapping ALT contigs and HLA genes, which are not fully addressed in BWA-MEM2).

3. Sambamba markdup: Quickly marks or removes duplicate reads using Picard's criteria.

4. ichorCNA: Estimates the tumor DNA fraction in cell-free DNA from ultra-low-pass whole genome sequencing (0.1x coverage) based on copy number alterations (CNA).

5. Fragle: A deep learning method for quantifying ctDNA levels from cell-free DNA fragmentomic profiles. It detects TF as low as ~1% ctDNA and works with targeted genomic panel sequencing data.

6. AlfredQC: A quality control tool for high-throughput sequencing data. It assesses metrics like read quality scores, GC content, and duplication rates, visualized through detailed plots and summary statistics.

7. Mosdepth: A fast tool for calculating sequencing coverage depth, offering a quicker alternative to samtools/sambamba depth by processing BAM and CRAM files.

8. Bedtools: A versatile toolkit for genomics, enabling operations like intersect, merge, count, and shuffle on genomic intervals across formats such as BAM, BED, GFF/GTF, and VCF.

9. Datamash: A command-line tool for basic numeric, textual, and statistical operations on input data streams. It supports operations such as grouping, sorting, transposing, and performing arithmetic calculations on tabular data.

10. gwf.app: A pragmatic alternative to Snakemake. Developed at Aarhus University, this flexible, generic workflow tool builds and runs large scientific workflows.

Predicting Pathogen Virulence Using Bioinformatics Tools

BioStar — Tue, 04 Nov 2025 07:55:53 -0600

In the genomic era, the ability to predict the virulence potential of pathogens has become an indispensable part of infectious disease research. With the exponential growth of microbial genome data, bioinformatics tools now enable scientists to identify virulence factors, model pathogen behavior, and even forecast outbreak risks — all from sequence data.

In an age where pathogens continue to evolve and cross boundaries, understanding what makes them virulent—that is, capable of causing disease—has become a critical focus in modern microbiology and genomics. Virulence prediction bridges computational biology, genomics, and machine learning to forecast the pathogenic potential of microbes before they strike.

What Is Virulence?

Virulence refers to the degree of damage a pathogen can inflict on its host. It is determined by a combination of genetic factors—called virulence factors (VFs)—that allow the organism to attach, invade, evade, and harm the host. These include genes coding for toxins, secretion systems, adhesins, and enzymes that disrupt host defenses.

Understanding virulence factors not only helps in deciphering the mechanisms of infection but also provides early warning signs for emerging threats.

Why Predict Virulence?

Traditional virulence studies relied heavily on experimental infection models, which, although accurate, are time-consuming, expensive, and ethically constrained.
Today, the availability of whole-genome sequences and large-scale pathogen databases has paved the way for in silico virulence prediction—a computational approach that can screen thousands of genomes within hours.

This approach enables researchers to:

Rapidly identify potential high-risk strains.
Prioritize pathogens for containment, surveillance, or further study.
Guide vaccine development and drug target discovery.
Support One Health frameworks, linking animal, human, and environmental health data.

How Is Virulence Predicted?

Virulence prediction combines bioinformatics pipelines with machine learning and comparative genomics. The process generally involves:

Genome Annotation: Identifying genes and coding sequences in microbial genomes.
Feature Extraction: Comparing sequences with curated databases like VFDB (Virulence Factor Database), PATRIC, or Victors.
Pattern Recognition: Using algorithms (e.g., Random Forest, SVM, or deep learning models) to classify genes or strains as virulent or non-virulent based on sequence patterns, motifs, and protein domains.
Scoring and Visualization: Assigning a virulence score or confidence level and visualizing it through heatmaps or genome maps.

Tools and Resources for Virulence Prediction

A number of tools and databases make virulence prediction accessible to the scientific community:

VFanalyzer – For identifying virulence genes based on VFDB.
PathoFact – Predicts virulence, antimicrobial resistance (AMR), and toxin genes from metagenomic data.
Pangenome-based models – Identify virulence-associated gene clusters across strains.
Machine learning models – Use features like GC content, codon usage bias, or protein domains to predict pathogenicity.

Emerging tools now integrate multi-omic data—including transcriptomics, proteomics, and metabolomics—to understand virulence in a systems biology framework.

Applications in the Real World

Virulence prediction has major implications across public health and research sectors:

Epidemic preparedness: Early identification of virulent strains in outbreak samples.
AMR surveillance: Linking virulence profiles with antibiotic resistance determinants.
Environmental monitoring: Predicting pathogenic potential of soil or waterborne microbes.
Clinical diagnostics: Supporting personalized treatment through pathogen profiling.

For instance, integrating virulence prediction pipelines into national surveillance networks could enable faster risk assessment and response to infectious outbreaks.

The Road Ahead

As machine learning and genomics advance, virulence prediction will evolve from simple gene-based detection to dynamic, context-aware models that account for host–pathogen interactions, environmental signals, and evolutionary adaptation.

Future tools may predict not just if a strain is virulent, but under what conditions it expresses that virulence—bridging the gap between genotype and phenotype.

In Summary

Virulence prediction is redefining how we understand and anticipate infectious diseases. By coupling genomic insights with computational intelligence, researchers can identify potential threats earlier, design smarter interventions, and ultimately, strengthen our preparedness against emerging pathogens.