BOL: Related items

Interesting Bioinformatics Resources !

Abhi — Fri, 11 Nov 2022 06:30:46 -0600

1. a reproducible workflow. https://www.youtube.com/watch?v=s3JldKoA0zw This two minute video will change your mind on reproducible research

2. Parallel sequencing lives, or what makes large sequencing projects successful https://academic.oup.com/gigascience/article/6/11/gix100/4557140?login=false

3. Common-sense approaches to sharing tabular data alongside publication https://www.sciencedirect.com/science/article/pii/S2666389921002300

4. A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker https://psyarxiv.com/8xzqy/

5. Practical Computational Reproducibility in the Life Sciences https://www.cell.com/cell-systems/fulltext/S2405-4712(18)30140-6

6. A video by Dr.Keith A. Baggerly from MD Anderson [The Importance of Reproducible Research in High-Throughput Biology](https://www.youtube.com/watch?v=7gYIs7uYbMo) highly recommended.

7. Ten Simple Rules for Reproducible Computational Research http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285)

8. Good Enough Practices in Scientific Computing http://arxiv.org/abs/1609.00037

9. Best Practices for Scientific Computing https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745

10. A Quick Guide to Organizing Computational Biology Projects http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.100042 A must read for computational biologists!

11. Reproducibility of computational workflows is automated using continuous analysis https://www.nature.com/articles/nbt.3780

12. Five selfish reasons to work reproducibly https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7

Important Bioinformatics Tools !

BioStar — Tue, 30 Jul 2024 05:03:29 -0500

1. Ktrim: An extra-fast, accurate adapter trimmer for sequencing data. It processes FASTQ files from multiple lanes with minimal mismatching and over-trimming of adapters.

2. BWA MEM: A reliable alignment tool (particularly for mapping ALT contigs and HLA genes, which are not fully addressed in BWA-MEM2).

3. Sambamba markdup: Quickly marks or removes duplicate reads using Picard's criteria.

4. ichorCNA: Estimates the tumor DNA fraction in cell-free DNA from ultra-low-pass whole genome sequencing (0.1x coverage) based on copy number alterations (CNA).

5. Fragle: A deep learning method for quantifying ctDNA levels from cell-free DNA fragmentomic profiles. It detects TF as low as ~1% ctDNA and works with targeted genomic panel sequencing data.

6. AlfredQC: A quality control tool for high-throughput sequencing data. It assesses metrics like read quality scores, GC content, and duplication rates, visualized through detailed plots and summary statistics.

7. Mosdepth: A fast tool for calculating sequencing coverage depth, offering a quicker alternative to samtools/sambamba depth by processing BAM and CRAM files.

8. Bedtools: A versatile toolkit for genomics, enabling operations like intersect, merge, count, and shuffle on genomic intervals across formats such as BAM, BED, GFF/GTF, and VCF.

9. Datamash: A command-line tool for basic numeric, textual, and statistical operations on input data streams. It supports operations such as grouping, sorting, transposing, and performing arithmetic calculations on tabular data.

10. gwf.app: A pragmatic alternative to Snakemake. Developed at Aarhus University, this flexible, generic workflow tool builds and runs large scientific workflows.

Predicting Pathogen Virulence Using Bioinformatics Tools

BioStar — Tue, 04 Nov 2025 07:55:53 -0600

In the genomic era, the ability to predict the virulence potential of pathogens has become an indispensable part of infectious disease research. With the exponential growth of microbial genome data, bioinformatics tools now enable scientists to identify virulence factors, model pathogen behavior, and even forecast outbreak risks — all from sequence data.

In an age where pathogens continue to evolve and cross boundaries, understanding what makes them virulent—that is, capable of causing disease—has become a critical focus in modern microbiology and genomics. Virulence prediction bridges computational biology, genomics, and machine learning to forecast the pathogenic potential of microbes before they strike.

What Is Virulence?

Virulence refers to the degree of damage a pathogen can inflict on its host. It is determined by a combination of genetic factors—called virulence factors (VFs)—that allow the organism to attach, invade, evade, and harm the host. These include genes coding for toxins, secretion systems, adhesins, and enzymes that disrupt host defenses.

Understanding virulence factors not only helps in deciphering the mechanisms of infection but also provides early warning signs for emerging threats.

Why Predict Virulence?

Traditional virulence studies relied heavily on experimental infection models, which, although accurate, are time-consuming, expensive, and ethically constrained.
Today, the availability of whole-genome sequences and large-scale pathogen databases has paved the way for in silico virulence prediction—a computational approach that can screen thousands of genomes within hours.

This approach enables researchers to:

Rapidly identify potential high-risk strains.
Prioritize pathogens for containment, surveillance, or further study.
Guide vaccine development and drug target discovery.
Support One Health frameworks, linking animal, human, and environmental health data.

How Is Virulence Predicted?

Virulence prediction combines bioinformatics pipelines with machine learning and comparative genomics. The process generally involves:

Genome Annotation: Identifying genes and coding sequences in microbial genomes.
Feature Extraction: Comparing sequences with curated databases like VFDB (Virulence Factor Database), PATRIC, or Victors.
Pattern Recognition: Using algorithms (e.g., Random Forest, SVM, or deep learning models) to classify genes or strains as virulent or non-virulent based on sequence patterns, motifs, and protein domains.
Scoring and Visualization: Assigning a virulence score or confidence level and visualizing it through heatmaps or genome maps.

Tools and Resources for Virulence Prediction

A number of tools and databases make virulence prediction accessible to the scientific community:

VFanalyzer – For identifying virulence genes based on VFDB.
PathoFact – Predicts virulence, antimicrobial resistance (AMR), and toxin genes from metagenomic data.
Pangenome-based models – Identify virulence-associated gene clusters across strains.
Machine learning models – Use features like GC content, codon usage bias, or protein domains to predict pathogenicity.

Emerging tools now integrate multi-omic data—including transcriptomics, proteomics, and metabolomics—to understand virulence in a systems biology framework.

Applications in the Real World

Virulence prediction has major implications across public health and research sectors:

Epidemic preparedness: Early identification of virulent strains in outbreak samples.
AMR surveillance: Linking virulence profiles with antibiotic resistance determinants.
Environmental monitoring: Predicting pathogenic potential of soil or waterborne microbes.
Clinical diagnostics: Supporting personalized treatment through pathogen profiling.

For instance, integrating virulence prediction pipelines into national surveillance networks could enable faster risk assessment and response to infectious outbreaks.

The Road Ahead

As machine learning and genomics advance, virulence prediction will evolve from simple gene-based detection to dynamic, context-aware models that account for host–pathogen interactions, environmental signals, and evolutionary adaptation.

Future tools may predict not just if a strain is virulent, but under what conditions it expresses that virulence—bridging the gap between genotype and phenotype.

In Summary

Virulence prediction is redefining how we understand and anticipate infectious diseases. By coupling genomic insights with computational intelligence, researchers can identify potential threats earlier, design smarter interventions, and ultimately, strengthen our preparedness against emerging pathogens.

Alignment-free sequence comparison tools available for next-generation sequencing data analysis

Abhimanyu Singh — Tue, 07 Nov 2017 05:33:33 -0600

kallisto

Transcript abundance quantification from RNA-seq data (uses pseudoalignment for rapid determination of read compatibility with targets)

Software (C++)

https://pachterlab.github.io/kallisto/

Sailfish

Estimation of isoform abundances from reference sequences and RNA-seq data (k-mer based)

Software (C++)

http://www.cs.cmu.edu/~ckingsf/software/sailfish/

Salmon

Quantification of the expression of transcripts using RNA-seq data (uses k-mers)

https://combine-lab.github.io/salmon/

RNA-Skim

RNA-seq quantification at transcript-level (partitions the transcriptome into disjoint transcript clusters; uses sig-mers, a special type of k-mers)

Software (C++)

http://www.csbio.unc.edu/rs/

Variant calling

ChimeRScope

Fusion transcript prediction using gene k-mers profiles of the RNA-seq paired-end reads

Software (Java)

https://github.com/ChimeRScope/ChimeRScope/wiki

FastGT

Genotyping of known SNV/SNP variants directly from raw NGS sequence reads by counting unique k-mers

Software (C)

https://github.com/bioinfo-ut/GenomeTester4/

Phy-Mer

Reference-independent mitochondrial haplogroup classifier from NGS data (k-mer based)

Software (Python)

https://github.com/danielnavarrogomez/phy-mer

LAVA

Genotyping of known SNPs (dbSNP and Affymetrix's Genome-Wide Human SNP Array) from raw NGS reads (k-mer based)

Software (C)

http://lava.csail.mit.edu/

MICADo

Detection of mutations in targeted third-generation NGS data (can distinguish patients’ specific mutations; algorithm uses k-mers and is based on colored de Bruijn graphs)

Software (Python)

http://github.com/cbib/MICADo

General mapper

Minimap

Lightweight and fast read mapper and read overlap detector (uses the concept of “minimazers”, a special type of k-mers)

Software (C)

https://github.com/lh3/minimap

Assembly

De novo genome assembly

MHAP

Produces highly continuous assembly (fully resolved chromosome arms) from third-generation long and noisy reads (10 kbp) using a dimensionality reduction technique MinHash

Software (Java)

https://github.com/marbl/MHAP

Miniasm

Assembler of long noisy reads (SMRT, ONT) using the Overlap-Layout Consensus (OLC) approach without the necessity of an error correction stage (uses minimap)

Software (C)

https://github.com/lh3/miniasm

LINKS

Scaffolding genome assembly with error-containing long sequence (e.g., ONT or PacBio reads, draft genomes)

Software (Perl)

https://github.com/warrenlr/LINKS/

Read clustering

afcluster

Clustering of reads from different genes and different species based on k-mer counts

Software (C++)

https://github.com/luscinius/afcluster

QCluster

Clustering of reads with alignment-free measures (k-mer based) and quality values

Software (C++)

http://www.dei.unipd.it/~ciompin/main/qcluster.html

Reads error correction

Lighter

Correction of sequencing errors in raw, whole genome sequencing reads (k-mer based)

Software (C++)

https://github.com/mourisl/Lighter

QuorUM

Error corrector for Illumina reads using k-mers

Software (C++)

https://github.com/gmarcais/Quorum

Trowel

Software (C++)

https://sourceforge.net/projects/trowel-ec/

Metagenomics

Assembly-free phylogenomics

AAF

Phylogeny reconstruction directly from unassembled raw sequence data from whole genome sequencing projects; provides bootstrap support to assess uncertainty in the tree topology (k-mer based)

Software (Python)

https://github.com/fanhuan/AAF

kSNP v3

Reference-free SNP identification and estimation of phylogenetic trees using SNPs (based on k-mer analysis)

Software (C)

https://sourceforge.net/projects/ksnp/files/

NGS-MC

Phylogeny of species based on NGS reads using alignment-free sequence dissimilarity measures d2* and d2 S under different Markov chain models (using k-words)

R package

http://www-rcf.usc.edu/~fsun/Programs/NGS-MC/NGS-MC.html

Species identification/taxonomic profiling

CLARK

Taxonomic classification of metagenomic reads to known bacterial genomes using k-mer search and LCA assignment

Software (C++)

http://clark.cs.ucr.edu/

FOCUS

Reports organisms present in metagenomic samples and profiles their abundances (uses composition-based approach and non-negative least squares for prediction)

Web service Software (Python)

http://edwards.sdsu.edu/FOCUS/

GSM

Estimation of abundances of microbial genomes in metagenomic samples (k-mer based)

Software (Go)

https://github.com/pdtrang/GSM

Mash

Species identification using assembled or unassembled Illumina, PacBio, and ONT data (based on MinHash dimensionality-reduction technique)

Software (C++)

https://github.com/marbl/mash

Kraken

Taxonomic assignment in metagenome analysis by exact k-mer search; LCA assignment of short reads based on a comprehensive sequence database

Software (C++)

https://ccb.jhu.edu/software/kraken/

LMAT

Assignment of taxonomic labels to reads by k-mers searches in precomputed database

Software (C++/Python)

https://sourceforge.net/projects/lmat/

stringMLST

k-mer-based tool for MLST directly from the genome sequencing reads

Software (Python)

http://jordan.biology.gatech.edu/page/software/stringMLST

Taxonomer

k-mer-based ultrafast metagenomics tool for assigning taxonomy to sequencing reads from clinical and environmental samples

Web service

http://taxonomer.iobio.io/

Other

d2-tools

Word-based (k-tuple) comparison (pairwise dissimilarity matrix using d2S measure) of metatranscriptomic samples from NGS reads

Software (Python/R)

https://code.google.com/p/d2-tools/

VirHostMatcher

Prediction of hosts from metagenomic viral sequences based on ONF using various distance measures (e.g., d2)

Software (C++)

https://github.com/jessieren/VirHostMatcher

MetaFast

Statistics calculation of metagenome sequences and the distances between them based on assembly using de Bruijn graphs and Bray–Curtis dissimilarity measure

Software (Java)

https://github.com/ctlab/metafast

HISAT2: a fast and sensitive alignment program for mapping next-generation sequencing reads

Rahul Nayak — Tue, 08 May 2018 04:27:22 -0500

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Based on an extension of BWT for graphs [Sirén et al. 2014], we designed and implemented a graph FM index (GFM), an original approach and its first implementation to the best of our knowledge. In addition to using one global GFM index that represents a population of human genomes, HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome (each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover the human population). These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. This new indexing scheme is called a Hierarchical Graph FM index (HGFM).

more at https://ccb.jhu.edu/software/hisat2/index.shtml

Address of the bookmark: https://github.com/infphilo/hisat2

List of motif discovery tools !

Neel — Tue, 20 Nov 2018 03:54:26 -0600

In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance. For proteins, a sequence motif is distinguished from a structural motif, a motif formed by the three-dimensional arrangement of amino acids which may not be adjacent.

Following are the list of tools for motif discovery:

2Dsweep -- protein annotation by secondary structure elements

Perform secondary structure predictions on protein sequences.

3D-footprint -- database of DNA-binding protein structures

Find binding specificity information about DNA-protein complexes.

3D-footprint: DNA-binding protein database

Find information about the binding specificity of DNA-binding proteins.

3D-partner -- a web server to infer interacting partners and binding models

Predict interacting partners and binding models.

3MOTIF -- a protein structure visualization system for conserved sequence motifs

Use this web-based sequence motif visualization system to display sequence motif information in its appropriate three-dimensional (3D) context.

AFAWE -- Automatic functional annotation in a distributed Web Services Environment

Protein function prediction and annotation in an integrated environment powered by web service.

ANCHOR -- Prediction of Protein Binding Regions in Disordered Proteins

Find information about protein binding.

ANNIE -- ANNotation and Interpretation Environment for Protein Sequences

Use to predict function from de novo protein sequences.

Active Sequences Collection (ASC) database -- A new tool to assign functions to protein sequences

Search for short active protein sequences with demonstrated biological activities.

Blocks -- Ungapped segments in conserved protein sequences

Search for ungapped segments corresponding to the most highly conserved regions of proteins.

CASTp -- computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues

Identify and measure surface accessible pockets as well as interior inaccessible cavities, for proteins and other molecules.

CSA -- The Catalytic Site Atlas

To search for catalytic residue annotation for enzymes in the Protein Data Bank.

ConFunc -- Conserved residue Protein Function Prediction Server

Predict protein function using Gene Ontology.

ConSurf-DB -- evolutionary conservation profiles of protein structures database

Automatically calculate evolutionary conservation scores of key amino acid residues and map them on protein structures.

DBAli -- A Database of Structure Alignments

Mine the protein structure space.

DILIMOT -- discovery of linear motifs in proteins

Predict short linear motifs (3-8 residues) in a set of protein sequences.

Dasty2 -- an Ajax protein DAS client

A web client for visualizing protein sequence feature information using DAS.

DomainSweep -- protein annotation by domain analysis

Identify the domain architecture within a protein sequence.

E1DS -- catalytic site prediction based on 1D signatures of concurrent conservation

Predict enzyme catalytic site.

ELM -- Eukarotic Linear Motif Resource

Predict functional sites in eukaryotic proteins.

EXPASY Proteome Tools Collection

Use a collection of tools for protein analyses.

EXPASY-Findmod

Predict potential protein post-translational modifications and find potential single amino acid substitutions in peptides.

EzCatDB -- the Enzyme Catalytic-mechanism Database

Search for information related to the catalytic mechanisms of enzymes.

FFPred -- feature-based function prediction

An integrated feature-based function prediction server for vertebrate proteomes.

FingerPRINT Scan

Identify the closest matching PRINTS sequence motif fingerprints in a protein sequence.

FireDB -- a database of functionally important residues from proteins of known structure

Search for functional annotation of important sites in proteins with known structures.

Frog2 -- a FRee Online druG 3D conformation generator

Produce 3D conformations of small drug compounds.

HGPD -- Human Gene and Protein Database

A database presenting experiment-based results in human proteomics.

HHsenser -- exhaustive transitive profile search using HMMx96HMM comparison

Conduct exhaustive intermediate profile searches of a set of homologous protein sequences.

HotSpot Wizard -- Substrate Specificity Hot Spot Identification web server

Design protein mutations in site-directed mutagenesis.

INTREPID -- INformation-theoretic TREe traversal for Protein functional site IDentification

Use for protein functional site identification.

Integrating protein annotation resources through the Distributed Annotation System

Annotate protein using this integrated annotation resource.

InterProScan -- protein domains identifier

Identify protein family (and DNA) domains, patterns, motifs, protein families, and functional sites.

KFC -- Knowledge-based FADE and Contacts

Interactive forecasting of protein interaction hot spots.

MAGIIC-PRO -- detecting functional signatures by efficient discovery of long patterns in protein sequences

Discover long patterns in protein sequences.

MALISAM -- Manual ALIgnments for Structurally Analogous Motifs

Database containing pairs of structural analogs and their alignments.

MEME -- discovering and analyzing DNA and protein sequence motifs

Find sequence patterns in DNA and protein sequences.

MODPROPEP -- a program for knowledge-based modeling of protein-peptide complexes

A web server for knowledge-based modeling of protein-peptide complexes, specifically peptides in complex with major histocompatibility complex (MHC) proteins and kinases.

MeMo -- a web tool for prediction of protein methylation modifications

Predict protein methylation sites.

MegaMotifBase -- a database of structural motifs in protein families and superfamilies

Find structural segments or motifs for protein structures.

Minimotif Miner -- a tool for investigating protein function

Find motifs in a protein sequence.

Motif3D -- Relating protein sequence motifs to 3D structure

Visualize protein sequence motifs on the 3D protein structures.

MotifScan

Find presence of any known protein motif (Prosite and Pfam) in a protein sequence.

MultiBind -- Multiple Alignment of Protein Binding Sites

Recognize spatial chemical binding patterns common to a set of protein structures.

NMT -- The MYR Predictor

Analyze proteins for the presence of N-terminal N-myristoylation site.

NetNGlyc -- N-Glycosylation sites prediction tool

Find the presence of N-Glycosylation sites in human proteins.

NetOGly 3.1 -- O-glycosylation sites prediction tool

Find the presence of O-GalNAc (mucin type) glycosylation sites in mammalian proteins.

NetPhos 2.0 -- Phosphorylation sites predictions

Analyze eukaryotic proteins for the presence of serine, threonine and tyrosine phosphorylation sites.

NetPhosK 1.0 Server -- kinase specific eukaryotic protein phosphorylation sites prediction tool

Find possible kinase specific phosphorylation sites in eukaryotic proteins.

NetworKIN -- a resource for exploring cellular phosphorylation networks

NeuroPred -- a tool to predict cleavage sites in neuropeptide precursors and provide the masses of the resulting peptides

Predict cleavage sites at basic amino acid locations in neuropeptide precursor sequences.

Non-Redundant Patent Sequences - Patented Sequence Database

Find information about patented nucleotide and protein sequences.

O-GLYCBASE

Search for information about glycoproteins with O-linked and C-linked glycosylation sites.

PANDORA -- Protein ANnotation Diagram ORiented Analysis

Find information about protein sequence annotations.

PAR-3D -- Protein Active site Residue - 3D structural motif

A server to predict protein active site residues.

PDBSite -- a database of the 3D structure of protein functional sites

Search for structural and functional information on the protein functional sites.

PDBSiteScan -- A program for searching for active, binding and posttranslational modification sites in the 3D structures of proteins

Search 3D protein fragments similar in structure to known active, binding and posttranslational modification sites.

PEDANT -- Protein Extraction, Description and ANalysis Tool

Conduct genome wide functional and structural analysis.

PHOSIDA -- Phosphorylation site database

Search for phosphorylation data of any protein of interest.

PHOSPHORYLATION SITE DATABASE

Search for information on prokaryotic proteins that undergo serine, threonine, or tyrosine phosphorylation.

PNU -- Protein Naming Utility

Determine correct names for proteins.

POODLE-S -- Predicition Of Order and Disorder by machine LEarning

Web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix.

PPISearch -- Protein-Protein Interaction Search

Find homologous protein-protein interactions across multiple species.

PPSearch

Search your query sequence against PROSITE pattern database for protein motifs.

PRIDB -- Protein-RNA Interface DataBase

Find information about protein-RNA complexes from the Protein Data Bank (PDB).

PRINTS and its automatic supplement, prePRINTS -- A compendium of protein fingerprints

Search for protein fingerprints.

PROSITE

Identify protein families and domains for a given protein sequence.

PRRDB -- Pattern Recognition Receptor Database

A comprehensive database of pattern-recognition receptors and their ligands.

PatMatch -- a program for finding patterns in peptide and nucleotide sequences

Search for short nucleotide or peptide sequences such as cis-elements in nucleotide sequences or small domains and motifs in protein sequences.

PepCyber:P~PEP -- a database of human protein protein interactions mediated by phosphoprotein-binding domains

Database specialized in documenting human PPBD-containing proteins and PPBD-mediated interactions.

PeptideCutter -- protein cleavage sites prediction tool

Predicts potential protease cleavage sites and sites cleaved by chemicals in a given protein sequence.

Phobius -- A combined transmembrane topology and signal peptide predictor

Predict combined transmembrane topology and signal peptides.

Phospho.ELM -- a database of phosphorylation sites

Search for eukaryotic phosphorylation sites.

Phospho3D -- a database of three-dimensional structures of protein phosphorylation sites

Search for 3D structure and functional annotation of phosphorylation sites in proteins.

PhosphoSite -- A bioinformatics resource dedicated to physiological protein phosphorylation.

Search the database of in vivo phosphorylation sites of human and mouse proteins

PolyQ -- Polyglutamine Database

Find information about polyglutamine (polyQ) repeats.

Pratt Protein motif and pattern discovery

Find the presence of protein motifs and patterns in an amino acid sequence.

PrediSi -- Prediction of Signal Peptides and their Cleavage Positions

Predict signal peptide sequences and their cleavage positions in bacterial and eukaryotic amino acid sequences.

ProFunc -- a server for predicting protein function from 3D structure

Predict protein functions based on known structures.

ProMateus--an open research approach to protein-binding sites analysis

Predict the location of potential protein-protein binding sites for unbound proteins.

ProTeus -- identifying signatures in protein termini

Identify short linear signatures in protein termini.

ProtSweep -- protein annotation by homology

Analyze and identify newly obtained protein sequences.

Protemot -- prediction of protein binding sites with automatically extracted geometrical templates

Predict protein binding sites in a protein sequence based on geometrical analysis of protein tertiary substructures.

QuasiMotiFinder -- protein annotation by searching for evolutionarily conserved motif-like patterns

Search for evolutionarily conserved motif-like patterns in protein sequences.

RNABindR -- software for prediction of RNA binding residues in proteins

Web-based server for analyzing and predicting RNA binding sites in proteins.

SCANMOT -- searching for similar sequences using a simultaneous scan of multiple sequence motifs

Search for similarities between proteins by simultaneous matching of multiple motifs.

SDPpred -- A Tool for Prediction of Amino Acid Residues that Determine Differences in Functional Specificity of Homologous Proteins

Predict residues in protein sequences that determine the proteins' functional specificity.

SDR -- Specificity Determining Residues Database

Predict specificity-determining residues in protein families.

SLiMDisc -- Short, Linear Motif Discovery

Find shared motifs in proteins with a common attribute.

SUMOsp -- a web server for sumoylation site prediction

Conduct in silico sumoylation sites prediction.

SWAKK -- a web server for detecting positive selection in proteins using a sliding window substitution rate analysis

Detect protein sequence section under positive evolution selection.

ScanProsite

Search for motifs and patterns within protein sequences.

ScanProsite -- detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins

Detect patterns, profiles and motifs in a protein sequence.

ScanSite 2.0 -- Proteome-wide prediction of cell signaling interactions using short sequence motifs

Search for motifs within proteins that are likely to be phosphorylated by specific protein kinases or bind to domains such as SH2 domains, 14-3-3 domains or PDZ domains.

SePreSA -- SErver for the PREdiction of populations susceptible to Serious Adverse drug reaction

Find information about populations carrying polymorphisms within protein binding pockets that make them susceptible to serious adverse drug reaction (SADR).

Sequence Motif Search

Search the presence of a motif in either amino acid sequence or nucleotide sequence.

Signal-3L -- A 3-layer approach for predicting signal peptides

Predict signal peptides.

SignalP -- Machine learning approaches to the prediction of signal peptides, their cleavage sites, and other protein sorting signals

Predict signal peptides and their cleavage sites.

Sulfinator -- tyrosine sulfation sites prediction tool

Predict the presence of tyrosine sulfation sites in protein sequences

SuperSite -- Ligand Binding Site Database

Look at protein structure from a ligand and binding site perspective.

Swiss EMBnet node web server

Use a collection of bioinformatics tools at this portal site.

T-REKS -- identification of Tandem REpeats in sequences with a K-meanS based algorithm

Find information about tandem repeats in proteins that carry fundamental biological functions and are related to a number of human diseases.

TMFunction -- The Functional Database of Membrane Proteins

Find information about functional residues in alpha-helical and beta-barrel membrane proteins.

TOPDOM -- Conservatively Located Domains and Motifs in Transmembrane Proteins

Database of domains and motifs with conservative location in transmembrane proteins.

The EMOTIF database

Search for highly conserved and specific protein sequence motifs.

TreeDet -- Predicting Functional Residues in Protein Sequence Alignments

Predict functional sites in protein sequence alignments use different methodologies.

W-ChIPMotifs -- ChIP-based protein Motif discovery web server

Find de novo protein motifs from chromatin immunoprecipitation data.

WebFEATURE -- an interactive web tool for identifying and visualizing functional sites on macromolecular structures

Scan query structures for functional sites in both proteins and nucleic acids.

WebProAnalyst -- an interactive tool for analysis of quantitative structurex96activity relationships in protein families

Analyze quantitative structure-activity relationship of related protein families.

eBLOCKs -- enumerating conserved protein blocks to achieve maximal sensitivity and specificity

Search for ungapped alignments of highly conserved regions among a protein family or superfamily.

eF-seek -- prediction of the functional sites of proteins by searching for similar electrostatic potential and molecular surface shape

Predict the functional sites of proteins.

firestar -- prediction of functionally important residues using structural templates and alignment reliability

An expert system for predicting ligand-binding residues in protein structures.

iMOTdb -- a comprehensive collection of spatially interacting motifs in proteins

Automatically identify spatially interacting motifs among distantly related proteins sharing similar folds and possessing common ancestral lineage.

wgd—simple command line tools for the analysis of ancient whole-genome duplications

LEGE — Thu, 23 Jul 2020 05:49:45 -0500

wgd is a easy to use command-line tool for K_S distribution construction named wgd. The wgd suite provides commonly used K_S and colinearity analysis workflows together with tools for modeling and visualization, rendering these analyses accessible to genomics researchers in a convenient manner.

https://academic.oup.com/bioinformatics/article/35/12/2153/5162749

Address of the bookmark: https://github.com/arzwa/wgd

Frequently used bioinformatics tools for viral genome analysis !

Neel — Wed, 23 Jun 2021 07:40:41 -0500

IVA: accurate de novo assembly of RNA virus genomes.
Hunt M, Gall A, Ong SH, Brener J, Ferns B, Goulder P, Nastouli E, Keane JA, Kellam P, Otto TD.
Bioinformatics. 2015 Jul 15;31(14):2374-6. doi: 10.1093/bioinformatics/btv120. Epub 2015 Feb 28.

Adapter sequences:
Optimal enzymes for amplifying sequencing libraries.
Quail, M. a et al. Nat. Methods 9, 10-1 (2012).

GAGE:
GAGE: A critical evaluation of genome assemblies and assembly algorithms.
Salzberg, S. L. et al. Genome Res. 22, 557-67 (2012).

KMC:
Disk-based k-mer counting on a PC.
Deorowicz, S., Debudaj-Grabysz, A. & Grabowski, S. BMC Bioinformatics 14, 160 (2013).

Kraken:
Kraken: ultrafast metagenomic sequence classification using exact alignments.
Wood, D. E. & Salzberg, S. L. Genome Biol. 15, R46 (2014).

MUMmer:
Versatile and open software for comparing large genomes.
Kurtz, S. et al. Genome Biol. 5, R12 (2004).

R:
R: A language and environment for statistical computing.
R Core Team (2013). R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

RATT:
RATT: Rapid Annotation Transfer Tool.
Otto, T. D., Dillon, G. P., Degrave, W. S. & Berriman, M. Nucleic Acids Res. 39, e57 (2011).

SAMtools:
The Sequence Alignment/Map format and SAMtools.
Li, H. et al. Bioinformatics 25, 2078-9 (2009).

Trimmomatic:
Trimmomatic: A flexible trimmer for Illumina Sequence Data.
Bolger, A. M., Lohse, M. & Usadel, B. Bioinformatics 1-7 (2014).

Bioinformatic tools for pathogens informatics at CVR

Abhi — Sat, 08 Jun 2024 15:59:46 -0500

Novel sequencing and analytical approaches focused on studying viruses and virus-host interactions. Below you will find summaries and links to a number of bioinformatic tools that have been developed @ CVR.

DIGS

The database-integrated genome-screening (DIGS) tool provides a framework for implementing automated in silico screening of sequence databases using BLAST in combination with a relational database (MySQL).

DisCVR

DisCVR is a Diagnostic tool for detecting known human viruses in clinical samples from Next-Generation Sequencing (NGS) data. The tool uses a simple and straightforward Graphical User Interface and is optimized on Windows OS without compromising speed and accuracy.

DiversiTools

DiversiTools is a computational tool that is specifically tailored towards viral HTS data sets and the analysis of the underlying viral populations that they represent. It was initially developed in collaboration with a number of virologists interested in characterising the intra-host diversity of viral populations and studying their evolution across transmission chains at the micro-evolutionary scale.

GLUE

GLUE is a flexible data-centric bioinformatics environment for virus sequence data, with a focus on virus evolution and genomic variation. GLUE has been applied to a range of viruses. A GLUE-based resource focused on Hepatitis C virus is HCV-GLUE.

Tanoti

Tanoti is a BLAST guided reference based short read aligner. It is developed for maximising alignment in highly variable next generation sequence data sets (Illumina).

ViCTree

ViCTree is a bioinformatic framework that automatically selects new candidate virus sequences from GenBank, generates multiple sequence alignments, calculates a maximum likelihood phylogeny and integrates the sequences into the existing phylogenetic trees. For more information click here.

Viral Host Predictor

Viral Host Predictor provides a fast and simple way to predict the hosts and vectors of RNA viruses from viral sequences.

GRACy

GRACy is a bioinformatic tool designed for the analysis of Illumina data originated from Human cytomegalovirus samples. GRACy can be used to perform read quality filtering, genotyping, de novo assembly, variant detection, annotation and data submission to public database.

LoReTTA

LoReTTA (Long Read Template Targeted Assembler) is a reference assisted de novo assembler specifically designed to deal with PacBio reads generated from viral genomes.

BingleSeq

BingleSeq is a R-package enables the user-friendly analysis of count tables obtained by both Bulk RNA-Seq and single-cell RNA-Seq protocols. The development of BingleSeq focused on providing a flexible and intuitive user experience.

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads

Jit — Thu, 24 Dec 2020 10:03:36 -0600

Hifiasm is a fast haplotype-resolved de novo assembler for PacBio Hifi reads. It can assemble a human genome in several hours and works with the California redwood genome, one of the most complex genomes sequenced so far. Hifiasm can produce primary/alternate assemblies of quality competitive with the best assemblers. It also introduces a new graph binning algorithm and achieves the best haplotype-resolved assembly given trio data.

Address of the bookmark: https://github.com/chhylp123/hifiasm