BOL: Related items

GATB : Genome Analysis Toolbox with de-Bruijn graph

Jit — Thu, 28 Apr 2016 11:16:51 -0500

The Genome Analysis Toolbox with de-Bruijn graph (GATB) provides a set of highly efficient algorithms to analyse NGS data sets. These methods enable the analysis of data sets of any size on multi-core desktop computers, including very huge amount of reads data coming from any kind of organisms such as bacteria, plants, animals and even complex samples (e.g. metagenomes).

More at https://gatb.inria.fr/

Address of the bookmark: https://gatb.inria.fr/

CSBB-v1.0

Neel — Wed, 29 Jun 2016 07:33:05 -0500

CSBB is a command line based bioinformatics suite to analyze biological data acquired through varied avenues of biological experiments. CSBB is implemented in Perl, while it also leverages the use of R and python in background for specific modules. Major focus of CSBB is to allow users from biology and bioinformatics community, to get benefited by performing down-stream analysis tasks while eliminating the need to write programming code. CSBB is currently available on Linux, UNIX, MAC OS and Windows platforms.

Currently CSBB provides 13 modules focused on analytical tasks like performing upper-quantile normalization on expression data or convert genome wide gene expression to z-scores when comparing expression data from different platforms.

More at https://github.com/skygenomics/CSBB-v1.0

Address of the bookmark: https://github.com/skygenomics/CSBB-v1.0

TMAP - torrent mapping alignment program General Notes

Poonam Mahapatra — Sun, 02 Apr 2017 15:53:47 -0500

TMAP - torrent mapping alignment program General Notes

TMAP is a fast and accurate alignment software for short and long nucleotide sequences produced by next-generation sequencing technologies.

The latest TMAP is unsupported. To use a supported version, please see the TMAP version associated with a Torrent Suite release below.

Get the latest source code:

git clone git://github.com/iontorrent/TMAP.git
 cd TMAP
 git submodule init
 git submodule update

https://github.com/iontorrent/TS/tree/master/Analysis/TMAP

Address of the bookmark: https://github.com/iontorrent/TS/tree/master/Analysis/TMAP

Tools to Predict the Impact of Missense Variants !

Jit — Mon, 23 Apr 2018 12:57:33 -0500

Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen‐2, SIFT, FatHMM, MutationTaster‐2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants.

Study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. Comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.

Following tools are useful for mis sense muation detection ...

PolyPhen‐2 (PP2)
“Predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations”

MutationTaster‐2 (MT2)
“Evaluation of the disease‐causing potential of DNA sequence alterations”

MutationAssessor (MASS)
“Predicts the functional impact of amino acid substitutions in proteins, such as mutations discovered in cancer or missense polymorphisms”

LRT
“Identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein‐coding sequences, which are likely to be unconditionally deleterious”

SIFT
“Predicts whether an amino acid substitution affects protein function”

GERP++
“Identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA, but did not occur because the element has been under functional constraint. We refer to these deficits as “rejected substitutions.” Rejected substitutions are a natural measure of constraint that reflects the strength of past purifying selection on the element”

phyloP
“Compute conservation or acceleration P values based on an alignment and a model of neutral evolution”

FatHMM unweighted (FatHMM‐U)
Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants”

FatHMM weighted (FatHMM‐W)
Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants” and its weighting scheme attributes higher tolerance scores to SNVs in proteins, related proteins, or domains that already include a high fraction of pathogenic variantsh

Combined Annotation Dependent Depletion (CADD)
“CADD is a tool for scoring the deleteriousness of single‐nucleotide variants as well as insertion/deletions variants in the human genome”

MITObim - mitochondrial baiting and iterative mapping

Rahul Nayak — Tue, 08 May 2018 04:15:25 -0500

This document contains instructions on how to use the MITObim pipeline described in Hahn et al. 2013. The full article can be found here. Kindly cite the article if you are using MITObim in your work. The pipeline was originally developed for Illumina data, but thanks to the versatility of the MIRA assembler, MITObim supports in principle also data from the Iontorrent, 454 and PacBio sequencing platforms.

Below you can find a few basic tutorials for how to run MITObim and I encorage you to give them a try with the testdata that comes with this Repo, just to make sure everything is running smoothly on your system. It'll only take a few minutes and will potentially safe you a lot of time down the line.

I provide further examples here as Jupyter notebooks. Get in touch if you feel like sharing your particular MITObim solution and I'd be happy to put it up here, too!

Address of the bookmark: https://github.com/chrishah/MITObim

gapFinisher: A reliable gap filling pipeline for SSPACE-LongRead scaffolder output

Rahul Nayak — Fri, 24 Jan 2020 06:04:40 -0600

gapFinisher is based on the controlled use of a previously published gap filling tool FGAP and works on all standard Linux/UNIX command lines. They compare the performance of gapFinisher against two other published gap filling tools PBJelly and GMcloser.

gapFinisher can fill gaps in draft genomes quickly and reliably.

Address of the bookmark: https://github.com/kammoji/gapFinisher

Tools for Differential expression analysis

Abhi — Tue, 08 Nov 2022 03:40:33 -0600

apeglm - https://bioconductor.org/packages/release/bioc/html/apeglm.html

ashr - https://github.com/stephens999/ashr, https://cran.r-project.org/web/packages/ashr/index.html

consensusDE - https://bioconductor.org/packages/release/bioc/html/consensusDE.html

DESeq2 - https://bioconductor.org/packages/release/bioc/html/DESeq2.html

edgeR - https://bioconductor.org/packages/release/bioc/html/edgeR.html

limma - https://kasperdanielhansen.github.io/genbioconductor/html/limma.html https://bioconductor.org/packages/release/bioc/html/limma.html

MetaCycle - https://cran.r-project.org/web/packages/MetaCycle/index.html, https://github.com/gangwug/MetaCycle

RUVSeq - https://bioconductor.org/packages/release/bioc/html/RUVSeq.html

SARTools - https://github.com/PF2-pasteur-fr/SARTools

tximport - https://github.com/mikelove/tximport

Virus Bioinformatics Tools

LEGE — Wed, 24 Apr 2024 06:19:55 -0500

Bioinformatics tools play a crucial role in studying viruses, enabling researchers to analyze their genetic makeup, structure, function, and evolution. Here are some commonly used bioinformatics tools for virus research

https://evirusbioinfc.notion.site/18e21bc49827484b8a2f84463cb40b8d?v=92e7eb6703be4720abf17a901bc9a947

Address of the bookmark: https://evirusbioinfc.notion.site/18e21bc49827484b8a2f84463cb40b8d?v=92e7eb6703be4720abf17a901bc9a947

Predicting Pathogen Virulence Using Bioinformatics Tools

BioStar — Tue, 04 Nov 2025 07:55:53 -0600

In the genomic era, the ability to predict the virulence potential of pathogens has become an indispensable part of infectious disease research. With the exponential growth of microbial genome data, bioinformatics tools now enable scientists to identify virulence factors, model pathogen behavior, and even forecast outbreak risks — all from sequence data.

In an age where pathogens continue to evolve and cross boundaries, understanding what makes them virulent—that is, capable of causing disease—has become a critical focus in modern microbiology and genomics. Virulence prediction bridges computational biology, genomics, and machine learning to forecast the pathogenic potential of microbes before they strike.

What Is Virulence?

Virulence refers to the degree of damage a pathogen can inflict on its host. It is determined by a combination of genetic factors—called virulence factors (VFs)—that allow the organism to attach, invade, evade, and harm the host. These include genes coding for toxins, secretion systems, adhesins, and enzymes that disrupt host defenses.

Understanding virulence factors not only helps in deciphering the mechanisms of infection but also provides early warning signs for emerging threats.

Why Predict Virulence?

Traditional virulence studies relied heavily on experimental infection models, which, although accurate, are time-consuming, expensive, and ethically constrained.
Today, the availability of whole-genome sequences and large-scale pathogen databases has paved the way for in silico virulence prediction—a computational approach that can screen thousands of genomes within hours.

This approach enables researchers to:

Rapidly identify potential high-risk strains.
Prioritize pathogens for containment, surveillance, or further study.
Guide vaccine development and drug target discovery.
Support One Health frameworks, linking animal, human, and environmental health data.

How Is Virulence Predicted?

Virulence prediction combines bioinformatics pipelines with machine learning and comparative genomics. The process generally involves:

Genome Annotation: Identifying genes and coding sequences in microbial genomes.
Feature Extraction: Comparing sequences with curated databases like VFDB (Virulence Factor Database), PATRIC, or Victors.
Pattern Recognition: Using algorithms (e.g., Random Forest, SVM, or deep learning models) to classify genes or strains as virulent or non-virulent based on sequence patterns, motifs, and protein domains.
Scoring and Visualization: Assigning a virulence score or confidence level and visualizing it through heatmaps or genome maps.

Tools and Resources for Virulence Prediction

A number of tools and databases make virulence prediction accessible to the scientific community:

VFanalyzer – For identifying virulence genes based on VFDB.
PathoFact – Predicts virulence, antimicrobial resistance (AMR), and toxin genes from metagenomic data.
Pangenome-based models – Identify virulence-associated gene clusters across strains.
Machine learning models – Use features like GC content, codon usage bias, or protein domains to predict pathogenicity.

Emerging tools now integrate multi-omic data—including transcriptomics, proteomics, and metabolomics—to understand virulence in a systems biology framework.

Applications in the Real World

Virulence prediction has major implications across public health and research sectors:

Epidemic preparedness: Early identification of virulent strains in outbreak samples.
AMR surveillance: Linking virulence profiles with antibiotic resistance determinants.
Environmental monitoring: Predicting pathogenic potential of soil or waterborne microbes.
Clinical diagnostics: Supporting personalized treatment through pathogen profiling.

For instance, integrating virulence prediction pipelines into national surveillance networks could enable faster risk assessment and response to infectious outbreaks.

The Road Ahead

As machine learning and genomics advance, virulence prediction will evolve from simple gene-based detection to dynamic, context-aware models that account for host–pathogen interactions, environmental signals, and evolutionary adaptation.

Future tools may predict not just if a strain is virulent, but under what conditions it expresses that virulence—bridging the gap between genotype and phenotype.

In Summary

Virulence prediction is redefining how we understand and anticipate infectious diseases. By coupling genomic insights with computational intelligence, researchers can identify potential threats earlier, design smarter interventions, and ultimately, strengthen our preparedness against emerging pathogens.

Alignment-free sequence comparison tools available for next-generation sequencing data analysis

Abhimanyu Singh — Tue, 07 Nov 2017 05:33:33 -0600

kallisto

Transcript abundance quantification from RNA-seq data (uses pseudoalignment for rapid determination of read compatibility with targets)

Software (C++)

https://pachterlab.github.io/kallisto/

Sailfish

Estimation of isoform abundances from reference sequences and RNA-seq data (k-mer based)

Software (C++)

http://www.cs.cmu.edu/~ckingsf/software/sailfish/

Salmon

Quantification of the expression of transcripts using RNA-seq data (uses k-mers)

https://combine-lab.github.io/salmon/

RNA-Skim

RNA-seq quantification at transcript-level (partitions the transcriptome into disjoint transcript clusters; uses sig-mers, a special type of k-mers)

Software (C++)

http://www.csbio.unc.edu/rs/

Variant calling

ChimeRScope

Fusion transcript prediction using gene k-mers profiles of the RNA-seq paired-end reads

Software (Java)

https://github.com/ChimeRScope/ChimeRScope/wiki

FastGT

Genotyping of known SNV/SNP variants directly from raw NGS sequence reads by counting unique k-mers

Software (C)

https://github.com/bioinfo-ut/GenomeTester4/

Phy-Mer

Reference-independent mitochondrial haplogroup classifier from NGS data (k-mer based)

Software (Python)

https://github.com/danielnavarrogomez/phy-mer

LAVA

Genotyping of known SNPs (dbSNP and Affymetrix's Genome-Wide Human SNP Array) from raw NGS reads (k-mer based)

Software (C)

http://lava.csail.mit.edu/

MICADo

Detection of mutations in targeted third-generation NGS data (can distinguish patients’ specific mutations; algorithm uses k-mers and is based on colored de Bruijn graphs)

Software (Python)

http://github.com/cbib/MICADo

General mapper

Minimap

Lightweight and fast read mapper and read overlap detector (uses the concept of “minimazers”, a special type of k-mers)

Software (C)

https://github.com/lh3/minimap

Assembly

De novo genome assembly

MHAP

Produces highly continuous assembly (fully resolved chromosome arms) from third-generation long and noisy reads (10 kbp) using a dimensionality reduction technique MinHash

Software (Java)

https://github.com/marbl/MHAP

Miniasm

Assembler of long noisy reads (SMRT, ONT) using the Overlap-Layout Consensus (OLC) approach without the necessity of an error correction stage (uses minimap)

Software (C)

https://github.com/lh3/miniasm

LINKS

Scaffolding genome assembly with error-containing long sequence (e.g., ONT or PacBio reads, draft genomes)

Software (Perl)

https://github.com/warrenlr/LINKS/

Read clustering

afcluster

Clustering of reads from different genes and different species based on k-mer counts

Software (C++)

https://github.com/luscinius/afcluster

QCluster

Clustering of reads with alignment-free measures (k-mer based) and quality values

Software (C++)

http://www.dei.unipd.it/~ciompin/main/qcluster.html

Reads error correction

Lighter

Correction of sequencing errors in raw, whole genome sequencing reads (k-mer based)

Software (C++)

https://github.com/mourisl/Lighter

QuorUM

Error corrector for Illumina reads using k-mers

Software (C++)

https://github.com/gmarcais/Quorum

Trowel

Software (C++)

https://sourceforge.net/projects/trowel-ec/

Metagenomics

Assembly-free phylogenomics

AAF

Phylogeny reconstruction directly from unassembled raw sequence data from whole genome sequencing projects; provides bootstrap support to assess uncertainty in the tree topology (k-mer based)

Software (Python)

https://github.com/fanhuan/AAF

kSNP v3

Reference-free SNP identification and estimation of phylogenetic trees using SNPs (based on k-mer analysis)

Software (C)

https://sourceforge.net/projects/ksnp/files/

NGS-MC

Phylogeny of species based on NGS reads using alignment-free sequence dissimilarity measures d2* and d2 S under different Markov chain models (using k-words)

R package

http://www-rcf.usc.edu/~fsun/Programs/NGS-MC/NGS-MC.html

Species identification/taxonomic profiling

CLARK

Taxonomic classification of metagenomic reads to known bacterial genomes using k-mer search and LCA assignment

Software (C++)

http://clark.cs.ucr.edu/

FOCUS

Reports organisms present in metagenomic samples and profiles their abundances (uses composition-based approach and non-negative least squares for prediction)

Web service Software (Python)

http://edwards.sdsu.edu/FOCUS/

GSM

Estimation of abundances of microbial genomes in metagenomic samples (k-mer based)

Software (Go)

https://github.com/pdtrang/GSM

Mash

Species identification using assembled or unassembled Illumina, PacBio, and ONT data (based on MinHash dimensionality-reduction technique)

Software (C++)

https://github.com/marbl/mash

Kraken

Taxonomic assignment in metagenome analysis by exact k-mer search; LCA assignment of short reads based on a comprehensive sequence database

Software (C++)

https://ccb.jhu.edu/software/kraken/

LMAT

Assignment of taxonomic labels to reads by k-mers searches in precomputed database

Software (C++/Python)

https://sourceforge.net/projects/lmat/

stringMLST

k-mer-based tool for MLST directly from the genome sequencing reads

Software (Python)

http://jordan.biology.gatech.edu/page/software/stringMLST

Taxonomer

k-mer-based ultrafast metagenomics tool for assigning taxonomy to sequencing reads from clinical and environmental samples

Web service

http://taxonomer.iobio.io/

Other

d2-tools

Word-based (k-tuple) comparison (pairwise dissimilarity matrix using d2S measure) of metatranscriptomic samples from NGS reads

Software (Python/R)

https://code.google.com/p/d2-tools/

VirHostMatcher

Prediction of hosts from metagenomic viral sequences based on ONF using various distance measures (e.g., d2)

Software (C++)

https://github.com/jessieren/VirHostMatcher

MetaFast

Statistics calculation of metagenome sequences and the distances between them based on assembly using de Bruijn graphs and Bray–Curtis dissimilarity measure

Software (Java)

https://github.com/ctlab/metafast