BOL: Related items

MIX: Combining multiple assemblies from NGS data

Rahul Nayak — Tue, 08 May 2018 04:58:05 -0500

Mix is a tool that combines two or more draft assemblies, without relying on a reference genome and has the goal to reduce contig fragmentation and thus speed-up genome finishing. The proposed algorithm builds an extension graph where vertices represent extremities of contigs and edges represent existing alignments between these extremities. These alignment edges are used for contig extension. The resulting output assembly corresponds to a path in the extension graph that maximizes the cumulative contig length.

The Mix algorithm, approach and results were published in BMC bioinformatics : http://www.biomedcentral.com/1471-2105/14/S15/S16.

Address of the bookmark: https://github.com/cbib/MIX

Introduction to phylogenies in R

Abhi — Wed, 13 Oct 2021 02:27:21 -0500

R phylogenetics is built on the contributed packages for phylogenetics in R, and there are many such packages. Let's begin today by installing a few critical packages, such as ape, phangorn, phytools, and geiger. To get the most recent CRAN version of these packages, you will need to have R 3.3.x installed on your computer!

Address of the bookmark: http://www.phytools.org/Cordoba2017/ex/2/Intro-to-phylogenies.html

Interesting Bioinformatics Resources !

Abhi — Fri, 11 Nov 2022 06:30:46 -0600

1. a reproducible workflow. https://www.youtube.com/watch?v=s3JldKoA0zw This two minute video will change your mind on reproducible research

2. Parallel sequencing lives, or what makes large sequencing projects successful https://academic.oup.com/gigascience/article/6/11/gix100/4557140?login=false

3. Common-sense approaches to sharing tabular data alongside publication https://www.sciencedirect.com/science/article/pii/S2666389921002300

4. A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker https://psyarxiv.com/8xzqy/

5. Practical Computational Reproducibility in the Life Sciences https://www.cell.com/cell-systems/fulltext/S2405-4712(18)30140-6

6. A video by Dr.Keith A. Baggerly from MD Anderson [The Importance of Reproducible Research in High-Throughput Biology](https://www.youtube.com/watch?v=7gYIs7uYbMo) highly recommended.

7. Ten Simple Rules for Reproducible Computational Research http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285)

8. Good Enough Practices in Scientific Computing http://arxiv.org/abs/1609.00037

9. Best Practices for Scientific Computing https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745

10. A Quick Guide to Organizing Computational Biology Projects http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.100042 A must read for computational biologists!

11. Reproducibility of computational workflows is automated using continuous analysis https://www.nature.com/articles/nbt.3780

12. Five selfish reasons to work reproducibly https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7

Important Bioinformatics Tools !

BioStar — Tue, 30 Jul 2024 05:03:29 -0500

1. Ktrim: An extra-fast, accurate adapter trimmer for sequencing data. It processes FASTQ files from multiple lanes with minimal mismatching and over-trimming of adapters.

2. BWA MEM: A reliable alignment tool (particularly for mapping ALT contigs and HLA genes, which are not fully addressed in BWA-MEM2).

3. Sambamba markdup: Quickly marks or removes duplicate reads using Picard's criteria.

4. ichorCNA: Estimates the tumor DNA fraction in cell-free DNA from ultra-low-pass whole genome sequencing (0.1x coverage) based on copy number alterations (CNA).

5. Fragle: A deep learning method for quantifying ctDNA levels from cell-free DNA fragmentomic profiles. It detects TF as low as ~1% ctDNA and works with targeted genomic panel sequencing data.

6. AlfredQC: A quality control tool for high-throughput sequencing data. It assesses metrics like read quality scores, GC content, and duplication rates, visualized through detailed plots and summary statistics.

7. Mosdepth: A fast tool for calculating sequencing coverage depth, offering a quicker alternative to samtools/sambamba depth by processing BAM and CRAM files.

8. Bedtools: A versatile toolkit for genomics, enabling operations like intersect, merge, count, and shuffle on genomic intervals across formats such as BAM, BED, GFF/GTF, and VCF.

9. Datamash: A command-line tool for basic numeric, textual, and statistical operations on input data streams. It supports operations such as grouping, sorting, transposing, and performing arithmetic calculations on tabular data.

10. gwf.app: A pragmatic alternative to Snakemake. Developed at Aarhus University, this flexible, generic workflow tool builds and runs large scientific workflows.

Predicting Pathogen Virulence Using Bioinformatics Tools

BioStar — Tue, 04 Nov 2025 07:55:53 -0600

In the genomic era, the ability to predict the virulence potential of pathogens has become an indispensable part of infectious disease research. With the exponential growth of microbial genome data, bioinformatics tools now enable scientists to identify virulence factors, model pathogen behavior, and even forecast outbreak risks — all from sequence data.

In an age where pathogens continue to evolve and cross boundaries, understanding what makes them virulent—that is, capable of causing disease—has become a critical focus in modern microbiology and genomics. Virulence prediction bridges computational biology, genomics, and machine learning to forecast the pathogenic potential of microbes before they strike.

What Is Virulence?

Virulence refers to the degree of damage a pathogen can inflict on its host. It is determined by a combination of genetic factors—called virulence factors (VFs)—that allow the organism to attach, invade, evade, and harm the host. These include genes coding for toxins, secretion systems, adhesins, and enzymes that disrupt host defenses.

Understanding virulence factors not only helps in deciphering the mechanisms of infection but also provides early warning signs for emerging threats.

Why Predict Virulence?

Traditional virulence studies relied heavily on experimental infection models, which, although accurate, are time-consuming, expensive, and ethically constrained.
Today, the availability of whole-genome sequences and large-scale pathogen databases has paved the way for in silico virulence prediction—a computational approach that can screen thousands of genomes within hours.

This approach enables researchers to:

Rapidly identify potential high-risk strains.
Prioritize pathogens for containment, surveillance, or further study.
Guide vaccine development and drug target discovery.
Support One Health frameworks, linking animal, human, and environmental health data.

How Is Virulence Predicted?

Virulence prediction combines bioinformatics pipelines with machine learning and comparative genomics. The process generally involves:

Genome Annotation: Identifying genes and coding sequences in microbial genomes.
Feature Extraction: Comparing sequences with curated databases like VFDB (Virulence Factor Database), PATRIC, or Victors.
Pattern Recognition: Using algorithms (e.g., Random Forest, SVM, or deep learning models) to classify genes or strains as virulent or non-virulent based on sequence patterns, motifs, and protein domains.
Scoring and Visualization: Assigning a virulence score or confidence level and visualizing it through heatmaps or genome maps.

Tools and Resources for Virulence Prediction

A number of tools and databases make virulence prediction accessible to the scientific community:

VFanalyzer – For identifying virulence genes based on VFDB.
PathoFact – Predicts virulence, antimicrobial resistance (AMR), and toxin genes from metagenomic data.
Pangenome-based models – Identify virulence-associated gene clusters across strains.
Machine learning models – Use features like GC content, codon usage bias, or protein domains to predict pathogenicity.

Emerging tools now integrate multi-omic data—including transcriptomics, proteomics, and metabolomics—to understand virulence in a systems biology framework.

Applications in the Real World

Virulence prediction has major implications across public health and research sectors:

Epidemic preparedness: Early identification of virulent strains in outbreak samples.
AMR surveillance: Linking virulence profiles with antibiotic resistance determinants.
Environmental monitoring: Predicting pathogenic potential of soil or waterborne microbes.
Clinical diagnostics: Supporting personalized treatment through pathogen profiling.

For instance, integrating virulence prediction pipelines into national surveillance networks could enable faster risk assessment and response to infectious outbreaks.

The Road Ahead

As machine learning and genomics advance, virulence prediction will evolve from simple gene-based detection to dynamic, context-aware models that account for host–pathogen interactions, environmental signals, and evolutionary adaptation.

Future tools may predict not just if a strain is virulent, but under what conditions it expresses that virulence—bridging the gap between genotype and phenotype.

In Summary

Virulence prediction is redefining how we understand and anticipate infectious diseases. By coupling genomic insights with computational intelligence, researchers can identify potential threats earlier, design smarter interventions, and ultimately, strengthen our preparedness against emerging pathogens.

Alignment-free sequence comparison tools available for next-generation sequencing data analysis

Abhimanyu Singh — Tue, 07 Nov 2017 05:33:33 -0600

kallisto

Transcript abundance quantification from RNA-seq data (uses pseudoalignment for rapid determination of read compatibility with targets)

Software (C++)

https://pachterlab.github.io/kallisto/

Sailfish

Estimation of isoform abundances from reference sequences and RNA-seq data (k-mer based)

Software (C++)

http://www.cs.cmu.edu/~ckingsf/software/sailfish/

Salmon

Quantification of the expression of transcripts using RNA-seq data (uses k-mers)

https://combine-lab.github.io/salmon/

RNA-Skim

RNA-seq quantification at transcript-level (partitions the transcriptome into disjoint transcript clusters; uses sig-mers, a special type of k-mers)

Software (C++)

http://www.csbio.unc.edu/rs/

Variant calling

ChimeRScope

Fusion transcript prediction using gene k-mers profiles of the RNA-seq paired-end reads

Software (Java)

https://github.com/ChimeRScope/ChimeRScope/wiki

FastGT

Genotyping of known SNV/SNP variants directly from raw NGS sequence reads by counting unique k-mers

Software (C)

https://github.com/bioinfo-ut/GenomeTester4/

Phy-Mer

Reference-independent mitochondrial haplogroup classifier from NGS data (k-mer based)

Software (Python)

https://github.com/danielnavarrogomez/phy-mer

LAVA

Genotyping of known SNPs (dbSNP and Affymetrix's Genome-Wide Human SNP Array) from raw NGS reads (k-mer based)

Software (C)

http://lava.csail.mit.edu/

MICADo

Detection of mutations in targeted third-generation NGS data (can distinguish patients’ specific mutations; algorithm uses k-mers and is based on colored de Bruijn graphs)

Software (Python)

http://github.com/cbib/MICADo

General mapper

Minimap

Lightweight and fast read mapper and read overlap detector (uses the concept of “minimazers”, a special type of k-mers)

Software (C)

https://github.com/lh3/minimap

Assembly

De novo genome assembly

MHAP

Produces highly continuous assembly (fully resolved chromosome arms) from third-generation long and noisy reads (10 kbp) using a dimensionality reduction technique MinHash

Software (Java)

https://github.com/marbl/MHAP

Miniasm

Assembler of long noisy reads (SMRT, ONT) using the Overlap-Layout Consensus (OLC) approach without the necessity of an error correction stage (uses minimap)

Software (C)

https://github.com/lh3/miniasm

LINKS

Scaffolding genome assembly with error-containing long sequence (e.g., ONT or PacBio reads, draft genomes)

Software (Perl)

https://github.com/warrenlr/LINKS/

Read clustering

afcluster

Clustering of reads from different genes and different species based on k-mer counts

Software (C++)

https://github.com/luscinius/afcluster

QCluster

Clustering of reads with alignment-free measures (k-mer based) and quality values

Software (C++)

http://www.dei.unipd.it/~ciompin/main/qcluster.html

Reads error correction

Lighter

Correction of sequencing errors in raw, whole genome sequencing reads (k-mer based)

Software (C++)

https://github.com/mourisl/Lighter

QuorUM

Error corrector for Illumina reads using k-mers

Software (C++)

https://github.com/gmarcais/Quorum

Trowel

Software (C++)

https://sourceforge.net/projects/trowel-ec/

Metagenomics

Assembly-free phylogenomics

AAF

Phylogeny reconstruction directly from unassembled raw sequence data from whole genome sequencing projects; provides bootstrap support to assess uncertainty in the tree topology (k-mer based)

Software (Python)

https://github.com/fanhuan/AAF

kSNP v3

Reference-free SNP identification and estimation of phylogenetic trees using SNPs (based on k-mer analysis)

Software (C)

https://sourceforge.net/projects/ksnp/files/

NGS-MC

Phylogeny of species based on NGS reads using alignment-free sequence dissimilarity measures d2* and d2 S under different Markov chain models (using k-words)

R package

http://www-rcf.usc.edu/~fsun/Programs/NGS-MC/NGS-MC.html

Species identification/taxonomic profiling

CLARK

Taxonomic classification of metagenomic reads to known bacterial genomes using k-mer search and LCA assignment

Software (C++)

http://clark.cs.ucr.edu/

FOCUS

Reports organisms present in metagenomic samples and profiles their abundances (uses composition-based approach and non-negative least squares for prediction)

Web service Software (Python)

http://edwards.sdsu.edu/FOCUS/

GSM

Estimation of abundances of microbial genomes in metagenomic samples (k-mer based)

Software (Go)

https://github.com/pdtrang/GSM

Mash

Species identification using assembled or unassembled Illumina, PacBio, and ONT data (based on MinHash dimensionality-reduction technique)

Software (C++)

https://github.com/marbl/mash

Kraken

Taxonomic assignment in metagenome analysis by exact k-mer search; LCA assignment of short reads based on a comprehensive sequence database

Software (C++)

https://ccb.jhu.edu/software/kraken/

LMAT

Assignment of taxonomic labels to reads by k-mers searches in precomputed database

Software (C++/Python)

https://sourceforge.net/projects/lmat/

stringMLST

k-mer-based tool for MLST directly from the genome sequencing reads

Software (Python)

http://jordan.biology.gatech.edu/page/software/stringMLST

Taxonomer

k-mer-based ultrafast metagenomics tool for assigning taxonomy to sequencing reads from clinical and environmental samples

Web service

http://taxonomer.iobio.io/

Other

d2-tools

Word-based (k-tuple) comparison (pairwise dissimilarity matrix using d2S measure) of metatranscriptomic samples from NGS reads

Software (Python/R)

https://code.google.com/p/d2-tools/

VirHostMatcher

Prediction of hosts from metagenomic viral sequences based on ONF using various distance measures (e.g., d2)

Software (C++)

https://github.com/jessieren/VirHostMatcher

MetaFast

Statistics calculation of metagenome sequences and the distances between them based on assembly using de Bruijn graphs and Bray–Curtis dissimilarity measure

Software (Java)

https://github.com/ctlab/metafast

List of visualization tools for network biology

Jit — Mon, 29 Jan 2018 05:12:24 -0600

Network analysis is any structured technique used to mathematically analyze a circuit (a “network” of interconnected components). The Network analysis provides the ability to quantify associations between individuals, which makes it possible to infer details about the network as a whole at the species and/or population level. Few tools published in BMC are listed here https://bmcbioinformatics.biomedcentral.com/articles/sections/networks-analysis.

Following are the list of standalone applications for network analysis:

Arena 3D

3D visualization of multi-layer networks

http://www.arena3d.org

Biana

Data integration and network management

http://sbi.imim.es/web/BIANA.php

BioLayout Express 3D

2D/3D network visualization

http://www.biolayout.org/

BiologicalNetworks

Efficient integrated multi-level analysis of microarray, sequence, regulatory and other data

http://www.biologicalnetworks.org

BioMiner

Modeling, analyzing and visualizing biochemical pathways and networks

http://www.zbi.uni-saarland.de/chair/projects/BioMiner

Cell Illustrator

Petri nets for modeling and simulating biological networks

http://www.cellillustrator.com

COPASI

Analysis of biochemical networks and their dynamics

http://www.copasi.org/

Cytoscape

Network visualization and analysis. Over 200 plugins [60]

http://www.cytoscape.org/

Dizzy

Chemical kinetics stochastic simulation software

http://magnet.systemsbiology.net/software/Dizzy/

DyCoNet

Gephi plugin that can be used to identify dynamic communities in networks

https://github.com/juliemkauffman/DyCoNet

GENeVis

Network and pathway visualization

http://tinyurl.com/genevis/

GEPHI

Interactive visualization and exploration for any network and complex system, dynamic and hierarchical graph.

https://gephi.org

Igraph

Collection of network analysis tools with the emphasis on efficiency, portability and ease of use

http://igraph.sourceforge.net

Medusa

Semantic and multi-edged simple networks

https://sites.google.com/site/medusa3visualization/

NAViGaTOR

Visualizing and analyzing protein-protein interaction networks

http://tinyurl.com/navigator1/

N-Browse

Interactive graphical browser for biological networks

http://www.gnetbrowse.org/

NeAT

Topological and clustering analysis of networks

http://rsat.ulb.ac.be/neat/

Ondex

Data integration and visualization of large networks

http://www.ondex.org/

Osprey

Visualization and annotation of biological networks

http://biodata.mshri.on.ca/osprey/servlet/Index

Pajek

Analysis and visualization of large networks and social network analysis

http://vlado.fmf.uni-lj.si/pub/networks/pajek/

PathwayAssist

Navigation and analysis of biological pathways, gene regulation networks and protein interaction maps.

http://www.ariadnegenomics.com/downloads/

PIVOT

Layout algorithms for visualizing protein interactions and families

http://acgt.cs.tau.ac.il/pivot/

ProCope

Prediction and evaluation of protein complexes from purification data experiments

http://www.bio.ifi.lmu.de/Complexes/ProCope/

ProViz

Visualization and exploration of interaction networks. Gene Ontology and PSI-MI formats supported

http://cbi.labri.fr/eng/proviz.htm

SpectralNET

Network analysis and visualizations. Scatter plots and dimensionality reduction algorithms

https://www.broadinstitute.org/software/spectralnet

Tulip

Enables the development of algorithms, visual encodings, interaction techniques, data models and domain-specific visualizations

http://tulip.labri.fr/TulipDrupal/

VANESA

Automatic reconstruction and analysis of biological networks and Petri nets based on life-science database information

http://agbi.techfak.uni-bielefeld.de/vanesa/

VANTED

Network reconstruction, data visualization, integration of various data types, network simulation

http://tinyurl.com/vanted/

yEd

Creation of diagrams manually and import external data

http://tinyurl.com/yEdGraph/

Web tools for network analysis

APID

Unified protein-protein interactions from BIND, BioGRID, DIP, HPRD, IntAct and MINT

http://bioinfow.dep.usal.es/apid/

Arcadia

Translates text-based descriptions of biological networks (SBML files) into standardized diagrams (Systems Biology Graphical Notation Process Description maps)

http://arcadiapathways.sourceforge.net/

AVIS

Viewer for signaling networks

http://actin.pharm.mssm.edu/AVIS2

bioPIXIE

Discovery of biological networks from diverse functional genomic data

http://pixie.princeton.edu/pixie

CellPublisher

Interactive representations of biochemical processes

http://cellpublisher.gobics.de/

Graphle

Distributed network exploration and visualization of interactive large, dense graphs

http://tinyurl.com/graphle/

GraphWeb

Web server for graph-based analysis of biological networks

http://biit.cs.ut.ee/graphweb/

Hubba

Web-based service to explore the essential nodes in a network

http://hub.iis.sinica.edu.tw/Hubba

NetworkBLAST

Analysis of protein interaction networks across species to infer protein complexes that are conserved in evolution

http://www.cs.tau.ac.il/~bnet/networkblast.htm

Pathview

Tool set for pathway-based data integration and visualization

http://Pathview.r-forge.r-project.org/

PINA

Integrated platform for protein interaction network construction, filtering, analysis, visualization and management

http://cbg.garvan.unsw.edu.au/pina/home.do

ReMatch

Web-based tool for integration of user-given stoichiometric metabolic models into a database collected from public data sources

http://www.cs.helsinki.fi/group/sysfys/software/rematch/

SNOW

Gene mapping on a reference or human protein-protein interaction network that SNOW hosts

http://snow.bioinfo.cipf.es

STITCH

Resource to explore known and predicted interactions of chemicals and proteins

http://stitch.embl.de/

STRING

Protein interaction networks and integration of data such as genomic context, high-throughput experiments, conserved coexpression and previous knowledge derived from the literature

http://string-db.org

TVNViewer

An interactive visualization tool for exploring networks that change over time or space

http://www.sailing.cs.cmu.edu/main/?page_id=545

tYNA

System for managing, comparing and mining multiple networks

http://tyna.gersteinlab.org/tyna/

VisANT

Visualization, mining, analysis and modeling of biological networks, metabolic networks and ecosystems

http://visant.bu.edu/

DNA Nucleotide Counter

Neel — Fri, 12 Oct 2018 04:37:01 -0500

DNA Nucleotide Counter is delivered in a DNA Baser package together with other free molecular biology tools. Download the package and double click it. The programs inside the package will be extracted to the destination folder (specified by you). Go to the destination folder and double click the program you want to use.

It installs in any computer even if you don't have administrator rights!

Address of the bookmark: http://www.dnabaser.com/download/DNA-Counter/index.html

Troyanskaya Lab

Tue, 04 Feb 2020 06:40:36 -0600

The goal of our research is to interpret and distill this complexity through accurate analysis and modeling of molecular pathways, particularly those in which malfunctions lead to the manifestation of disease. We are inventing integrative methods for systems-level pathway modeling through integrative analysis of genome-scale datasets. We apply these approaches in studying challenging biological problems, such as how pathways function in diverse cell types and how they change dynamically.

https://function.princeton.edu/

Frequent parameters for bioinformatics tools !

BioStar — Tue, 27 Oct 2020 19:42:32 -0500

Third party executable parameters and options.

Trimmomatic

“ILLUMINACLIP:...:2:30:10”

“LEADING:15”

“TRAILING:15”

“SLIDINGWINDOW:4:20”

“MINLEN:20”

“TOPHRED33”

Filtlong

--min_length 500

--min_mean_q 85

--min_window_q 65

FastQ Screen

--aligner bowtie2' (bwa for PacBio)

--subset 1000 (for PacBio)

SPAdes

--careful

--disable-gzip-output

--cov-cutoff auto

--phred-offset 33

HGAP

Pbalign.task_options.min_accuracy: 70

Pbalign.task_options.no_split_subreads: false

Genomic_consensus.task_options.min_confidence: 40

falcon_ns.task_options.HGAP_GenomeLength_str:

6000000

Pbcoretools.task_options.read_length: 0

Genomic_consensus.task_options.use_score: 0

Pbalign.task_options.min_length: 50

Pbalign.task_options.algorithm_options: --minMatch 12

--bestn 10 --minPctSimilarity 70.0

Pbalign.task_options.hit_policy: randombest

Pbcoretools.task_options.other_filters: rq >= 0.7

Pbalign.task_options.concordant: false

Genomic_consensus.task_options.min_coverage: 5

falcon_ns.task_options.HGAP_SeedCoverage_str: 30

falcon_ns.task_options.HGAP_AggressiveAsm_bool: false

Genomic_consensus.task_options.algorithm: best

falcon_ns.task_options.HGAP_SeedLengthCutoff_str: -1

Genomic_consensus.task_options.diploid: false

MeDuSa

-random 100

Prokka

--usegenus

--force

--addgenes

--rfam

--rawproduct

cmsearch (taxonomy, 16S)

--rfam

--noali

blastn (taxonomy, 16S)

-evalue 1E-10

blastn (MLST)

-ungapped

-dust no

-evalue 1E-20

-word_size 32

-culling_limit 2

-perc_identity 95

blastp (VF)

-culling_limit 2

RGI (ABR)

--input_type contig

bowtie2 (mapping)

--sensitive

minimap2 (mapping)

-a

-x map-ont

samtools mpileup (SNP detection)

-uRI

bcftools call (SNP detection)

--variants-only

--skip-variants indels

--output-type v

--ploidy 1

-c

SNPsift filter (SNP detection)

"( QUAL >= 30 ) & (( na FILTER ) | (FILTER = 'PASS')) &

( DP >= 20 ) & ( MQ >= 20 )"

SNPeff ann (SNP detection)

-nodownload

-no-intron

-no-downstream

-no SPLICE_SITE_REGION

-upDownStreamLen 250

bcftools consensus

(phylogenetic tree)

--haplotype 1

fasttreemp

-nt

-boot 100

roary

-e

-n

-cd 100

-g 100000