BOL: Related items

Useful Bioinformatics Tools

Poonam Mahapatra — Mon, 29 Aug 2016 04:08:12 -0500

Collections of few handy tools for bioinformatician

http://molbiol-tools.ca/Convert.htm

Address of the bookmark: http://molbiol-tools.ca/Convert.htm

Harvest

Jit — Tue, 31 Jan 2017 10:57:56 -0600

Harvest is a suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial genomes, including variant calls, recombination detection, and phylogenetic trees.

Tools

Parsnp - Core-genome alignment and analysis
Gingr - Interactive visualization of alignments, trees and variants
HarvestTools - Archiving and postprocessing

Citation

Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biology, 15 (11), 1-15 [PDF]

Address of the bookmark: http://harvest.readthedocs.io/en/latest/index.html

WGS Celera Assembler version 8.3rc2

Jit — Mon, 10 Apr 2017 04:45:40 -0500

These are release notes for Celera Assembler version 8.3rc2, which was released on May 24, 2015.

This distribution package provides a stable, tested, documented version of the software. The distribution is usable on most Unix-like platforms, and some platforms have pre-compiled binary distributions ready for installation.

The source code package includes full source code (revision 4627), Makefiles, and scripts. A subset of the kmer package (http://kmer.sourceforge.net/, version r1994), used by some modules of Celera Assembler, is included. This distribution includes [http://samtools.sourceforge.net/ SAMtools], [http://www.cbcb.umd.edu/software/jellyfish/ Jellyfish 2.0], [https://github.com/pbjd/pbutgcns PBUTGCNS], [https://github.com/PacificBiosciences/pbdagcon PBDAGCON], [https://github.com/PacificBiosciences/BLASR BLASR], and parts of the [https://github.com/PacificBiosciences/FALCON/tree/v0.1.3 Falcon assembler].

Full documentation can be found online at http://wgs-assembler.sourceforge.net/.

Interesting scripts within it

urbe@urbo214b[bin] ls []
-rwxrwxr-x 1 urbe urbe 11K Apr 10 11:41 addCNSToStore
-rwxrwxr-x 1 urbe urbe 575K Apr 10 11:41 addReadsToUnitigs
-rwxrwxr-x 1 urbe urbe 128K Apr 10 11:41 analyzeBest
-rwxrwxr-x 1 urbe urbe 257K Apr 10 11:41 analyzePosMap
-rwxrwxr-x 1 urbe urbe 1,5M Apr 10 11:41 analyzeScaffolds
-rwxrwxr-x 1 urbe urbe 224K Apr 10 11:41 asmOutputFasta
-rwxrwxr-x 1 urbe urbe 448K Apr 10 11:41 asmOutputStatistics
-rwxrwxr-x 1 urbe urbe 2,4K Apr 10 11:41 asmToAGP.pl
-rwxrwxr-x 1 urbe urbe 7,6M Apr 10 11:41 blasr
-rwxrwxr-x 1 urbe urbe 1,6M Apr 10 11:41 bogart
-rwxrwxr-x 1 urbe urbe 183K Apr 10 11:41 bogus
-rwxrwxr-x 1 urbe urbe 272K Apr 10 11:41 bogusness
-rwxrwxr-x 1 urbe urbe 247K Apr 10 11:41 buildPosMap
-rwxrwxr-x 1 urbe urbe 213K Apr 10 11:41 buildRefContigs
-rwxrwxr-x 1 urbe urbe 990K Apr 10 11:41 buildUnitigs
-rwxrwxr-x 1 urbe urbe 18K Apr 10 11:41 ca2ace.pl
-rwxrwxr-x 1 urbe urbe 12K Apr 10 11:41 caqc_help.ini
-rwxrwxr-x 1 urbe urbe 61K Apr 10 11:41 caqc.pl
-rwxrwxr-x 1 urbe urbe 23K Apr 10 11:41 cat-corrects
-rwxrwxr-x 1 urbe urbe 24K Apr 10 11:41 cat-erates
-rwxrwxr-x 1 urbe urbe 1,9M Apr 10 11:41 cgw
-rwxrwxr-x 1 urbe urbe 1,4M Apr 10 11:41 cgwDump
-rwxrwxr-x 1 urbe urbe 204K Apr 10 11:41 chimChe
-rwxrwxr-x 1 urbe urbe 201K Apr 10 11:40 chimera
-rwxrwxr-x 1 urbe urbe 220K Apr 10 11:41 classifyMates
-rwxrwxr-x 1 urbe urbe 201K Apr 10 11:41 classifyMatesApply
-rwxrwxr-x 1 urbe urbe 215K Apr 10 11:41 classifyMatesPairwise
-rwxrwxr-x 1 urbe urbe 366K Apr 10 11:41 computeCoverageStat
-rwxrwxr-x 1 urbe urbe 9,8K Apr 10 11:41 convert-fasta-to-v2.pl
-rwxrwxr-x 1 urbe urbe 48K Apr 10 11:41 convertOverlap
-rwxrwxr-x 1 urbe urbe 119K Apr 10 11:41 convertSamToCA
-rwxrwxr-x 1 urbe urbe 20K Apr 10 11:41 convertToPBCNS
-rwxrwxr-x 1 urbe urbe 197K Apr 10 11:41 correct-frags
-rwxrwxr-x 1 urbe urbe 259K Apr 10 11:41 correct-olaps
-rwxrwxr-x 1 urbe urbe 520K Apr 10 11:41 correctPacBio
-rwxrwxr-x 1 urbe urbe 540K Apr 10 11:41 ctgcns
-rwxrwxr-x 1 urbe urbe 162K Apr 10 11:40 deduplicate
-rwxrwxr-x 1 urbe urbe 37K Apr 10 11:41 demotePosMap
-rwxrwxr-x 1 urbe urbe 1,5M Apr 10 11:41 dumpCloneMiddles
-rwxrwxr-x 1 urbe urbe 124K Apr 10 11:41 dumpPBRLayoutStore
-rwxrwxr-x 1 urbe urbe 1,3M Apr 10 11:41 dumpSingletons
-rwxrwxr-x 1 urbe urbe 171K Apr 10 11:41 erate-estimate
-rwxrwxr-x 1 urbe urbe 221K Apr 10 11:40 estimate-mer-threshold
-rwxrwxr-x 1 urbe urbe 1,5M Apr 10 11:41 extendClearRanges
-rwxrwxr-x 1 urbe urbe 1,3M Apr 10 11:41 extendClearRangesPartition
-rwxrwxr-x 1 urbe urbe 205K Apr 10 11:40 extractmessages
-rwxrwxr-x 1 urbe urbe 7,2M Apr 10 11:41 falcon_sense
-rwxrwxr-x 1 urbe urbe 9,8K Apr 10 11:41 fastaToCA
-rwxrwxr-x 1 urbe urbe 124K Apr 10 11:40 fastqAnalyze
-rwxrwxr-x 1 urbe urbe 137K Apr 10 11:40 fastqSample
-rwxrwxr-x 1 urbe urbe 62K Apr 10 11:40 fastqSimulate
-rwxrwxr-x 1 urbe urbe 121K Apr 10 11:40 fastqSimulate-sort
-rwxrwxr-x 1 urbe urbe 246K Apr 10 11:40 fastqToCA
-rwxrwxr-x 1 urbe urbe 140K Apr 10 11:41 filterOverlap
-rwxrwxr-x 1 urbe urbe 341K Apr 10 11:40 finalTrim
-rwxrwxr-x 1 urbe urbe 228K Apr 10 11:41 fixUnitigs
-rwxrwxr-x 1 urbe urbe 147K Apr 10 11:40 fragmentDepth
-rwxrwxr-x 1 urbe urbe 29K Apr 10 11:41 fragsInVars
-rwxrwxr-x 1 urbe urbe 545K Apr 10 11:41 frgs2clones
-rwxrwxr-x 1 urbe urbe 398K Apr 10 11:40 gatekeeper
-rwxrwxr-x 1 urbe urbe 139K Apr 10 11:40 gatekeeperbench
-rwxrwxr-x 1 urbe urbe 167K Apr 10 11:40 gkpStoreCreate
-rwxrwxr-x 1 urbe urbe 147K Apr 10 11:40 gkpStoreDumpFASTQ
-rwxrwxr-x 1 urbe urbe 184K Apr 10 11:41 greedyFragmentTiling
-rwxrwxr-x 1 urbe urbe 1,6K Apr 10 11:41 greedy_layout_to_IUM
-rwxrwxr-x 1 urbe urbe 142K Apr 10 11:40 initialTrim
-rwxrwxr-x 1 urbe urbe 967K Apr 10 11:41 jellyfish
-rwxrwxr-x 1 urbe urbe 219K Apr 10 11:41 markRepeatUnique
-rwxrwxr-x 1 urbe urbe 273K Apr 10 11:40 markUniqueUnique
-rwxrwxr-x 1 urbe urbe 114K Apr 10 11:40 mercy
-rwxrwxr-x 1 urbe urbe 3,8K Apr 10 11:41 mergeqc.pl
-rwxrwxr-x 1 urbe urbe 422K Apr 10 11:40 merTrim
-rwxrwxr-x 1 urbe urbe 125K Apr 10 11:40 merTrimApply
-rwxrwxr-x 1 urbe urbe 376K Apr 10 11:40 meryl
-rwxrwxr-x 1 urbe urbe 176K Apr 10 11:41 metagenomics_ovl_analyses
-rwxrwxr-x 1 urbe urbe 297K Apr 10 11:41 olap-from-seeds
-rwxrwxr-x 1 urbe urbe 275K Apr 10 11:41 outputLayout
-rwxrwxr-x 1 urbe urbe 229K Apr 10 11:41 overlapInCore
-rwxrwxr-x 1 urbe urbe 144K Apr 10 11:40 overlap_partition
-rwxrwxr-x 1 urbe urbe 179K Apr 10 11:41 overlapStats
-rwxrwxr-x 1 urbe urbe 179K Apr 10 11:41 overlapStore
-rwxrwxr-x 1 urbe urbe 153K Apr 10 11:41 overlapStoreBucketizer
-rwxrwxr-x 1 urbe urbe 175K Apr 10 11:41 overlapStoreBuild
-rwxrwxr-x 1 urbe urbe 33K Apr 10 11:41 overlapStoreIndexer
-rwxrwxr-x 1 urbe urbe 48K Apr 10 11:41 overlapStoreSorter
-rwxrwxr-x 1 urbe urbe 604K Apr 10 11:40 overmerry
lrwxrwxrwx 1 urbe urbe 4 Apr 10 11:41 pacBioToCA -> PBcR
-rwxrwxr-x 1 urbe urbe 131K Apr 10 11:41 PBcR
-rwxrwxr-x 1 urbe urbe 2,9M Apr 10 11:41 pbdagcon
-rwxrwxr-x 1 urbe urbe 1,9M Apr 10 11:41 pbutgcns
-rwxrwxr-x 1 urbe urbe 201K Apr 10 11:40 remove_fragment
-rwxrwxr-x 1 urbe urbe 153K Apr 10 11:40 removeMateOverlap
-rwxrwxr-x 1 urbe urbe 2,5K Apr 10 11:41 replaceUIDwithName-fastq
-rwxrwxr-x 1 urbe urbe 1,2K Apr 10 11:41 replaceUIDwithName-posmap
-rwxrwxr-x 1 urbe urbe 1,3M Apr 10 11:41 resolveSurrogates
-rwxrwxr-x 1 urbe urbe 139K Apr 10 11:41 rewriteCache
-rwxrwxr-x 1 urbe urbe 232K Apr 10 11:41 runCA
-rwxrwxr-x 1 urbe urbe 88K Apr 10 11:41 runCA-dedupe
-rwxrwxr-x 1 urbe urbe 14K Apr 10 11:41 runCA-overlapStoreBuild
-rwxrwxr-x 1 urbe urbe 3,6K Apr 10 11:41 run_greedy.csh
-rwxrwxr-x 1 urbe urbe 297K Apr 10 11:40 sffToCA
-rwxrwxr-x 1 urbe urbe 13K Apr 10 11:40 show-corrects
-rwxrwxr-x 1 urbe urbe 557K Apr 10 11:41 splitUnitigs
-rwxrwxr-x 1 urbe urbe 1,4M Apr 10 11:41 terminator
drwxrwxr-x 2 urbe urbe 4,0K Apr 10 11:41 TIGR
-rwxrwxr-x 1 urbe urbe 526K Apr 10 11:41 tigStore
-rwxrwxr-x 1 urbe urbe 35K Apr 10 11:41 tracearchiveToCA
-rwxrwxr-x 1 urbe urbe 35K Apr 10 11:41 tracedb-to-frg.pl
-rwxrwxr-x 1 urbe urbe 44K Apr 10 11:41 trimFastqByQVWindow
-rwxrwxr-x 1 urbe urbe 18K Apr 10 11:40 uidclient
-rwxrwxr-x 1 urbe urbe 589K Apr 10 11:41 unitigger
-rwxrwxr-x 1 urbe urbe 42K Apr 10 11:40 upgrade-v8-to-v9
-rwxrwxr-x 1 urbe urbe 42K Apr 10 11:40 upgrade-v9-to-v10
-rwxrwxr-x 1 urbe urbe 854 Apr 10 11:41 utg2fasta
-rwxrwxr-x 1 urbe urbe 731K Apr 10 11:41 utgcns
-rwxrwxr-x 1 urbe urbe 561K Apr 10 11:41 utgcnsfix

Address of the bookmark: http://wgs-assembler.sourceforge.net/wiki/index.php/Main_Page

Tools to Predict the Impact of Missense Variants !

Jit — Mon, 23 Apr 2018 12:57:33 -0500

Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen‐2, SIFT, FatHMM, MutationTaster‐2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants.

Study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. Comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.

Following tools are useful for mis sense muation detection ...

PolyPhen‐2 (PP2)
“Predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations”

MutationTaster‐2 (MT2)
“Evaluation of the disease‐causing potential of DNA sequence alterations”

MutationAssessor (MASS)
“Predicts the functional impact of amino acid substitutions in proteins, such as mutations discovered in cancer or missense polymorphisms”

LRT
“Identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein‐coding sequences, which are likely to be unconditionally deleterious”

SIFT
“Predicts whether an amino acid substitution affects protein function”

GERP++
“Identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA, but did not occur because the element has been under functional constraint. We refer to these deficits as “rejected substitutions.” Rejected substitutions are a natural measure of constraint that reflects the strength of past purifying selection on the element”

phyloP
“Compute conservation or acceleration P values based on an alignment and a model of neutral evolution”

FatHMM unweighted (FatHMM‐U)
Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants”

FatHMM weighted (FatHMM‐W)
Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants” and its weighting scheme attributes higher tolerance scores to SNVs in proteins, related proteins, or domains that already include a high fraction of pathogenic variantsh

Combined Annotation Dependent Depletion (CADD)
“CADD is a tool for scoring the deleteriousness of single‐nucleotide variants as well as insertion/deletions variants in the human genome”

EvidentialGene: tr2aacds, mRNA Transcript Assembly Software

Rahul Nayak — Tue, 08 May 2018 04:39:39 -0500

EvidentialGene is a genome informatics project, "Evidence Directed Gene Construction for Eukaryotes", to construct high quality, accurate gene sets for animals and plants, developed by Don Gilbert at Indiana University, see
http://arthropods.eugenes.org/EvidentialGene/

Construction refers to the combination of classical gene prediction, and more recent gene assembly (de-novo and genome-assisted) methods. The basic Evigene methods involve using available best-of-breed gene prediction and assembly software, combining all evidence for genes, from expressed sequences, genome assembly sequences, related species protein sequences, and any other, to annotate and score gene constructions. Over-produced constructions are classified by gene evidence for best qualities per "locus", including genome-aligned and gene-transcript aligned (genome-free) locus identification. All software developed for EvidentialGene is publicly available. See project wiki/blog for notes.

Download

http://arthropods.eugenes.org/EvidentialGene/trassembly.html

https://sourceforge.net/p/evidentialgene/blog/

Address of the bookmark: http://arthropods.eugenes.org/EvidentialGene/trassembly.html

mmgenome: Tools for extracting individual genomes from metagneomes

Jit — Thu, 09 Aug 2018 17:41:17 -0500

The mmgenome toolbox enables reproducible extraction of individual genomes from metagenomes. It builds on the multi-metagenome concept, but wraps most of the process of extracting genomes in simple R functions. Thereby making the whole process of binning easy and at the same time reproducible through the Rmarkdown format.

The mmgenome R package also facilitates effortless integration with additional data sources and hence should not be seen as "yet another binning method", but rather a package to integrate different binning strategies.

All functions in the mmgenome R package has associated documentation, check it out in R by e.g. ?mmplot.

Address of the bookmark: https://github.com/MadsAlbertsen/mmgenome

Useful Bioinformatics Analysis Tools !

Neel — Thu, 23 Dec 2021 23:10:02 -0600

CoMeta

Classificier of reads from metagenomic sequencing experiments.

• Kawulok, J., Deorowicz, S., CoMeta: Classification of Metagenomes Using k-mers, PLOS ONE, 2015; 10(4):1–23,

CoMSA

Compressor of multiple sequence alignments of proteins.

• Deorowicz, S., Walczyszyn, J., Debudaj-Grabysz, A., CoMSA: compression of protein multiple sequence alignment files, Bioinformatics, 2019; 35(2):22–234,

DSRC

Compressor of sequencing reads.

• Roguski, L., Deorowicz, S., DSRC 2: Industry-oriented compression of FASTQ files, Bioinformatics, 2014; 30(15):2213–2215,
• Deorowicz, S., Grabowski, Sz., Compression of DNA sequences in FASTQ format, Bioinformatics, 2011; 27(6):860–862,

FAMSA

Multiple sequence alignment designed for huge families of proteins (even containing hundreds of thousands of sequences).

• Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., FAMSA: Fast and accurate multiple sequence alignment of huge protein families, Scientific Reports, 2016; 6(33964):

FaStore

Compressor of FASTQ files.

• Roguski, L., Ochoa, I., Hernaez, M., Deorowicz, S., FaStore - a space-saving solution for raw sequencing data, Bioinformatics, 2018; 34(16):2748–2756,

FQSqueezer

Experimental high-end compressor of FASTQ files.

• Deorowicz, S., FQSqueezer: k-mer-based compression of sequencing data, Scientific Reports, 2020; 10(578):

GDC

Compressor of collections of genome sequences.

• Deorowicz, S., Danek, A., Niemiec, M., GDC 2: Compression of large collections of genomes, Scientific Reports, 2015; 5(11565):1–12,
• Deorowicz, S., Grabowski, Sz., Robust relative compression of genomes with random access, Bioinformatics, 2011; 27(21):2979–2986,

GTC

Genotype databases compressor with support for fast queries.

• Danek, A., Deorowicz, S., GTC: how to maintain huge genotype collections in a compressed form, Bioinformatics, 2018; 34(11):1834–1840,

GTShark

Genotypes compressor.

• Deorowicz, S., Danek, A., GTShark: Genotype compression in large projects, Bioinformatics, 2019; 35(22):4791–4793,

KMC

Memory frugal k-mer counter.

•  Kokot, M., Długosz, M., Deorowicz, S., KMC 3: counting and manipulating k -mer statistics, Bioinformatics, 2017; 33(17):2759–2761,
•  Deorowicz, S., Kokot, M., Grabowski, Sz., Debudaj-Grabysz, A., KMC 2: Fast and resource-frugal k-mer counting, Bioinformatics, 2015; 31(10):1569–1576,
•  Deorowicz, S., Debudaj-Grabysz, A., Grabowski, Sz., Disk-based k-mer counting on a PC, BMC Bioinformatics, 2013; 14():Article no. 160,

Kmer-db

Tool for estimation of evolutionary distances in a collection of genomes.

• Deorowicz, S., Gudys, A., Dlugosz, M., Kokot, M., Danek, A., Kmer-db: instant evolutionary distance estimation, Bioinformatics, 2019; 35(1):133–136,

MuGI

Index allowing queries for a collection of multiple genome sequences.

• Danek, A., Deorowicz, S., Grabowski, Sz., Indexes of Large Genome Collections on a PC, PLOS ONE, 2014; 9(10):e109384,

ORCOM

Experimental compressor of sequencing reads.

• Grabowski, Sz., Deorowicz, S., Roguski, L., Disk-based compression of data from genome sequencing, Bioinformatics, 2014; 31(9):1389–1395,

PgSA

Index allowing queries for a collection of sequencing reads.

• Kowalski, T., Grabowski, Sz., Deorowicz, S., Indexing arbitrary-length k-mers in sequencing reads, PLOS ONE, 2015; 10(7):1–16,

QuickProbs

Multiple sequence alignment designed especially for GPU.

• Gudys, A., Deorowicz, S., QuickProbs 2: towards rapid construction of high-quality alignments of large protein families, Scientific Reports, 2017; 7(41553):
• Gudys, A., Deorowicz, S., QuickProbs – A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors, PLOS ONE, 2014; 9(2):e88901,

RECKONER

Read error corrector.

• Maciej Długosz, M., Deorowicz, S., RECKONER: read error corrector based on KMC, Bioinformatics, 2017; 33(7):1086–1089,

TGC

Compressor of collections of genomes given in Variant Call Format (VCF) files.

• Deorowicz, S., Danek, A., Grabowski, Sz., Genome compression: a novel approach for large collections, Bioinformatics, 2013; 29(20):2572–2578,

VCFShark

Compressor of VCF files.

• Deorowicz, S., Danek, A., GTShark: Genotype compression in large projects, biorxiv.org, 2020; ():

Whisper

Experimental mapper of whole genome sequencing data.

•  Deorowicz, S., Gudys, A., Whisper 2: indel-sensitive short read mapping, bioRxiv.org, 2019; :
•  Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., Grabowski, Sz., Whisper: read sorting allows robust robust mapping of DNA sequencing data, Bioinformatics, 2019; 35(12):2043–2050,
•  Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., Grabowski, Sz., Robust mapping of whole genome sequencing data, Poster at The Biology of Genomes Conference, 2017;

Interesting Bioinformatics Resources !

Abhi — Fri, 11 Nov 2022 06:30:46 -0600

1. a reproducible workflow. https://www.youtube.com/watch?v=s3JldKoA0zw This two minute video will change your mind on reproducible research

2. Parallel sequencing lives, or what makes large sequencing projects successful https://academic.oup.com/gigascience/article/6/11/gix100/4557140?login=false

3. Common-sense approaches to sharing tabular data alongside publication https://www.sciencedirect.com/science/article/pii/S2666389921002300

4. A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker https://psyarxiv.com/8xzqy/

5. Practical Computational Reproducibility in the Life Sciences https://www.cell.com/cell-systems/fulltext/S2405-4712(18)30140-6

6. A video by Dr.Keith A. Baggerly from MD Anderson [The Importance of Reproducible Research in High-Throughput Biology](https://www.youtube.com/watch?v=7gYIs7uYbMo) highly recommended.

7. Ten Simple Rules for Reproducible Computational Research http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285)

8. Good Enough Practices in Scientific Computing http://arxiv.org/abs/1609.00037

9. Best Practices for Scientific Computing https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745

10. A Quick Guide to Organizing Computational Biology Projects http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.100042 A must read for computational biologists!

11. Reproducibility of computational workflows is automated using continuous analysis https://www.nature.com/articles/nbt.3780

12. Five selfish reasons to work reproducibly https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7

Important Bioinformatics Tools !

BioStar — Tue, 30 Jul 2024 05:03:29 -0500

1. Ktrim: An extra-fast, accurate adapter trimmer for sequencing data. It processes FASTQ files from multiple lanes with minimal mismatching and over-trimming of adapters.

2. BWA MEM: A reliable alignment tool (particularly for mapping ALT contigs and HLA genes, which are not fully addressed in BWA-MEM2).

3. Sambamba markdup: Quickly marks or removes duplicate reads using Picard's criteria.

4. ichorCNA: Estimates the tumor DNA fraction in cell-free DNA from ultra-low-pass whole genome sequencing (0.1x coverage) based on copy number alterations (CNA).

5. Fragle: A deep learning method for quantifying ctDNA levels from cell-free DNA fragmentomic profiles. It detects TF as low as ~1% ctDNA and works with targeted genomic panel sequencing data.

6. AlfredQC: A quality control tool for high-throughput sequencing data. It assesses metrics like read quality scores, GC content, and duplication rates, visualized through detailed plots and summary statistics.

7. Mosdepth: A fast tool for calculating sequencing coverage depth, offering a quicker alternative to samtools/sambamba depth by processing BAM and CRAM files.

8. Bedtools: A versatile toolkit for genomics, enabling operations like intersect, merge, count, and shuffle on genomic intervals across formats such as BAM, BED, GFF/GTF, and VCF.

9. Datamash: A command-line tool for basic numeric, textual, and statistical operations on input data streams. It supports operations such as grouping, sorting, transposing, and performing arithmetic calculations on tabular data.

10. gwf.app: A pragmatic alternative to Snakemake. Developed at Aarhus University, this flexible, generic workflow tool builds and runs large scientific workflows.

Predicting Pathogen Virulence Using Bioinformatics Tools

BioStar — Tue, 04 Nov 2025 07:55:53 -0600

In the genomic era, the ability to predict the virulence potential of pathogens has become an indispensable part of infectious disease research. With the exponential growth of microbial genome data, bioinformatics tools now enable scientists to identify virulence factors, model pathogen behavior, and even forecast outbreak risks — all from sequence data.

In an age where pathogens continue to evolve and cross boundaries, understanding what makes them virulent—that is, capable of causing disease—has become a critical focus in modern microbiology and genomics. Virulence prediction bridges computational biology, genomics, and machine learning to forecast the pathogenic potential of microbes before they strike.

What Is Virulence?

Virulence refers to the degree of damage a pathogen can inflict on its host. It is determined by a combination of genetic factors—called virulence factors (VFs)—that allow the organism to attach, invade, evade, and harm the host. These include genes coding for toxins, secretion systems, adhesins, and enzymes that disrupt host defenses.

Understanding virulence factors not only helps in deciphering the mechanisms of infection but also provides early warning signs for emerging threats.

Why Predict Virulence?

Traditional virulence studies relied heavily on experimental infection models, which, although accurate, are time-consuming, expensive, and ethically constrained.
Today, the availability of whole-genome sequences and large-scale pathogen databases has paved the way for in silico virulence prediction—a computational approach that can screen thousands of genomes within hours.

This approach enables researchers to:

Rapidly identify potential high-risk strains.
Prioritize pathogens for containment, surveillance, or further study.
Guide vaccine development and drug target discovery.
Support One Health frameworks, linking animal, human, and environmental health data.

How Is Virulence Predicted?

Virulence prediction combines bioinformatics pipelines with machine learning and comparative genomics. The process generally involves:

Genome Annotation: Identifying genes and coding sequences in microbial genomes.
Feature Extraction: Comparing sequences with curated databases like VFDB (Virulence Factor Database), PATRIC, or Victors.
Pattern Recognition: Using algorithms (e.g., Random Forest, SVM, or deep learning models) to classify genes or strains as virulent or non-virulent based on sequence patterns, motifs, and protein domains.
Scoring and Visualization: Assigning a virulence score or confidence level and visualizing it through heatmaps or genome maps.

Tools and Resources for Virulence Prediction

A number of tools and databases make virulence prediction accessible to the scientific community:

VFanalyzer – For identifying virulence genes based on VFDB.
PathoFact – Predicts virulence, antimicrobial resistance (AMR), and toxin genes from metagenomic data.
Pangenome-based models – Identify virulence-associated gene clusters across strains.
Machine learning models – Use features like GC content, codon usage bias, or protein domains to predict pathogenicity.

Emerging tools now integrate multi-omic data—including transcriptomics, proteomics, and metabolomics—to understand virulence in a systems biology framework.

Applications in the Real World

Virulence prediction has major implications across public health and research sectors:

Epidemic preparedness: Early identification of virulent strains in outbreak samples.
AMR surveillance: Linking virulence profiles with antibiotic resistance determinants.
Environmental monitoring: Predicting pathogenic potential of soil or waterborne microbes.
Clinical diagnostics: Supporting personalized treatment through pathogen profiling.

For instance, integrating virulence prediction pipelines into national surveillance networks could enable faster risk assessment and response to infectious outbreaks.

The Road Ahead

As machine learning and genomics advance, virulence prediction will evolve from simple gene-based detection to dynamic, context-aware models that account for host–pathogen interactions, environmental signals, and evolutionary adaptation.

Future tools may predict not just if a strain is virulent, but under what conditions it expresses that virulence—bridging the gap between genotype and phenotype.

In Summary

Virulence prediction is redefining how we understand and anticipate infectious diseases. By coupling genomic insights with computational intelligence, researchers can identify potential threats earlier, design smarter interventions, and ultimately, strengthen our preparedness against emerging pathogens.