BOL: Related items

Useful Bioinformatics Tools

Poonam Mahapatra — Mon, 29 Aug 2016 04:08:12 -0500

Collections of few handy tools for bioinformatician

http://molbiol-tools.ca/Convert.htm

Address of the bookmark: http://molbiol-tools.ca/Convert.htm

Harvest

Jit — Tue, 31 Jan 2017 10:57:56 -0600

Harvest is a suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial genomes, including variant calls, recombination detection, and phylogenetic trees.

Tools

Parsnp - Core-genome alignment and analysis
Gingr - Interactive visualization of alignments, trees and variants
HarvestTools - Archiving and postprocessing

Citation

Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biology, 15 (11), 1-15 [PDF]

Address of the bookmark: http://harvest.readthedocs.io/en/latest/index.html

WGS Celera Assembler version 8.3rc2

Jit — Mon, 10 Apr 2017 04:45:40 -0500

These are release notes for Celera Assembler version 8.3rc2, which was released on May 24, 2015.

This distribution package provides a stable, tested, documented version of the software. The distribution is usable on most Unix-like platforms, and some platforms have pre-compiled binary distributions ready for installation.

The source code package includes full source code (revision 4627), Makefiles, and scripts. A subset of the kmer package (http://kmer.sourceforge.net/, version r1994), used by some modules of Celera Assembler, is included. This distribution includes [http://samtools.sourceforge.net/ SAMtools], [http://www.cbcb.umd.edu/software/jellyfish/ Jellyfish 2.0], [https://github.com/pbjd/pbutgcns PBUTGCNS], [https://github.com/PacificBiosciences/pbdagcon PBDAGCON], [https://github.com/PacificBiosciences/BLASR BLASR], and parts of the [https://github.com/PacificBiosciences/FALCON/tree/v0.1.3 Falcon assembler].

Full documentation can be found online at http://wgs-assembler.sourceforge.net/.

Interesting scripts within it

urbe@urbo214b[bin] ls []
-rwxrwxr-x 1 urbe urbe 11K Apr 10 11:41 addCNSToStore
-rwxrwxr-x 1 urbe urbe 575K Apr 10 11:41 addReadsToUnitigs
-rwxrwxr-x 1 urbe urbe 128K Apr 10 11:41 analyzeBest
-rwxrwxr-x 1 urbe urbe 257K Apr 10 11:41 analyzePosMap
-rwxrwxr-x 1 urbe urbe 1,5M Apr 10 11:41 analyzeScaffolds
-rwxrwxr-x 1 urbe urbe 224K Apr 10 11:41 asmOutputFasta
-rwxrwxr-x 1 urbe urbe 448K Apr 10 11:41 asmOutputStatistics
-rwxrwxr-x 1 urbe urbe 2,4K Apr 10 11:41 asmToAGP.pl
-rwxrwxr-x 1 urbe urbe 7,6M Apr 10 11:41 blasr
-rwxrwxr-x 1 urbe urbe 1,6M Apr 10 11:41 bogart
-rwxrwxr-x 1 urbe urbe 183K Apr 10 11:41 bogus
-rwxrwxr-x 1 urbe urbe 272K Apr 10 11:41 bogusness
-rwxrwxr-x 1 urbe urbe 247K Apr 10 11:41 buildPosMap
-rwxrwxr-x 1 urbe urbe 213K Apr 10 11:41 buildRefContigs
-rwxrwxr-x 1 urbe urbe 990K Apr 10 11:41 buildUnitigs
-rwxrwxr-x 1 urbe urbe 18K Apr 10 11:41 ca2ace.pl
-rwxrwxr-x 1 urbe urbe 12K Apr 10 11:41 caqc_help.ini
-rwxrwxr-x 1 urbe urbe 61K Apr 10 11:41 caqc.pl
-rwxrwxr-x 1 urbe urbe 23K Apr 10 11:41 cat-corrects
-rwxrwxr-x 1 urbe urbe 24K Apr 10 11:41 cat-erates
-rwxrwxr-x 1 urbe urbe 1,9M Apr 10 11:41 cgw
-rwxrwxr-x 1 urbe urbe 1,4M Apr 10 11:41 cgwDump
-rwxrwxr-x 1 urbe urbe 204K Apr 10 11:41 chimChe
-rwxrwxr-x 1 urbe urbe 201K Apr 10 11:40 chimera
-rwxrwxr-x 1 urbe urbe 220K Apr 10 11:41 classifyMates
-rwxrwxr-x 1 urbe urbe 201K Apr 10 11:41 classifyMatesApply
-rwxrwxr-x 1 urbe urbe 215K Apr 10 11:41 classifyMatesPairwise
-rwxrwxr-x 1 urbe urbe 366K Apr 10 11:41 computeCoverageStat
-rwxrwxr-x 1 urbe urbe 9,8K Apr 10 11:41 convert-fasta-to-v2.pl
-rwxrwxr-x 1 urbe urbe 48K Apr 10 11:41 convertOverlap
-rwxrwxr-x 1 urbe urbe 119K Apr 10 11:41 convertSamToCA
-rwxrwxr-x 1 urbe urbe 20K Apr 10 11:41 convertToPBCNS
-rwxrwxr-x 1 urbe urbe 197K Apr 10 11:41 correct-frags
-rwxrwxr-x 1 urbe urbe 259K Apr 10 11:41 correct-olaps
-rwxrwxr-x 1 urbe urbe 520K Apr 10 11:41 correctPacBio
-rwxrwxr-x 1 urbe urbe 540K Apr 10 11:41 ctgcns
-rwxrwxr-x 1 urbe urbe 162K Apr 10 11:40 deduplicate
-rwxrwxr-x 1 urbe urbe 37K Apr 10 11:41 demotePosMap
-rwxrwxr-x 1 urbe urbe 1,5M Apr 10 11:41 dumpCloneMiddles
-rwxrwxr-x 1 urbe urbe 124K Apr 10 11:41 dumpPBRLayoutStore
-rwxrwxr-x 1 urbe urbe 1,3M Apr 10 11:41 dumpSingletons
-rwxrwxr-x 1 urbe urbe 171K Apr 10 11:41 erate-estimate
-rwxrwxr-x 1 urbe urbe 221K Apr 10 11:40 estimate-mer-threshold
-rwxrwxr-x 1 urbe urbe 1,5M Apr 10 11:41 extendClearRanges
-rwxrwxr-x 1 urbe urbe 1,3M Apr 10 11:41 extendClearRangesPartition
-rwxrwxr-x 1 urbe urbe 205K Apr 10 11:40 extractmessages
-rwxrwxr-x 1 urbe urbe 7,2M Apr 10 11:41 falcon_sense
-rwxrwxr-x 1 urbe urbe 9,8K Apr 10 11:41 fastaToCA
-rwxrwxr-x 1 urbe urbe 124K Apr 10 11:40 fastqAnalyze
-rwxrwxr-x 1 urbe urbe 137K Apr 10 11:40 fastqSample
-rwxrwxr-x 1 urbe urbe 62K Apr 10 11:40 fastqSimulate
-rwxrwxr-x 1 urbe urbe 121K Apr 10 11:40 fastqSimulate-sort
-rwxrwxr-x 1 urbe urbe 246K Apr 10 11:40 fastqToCA
-rwxrwxr-x 1 urbe urbe 140K Apr 10 11:41 filterOverlap
-rwxrwxr-x 1 urbe urbe 341K Apr 10 11:40 finalTrim
-rwxrwxr-x 1 urbe urbe 228K Apr 10 11:41 fixUnitigs
-rwxrwxr-x 1 urbe urbe 147K Apr 10 11:40 fragmentDepth
-rwxrwxr-x 1 urbe urbe 29K Apr 10 11:41 fragsInVars
-rwxrwxr-x 1 urbe urbe 545K Apr 10 11:41 frgs2clones
-rwxrwxr-x 1 urbe urbe 398K Apr 10 11:40 gatekeeper
-rwxrwxr-x 1 urbe urbe 139K Apr 10 11:40 gatekeeperbench
-rwxrwxr-x 1 urbe urbe 167K Apr 10 11:40 gkpStoreCreate
-rwxrwxr-x 1 urbe urbe 147K Apr 10 11:40 gkpStoreDumpFASTQ
-rwxrwxr-x 1 urbe urbe 184K Apr 10 11:41 greedyFragmentTiling
-rwxrwxr-x 1 urbe urbe 1,6K Apr 10 11:41 greedy_layout_to_IUM
-rwxrwxr-x 1 urbe urbe 142K Apr 10 11:40 initialTrim
-rwxrwxr-x 1 urbe urbe 967K Apr 10 11:41 jellyfish
-rwxrwxr-x 1 urbe urbe 219K Apr 10 11:41 markRepeatUnique
-rwxrwxr-x 1 urbe urbe 273K Apr 10 11:40 markUniqueUnique
-rwxrwxr-x 1 urbe urbe 114K Apr 10 11:40 mercy
-rwxrwxr-x 1 urbe urbe 3,8K Apr 10 11:41 mergeqc.pl
-rwxrwxr-x 1 urbe urbe 422K Apr 10 11:40 merTrim
-rwxrwxr-x 1 urbe urbe 125K Apr 10 11:40 merTrimApply
-rwxrwxr-x 1 urbe urbe 376K Apr 10 11:40 meryl
-rwxrwxr-x 1 urbe urbe 176K Apr 10 11:41 metagenomics_ovl_analyses
-rwxrwxr-x 1 urbe urbe 297K Apr 10 11:41 olap-from-seeds
-rwxrwxr-x 1 urbe urbe 275K Apr 10 11:41 outputLayout
-rwxrwxr-x 1 urbe urbe 229K Apr 10 11:41 overlapInCore
-rwxrwxr-x 1 urbe urbe 144K Apr 10 11:40 overlap_partition
-rwxrwxr-x 1 urbe urbe 179K Apr 10 11:41 overlapStats
-rwxrwxr-x 1 urbe urbe 179K Apr 10 11:41 overlapStore
-rwxrwxr-x 1 urbe urbe 153K Apr 10 11:41 overlapStoreBucketizer
-rwxrwxr-x 1 urbe urbe 175K Apr 10 11:41 overlapStoreBuild
-rwxrwxr-x 1 urbe urbe 33K Apr 10 11:41 overlapStoreIndexer
-rwxrwxr-x 1 urbe urbe 48K Apr 10 11:41 overlapStoreSorter
-rwxrwxr-x 1 urbe urbe 604K Apr 10 11:40 overmerry
lrwxrwxrwx 1 urbe urbe 4 Apr 10 11:41 pacBioToCA -> PBcR
-rwxrwxr-x 1 urbe urbe 131K Apr 10 11:41 PBcR
-rwxrwxr-x 1 urbe urbe 2,9M Apr 10 11:41 pbdagcon
-rwxrwxr-x 1 urbe urbe 1,9M Apr 10 11:41 pbutgcns
-rwxrwxr-x 1 urbe urbe 201K Apr 10 11:40 remove_fragment
-rwxrwxr-x 1 urbe urbe 153K Apr 10 11:40 removeMateOverlap
-rwxrwxr-x 1 urbe urbe 2,5K Apr 10 11:41 replaceUIDwithName-fastq
-rwxrwxr-x 1 urbe urbe 1,2K Apr 10 11:41 replaceUIDwithName-posmap
-rwxrwxr-x 1 urbe urbe 1,3M Apr 10 11:41 resolveSurrogates
-rwxrwxr-x 1 urbe urbe 139K Apr 10 11:41 rewriteCache
-rwxrwxr-x 1 urbe urbe 232K Apr 10 11:41 runCA
-rwxrwxr-x 1 urbe urbe 88K Apr 10 11:41 runCA-dedupe
-rwxrwxr-x 1 urbe urbe 14K Apr 10 11:41 runCA-overlapStoreBuild
-rwxrwxr-x 1 urbe urbe 3,6K Apr 10 11:41 run_greedy.csh
-rwxrwxr-x 1 urbe urbe 297K Apr 10 11:40 sffToCA
-rwxrwxr-x 1 urbe urbe 13K Apr 10 11:40 show-corrects
-rwxrwxr-x 1 urbe urbe 557K Apr 10 11:41 splitUnitigs
-rwxrwxr-x 1 urbe urbe 1,4M Apr 10 11:41 terminator
drwxrwxr-x 2 urbe urbe 4,0K Apr 10 11:41 TIGR
-rwxrwxr-x 1 urbe urbe 526K Apr 10 11:41 tigStore
-rwxrwxr-x 1 urbe urbe 35K Apr 10 11:41 tracearchiveToCA
-rwxrwxr-x 1 urbe urbe 35K Apr 10 11:41 tracedb-to-frg.pl
-rwxrwxr-x 1 urbe urbe 44K Apr 10 11:41 trimFastqByQVWindow
-rwxrwxr-x 1 urbe urbe 18K Apr 10 11:40 uidclient
-rwxrwxr-x 1 urbe urbe 589K Apr 10 11:41 unitigger
-rwxrwxr-x 1 urbe urbe 42K Apr 10 11:40 upgrade-v8-to-v9
-rwxrwxr-x 1 urbe urbe 42K Apr 10 11:40 upgrade-v9-to-v10
-rwxrwxr-x 1 urbe urbe 854 Apr 10 11:41 utg2fasta
-rwxrwxr-x 1 urbe urbe 731K Apr 10 11:41 utgcns
-rwxrwxr-x 1 urbe urbe 561K Apr 10 11:41 utgcnsfix

Address of the bookmark: http://wgs-assembler.sourceforge.net/wiki/index.php/Main_Page

Tools to Predict the Impact of Missense Variants !

Jit — Mon, 23 Apr 2018 12:57:33 -0500

Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen‐2, SIFT, FatHMM, MutationTaster‐2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants.

Study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. Comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.

Following tools are useful for mis sense muation detection ...

PolyPhen‐2 (PP2)
“Predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations”

MutationTaster‐2 (MT2)
“Evaluation of the disease‐causing potential of DNA sequence alterations”

MutationAssessor (MASS)
“Predicts the functional impact of amino acid substitutions in proteins, such as mutations discovered in cancer or missense polymorphisms”

LRT
“Identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein‐coding sequences, which are likely to be unconditionally deleterious”

SIFT
“Predicts whether an amino acid substitution affects protein function”

GERP++
“Identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA, but did not occur because the element has been under functional constraint. We refer to these deficits as “rejected substitutions.” Rejected substitutions are a natural measure of constraint that reflects the strength of past purifying selection on the element”

phyloP
“Compute conservation or acceleration P values based on an alignment and a model of neutral evolution”

FatHMM unweighted (FatHMM‐U)
Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants”

FatHMM weighted (FatHMM‐W)
Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants” and its weighting scheme attributes higher tolerance scores to SNVs in proteins, related proteins, or domains that already include a high fraction of pathogenic variantsh

Combined Annotation Dependent Depletion (CADD)
“CADD is a tool for scoring the deleteriousness of single‐nucleotide variants as well as insertion/deletions variants in the human genome”

Tools for Protein-Protein Docking !

Poonam Mahapatra — Wed, 25 Apr 2018 05:15:53 -0500

Predicting the structure of protein–protein complexes using docking approaches is a difficult problem whose major challenges include identifying correct solutions, and properly dealing with molecular flexibility and conformational changes. Following are the tools to predict the structure of protein–protein complexes:

3D-Dock Suite

Global rigid search: FFTShape complementarity and electrostatics

Re-scoring and clustering. Refinement of interface side-chains

3D-Garden

Global rigid search in ensamble

Shape complementarity and Lennard–Jones potential

Side chain and backbone dihedral refinement

DOT

Global rigid search: FFTShape complementarity, electrostatics and VDWNone

Escher NG

Global rigid searchShape complementarity, hydrogen bonds and electrostatic

Integrated in VEGA

GRAMM

Global rigid search: FFT. smooth protein surface representation for soft docking

Shape complementarity and Lennard-Jones potential

Clustering of conformations

GRAMM-X

Global rigid search: FFT. smooth protein surface representation for soft docking

Shape complementarity and Lennard-Jones potentialminimization and re-scoring with multiple filters

HEX

Global rigid search: Fourier correlation of spherical harmonics

Shape complementarity

HADDOCK

Global rigid searchElectrostatic ,VDW and desolvation energy termsMD simulated annealing refinement . Filtering based on external data.

ICM

Global rigid search: Monte CarloEmpirical scoring function

Clustering and selection of conformations. Refinement of interface side-chains and re-scoring

MolFit

Global rigid search: FFTShape complementarity

Clustering of good solutions, filtering using a priori information and small, local rigid rotations around selected conformations

PatchDock

Global rigid searchShape complementarity and atomic desolvation energy

Clustering of conformations

PyDock

Global rigid search:FFTShape complementarity

rescoring by binding electrostatics and desolvation energy

RosettaDock

Local rigid search: Monte Carlo with low and high resolution structure representation levels

Different scoring parameters for the different resolutions

ZDOCK

Global rigid search: FFTShape complementarity, desolvation energy, and electrostatics.

Energy minimization and re-scoringFree for academics

Point to note:

The proper treatment of flexibility in protein–protein docking is still an active field of research. You first should analyzed your proteins in order to define their conformational space and then choose the most suitable method for your docking problem.

EvidentialGene: tr2aacds, mRNA Transcript Assembly Software

Rahul Nayak — Tue, 08 May 2018 04:39:39 -0500

EvidentialGene is a genome informatics project, "Evidence Directed Gene Construction for Eukaryotes", to construct high quality, accurate gene sets for animals and plants, developed by Don Gilbert at Indiana University, see
http://arthropods.eugenes.org/EvidentialGene/

Construction refers to the combination of classical gene prediction, and more recent gene assembly (de-novo and genome-assisted) methods. The basic Evigene methods involve using available best-of-breed gene prediction and assembly software, combining all evidence for genes, from expressed sequences, genome assembly sequences, related species protein sequences, and any other, to annotate and score gene constructions. Over-produced constructions are classified by gene evidence for best qualities per "locus", including genome-aligned and gene-transcript aligned (genome-free) locus identification. All software developed for EvidentialGene is publicly available. See project wiki/blog for notes.

Download

http://arthropods.eugenes.org/EvidentialGene/trassembly.html

https://sourceforge.net/p/evidentialgene/blog/

Address of the bookmark: http://arthropods.eugenes.org/EvidentialGene/trassembly.html

mmgenome: Tools for extracting individual genomes from metagneomes

Jit — Thu, 09 Aug 2018 17:41:17 -0500

The mmgenome toolbox enables reproducible extraction of individual genomes from metagenomes. It builds on the multi-metagenome concept, but wraps most of the process of extracting genomes in simple R functions. Thereby making the whole process of binning easy and at the same time reproducible through the Rmarkdown format.

The mmgenome R package also facilitates effortless integration with additional data sources and hence should not be seen as "yet another binning method", but rather a package to integrate different binning strategies.

All functions in the mmgenome R package has associated documentation, check it out in R by e.g. ?mmplot.

Address of the bookmark: https://github.com/MadsAlbertsen/mmgenome

Useful Bioinformatics Analysis Tools !

Neel — Thu, 23 Dec 2021 23:10:02 -0600

CoMeta

Classificier of reads from metagenomic sequencing experiments.

• Kawulok, J., Deorowicz, S., CoMeta: Classification of Metagenomes Using k-mers, PLOS ONE, 2015; 10(4):1–23,

CoMSA

Compressor of multiple sequence alignments of proteins.

• Deorowicz, S., Walczyszyn, J., Debudaj-Grabysz, A., CoMSA: compression of protein multiple sequence alignment files, Bioinformatics, 2019; 35(2):22–234,

DSRC

Compressor of sequencing reads.

• Roguski, L., Deorowicz, S., DSRC 2: Industry-oriented compression of FASTQ files, Bioinformatics, 2014; 30(15):2213–2215,
• Deorowicz, S., Grabowski, Sz., Compression of DNA sequences in FASTQ format, Bioinformatics, 2011; 27(6):860–862,

FAMSA

Multiple sequence alignment designed for huge families of proteins (even containing hundreds of thousands of sequences).

• Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., FAMSA: Fast and accurate multiple sequence alignment of huge protein families, Scientific Reports, 2016; 6(33964):

FaStore

Compressor of FASTQ files.

• Roguski, L., Ochoa, I., Hernaez, M., Deorowicz, S., FaStore - a space-saving solution for raw sequencing data, Bioinformatics, 2018; 34(16):2748–2756,

FQSqueezer

Experimental high-end compressor of FASTQ files.

• Deorowicz, S., FQSqueezer: k-mer-based compression of sequencing data, Scientific Reports, 2020; 10(578):

GDC

Compressor of collections of genome sequences.

• Deorowicz, S., Danek, A., Niemiec, M., GDC 2: Compression of large collections of genomes, Scientific Reports, 2015; 5(11565):1–12,
• Deorowicz, S., Grabowski, Sz., Robust relative compression of genomes with random access, Bioinformatics, 2011; 27(21):2979–2986,

GTC

Genotype databases compressor with support for fast queries.

• Danek, A., Deorowicz, S., GTC: how to maintain huge genotype collections in a compressed form, Bioinformatics, 2018; 34(11):1834–1840,

GTShark

Genotypes compressor.

• Deorowicz, S., Danek, A., GTShark: Genotype compression in large projects, Bioinformatics, 2019; 35(22):4791–4793,

KMC

Memory frugal k-mer counter.

•  Kokot, M., Długosz, M., Deorowicz, S., KMC 3: counting and manipulating k -mer statistics, Bioinformatics, 2017; 33(17):2759–2761,
•  Deorowicz, S., Kokot, M., Grabowski, Sz., Debudaj-Grabysz, A., KMC 2: Fast and resource-frugal k-mer counting, Bioinformatics, 2015; 31(10):1569–1576,
•  Deorowicz, S., Debudaj-Grabysz, A., Grabowski, Sz., Disk-based k-mer counting on a PC, BMC Bioinformatics, 2013; 14():Article no. 160,

Kmer-db

Tool for estimation of evolutionary distances in a collection of genomes.

• Deorowicz, S., Gudys, A., Dlugosz, M., Kokot, M., Danek, A., Kmer-db: instant evolutionary distance estimation, Bioinformatics, 2019; 35(1):133–136,

MuGI

Index allowing queries for a collection of multiple genome sequences.

• Danek, A., Deorowicz, S., Grabowski, Sz., Indexes of Large Genome Collections on a PC, PLOS ONE, 2014; 9(10):e109384,

ORCOM

Experimental compressor of sequencing reads.

• Grabowski, Sz., Deorowicz, S., Roguski, L., Disk-based compression of data from genome sequencing, Bioinformatics, 2014; 31(9):1389–1395,

PgSA

Index allowing queries for a collection of sequencing reads.

• Kowalski, T., Grabowski, Sz., Deorowicz, S., Indexing arbitrary-length k-mers in sequencing reads, PLOS ONE, 2015; 10(7):1–16,

QuickProbs

Multiple sequence alignment designed especially for GPU.

• Gudys, A., Deorowicz, S., QuickProbs 2: towards rapid construction of high-quality alignments of large protein families, Scientific Reports, 2017; 7(41553):
• Gudys, A., Deorowicz, S., QuickProbs – A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors, PLOS ONE, 2014; 9(2):e88901,

RECKONER

Read error corrector.

• Maciej Długosz, M., Deorowicz, S., RECKONER: read error corrector based on KMC, Bioinformatics, 2017; 33(7):1086–1089,

TGC

Compressor of collections of genomes given in Variant Call Format (VCF) files.

• Deorowicz, S., Danek, A., Grabowski, Sz., Genome compression: a novel approach for large collections, Bioinformatics, 2013; 29(20):2572–2578,

VCFShark

Compressor of VCF files.

• Deorowicz, S., Danek, A., GTShark: Genotype compression in large projects, biorxiv.org, 2020; ():

Whisper

Experimental mapper of whole genome sequencing data.

•  Deorowicz, S., Gudys, A., Whisper 2: indel-sensitive short read mapping, bioRxiv.org, 2019; :
•  Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., Grabowski, Sz., Whisper: read sorting allows robust robust mapping of DNA sequencing data, Bioinformatics, 2019; 35(12):2043–2050,
•  Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., Grabowski, Sz., Robust mapping of whole genome sequencing data, Poster at The Biology of Genomes Conference, 2017;

Interesting Bioinformatics Resources !

Abhi — Fri, 11 Nov 2022 06:30:46 -0600

1. a reproducible workflow. https://www.youtube.com/watch?v=s3JldKoA0zw This two minute video will change your mind on reproducible research

2. Parallel sequencing lives, or what makes large sequencing projects successful https://academic.oup.com/gigascience/article/6/11/gix100/4557140?login=false

3. Common-sense approaches to sharing tabular data alongside publication https://www.sciencedirect.com/science/article/pii/S2666389921002300

4. A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker https://psyarxiv.com/8xzqy/

5. Practical Computational Reproducibility in the Life Sciences https://www.cell.com/cell-systems/fulltext/S2405-4712(18)30140-6

6. A video by Dr.Keith A. Baggerly from MD Anderson [The Importance of Reproducible Research in High-Throughput Biology](https://www.youtube.com/watch?v=7gYIs7uYbMo) highly recommended.

7. Ten Simple Rules for Reproducible Computational Research http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285)

8. Good Enough Practices in Scientific Computing http://arxiv.org/abs/1609.00037

9. Best Practices for Scientific Computing https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745

10. A Quick Guide to Organizing Computational Biology Projects http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.100042 A must read for computational biologists!

11. Reproducibility of computational workflows is automated using continuous analysis https://www.nature.com/articles/nbt.3780

12. Five selfish reasons to work reproducibly https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7

Exploring Bacterial Comparative Genomics: A Bioinformatics Approach

LEGE — Sat, 14 Dec 2024 12:31:14 -0600

In the world of microbiology, bacteria have long fascinated scientists for their diversity, adaptability, and crucial roles in ecosystems and human health. Comparative genomics—a field that involves analyzing and comparing the genomes of different organisms—has revolutionized our understanding of bacterial evolution, adaptation, and pathogenicity. By leveraging bioinformatics tools and techniques, researchers can uncover genomic insights that were once hidden. This blog delves into the principles, methodologies, and applications of bacterial comparative genomics from a bioinformatics perspective.

What is Bacterial Comparative Genomics?

Comparative genomics involves the systematic comparison of genomes across different bacterial species or strains. This approach allows scientists to:

Identify conserved and unique genes.
Explore genetic determinants of pathogenicity.
Understand bacterial evolution and phylogenetics.
Investigate horizontal gene transfer and its role in antibiotic resistance.

Bioinformatics is central to these analyses, enabling the processing and interpretation of large-scale genomic data.

Key Steps in Bacterial Comparative Genomics

Genome Sequencing and Assembly: The process begins with obtaining high-quality bacterial genome sequences. Advances in next-generation sequencing (NGS) technologies have made it faster and more affordable to sequence bacterial genomes. Tools such as SPAdes and Velvet are commonly used for genome assembly.
Genome Annotation: Annotating a genome involves identifying genes, regulatory elements, and other genomic features. Automated tools like Prokka and RAST provide functional annotations, allowing researchers to predict the roles of genes and proteins.
Genome Alignment: Aligning genomes is crucial for identifying conserved regions, single-nucleotide polymorphisms (SNPs), and structural variations. Tools like Mauve and progressiveMauve are commonly employed for whole-genome alignments.
Comparative Analyses:
- Core and Pan-genome Analysis: The core genome consists of genes shared across all strains of a species, while the pan-genome includes all genes found in any strain. Software like Roary and BPGA can perform core and pan-genome analyses.
- Phylogenetic Analysis: Comparative genomics often involves reconstructing evolutionary relationships. Tools such as MEGA and IQ-TREE facilitate phylogenetic tree construction based on genomic data.
- Functional Enrichment Analysis: To understand the biological significance of unique or shared genes, functional enrichment analysis using databases like GO (Gene Ontology) and KEGG is essential.

Recommended Bioinformatics Tools for Comparative Genomics

Here are some additional bioinformatics tools that can aid bacterial comparative genomics:

OrthoFinder: For accurate ortholog identification across multiple genomes.
PanOCT: Specifically designed for pan-genome clustering and annotation.
FASTANI: A tool for calculating Average Nucleotide Identity (ANI) for microbial genome comparisons.
CIRCOS: For visually comparing genomic data through circular genome plots.
Galaxy Platform: A user-friendly web-based platform offering numerous genomic analysis tools.
BLAST: Essential for sequence alignment and similarity searches.
PhyloSift: Focused on phylogenetic analysis of microbial genomes using marker genes.

These tools, in combination with the methods discussed, provide a robust framework for conducting comprehensive comparative genomic studies.

Applications of Bacterial Comparative Genomics

Understanding Pathogenicity: Comparative genomics helps identify virulence factors that distinguish pathogenic strains from non-pathogenic relatives. For instance, comparing genomes of Escherichia coli strains has revealed key genetic determinants of pathogenicity in enterohemorrhagic strains.
Antibiotic Resistance Research: The spread of antibiotic resistance genes through horizontal gene transfer is a major global concern. Comparative analyses can trace the origins and dissemination of resistance genes, aiding in the development of countermeasures.
Microbial Ecology and Evolution: By studying genomic variations, researchers can understand how bacteria adapt to different environments. This is particularly relevant for extremophiles and symbiotic bacteria.
Vaccine Development: Identifying conserved antigens across pathogenic strains is critical for vaccine design. Comparative genomics has been instrumental in developing vaccines against pathogens like Neisseria meningitidis.
Biotechnology Applications: Comparative studies can uncover unique metabolic pathways in bacteria, paving the way for applications in bioremediation, synthetic biology, and industrial microbiology.

Challenges in Bacterial Comparative Genomics

While the field has made significant strides, several challenges remain:

Data Overload: The rapid growth of sequencing data requires robust computational infrastructure and efficient algorithms.
Genome Plasticity: High rates of horizontal gene transfer and genome rearrangements in bacteria complicate comparative analyses.
Annotation Accuracy: Automated annotation tools are not infallible, and manual curation is often needed for high-confidence results.
Interpreting Non-Coding Regions: Understanding the functional significance of non-coding genomic regions remains a challenge.

Future Directions

The integration of bacterial comparative genomics with other ‘omics’ approaches—such as transcriptomics, proteomics, and metabolomics—promises a more comprehensive understanding of bacterial biology. Additionally, advancements in machine learning and artificial intelligence are likely to further enhance bioinformatics analyses, enabling the prediction of complex phenotypes from genomic data.

Conclusion

Bacterial comparative genomics, driven by bioinformatics, continues to unravel the complexities of bacterial life. From combating antibiotic resistance to uncovering the secrets of microbial evolution, this interdisciplinary field holds immense potential for addressing pressing challenges in microbiology and beyond. As technology advances, so too will our ability to harness the power of comparative genomics for scientific and societal benefit.