BOL: Related items

Unlocking Evolutionary Secrets: A Dive into Comparative Genomics Methods

LEGE — Tue, 20 May 2025 00:25:09 -0500

Comparative genomics is the art and science of comparing genomes—across species, within species, or even among individuals—to unravel evolutionary relationships, functional elements, and genetic adaptations. As sequencing technologies have advanced and genome databases have expanded, comparative genomics has become a cornerstone of modern biology, shedding light on everything from antibiotic resistance in bacteria to human disease genetics.

In this post, we’ll explore the core methods used in comparative genomics, the questions they help answer, and how they’re shaping our understanding of life.

1. Whole-Genome Alignment
Whole-genome alignment involves mapping the entire genome of one species to another. Tools like MUMmer, MAUVE, and LASTZ perform large-scale sequence alignments to detect conserved regions, rearrangements, insertions, and deletions.

Use Case:
Comparing human and chimpanzee genomes to identify evolutionary conserved sequences (ECS) and regions of divergence.

Key Challenges:
Handling repetitive sequences and genome rearrangements.

Computational complexity in large genomes.

2. Synteny and Collinearity Analysis
Synteny refers to conserved blocks of gene order across species. Tools like MCScanX, SynMap, or CHITRA (for visualizing synteny interactively) detect these blocks to understand chromosomal evolution.

Use Case:
Studying ancient genome duplications in plants.

Investigating chromosomal rearrangements in cancer genomes.

3. Ortholog and Paralog Detection
Orthologs are genes in different species that evolved from a common ancestor, while paralogs are genes duplicated within a genome. Identifying them is crucial for functional annotation and evolutionary studies.

Popular Tools:
OrthoFinder, Orthologous MAtrix (OMA), InParanoid, and EggNOG.

Use Case:
Functional prediction of uncharacterized genes based on orthologs in model organisms.

Tracing gene family evolution.

4. Phylogenomic Analysis
Phylogenomic methods combine phylogenetics and genomics to infer evolutionary trees based on genome-wide data. These methods can handle dozens to hundreds of genomes, using concatenated alignments or gene trees.

Tools:
RAxML, IQ-TREE, ASTRAL, Phylip, BEAST.

Use Case:
Resolving the evolutionary relationships between microbial species.

Studying speciation events.

5. Pan-Genome Analysis
The pan-genome consists of the core genome (shared by all strains) and the accessory genome (strain-specific genes). This is especially popular in microbial genomics.

Tools:
Roary, Panaroo, BPGA, PGAP.

Use Case:
Understanding virulence factor diversity in E. coli.

Designing broad-spectrum vaccines.

6. Comparative Transcriptomics
Comparing transcriptomes across species or conditions reveals conserved and unique expression patterns. RNA-seq data can be mapped to reference genomes to identify orthologous expression profiles.

Use Case:
Comparing stress response in extremophiles and model species.

Studying conserved regulatory networks.

7. Functional Element Comparison
Beyond genes, comparative genomics also targets non-coding regions—enhancers, promoters, miRNAs. Conservation across species often implies functional importance.

Tools:
PhastCons, GERP, phyloP (based on multiple alignments).

Use Case:
Detecting conserved non-coding elements in vertebrates.

Studying regulatory divergence in human evolution.

8. Horizontal Gene Transfer (HGT) Detection
In microbes, genes often jump across species boundaries. Comparative genomics can detect HGT by identifying genes that defy the expected phylogenetic pattern.

Tools:
HGTector, DarkHorse, AlienHunter, SIGI-HMM.

Use Case:
Tracing antibiotic resistance genes.

Exploring microbial adaptability in extreme environments.

Final Thoughts
Comparative genomics is a powerful lens to observe the diversity and unity of life. With a broad toolkit—from aligners to orthology pipelines, phylogenetic engines to visualization tools—it allows scientists to ask big questions: How did genomes evolve? What makes species unique? Where do new genes come from?

Whether you're studying extremophiles, building better crops, or exploring human ancestry, comparative genomics offers the methods to connect the dots across the tree of life.

G-compass: a comparative genome browser

Jit — Thu, 12 Apr 2018 10:00:27 -0500

G-compass (http://www.h-invitational.jp/g-compass/) is a comparative genome browser. It visualizes evolutionarily conserved genomic regions between human and other 12 vertebrates based on original genome alignments pursuing higher coverage (1,2). Annotations of human genes/transcripts and their ortholog information were derived from H-InvDB and its subdatabase Evola, respectively. G-compass is available for free of charge. [ Sample ]

Address of the bookmark: http://www.h-invitational.jp/g-compass/

postdoctoral fellow for a large-scale microbial comparative genomics !

Thu, 29 Apr 2021 08:44:53 -0500

Asaf Levy hiring a postdoctoral fellow for a large-scale microbial comparative genomics project at the Hebrew University of Jerusalem (Israel).
The project is a continuation of Levy Asaf et al. Nature Genetics 2018 paper.
Requirements:
1.Experience with programming in at least one programming language, preferably Python.
2.A PhD in bioinformatics/computational biology
3.At least one first authorship publication in a good journal, preferably more.
4.Good communication skills in English
5.Ability to enter and study in Israel (not applicable for Pakistani people, for example).
6.Ability to work in a team.
Please send CV to alevy@mail.huji.ac.il

DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication

Jit — Tue, 14 Nov 2017 10:26:16 -0600

We developed a prokaryotic genome annotation pipeline, DFAST, that also supports genome submission to public sequence databases. DFAST was originally started as an on-line annotation server, and to date, over 7,000 jobs have been processed since its first launch in 2016. Here, we present a newly implemented background annotation engine for DFAST, which is also available as a standalone command-line program. The new engine can annotate a typical-sized bacterial genome within 10 minutes, with rich information such as pseudogenes, translation exceptions, and orthologous gene assignment between given reference genomes. In addition, the modular framework of DFAST allows users to customize the annotation workflow easily and will also facilitate extensions for new functions and incorporation of new tools in the future.

Availability and Implementation

The software is implemented in Python 3 and runs in both Python 2.7 and 3.4– on Macintosh and Linux systems. It is freely available at https://github.com/nigyta/dfast_core/ under the GPLv3 license with external binaries bundled in the software distribution. An on-line version is also available at https://dfast.nig.ac.jp/.

Address of the bookmark: https://dfast.nig.ac.jp/

jobTree based python wrapper to run the genome simulation tool suite Evolver

Jit — Fri, 08 Dec 2017 16:26:32 -0600

evolverSimControl (eSC) can be used to simulate multi-chromosome genome evolution on an arbitrary phylogeny (Newick format). In addition to simply running evolver, eSC also automatically creates statistical summaries of the simulation as it runs including text and image files. Also included are convenience scripts to: check on a running simulation and see detailed status and logging information; extract fasta sequence files from the leaf nodes of a completed simulation; extract pairwise multiple alignment files (.maf) from leaf and branch nodes from a completed simulation and with the help of mafJoin, join them together into a single maf covering the entire simulation.

Address of the bookmark: https://github.com/dentearl/evolverSimControl

Metassembler: merging and optimizing de novo genome assemblies

Rahul Nayak — Tue, 08 May 2018 04:52:33 -0500

Metassembler combines multiple whole genome de novo assemblies into a combined consensus assembly using the best segments of the individual assemblies.

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence.

Address of the bookmark: https://sourceforge.net/projects/metassembler/?source=directory

EAGLER: a scaffolding tool for long reads.

Jit — Mon, 04 Jun 2018 05:26:03 -0500

EAGLER is a scaffolding tool for long reads. The scaffolder takes as input a draft genome created by any NGS assembler and a set of long reads. The long reads are used to extend the contigs present in the NGS draft and possibly join overlapping contigs. EAGLER supports both PacBio and Oxford Nanopore reads.

The tool should be compatible with most UNIX flavors and has been successfully tested on the following operating systems:

Mac OS X 10.11.1
Mac OS X 10.10.3
Ubuntu 14.04 LTS

https://bib.irb.hr/datoteka/844447.Diplomski_2015_Luka_terbi.pdf

Address of the bookmark: https://github.com/mculinovic/EAGLER

genoPlotR - plot gene and genome maps project!

Abhimanyu Singh — Wed, 12 Dec 2018 08:33:41 -0600

genoPlotR is a R package to produce reproducible, publication-grade graphics of gene and genome maps. It allows the user to read from usual format such as protein table files and blast results, as well as home-made tabular files.

Features

Linear representation of several segments of DNA
Comparisons represented by areas between the segments (like Artemis, for example)
Reads from common formats: Genbank, EMBL, blast, Mauve, and from user-generated tab files
Plot several subsegments of the same segment on the same line, separated by a //
Automatic or manual placement of the segments on the plot
Add annotations to all the lines
Create smart, automatic annotations for genomes, based on gene names
Add a user-generated tree
Add a global scale or a scale to each line
Use user-defined graphical functions to represent genes

Address of the bookmark: http://genoplotr.r-forge.r-project.org/

Hawkeye: an interactive visual analytics tool for genome assemblies

Abhimanyu Singh — Tue, 01 Jan 2019 11:56:17 -0600

Genome sequencing remains an inexact science, and genome sequences can contain significant errors if they are not carefully examined. Hawkeye is our new visual analytics tool for genome assemblies, designed to aid in identifying and correcting assembly errors. Users can analyze all levels of an assembly along with summary statistics and assembly metrics, and are guided by a ranking component towards likely mis-assemblies. Hawkeye is freely available and released as part of the open source AMOS project http://amos.sourceforge.net/hawkeye.

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2007-8-3-r34

Address of the bookmark: http://amos.sourceforge.net/wiki/index.php?title=Hawkeye

Ancient whole genome duplication (WGD) detection tools !

Rahul Nayak — Sun, 07 Mar 2021 00:32:44 -0600

There are two methods for ancient WGD detection, one is collinearity analysis, and the other is based on the Ks distribution map. Among them, Ks is defined as the average number of synonymous substitutions at each synonymous site, and there is also a Ka corresponding to it, which refers to the average number of non-synonymous substitutions at each non-synonymous site.

At present, some people have posted articles about the analysis process of WGD. I searched for the keyword "wgd pipeline" and found the following:

GenoDup: https:// github.com/MaoYafei/GenoDup-Pipeline
https://peerj.com/articles/6303/
WGDdetector: https:// github.com/yongzhiyang2 012/WGDdetector
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2670-3
wgd: https:// github.com/arzwa/wgd
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2#Sec1
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
GeNoGAP https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
https://github.com/dfguan/purge_dups
https://www.biorxiv.org/content/10.1101/2020.01.24.917997v1

This article introduces the usage of wgd.

Wgd cannot be installed directly with bioconda at present, so it is a little troublesome to install, because it depends on a lot of software. wgd depends on the following software

BLAST
MCL
MUSCLE/MAFFT/PRANK
PAML
PhyML/FastTree
i-ADHoRe

But the good news is that most of the software it depends on can be installed with bioconda

conda create -n wgd python=3.5 blast mcl muscle mafft prank paml fasttree cmake libpng mpi=1.0=mpich
conda activate wgd

Here mpi=1.0=mpich is selected, because i-adhore depends on mpich. If openmpi is installed, an error will appear while loading shared libraries: libmpi_cxx.so.40: cannot open shared object file: No such file or directory

After that, the installation is much simpler

git clone https://github.com/arzwa/wgd.git
cd wgd
pip install .
pip install git+https://github.com/arzwa/wgd.git
For i-ADHoRe, you need to register at http:// bioinformatics.psb.ugent.be /webtools/i-adhore/licensing/Agree to the license to download i-ADHoRe-3.0

Since my miniconda3 installed ~/opt/, the installation path is so~/opt/miniconda3/envs/wgd/

tar -zxvf i-adhore-3.0.01.tar.gz
cd i-adhore-3.0.01
mkdir -p build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=~/opt/miniconda3/envs/wgd/
make -j 4
make insatall

Take the sugarcane genome Saccharum spontaneum L as an example. The genome is 8-ploid with 32 chromosomes (2n = 4x8 = 32)

Download the tutorial for CDS and GFF annotation files

mkdir -p wgd_tutorial && cd wgd_tutorial
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.cds.fasta.gz
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.gff3.gz
gunzip *.gz

First conda activate wgdstart our analysis environment, and then start the analysis

Step 1 : Use to wgd mclidentify homologous genes in the genome

wgd mcl -n 20 --cds --mcl -s Sspon.v20190103.cds.fasta -o Sspon_cds.out

Step 2 : Use to wgd ksdbuild Ks distribution

wgd ksd --n_threads 80 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl Sspon.v20190103.cds.fasta

Step 3 : If the quality of the genome is good, then wgd syncollinearity analysis can be used . It can help us find the collinearity block in the genome and the corresponding anchor point

wgd syn --feature gene --gene_attribute ID \
-ks wgd_ksd/Sspon.v20190103.cds.fasta.ks.tsv \
Sspon.v20190103.gff3 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl

For more reading - There are 9 sub-modules in WGD

kde: KDE fitting to the Ks distribution
ksd: Ks distribution construction
mcl: BLASP comparison of All-vs-ALl + MCL classification analysis.
mix: Hybrid modeling of Ks distribution.
pre: preprocess the CDS file
syn: Call I-ADHoRe 3.0 to use GFF files for collinearity analysis
viz: draw histogram and density plot
wf1: Ks standard analysis procedure of the whole genome paranome (paranome), call mcl, ksd and syn
wf2: Ks standard analysis procedure of one-vs-one homologous gene (ortholog), call wcl and kSD