BOL: Related items

Ancient whole genome duplication (WGD) detection tools !

Rahul Nayak — Sun, 07 Mar 2021 00:32:44 -0600

There are two methods for ancient WGD detection, one is collinearity analysis, and the other is based on the Ks distribution map. Among them, Ks is defined as the average number of synonymous substitutions at each synonymous site, and there is also a Ka corresponding to it, which refers to the average number of non-synonymous substitutions at each non-synonymous site.

At present, some people have posted articles about the analysis process of WGD. I searched for the keyword "wgd pipeline" and found the following:

GenoDup: https:// github.com/MaoYafei/GenoDup-Pipeline
https://peerj.com/articles/6303/
WGDdetector: https:// github.com/yongzhiyang2 012/WGDdetector
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2670-3
wgd: https:// github.com/arzwa/wgd
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2#Sec1
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
GeNoGAP https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
https://github.com/dfguan/purge_dups
https://www.biorxiv.org/content/10.1101/2020.01.24.917997v1

This article introduces the usage of wgd.

Wgd cannot be installed directly with bioconda at present, so it is a little troublesome to install, because it depends on a lot of software. wgd depends on the following software

BLAST
MCL
MUSCLE/MAFFT/PRANK
PAML
PhyML/FastTree
i-ADHoRe

But the good news is that most of the software it depends on can be installed with bioconda

conda create -n wgd python=3.5 blast mcl muscle mafft prank paml fasttree cmake libpng mpi=1.0=mpich
conda activate wgd

Here mpi=1.0=mpich is selected, because i-adhore depends on mpich. If openmpi is installed, an error will appear while loading shared libraries: libmpi_cxx.so.40: cannot open shared object file: No such file or directory

After that, the installation is much simpler

git clone https://github.com/arzwa/wgd.git
cd wgd
pip install .
pip install git+https://github.com/arzwa/wgd.git
For i-ADHoRe, you need to register at http:// bioinformatics.psb.ugent.be /webtools/i-adhore/licensing/Agree to the license to download i-ADHoRe-3.0

Since my miniconda3 installed ~/opt/, the installation path is so~/opt/miniconda3/envs/wgd/

tar -zxvf i-adhore-3.0.01.tar.gz
cd i-adhore-3.0.01
mkdir -p build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=~/opt/miniconda3/envs/wgd/
make -j 4
make insatall

Take the sugarcane genome Saccharum spontaneum L as an example. The genome is 8-ploid with 32 chromosomes (2n = 4x8 = 32)

Download the tutorial for CDS and GFF annotation files

mkdir -p wgd_tutorial && cd wgd_tutorial
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.cds.fasta.gz
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.gff3.gz
gunzip *.gz

First conda activate wgdstart our analysis environment, and then start the analysis

Step 1 : Use to wgd mclidentify homologous genes in the genome

wgd mcl -n 20 --cds --mcl -s Sspon.v20190103.cds.fasta -o Sspon_cds.out

Step 2 : Use to wgd ksdbuild Ks distribution

wgd ksd --n_threads 80 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl Sspon.v20190103.cds.fasta

Step 3 : If the quality of the genome is good, then wgd syncollinearity analysis can be used . It can help us find the collinearity block in the genome and the corresponding anchor point

wgd syn --feature gene --gene_attribute ID \
-ks wgd_ksd/Sspon.v20190103.cds.fasta.ks.tsv \
Sspon.v20190103.gff3 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl

For more reading - There are 9 sub-modules in WGD

kde: KDE fitting to the Ks distribution
ksd: Ks distribution construction
mcl: BLASP comparison of All-vs-ALl + MCL classification analysis.
mix: Hybrid modeling of Ks distribution.
pre: preprocess the CDS file
syn: Call I-ADHoRe 3.0 to use GFF files for collinearity analysis
viz: draw histogram and density plot
wf1: Ks standard analysis procedure of the whole genome paranome (paranome), call mcl, ksd and syn
wf2: Ks standard analysis procedure of one-vs-one homologous gene (ortholog), call wcl and kSD

Understanding kmer !

BioStar — Wed, 18 Aug 2021 04:27:51 -0500

What is a k-mer anyway? A k-mer is just a sequence of k characters in a string (or nucleotides in a DNA sequence). Now, it is important to remember that to get all k-mers from a sequence you need to get the first k characters, then move just a single character for the start of the next k-mer and so on. Effectively, this will create sequences that overlap in k-1 positions.

Address of the bookmark: https://bioinfologics.github.io/post/2018/09/17/k-mer-counting-part-i-introduction/

Internship Program for Bioinformatics / Biotechnology / MBA / MCA (No. Of Vacancy: 5)

Mon, 15 Dec 2014 08:11:02 -0600

ArrayGen is offering an Internship Program for Post graduate Bioinformatics / Biotechnology / MBA / MCA students and professionals. ArrayGen Technologies provide an excellent opportunity to gain research experience and explore if a scientific career is right for you. Currently we offer positions to outstanding students interested in Next Generation Sequencing (NGS) data analysis or marketing or software development. Applications are accepted throughout the year. Accepted students will be notified through email.

maftools

Surabhi Chaudhary — Fri, 17 Dec 2021 03:18:28 -0600

With advances in Cancer Genomics, Mutation Annotation Format (MAF) is being widely accepted and used to store somatic variants detected. The Cancer Genome Atlas Project has sequenced over 30 different cancers with sample size of each cancer type being over 200. Resulting data consisting of somatic variants are stored in the form of Mutation Annotation Format. This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner from either TCGA sources or any in-house studies as long as the data is in MAF format.

https://www.bioconductor.org/packages/devel/bioc/vignettes/maftools/inst/doc/maftools.html

Address of the bookmark: https://github.com/PoisonAlien/maftools

Short-read assembly using Spades !

Abhimanyu Singh — Mon, 31 Jan 2022 07:18:16 -0600

If we only had Illumina reads, we could also assemble these using the tool Spades.

You can try this here, or try it later on your own data.

Get data

We will use the same Illumina data as we used above:

illumina_R1.fastq.gz: the Illumina forward reads
illumina_R2.fastq.gz: the Illumina reverse reads

Assemble

Run Spades:

spades.py -1 illumina_R1.fastq.gz -2 illumina_R2.fastq.gz --careful --cov-cutoff auto -o spades_assembly_all_illumina

-1 is input file of forward reads
-2 is input file of reverse reads
--careful minimizes mismatches and short indels
--cov-cutoff auto computes the coverage threshold (rather than the default setting, “off”)
-o is the output directory

Results

Move into the output directory and look at the contigs:

infoseq contigs.fasta

Smudgeplot: Inference of ploidy and heterozygosity structure using whole genome sequencing data

Neel — Fri, 25 Feb 2022 04:42:09 -0600

This tool extracts heterozygous kmer pairs from kmer count databases and performs gymnastics with them. We are able to disentangle genome structure by comparing the sum of kmer pair coverages (CovA + CovB) to their relative coverage (CovB / (CovA + CovB)). Such an approach also allows us to analyze obscure genomes with duplications, various ploidy levels, etc.

Smudgeplots are computed from raw or even better from trimmed reads and show the haplotype structure using heterozygous kmer pairs. For example:

Address of the bookmark: https://github.com/KamilSJaron/smudgeplot

genomenotebook

Abhi — Thu, 20 Apr 2023 13:19:01 -0500

https://dbikard.github.io/genomenotebook/

Install

pip install genomenotebook

How to use

Create a simple genome browser with a search bar. The sequence appears when zooming in.

import genomenotebook as gn

g=gn.GenomeBrowser(genome_path, gff_path, init_pos=10000)
g.show()

Tracks can be added to visualize your favorite genomics data. See Examples for more !!!!

Address of the bookmark: https://dbikard.github.io/genomenotebook/

Mitochondrial genome assembly tools !

Abhi — Wed, 06 Sep 2023 00:37:18 -0500

Mitochondrial genome assembly tools are specialized software and algorithms designed to accurately reconstruct the mitochondrial genome (mitogenome) from sequencing data, typically obtained through techniques like next-generation sequencing (NGS). The mitochondrial genome is relatively small compared to the nuclear genome, making it an ideal target for assembly. Here are some commonly used mitochondrial genome assembly tools:

MitoFinder: Mitofinder is a pipeline to assemble mitochondrial genomes and annotate mitochondrial genes from trimmed read sequencing data.

MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads

MITObim: MITObim is a tool specifically developed for the iterative assembly of mitochondrial genomes. It starts with a reference mitogenome and iteratively refines the assembly using the read data.

MITOS: MITOS is a web-based platform that provides a pipeline for annotating mitochondrial genomes. It integrates multiple software tools for assembly, annotation, and visualization of mitogenomes.

MIRA: MIRA (Mimicking Intelligent Read Assembly) is a versatile genome assembly tool that can be used for mitochondrial genome assembly. It supports various sequencing technologies and allows for reference-based or de novo assembly.

NOVOPlasty: NOVOPlasty is a user-friendly tool designed for de novo assembly of organelle genomes, including mitochondria. It utilizes a seed-and-extend algorithm and is suitable for both short-read and long-read data.

MITOS2: MITOS2 is an updated version of the MITOS pipeline, which automates the annotation of mitochondrial genomes. It provides improved accuracy and additional features for mitochondrial genome analysis.

GetOrganelle: While primarily designed for chloroplast genome assembly, GetOrganelle can also be used for mitochondrial genome assembly. It is particularly useful for dealing with high-throughput sequencing data.

SPAdes: SPAdes (St. Petersburg genome assembler) is a versatile genome assembly tool that can be employed for mitochondrial genome assembly, especially when dealing with complex datasets that may contain nuclear mitochondrial DNA sequences (numts).

IDBA-UD: IDBA-UD (Iterative De Bruijn Graph De Novo Assembler) is another de novo assembly tool that can be used for mitochondrial genome assembly, especially in cases with relatively low coverage.

Velvet: Velvet is a de novo assembly tool that can be applied to mitochondrial genome assembly, especially when working with short-read data.

When selecting a mitochondrial genome assembly tool, it's important to consider the specific characteristics of your sequencing data, such as read length and coverage, as well as the complexity of the mitochondrial genome. Additionally, some tools are better suited for specific organisms or research objectives, so choosing the right tool will depend on your particular project requirements.

Regular Expression Cheat Sheet

Jitendra Narayan — Tue, 09 Jul 2013 17:38:42 -0500

The Regular Expression are the sole of Perl language, and for bioinformatician it is just a magical stick to resolve gingatic string data. We did not find any good and user friendly regular expression cheat sheet, hence write our own cheat sheet. The Regular Expressions Cheat Sheet, a quick reference guide for regular expressions, including symbols, ranges, grouping, assertions and some sample patterns to get you started.

The Story of You: ENCODE and the human genome

Sat, 24 Aug 2013 18:49:03 -0500

Ever since a monk called Mendel started breeding pea plants we've been learning about our genomes. In 1953, Watson, Crick and Franklin described the structure of the molecule that makes up our genomes: the DNA double helix. Then, in 2001, scientists wrote down the entire 3-billion letter code contained in the average human genome. Now they're trying to interpret that code; to work out how it's used to make different types of cells and different people. The ENCODE project, as it's called, is the latest chapter in the story of you. To read the ENCODE research papers and more, visit http://www.nature.com/ENCODE