BOL: Related items

Ancient whole genome duplication (WGD) detection tools !

Rahul Nayak — Sun, 07 Mar 2021 00:32:44 -0600

There are two methods for ancient WGD detection, one is collinearity analysis, and the other is based on the Ks distribution map. Among them, Ks is defined as the average number of synonymous substitutions at each synonymous site, and there is also a Ka corresponding to it, which refers to the average number of non-synonymous substitutions at each non-synonymous site.

At present, some people have posted articles about the analysis process of WGD. I searched for the keyword "wgd pipeline" and found the following:

GenoDup: https:// github.com/MaoYafei/GenoDup-Pipeline
https://peerj.com/articles/6303/
WGDdetector: https:// github.com/yongzhiyang2 012/WGDdetector
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2670-3
wgd: https:// github.com/arzwa/wgd
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2#Sec1
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
GeNoGAP https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
https://github.com/dfguan/purge_dups
https://www.biorxiv.org/content/10.1101/2020.01.24.917997v1

This article introduces the usage of wgd.

Wgd cannot be installed directly with bioconda at present, so it is a little troublesome to install, because it depends on a lot of software. wgd depends on the following software

BLAST
MCL
MUSCLE/MAFFT/PRANK
PAML
PhyML/FastTree
i-ADHoRe

But the good news is that most of the software it depends on can be installed with bioconda

conda create -n wgd python=3.5 blast mcl muscle mafft prank paml fasttree cmake libpng mpi=1.0=mpich
conda activate wgd

Here mpi=1.0=mpich is selected, because i-adhore depends on mpich. If openmpi is installed, an error will appear while loading shared libraries: libmpi_cxx.so.40: cannot open shared object file: No such file or directory

After that, the installation is much simpler

git clone https://github.com/arzwa/wgd.git
cd wgd
pip install .
pip install git+https://github.com/arzwa/wgd.git
For i-ADHoRe, you need to register at http:// bioinformatics.psb.ugent.be /webtools/i-adhore/licensing/Agree to the license to download i-ADHoRe-3.0

Since my miniconda3 installed ~/opt/, the installation path is so~/opt/miniconda3/envs/wgd/

tar -zxvf i-adhore-3.0.01.tar.gz
cd i-adhore-3.0.01
mkdir -p build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=~/opt/miniconda3/envs/wgd/
make -j 4
make insatall

Take the sugarcane genome Saccharum spontaneum L as an example. The genome is 8-ploid with 32 chromosomes (2n = 4x8 = 32)

Download the tutorial for CDS and GFF annotation files

mkdir -p wgd_tutorial && cd wgd_tutorial
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.cds.fasta.gz
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.gff3.gz
gunzip *.gz

First conda activate wgdstart our analysis environment, and then start the analysis

Step 1 : Use to wgd mclidentify homologous genes in the genome

wgd mcl -n 20 --cds --mcl -s Sspon.v20190103.cds.fasta -o Sspon_cds.out

Step 2 : Use to wgd ksdbuild Ks distribution

wgd ksd --n_threads 80 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl Sspon.v20190103.cds.fasta

Step 3 : If the quality of the genome is good, then wgd syncollinearity analysis can be used . It can help us find the collinearity block in the genome and the corresponding anchor point

wgd syn --feature gene --gene_attribute ID \
-ks wgd_ksd/Sspon.v20190103.cds.fasta.ks.tsv \
Sspon.v20190103.gff3 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl

For more reading - There are 9 sub-modules in WGD

kde: KDE fitting to the Ks distribution
ksd: Ks distribution construction
mcl: BLASP comparison of All-vs-ALl + MCL classification analysis.
mix: Hybrid modeling of Ks distribution.
pre: preprocess the CDS file
syn: Call I-ADHoRe 3.0 to use GFF files for collinearity analysis
viz: draw histogram and density plot
wf1: Ks standard analysis procedure of the whole genome paranome (paranome), call mcl, ksd and syn
wf2: Ks standard analysis procedure of one-vs-one homologous gene (ortholog), call wcl and kSD

Professor/Associate Professor/ Assistant Professor at Chettinad Academy of Research and Education

Sat, 24 May 2014 00:00:15 -0500

OPEN FACULTY POSITION

Chettinad Academy of Research and Education (CARE) invites applications from eligible and translational research-oriented candidates to the posts of Professor/Associate Professor/ Assistant Professor Computational Biology, Bioinformatics, and Pharmaceutical Chemistry.

Emoluments: As per UGC norms (Adequate Compensation for Postdoctoral/Teaching experience)

Candidates fulfilling the eligibility criteria as per the UGC norms can send their full CV with copies of certificates and reference letters to the following address by post or by e-mail on or before 31st May 2014

The Registrar,
Chettinad Academy of Research and Education,
Chettinad Health City
Kelambakkam, Chennai 603 103
Tamil Nadu
T +91 (0)44 4741 1000
F +91 (0)44 4741 1011
Email: jobs @chettinadhealthcity.com

Advertisement: http://182.73.176.163/chc/ads2014.pdf

GEnView: A phylogeny based comparative genomics software to analyze the genetic environment of genes

Abhi — Tue, 28 Dec 2021 01:49:03 -0600

A phylogeny based comparative genomics software to analyze the genetic environment of genes. The user can select one or several taxa and provide one or several reference protein(s). Genomes and plasmids (based on user choice) will be downloaded from the NCBI Assembly/NR database and searched for the respective gene. Alternatively, custom genomes can be provided. User selected stretches (20kbp by default) of the genes genetic environment are extracted, annotated and aligned between all genomes. The sequences are then visualized, enabling comparison of synteny and gene content.

More at https://pubmed.ncbi.nlm.nih.gov/34951622/

Address of the bookmark: https://github.com/EbmeyerSt/GEnView

The Minerva Research Group for Bioinformatics

Tue, 27 May 2014 15:48:14 -0500

The focus of the bioinformatics group is to use computational approaches to gain an insight into genome evolution in primates.

http://www.eva.mpg.de/genetics/bioinformatics/overview.html?Fsize=0%2C%20%40%2F%27

Kelso Group
Department of Evolutionary Genetics
Max Planck Institute for Evolutionary Anthropology
Deutscher Platz 6
04103 Leipzig
Germany
Phone: +49 341 3550 500

Job:
http://www.eva.mpg.de/genetics/bioinformatics/jobs.html?Fsize=0%2C%2B%40

dna2bit: an ultra-fast and accurate genomic distance estimation software

LEGE — Sun, 31 Aug 2025 06:24:58 -0500

dna2bit is a software tool developed in C++11, leveraging the capabilities of OpenMP for parallel computing and the popcount technique for efficient bit manipulation. It has been thoroughly tested using the g++ and clang compilers on both Linux and MacOS platforms.

Address of the bookmark: https://github.com/lijuzeng/dna2bit

Monitor running jobs on Linux server

Jitendra Narayan — Fri, 06 Jun 2014 16:18:43 -0500

You as a bioinformatican run lots of program on your servers. Sometime the shared server is also used by your colleague. If server is busy you sometime need to check the running programs and want to monitor the running programs as well. The "top" command will come in handy when you need to find out if things are still running, how long they’ve been running, or how much memory is being used.

‘top’ is very simple to run: type

%% top

You’ll get a screen that looks like this, and is updated regularly:

Simple, right? Heh.

First! Note that you can use ‘q’ or ‘CTRL-C’ to exit from ‘top’.

Now let’s read and understand at each line independently.

The first line:

top - 23:00:48 up 39 days, 2 user, load average: 0.00, 0.00, 0.00

The first line tells you the current time, how long the machine has been up, how many users are logged in, and the short/medium/long-term compute load on the machine. If you run something for a long time, you’ll see these numbers go up. Right now, the machine is basically just sitting there, so these are all close to 0.

The second line:

Tasks: 239 total,   1 running, 238 sleeping,   0 stopped,   0 zombie

This line tells you how many processes are running. If you are using laptops machines it’s not so interesting because you really are the only one using this machine.

Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

This line contains the CPU load. The first two numbers are how busy the system is doing computation (“us” stands for “user”) and how busy the system is doing system-y things like accessing disks or network (“sy” stands for “system”). We’ll talk more about this later.

Mem:   49457320k total,    3492174k used, 14535596k free,    1435148k buffers

This should be easy to understand – how much memory you’re using!

Swap:   539356k total,   28332k used,   836562k free,    29862014k cached

Swap is just on-disk memory that can be used to “swap” out programs from main memory. Again, we’ll talk about this later.:

PID USER      PR NI VIRT RES SHR S %CPU %MEM    TIME+ COMMAND
1 root      39 19 0 0 0 S 0.0 0.0   246:57.22 kipmi0
2 root      RT   0     0    0    0 S 0.0 0.0   0:00.00 migration/0

And... finally! What’s actually running! The two most important numbers are the %CPU and %MEM towards the right, as well as the COMMAND. This tells you how compute- and memory-intensive your program is. Right now, nothing’s running so the numbers aren’t very interesting, but just wait until we run something...

BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries

Poonam Mahapatra — Wed, 03 Jan 2018 00:25:27 -0600

BBSplit internally uses BBMap to map reads to multiple genomes at once, and determine which genome they match best. This is different than with ordinary mapping. If a genome (say, human) contains an exact repeat somewhere, reads mapping to it will be mapped ambiguously. But if you want to determine whether reads are mouse or human, it does not matter whether they map ambiguously within human, only whether they are ambiguous between human and mouse. BBSplit tracks this additional ambiguity information and decides how to use it based on the “ambig2” flag. The normal use of BBSplit is like Seal, either quantifying how many reads go to each reference, or splitting the reads into multiple output files, one per reference. BBSplit can only be run using references indexed with BBSplit, as they contain additional information regarding which sequences came from which reference file.

BBSplit is a tool that bins reads by mapping to multiple references simultaneously, using BBMap. The reads go to the bin of the reference they map to best. There are also disambiguation options, such that reads that map to multiple references can be binned with all of them, none of them, one of them, or put in a special "ambiguous" file for each of them. Paired reads will always be kept together.

For example, if you had a library of something that was contaminated with e.coli and salmonella, you could do this:

bbsplit.sh in=reads.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu=clean.fq int=t

This will produce 3 output files:
out_ecoli.fq (ecoli reads)
out_salmonella.fq (salmonella reads)
clean.fq (unmapped reads)

In this case, "int=t" means that the input file is paired and interleaved. For single-end reads you would leave that out. For paired reads in 2 files, you would do this:
bbsplit.sh in1=reads1.fq in2=reads2.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu1=clean1.fq outu2=clean2.fq

BBSplit is available here:
https://sourceforge.net/projects/bbmap/

The sensitivity can be raised to be equivalent to BBMap with these flags: "minratio=0.56 minhits=1 maxindel=16000"

lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data

BioJoker — Tue, 27 Nov 2018 04:43:57 -0600

lordFAST is a sensitive tool for mapping long reads with high error rates. lordFAST is specially designed for aligning reads from PacBio sequencing technology but provides the user the ability to change alignment parameters depending on the reads and application.

lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint.

Address of the bookmark: https://github.com/vpc-ccg/lordfast

Assistant Professor in Bioinformatics at Dr. D. Y. Patil Biotechnology & Bioinformatics Institute

Tue, 03 Jun 2014 19:54:15 -0500

Dr. D. Y. Patil Biotechnology & Bioinformatics Institute
Tathawade, Pune 411033.

Assistant Professor in Bioinformatics

Essential :
First Class Master’s Degree in the appropriate branch of Life Sciences / Technology (Tech.)
OR
Ph.D in Life Sciences or in the respective subject area of specialization
OR
Good Academic record with at least 55% marks (or an equivalent grade) at the Master’s Degree level, in the relevant subject or an equivalent degree from an Indian / Foreign University.
Besides fulfilling the above qualifications, candidates should have cleared the eligibility test (NET) for lecturers conducted by the UGC, CSIR or similar test accredited by the UGC and as per the requirements of UGC guidelines.

Desirable :
Teaching, research industrial and/or professional experience in a reputed organization.
Papers presented at Conferences and/or in refereed journals

Note : Application are invited in prescribed form Click here for Application Form
Kindly send your applications to “Registrar, Dr. D. Y. Patil Vidyapeeth, Pune, Sant Tukaram Nagar, Pimpri, Pune – 411018., Maharashtra, India.” should reach in the University office within 15 days from the publication.

More Info: http://www.dpu.edu.in/BiotechResearchPositions.aspx

G-NEST: The Gene NEighborhood Scoring Tool

Neel — Fri, 25 Sep 2020 20:09:18 -0500

The Gene NEighborhood Scoring Tool (G-NEST) combines genomic location, gene expression, and evolutionary sequence conservation data to score putative gene neighborhoods across all window sizes. Primary author of final code = William F. Martin. Example data files are in the separate repository.

Address of the bookmark: https://github.com/dglemay/G-NEST