BOL: Related items

List of non-commercial NGS genotype-calling software

Jit — Thu, 09 Aug 2018 04:21:32 -0500

Meaningful analysis of next-generation sequencing (NGS) data, which are produced extensively by genetics and genomics studies, relies crucially on the accurate calling of SNPs and genotypes. Recently developed statistical methods both improve and quantify the considerable uncertainty associated with genotype calling, and will especially benefit the growing number of studies using low- to medium-coverage data.

A list of programs for genotype and SNP calling :

SOAP2 http://soap.genomics.org.cn/index.html

Single-sample High-quality variant database (for example, dbSNP) Package for NGS data analysis, which includes a single individual genotype caller (SOAPsnp)

realSFS http://128.32.118.212/thorfinn/realSFS/

Single-sample Aligned reads Software for SNP and genotype calling using single individuals and allele frequencies. Site frequency spectrum (SFS) estimation

Samtools http://samtools.sourceforge.net/

Multi-sample Aligned reads Package for manipulation of NGS alignments, which includes a computation of genotype likelihoods (samtools) and SNP and genotype calling (bcftools)

GATK http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit Multi-sample Aligned reads Package for aligned NGS data analysis, which includes a SNP and genotype caller (Unifed Genotyper), SNP filtering (Variant Filtration) and SNP quality recalibration (Variant Recalibrator)

Beagle http://faculty.washington.edu/browning/beagle/beagle.html

Multi-sample LD Candidate SNPs, genotype likelihoods Software for imputation, phasing and association that includes a mode for genotype calling

IMPUTE2 http://mathgen.stats.ox.ac.uk/impute/impute_v2.html

Multi-sample LD Candidate SNPs, genotype likelihoods Software for imputation and phasing, including a mode for genotype calling. Requires fine-scale linkage map

QCall ftp://ftp.sanger.ac.uk/pub/rd/QCALL

Multi-sample LD ‘Feasible’ genealogies at a dense set of loci, genotype likelihoods Software for SNP and genotype calling, including a method for generating candidate SNPs without LD information (NLDA) and a method for incorporating LD information (LDA). The ‘feasible’ genealogies can be generated using Margarita (http://www.sanger.ac.uk/resources/software/margarita)

MaCH http://genome.sph.umich.edu/wiki/Thunder

Multi-sample LD Genotype likelihoods Software for SNP and genotype calling, including a method (GPT_Freq) for generating candidate SNPs without LD information and a method (thunder_glf_freq) for incorporating LD information

Ancient whole genome duplication (WGD) detection tools !

Rahul Nayak — Sun, 07 Mar 2021 00:32:44 -0600

There are two methods for ancient WGD detection, one is collinearity analysis, and the other is based on the Ks distribution map. Among them, Ks is defined as the average number of synonymous substitutions at each synonymous site, and there is also a Ka corresponding to it, which refers to the average number of non-synonymous substitutions at each non-synonymous site.

At present, some people have posted articles about the analysis process of WGD. I searched for the keyword "wgd pipeline" and found the following:

GenoDup: https:// github.com/MaoYafei/GenoDup-Pipeline
https://peerj.com/articles/6303/
WGDdetector: https:// github.com/yongzhiyang2 012/WGDdetector
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2670-3
wgd: https:// github.com/arzwa/wgd
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2#Sec1
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
GeNoGAP https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1142-2
https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-017-0399-x
https://github.com/dfguan/purge_dups
https://www.biorxiv.org/content/10.1101/2020.01.24.917997v1

This article introduces the usage of wgd.

Wgd cannot be installed directly with bioconda at present, so it is a little troublesome to install, because it depends on a lot of software. wgd depends on the following software

BLAST
MCL
MUSCLE/MAFFT/PRANK
PAML
PhyML/FastTree
i-ADHoRe

But the good news is that most of the software it depends on can be installed with bioconda

conda create -n wgd python=3.5 blast mcl muscle mafft prank paml fasttree cmake libpng mpi=1.0=mpich
conda activate wgd

Here mpi=1.0=mpich is selected, because i-adhore depends on mpich. If openmpi is installed, an error will appear while loading shared libraries: libmpi_cxx.so.40: cannot open shared object file: No such file or directory

After that, the installation is much simpler

git clone https://github.com/arzwa/wgd.git
cd wgd
pip install .
pip install git+https://github.com/arzwa/wgd.git
For i-ADHoRe, you need to register at http:// bioinformatics.psb.ugent.be /webtools/i-adhore/licensing/Agree to the license to download i-ADHoRe-3.0

Since my miniconda3 installed ~/opt/, the installation path is so~/opt/miniconda3/envs/wgd/

tar -zxvf i-adhore-3.0.01.tar.gz
cd i-adhore-3.0.01
mkdir -p build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=~/opt/miniconda3/envs/wgd/
make -j 4
make insatall

Take the sugarcane genome Saccharum spontaneum L as an example. The genome is 8-ploid with 32 chromosomes (2n = 4x8 = 32)

Download the tutorial for CDS and GFF annotation files

mkdir -p wgd_tutorial && cd wgd_tutorial
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.cds.fasta.gz
wget http://www.life.illinois.edu/ming/downloads/Spontaneum_genome/Sspon.v20190103.gff3.gz
gunzip *.gz

First conda activate wgdstart our analysis environment, and then start the analysis

Step 1 : Use to wgd mclidentify homologous genes in the genome

wgd mcl -n 20 --cds --mcl -s Sspon.v20190103.cds.fasta -o Sspon_cds.out

Step 2 : Use to wgd ksdbuild Ks distribution

wgd ksd --n_threads 80 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl Sspon.v20190103.cds.fasta

Step 3 : If the quality of the genome is good, then wgd syncollinearity analysis can be used . It can help us find the collinearity block in the genome and the corresponding anchor point

wgd syn --feature gene --gene_attribute ID \
-ks wgd_ksd/Sspon.v20190103.cds.fasta.ks.tsv \
Sspon.v20190103.gff3 Sspon_cds.out/Sspon.v20190103.cds.fasta.blast.tsv.mcl

For more reading - There are 9 sub-modules in WGD

kde: KDE fitting to the Ks distribution
ksd: Ks distribution construction
mcl: BLASP comparison of All-vs-ALl + MCL classification analysis.
mix: Hybrid modeling of Ks distribution.
pre: preprocess the CDS file
syn: Call I-ADHoRe 3.0 to use GFF files for collinearity analysis
viz: draw histogram and density plot
wf1: Ks standard analysis procedure of the whole genome paranome (paranome), call mcl, ksd and syn
wf2: Ks standard analysis procedure of one-vs-one homologous gene (ortholog), call wcl and kSD

FRODOCK 2.0: fast protein–protein docking server

Neel — Wed, 17 Oct 2018 04:31:30 -0500

frodock: a user-friendly protein–protein docking server based on an improved version of FRODOCK that includes a complementary knowledge-based potential. The web interface provides a very effective tool to explore and select protein–protein models and interactively screen them against experimental distance constraints. The competitive success rates and efficiency achieved allow the retrieval of reliable potential protein–protein binding conformations that can be further refined with more computationally demanding strategies.

Address of the bookmark: http://frodock.chaconlab.org/

Protein Subcellular Localization Prediction

Poonam Mahapatra — Thu, 01 Mar 2018 06:20:47 -0600

Assigning subcellular localization to a protein is an important step towards elucidating its interaction partners, function, and potential role(s) in the cellular machinery. Computational tools offer an attractive complement to time-consuming and laborious experimental methods.

http://abi.inf.uni-tuebingen.de/Services/YLoc/webloc.cgi

Address of the bookmark: https://abi.inf.uni-tuebingen.de/Research/Systems%20Biology/protein-subcellular-localization

STRUM: structure-based prediction of protein stability changes upon single-point mutation

BioStar — Mon, 10 Sep 2018 13:21:49 -0500

STRUM is a method for predicting the fold stability change (ΔΔG) of protein molecules upon single-point nsSNP mutations. STRUM adopts a gradient boosting regression approch to train the Gibbs free-energy changes on a variety of features at different levels of sequence and structure properties. The unique characteristics of STRUM is the combination of sequence profiles with low-resolution structure models from protein structure prediction, which helps enhance the robustness and accuracy of the method and make it applicable to various protein seqences, including those without experimental structures

Address of the bookmark: https://zhanglab.ccmb.med.umich.edu/STRUM/

Protein-Protein Interaction Sites Predictions !

Poonam Mahapatra — Wed, 25 Apr 2018 04:53:20 -0500

The study of Protein–Protein Interactions (PPIs) has a crucial role in biology, medicine and the pharmaceutical industry. PPIs can be investigated from two aspects: The interaction partners of a specific protein and the amino acid residues participating in a given PPI. Information about a protein’s interaction partners allows scientists to construct protein interaction networks, such as signaling pathways, which in turn facilitate the understanding of many biological and clinical observations.

Following are the list of tools commonly used to PPIs predictions:

Protein-Protein Interaction Sites

PPISP

A consensus neural network method for predicting protein-protein interaction sites

HOMCOS

A server to predict interacting protein pairs and interacting sites by homology modeling of complex structures

HotPOINT

Prediction of protein interfaces using an empirical model

ISIS

Prediction of interaction hotspots from sequence

KFC server

Automated decision-tree approach to predicting protein-protein interaction hot spots

meta-PPISP

A meta server for predicting protein-protein interaction sites. meta-PPISP is built on three individual web servers: cons-PPISP, PINUP, and Promate

ODA

Identification of optimal surface patches with the lowest docking desolvation energy values

PINUP

Protein binding site prediction with an empirical scoring function

Other Sites (DNA, RNA, Metals)

CHED

Web server for predicting soft metal binding sites in proteins

DBD-Hunter

A knowledge-based method for the prediction of DNA-protein interactions

DISPLAR

Given the structure of a protein known to bind DNA, the method predicts residues that contact DNA using neural network method

iDBPs

Predicts DNA binding proteins for proteins with known 3D structure.

PFplus

A tool for extracting and displaying positive electrostatic patches on protein surfaces which can be indicative of nucleic acid binding interfaces.

Five points for bioinformatics software/tools

Jitendra Narayan — Mon, 05 Aug 2013 04:12:32 -0500

In the bioinformatics sector we mostly spend time on computational analysis of huge amounts of data and try to make sense of it, biologically. But, most of the newbie bioinformaticians are faced with dilemma when they receive biological sequence data for the first time. They mostly found confusing over open source, user friendly GUI, and commercial bioinformatics software. Don’t be surprise this is true and also not an easy task to decide, because analytical step is the most crucial part and believe to be the biggest bottleneck in publishing paper in high impact journals. Through this blog I would like to address the pros and cons of both kind of software/tools and try to assist (Hmmm not really, It looks convince) you to make decision on your software selections.

The most common newbie questions are:

Should I try to use these free open source programs? Why are we not trying GUI software for computational analysis? Should I use commercial bioinformatics programs/software?”

1. Let’s be open

We generally think free and cheap are useless. But this concept is not applicable when we discuss open source software. Mostly, the bioinformatics software is developed by highly competitive biological programmers who believe in open sharing of knowledge. They come under Open Bioinformatics Foundation or O|B|F which is a non-profit, volunteer run organization focused on supporting open source programming in bioinformatics. The best part about open source tools/software is that they’re free to download the source code and read exactly what the program does. If you are so inclined, you can view all of the parts of the program and see the logical flow of the pipeline. In addition, open source makes an excellent learning tool for any beginning bioinformatician. Moreover, you can modify existing open source programs to deal with cutting-edge problems or to customize your pipeline. Apart from your computational and analysis work, most of the reviewer also prefers the open source based results so that they can validate the results if validation required.

2. Code headache

As a bioinformatician you are supposed to know the basics of programming languages, and if you are not good at it, then please learn it as soon as possible because you are not a bio-analyst but biological programmers. The open source programs usually lack dedicated service and support teams (often because they were the product of an overworked doc/postdoc!) so you are responsible for troubleshooting your own errors most of the time. We commonly receive the HELP email to support and assist to setup the pipeline; you can also find this kind of request on any QA forum. I personally believe this coding horror brings the biggest downside of open-source programs; where you need some programming skills in order to implement the program in your pipeline. But, if you are not able to fix the pipeline and modify the open source code according to your requirements them you should re-think on your bioinformatician name tag!!!

3. Dive into the codes

Some of the biologist turn bioinformatician says “if you can do the same thing with commercial software then why to get migraine with weird codes”, well this statement looks to me that guys are keen to learn swimming but still don’t like to get wet. If you are still using paid software and doing your work by customer support and clicking some of the well-designed GUI button then perhaps you are not interested in learning and trying new and challenging bioinformatics works. You are missing the basic flavour of bioinformatics. Let’s dive into the coding world, I am sure your will enjoy it. I recommend your to swim freely in code’s sea, and enjoy the journey; do not merely watch it from the outside.

4. Paid does not mean better

The bioinformatics company which are specializes in bioinformatics solutions develop well designed/packed, user friendly software by using a large number of specialised scientist, programmers and support staff. They also provide good services to accomplice your biological analysis work. This means that if you hit a ‘snag’ with your data, help is likely only a phone call away! These companies price their products competitively against the cost of a dedicated bioinformatician. You may be able to afford the program, but not the additional staff! Additionally, most of the functionality that you need in your analysis is already coded into the program. Need to plot a graph? Just click this button right here. It is that easy. But, as a bioinformatician this is not generally well encouraged approach in biological analysis work, because the software is not available to everyone and your data can’t be validated. Moreover, there is very less chances that anyone will repeat your work or love to do similar kind of research (because not all the labs in the world are rich like yours).

5. Take a caution

In biological analysis work, in which you deal GB/TB of data are having maximum chances of getting errors, so please be careful and always cross check your data before coming to any conclusion. Even an error in two line code can alter your entire analysis and display weird results. Some of the scientist blindly believes on commercial software, which is entirely wrong. Using proprietary tools does not absolve you of the need to actually read and research the type of analysis that you are doing. This is particularly true in the case of genome assembly and annotation.

At the end, I would like to tell only one think that open source solutions allows you to do more cutting edge analysis than the commercial tools. So let’s go for it.

Disclaimer:

This is my personal view. I have nothing to do with any company or open source community. The views expressed on these pages are mine alone and not those of my current/past employers. I do reserve the right to remove comments left by spammers or off-topic comments.

Software and Tools to detect structure variation with long reads !!

Archana Malhotra — Wed, 15 Mar 2017 14:31:09 -0500

Uncovering the connection between genetics and heritable diseases requires an approach that looks at all the variant bases and types in a genome. While a PacBio de novo assembly resolves the most novel SV variants. 8-10X PacBio coverage of single genomes or trios reveals triple the SVs detectable by short-read data.

With Single Molecule, Real-Time (SMRT) Sequencing, you can access structural variations having a broad range of sizes, types, and GC content with the ability to:

Uncover missing heritability linked to structural variation
Unambiguously identify genomic context and variant breakpoints at the sequence level to unravel the genetic etiology of disease
Resolve structural variation across the complete size spectrum with basepair resolution

Following are the SV tools, which can assist you to achieve your goal.

Sniffles: Structural variation caller using third generation sequencing

Sniffles is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore). It detects all types of SVs using evidence from split-read alignments, high-mismatch regions, and coverage analysis. Please note the current version of Sniffles requires sorted output from BWA-MEM (use -M and -x parameter) or NGM-LR with the optional SAM attributes enabled!

More at https://github.com/fritzsedlazeck/Sniffles

MultiBreak-SV: It identifies structural variants from next-generation paired end data, third-generation long read data, or data from a combination of sequencing platforms.

There are two pieces of software in this release: (1) a pre-processor that takes machineformat (.m5) BLASR files, and (2) MultiBreak-SV. For installation and usage instructions, see doc/MultiBreakSV-Manual.txt.

More at https://github.com/raphael-group/multibreak-sv

Parliament: A Structural Variation Tool. Why ask a single sv-detection approach to find every variant when you can have a parliament of tools deciding?

Publication about the algorithm and “…the first long-read characterization of structural variation in a diploid human personal genome…” (HS1011) - “Assessing structural variation in a personal genome—towards a human reference diploid genome”

More at https://sourceforge.net/projects/parliamentsv/

https://www.dnanexus.com/papers/Parliament_Info_Sheet.pdf

PBHoney: the structural variation discovery tool

PBHoney is an implementation of two variant-identification approaches designed to exploit the high mappability of long reads (i.e., greater than 10,000 bp). PBHoney considers both intra-read discordance and soft-clipped tails of long reads to identify structural variants.

Read The Paper http://www.biomedcentral.com/1471-2105/15/180/abstract

More at https://sourceforge.net/projects/pb-jelly/

SMRT-SV: Structural variant and indel caller for PacBio reads

Structural variant (SV) and indel caller for PacBio reads based on methods from Chaisson et al. 2014.

SMRT-SV provides an official software package for tools described in Chaisson et al. 2014 and adds several key features including the following.

Unified variant calling user interface with built-in cluster compute support
Small indel calling (2-49 bp)
Improved inversion calling (screenInversions)
Quality metric for SV calls based on number of local assemblies supporting each call
Higher sensitivity for SV calls using tiled local assemblies across the entire genome instead of "signature" regions
Genotyping of SVs with Illumina paired-end reads from WGS samples

More at https://github.com/EichlerLab/pacbio_variant_caller

Exploring RNA Sequence Analysis: Tools for Every Bioinformatician

Neel — Fri, 13 Dec 2024 04:03:04 -0600

RNA sequence analysis has become an essential part of modern biological research. From RNA-seq pipelines to specialized tools for specific RNA types, here's a comprehensive guide to tools you can use to make sense of RNA data.

1. RNA-Seq Analysis Pipelines

RNA-seq is one of the most popular techniques for studying RNA. These tools streamline processing raw sequence data:

FASTQC: For quality control of raw RNA-seq reads.
Trimmomatic: For trimming and filtering RNA-seq reads.
HISAT2/STAR: High-performance aligners for RNA-seq reads.
FeatureCounts: For quantifying gene expression.
DESeq2/EdgeR: For differential expression analysis.

2. Transcriptome Assembly and Annotation

For analyzing transcriptomes from non-model organisms or assembling novel transcripts:

Trinity: For de novo transcriptome assembly.
StringTie: For transcript assembly and quantification from RNA-seq alignments.
TransDecoder: To predict coding regions within assembled transcripts.
TAU: Tools for annotating non-coding and coding RNAs.

3. Exploring Non-Coding RNA (ncRNA)

Non-coding RNAs play critical regulatory roles. Dedicated tools for studying them include:

Infernal: For identifying ncRNA sequences based on covariance models.
Rfam: Database and tools for ncRNA families.
miRDeep: For identifying microRNAs in RNA-seq datasets.

4. RNA Structure and Motif Analysis

Structural biology of RNA helps in understanding its function:

RNAfold (ViennaRNA): Predicts secondary structures from RNA sequences.
RNAstructure: Tools for RNA secondary structure prediction and analysis.
MEME Suite: For identifying motifs in RNA sequences.
IntaRNA: For RNA-RNA interaction prediction.

5. RNA Editing and Modifications

Epitranscriptomics is a growing field focusing on RNA modifications:

REDItools: For RNA editing analysis.
m6Aboost: For identifying m6A modifications in RNA.

6. Long-Read RNA Sequencing Analysis

Long-read technologies like Nanopore and PacBio are transforming RNA research:

FLAIR: For isoform-level analysis of long-read RNA-seq data.
NanoMod: For detecting modifications in RNA from Nanopore sequencing.

7. RNA-Protein Interactions

To study RNA-protein interactions and complexes:

RBPmap: For identifying RNA-binding protein motifs.
PARalyzer: For analyzing PAR-CLIP data.

8. Functional Enrichment Analysis

Understanding biological functions and pathways from RNA-seq data:

getENRICH: A tool designed for pathway enrichment analysis of non-model organisms (hypergeometric P-value calculation with FDR correction).
ClusterProfiler: For GO and KEGG pathway enrichment analysis.

9. Visualization and Data Sharing

Presenting and sharing RNA sequence analysis results effectively:

IGV: Genome browser for visualizing RNA-seq alignments.
Circos: Circular visualization of RNA-seq data.
DashBio: A Python library for creating bioinformatics visualizations.

Conclusion

The bioinformatics landscape for RNA sequence analysis is vast, with tools catering to specific needs. Whether you’re studying coding RNAs, non-coding RNAs, or exploring RNA-protein interactions, the right tools can transform your data into biological insights.

RagTag: a collection of software tools for scaffolding and improving modern genome assemblies

Jit — Sat, 11 Sep 2021 00:28:14 -0500

RagTag is a collection of software tools for scaffolding and improving modern genome assemblies. Tasks include:

Homology-based misassembly correction
Homology-based assembly scaffolding and patching
Scaffold merging

Address of the bookmark: https://github.com/malonge/RagTag