BOL: Related items

Frequent words problem solution by Perl

Jit — Tue, 09 Jun 2015 23:38:44 -0500

Solved with perl http://rosalind.info/problems/1a/

#Find the most frequent k-mers in a string.
#Given: A DNA string Text and an integer k.
#Return: All most frequent k-mers in Text (in any order).

use strict;
use warnings;

my $string="ACGTTGCATGTCGCATGATGCATGAGAGCT";
my $kmer=4;
my %myHash;
my $max=0;

for (my $aa=0; $aa<=(length($string)-4); $aa++) {
   my $myStr=substr $string, $aa,$kmer;
   #print "$myStr\n";
   my $km=kmerMatch ($string, $myStr, $kmer);
   if ($km > $max) { $max = $km;}
   #print "$km\t$myStr\n";
   $myHash{$myStr}=$km;

}

#Print all key which have matching values
foreach my $name (keys %myHash){
    print "$name " if $myHash{$name} == $max;
}

sub kmerMatch { #Check the exact matching kmers with sliding window
my ($string, $myStr, $kmer)=@_;
my $count=0;
for (my $aa=0; $aa<=(length($string)-4); $aa++) {
   my $myWin=substr $string, $aa,$kmer;
   if ($myWin eq $myStr) {
       #print "$myWin eq $myStr\n";
       $count++;
   }
}
return $count;
}

BioToolbox

Jit — Fri, 19 Feb 2016 09:14:44 -0600

This is a collection of libraries and high-quality end-user scripts for bioinformatic analysis, including working with gene annotation, collecting data scores from a variety of modern file formats, and conversion between file formats. The Bio::ToolBox libraries provide a unified, abstracted interface to multiple common gene annotation formats and the collection of data from multiple data files. They rely on BioPerl SeqFeature libraries and related adaptors to access binary file formats including Bam, BigWig, BigBed, and USeq. The Bio::ToolBox package includes scripts for setting up databases of annotation, collecting annotated features, collecting genomic data relative to features, manipulating and analyzing data, and data format conversion.

More at http://cpansearch.perl.org/src/TJPARNELL/

Address of the bookmark: http://cpansearch.perl.org/src/TJPARNELL/

Benchmarking Perl Module !

Rahul Nayak — Sat, 25 Aug 2018 11:40:42 -0500

The benchmark module is a great tool to know the time the code takes to run. The output is usually in terms of CPU time. This module provides us with a way to optimize our code. With the advent of petascale computing and other multicore processor it is becoming a neccesity to know about the CPU time taken by our perl program.

This is the simple way to use the module

Example1:
use Benchmark;
$first_time = Benchmark->new;
our code……
$second_time = Benchmark->new;
$final_difference = timediff($first_time,$second_time);
print “the code took, timestr($final_difference),”\n”;

that was a very simple way to know the time diff , we can use it to know the time taken by some part of the code in the program.

More sophisticated way:
use Benchmark;
sub first {
my(arguments) = @_;
}
timethese(100, { first => ‘first_sub(arguments)’});
The first argument to timethese is 100 (evaluate 100 times).

Hope this very small tutorial with Benchmark will help people get started.

Biologist versus computational biologist !

Abhimanyu Singh — Mon, 29 Oct 2018 04:23:24 -0500

This is how it work :)

Basics of DESeq2: Differential Expression Made Simple

LEGE — Wed, 28 May 2025 06:47:32 -0500

DESeq2 is a powerful and widely-used R package that identifies differentially expressed genes (DEGs) from RNA-seq data. Whether you're comparing treated vs untreated samples, disease vs healthy conditions, or wild-type vs mutant strains, DESeq2 helps you statistically determine which genes are significantly up- or down-regulated.

What Does DESeq2 Do?
DESeq2 analyzes count data—the number of sequencing reads that map to each gene. It:

Normalizes the data to account for sequencing depth and library size.

Estimates variance (dispersion) for each gene.

Fits a model to compare groups (e.g., control vs treated).

Calculates fold-changes and p-values to determine significance.

Installing DESeq2

You can install DESeq2 via Bioconductor in R:

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")

Inputs Needed

A count matrix: genes as rows, samples as columns (raw counts, not normalized).

A sample metadata table (also called colData): defines the condition/group for each sample.

Example:
# Count matrix (rows = genes, columns = samples)
counts <- read.csv("counts.csv", row.names = 1)
# Sample metadata
colData <- data.frame(
row.names = colnames(counts),
condition = c("control", "control", "treated", "treated")
)
DESeq2 Workflow
1. Load the package
library(DESeq2)
2. Create a DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = colData,
design = ~ condition)
3. Run the differential expression analysis
dds <- DESeq(dds)
4. Get the results
res <- results(dds)
head(res)
This gives a table with:
log2FoldChange: how much expression changed
pvalue: statistical significance
padj: adjusted p-value (FDR corrected)

Visualization (Optional but Powerful)

MA Plot
plotMA(res, ylim = c(-2, 2))
Volcano Plot (custom)
library(ggplot2)
res$significant <- res$padj < 0.05
ggplot(res, aes(x=log2FoldChange, y=-log10(padj), color=significant)) +
geom_point() +
theme_minimal()
Heatmap of Top Genes
library(pheatmap)
topgenes <- head(order(res$padj), 20)
vsd <- vst(dds, blind=FALSE)
pheatmap(assay(vsd)[topgenes, ])
Tips for Best Results
Use raw counts (not normalized or TPM/RPKM values).
Have replicates: DESeq2 relies on variance estimates, so at least 3 per group is ideal.
Watch out for batch effects—include them in your design if needed (e.g., ~ batch + condition).

Summary

Step Purpose
DESeqDataSetFromMatrix() Load your data into DESeq2
DESeq() Run the differential expression analysis
results() Extract the output (log fold change, p-values, etc.)
plotMA() / ggplot2 / pheatmap Visualize the results

Final Thoughts
DESeq2 is an essential tool for RNA-seq data analysis. It abstracts away much of the complexity of statistical modeling, while still giving you control when needed. Whether you're a bioinformatician or a wet-lab biologist, DESeq2 offers both ease of use and analytical power.

Tools to Predict the Impact of Missense Variants !

Jit — Mon, 23 Apr 2018 12:57:33 -0500

Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen‐2, SIFT, FatHMM, MutationTaster‐2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants.

Study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. Comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.

Following tools are useful for mis sense muation detection ...

PolyPhen‐2 (PP2)
“Predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations”

MutationTaster‐2 (MT2)
“Evaluation of the disease‐causing potential of DNA sequence alterations”

MutationAssessor (MASS)
“Predicts the functional impact of amino acid substitutions in proteins, such as mutations discovered in cancer or missense polymorphisms”

LRT
“Identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein‐coding sequences, which are likely to be unconditionally deleterious”

SIFT
“Predicts whether an amino acid substitution affects protein function”

GERP++
“Identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA, but did not occur because the element has been under functional constraint. We refer to these deficits as “rejected substitutions.” Rejected substitutions are a natural measure of constraint that reflects the strength of past purifying selection on the element”

phyloP
“Compute conservation or acceleration P values based on an alignment and a model of neutral evolution”

FatHMM unweighted (FatHMM‐U)
Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants”

FatHMM weighted (FatHMM‐W)
Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants” and its weighting scheme attributes higher tolerance scores to SNVs in proteins, related proteins, or domains that already include a high fraction of pathogenic variantsh

Combined Annotation Dependent Depletion (CADD)
“CADD is a tool for scoring the deleteriousness of single‐nucleotide variants as well as insertion/deletions variants in the human genome”

TARDIS: Toolkit for automated and rapid discovery of structural variants

Neel — Fri, 09 Jun 2017 04:43:31 -0500

tardis

Toolkit for Automated and Rapid DIscovery of Structural variants

Requirements

zlib (http://www.zlib.net)
mrfast (https://github.com/BilkentCompGen/mrfast)
htslib (included as submodule; http://htslib.org/)
Fetching tardis

git clone https://github.com/BilkentCompGen/tardis.git --recursive

https://github.com/BilkentCompGen/tardis

Address of the bookmark: https://github.com/BilkentCompGen/tardis

DeepVariant : an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

Jit — Sat, 25 Jan 2020 13:28:09 -0600

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data. DeepVariant relies on Nucleus, a library of Python and C++ code for reading and writing data in common genomics file formats (like SAM and VCF) designed for painless integration with the TensorFlow machine learning framework.

https://ai.googleblog.com/2017/12/deepvariant-highly-accurate-genomes.html

https://www.biorxiv.org/content/10.1101/092890v6

Address of the bookmark: https://github.com/google/deepvariant

MALVA: Genotyping by Mapping-free ALlele Detection of Known VAriants

Jit — Tue, 28 Jan 2020 03:39:22 -0600

MALVA is able to genotype multi-allelic SNPs and indels without mapping reads

MALVA calls correctly more indels than the most widely adopted genotyping pipelines

Mapping-free approaches are as accurate as alignment-based ones, while being faster

More at https://www.sciencedirect.com/science/article/pii/S2589004219302366

https://www.sciencedirect.com/science/article/pii/S2589004219302366

Address of the bookmark: https://github.com/AlgoLab/malva

Smash: An alignment-free method to find and visualise rearrangements between pairs of DNA sequences

Jit — Tue, 26 Apr 2016 12:18:49 -0500

Smash is a completely alignment-free method/tool to find and visualise genomic rearrangements. The detection is based on conditional exclusive compression, namely using a FCM (Markov model), of high context order (typically 20). For visualisation, Smash outputs a SVG image, with an ideogramoutput architecture, where the patterns are represented with several HSV values (only value varies). The method can perform both in small- and large-scale. Nevertheless is more directed to large-scale since that the main aim of the research is to know where the large-scale [chromosomal by chromosome] of several primates was equal/different, having at a glance a map of the entire genomes.

Address of the bookmark: http://bioinformatics.ua.pt/software/smash/