BOL: Related items

Finding Patterns in Biological Sequences

Jit — Thu, 22 Dec 2016 10:30:49 -0600

In this report we provide an overview of known techniques for discovery of patterns of biological sequences (DNA and proteins). We also provide biological motivation, and methods of biological verification of such patterns. Finally we list publicly available tools and databases for pattern discovery. On-line supplement is available through http://genetics.uwaterloo.ca/∼tvinar/cs798g/motif.

Address of the bookmark: http://engr.case.edu/li_jing/papers/00798gpattern.pdf

Tiny Python3.6 Notebook

Neel — Sat, 03 Jun 2017 03:16:28 -0500

This is not so much an instructional manual, but rather notes, tables, and examples for Python syntax. It was created by the author as an additional resource during training, meant to be distributed as a physical notebook. Participants (who favor the physical characteristics of dead tree material) could add their own notes, thoughts, and have a valuable reference of curated examples.

Address of the bookmark: https://github.com/mattharrison/Tiny-Python-3.6-Notebook/blob/master/python.rst

SNP Analysis: Unlocking the Secrets in Our DNA

Abhi — Wed, 16 Jul 2025 01:31:45 -0500

Single Nucleotide Polymorphisms (SNPs) are the most common type of genetic variation in humans—and many other organisms. A single base change in the DNA sequence (for example, an A instead of a G) can influence everything from our eye color to our risk of developing diseases. Analyzing these tiny changes has become central to modern genetics, medicine, agriculture, and evolutionary biology.

What are SNPs?
SNPs (pronounced "snips") are positions in the genome where individuals differ by a single nucleotide. For example:

Reference: ...A T G C A T G A...
Variant: ...A T G T A T G A...

Here, the C in the reference genome has been replaced by a T in the variant.

SNPs occur roughly every 300–1,000 bases in the human genome, meaning there are millions of them scattered throughout our DNA. Most SNPs have no effect on health, but some are linked to disease susceptibility, drug response, and other traits.

Why Do We Analyze SNPs?
1. Medical Genetics

Identify disease-associated variants (e.g., BRCA1/2 in breast cancer).

Predict drug response (pharmacogenomics).

Enable precision medicine by tailoring treatments.

2. Population Genetics & Ancestry

Trace human migration and ancestry.

Study genetic diversity within and between populations.

3. Agriculture & Animal Breeding

Select for desirable traits (drought resistance, yield, disease resistance).

Improve breeding efficiency in livestock.

4. Evolutionary Biology

Track natural selection.

Study adaptation in wild populations.

How is SNP Analysis Performed?
SNP analysis can be broadly divided into three steps:

SNP Detection
Genotyping arrays: Chips that test hundreds of thousands of known SNP positions simultaneously. Fast and affordable, widely used in consumer ancestry testing.

Whole-genome or whole-exome sequencing: Can detect known and novel SNPs across the genome.

Targeted sequencing or PCR: For focused analysis of specific regions.

Variant Calling
Sequencing data is aligned to a reference genome. Bioinformatics tools (e.g., GATK, bcftools) identify positions where the sequenced sample differs from the reference.

Annotation and Interpretation
Tools (e.g., SnpEff, VEP) predict the functional impact of SNPs.

Are the SNPs in coding regions? Do they cause amino acid changes? Are they known to be pathogenic?

Databases like dbSNP, ClinVar, and GWAS Catalog provide information on known associations.

Common Tools for SNP Analysis
Alignment: BWA, Bowtie2

Variant Calling: GATK, FreeBayes

Visualization: IGV, UCSC Genome Browser

Annotation: SnpEff, VEP

Statistical Analysis: PLINK, SNPTEST

Challenges in SNP Analysis
False positives/negatives: Sequencing errors, alignment issues.

Population stratification: Confounding in association studies.

Interpretation: Many SNPs have unknown or complex effects.

Researchers address these with rigorous quality control, large datasets, and increasingly sophisticated statistical models.

The Future of SNP Analysis
With advances in sequencing technology and AI-driven analysis, SNP studies are expanding:

Polygenic risk scores predict disease risk based on thousands of SNPs.

Large-scale biobanks (e.g., UK Biobank, All of Us) enable powerful genome-wide association studies (GWAS).

CRISPR and functional assays help validate SNP effects in the lab.

SNP analysis is at the heart of the genomic revolution, promising insights into biology, health, and evolution at unprecedented scale.

Conclusion
From diagnosing rare diseases to designing better crops, SNP analysis is a foundational tool in modern science. As our ability to sequence and interpret genomes improves, so will our understanding of these tiny—but mighty—variations in DNA.

Biological file format tutorial

Jit — Sun, 17 Dec 2017 18:13:03 -0600

This section explains some of the commonly used file formats in bioinformatics. The information provided here is basic and designed to help users to distinguish the difference between different formats. Please refer user manual or other information resources on web for more details.

Address of the bookmark: https://bioinformatics.uconn.edu/resources-and-events/tutorials/file-formats-tutorial/

Machine learning training and courses in bioinformatics !

Rahul Nayak — Tue, 31 Dec 2019 19:33:07 -0600

Machine learning techniques have been successful in analyzing biological data because of their capabilities in handling randomness and uncertainty of data noise and in generalization. In this class, we will learn basics about probabilistic models and machine learning techniques. We will focus on probabilistic models (Markov models, Hidden Markov models, and Bayesian networks) for biological sequence analysis and systems biology. Other machine learning techniques, such as Naive bayes, neural networks and SVMs will only be covered briefly.

More at http://homes.sice.indiana.edu/yye/lab/teaching/spring2017-I529/

Pangolin tutorial !

Abhi — Fri, 10 Dec 2021 05:58:59 -0600

This is a tutorial for using the Pangolin Web Application. For information on using the command line tool, please visit the command line tool usage page.

https://cov-lineages.org/resources/pangolin/tutorial.html

Address of the bookmark: https://cov-lineages.org/resources/pangolin/tutorial.html

Trelliscope: flexibly visualize large, complex data in great detail from within the R statistical programming environment.

Jit — Tue, 21 Jan 2020 04:22:49 -0600

Trelliscope provides a way to flexibly visualize large, complex data in great detail from within the R statistical programming environment. Trelliscope is a component in the DeltaRho environment.

For those familiar with Trellis Display, faceting in ggplot, or the notion of small multiples, Trelliscope provides a scalable way to break a set of data into pieces, apply a plot method to each piece, and then arrange those plots in a grid and interactively sort, filter, and query panels of the display based on metrics of interest. With Trelliscope, we are able to create multipanel displays on data with a very large number of subsets and view them in an interactive and meaningful way.

Address of the bookmark: http://deltarho.org/docs-trelliscope/#introduction

MAJIQ 2 is released !

LEGE — Thu, 09 Jul 2020 03:06:26 -0500

Ability to detect, quantify, and visualize complex and de-novo splicing variations from RNASeq.

MAJIQ’s accuracy compares favorably to other algorithms.

MAJIQ 2 is *way* faster, more memory and I/O efficient

New visualization (VOILA 2.0) Ability to analyze hundreds and thousands of samples Why so negative? (Support for a confident negative set)

Finally, a major reason we are excited about MAJIQ 2.0 is that it sets the code base for many new exciting algorithmic and visualization improvements, with application to new research questions so stay tuned!

More at https://biociphers.wordpress.com/2019/04/01/majiq-2-is-out/

Address of the bookmark: https://majiq.biociphers.org/

Basics of DESeq2: Differential Expression Made Simple

LEGE — Wed, 28 May 2025 06:47:32 -0500

DESeq2 is a powerful and widely-used R package that identifies differentially expressed genes (DEGs) from RNA-seq data. Whether you're comparing treated vs untreated samples, disease vs healthy conditions, or wild-type vs mutant strains, DESeq2 helps you statistically determine which genes are significantly up- or down-regulated.

What Does DESeq2 Do?
DESeq2 analyzes count data—the number of sequencing reads that map to each gene. It:

Normalizes the data to account for sequencing depth and library size.

Estimates variance (dispersion) for each gene.

Fits a model to compare groups (e.g., control vs treated).

Calculates fold-changes and p-values to determine significance.

Installing DESeq2

You can install DESeq2 via Bioconductor in R:

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")

Inputs Needed

A count matrix: genes as rows, samples as columns (raw counts, not normalized).

A sample metadata table (also called colData): defines the condition/group for each sample.

Example:
# Count matrix (rows = genes, columns = samples)
counts <- read.csv("counts.csv", row.names = 1)
# Sample metadata
colData <- data.frame(
row.names = colnames(counts),
condition = c("control", "control", "treated", "treated")
)
DESeq2 Workflow
1. Load the package
library(DESeq2)
2. Create a DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = colData,
design = ~ condition)
3. Run the differential expression analysis
dds <- DESeq(dds)
4. Get the results
res <- results(dds)
head(res)
This gives a table with:
log2FoldChange: how much expression changed
pvalue: statistical significance
padj: adjusted p-value (FDR corrected)

Visualization (Optional but Powerful)

MA Plot
plotMA(res, ylim = c(-2, 2))
Volcano Plot (custom)
library(ggplot2)
res$significant <- res$padj < 0.05
ggplot(res, aes(x=log2FoldChange, y=-log10(padj), color=significant)) +
geom_point() +
theme_minimal()
Heatmap of Top Genes
library(pheatmap)
topgenes <- head(order(res$padj), 20)
vsd <- vst(dds, blind=FALSE)
pheatmap(assay(vsd)[topgenes, ])
Tips for Best Results
Use raw counts (not normalized or TPM/RPKM values).
Have replicates: DESeq2 relies on variance estimates, so at least 3 per group is ideal.
Watch out for batch effects—include them in your design if needed (e.g., ~ batch + condition).

Summary

Step Purpose
DESeqDataSetFromMatrix() Load your data into DESeq2
DESeq() Run the differential expression analysis
results() Extract the output (log fold change, p-values, etc.)
plotMA() / ggplot2 / pheatmap Visualize the results

Final Thoughts
DESeq2 is an essential tool for RNA-seq data analysis. It abstracts away much of the complexity of statistical modeling, while still giving you control when needed. Whether you're a bioinformatician or a wet-lab biologist, DESeq2 offers both ease of use and analytical power.

Perl in a day !!

Jitendra Narayan — Sat, 10 Aug 2013 21:14:03 -0500

This pdf based tutorial in good resource to understand the basic of Perl in a day

http://ritg.med.harvard.edu/training/perl/RC_Perl_Intro.pdf