BOL: Related items

Calculate the significance of the difference between two trends

BioStar — Tue, 14 Mar 2023 05:41:53 -0500

To calculate the significance of the difference between two trends, you can use a statistical test such as a t-test or ANOVA (analysis of variance). Here are the general steps to follow:

Define your null hypothesis (H0) and alternative hypothesis (H1). For example, H0 might be that there is no significant difference between the two trends, while H1 might be that there is a significant difference.
Collect data on the two trends. Make sure that the data is independent, normally distributed, and has equal variances.
Calculate the means and standard deviations of each trend.
Calculate the test statistic using a t-test or ANOVA. The test statistic will depend on the specific test you choose, but it will generally compare the difference in means between the two trends to the variability within each trend.
Determine the p-value associated with the test statistic. The p-value represents the probability of obtaining a test statistic as extreme as the one you calculated, assuming that the null hypothesis is true.
Compare the p-value to your chosen significance level (usually 0.05 or 0.01). If the p-value is less than or equal to the significance level, reject the null hypothesis and conclude that there is a significant difference between the two trends. If the p-value is greater than the significance level, fail to reject the null hypothesis and conclude that there is not enough evidence to support a significant difference.

It's important to note that the specific details of each step will depend on the type of test you choose and the software you use to perform the analysis.

The most common methods for comparing means include:

Methods	R function	Description
T-test	t.test()	Compare two groups (parametric)
Wilcoxon test	wilcox.test()	Compare two groups (non-parametric)
ANOVA	aov() or anova()	Compare multiple groups (parametric)
Kruskal-Wallis	kruskal.test()	Compare multiple groups (non-parametric)

Data Visualization in Bioinformatics: Useful and Eye-Catching Plots for Data Analysis

LEGE — Sat, 14 Dec 2024 12:41:53 -0600

Data visualization is a cornerstone of bioinformatics, enabling researchers to interpret complex datasets effectively. With a plethora of data types—genomic sequences, expression profiles, protein interactions, and more—the right visualizations can make or break an analysis. This blog highlights some of the most useful and visually compelling plots for bioinformatics data analysis, along with tools to create them.

1. Heatmaps: Exploring Patterns in High-Dimensional Data

Heatmaps are a go-to visualization for representing high-dimensional datasets, such as gene expression or metabolomics data. They use color gradients to display data intensity, making patterns and clusters easily detectable.

Applications: Gene expression analysis, pathway enrichment, methylation studies.
Tools: Seaborn (Python), ComplexHeatmap (R), Morpheus (web-based).

Tip: Add dendrograms to visualize clustering of rows and columns for hierarchical relationships.

2. Volcano Plots: Highlighting Differential Features

Volcano plots are indispensable for identifying significantly differentially expressed genes or proteins. They plot the log2 fold change against –log10(p-value), making it easy to spot statistically significant changes.

Applications: RNA-seq, proteomics, and metabolomics.
Tools: ggplot2 (R), EnhancedVolcano (R), Plotly (Python).

Tip: Use color to highlight significant features and label key genes or proteins.

3. PCA Plots: Reducing Complexity with Principal Component Analysis

Principal Component Analysis (PCA) plots are used to reduce dimensionality and uncover trends or clusters in data. They provide insights into sample variability and grouping.

Applications: Transcriptomics, metabolomics, microbiome studies.
Tools: scikit-learn + Matplotlib (Python), prcomp (R), ClustVis (web-based).

Tip: Annotate clusters with metadata to enhance interpretability.

4. Manhattan Plots: Genome-Wide Association Studies

Manhattan plots visualize p-values across the genome, making it easy to identify significant associations in genome-wide studies. They resemble city skylines, with the highest peaks indicating loci of interest.

Applications: GWAS, QTL mapping.
Tools: qqman (R), Matplotlib (Python).

Tip: Use alternating colors for chromosomes and highlight significant SNPs for clarity.

5. Circular Plots (Circos): Visualizing Genomic Relationships

Circular plots are ideal for visualizing relationships across the genome, such as structural variations, gene duplications, or synteny.

Applications: Comparative genomics, structural variation studies.
Tools: Circos (standalone), Rcircos (R), pyCircos (Python).

Tip: Keep the plot clean and avoid overcrowding to maintain readability.

6. Sankey Diagrams: Tracking Data Flows

Sankey diagrams visualize flows or relationships between categories, often used to track changes in gene expression or pathway enrichment across conditions.

Applications: Pathway analysis, gene set enrichment analysis.
Tools: Plotly (Python), networkD3 (R).

Tip: Use gradients or distinct colors to highlight key transitions.

7. Network Graphs: Mapping Interactions

Network graphs represent relationships between entities, such as protein-protein interactions or gene regulatory networks. Nodes represent entities, and edges represent relationships.

Applications: Systems biology, interactomics.
Tools: Cytoscape (standalone), igraph (R), NetworkX (Python).

Tip: Use edge thickness or node size to represent interaction strength or centrality.

8. Violin Plots: Visualizing Data Distribution

Violin plots combine a boxplot with a density plot, showing the distribution and variability of data.

Applications: Single-cell RNA-seq, quantitative trait analysis.
Tools: Seaborn (Python), ggplot2 (R).

Tip: Split violins by groups for side-by-side comparisons.

9. Time-Series Plots: Monitoring Changes Over Time

Time-series plots display changes in variables across time points, useful for tracking gene expression dynamics or metabolic fluxes.

Applications: Time-course experiments, cell cycle studies.
Tools: Matplotlib (Python), ggplot2 (R).

Tip: Smooth the data to highlight trends while avoiding overfitting.

10. Genome Tracks: Visualizing Genomic Features

Genome tracks display multiple layers of genomic data, such as gene annotations, sequencing coverage, and epigenetic marks.

Applications: ChIP-seq, ATAC-seq, whole-genome sequencing.
Tools: IGV (standalone), pyGenomeTracks (Python).

Tip: Stack related tracks for direct comparisons.

11. UpSet Plots: Visualizing Set Intersections

UpSet plots are a powerful alternative to Venn diagrams for visualizing intersections between multiple datasets.

Applications: Overlap analysis for gene sets, pathways, or variants.
Tools: UpSetR (R), ComplexUpset (Python).

Tip: Use bar plots to represent the size of each intersection for added clarity.

12. Ridge Plots: Comparing Distributions

Ridge plots visualize the distributions of multiple datasets, stacked for easy comparison.

Applications: Transcriptomics, single-cell RNA-seq.
Tools: ggridges (R), Matplotlib (Python).

Tip: Use transparency and consistent scaling for better readability.

13. Chord Diagrams: Visualizing Connections Between Groups

Chord diagrams illustrate relationships between categories, such as shared genes between pathways or overlaps in regulatory elements.

Applications: Pathway overlap, synteny, co-expression networks.
Tools: Circlize (R), Holoviews (Python).

Tip: Use distinct colors for each group to emphasize relationships.

14. Treemaps: Hierarchical Data Representation

Treemaps visualize hierarchical data as nested rectangles, with area proportional to data size.

Applications: Ontology enrichment, pathway analysis.
Tools: Treemapify (R), Plotly (Python).

Tip: Use colors to represent additional variables, like significance or enrichment scores.

15. T-SNE/UMAP Plots: Dimensionality Reduction for Clustering

T-SNE and UMAP plots are great for visualizing high-dimensional data in two dimensions while preserving local or global structure.

Applications: Single-cell transcriptomics, clustering analyses.
Tools: scikit-learn (Python), Seurat (R).

Tip: Combine with metadata annotations for better cluster interpretation.

Bringing It All Together

The choice of visualization can significantly impact the insights gained from bioinformatics data. By selecting plots tailored to your data type and analysis goals, you can effectively communicate your findings and make your research more impactful. Whether you’re a seasoned bioinformatician or a beginner, mastering these visualizations will elevate your analyses and presentations.

BBC Secret Universe: The Hidden Life of the Cell

Mon, 23 Sep 2013 18:19:54 -0500

This will help you to understand how a cell works (C) BBC MMXII

MIT Computational Biology Group

Thu, 18 Dec 2014 14:47:01 -0600

My research group consists primarily of computer science graduate students and postdocs with expertise in algorithms, statistical inferences and machine learning, and sharing a passion for understanding fundamental biological problems.

We work in a highly interdisciplinary environment at the interface of Computer Science and Biology. Since its inception, our lab has eagerly engaged in collaborative research partnerships with biological and experimental collaborators, facilitated by our affiliation with the Broad Institute and the Computational and Systems Biology initiative (CSBi) at MIT, our participation in the Epigenome Roadmap, ENCODE, and modENCODE consortia, and by several other ongoing collaborations at MIT, Harvard, and the Harvard Medical School affiliated hospitals.

http://compbio.mit.edu/

Raphael Lab

Sat, 04 Jul 2015 19:05:29 -0500

Raphael Lab research is focused on Bioinformatics and Computational Biology.

Current research interests include next-generation DNA sequencing, structural variation, genome rearrangements in cancer and evolution, and network analysis of somatic mutations in cancer. Earlier research included topics in comparative genomics, multiple sequence alignment, and motif finding.

More athttp://compbio.cs.brown.edu/

Pollux: platform independent error correction of single and mixed genomes

Jit — Fri, 19 May 2017 09:41:27 -0500

Pollux: General-purpose error corrector that corrects errors introduced by Illumina, Ion Torrent, and Roche 454 sequencing technologies and can be applied to single- or mixed-genome data. In addition to correcting substitution errors, we locate and correct insertion, deletion, and homopolymer errors while remaining sensitive to low coverage areas of sequencing projects. Using published data sets, we correct 94% of Illumina MiSeq errors, 88% of Ion Torrent PGM errors, 85% of Roche 454 GS Junior errors. Introduced errors are 20 to 70 times more rare than successfully corrected errors. Furthermore, we show that the quality of assemblies improves when reads are corrected by our software.

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-014-0435-6

Address of the bookmark: https://github.com/emarinier/pollux

BlasR Mapping single molecule sequencing reads using Basic Local Alignment with Successive Refinement (BLASR): Theory and Application,

Jit — Wed, 23 May 2018 06:54:32 -0500

BLASR (Basic Local Alignment with Successive Refinement) for mapping Single Molecule Sequencing (SMS) reads that are thousands to tens of thousands of bases long with divergence between the read and genome dominated by insertion and deletion error.

Here is how I use the blasr to align PacBio reads to the contigs (target.fasta). The “target.fasta.sa” is the suffix array from “target.fasta” generated by sawriter.

blasr query.fa ./target.fasta -sa ./target.fasta.sa -bestn 40 -maxScore -500 -m 4 -nproc 24 -out target.m4 -maxLCPLength 15

the output format option “-m 4″ generate the alignment coordinate. Not fully documented, but I can explain that to you.

I use a 24 cores / 48G ram server for the alignment. It took about 2 to 3 hours aligning 3G PacBio Reads to 10^6 sequences of short read contigs with a mean 3.5kbp length.

Address of the bookmark: http://bix.ucsd.edu/projects/blasr/

Genomics for Bioinformatician

Jitendra Narayan — Sat, 20 Jul 2013 07:03:00 -0500

Genomics is the study of the genomes of organisms. The field includes intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping efforts. The field also includes studies of intragenomic phenomena such as heterosis, epistasis, pleiotropy and other interactions between loci and alleles within the genome. In contrast, the investigation of the roles and functions of single genes is a primary focus of molecular biology or genetics and is a common topic of modern medical and biological research. Research of single genes does not fall into the definition of genomics unless the aim of this genetic, pathway, and functional information analysis is to elucidate its effect on, place in, and response to the entire genome's networks.

Genomics was established by Fred Sanger when he first sequenced the complete genomes of a virus and a mitochondrion. His group established techniques of sequencing, genome mapping, data storage, and bioinformatic analyses in the 1970-1980s. A major branch of genomics is still concerned with sequencing the genomes of various organisms, but the knowledge of full genomes has created the possibility for the field of functional genomics, mainly concerned with patterns of gene expression during various conditions. The most important tools here are microarrays and bioinformatics. Study of the full set of proteins in a cell type or tissue, and the changes during various conditions, is called proteomics. A related concept is materiomics, which is defined as the study of the material properties of biological materials (e.g. hierarchical protein structures and materials, mineralized biological tissues, etc.) and their effect on the macroscopic function and failure in their biological context, linking processes, structure and properties at multiple scales through a materials science approach. The actual term 'genomics' is thought to have been coined by Dr. Tom Roderick, a geneticist at the Jackson Laboratory (Bar Harbor, ME) over beer at a meeting held in Maryland on the mapping of the human genome in 1986.

The outcome of almost two years of intense discussions with literally hundreds of scientists and members of the public, has three major areas of focus: Genomics to Biology, Genomics to Health, and Genomics to Society.

Genomics to Biology:
The human genome sequence provides foundational information that now will allow development of a comprehensive catalog of all of the genome's components, determination of the function of all human genes, and deciphering of how genes and proteins work together in pathways and networks.

Genomics to Health:
Completion of the human genome sequence offers a unique opportunity to understand the role of genetic factors in health and disease, and to apply that understanding rapidly to prevention, diagnosis, and treatment. This opportunity will be realized through such genomics-based approaches as identification of genes and pathways and determining how they interact with environmental factors in health and disease, more precise prediction of disease susceptibility and drug response, early detection of illness, and development of entirely new therapeutic approaches.

Genomics to Society:
Just as the HGP has spawned new areas of research in basic biology and in health, it has created new opportunities in exploring the ethical, legal, and social implications (ELSI) of such work. These include defining policy options regarding the use of genomic information in both medical and non-medical settings and analysis of the impact of genomics on such concepts as race, ethnicity, kinship, individual and group identity, health, disease, and "normality" for traits and behaviors.

This vision for the future of genomics is not just about the NHGRI. It encompasses the whole field of genomics, including the work of all the other Institutes and Centers at the NIH and of a number of other federal agencies. All of the NIH Institutes are already taking full advantage of the sequence and will apply its data to the better understanding of both rare and common diseases, almost all of which have a genetic component. A recent example of the way that the HGP and the knowledge and new technologies it has spawned are already facilitating science is the extremely rapid sequencing by groups in Canada and at the Centers for Disease Control and Prevention (CDC) in Atlanta of the genome of the virus that causes Severe Acute Respiratory Syndrome (SARS). The sequencing of the SARS virus genome provides insight into this new and deadly disease at a speed never before possible in science. In turn, this should lead to the rapid development of diagnostic tests and, in time, vaccines and effective treatments.

Links for the addition material available on Net

Genomes and genomics:

Bioinformatics and Genomics:

Structural genomics tutorial:

Comparative Genomics Tutorial:

GENOME TUTORIAL:

Tools and resources for identifying protein families, domains and motifs

Bioinformatics Tools
Tips, Tutorials, and Terminology for Using Selected Resources in Genome Database Guide:

A Web-Based Comparative Genomics Tutorial for Investigating Microbial Genomes:

Free Online Tutorials Teach Anyone How to Use Genome Databases:

Circos to create concise, explanatory, unique and print-ready visualizations of your data:

Genomics and Comparative Genomics Learning Module:

Computational Challenges in Comparative Genomics

A Tutorial:

A Comparative Genomics Resource for Grains:

PLAZA: A Comparative Genomics Resource to Study Gene and Genome Evolution in Plants:

VISTA :

Software for Genomics

Artemis Artemis is a free genome viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation.
Chromas It will display and prints chromatogram files from ABI automated DNA sequencers, and Staden SCF files which the analysis programs for ALF, Li-Cor and Visible Genetics OpenGene sequencers can create.
Glimmer A system for finding genes in microbial DNA, especially the genomes of bacteria and archaea.Glimmer (Gene Locator and Interpolated Markov Modeler) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DN
Glimmer HMM A fast and accurate gene finder based on a GHMM architecture, developed specifically for eukaryotes. It incorporates splice site models adapted from the GeneSplicer program and uses interpolated Markov models for evaluating the coding regions.
Glimmer M A gene finder derived from Glimmer, but developed specifically for eukaryotes. It is based on a dynamic programming algorithm that considers all combinations of possible exons for inclusion in a gene model and chooses the best of these combinations. The d
MUMmer MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form.
pDRAW pDRAW32 is being developed as a free time hobby project. It is far from finished, but as it has reached a point where it could be helpful for many labs, it is now available to the scientific community.
Sequin Sequin is a stand-alone software tool developed by the NCBI for submitting and updating entries to the GenBank, EMBL, or DDBJ sequence databases. It is capable of handling simple submissions that contain a single short mRNA sequence, and complex submissio
Staden The Staden Package consists of a series of tools for DNA sequence preparation (pregap4), assembly (gap4), editing (gap4) and DNA/protein sequence analysis (spin).

For more software @ http://bioinformaticsonline.com/bookmarks/view/926/list-of-popular-bioinformatics-softwaretools

Next Generation Sequencing (NGS) Tutorials

Jitendra Narayan — Sat, 24 Aug 2013 06:01:37 -0500

Institute of computational biomedicine, Cornell University provide an NGS workshop tutorial at http://chagall.med.cornell.edu/NGScourse/

You can also add your favourite NGS educational material, or workshop tutorial by commenting on this bookmarks for user benefit.

Understanding the basics of genome sequencing:

Tutorial by Luke Jostins.

http://www.genetic-inference.co.uk/blog/2009/04/basics-sequencing-dna-part-1/

http://www.genetic-inference.co.uk/blog/2009/08/basics-sequencing-dna-part-2/

A window into third-generation sequencing

http://hmg.oxfordjournals.org/content/19/R2/R227.full.pdf

==============================================

NGS data analysis pipelines

Detecting and annotating genetic variations using the HugeSeq pipeline DOI: 10.1038/nbt.2134
NARWHAL, a primary analysis pipeline for NGS data http://bioinformatics.oxfordjournals.org/cgi/content/abstract/28/2/284?etoc
RseqFlow: Workflows for RNA-Seq data analysis DOI: 10.1093/bioinformatics/btr441
ngs_backbone: a pipeline for read cleaning, mapping and SNP calling using Next Generation Sequence 10.1186/1471-2164-12-285
A framework for variation discovery and genotyping using next-generation DNA sequencing data PubMed: 21478889
SNiPlay: a web-based tool for detection, management and analysis of SNPs. Application to grapevine diversity projects DOI: 10.1186/1471-2105-12-134 Abstract: http://www.biomedcentral.com/1471-2105/12/134/abstract
WEP: a high-performance analysis pipeline for whole-exome data http://www.biomedcentral.com/1471-2105/14/S7/S11
DDBJ read annotation pipeline: a cloud computing-based pipeline for high-throughput analysis of next-generation sequencing data. http://www.ncbi.nlm.nih.gov/pubmed/23657089
GATK: a Toolkit for Genome Analysis http://www.broadinstitute.org/gatk/
Metagenomics:http://www.nbic.nl/education/nbic-phd-school/course-schedule/ngsmetagenomics/
RNASeq:http://www.nbic.nl/education/nbic-phd-school/course-schedule/ngsrnaseq/
Bioinformatics and Seq courses: http://www.isb-sib.ch/training/training-activities-schedule/archive-2013.html
Variant Detection (Model organism) Advanced tutorial https://docs.google.com/document/pub?id=1CuKkKylVDb03tnN7RSWl5EUzleetn0ctjmvaidPKLxM
Variant Detection Introductory tutorial https://docs.google.com/document/pub?id=1ZRzrjjOCvtAu3m-IKL-rbJ1f4On60dDL_IEwG7oejdI
Microbial de novo Assembly for Illumina Data Introductory tutorial https://docs.google.com/document/pub?id=1N3AB9ptISUu4zULqe1kXpVF0BDyGb5f5yzxWSJd_WNM
RNAseq Differential Gene Expression Introductory tutorial https://docs.google.com/document/pub?id=1KbTiBHtvHLfPRZ39AY3uriazrINA8TJzgjjwn1zPP7Y

" Please add your favourite NGS link below in comment section for the benefit of bioinformatics community ".

Address of the bookmark: http://chagall.med.cornell.edu/NGScourse/

Bioinformatics Protocols

Rahul Nayak — Mon, 05 May 2014 10:21:41 -0500

RNA Seq

Basic Galaxy Tutorial

RNA-Seq tutorial based on Trapnell et al. (2012) Nature Protocols

In this tutorial we cover the concepts of RNA-Seq differential gene expression (DGE) analysis using a very small synthetic dataset from a well studied organism.

Advanced Galaxy Tutorial

RNA-Seq (Advanced) Tutorial

In this tutorial we compare the performance of three statistically-based differential expression tools:

* CuffDiff

* EdgeR

* DESeq2

Advanced Command Line Tutorial

Graphical Output with CummeRbund introduces some basic commands using the cummeRbund package of the R programming language

You will need to install R, RStudio and cummeRbund on your PC (explained in the Tutorial). You will learn how to produce graphical output from RNA-Seq analysis previously done using a Cuffdiff analysis.

Variant Detection

Basic Galaxy Tutorial

Variant Detection tutorial

In this tutorial we cover the concepts of detecting small variants (SNVs and indels) in human genomic DNA using a small set of reads from chromosome 22.

Advanced Galaxy Tutorial

Variant Detection (Advanced) Tutorial

In this tutorial we compare the performance of three statistically-based variant detection tools:

* SAMtools: Mpileup

* GATK: Unified Genotyper

* FreeBayes

Each of these tools takes as its input a BAM file of aligned reads and generates a list of likely variants in VCF format

Pipelines are for those who are comfortable with using the UNIX command line; and often allow more control over branching and iteration logic.

WGS/exome GATK-based variant calling pipeline

This is a basic variant-calling and annotation pipeline developed at the Victorian Life Sciences Computation Initiative (VLSCI), University of Melbourne. It is based around BWA, GATK and ENSEMBL and was originally designed for human (or similar) data. The master branch is configured for WGS data; there is an exome branch configured for variant calling in exome data.

To run the pipeline you will need Rubra: https://github.com/bjpop/rubra. Rubra uses the python Ruffus library: http://www.ruffus.org.uk/.

Protocols

Familial Variant Calling

In this protocol we discuss and outline the process of calling familial related mutations.

Somatic Variant Calling

In this protocol we discuss and outline the process of identifying somatic variants or mutations.

Assembly

Basic Galaxy Tutorial

Genome assembly tutorial

In this tutorial we carry out de novo assembly of a microbial genome. We have also written a De novo Genome Assembly for Illumina Data Protocol for a more generic description of the method.

Protocol

De novo Genome Assembly for Illumina Data

In this protocol we discuss and outline the process of de novo assembly for small to medium sized genomes. Use our Genome assembly tutorial to learn a specific case of using Galaxy to carry out de novo assembly of a microbial genome.

Small RNAs

Basic Galaxy Tutorial

Quality control for small RNA

This tutorial covers initial steps of the workflow for analysis of short RNA expression such as a quality control of the raw reads, processing of the raw reads for the subsequent analysis and initial quality assessment of the library.

ChIP Seq

Protocol

ChIP-Seq

In this protocol we discuss ChIP-Seq: a method to analyze the interaction between proteins and DNA.

Amplicons

Protocol

Amplicon Alignment

In this protocol we discuss and outline the process of aligning custom amplicons using primers for high precision.

Learn Galaxy

Introduction to Galaxy, for those who are very new to Galaxy.

Using Histories and Workflows, for those with some Galaxy knowledge.

The Galaxy project website has many tutorials and screencasts about using Galaxy and the tools, and developing new tools.

Address of the bookmark: https://genome.edu.au/wiki/Learn