BOL: Related items

Randomness and Probability

Jit — Tue, 08 Nov 2016 07:17:32 -0600

Randomness and Probability

Randomness and probability are two differnet concepts: probaility is a measure (according to measure theory) which measures the randomness. Randomness is the object to be measured by probability. For example, probability is a mapping from randomness to the real number between 0 and 1. The similar examples are that the entropy measures the uncertanity; product of length and width measures the area of rectangle etc.

Please see “A mathematical theory of ability measure” by N. Kong ets for more examples to answer this question.

Bioistats Online course

Abhimanyu Singh — Thu, 10 Nov 2016 04:22:51 -0600

One of our primary focuses will be to develop an understanding of the various ways in which we can assign a probability to some chance event. We'll also learn the fundamental properties of probability, investigate how probability behaves, and learn how to calculate the probability of a new chance event.

This book is handy understanding basic concepts.

Address of the bookmark: https://onlinecourses.science.psu.edu/stat414/node/287

Tools for Searching Repeats And Palindromic Sequences

Radha Agarkar — Sat, 21 May 2016 22:32:25 -0500

What are genomic interspersed repeats?

In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences ( see Reassociation Kinetics Experiments, and C-Value Paradox ). These sequences were later characterized and placed into five categories:

Simple Repeats - Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
Tandem Repeats - Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
Segmental Duplications - Large blocks of 10-300 kilobases which are that have been copied to another region of the genome.
Interspersed Repeats
Processed Pseudogenes, Retrotranscripts, SINES - Non-functional copies of RNA genes which have been reintegrated into the genome with the assitance of a reverse transcriptase.
DNA Transposons
Retrovirus Retrotransposons
Non-Retrovirus Retrotransposons ( LINES )

Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase.

On the other hand; In genetics, the term palindrome refers to a sequence of nucleotides along a DNA (deoxyribonucleic acid) or RNA (ribonucleic acid) strand that contains the same series of nitrogenous bases regardless from which direction the strand is analyzed. Akin to a language palindrome—wherein a word or phrase is spelled the same left-to-right as right-to-left (e.g., the word RADAR or the phrase "able was I ere I saw elba")—with genetic palindromes it does not matter whether the nucleic acid strand is read starting from the 3' (three prime) end or the 5' (five prime) end of the strand.

Recent research on palindromes centers on understanding palindrome formation during gene amplification. Other studies have attempted to relate palindrome formation to molecular mechanisms involved in double stranded breaks and in the formation of inverted repeats. Assisted by high speed computers, other groups of scientists link palindrome formation to the conservation of genetic information.

Related to the direction of transcription by RNA polymerase, DNA strands have upstream and downstream terminus defined by differing chemical groups at each end. The ends of each strand of DNA or RNA are termed the 5' (phosphate bound to the 5' position carbon) and 3' (phosphate bound to the 3' carbon) ends to indicate a polarity within the molecule. Using the letters A, T, C, G, to represent the nitrogenous bases adenine, thymine, cytosine, and guanine found in DNA, and the letters A, U, C, G to represent the nitrogenous bases adenine, uracil, cytosine, guanine found in RNA (Note that uracil in RNA replaces the thymine found in DNA), geneticists usually represent DNA by a series of base codes (e.g., 5' AATCGGATTGCA 3'). The base codes are usually arranged from the 5' end to the 3' end.

Because of specific base pairing in DNA (i.e., adenine (A) always bonds with (thymine (T) and cytosine (C) always bonds with guanine (G)) the complimentary stand to the sequence 5' AATCGGATTGCA 3' would be 3' TTAGCCTAACGT 5'.

With palindromes the sequences on the complimentary strands read the same in either direction. For example, a sequence of 5' GAATTC3' on one strand would be complimented by a 3' CTTAAG 5' strand. In either case, when either strand is read from the 5' prime end the sequence is GAATTC. Another example of a palindrome would be the sequence 5' CGAAGC 3' that, when reversed, still reads CGAAGC.

Palindromes are important sequences within nucleic acids. Often they are the site of binding for specific enzymes (e.g., restriction endobucleases) designed to cut the DNA strands at specific locations (i.e., at palindromes).

Palindromes may arise from brakeage and chromosomal inversions that form inverted repeats that compliment each other. When a palindrome results from an inversion, it is often referred to as an inverted repeat. For example, the sequence 5' CGAAGC 3', if inverted (reversed 180°), still reads CGAAGC.

The European Molecular Biology Open Software Suite (EMBOSS) includes some basic tools for finding tandem repeats and inverted repeats (see B.6.22. Applications in group Nucleic:repeats). There are many on-line services providing the EMBOSS tools, for example:

Wageningen Bioinformatics Webportal EMBOSS explorer
Mobyle@Pasteur
Soaplab2 Web Services at Vital-IT

For more sophisticated repeat finding you will want to look at tools using Repbase for example:

Other nucleotide repeat finding methods found by a couple of web searches:

4DGenome

Jitendra Narayan — Mon, 04 Jul 2016 00:44:55 -0500

Records in 4DGenome are compiled through comprehensive literature curation of experimentally-derived and computationally-predicted interactions. The current release contains 4,433,071 experimentally-derived and 3,605,176 computationally-predicted interactions in 5 organisms. Experimental data cover both high throughput datasets and individiual focused studies.

All interaction data are freely available in a standardized file format. Records can be queried by genomic regions, gene names, organism, and detection technology.

Address of the bookmark: http://4dgenome.research.chop.edu/

Enrichr: a comprehensive gene set enrichment analysis

Jit — Thu, 27 Apr 2017 05:42:09 -0500

Enrichment analysis is a popular method for analyzing gene sets generated by genome-wide experiments. Here we present a significant update to one of the tools in this domain called Enrichr. Enrichr currently contains a large collection of diverse gene set libraries available for analysis and download. In total, Enrichr currently contains 180 184 annotated gene sets from 102 gene set libraries. New features have been added to Enrichr including the ability to submit fuzzy sets, upload BED files, improved application programming interface and visualization of the results as clustergrams. Overall, Enrichr is a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries. Enrichr is freely available at: http://amp.pharm.mssm.edu/Enrichr.

https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkw377

Address of the bookmark: http://amp.pharm.mssm.edu/Enrichr/

DECIPHER

Anjana — Fri, 30 Sep 2016 09:33:12 -0500

DECIPHER is a software toolset that can be used to maintain, analyze, and decipher large amounts of DNA sequence data. To install DECIPHER, see the Downloads page.

To begin using DECIPHER read the "Getting Started DECIPHERing" tutorial. Refer to the PDF documents below for instructions on how to use DECIPHER for various tasks.

Address of the bookmark: http://decipher.cee.wisc.edu/Documentation.html

RNA-Seq Analysis: A Guide for Bioinformaticians

LEGE — Sat, 07 Dec 2024 22:22:24 -0600

RNA sequencing (RNA-Seq) has revolutionized transcriptomics, offering unprecedented insights into gene expression, splicing, and transcript diversity. For bioinformaticians, RNA-Seq analysis is a gateway to exploring the complexity of RNA biology and its implications in health and disease. This blog post provides an overview of RNA-Seq analysis, key computational steps, and tools for bioinformaticians eager to delve into this powerful technique.

What is RNA-Seq?

RNA-Seq is a next-generation sequencing (NGS) technology used to study the transcriptome—the complete set of RNA molecules in a cell. It quantifies gene expression, detects novel transcripts, and captures alternative splicing events with high sensitivity and resolution.

Workflow for RNA-Seq Analysis

RNA-Seq analysis involves several stages, each requiring computational tools and expertise.

1. Experimental Design and Data Acquisition

Before diving into analysis, bioinformaticians should consider:

Biological Replicates: Ensure statistical power to detect meaningful differences.
Sequencing Depth: Align sequencing depth to study objectives (e.g., higher depth for low-abundance transcripts).
Paired-End vs. Single-End: Paired-end sequencing provides more detailed information on transcript structure.

Once sequencing is complete, raw data is provided in FASTQ format, containing sequence reads and quality scores.

2. Quality Control and Preprocessing

Quality control (QC) ensures data integrity. Tools such as FastQC evaluate metrics like base quality, GC content, and adapter contamination.

Preprocessing Steps:

Trimming: Tools like Trimmomatic or Cutadapt remove low-quality bases and adapter sequences.
Filtering: Discard reads below a certain quality threshold or length.

3. Read Alignment

Reads are mapped to a reference genome or transcriptome to determine their origin. Alignment tools include:

HISAT2: Handles large genomes efficiently and supports spliced alignments.
STAR: High-speed aligner optimized for RNA-Seq.
Bowtie2: Suitable for short-read alignment.

Output: A SAM/BAM file containing aligned reads.

4. Transcript Assembly and Quantification

This step involves identifying transcripts and quantifying their expression levels. Tools used include:

StringTie: Assembles and quantifies transcripts from aligned reads.
Salmon/Kallisto: Perform pseudo-alignment for rapid and accurate quantification.

Expression levels are typically measured as TPM (transcripts per million) or FPKM (fragments per kilobase of transcript per million mapped reads).

5. Differential Expression Analysis

To identify genes with altered expression between conditions, bioinformaticians use tools such as:

DESeq2: Accounts for data normalization and variability.
edgeR: Handles overdispersed count data efficiently.
Limma-voom: Combines linear modeling with RNA-Seq count data.

The output includes a list of differentially expressed genes (DEGs) with statistical significance and fold-change values.

6. Functional Annotation and Pathway Analysis

Understanding the biological significance of DEGs involves:

Gene Ontology (GO) Analysis: Tools like DAVID or clusterProfiler categorize genes based on their biological functions.
Pathway Enrichment Analysis: Identifies pathways enriched in DEGs using tools like KEGG, Reactome, or GSEA.

7. Visualization

Visualizing results enhances interpretability. Common visualizations include:

Heatmaps: Show expression patterns across samples (e.g., pheatmap).
Volcano Plots: Highlight significant DEGs (e.g., ggplot2).
PCA/UMAP: Assess sample clustering and variability (e.g., Seurat).

Challenges in RNA-Seq Analysis

Batch Effects: Technical variability can confound biological signals. Combat this with normalization techniques or batch-correction tools like ComBat.
Low-Quality Samples: Poor-quality RNA impacts downstream analyses.
Computational Complexity: RNA-Seq generates massive datasets, requiring robust computing resources and optimized pipelines.

Key Tools and Resources

Bioconductor: A treasure trove of R packages for RNA-Seq analysis.
Galaxy: A web-based platform for running RNA-Seq workflows.
Nextflow/Snakemake: Workflow management tools to streamline analyses.

Applications of RNA-Seq

RNA-Seq is used in diverse research areas, including:

Cancer Transcriptomics: Identifying tumor-specific expression profiles.
Developmental Biology: Studying dynamic transcriptome changes.
Drug Discovery: Screening genes modulated by therapeutic compounds.

Conclusion

RNA-Seq analysis is a cornerstone of modern transcriptomics, offering bioinformaticians a versatile toolkit for unraveling gene expression and regulation. Mastering RNA-Seq workflows and tools empowers researchers to transform raw sequencing data into biological discoveries.

Whether you’re investigating disease mechanisms, exploring cellular pathways, or developing new therapeutics, RNA-Seq is a powerful ally in your bioinformatics arsenal.

GIGGLE: a search engine for large-scale integrated genome analysis

Jit — Wed, 10 Jan 2018 03:10:45 -0600

GIGGLE is a genomics search engine that identifies and ranks the significance of genomic loci shared between query features and thousands of genome interval files. GIGGLE (https://github.com/ryanlayer/giggle) scales to billions of intervals and is over three orders of magnitude faster than existing methods. Its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEx by facilitating data integration and hypothesis generation.

https://www.nature.com/articles/nmeth.4556

Address of the bookmark: https://github.com/ryanlayer/giggle

Bio7: an integrated development environment for ecological modeling, scientific image analysis and statistical analysis

Nidhi Rajput — Fri, 07 Feb 2020 23:32:24 -0600

The application Bio7 is an integrated development environment for ecological modeling, scientific image analysis and statistical analysis. The application itself is based on an RCP-Eclipse-Environment (Rich-Client-Platform) which offers a huge flexibility in configuration and extensibility because of its plug-in structure and the possibility of customization.

https://bio7.org/about/

Address of the bookmark: https://bio7.org/home-2/

Dahak: benchmarking and containerization of tools for analysis of complex non-clinical metagenomes.

BioStar — Thu, 09 Apr 2020 04:56:09 -0500

Dahak is a software suite that integrates state-of-the-art open source tools for metagenomic analyses. Tools in the dahak software suite will perform various steps in metagenomic analysis workflows including data pre-processing, metagenome assembly, taxonomic and functional classification, genome binning, and gene assignment. We aim to deliver the analytical framework as a robust and reliable containerized workflow system, which will be free from dependency, installation, and execution problems typically associated with other open-source bioinformatics solutions. This will maximize the transparency, data provenance (i.e., the process of tracing the origins of data and its movement through the workflow), and reproducibility.

More at https://dahak-metagenomics.github.io/dahak/

Address of the bookmark: https://github.com/dahak-metagenomics/dahak