BOL: Related items

MUMmer4: A fast and versatile genome alignment system

Jit — Sat, 03 Feb 2018 04:59:17 -0600

MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes;

Address of the bookmark: https://mummer4.github.io/

Arvados

Martin Jones — Sat, 20 Sep 2014 16:54:21 -0500

Arvados is a free and open source bioinformatics platform for genomic and biomedical data. User can Store | Organize | Compute | Share the data for free.

Address of the bookmark: https://arvados.org/

QuIN’s web server

Jit — Mon, 27 Jun 2016 10:44:16 -0500

Recent studies of the human genome have indicated that regulatory elements (e.g. promoters and enhancers) at distal genomic locations can interact with each other via chromatin folding and affect gene expression levels. Genomic technologies for mapping interactions between DNA regions, e.g., ChIA-PET and HiC, can generate genome-wide maps of interactions between regulatory elements. These interaction datasets are important resources to infer distal gene targets of non-coding regulatory elements and to facilitate prioritization of critical loci for important cellular functions. With the increasing diversity and complexity of genomic information and public ontologies, making sense of these datasets demands integrative and easy-to-use software tools. Moreover, network representation of chromatin interaction maps enables effective data visualization, integration, and mining. Currently, there is no software that can take full advantage of network theory approaches for the analysis of chromatin interaction datasets. To fill this gap, we developed a web-based application, QuIN, which enables: 1) building and visualizing chromatin interaction networks, 2) annotating networks with user-provided private and publicly available functional genomics and interaction datasets, 3) querying network components based on gene name or chromosome location, and 4) utilizing network based measures to identify and prioritize critical regulatory targets and their direct and indirect interactions.

AVAILABILITY: QuIN’s web server is available at http://quin.jax.org QuIN is developed in Java and JavaScript, utilizing an Apache Tomcat web server and MySQL database and the source code is available under the GPLV3 license available on GitHub:https://github.com/UcarLab/QuIN/.

Address of the bookmark: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004809

Randomness and Probability

Jit — Tue, 08 Nov 2016 07:17:32 -0600

Randomness and Probability

Randomness and probability are two differnet concepts: probaility is a measure (according to measure theory) which measures the randomness. Randomness is the object to be measured by probability. For example, probability is a mapping from randomness to the real number between 0 and 1. The similar examples are that the entropy measures the uncertanity; product of length and width measures the area of rectangle etc.

Please see “A mathematical theory of ability measure” by N. Kong ets for more examples to answer this question.

Bioistats Online course

Abhimanyu Singh — Thu, 10 Nov 2016 04:22:51 -0600

One of our primary focuses will be to develop an understanding of the various ways in which we can assign a probability to some chance event. We'll also learn the fundamental properties of probability, investigate how probability behaves, and learn how to calculate the probability of a new chance event.

This book is handy understanding basic concepts.

Address of the bookmark: https://onlinecourses.science.psu.edu/stat414/node/287

Bpipe - a tool for running and managing bioinformatics pipelines

Radha Agarkar — Sat, 21 May 2016 22:42:16 -0500

Bpipe provides a platform for running big bioinformatics jobs that consist of a series of processing stages - known as 'pipelines'.

January 20th, 2016 - New! Bpipe 0.9.9 released!
Download latest, all
Documentation
Mailing List (Google Group)

Bpipe has been published in Bioinformatics! If you use Bpipe, please cite:

Sadedin S, Pope B & Oshlack A, Bpipe: A Tool for Running and Managing Bioinformatics Pipelines, Bioinformatics

Address of the bookmark: http://docs.bpipe.org/

SAM flags

Poonam Mahapatra — Wed, 29 Jun 2016 15:38:15 -0500

Decoding SAM flags

This utility makes it easy to identify what are the properties of a read based on its SAM flag value, or conversely, to find what the SAM Flag value would be for a given combination of properties.

To decode a given SAM flag value, just enter the number in the field below. The encoded properties will be listed under Summary below, to the right.

Address of the bookmark: https://broadinstitute.github.io/picard/explain-flags.html

Enrichr: a comprehensive gene set enrichment analysis

Jit — Thu, 27 Apr 2017 05:42:09 -0500

Enrichment analysis is a popular method for analyzing gene sets generated by genome-wide experiments. Here we present a significant update to one of the tools in this domain called Enrichr. Enrichr currently contains a large collection of diverse gene set libraries available for analysis and download. In total, Enrichr currently contains 180 184 annotated gene sets from 102 gene set libraries. New features have been added to Enrichr including the ability to submit fuzzy sets, upload BED files, improved application programming interface and visualization of the results as clustergrams. Overall, Enrichr is a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries. Enrichr is freely available at: http://amp.pharm.mssm.edu/Enrichr.

https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkw377

Address of the bookmark: http://amp.pharm.mssm.edu/Enrichr/

RNA-Seq Analysis: A Guide for Bioinformaticians

LEGE — Sat, 07 Dec 2024 22:22:24 -0600

RNA sequencing (RNA-Seq) has revolutionized transcriptomics, offering unprecedented insights into gene expression, splicing, and transcript diversity. For bioinformaticians, RNA-Seq analysis is a gateway to exploring the complexity of RNA biology and its implications in health and disease. This blog post provides an overview of RNA-Seq analysis, key computational steps, and tools for bioinformaticians eager to delve into this powerful technique.

What is RNA-Seq?

RNA-Seq is a next-generation sequencing (NGS) technology used to study the transcriptome—the complete set of RNA molecules in a cell. It quantifies gene expression, detects novel transcripts, and captures alternative splicing events with high sensitivity and resolution.

Workflow for RNA-Seq Analysis

RNA-Seq analysis involves several stages, each requiring computational tools and expertise.

1. Experimental Design and Data Acquisition

Before diving into analysis, bioinformaticians should consider:

Biological Replicates: Ensure statistical power to detect meaningful differences.
Sequencing Depth: Align sequencing depth to study objectives (e.g., higher depth for low-abundance transcripts).
Paired-End vs. Single-End: Paired-end sequencing provides more detailed information on transcript structure.

Once sequencing is complete, raw data is provided in FASTQ format, containing sequence reads and quality scores.

2. Quality Control and Preprocessing

Quality control (QC) ensures data integrity. Tools such as FastQC evaluate metrics like base quality, GC content, and adapter contamination.

Preprocessing Steps:

Trimming: Tools like Trimmomatic or Cutadapt remove low-quality bases and adapter sequences.
Filtering: Discard reads below a certain quality threshold or length.

3. Read Alignment

Reads are mapped to a reference genome or transcriptome to determine their origin. Alignment tools include:

HISAT2: Handles large genomes efficiently and supports spliced alignments.
STAR: High-speed aligner optimized for RNA-Seq.
Bowtie2: Suitable for short-read alignment.

Output: A SAM/BAM file containing aligned reads.

4. Transcript Assembly and Quantification

This step involves identifying transcripts and quantifying their expression levels. Tools used include:

StringTie: Assembles and quantifies transcripts from aligned reads.
Salmon/Kallisto: Perform pseudo-alignment for rapid and accurate quantification.

Expression levels are typically measured as TPM (transcripts per million) or FPKM (fragments per kilobase of transcript per million mapped reads).

5. Differential Expression Analysis

To identify genes with altered expression between conditions, bioinformaticians use tools such as:

DESeq2: Accounts for data normalization and variability.
edgeR: Handles overdispersed count data efficiently.
Limma-voom: Combines linear modeling with RNA-Seq count data.

The output includes a list of differentially expressed genes (DEGs) with statistical significance and fold-change values.

6. Functional Annotation and Pathway Analysis

Understanding the biological significance of DEGs involves:

Gene Ontology (GO) Analysis: Tools like DAVID or clusterProfiler categorize genes based on their biological functions.
Pathway Enrichment Analysis: Identifies pathways enriched in DEGs using tools like KEGG, Reactome, or GSEA.

7. Visualization

Visualizing results enhances interpretability. Common visualizations include:

Heatmaps: Show expression patterns across samples (e.g., pheatmap).
Volcano Plots: Highlight significant DEGs (e.g., ggplot2).
PCA/UMAP: Assess sample clustering and variability (e.g., Seurat).

Challenges in RNA-Seq Analysis

Batch Effects: Technical variability can confound biological signals. Combat this with normalization techniques or batch-correction tools like ComBat.
Low-Quality Samples: Poor-quality RNA impacts downstream analyses.
Computational Complexity: RNA-Seq generates massive datasets, requiring robust computing resources and optimized pipelines.

Key Tools and Resources

Bioconductor: A treasure trove of R packages for RNA-Seq analysis.
Galaxy: A web-based platform for running RNA-Seq workflows.
Nextflow/Snakemake: Workflow management tools to streamline analyses.

Applications of RNA-Seq

RNA-Seq is used in diverse research areas, including:

Cancer Transcriptomics: Identifying tumor-specific expression profiles.
Developmental Biology: Studying dynamic transcriptome changes.
Drug Discovery: Screening genes modulated by therapeutic compounds.

Conclusion

RNA-Seq analysis is a cornerstone of modern transcriptomics, offering bioinformaticians a versatile toolkit for unraveling gene expression and regulation. Mastering RNA-Seq workflows and tools empowers researchers to transform raw sequencing data into biological discoveries.

Whether you’re investigating disease mechanisms, exploring cellular pathways, or developing new therapeutics, RNA-Seq is a powerful ally in your bioinformatics arsenal.

GeneBreak: a tool to systematically identify genes recurrently affected by the genomic location of chromosomal CNA-associated breaks by a genome-wide approach

Jit — Sat, 01 Oct 2016 15:15:29 -0500

Development of cancer is driven by somatic alterations, including numerical and structural chromosomal aberrations. Currently, several computational methods are available and are widely applied to detect numerical copy number aberrations (CNAs) of chromosomal segments in tumor genomes. However, there is lack of computational methods that systematically detect structural chromosomal aberrations by virtue of the genomic location of CNA-associated chromosomal breaks and identify genes that appear non-randomly affected by chromosomal breakpoints across (large) series of tumor samples. ‘GeneBreak’ is developed to systematically identify genes recurrently affected by the genomic location of chromosomal CNA-associated breaks by a genome-wide approach, which can be applied to DNA copy number data obtained by array-Comparative Genomic Hybridization (CGH) or by (low-pass) whole genome sequencing (WGS). First, ‘GeneBreak’ collects the genomic locations of chromosomal CNA-associated breaks that were previously pinpointed by the segmentation algorithm that was applied to obtain CNA profiles. Next, a tailored annotation approach for breakpoint-to-gene mapping is implemented. Finally, dedicated cohort-based statistics is incorporated with correction for covariates that influence the probability to be a breakpoint gene. In addition, multiple testing correction is integrated to reveal recurrent breakpoint events. This easy-to-use algorithm, ‘GeneBreak’, is implemented in R (www.cran.r-project.org) and is available from Bioconductor (www.bioconductor.org/packages/release/bioc/html/GeneBreak.html).

Address of the bookmark: http://www.bioconductor.org/packages/release/bioc/html/GeneBreak.html