BOL: Related items

DISCO : multi threaded and multiprocess distributed memory overlap-layout-consensus (OLC) metagenome assembler

Jit — Mon, 10 Jul 2017 10:09:27 -0500

DISCO is a multi threaded and multiprocess distributed memory overlap-layout-consensus (OLC) metagenome assembler. Disco was developed as a scalable assembler to assemble large metagenomes from billions of Illumina sequencing reads of complex microbial communities. Disco was parallelized for computer clusters in a hybrid architecture that integrated shared-memory multi-threading, point-to-point message passing, and remote direct memory access. The assembly and scaffolding were performed using an iterative overlap graph approach.

Address of the bookmark: http://disco.omicsbio.org/

FastProNGS: fast preprocessing of next-generation sequencing reads

Rahul Nayak — Sat, 26 Dec 2020 08:35:21 -0600

FastProNGS to integrate the quality control process with automatic adapter removal. Parallel processing was implemented to speed up the process by allocating multiple threads. Compared with similar up-to-date preprocessing tools, FastProNGS is by far the fastest.

Address of the bookmark: https://github.com/Megagenomics/FastProNGS

RNA-Seq De novo Assembly Using Trinity

Surabhi Chaudhary — Wed, 23 Mar 2016 05:53:46 -0500

Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. Briefly, the process works like so:

Inchworm assembles the RNA-seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptonal complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.
Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.

More at https://github.com/trinityrnaseq/trinityrnaseq/wiki

......................................................................................................................................

Download Trinity here.

Build Trinity by typing 'make' in the base installation directory.

Assemble RNA-Seq data like so:

 Trinity --seqType fq --left reads_1.fq --right reads_2.fq --CPU 6 --max_memory 20G

Find assembled transcripts as: 'trinity_out_dir/Trinity.fasta'

Address of the bookmark: https://github.com/trinityrnaseq/trinityrnaseq/wiki

NCBI Magic-BLAST

Jit — Tue, 14 Aug 2018 18:11:11 -0500

Magic-BLAST is a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome. Each alignment optimizes a composite score, taking into account simultaneously the two reads of a pair, and in case of RNA-seq, locating the candidate introns and adding up the score of all exons. This is very different from other versions of BLAST, where each exon is scored as a separate hit and read-pairing is ignored.

Magic-BLAST incorporates within the NCBI BLAST code framework ideas developed in the NCBI Magic pipeline, in particular hit extensions by local walk and jump (http://www.ncbi.nlm.nih.gov/pubmed/26109056), and recursive clipping of mismatches near the edges of the reads, which avoids accumulating artefactual mismatches near splice sites and is needed to distinguish short indels from substitutions near the edges.

Address of the bookmark: https://ncbi.github.io/magicblast/

Encode sequencing data freely available to download and use for academic means

Rahul Agarwal — Thu, 13 Mar 2014 18:18:08 -0500

In Encode, regulatory elements investigated via DNA hypersensitivity assays, assays of DNA methylation, and chromatin immunoprecipitation (ChIP) of proteins that interact with DNA, including modified histones and transcription factors, followed by sequencing (ChIP-Seq).

More information:

https://genome.ucsc.edu/ENCODE/pilot.html

Address of the bookmark: https://genome.ucsc.edu/ENCODE/

deepTools

Martin Jones — Sat, 08 Nov 2014 15:02:08 -0600

deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers. To do so, deepTools contains useful modules to process the mapped reads data to create coverage files in standard bedGraph and bigWig file formats. By doing so, deepTools allows the creation of normalized coverage files or the comparison between two files (for example, treatment and control). Finally, using such normalized and standardized files, multiple visualizations can be created to identify enrichments with functional annotations of the genome.

Publicaton: http://nar.oxfordjournals.org/content/early/2014/05/05/nar.gku365.full

Source Code and Wiki: https://github.com/fidelram/deepTools/wiki

Galaxy Tool Shed repository: http://toolshed.g2.bx.psu.edu/view/bgruening/deeptools

and example Galaxy workflows: http://toolshed.g2.bx.psu.edu/view/bgruening/deeptools_workflows

Live Webinar on RNA-Seq Data Analysis on 9 Nov 2016

Strand — Wed, 19 Oct 2016 05:25:27 -0500

Live Webinar on RNA-Seq Data Analysis

Abstract: Strand NGS supports an extensive workflow for the analysis and visualization of RNA-Seq data. The workflow includes Transcriptome / Genome alignment, Differential expression analysis with Statistical approach and Splicing events detection. Strand NGS also supports novel discovery like identification of novel genes, exons and Novel splice junctions, alongside it can also detect gene fusion events. Further downstream analysis such as GO and pathway analysis can be performed on the set of interesting genes. The product has an option to create pipelines for time consuming jobs which automates analysis and leaves more time for end data interpretation. This webinar will give an overview of the features in the RNA-Seq data analysis workflow in Strand NGS and also highlights on parameters within each feature that can be optimized depending on datasets and analysis needs.

Speaker: Mr. Sugandan Sivamani, Senior Application Scientist, Strand Life Sciences

Date: 9th Nov, Session 1 for SAPK/ APFO: 2:30 PM IST Date: 9th Nov, Session 2 for AFO/ EMEA: 9:00 AM PST

StringTie Transcript assembly and quantification for RNA-Seq

Jit — Tue, 09 Jun 2020 05:21:11 -0500

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only alignments of short reads that can also be used by other transcript assemblers, but also alignments of longer sequences that have been assembled from those reads. In order to identify differentially expressed genes between experiments, StringTie's output can be processed by specialized software like Ballgown, Cuffdiff or other programs (DESeq2, edgeR, etc.).

Address of the bookmark: https://ccb.jhu.edu/software/stringtie/

Understanding RNA-Seq Normalization Methods: TPM vs. FPKM vs. CPM

Neel — Wed, 11 Dec 2024 00:59:15 -0600

RNA sequencing (RNA-Seq) is a powerful technology used to study transcriptomes, providing insights into gene expression levels. However, raw RNA-Seq data requires normalization to account for sequencing depth and gene length, enabling accurate comparisons between genes and samples. Among the most widely used normalization methods are TPM (Transcripts Per Million), FPKM (Fragments Per Kilobase Million), and CPM (Counts Per Million). Each method has its unique principles and applications, which we’ll explore in this blog.

Why Normalize RNA-Seq Data?

Normalization is a crucial step in RNA-Seq analysis for the following reasons:

Sequencing depth: Different RNA-Seq experiments produce varying numbers of reads, making direct comparisons between samples misleading.
Gene length: Longer genes inherently generate more reads, irrespective of their actual expression level.
Bias reduction: Normalization mitigates technical biases, enabling meaningful biological interpretation.

TPM (Transcripts Per Million)

TPM measures the proportion of reads mapped to a transcript, normalized by transcript length and sequencing depth. It is calculated as:

Key Features:

Proportionality: TPM values sum to 1,000,000 across all transcripts in a sample, making it easier to compare between samples.
Intuitive interpretation: TPM values directly represent the abundance of transcripts in a sample.
Preferred for comparisons: TPM facilitates between-sample comparisons better than FPKM.

FPKM (Fragments Per Kilobase Million)

FPKM normalizes read counts by transcript length and sequencing depth, but without enforcing proportionality like TPM. It is defined as:

Key Features:

Historical significance: FPKM was one of the first normalization methods used for RNA-Seq.
Single-end vs. paired-end: In paired-end sequencing, FPKM becomes RPKM (Reads Per Kilobase Million).
Limited utility: FPKM values are not as robust as TPM for cross-sample comparisons due to lack of proportionality.

CPM (Counts Per Million)

CPM normalizes raw read counts by sequencing depth, without considering gene length. It is expressed as:

Key Features:

Simplicity: CPM is straightforward and computationally less intensive.
Application: Suitable for non-length-dependent analyses, such as comparing total expression levels or differential expression analysis.
Gene length agnostic: CPM does not correct for gene length, making it less ideal for measuring expression levels.

When to Use Each Method

TPM: Best for comparing expression levels between samples, especially when transcript length and sequencing depth vary.
FPKM: Useful for historical consistency but generally replaced by TPM.
CPM: Ideal for differential expression analysis when gene length normalization is unnecessary.

Conclusion

Choosing the right normalization method depends on the specific objectives of your RNA-Seq analysis. TPM’s proportionality and robustness make it the preferred choice for most applications, while CPM serves well for differential expression studies. Although FPKM paved the way for RNA-Seq normalization, it has largely been supplanted by TPM in modern workflows. Understanding these methods and their nuances ensures accurate and meaningful interpretations of RNA-Seq data.

References:

Li, B., & Dewey, C. N. (2011). RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics.
Trapnell, C., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology.
Law, C. W., et al. (2014). voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology.

Webinar on RNA-Seq Data Analysis on 28 Feb 2018

Strand — Thu, 22 Feb 2018 06:38:48 -0600

Strand NGS is a biologist friendly NGS analysis tool that allows biologists to analyze their data using a very intuitive workflow for the analysis and visualization of RNA-Seq data. This webinar will give an overview of the workflow which includes Transcriptome/ Genome alignment, Differential expression analysis, Splicing events and gene fusion detection. Strand NGS also supports novel discovery like identification of novel genes, exons and novel splice junctions.
We will highlight the use of Strand NGS features such as PCA, sample correlation, clustering, Venn diagrams, CVA, UMI support and elastic genome browser used in RNA-Seq workflow that supports large scale RNA-Seq data analysis too. The tool also supports biological contextualization on the set of interesting genes from the data by allowing downstream analysis such as GO and pathway analysis. The product has an option to create pipelines for time consuming jobs which automates analysis and leaves more time for end data interpretation. This webinar will give an overview of the features in the RNA-Seq data analysis workflow in Strand NGS.

Details:
Session 1: 28 Feb 2018, 9 AM CET
Session 2: 28 Feb 2018, 8 AM PST
Register here: http://www.strand-ngs.com/webinar_registration

About Speaker:

Dr. Suman Kapoor, Manager- Application Science at Strand Life Sciences, has over 11 years experience in molecular biology, next-generation sequencing based testing, clinical genomics, and personalized medicine for disease management and prenatal testing. Dr. Suman holds a Ph.D in Molecular and Cell Biology from Indian Institute of Science, Bangalore. Prior to joining Strand NGS team, Suman has worked extensively on protein synthesis in eubacteria and has experience working in CAP and NABL accredited lab validating and interpreting NGS based diagnostic tests.