BOL: Related items

NCBI Magic-BLAST

Jit — Tue, 14 Aug 2018 18:11:11 -0500

Magic-BLAST is a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome. Each alignment optimizes a composite score, taking into account simultaneously the two reads of a pair, and in case of RNA-seq, locating the candidate introns and adding up the score of all exons. This is very different from other versions of BLAST, where each exon is scored as a separate hit and read-pairing is ignored.

Magic-BLAST incorporates within the NCBI BLAST code framework ideas developed in the NCBI Magic pipeline, in particular hit extensions by local walk and jump (http://www.ncbi.nlm.nih.gov/pubmed/26109056), and recursive clipping of mismatches near the edges of the reads, which avoids accumulating artefactual mismatches near splice sites and is needed to distinguish short indels from substitutions near the edges.

Address of the bookmark: https://ncbi.github.io/magicblast/

LncPipe:A Nextflow-based pipeline for comprehensive analyses of long non-coding RNAs from RNA-seq datasets

LEGE — Fri, 17 Sep 2021 01:57:02 -0500

The pipeline was developed based on a popular workflow framework Nextflow, composed of four core procedures including reads alignment, assembly, identification and quantification. It contains various unique features such as well-designed lncRNAs annotation strategy, optimized calculating efficiency, diversified classification and interactive analysis report. LncPipe allows users additional control in interuppting the pipeline, resetting parameters from command line, modifying main script directly and resume analysis from previous checkpoint.

Ref https://www.lncrnablog.com/lncpipe-a-nextflow-based-pipeline-for-identification-and-analysis-of-long-non-coding-rnas-from-rna-seq-data/

Address of the bookmark: https://github.com/likelet/LncPipe

deepTools

Martin Jones — Sat, 08 Nov 2014 15:02:08 -0600

deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers. To do so, deepTools contains useful modules to process the mapped reads data to create coverage files in standard bedGraph and bigWig file formats. By doing so, deepTools allows the creation of normalized coverage files or the comparison between two files (for example, treatment and control). Finally, using such normalized and standardized files, multiple visualizations can be created to identify enrichments with functional annotations of the genome.

Publicaton: http://nar.oxfordjournals.org/content/early/2014/05/05/nar.gku365.full

Source Code and Wiki: https://github.com/fidelram/deepTools/wiki

Galaxy Tool Shed repository: http://toolshed.g2.bx.psu.edu/view/bgruening/deeptools

and example Galaxy workflows: http://toolshed.g2.bx.psu.edu/view/bgruening/deeptools_workflows

StringTie Transcript assembly and quantification for RNA-Seq

Jit — Tue, 09 Jun 2020 05:21:11 -0500

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only alignments of short reads that can also be used by other transcript assemblers, but also alignments of longer sequences that have been assembled from those reads. In order to identify differentially expressed genes between experiments, StringTie's output can be processed by specialized software like Ballgown, Cuffdiff or other programs (DESeq2, edgeR, etc.).

Address of the bookmark: https://ccb.jhu.edu/software/stringtie/

Understanding RNA-Seq Normalization Methods: TPM vs. FPKM vs. CPM

Neel — Wed, 11 Dec 2024 00:59:15 -0600

RNA sequencing (RNA-Seq) is a powerful technology used to study transcriptomes, providing insights into gene expression levels. However, raw RNA-Seq data requires normalization to account for sequencing depth and gene length, enabling accurate comparisons between genes and samples. Among the most widely used normalization methods are TPM (Transcripts Per Million), FPKM (Fragments Per Kilobase Million), and CPM (Counts Per Million). Each method has its unique principles and applications, which we’ll explore in this blog.

Why Normalize RNA-Seq Data?

Normalization is a crucial step in RNA-Seq analysis for the following reasons:

Sequencing depth: Different RNA-Seq experiments produce varying numbers of reads, making direct comparisons between samples misleading.
Gene length: Longer genes inherently generate more reads, irrespective of their actual expression level.
Bias reduction: Normalization mitigates technical biases, enabling meaningful biological interpretation.

TPM (Transcripts Per Million)

TPM measures the proportion of reads mapped to a transcript, normalized by transcript length and sequencing depth. It is calculated as:

Key Features:

Proportionality: TPM values sum to 1,000,000 across all transcripts in a sample, making it easier to compare between samples.
Intuitive interpretation: TPM values directly represent the abundance of transcripts in a sample.
Preferred for comparisons: TPM facilitates between-sample comparisons better than FPKM.

FPKM (Fragments Per Kilobase Million)

FPKM normalizes read counts by transcript length and sequencing depth, but without enforcing proportionality like TPM. It is defined as:

Key Features:

Historical significance: FPKM was one of the first normalization methods used for RNA-Seq.
Single-end vs. paired-end: In paired-end sequencing, FPKM becomes RPKM (Reads Per Kilobase Million).
Limited utility: FPKM values are not as robust as TPM for cross-sample comparisons due to lack of proportionality.

CPM (Counts Per Million)

CPM normalizes raw read counts by sequencing depth, without considering gene length. It is expressed as:

Key Features:

Simplicity: CPM is straightforward and computationally less intensive.
Application: Suitable for non-length-dependent analyses, such as comparing total expression levels or differential expression analysis.
Gene length agnostic: CPM does not correct for gene length, making it less ideal for measuring expression levels.

When to Use Each Method

TPM: Best for comparing expression levels between samples, especially when transcript length and sequencing depth vary.
FPKM: Useful for historical consistency but generally replaced by TPM.
CPM: Ideal for differential expression analysis when gene length normalization is unnecessary.

Conclusion

Choosing the right normalization method depends on the specific objectives of your RNA-Seq analysis. TPM’s proportionality and robustness make it the preferred choice for most applications, while CPM serves well for differential expression studies. Although FPKM paved the way for RNA-Seq normalization, it has largely been supplanted by TPM in modern workflows. Understanding these methods and their nuances ensures accurate and meaningful interpretations of RNA-Seq data.

References:

Li, B., & Dewey, C. N. (2011). RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics.
Trapnell, C., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology.
Law, C. W., et al. (2014). voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology.

dupRadar package

Jit — Sun, 04 Feb 2018 14:28:57 -0600

The dupRadar package gives an insight into the duplication problem by graphically relating the gene expression level and the duplication rate present on it. Thus, failed experiments can be easily identified at a glance

Address of the bookmark: https://bioconductor.org/packages/3.7/bioc/vignettes/dupRadar/inst/doc/dupRadar.html

Will MinION Nanopore sequencing increase the number of Next Generation Sequencing projects?

Strand — Tue, 04 Aug 2015 05:14:07 -0500

Will MinION Nanopore sequencing increase the number of Next Generation Sequencing projects?

Webinar: Unravelling complex mutational events in clinical cases using the power of NGS data analysis by Dr Satish Sankaran on 31 Jan 2018

Strand — Tue, 26 Dec 2017 02:00:26 -0600

Live webinar on Unravelling complex mutational events in clinical cases using the power of Next generation sequencing data analysis by Dr Satish Sankaran on 31 Jan 2018 at 9am CET and 8am PST

Speaker: Dr. Satish Sankaran, Vice President and Lab Director - Clinical Operations & Clinical Lab, Strand Life Sciences Pvt Ltd

Abstract: Next Generation sequencing has come a long way in aiding genetic disease diagnosis by bringing down both the time and cost of testing. Testing involves massively parallel sequencing of a single to 100s of genes in a one assay. With a large amount of sequence data getting generated from such assays, it is critical that the data is analyzed using standard analysis tools to detect wide range of variants. Strand Life Sciences, has tested more than 3000 clinical samples using multi-gene panels for diagnosis of rare disease conditions. NGS data analysis is done using the Strand NGS software and variant prioritization and reporting using StrandOMICS.

While most analysis software can easily detect single nucleotide variants, the complex ones involving insertions and deletions are usually missed. With multiple iterations the Strand NGS software is trained to effectively detect structural and copy number changes from a single NGS data set. This is critical in certain disease conditions like Retinoblastoma and Duchenne’s Muscular Dystrophy where there are clinically relevant deletions reported.

In this presentation, we present four different case studies where we were able to detect mutations due to unusual and difficult regions in the genome from the NGS data. These results were further confirmed using orthologous methods.

Session 1: 31 Jan 2018; 9:00 AM CET
Session 2: 31 Jan 2018; 8:00 AM PST

RNA-Seq Data Pathway and Gene-set Analysis Workflows

Jit — Fri, 25 Oct 2013 08:00:48 -0500

It describe the GAGE (Luo et al., 2009) /Pahview (Luo and Brouwer, 2013) workflows on RNA-Seq data pathway analysis and gene-set analysis. The gage package (2.12.0) now includes a new tutorial, “RNA-Seq Data Pathway and Gene-set Analysis Workflows“.

First cover a full workflow from preparation, reads counting, data preprocessing, gene set test, to pathway visualization in about 40 lines of codes. The same workflow can be used for GO analysis or other types of gene set analysis too. We also describe joint workflows, i.e. to do gene-level analysis using one of the major RNA-Seq analysis tools, DEseq/DEseq2, edgeR, limma and Cufflinks, and feed the results into GAGE/Pahview for pathway analysis or visualization. All these workflows are implemented in R/Bioconductor.

The work ows cover the most common situations and issues for RNA-Seq data pathway analysis. Issues like data quality assessment are relevant for data analysis in general yet out the scope of this tutorial. Although we focus on RNA-Seq data here, but pathway analysis work ow remains similar for microarray, particularly step 3-4 would be the same. Please check gage and pathview vigenttes for details.

Note: You need to update to current release versions of R(3.0.2)/ Bioconductor(2.13) to use all the features.

Reference:

Please check it out:
http://bioconductor.org/packages/release/bioc/html/gage.html
http://bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc/RNA-seqWorkflow.pdf

Scallop: reference-based transcriptome assembler for RNA-seq

Rahul Nayak — Tue, 08 May 2018 04:23:27 -0500

Scallop is an accurate reference-based transcript assembler. Scallop features its high accuracy in assembling multi-exon transcripts as well as lowly expressed transcripts. Scallop achieves this improvement through a novel algorithm that can be proved preserving all phasing paths from reads and paired-end reads, while also achieves both transcripts parsimony and coverage deviation minimization.

Scallop paper has been published at Nature Biotechnology. The datasets and scripts used in this paper to compare the performance of Scallop and other assemblers are available at scalloptest.

Please also checkout the podcast about Scallop (thanks Roman Cheplyaka for the interview). It is available at both the bioinformatics chat and iTunes.

https://github.com/Kingsford-Group/scallop

Address of the bookmark: https://github.com/Kingsford-Group/scallop