BOL: Related items

Understanding RNA-Seq Normalization Methods: TPM vs. FPKM vs. CPM

Neel — Wed, 11 Dec 2024 00:59:15 -0600

RNA sequencing (RNA-Seq) is a powerful technology used to study transcriptomes, providing insights into gene expression levels. However, raw RNA-Seq data requires normalization to account for sequencing depth and gene length, enabling accurate comparisons between genes and samples. Among the most widely used normalization methods are TPM (Transcripts Per Million), FPKM (Fragments Per Kilobase Million), and CPM (Counts Per Million). Each method has its unique principles and applications, which we’ll explore in this blog.

Why Normalize RNA-Seq Data?

Normalization is a crucial step in RNA-Seq analysis for the following reasons:

Sequencing depth: Different RNA-Seq experiments produce varying numbers of reads, making direct comparisons between samples misleading.
Gene length: Longer genes inherently generate more reads, irrespective of their actual expression level.
Bias reduction: Normalization mitigates technical biases, enabling meaningful biological interpretation.

TPM (Transcripts Per Million)

TPM measures the proportion of reads mapped to a transcript, normalized by transcript length and sequencing depth. It is calculated as:

Key Features:

Proportionality: TPM values sum to 1,000,000 across all transcripts in a sample, making it easier to compare between samples.
Intuitive interpretation: TPM values directly represent the abundance of transcripts in a sample.
Preferred for comparisons: TPM facilitates between-sample comparisons better than FPKM.

FPKM (Fragments Per Kilobase Million)

FPKM normalizes read counts by transcript length and sequencing depth, but without enforcing proportionality like TPM. It is defined as:

Key Features:

Historical significance: FPKM was one of the first normalization methods used for RNA-Seq.
Single-end vs. paired-end: In paired-end sequencing, FPKM becomes RPKM (Reads Per Kilobase Million).
Limited utility: FPKM values are not as robust as TPM for cross-sample comparisons due to lack of proportionality.

CPM (Counts Per Million)

CPM normalizes raw read counts by sequencing depth, without considering gene length. It is expressed as:

Key Features:

Simplicity: CPM is straightforward and computationally less intensive.
Application: Suitable for non-length-dependent analyses, such as comparing total expression levels or differential expression analysis.
Gene length agnostic: CPM does not correct for gene length, making it less ideal for measuring expression levels.

When to Use Each Method

TPM: Best for comparing expression levels between samples, especially when transcript length and sequencing depth vary.
FPKM: Useful for historical consistency but generally replaced by TPM.
CPM: Ideal for differential expression analysis when gene length normalization is unnecessary.

Conclusion

Choosing the right normalization method depends on the specific objectives of your RNA-Seq analysis. TPM’s proportionality and robustness make it the preferred choice for most applications, while CPM serves well for differential expression studies. Although FPKM paved the way for RNA-Seq normalization, it has largely been supplanted by TPM in modern workflows. Understanding these methods and their nuances ensures accurate and meaningful interpretations of RNA-Seq data.

References:

Li, B., & Dewey, C. N. (2011). RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics.
Trapnell, C., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology.
Law, C. W., et al. (2014). voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology.

2.4 Mb Genome Size for World's Biggest Virus

Jitendra Narayan — Thu, 08 Aug 2013 10:05:37 -0500

The genome size of new discovered Pandoraviruses have roughly twice the size of the record-holding Megavirus genomic code. Interestingly only 6 percent of its genes resembled the genes other organisms. It is assume that it may come from a different origin.

For detail : http://www.sciencemag.org/content/341/6143/281

http://www.npr.org/blogs/health/2013/07/18/203298244/worlds-biggest-virus-may-have-ancient-roots

Ebola virus disease (EVD)or Ebola haemorrhagic fever !!!

Rahul Nayak — Sun, 10 Aug 2014 13:08:13 -0500

Ebola virus disease (EVD)or Ebola haemorrhagic fever is a severe and often deadly illness in humans, caused by the Ebola virus. The disease has high mortality rate, killing upto 90% of people who are infected.

The ongoing 2014 West Africa Ebola outbreak is considered to be the largest and longest outbreak ever recorded of Ebola, killing at least 932 people and infecting more than 1,700 till date since March in Sierra Leone, Guinea, Nigeria and Liberia.

Hence, the World Health Organisation (WHO) on 8 August, 2014 declared the killer Ebola epidemic ravaging parts of West Africa an international health emergency.

Causes

EVD is caused by infection with a virus of the family Filoviridae, genus Ebolavirus. While there are five identified sub-species of Ebolavirus, four viruses cause disease in humans. They are Bundibugyo virus (BDBV), Ebola virus (EBOV), Sudan virus (SUDV), Taï Forest virus (TAFV).

The fifth virus, Reston virus (RESTV), is not considered to be disease-causing in humans.

According to WHO, EVD first appeared in 1976 in two simultaneous outbreaks, in Nzara, Sudan, and in Yambuku, Democratic Republic of Congo. The latter was in a village situated near the Ebola River from which the disease takes its name.

How does it spread?

It is still unclear how Ebola spreads. However, it is believed that the first pateint becomes infected through contact with an infected animal's body fluids.

Human-to-human transmission can occur through direct contact with blood, organs or other body fluids of infected people or exposure to objects such as needles and syringes that have been contaminated with infected secretions.

Ebola can also be transmitted from men who have recovered from the disease through semen as it is infectious for up to 7 weeks.

Infected dead bodies can spread Ebola as they are still infectious. So mourners who have direct contact with the body of deceased person can also get the disease.

Who is most at risk?

Health-care workers who do not wear appropriate protective clothing and family members who are in close contact with infected people or deceased patients.

Signs and symptoms:

Symptoms may occur between 2 and 21 days after contracting the infection. Common signs of Ebola include:

Fever

Headache

Muscle, abdominal and joint pain

Sore throat

Weakness

Diarrhea

Vomit or cough up blood

Chest pain

Difficulty in breathing and swallowing

Rash

Hiccups

Bleeding inside and outside the body

Prevention

Currently there is no vaccine available for humans. But the infection can be controlled through the use of recommended protective measures such as:

Avoid contacting infected blood or secretions, including from those who are dead .

Using standard precautions for all patients in the healthcare setting.

Sterilizing equipment, and wearing protective clothing including masks, gloves, gowns and goggles.

Washing your hands with soaps or detergents.

Disinfecting your surroundings.

Isolate people who have Ebola symptoms.

Culling of infected animals, with close supervision of burial or incineration of carcasses.

Yet, not travelling to the areas or countries where the virus is found is the best way to avoid Ebola.

Langya Virus Update !

Neel — Fri, 12 Aug 2022 05:31:10 -0500

https://www.ncbi.nlm.nih.gov/nuccore/OM101125,OM101126,OM101127,OM101128,OM101129,OM101130?

Zoonotic Henipavirus

https://pubmed.ncbi.nlm.nih.gov/35921459/

https://www.ncbi.nlm.nih.gov/nuccore/OM069646,,OM069567,OM069568,OM069569,OM069570,OM069571,OM069572,OM069573,OM069574,OM069575,OM069576,OM069577,OM069578,OM069579,OM069580,OM069581,OM069582,OM069583,OM069584,OM069585,OM069586,OM069587,OM069588,OM069589,OM069590,OM069591,OM069592,OM069593,OM069594,OM069595,OM069596,OM069597,OM069598,OM069599,OM069600,OM069601,OM069602,OM069603,OM069604,OM069605,OM069606,OM069607,OM069608,OM069609,OM069610,OM069611,OM069612,OM069613,OM069614,OM069615,OM069616,OM069617,OM069618,OM069619,OM069620,OM069621,OM069622,OM069623,OM069624,OM069625,OM069626,OM069627,OM069628,OM069629,OM069630,OM069631,OM069632,OM069633,OM069634,OM069635,OM069636,OM069637,OM069638,OM069639,OM069640,OM069641,OM069642,OM069643,OM069644,OM069645,OM069646

Sequencing Solutions to World Health

Rahul Agarwal — Thu, 29 Aug 2013 15:05:35 -0500

"New technology that quickly, easily and economically reveals the genomes of viruses and pathogens transforms public health and medicine."

Source: Life technologies

Address of the bookmark: http://www.lifetechnologies.com/global/en/home/communities-social/blog/blogs/sequencing-solutions-to-world-health.html?cid=social_blogseries_20130829_11098264

Frequently used bioinformatics tools for viral genome analysis !

Neel — Wed, 23 Jun 2021 07:40:41 -0500

IVA: accurate de novo assembly of RNA virus genomes.
Hunt M, Gall A, Ong SH, Brener J, Ferns B, Goulder P, Nastouli E, Keane JA, Kellam P, Otto TD.
Bioinformatics. 2015 Jul 15;31(14):2374-6. doi: 10.1093/bioinformatics/btv120. Epub 2015 Feb 28.

Adapter sequences:
Optimal enzymes for amplifying sequencing libraries.
Quail, M. a et al. Nat. Methods 9, 10-1 (2012).

GAGE:
GAGE: A critical evaluation of genome assemblies and assembly algorithms.
Salzberg, S. L. et al. Genome Res. 22, 557-67 (2012).

KMC:
Disk-based k-mer counting on a PC.
Deorowicz, S., Debudaj-Grabysz, A. & Grabowski, S. BMC Bioinformatics 14, 160 (2013).

Kraken:
Kraken: ultrafast metagenomic sequence classification using exact alignments.
Wood, D. E. & Salzberg, S. L. Genome Biol. 15, R46 (2014).

MUMmer:
Versatile and open software for comparing large genomes.
Kurtz, S. et al. Genome Biol. 5, R12 (2004).

R:
R: A language and environment for statistical computing.
R Core Team (2013). R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

RATT:
RATT: Rapid Annotation Transfer Tool.
Otto, T. D., Dillon, G. P., Degrave, W. S. & Berriman, M. Nucleic Acids Res. 39, e57 (2011).

SAMtools:
The Sequence Alignment/Map format and SAMtools.
Li, H. et al. Bioinformatics 25, 2078-9 (2009).

Trimmomatic:
Trimmomatic: A flexible trimmer for Illumina Sequence Data.
Bolger, A. M., Lohse, M. & Usadel, B. Bioinformatics 1-7 (2014).

Unicycler: Hybrid assembly pipeline for bacterial genomes

Jit — Fri, 10 Nov 2017 03:58:27 -0600

Unicycler is an assembly pipeline for bacterial genomes. It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser. It can also assembly long-read-only sets (PacBio or Nanopore) where it runs a miniasm+Racon pipeline. For the best possible assemblies, give it both Illumina reads and long reads, and it will conduct a hybrid assembly.

Address of the bookmark: https://github.com/rrwick/Unicycler

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies

Jit — Tue, 15 May 2018 07:35:26 -0500

HapCUT2 is a maximum-likelihood-based tool for assembling haplotypes from DNA sequence reads, designed to "just work" with excellent speed and accuracy. We found that previously described haplotype assembly methods are specialized for specific read technologies or protocols, with slow or inaccurate performance on others. With this in mind, HapCUT2 is designed for speed and accuracy across diverse sequencing technologies, including but not limited to: NGS short reads (Illumina HiSeq) clone-based sequencing (Fosmid or BAC clones) SMRT reads (PacBio) Oxford Nanopore reads 10X Genomics Linked-Reads proximity-ligation (Hi-C) reads high-coverage sequencing (>40x coverage-per-SNP) using above technologies combinations of the above technologies (e.g. scaffold long reads with Hi-C reads) See below for specific examples of command line options and best practices for some of these technologies. NOTE: At this time HapCUT2 is for diploid organisms only. VCF input should contain diploid variants. If you use HapCUT2 in your research, please cite: Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. gr.213462.116 (2016). doi:10.1101/gr.213462.116

Address of the bookmark: https://github.com/vibansal/HapCUT2

transrate: Understanding your transcriptome assembly

Neel — Fri, 13 Jul 2018 07:49:26 -0500

Transrate is software for de-novo transcriptome assembly quality analysis. It examines your assembly in detail and compares it to experimental evidence such as the sequencing reads, reporting quality scores for contigs and assemblies. This allows you to choose between assemblers and parameters, filter out the bad contigs from an assembly, and help decide when to stop trying to improve the assembly.

Address of the bookmark: http://hibberdlab.com/transrate/index.html

ALLHiC: Phasing and scaffolding polyploid genomes based on Hi-C data

BioStar — Thu, 20 Dec 2018 12:03:32 -0600

The major problem of scaffolding polyploid genome is that Hi-C signals are frequently detected between allelic haplotypes and any existing stat of art Hi-C scaffolding program links the allelic haplotypes together. To solve the problem, we developed a new Hi-C scaffolding pipeline, called ALLHIC, specifically tailored to the polyploid genomes. ALLHIC pipeline contains a total of 5 steps: prune, partition, rescue, optimize and build.

Address of the bookmark: https://github.com/tangerzhang/ALLHiC/wiki