BOL: Related items

IVA: accurate de novo assembly of RNA virus genomes

Neel — Wed, 23 Jun 2021 07:51:59 -0500

IVA (Iterative Virus Assembler) designed specifically for read pairs sequenced at highly variable depth from RNA virus samples. We tested IVA on datasets from 140 sequenced samples from human immunodeficiency virus-1 or influenza-virus-infected people and demonstrated that IVA outperforms all other virus de novo assemblers.

Availability and implementation: The software runs under Linux, has the GPLv3 licence and is freely available from http://sanger-pathogens.github.io/iva

https://pubmed.ncbi.nlm.nih.gov/25725497/

Address of the bookmark: https://github.com/sanger-pathogens/iva

Understanding RNA-Seq Normalization Methods: TPM vs. FPKM vs. CPM

Neel — Wed, 11 Dec 2024 00:59:15 -0600

RNA sequencing (RNA-Seq) is a powerful technology used to study transcriptomes, providing insights into gene expression levels. However, raw RNA-Seq data requires normalization to account for sequencing depth and gene length, enabling accurate comparisons between genes and samples. Among the most widely used normalization methods are TPM (Transcripts Per Million), FPKM (Fragments Per Kilobase Million), and CPM (Counts Per Million). Each method has its unique principles and applications, which we’ll explore in this blog.

Why Normalize RNA-Seq Data?

Normalization is a crucial step in RNA-Seq analysis for the following reasons:

Sequencing depth: Different RNA-Seq experiments produce varying numbers of reads, making direct comparisons between samples misleading.
Gene length: Longer genes inherently generate more reads, irrespective of their actual expression level.
Bias reduction: Normalization mitigates technical biases, enabling meaningful biological interpretation.

TPM (Transcripts Per Million)

TPM measures the proportion of reads mapped to a transcript, normalized by transcript length and sequencing depth. It is calculated as:

Key Features:

Proportionality: TPM values sum to 1,000,000 across all transcripts in a sample, making it easier to compare between samples.
Intuitive interpretation: TPM values directly represent the abundance of transcripts in a sample.
Preferred for comparisons: TPM facilitates between-sample comparisons better than FPKM.

FPKM (Fragments Per Kilobase Million)

FPKM normalizes read counts by transcript length and sequencing depth, but without enforcing proportionality like TPM. It is defined as:

Key Features:

Historical significance: FPKM was one of the first normalization methods used for RNA-Seq.
Single-end vs. paired-end: In paired-end sequencing, FPKM becomes RPKM (Reads Per Kilobase Million).
Limited utility: FPKM values are not as robust as TPM for cross-sample comparisons due to lack of proportionality.

CPM (Counts Per Million)

CPM normalizes raw read counts by sequencing depth, without considering gene length. It is expressed as:

Key Features:

Simplicity: CPM is straightforward and computationally less intensive.
Application: Suitable for non-length-dependent analyses, such as comparing total expression levels or differential expression analysis.
Gene length agnostic: CPM does not correct for gene length, making it less ideal for measuring expression levels.

When to Use Each Method

TPM: Best for comparing expression levels between samples, especially when transcript length and sequencing depth vary.
FPKM: Useful for historical consistency but generally replaced by TPM.
CPM: Ideal for differential expression analysis when gene length normalization is unnecessary.

Conclusion

Choosing the right normalization method depends on the specific objectives of your RNA-Seq analysis. TPM’s proportionality and robustness make it the preferred choice for most applications, while CPM serves well for differential expression studies. Although FPKM paved the way for RNA-Seq normalization, it has largely been supplanted by TPM in modern workflows. Understanding these methods and their nuances ensures accurate and meaningful interpretations of RNA-Seq data.

References:

Li, B., & Dewey, C. N. (2011). RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics.
Trapnell, C., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology.
Law, C. W., et al. (2014). voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology.

Bioinformatics Codes Search

Jitendra Narayan — Thu, 15 Aug 2013 11:08:52 -0500

I bet, this website will be your best friend in near future. This helps us to explore the existing open source codes and learn from it.

You can find some useful open source bioinformatics codes for your analysis work. You can use the left bar options to filtere out or narrow down your search result. This webpage can be an useful resource for a beginners bioinformatician as it contain several bioinformatics basics script that are commonly used by biological programmers and biologist.

Stand on the slumped, dandruff-covered shoulders of millions of computer nerds. _/\_

Enjoy the code and research work.

http://code.ohloh.net/search?s=bioinformatics

Address of the bookmark: http://code.ohloh.net/search?s=bioinformatics

MMseqs2: ultra fast and sensitive sequence search and clustering suite

Abhi — Wed, 06 Oct 2021 07:01:14 -0500

MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed.

Address of the bookmark: https://github.com/soedinglab/MMseqs2

Genome Stability Laboratory

Mon, 07 Mar 2016 04:16:32 -0600

The bakers yeast, Saccharomyces cerevisiae is an ideal model organism to understand mechanisms of meiotic chromosome segregation. In S. cerevisiae and in mammals, the majority of meiotic crossovers are formed through a highly conserved MSH4p-MSH5p, MLH1p-MLH3p dependent pathway. We are interested in charactering the role of these complexes in crossover formation and distribution among all homolog pairs. Errors in this process are linked to congenital birth defects in humans such as Down's syndrome.Our laboratory is also interested in understanding the effect of genetic background on mutation rate variation using S. cerevisiae as a model. These studies are relevant for understanding cancer progression, genome evolution and architecture. We use high- throughput genomic methods as well as classical genetics to achieve these aims.

More at http://faculty.iisertvm.ac.in/~nishantkt/index.html

SISRS: Site Identification from Short Read Sequences

Abhimanyu Singh — Wed, 28 Nov 2018 08:56:03 -0600

Next-gen sequence data such as Illumina HiSeq reads. Data must be sorted into folders by taxon (e.g. species or genus). Paired reads in fastq format must be specified by _R1 and _R2 in the (otherwise identical) filenames. Paired and unpaired reads must have a fastq file extension.

Address of the bookmark: https://github.com/rachelss/SISRS

GenBank release 257.0 is now available!

Neel — Wed, 23 Aug 2023 00:23:23 -0500

GenBank release 257.0 is now available! This release has 25.10 trillion bases and 3.69 billion records. Learn more: https://ncbiinsights.ncbi.nlm.nih.gov/2023/08/21/genbank-release-257/

GenBank release 257.0 (8/15/2023) is now available on the NCBI FTP site. This release has 25.10 trillion bases and 3.69 billion records.

The current release has:

246,119,175 traditional records containing 2,112,058,517,945 base pairs of sequence data
2,631,493,489 WGS records containing 22,294,446,104,543 base pairs of sequence data
686,271,945 bulk-oriented TSA records containing 646,176,166,908 base pairs of sequence data
124,421,006 bulk-oriented TLS records containing 48,289,699,026 base pairs of sequence data

COSINE: non-seeding method for mapping long noisy sequences

Jit — Fri, 26 Oct 2018 00:41:59 -0500

Third generation sequencing (TGS) are highly promising technologies but the long and noisy reads from TGS are difficult to align using existing algorithms. Here, we present COSINE, a conceptually new method designed specifically for aligning long reads contaminated by a high level of errors.

Address of the bookmark: https://github.com/SUwonglab/COSINE

UPhO: Scripts for homology and orthology assessment from genomic sequences.

BioStar — Mon, 14 Jan 2019 10:36:42 -0600

UPhO finds orthologs with and without inparalogs from input gene family trees. Refer to the Documentation.pdf for more detailed explanations on its usage, installation and dependencies. Type UPhO.py -h for help.

The only input requierement for UPhO is a tree (or trees) in Newick format in which the leaves are named with a species idenfifier, a field separator, and sequence identifier. By default, the field separator is the character "|" but custom delimiters can be defined. Examples of trees to test UPhO are provided in the TestData folder.

Address of the bookmark: https://github.com/ballesterus/UPhO

IgBLAST 1.17 is now available with improved identification of productive V gene sequences

Jit — Sun, 01 Nov 2020 16:52:58 -0600

A new release of IgBLAST (1.17), the popular package for classifying and analyzing immunoglobulin and T cell receptor sequences, is now available on the web and from the FTP site. The updated package is better at identifying productive V gene sequences. We added a new field , “V frame shift”, to the IgBLAST output to indicate whether the V gene translation frame contains a frame-shift. We have also updated the definition of a productive V(D)J sequence to now exclude those with internal frame shifts.

See the new IgBLAST manual on the NCBI GitHub site for more information on setting up and running IgBLAST.

If you have any questions or concerns, please email us at blast-help@ncbi.nlm.nih.gov