BOL: Related items

The "Ifs" and "Buts" of NGS Quality Control and Trimming

BioStar — Thu, 02 Jan 2025 20:11:07 -0600

Next-Generation Sequencing (NGS) has revolutionized biological research, providing vast amounts of data for a wide range of applications. However, the reliability of NGS analyses heavily depends on the quality of raw sequencing data. Quality control (QC) and trimming are critical preprocessing steps that can make or break your downstream analyses. In this blog, we explore the "ifs" (why you should perform QC and trimming) and the "buts" (challenges or considerations) of this vital step in NGS workflows.

The "Ifs" of NGS QC and Trimming

Ensures Data Integrity
If you want to minimize errors in downstream analyses, QC and trimming remove low-quality reads and bases, ensuring high-confidence data. This step is essential for reliable variant calling, assembly, and other applications.
Removes Contaminants
If adapter sequences or contaminants are present in the raw reads, trimming can eliminate them. This prevents issues like misalignment or incorrect biological interpretations, ensuring cleaner data for analysis.
Improves Mapping and Assembly
If your goal is better alignment to a reference genome or improved de novo assembly, trimming low-quality bases and adapters is critical. High-quality reads map more efficiently and generate more accurate assemblies.
Reduces Computational Load
If you want to save computational resources, trimming reduces the dataset size, which speeds up processing and analysis. Clean datasets mean less computational time spent on processing low-quality data.
Prepares for Standardized Analyses
If your project involves multiple datasets, QC and trimming ensure uniformity across them. This standardization makes comparisons valid and reproducible, particularly in large collaborative studies.

The "Buts" of NGS QC and Trimming

Risk of Over-Trimming
But excessive trimming can lead to the loss of informative sequences, reducing read depth and potentially discarding biologically relevant data. This is especially critical in studies with limited sequencing depth.
Bias Introduction
But trimming algorithms might introduce biases, especially if they inadvertently remove sequences with specific biological patterns. This can skew results and compromise biological insights.
Loss of Context in Paired-End Reads
But trimming one read in a pair more than the other can lead to loss of pairing information. This complicates downstream analyses that rely on paired-end data, such as structural variant detection.
Time and Resource Intensive
But running QC and trimming for large datasets can be computationally expensive and time-consuming. As sequencing depth increases, preprocessing becomes a bottleneck in the analysis pipeline.
Variable Standards
But the criteria for trimming (e.g., quality threshold, minimum read length) can vary between tools and datasets. This variability may affect reproducibility and comparability of results across studies.

Balancing the "Ifs" and "Buts"

To maximize the benefits of QC and trimming while mitigating the challenges, consider the following best practices:

Use QC Tools Wisely: Start with tools like FastQC to identify quality issues in your raw data. Visualizing quality metrics helps tailor your trimming parameters.
Choose Reliable Trimming Tools: Tools like Trimmomatic, Cutadapt, and BBduk offer adaptive and customizable trimming options. Select one that aligns with your dataset and project goals.
Set Reasonable Parameters: Avoid over-trimming by setting quality thresholds and minimum read lengths that balance data retention and quality improvement.
Test Downstream Effects: Validate the impact of QC and trimming on downstream analyses, such as alignment efficiency, variant calling accuracy, or assembly quality.
Document Your Workflow: Maintain detailed records of the parameters and tools used for QC and trimming. This ensures reproducibility and enables better troubleshooting.

Conclusion

NGS quality control and trimming are essential steps to ensure reliable and accurate data for analysis. While the "ifs" highlight the clear benefits of these steps, the "buts" remind us of the potential pitfalls. By adopting best practices and carefully balancing these considerations, you can optimize your preprocessing workflow and unlock the full potential of your sequencing data.

gapFinisher: A reliable gap filling pipeline for SSPACE-LongRead scaffolder output

Rahul Nayak — Fri, 24 Jan 2020 06:04:40 -0600

gapFinisher is based on the controlled use of a previously published gap filling tool FGAP and works on all standard Linux/UNIX command lines. They compare the performance of gapFinisher against two other published gap filling tools PBJelly and GMcloser.

gapFinisher can fill gaps in draft genomes quickly and reliably.

Address of the bookmark: https://github.com/kammoji/gapFinisher

Bioinformatics approach to Boar Taint

Rahul Agarwal — Wed, 17 Jul 2013 15:50:37 -0500

Meat products obtained from intact male pigs often produce offensive smell or odour which is recognized as a complex genetic trait called boar taint.Androstenone and Skatole in the fat primarily cause boar taint. Metabolism of androstenone and sex steroids share a common pathway which makes removal of boar taint a very challenging task. Castration is a traditional solution to remove boar taint but it also results in bad quality of meat due to low level of steroids which is objectionable to many consumers. Detected functional variant(s) underlying boar taint compounds can be used as genetic markers in selection of male pigs with reduced boar taint levels. Resequencing of a total of 47 samples belong to Norwegian Landrace (NL) and Duroc (D) pigs with varied boar taint levels were done in Illumina HiSeq2000 to >10X average coverage. Short reads generated from these samples mapped to Sus Scrofa version 10.2 reference assembly using Bowtie2. Alignment file then used for calling SNPs and InDels inside previousy identified QTL regions on SSC5,13, and 7 with the aid of FreeBayes , a variant caller tool. A final list of SNPs was prepared after filtering SNPs on the basis of SNP quality, coverage of SNP allele, functional and structural annotation, and repeats, etc. Selected SNPs will be genotyped in sample population for validation and then used for constructing SNPs haplotypes in close linkage disequilibrium with QTLs and fine mapping of QTLs through association mapping of genotyped SNPs.

Want to Know which genome assembler rule the world ?

Rahul Agarwal — Sun, 11 Aug 2013 11:42:32 -0500

Assemblathon 2: evaluating de novo methods of genome assembly

http://www.gigasciencejournal.com/content/2/1/10/abstract

http://blogs.nature.com/news/2013/07/genome-assembly-contest-prompts-soul-searching.html

http://assemblathon.org/post/44431915644/feedback-and-analysis-of-the-assemblathon-2-p

Barber pole worm , sheep pathogen sequenced !!!

Rahul Agarwal — Tue, 03 Sep 2013 16:32:18 -0500

Haemonchus contortus is a highly pathogenic parasitic nematode of that can infect a large number of wild and domesticated ruminant species and is the most economically important parasite of sheep and goats worldwide. Scientists at the Wellcome Trust Sanger Institute have sequenced the genome of the barber's pole worm (Haemonchus contortus), which will help to explore the this tropical parasite which been disseminated around the world by livestock movement.

H. contortus is a member of the superfamily trichostrongyloidea (Strongylida) which contains most of the economically important parasitic nematodes of grazing livestock. These parasites cost the global livestock industry billions of dollars per annum in lost production and drug costs. A common type of clover may be a preventative or palliative for the disease. However, some particular breeds of sheep, such as the Gulf Coast Native from the Southern United States, have been shown to have developed special resistance to H. contortus.

Getting the full genome can help to tackle the problem and understand the resistance mechanism with an ease. Moreover, the genome could now provide a comprehensive understanding of how treatments against parasitic worms work and point to further new treatments and vaccines. By comparing the genome of the barber's pole worm with those of worms that have acquired drug resistance, researchers expect to reveal information about how and why resistance has occurred. Till now, researchers have uncovered essential information in the fight against drug resistance in worms.

Reference:

http://www.fwi.co.uk/articles/28/08/2013/140758/researchers-close-in-on-worm-resistance-in-sheep.htm

http://www.sciencedaily.com/releases/2013/08/130828103351.htm?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+sciencedaily%2Fplants_animals+(ScienceDaily%3A+Plants+%26+Animals+News)

Image source: Wikipedia

DNA tale of 3 to 4 years old Serbia boy

Rahul Agarwal — Tue, 26 Nov 2013 17:34:00 -0600

The genome of a young boy found underground at Mal’ta near Lake Baikal of eastern Siberia around 24,000 years ago came out as close relative of Europeans and Native Indians.

Link:

http://www.nytimes.com/2013/11/21/science/two-surprises-in-dna-of-boy-found-buried-in-siberia.html?_r=0

http://www.nature.com/nature/journal/vaop/ncurrent/full/nature12736.html

Genome of Rainbow Trout Sequenced

Rahul Agarwal — Fri, 25 Apr 2014 10:36:51 -0500

Major finding:

“In humans and most vertebrates the duplication events were older so there are fewer duplicated genes still present. Most of the duplicated genes get lost or modified so much that they are no longer recognizable as duplicates over time. In the trout and salmon we can see an earlier stage in the process and many duplicated genes are still present,” said Dr Gary Thorgaard of Washington State University, a co-author of the paper published in the journal Nature Communications.

Source:

http://www.sci-news.com/genetics/science-genome-rainbow-trout-01877.html

Real time Sequencing

Rahul Agarwal — Sun, 04 May 2014 18:16:42 -0500

“... we now know we can do high-throughput sequencing at any location on Earth,” Moroz said.

Source:

http://news.ufl.edu/2014/04/28/real-time-genome-sequencing-at-sea/

Drawback of Exome Sequencing

Rahul Agarwal — Mon, 02 Jun 2014 05:46:43 -0500

Dr Eric Londin, Assistant Professor, Thomas Jefferson University, USA, stated that analysis of 44 exome datasets from four different testing kits showed that they missed a high proportion of clinically relevant regions in the 56 ACMG genes. "At least one gene in each exome method was missing more than 40 percent of disease-causing genetic variants, and we found that the worst-performing method missed more than 90 percent of such variants in four of the 56 genes," he says.

Source: http://www.eurekalert.org/pub_releases/2014-05/esoh-pco052914.php

Pathway Analysis

Rahul Agarwal — Fri, 03 Oct 2014 08:51:13 -0500

Pathway Analysis is usually performed with aim to enrich the genes with their functional information and reveal the underlying biological mechanisms pursue by genes. Pathway Analysis is not only limited to what biological pathways a particular set of expressed genes follow but also to disclose the relationships between these genes. With availability of more genomics, transcriptomics and proteomics data, interactions between genes involve in multiple pathways become more clear and also relationships between the genes, their transcripts, and their gene products. However, existing tools and dbs mainly based on knowledge driven approach in which pathways will be identified by finding the correlation between the information in one of the pathway knowledge databases (KEGG,Reactome,Panther,BioCarta, Panther,GO,NCI,WikiPathways,etc) and gene expression result for a specific conditions for instance tumor, obesity , cold resistant crops/plants, etc.

Introductory Articles/ppt/sources:

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002375

http://bioinformatics.mdanderson.org/MicroarrayCourse/Lectures09/Pathway%20Analysis.pdf

http://gettinggeneticsdone.blogspot.de/2012/03/pathway-analysis-for-high-throughput.html

http://davetang.org/muse/tag/pathway/

https://www.biostars.org/p/42219/

http://bioinformatics.ca//files/public/Pathways_2014_Module4_v2.pdf

http://bioinformatics.ca//files/public/Pathways_2014_Module2.pdf

Impotant Database and Tools:

GeneMANIA, Cytoscape, IPA and Metacore (Commerical ), Pathway Commons, Reactome ,Panther, BioCyc, WikiPathways, Pathvisio, KEGG, NCI, Stringdb, Amigo, WebGestalt ,ConsensusPathDB ,GSEA,Blast2go

Popular R based tools:

Reactome.db, ReactomePA, ClusterProfiler, Gage, SPIA, topGO, Pathview,DOSE,GOStat

More:

http://www.bioconductor.org/help/search/index.html?q=Enrichment+analysis+