BOL: Related items

MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads

Abhi — Tue, 05 Sep 2023 07:31:35 -0500

MitoHiFi v3.2 is a python pipeline distributed under MIT License !

MitoHiFi was first developed to assemble the mitogenomes for a wide range of species in the Darwin Tree of Life Project (DToL)

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05385-y

Address of the bookmark: https://github.com/marcelauliano/MitoHiFi

Trimmomatic: A flexible read trimming tool for Illumina NGS data

Jit — Fri, 15 Apr 2016 05:58:53 -0500

Paired End:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

This will perform the following:

Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
Remove leading low quality or N bases (below quality 3) (LEADING:3)
Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
Drop reads below the 36 bases long (MINLEN:36)

More at http://www.usadellab.org/cms/?page=trimmomatic

Address of the bookmark: http://www.usadellab.org/cms/?page=trimmomatic

ART: Set of Simulation Tools

Jit — Thu, 03 Nov 2016 08:28:25 -0500

ART is a set of simulation tools to generate synthetic next-generation sequencing reads. ART simulates sequencing reads by mimicking real sequencing process with empirical error models or quality profiles summarized from large recalibrated sequencing data. ART can also simulate reads using user own read error model or quality profiles. ART supports simulation of single-end, paired-end/mate-pair reads of three major commercial next-generation sequencing platforms: Illumina's Solexa, Roche's 454 and Applied Biosystems' SOLiD. ART can be used to test or benchmark a variety of method or tools for next-generation sequencing data analysis, including read alignment, de novo assembly, SNP and structure variation discovery. ART was used as a primary tool for the simulation study of the 1000 Genomes Project . ART is implemented in C++ with optimized algorithms and is highly efficient in read simulation. ART outputs reads in the FASTQ format, and alignments in the ALN format. ART can also generate alignments in the SAM alignment or UCSC BED file format. ART can be used together with genome variants simulators (e.g. VarSim) for evaluating variant calling tools or methods.

Address of the bookmark: http://www.niehs.nih.gov/research/resources/software/biostatistics/art/

Pacbio Long Reads Compatible Software and Tools

Archana Malhotra — Wed, 15 Mar 2017 14:19:01 -0500

The following software packages are known to be compatible with PacBio® data, in addition to PacBio's own SMRT® Analysis suite. All packages are believed to be open source or freely available for non-commercial use. See the individual project sites for up-to-date license information. A separate page lists commercial software.

Know of any other open source software for PacBio data? Email us.

Software categories:

Address of the bookmark: https://github.com/PacificBiosciences/DevNet/wiki/Compatible-Software

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Jit — Tue, 07 Nov 2017 04:36:10 -0600

We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3158087/

Consed--A Finishing Package (BAM File Viewer, Assembly Editor, Autofinish, Autoreport, Autoedit, and Align Reads To Reference Sequence)

Neel — Fri, 07 Feb 2020 07:16:22 -0600

Supports Illumina, 454, other Next-Gen and Sanger Reads and allows mixtures of these read types
Consed includes BamScape which can view bam files with unlimited numbers of reads. BamScape can bring up consed to edit reads and the reference sequence in targeted regions.
Consed is compatible with Newbler, Cross_match, Phrap, MIRA, Velvet and PCAP output.
Quickly takes the user to each variant site for viewing (also available as an automated report)
Overview of assembly can help detect and fix misassemblies
Editing time reduced by the program's ability to pin-point problem areas
Editing is guided by error probabilities

Address of the bookmark: http://www.phrap.org/consed/consed.html

The "Ifs" and "Buts" of NGS Quality Control and Trimming

BioStar — Thu, 02 Jan 2025 20:11:07 -0600

Next-Generation Sequencing (NGS) has revolutionized biological research, providing vast amounts of data for a wide range of applications. However, the reliability of NGS analyses heavily depends on the quality of raw sequencing data. Quality control (QC) and trimming are critical preprocessing steps that can make or break your downstream analyses. In this blog, we explore the "ifs" (why you should perform QC and trimming) and the "buts" (challenges or considerations) of this vital step in NGS workflows.

The "Ifs" of NGS QC and Trimming

Ensures Data Integrity
If you want to minimize errors in downstream analyses, QC and trimming remove low-quality reads and bases, ensuring high-confidence data. This step is essential for reliable variant calling, assembly, and other applications.
Removes Contaminants
If adapter sequences or contaminants are present in the raw reads, trimming can eliminate them. This prevents issues like misalignment or incorrect biological interpretations, ensuring cleaner data for analysis.
Improves Mapping and Assembly
If your goal is better alignment to a reference genome or improved de novo assembly, trimming low-quality bases and adapters is critical. High-quality reads map more efficiently and generate more accurate assemblies.
Reduces Computational Load
If you want to save computational resources, trimming reduces the dataset size, which speeds up processing and analysis. Clean datasets mean less computational time spent on processing low-quality data.
Prepares for Standardized Analyses
If your project involves multiple datasets, QC and trimming ensure uniformity across them. This standardization makes comparisons valid and reproducible, particularly in large collaborative studies.

The "Buts" of NGS QC and Trimming

Risk of Over-Trimming
But excessive trimming can lead to the loss of informative sequences, reducing read depth and potentially discarding biologically relevant data. This is especially critical in studies with limited sequencing depth.
Bias Introduction
But trimming algorithms might introduce biases, especially if they inadvertently remove sequences with specific biological patterns. This can skew results and compromise biological insights.
Loss of Context in Paired-End Reads
But trimming one read in a pair more than the other can lead to loss of pairing information. This complicates downstream analyses that rely on paired-end data, such as structural variant detection.
Time and Resource Intensive
But running QC and trimming for large datasets can be computationally expensive and time-consuming. As sequencing depth increases, preprocessing becomes a bottleneck in the analysis pipeline.
Variable Standards
But the criteria for trimming (e.g., quality threshold, minimum read length) can vary between tools and datasets. This variability may affect reproducibility and comparability of results across studies.

Balancing the "Ifs" and "Buts"

To maximize the benefits of QC and trimming while mitigating the challenges, consider the following best practices:

Use QC Tools Wisely: Start with tools like FastQC to identify quality issues in your raw data. Visualizing quality metrics helps tailor your trimming parameters.
Choose Reliable Trimming Tools: Tools like Trimmomatic, Cutadapt, and BBduk offer adaptive and customizable trimming options. Select one that aligns with your dataset and project goals.
Set Reasonable Parameters: Avoid over-trimming by setting quality thresholds and minimum read lengths that balance data retention and quality improvement.
Test Downstream Effects: Validate the impact of QC and trimming on downstream analyses, such as alignment efficiency, variant calling accuracy, or assembly quality.
Document Your Workflow: Maintain detailed records of the parameters and tools used for QC and trimming. This ensures reproducibility and enables better troubleshooting.

Conclusion

NGS quality control and trimming are essential steps to ensure reliable and accurate data for analysis. While the "ifs" highlight the clear benefits of these steps, the "buts" remind us of the potential pitfalls. By adopting best practices and carefully balancing these considerations, you can optimize your preprocessing workflow and unlock the full potential of your sequencing data.

Metabuli 분리 improves metagenomic read classification

Abhi — Sat, 03 Jun 2023 20:15:04 -0500

Metabuli 분리 improves metagenomic read classification through metamers, DNA-AA k-mers, to be sensitive and specific, recovering 99% and 98% of DNA or AA classifiers.

Metabuli is metagenomic classifier that jointly analyze both DNA and amino acid (AA) sequences. DNA-based classifiers can make specific classifications, exploiting point mutations to distinguish close taxa. AA-based classifiers have higher sensitivity in detecting homology between query and reference sequences, leverageing higher conservation of AA sequences. Metabuli combines the information of both sequence types using a novel k-mer structure, metamer, to enable both specific and sensitive characterization of metagenomic samples. In addition, it can classify reads against a database of any size as long as it fits in the hard disk.

Address of the bookmark: https://github.com/steineggerlab/Metabuli

Sequencing Solutions to World Health

Rahul Agarwal — Thu, 29 Aug 2013 15:05:35 -0500

"New technology that quickly, easily and economically reveals the genomes of viruses and pathogens transforms public health and medicine."

Source: Life technologies

Address of the bookmark: http://www.lifetechnologies.com/global/en/home/communities-social/blog/blogs/sequencing-solutions-to-world-health.html?cid=social_blogseries_20130829_11098264

Genome Browsers

Rahul Agarwal — Fri, 16 Aug 2013 19:04:47 -0500

Genome Browser is the platform/database used for searching and retreiving sequences and annotation of genomes belong to various eukaryotes, prokaryotes, etc.

Following are the weblink for different available browsers:

http://www.ensembl.org/index.html

http://ensemblgenomes.org/

http://genome.ucsc.edu/

http://www.ncbi.nlm.nih.gov/genome

http://www.ebi.ac.uk/genomes/

http://flybase.org/

http://cmr.jcvi.org/tigr-scripts/CMR/CmrHomePage.cgi

http://www.sanger.ac.uk/resources/databases/