BOL: Related items

ALE: a Generic Assembly Likelihood Evaluation Framework for Assessing the Accuracy of Genome and Metagenome Assemblies

Neel — Tue, 26 Apr 2016 03:38:43 -0500

Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.

More at http://www.ncbi.nlm.nih.gov/pubmed/23303509

Address of the bookmark: http://sc932.github.io/ALE/about.html

Stampy

Abhi — Fri, 20 May 2016 19:13:32 -0500

Stampy is a package for the mapping of short reads from illumina sequencing machines onto a reference genome. It's recommended for most workflows, including those for genomic resequencing, RNA-Seq and Chip-seq. Stampy excels in the mapping of reads containing that contain sequence variation relative to the reference, in particular for those containing insertions or deletions.

Address of the bookmark: http://www.well.ox.ac.uk/project-stampy

ART: Set of Simulation Tools

Jit — Thu, 03 Nov 2016 08:28:25 -0500

ART is a set of simulation tools to generate synthetic next-generation sequencing reads. ART simulates sequencing reads by mimicking real sequencing process with empirical error models or quality profiles summarized from large recalibrated sequencing data. ART can also simulate reads using user own read error model or quality profiles. ART supports simulation of single-end, paired-end/mate-pair reads of three major commercial next-generation sequencing platforms: Illumina's Solexa, Roche's 454 and Applied Biosystems' SOLiD. ART can be used to test or benchmark a variety of method or tools for next-generation sequencing data analysis, including read alignment, de novo assembly, SNP and structure variation discovery. ART was used as a primary tool for the simulation study of the 1000 Genomes Project . ART is implemented in C++ with optimized algorithms and is highly efficient in read simulation. ART outputs reads in the FASTQ format, and alignments in the ALN format. ART can also generate alignments in the SAM alignment or UCSC BED file format. ART can be used together with genome variants simulators (e.g. VarSim) for evaluating variant calling tools or methods.

Address of the bookmark: http://www.niehs.nih.gov/research/resources/software/biostatistics/art/

LoRDEC: a hybrid error correction program for long, PacBio reads

Jit — Mon, 10 Apr 2017 04:16:09 -0500

LoRDEC is a program to correct sequencing errors in long reads from 3rd generation sequencing with high error rate, and is especially intended for PacBio reads. It uses a hybrid strategy, meaning that it uses two sets of reads: the reference read set, whose error rate is assumed to be small, and the PacBio read set, which is then corrected using the reference set. Typically, the reference set contains Illumina reads.

Usually, errors in PacBio reads include many insertions and deletions, and comparatively less substitutions. LoRDEC can correct errors of all these types.
After correction, a larger portion of the sequence of PacBio reads is usable for detection of region of similarity with other sequences, for aligning them to the contigs of an assembly, etc.

Why is LoRDEC different?

It is efficient and can process large read data sets, included from eukaryotic or vertebrate species, on a usual computing server, and even works on desktop/laptop computers.
It adopts a novel graph based approach: it builds a succinct De Bruijn Graph (DBG) representing the short reads, and seeks a corrective sequence for each erroneous region of a long read by traversing chosen paths in the graph.

Address of the bookmark: http://www.atgc-montpellier.fr/lordec/

HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning

Abhimanyu Singh — Tue, 01 Jan 2019 12:01:00 -0600

HECIL—Hybrid Error Correction with Iterative Learning—a hybrid error correction framework that determines a correction policy for erroneous long reads, based on optimal combinations of decision weights obtained from short read alignments.

HECIL’s core algorithm by introducing an iterative learning paradigm that enhances the correction policy at each iteration by incorporating knowledge gathered from previous iterations via data-driven confidence metrics assigned to prior corrections.

Address of the bookmark: https://github.com/NDBL/HECIL

Understanding HiFi Reads !

Rahul Nayak — Thu, 24 Mar 2022 19:48:11 -0500

While little public data is available for either of the new synthetic long read approaches, Illumina showed an example comparison earlier this year at the Festival of Genomics & Biodata conference (FoG 2022). In the IGV screenshot presented (below), synthetic Infinity reads – labeled “Longas” – are at the top, followed by standard Illumina short reads, and PacBio HiFi reads labeled “CCS” depicted at the bottom:

Address of the bookmark: http://pacb.com/blog/the-hifi-difference-true-long-reads-vs-synthetic-long-reads/

Trust But Verify: Sequencing Your Cell Lines Might Reveal an Uninvited Guest

LEGE — Wed, 04 Jun 2025 00:07:57 -0500

High-throughput sequencing has become indispensable in cell biology, enabling detailed insights into chromatin structure, gene expression, and regulatory dynamics. Yet, when faced with unexpectedly low mapping rates to the human genome, researchers often rush to troubleshoot technical parameters—sequencer quality, adapter trimming, or aligner settings.

Before you go down that path, consider this critical biological question:
Are you sequencing human cells—or bacterial contamination?

The Silent Saboteur: Mycoplasma in Cell Cultures

Mycoplasma contamination remains one of the most widespread and underdiagnosed issues in tissue culture work. Studies suggest that 15–35% of cell lines in use may be contaminated, often without visible signs. Unlike other microbial infections, Mycoplasma does not produce cloudiness, odor, or a change in pH. Many researchers won’t detect it unless they specifically test for it.

The consequences, however, are profound. Mycoplasma can significantly alter:

Host gene expression patterns
Cell proliferation rates
Epigenetic profiles and chromatin accessibility
Cytokine signaling and immune responses

In short, it can skew your results, compromise your biological conclusions, and invalidate weeks or months of research.

A Simple Diagnostic Step: Map Against Mycoplasma Genomes

If you encounter poor alignment rates to the human genome, consider mapping your reads to a Mycoplasma reference genome—or better yet, use a combined human + Mycoplasma reference. There have been cases where over half of all reads, initially assumed to be from human cells, were in fact bacterial in origin. This check is fast, easy, and could save your project.

How Contamination Happens—and Persists

Mycoplasma is small (0.1–0.3 μm), lacks a cell wall, and can pass through standard filters undetected. Common sources include:

Contaminated reagents (e.g., FBS)
Infected cell lines obtained from other labs
Poor aseptic technique or shared equipment

Once present, it spreads quickly between cultures and can persist for months, silently affecting results.

Why Treatment Is Difficult

While antibiotics such as Plasmocin or BM-Cyclin are sometimes used, they often offer only partial resolution and may themselves alter cell behavior. In many cases, the best course of action is to discard the contaminated culture and start with a fresh, verified stock.

Practical Recommendations for Researchers

Routinely test for Mycoplasma using PCR, qPCR, or fluorescence-based assays
Incorporate contamination screens into your sequencing QC pipeline
Use combined reference genomes when mapping ambiguous reads
Practice strict aseptic technique and monitor all incoming cell lines
Don’t ignore unexplained data anomalies—they might point to contamination

Closing Thought: Contamination Is a Biological Variable

It’s easy to view poor mapping as a technical issue, but sometimes the problem lies deeper—in the biology itself. Mycoplasma contamination doesn’t just interfere with sequencing; it interferes with science. As a research community, we must treat contamination not as an afterthought, but as a key variable to control.

So next time your reads won’t align, don’t just tune the aligner. Ask if your cells are telling the truth—or if they're hiding something.

npScarf: Scaffolding and Completing Assemblies in Real-time Fashion

Jit — Tue, 23 May 2017 04:53:29 -0500

npScarf (jsa.np.npscarf) is a program that scaffolds and completes draft genomes assemblies in real-time with Oxford Nanopore sequencing. The pipeline can run on a computing cluster as well as on a laptop computer for microbial datasets. It also facilitates the real-time analysis of positional information such as gene ordering and the detection of genes from mobile elements (plasmids and genomic islands).

Complete paper at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5321748/

Address of the bookmark: https://github.com/mdcao/npScarf

Multi-CAR: a tool of contig scaffolding using multiple references

Rahul Nayak — Tue, 06 Mar 2018 16:39:41 -0600

we design a simple heuristic method to further revise our single reference-based scaffolding tool CAR into a new one called Multi-CAR such that it can utilize multiple complete genomes of related organisms as references to more accurately order and orient the contigs of a draft genome. In practical usage, our Multi-CAR does not require prior knowledge concerning phylogenetic relationships among the draft and reference genomes and libraries of paired-end reads. To validate Multi-CAR, we have tested it on a real dataset composed of several prokaryotic genomes and also compared its accuracy performance with other multiple reference-based scaffolding tools Ragout and MeDuSa.

Address of the bookmark: http://genome.cs.nthu.edu.tw/Multi-CAR/

CSAR-web: a web server of contig scaffolding using algebraic rearrangements

BioStar — Fri, 10 Apr 2020 04:39:36 -0500

CSAR-web is a web-based tool that allows the users to efficiently and accurately scaffold (i.e. order and orient) the contigs of a target draft genome based on a complete or incomplete reference genome from a related organism.

CSAR-web can serve as a convenient and useful scaffolding tool allowing the users to efficiently and accurately scaffold their draft genomes according to a complete or incomplete reference genome.

Address of the bookmark: http://genome.cs.nthu.edu.tw/CSAR-web